Abstract:
Semantic Textual Similarity (STS) evaluates the degree to which two sentences or short
texts are semantically proportional to one another. STS is one of the significant problems
in the field of Natural Language Processing (NLP). Text reuse and plagiarism
detection are famous examples of STS.
STS has three types; (1) Monolingual STS: if the source and suspicious short texts are
in the same language. (2) Multilingual STS: if the source is in one language and suspicious
short text is more than two languages. (3) Cross-lingual STS: if the source in
one language and suspect is in the other language, often translation of each other. STS
could be found several levels, for example, word, sentence, paragraph, and document
level.
Urdu is one of the low resource languages. It’s the National Language of Pakistan, also
widely spoken and used in electronic, print media of Pakistan, India, and Bangladesh.
The main aim of this thesis is to develop techniques that measure STS as paragraph
level for the Urdu language.
Our thesis aims to develop and investigate the new feature extraction techniques to address
the problem of STS for Urdu. We divide it into three layers, (1) Train Word and
Sentence Embedding models on Urdu datasets, (2) Apply these new feature extraction
techniques to extract feature from Urdu short text pairs, and (3) Apply machine learning
classification algorithms for Urdu STS. In the first step we trainWord and Sentence
Embedding models on Urdu datasets. By using these embedding models, we extract
the word, and sentence embedding features from pre-processed Urdu Short Text Reuse
Corpus (USTRC) short texts. After that, we find the cosine similarity between these
extracted feature vectors, then apply classification algorithms on similarity to classify
short texts into verbatim, paraphrased and independently written. In word embedding
techniques, we usedWord2Vec, GloVe, and FastText with Addition, Average, and Multiplication
Functions also we explored Smooth Inverse (SI) and Term Frequency (TF)
weighted word embedding techniques. In sentence embedding techniques we have used
two unsupervised (sent2vec and LASER) and two supervised (InferSent and BERT)
ix
techniques.
We have used seven machine learning algorithms to classify similarity score including,
Naive Bayes (NB), Support Vector Machine (SVM), Logistic Regression (LR), Random
Forest (RF), K-Nearest Neighbour (KNN), Gradient Boost (GB), Multi-Layer Perceptron
(MLP). To evaluate these classifiers, we apply F1 measure. We got best F1 measure
= 0.68, 0.75, 0.92, 0.70 by using sent2vec sentence embedding technique with GB and
MLP classifiers.