Detecting Urdu Semantic Textual Similarity through Word and Sentence Embedding Techniques

Farhat Ullah, Muhammad

Detecting Urdu Semantic Textual Similarity through Word and Sentence Embedding Techniques

Farhat Ullah, Muhammad

URI: http://dspace.cuilahore.edu.pk/xmlui/handle/123456789/1554

Date: 2019-12-19

Abstract:

Semantic Textual Similarity (STS) evaluates the degree to which two sentences or short texts are semantically proportional to one another. STS is one of the significant problems in the field of Natural Language Processing (NLP). Text reuse and plagiarism detection are famous examples of STS. STS has three types; (1) Monolingual STS: if the source and suspicious short texts are in the same language. (2) Multilingual STS: if the source is in one language and suspicious short text is more than two languages. (3) Cross-lingual STS: if the source in one language and suspect is in the other language, often translation of each other. STS could be found several levels, for example, word, sentence, paragraph, and document level. Urdu is one of the low resource languages. It’s the National Language of Pakistan, also widely spoken and used in electronic, print media of Pakistan, India, and Bangladesh. The main aim of this thesis is to develop techniques that measure STS as paragraph level for the Urdu language. Our thesis aims to develop and investigate the new feature extraction techniques to address the problem of STS for Urdu. We divide it into three layers, (1) Train Word and Sentence Embedding models on Urdu datasets, (2) Apply these new feature extraction techniques to extract feature from Urdu short text pairs, and (3) Apply machine learning classification algorithms for Urdu STS. In the first step we trainWord and Sentence Embedding models on Urdu datasets. By using these embedding models, we extract the word, and sentence embedding features from pre-processed Urdu Short Text Reuse Corpus (USTRC) short texts. After that, we find the cosine similarity between these extracted feature vectors, then apply classification algorithms on similarity to classify short texts into verbatim, paraphrased and independently written. In word embedding techniques, we usedWord2Vec, GloVe, and FastText with Addition, Average, and Multiplication Functions also we explored Smooth Inverse (SI) and Term Frequency (TF) weighted word embedding techniques. In sentence embedding techniques we have used two unsupervised (sent2vec and LASER) and two supervised (InferSent and BERT) ix techniques. We have used seven machine learning algorithms to classify similarity score including, Naive Bayes (NB), Support Vector Machine (SVM), Logistic Regression (LR), Random Forest (RF), K-Nearest Neighbour (KNN), Gradient Boost (GB), Multi-Layer Perceptron (MLP). To evaluate these classifiers, we apply F1 measure. We got best F1 measure = 0.68, 0.75, 0.92, 0.70 by using sent2vec sentence embedding technique with GB and MLP classifiers.

Show full item record

Files in this item

Name: V3_Thesis_Detecti ...

Size: 443.7Kb

Format: PDF

Description: MS Thesis

View/Open

This item appears in the following Collection(s)

Thesis - MS / PhD
This collection containts the Ms/PhD thesis of the studetns of Department of Computer Science

Detecting Urdu Semantic Textual Similarity through Word and Sentence Embedding Techniques

Detecting Urdu Semantic Textual Similarity through Word and Sentence Embedding Techniques

Abstract:

Files in this item

This item appears in the following Collection(s)

Search DSpace

Browse

All of DSpace

This Collection

My Account