Abstract:
The process of finding a degree of similarity between words is called semantic word similarity.Machine Translation and information retrieval are common and well distinguishes applications of Semantic Word Similarity. Because of its applications,finding similarity between words is a crucial task anditbecomes animportantproblem of Natural Language Processing.
Semantic Word Similarity has two basic types: (1) Mono-lingual Semantic Word Similarity (2) Cross-lingual Semantic Word Similarity. When bothwords are in the same language it’s called Mono-lingual Semantic Word Similarity, on the other hand, when the both words come across two different languages it’s called Cross-lingual Semantic Word Similarity.Semantic similarity can be detected at many levels i.e. word, sentence, phrase or document level.
Urdu is the national language of Pakistan. 11 million people speak it as mother tongue and 105 million people as a second language [1].According to Riaz [2], it has been estimated that the Urdu language has 300 million speakers in the overall world. The aim of this thesis is to develop mono- and cross-lingual semantic word similarity datasets and models for the Urdu language.
The purpose of this study is threefold (1) Create Monolingual Semantic Word Similarity dataset (2) Develop Cross-lingual Semantic Word Similarity dataset (3) Create new models and evaluate pre-trained models using newly created gold standard mono- and cross-lingual corpora. In the first step, we will translate SemEval-500 dataset into Urdu language using Google, Bing and other translation resources. After translation, a linguistic expert verifies the translation and make corrections as required.The verified dataset then assigned to 12 annotators for annotation by following the guidelines provided in SemEval-2014 task 3 [3]. In the second step, the newly created dataset and SemEval-500 dataset are combined to create a cross-lingual dataset. This dataset is also annotated by 12 annotators by following the same guidelines. We use intra-class correlation coefficient for inter annotator agreement of both mono- and cross-lingual dataset and gotr=0.9 with p < .001 for each dataset. In the third step, we create new models for Monolingual semantic Word Similarity and Cross-lingual Semantic Word Similarity. We evaluate these newly created models and pre-trained models using our mono and cross-lingual gold standard datasets.We useperson correlation as evaluation measure. In case of monolingual models, we got correlation of 0.54 for corpus based techniques. Fasttext pre-trained Cbow and Skip-gram models got correlation of 0.53 and 0.63 respectively.Ur-Mono based trained fasttextCbow and Skip-gram models got maximum correlation of 0.63 and 0.67 respectively.For cross-lingual semantic word similarity, we applied Translation Plus Monoligual analysis technique, in which after translation of word pairs, multiple WordNet measures are applied. Path Similarity measure outperform other measures with correlation of 0.70 while using Google translation. In case of Bing translation, Path Similarity and Leacock measure outperform other measures with correlation of 0.71 and 0.71 respectively.