dc.description.abstract |
Generating a new text from pre-existing text is called paraphrasing. In literature this
text-to-text conversion follows different linguistic variations followed by several
paraphrase typologies, at different levels with strong domain knowledge and
language expertise. These linguistic modifications, sometimes, involve more than
one languages. Paraphrase types detection has become a fascinating research topic
for many written languages, with applications in text mining, plagiarism detection,
information retrieval, and paraphrase generation etc.
Previously, very few research studies have been conducted for the detection of
paraphrase types in English language only. Despite the significance of paraphrase
types detection in natural language processing, there has been a noticeable scarcity of
research endeavors specifically targeting the detection of paraphrase types. To the
best of my knowledge, there is no corpora for the detection of paraphrase types
except English language.
To fulfill this research gap, this research study followed twofold strategy, firstly a
sentential Urdu Paraphrase Types Corpus (UPTC-23) was built consisting of 5000
sentence pairs tagged with fourteen Urdu paraphrase types by extracting sentence
pairs from the COUNTER corpus and online Urdu news articles. Secondly, a
thorough experimental work has been carried out by fine tuning the Large Language
Models (bert-base-multilingual-cased, distilbert-base-multilingual-cased) for the task
of Urdu paraphrase types detection. Automatic evaluation of the trained model(s)
performed using standard evaluation measures i.e., precision, recall, f1-score and
accuracy. The highest score observed was 86% accuracy in case of ‘bert-base-
x
multilingual-cased’ model. However in case of ‘distilbert-base-multilingual-cased’
model an accuracy score of 84% observed. This research work shall contribute to the
proceedings of latent research gaps pertaining to the detection of paraphrase types in
the Urdu language |
en_US |