CUI Lahore Repository

Using Large Language Models for the Detection of Urdu Paraphrase Types

Show simple item record

dc.contributor.author Shahzad, Fozia
dc.date.accessioned 2024-11-28T10:34:39Z
dc.date.available 2024-11-28T10:34:39Z
dc.date.issued 2024-11-28
dc.identifier.uri http://repository.cuilahore.edu.pk/xmlui/handle/123456789/4860
dc.description.abstract Generating a new text from pre-existing text is called paraphrasing. In literature this text-to-text conversion follows different linguistic variations followed by several paraphrase typologies, at different levels with strong domain knowledge and language expertise. These linguistic modifications, sometimes, involve more than one languages. Paraphrase types detection has become a fascinating research topic for many written languages, with applications in text mining, plagiarism detection, information retrieval, and paraphrase generation etc. Previously, very few research studies have been conducted for the detection of paraphrase types in English language only. Despite the significance of paraphrase types detection in natural language processing, there has been a noticeable scarcity of research endeavors specifically targeting the detection of paraphrase types. To the best of my knowledge, there is no corpora for the detection of paraphrase types except English language. To fulfill this research gap, this research study followed twofold strategy, firstly a sentential Urdu Paraphrase Types Corpus (UPTC-23) was built consisting of 5000 sentence pairs tagged with fourteen Urdu paraphrase types by extracting sentence pairs from the COUNTER corpus and online Urdu news articles. Secondly, a thorough experimental work has been carried out by fine tuning the Large Language Models (bert-base-multilingual-cased, distilbert-base-multilingual-cased) for the task of Urdu paraphrase types detection. Automatic evaluation of the trained model(s) performed using standard evaluation measures i.e., precision, recall, f1-score and accuracy. The highest score observed was 86% accuracy in case of ‘bert-base- x multilingual-cased’ model. However in case of ‘distilbert-base-multilingual-cased’ model an accuracy score of 84% observed. This research work shall contribute to the proceedings of latent research gaps pertaining to the detection of paraphrase types in the Urdu language en_US
dc.language.iso en en_US
dc.publisher Department of Computer Science, CUI Lahore en_US
dc.relation.ispartofseries 9335;SP22-RCS-019
dc.subject pre-existing, typologies, typologies, twofold strategy, en_US
dc.title Using Large Language Models for the Detection of Urdu Paraphrase Types en_US
dc.type Thesis en_US


Files in this item

This item appears in the following Collection(s)

  • Thesis - MS / PhD
    This collection containts the Ms/PhD thesis of the studetns of Department of Computer Science

Show simple item record

Search DSpace


Advanced Search

Browse

My Account