CUI Lahore Repository

Detecting Real, Simulated and Artificial Cases of Paraphrasing

Show simple item record

dc.contributor.author Shahzadi, Ayesha
dc.date.accessioned 2021-06-04T08:08:28Z
dc.date.available 2021-06-04T08:08:28Z
dc.date.issued 2021-06-04
dc.identifier.uri http://repository.cuilahore.edu.pk/xmlui/handle/123456789/2339
dc.description.abstract Paraphrase detection is the process of identifying the use of existing text(s) as a new text in the same context with different alterations/modifications. In recent years, paraphrase detection has gained the attention of the research community due to its potential applications in different domains of NLP and Machine Learning. Paraphrasing may carry out on three use case levels 1) Real Cases, 2) Simulated Cases, and 3) Artificial Cases. Real Cases are examples from the real-world data available on different platforms, websites, e books, and over the internet on multiple topics and domains. When human manipulates the data content by using different grammatical rules and editions for paraphrasing and then claims to be original data that is Simulated Cases paraphrasing. Now the world is shifting from manual tasking to the availability of tools to modify data freely on the internet. Modifications in available data by using online tools is an example of Artificial Cases paraphrasing. To identify the use case of paraphrasing is a crucial task because, without any computational and experimental environment, no tool can identify the paraphrasing type. More importantly, Artificial Cases paraphrase detection is the most difficult task to identify. The world is lacking the standard set of examples for Simulated Cases and Artificial Cases. To fulfill this gap, we purposed one Simulated Cases corpus and three Artificial Cases corpora for paraphrasing in the English language on Sentence Level with binary classification as paraphrased or non-paraphrased. To address this problem, this study has three main goals: (1) develop benchmark corpora for simulated and Artificial Cases of paraphrasing and (2) apply WordNet-based approaches, Kull-back Libeler, and approach on Real, Simulated, and Artificial Cases of paraphrasing and (3) identify which automatic paraphrasing tools are more difficult to detect paraphrasing. We tried to contribute on the initial level, we developed a corpus for Simulated Cases and Artificial Cases from previous studies. We developed benchmark corpora as Quora Simulated Corpus on Simulated Cases, Artificial Article Rewriter Corpus, Artificial x Rewriter Tool Corpus, and Artificial Paraphrasing Tool Corpus on Artificial Cases. Each corpus contains 5801 sentence pairs for the English language from different domains. We applied different approaches like N-gram Overlap. Kull-back Liebler and Wordnet based approaches to extract features from data on Real, Simulated, and Artificial Cases corpora. Five different machine learning classifiers like Random Forest, Decision Tree, Multilayer Perceptron, Adaboost, and Gradient Boosting classifiers were evaluated by using Precision, Recall, and F1 measure. We only reported the highest F1 scores, as AARC, ARTC, and QSC scored 85%, APTC scored 89%, and MSRPPC scored 71% results for all combined approaches. en_US
dc.publisher Department of Computer science, COMSATS University Lahore. en_US
dc.relation.ispartofseries ;6420
dc.subject Detecting Real, Simulated and Artificial Cases of Paraphrasing en_US
dc.title Detecting Real, Simulated and Artificial Cases of Paraphrasing en_US
dc.type Thesis en_US


Files in this item

This item appears in the following Collection(s)

  • Thesis - MS / PhD
    This collection containts the Ms/PhD thesis of the studetns of Department of Management Sciences

Show simple item record

Search DSpace


Advanced Search

Browse

My Account