Abstract:
Paraphrase detection is the process of identifying the use of existing text(s) as a new text
in the same context with different alterations/modifications. In recent years, paraphrase
detection has gained the attention of the research community due to its potential
applications in different domains of NLP and Machine Learning. Paraphrasing may carry
out on three use case levels 1) Real Cases, 2) Simulated Cases, and 3) Artificial Cases. Real
Cases are examples from the real-world data available on different platforms, websites, e books, and over the internet on multiple topics and domains. When human manipulates the
data content by using different grammatical rules and editions for paraphrasing and then
claims to be original data that is Simulated Cases paraphrasing. Now the world is shifting
from manual tasking to the availability of tools to modify data freely on the internet.
Modifications in available data by using online tools is an example of Artificial Cases
paraphrasing.
To identify the use case of paraphrasing is a crucial task because, without any
computational and experimental environment, no tool can identify the paraphrasing type.
More importantly, Artificial Cases paraphrase detection is the most difficult task to identify.
The world is lacking the standard set of examples for Simulated Cases and Artificial Cases.
To fulfill this gap, we purposed one Simulated Cases corpus and three Artificial Cases
corpora for paraphrasing in the English language on Sentence Level with binary
classification as paraphrased or non-paraphrased. To address this problem, this study has
three main goals: (1) develop benchmark corpora for simulated and Artificial Cases of
paraphrasing and (2) apply WordNet-based approaches, Kull-back Libeler, and approach
on Real, Simulated, and Artificial Cases of paraphrasing and (3) identify which automatic
paraphrasing tools are more difficult to detect paraphrasing.
We tried to contribute on the initial level, we developed a corpus for Simulated Cases and
Artificial Cases from previous studies. We developed benchmark corpora as Quora
Simulated Corpus on Simulated Cases, Artificial Article Rewriter Corpus, Artificial x
Rewriter Tool Corpus, and Artificial Paraphrasing Tool Corpus on Artificial Cases. Each
corpus contains 5801 sentence pairs for the English language from different domains.
We applied different approaches like N-gram Overlap. Kull-back Liebler and Wordnet
based approaches to extract features from data on Real, Simulated, and Artificial Cases
corpora. Five different machine learning classifiers like Random Forest, Decision Tree,
Multilayer Perceptron, Adaboost, and Gradient Boosting classifiers were evaluated by
using Precision, Recall, and F1 measure. We only reported the highest F1 scores, as AARC,
ARTC, and QSC scored 85%, APTC scored 89%, and MSRPPC scored 71% results for all
combined approaches.