Abstract:
The process of creating new text(s) using an existing one is called Text Reuse. The goal of text reuse detection is to identify whether a new text is created using a source text or not. A source text can be reused at different rewrite levels to generate either partially (Paraphrased) or fully (Verbatim) reused texts pairs. The Paraphrased Text Reuse shows the relationship between two text pairs conveying the same idea differently. Paraphrased text reuse text is difficult to detect since different edit operations can be used to generate it which varies considerably according to the context where it is being used.
In literature, the Paraphrase Text Reuse Detection problem has been explored for English as well as some other European languages, but it has not been investigated for the Urdu language because of the non-availability of the Urdu paraphrase text reuse detection corpus. Therefore, to fulfill this gap this study aims to develop a large-scale gold standard benchmark Urdu Paraphrase Text Reuse Corpus (UPTRC) for the Urdu text reuse detection with special emphasizes on paraphrase text reuse detection for Urdu language. In addition, we have applied the baseline approaches (N-Gram Based Approaches) and proposed other approaches including Word Embedding Based Approaches, Sentence Transformers Based Approaches, and other Feature Fusion Approaches for Urdu Paraphrase Text Reuse Detection (UPTRD). However, Urdu Short Text Reuse Detection (USTRD) has been explored at the sentence level, passage level and document level along with different approaches that have been previously applied on Urdu Short Text Reuse Corpus (USTRC). Nonetheless, the USTRD task has only been explored with lexical and semantic approaches on USTRC but has not been explored with state-of-the-art Sentence Transformers Based Approaches and Word Embedding Based Approaches. The Sentence Transformers Based Approaches are used in a variety of applications and natural language processing tasks, especially for semantic text similarity and duplicate pair detection. But the Sentence Transformers has not been explored for Urdu text reuse detection. So, there is a need to explore the Sentence Transformers Based Approaches on the USTRC corpus for the task of USTRD.
In this study, along with the corpus construction, we have applied the N-Gram Based Approaches as baseline and proposed new approaches including Word Embedding Based Approaches, Sentence Transformers Based Approaches, and Feature Fusion Approaches for our proposed UPTRC corpus. We have proposed applied compared and evaluated same approaches for USTRC corpus as well as our proposed corpus to investigate the performance of our proposed approaches. The best results obtained were the weighted average of F1 = 0.855607 by our proposed Feature Fusion Approaches for the task of UPTRD on our proposed UPTRC corpus. In the case of the USTRC corpus, our proposed Feature Fusion Approaches still outperformed the baseline approaches for ternary classification with F1 = 0.715164. The performance of the proposed approaches for both corpora is evaluated and reported using the weighted average of Precision (P), Recall (R), and F1 evaluation measures.