dc.description.abstract |
Detection of cyclic text reuse is an increasingly growing need in the Urdu language with the greater use of digital technologies. The growth of text reuse, which is the process in which the passage from an original document is copied and then reused with slight changes in wording or rewording without proper credit, has become very easy due to the vast availability of digital text and free online machine translation tools. The latest AI-powered machine translation systems make the process of iterative translation of text across multiple languages and back to the origin very easy, thus generating paraphrases which can be reused without due credit. Detection of such text reuse is important for the whole gamut of applications in the journalism domain, near-duplicate document detection, content reuse in blogs, and plagiarism detection. While lots of research has been done on the detection of text reuses in English, there exist only a few manually created corpora in the case of Urdu text reuse detection, which are real and simulated instances of text reuse. And up to now, no huge corpus or research targets the detection of cyclic text reuse in Urdu. This study makes an effort to fill this gap by developing a huge benchmark corpus through an automated approach based on real cases from journalism. The proposed corpus contains 6,000 manually annotated documents as either partially derived, wholly derived, or non-derived. In the paper, more research is conducted on the effectiveness of different sentence-transformer methods such LaBSE,1in detecting cyclic text reuse. Besides, we have fine-tuned the LaBSE, paraphrase-multilingual-mpnet-base-v2, distiluse-base-multilingual-cased-v2, xlm-r-100langs-bert-base-nli-stsb-mean-tokens, xlm-r-distilroberta-base-paraphrase-v1 model to increase its accuracy with the comparison to the baseline of the N-gram approach. The results demonstrate that when the implemented change is applied to Urdu, the cyclic text reuse detection of the given document using the proposed transformer-based approach with the LaBSE yields favorable results with the best F1 score being 0. 61. The results revived the capacity of the model in dealing with other intricate translation tasks as well as increasing its accuracy in expounding the cyclic text reuse. The contributions from this research are two-fold: first, the construction of suitable and large scale corpus corpora for Urdu cyclic text reuse detection, and, second, the analysis of performance of existing and specifically state-of-art sentence transformer techniques such as LaBSE for improving the performance of the methods for this specific task. The techniques and collection employed through this study will provide immense help to the further improvement of cyclical text reuse in Urdu language and can be applied for other similar languages of the world. |
en_US |