Abstract:
Text reuse is a process of creating new text(s) from pre-existing text(s). In recent years, Urdu Text Reuse Detection (UTRD) has gained the attention of researchers because the text is readily available in digital format all over the internet and can be copied or paraphrased from another source without proper attribution, which makes it easier to reuse but hard to detect. In previous studies, the problem of UTRD has been explored at the sentence level [1], sentence/passage level [2], and document level [3], along with benchmark corpora and methods. However, the problem of UTRD has not been explored at the Phrasal level with respect to corpora and methods. To fulfill this research gap, this research study has developed a large benchmark manually annotated corpus of 25,001 text pairs at two levels of rewrite: 1) Derived = 15,105 and 2) Non Derived = 9,896. Secondly, this research study has developed, applied, evaluated, and compared baseline approaches (N-gram Overlap and Word Embedding-based approaches) with proposed transformer-based approaches on the proposed UTRD-Phr-22 corpus. The proposed transformers-based approach (using a combination of 8 different Sentence Transformers including: paraphrase- multilingual-mpnet-base-v2, distiluse-base-multilingual-cased-v, paraphrase-multilingual-MiniLM-L12-v2, LaBSE, xlm-r-distilroberta-base-paraphrase-v1, xlm-r-100langs-bert-base-nli-mean-tokens, xlm-r-bert- base-nli-stsb-mean-tokens, and xlm-r-100langs-bert-base-nli-stsb-mean-tokens) outperforms with an F1 score of 0.63 compared