Urdu Text Reuse Detection (UTRD) at Phrasal Level using Transformer-Based Approached

Mehak, Gull

Urdu Text Reuse Detection (UTRD) at Phrasal Level using Transformer-Based Approached

Mehak, Gull

URI: http://repository.cuilahore.edu.pk/xmlui/handle/123456789/3771

Date: 2023-08-08

Abstract:

Text reuse is a process of creating new text(s) from pre-existing text(s). In recent years, Urdu Text Reuse Detection (UTRD) has gained the attention of researchers because the text is readily available in digital format all over the internet and can be copied or paraphrased from another source without proper attribution, which makes it easier to reuse but hard to detect. In previous studies, the problem of UTRD has been explored at the sentence level [1], sentence/passage level [2], and document level [3], along with benchmark corpora and methods. However, the problem of UTRD has not been explored at the Phrasal level with respect to corpora and methods. To fulfill this research gap, this research study has developed a large benchmark manually annotated corpus of 25,001 text pairs at two levels of rewrite: 1) Derived = 15,105 and 2) Non Derived = 9,896. Secondly, this research study has developed, applied, evaluated, and compared baseline approaches (N-gram Overlap and Word Embedding-based approaches) with proposed transformer-based approaches on the proposed UTRD-Phr-22 corpus. The proposed transformers-based approach (using a combination of 8 different Sentence Transformers including: paraphrase- multilingual-mpnet-base-v2, distiluse-base-multilingual-cased-v, paraphrase-multilingual-MiniLM-L12-v2, LaBSE, xlm-r-distilroberta-base-paraphrase-v1, xlm-r-100langs-bert-base-nli-mean-tokens, xlm-r-bert- base-nli-stsb-mean-tokens, and xlm-r-100langs-bert-base-nli-stsb-mean-tokens) outperforms with an F1 score of 0.63 compared

Show full item record

Files in this item

Name: Thesis - SP21-RCS ...

Size: 1.821Mb

Format: PDF

View/Open

This item appears in the following Collection(s)

Thesis - MS / PhD
This collection containts the Ms/PhD thesis of the studetns of Department of Computer Science

Urdu Text Reuse Detection (UTRD) at Phrasal Level using Transformer-Based Approached

Urdu Text Reuse Detection (UTRD) at Phrasal Level using Transformer-Based Approached

Abstract:

Files in this item

This item appears in the following Collection(s)

Search DSpace

Browse

All of DSpace

This Collection

My Account