CUI Lahore Repository

Data Augmentation using Machine Translation for Urdu Document Classification

Show simple item record

dc.contributor.author But, Khizer
dc.date.accessioned 2024-10-29T14:00:46Z
dc.date.available 2024-10-29T14:00:46Z
dc.date.issued 2024-10-28
dc.identifier.uri http://repository.cuilahore.edu.pk/xmlui/handle/123456789/4585
dc.description.abstract A huge volume of textual data is being created due to the advancement in technology and widespread use of the internet. The exponential growth of textual data in the Urdu language has given rise to its applications in spam filtering, fake news detection, website classification, sentiment analysis, etc. However, Urdu is a low-resource language and lacks the support of automated tools for its applications. This research focuses on using data from a resource-rich language to translate it into the Urdu language to overcome the data scarcity concern. For this purpose, the BBC English news articles dataset was acquired and translated using the Google-trans-new library. Further, an Urdu dataset of fake news was enriched with 1200 articles which were collected from an online resource named Urdu Point. Urdu and machine-translated datasets were merged to come up with a diverse and large augmented dataset and augmented downsized dataset. A supervised classification method was employed with TextCNN, LSTM, simple Bi-LSTM, Bi-LSTM with attention, and SVM on the datasets. Different training experiments were applied to evaluate the effectiveness of the classification with Urdu, machine translated, and augmented datasets. Training and testing on the same dataset gave good results whereas the cross-testing from machine- translated to Urdu and Urdu to Machin-translated dataset resulted in a performance decline. Models trained with an augmented dataset achieved promising results when tested on the original Urdu test dataset and results were improved as compared to results achieved with no augmentation. These promising results are suggestive of the applicability of the proposed strategy of creating datasets on large scale using machine translation to automatically classify the documents in the Urdu language. Bi-LSTM resulted in the highest F-1 score of 0.97 with the augmented dataset. en_US
dc.publisher Computer Science Department COMSATS University Islamabad Lahore Campus en_US
dc.relation.ispartofseries CIIT/FA19-RCS-018/LHR;8355
dc.subject Augmentation using, Machine, Translation, Urdu Document, Classification en_US
dc.title Data Augmentation using Machine Translation for Urdu Document Classification en_US
dc.type Thesis en_US


Files in this item

This item appears in the following Collection(s)

  • Thesis - MS / PhD
    This collection containts the Ms/PhD thesis of the studetns of Department of Computer Science

Show simple item record

Search DSpace


Advanced Search

Browse

My Account