Data Augmentation using Machine Translation for Urdu Document Classification

But, Khizer

dc.contributor.author	But, Khizer
dc.date.accessioned	2024-10-29T14:00:46Z
dc.date.available	2024-10-29T14:00:46Z
dc.date.issued	2024-10-28
dc.identifier.uri	http://repository.cuilahore.edu.pk/xmlui/handle/123456789/4585
dc.description.abstract	A huge volume of textual data is being created due to the advancement in technology and widespread use of the internet. The exponential growth of textual data in the Urdu language has given rise to its applications in spam filtering, fake news detection, website classification, sentiment analysis, etc. However, Urdu is a low-resource language and lacks the support of automated tools for its applications. This research focuses on using data from a resource-rich language to translate it into the Urdu language to overcome the data scarcity concern. For this purpose, the BBC English news articles dataset was acquired and translated using the Google-trans-new library. Further, an Urdu dataset of fake news was enriched with 1200 articles which were collected from an online resource named Urdu Point. Urdu and machine-translated datasets were merged to come up with a diverse and large augmented dataset and augmented downsized dataset. A supervised classification method was employed with TextCNN, LSTM, simple Bi-LSTM, Bi-LSTM with attention, and SVM on the datasets. Different training experiments were applied to evaluate the effectiveness of the classification with Urdu, machine translated, and augmented datasets. Training and testing on the same dataset gave good results whereas the cross-testing from machine- translated to Urdu and Urdu to Machin-translated dataset resulted in a performance decline. Models trained with an augmented dataset achieved promising results when tested on the original Urdu test dataset and results were improved as compared to results achieved with no augmentation. These promising results are suggestive of the applicability of the proposed strategy of creating datasets on large scale using machine translation to automatically classify the documents in the Urdu language. Bi-LSTM resulted in the highest F-1 score of 0.97 with the augmented dataset.	en_US
dc.publisher	Computer Science Department COMSATS University Islamabad Lahore Campus	en_US
dc.relation.ispartofseries	CIIT/FA19-RCS-018/LHR;8355
dc.subject	Augmentation using, Machine, Translation, Urdu Document, Classification	en_US
dc.title	Data Augmentation using Machine Translation for Urdu Document Classification	en_US
dc.type	Thesis	en_US