Abstract:
A huge volume of textual data is being created due to the advancement in technology
and widespread use of the internet. The exponential growth of textual data in the Urdu
language has given rise to its applications in spam filtering, fake news detection,
website classification, sentiment analysis, etc. However, Urdu is a low-resource
language and lacks the support of automated tools for its applications. This research
focuses on using data from a resource-rich language to translate it into the Urdu
language to overcome the data scarcity concern. For this purpose, the BBC English
news articles dataset was acquired and translated using the Google-trans-new library.
Further, an Urdu dataset of fake news was enriched with 1200 articles which were
collected from an online resource named Urdu Point. Urdu and machine-translated
datasets were merged to come up with a diverse and large augmented dataset and
augmented downsized dataset. A supervised classification method was employed with
TextCNN, LSTM, simple Bi-LSTM, Bi-LSTM with attention, and SVM on the
datasets. Different training experiments were applied to evaluate the effectiveness of
the classification with Urdu, machine translated, and augmented datasets. Training and
testing on the same dataset gave good results whereas the cross-testing from machine-
translated to Urdu and Urdu to Machin-translated dataset resulted in a performance
decline. Models trained with an augmented dataset achieved promising results when
tested on the original Urdu test dataset and results were improved as compared to results
achieved with no augmentation. These promising results are suggestive of the
applicability of the proposed strategy of creating datasets on large scale using machine
translation to automatically classify the documents in the Urdu language. Bi-LSTM
resulted in the highest F-1 score of 0.97 with the augmented dataset.