Abstract:
The rise of social media has encouraged experts to conduct several studies based on user-
generated data. The widespread use of online platforms like Facebook, Twitter and other
web forums has created challenges to keep check for inappropriate language. Pakistan has
about 44 million online social network users that speak Roman Urdu to interact with each
other. The prevalence of user-generated information on social media caused a massive rise
in the content of offensive texts. However, this growth also offers avenues for trolls who
contaminate these social platforms with their nasty remarks.
In South Asia, Roman Urdu is the most widely used communication language on social
media. Compared to other resource-rich languages such as English, less work has been
done on offensive language classification in Roman Urdu. In this research (1) a lexicon of
offensive terms in Roman Urdu is presented, (2) developed a dataset called RU-OLD
“Roman Urdu Offensive Language Detection” with 15000 tweets, and (3) conducted a
comparative analysis of machine learning and several deep learning architectures.
Comparative analysis was performed by well-known machine learning classifiers such as
Naive Bayes, RF, SVM, and Logistic Regression as well as cutting-edge deep neural
networks such as Long Short-Term Memory, Convolutional Neural Networks, XLNET and
BERT. XLNET trained with developed corpus outperformed the rest of the models and
resulted in the highest accuracy of 93%. The results are suggestive of the scalability of the
proposed approach on large scale to counter offensive language in Roman Urdu.