A Comparative Analysis of Offensive Language Detection in Roman Urdu

Khan, Muneeb Rehman

dc.contributor.author	Khan, Muneeb Rehman
dc.date.accessioned	2024-10-28T10:18:30Z
dc.date.available	2024-10-28T10:18:30Z
dc.date.issued	2024-10-28
dc.identifier.uri	http://repository.cuilahore.edu.pk/xmlui/handle/123456789/4432
dc.description.abstract	The rise of social media has encouraged experts to conduct several studies based on user- generated data. The widespread use of online platforms like Facebook, Twitter and other web forums has created challenges to keep check for inappropriate language. Pakistan has about 44 million online social network users that speak Roman Urdu to interact with each other. The prevalence of user-generated information on social media caused a massive rise in the content of offensive texts. However, this growth also offers avenues for trolls who contaminate these social platforms with their nasty remarks. In South Asia, Roman Urdu is the most widely used communication language on social media. Compared to other resource-rich languages such as English, less work has been done on offensive language classification in Roman Urdu. In this research (1) a lexicon of offensive terms in Roman Urdu is presented, (2) developed a dataset called RU-OLD “Roman Urdu Offensive Language Detection” with 15000 tweets, and (3) conducted a comparative analysis of machine learning and several deep learning architectures. Comparative analysis was performed by well-known machine learning classifiers such as Naive Bayes, RF, SVM, and Logistic Regression as well as cutting-edge deep neural networks such as Long Short-Term Memory, Convolutional Neural Networks, XLNET and BERT. XLNET trained with developed corpus outperformed the rest of the models and resulted in the highest accuracy of 93%. The results are suggestive of the scalability of the proposed approach on large scale to counter offensive language in Roman Urdu.	en_US
dc.publisher	Computer Science Department COMSATS University Islamabad Lahore Campus	en_US
dc.relation.ispartofseries	ST-SG-0697;8305
dc.subject	The rise of social media has encouraged experts to conduct several studies based on user- generated data	en_US
dc.title	A Comparative Analysis of Offensive Language Detection in Roman Urdu	en_US
dc.type	Thesis	en_US