CUI Lahore Repository

A Comparative Analysis of Offensive Language Detection in Roman Urdu

Show simple item record

dc.contributor.author Khan, Muneeb Rehman
dc.date.accessioned 2024-10-28T10:18:30Z
dc.date.available 2024-10-28T10:18:30Z
dc.date.issued 2024-10-28
dc.identifier.uri http://repository.cuilahore.edu.pk/xmlui/handle/123456789/4432
dc.description.abstract The rise of social media has encouraged experts to conduct several studies based on user- generated data. The widespread use of online platforms like Facebook, Twitter and other web forums has created challenges to keep check for inappropriate language. Pakistan has about 44 million online social network users that speak Roman Urdu to interact with each other. The prevalence of user-generated information on social media caused a massive rise in the content of offensive texts. However, this growth also offers avenues for trolls who contaminate these social platforms with their nasty remarks. In South Asia, Roman Urdu is the most widely used communication language on social media. Compared to other resource-rich languages such as English, less work has been done on offensive language classification in Roman Urdu. In this research (1) a lexicon of offensive terms in Roman Urdu is presented, (2) developed a dataset called RU-OLD “Roman Urdu Offensive Language Detection” with 15000 tweets, and (3) conducted a comparative analysis of machine learning and several deep learning architectures. Comparative analysis was performed by well-known machine learning classifiers such as Naive Bayes, RF, SVM, and Logistic Regression as well as cutting-edge deep neural networks such as Long Short-Term Memory, Convolutional Neural Networks, XLNET and BERT. XLNET trained with developed corpus outperformed the rest of the models and resulted in the highest accuracy of 93%. The results are suggestive of the scalability of the proposed approach on large scale to counter offensive language in Roman Urdu. en_US
dc.publisher Computer Science Department COMSATS University Islamabad Lahore Campus en_US
dc.relation.ispartofseries ST-SG-0697;8305
dc.subject The rise of social media has encouraged experts to conduct several studies based on user- generated data en_US
dc.title A Comparative Analysis of Offensive Language Detection in Roman Urdu en_US
dc.type Thesis en_US


Files in this item

This item appears in the following Collection(s)

  • Thesis - MS / PhD
    This collection containts the Ms/PhD thesis of the studetns of Department of Computer Science

Show simple item record

Search DSpace


Advanced Search

Browse

My Account