CUI Lahore Repository

Toxic Comment Classification of Roman Urdu Text

Show simple item record

dc.contributor.author Abbas, Waheed
dc.date.accessioned 2019-12-20T15:36:17Z
dc.date.available 2019-12-20T15:36:17Z
dc.date.issued 2019-12-19
dc.identifier.uri http://dspace.cuilahore.edu.pk/xmlui/handle/123456789/1558
dc.description.abstract We are living in an era of technology where most of us have easy access to the internet. Due to internet bloom, the use of social media and discussion forum, in particular for communication, increase tremendously in recent years. But this advancement also opens doors for trolls who poison these social media and forums by their abusive behavior toward others. Detection of toxic language online is becoming an important issue in recent years. A toxic comment is defined as rude, unreasonable, or disrespectful comments which likely to make you leave a conversation. Current methods for handling toxicity online are often heavily dependent on manual moderation, and not scalable enough to handle growing users day by day. In this thesis, we investigated the problem of toxic language detection for Roman Urdu (transliteration of Urdu in English) text because Roman Urdu is widely used to communicate online in Pakistani and Indian community. Various work has been done to detect toxicity in the English language, but due to unavailability of large Roman Urdu corpora, there is no work done so far in Roman Urdu. The aim of this thesis work is two folds: first is to develop large scale corpus for detection of toxic language in Roman Urdu, second is to apply different proven classical machine learning and state-of-the-art deep learning methods on the developed corpus. In the first step, we collected data from the YouTube comment section to develop Roman Urdu corpus. The final corpus consists of 10,863 unique comments where we have 5,887 toxic and 4,976 non-toxic comments. We performed manual annotation of corpus and got it reviewed from another annotator. We applied proven classical machine learning algorithms like Logistic Regression, K-Nearest Neighbors, Naive Bayes, Random Forest, Support Vector Machine and Multi-layer Perceptron along with stateof- the-art deep learning methods like Long Short-Term Memory, Bidirectional Long Short-Term Memory, and Convolutional Neural Networks. Best results obtained from Random Forest with character 4-gram (Accuracy = 0.964, Precision = 0.980, Recall = 0.953, F1 = 0.966). en_US
dc.language.iso en en_US
dc.publisher Department of Computer Science, COMSATS University Islamabad, Lahore Campus en_US
dc.subject Toxic Classification, Language, Roman Urdu Text, transliteration of Urdu in English, Logistic Regression, K-Nearest Neighbors, Naive Bayes, Random Forest, Support Vector Machine and Multi-layer Perceptron en_US
dc.title Toxic Comment Classification of Roman Urdu Text en_US
dc.type Thesis en_US


Files in this item

This item appears in the following Collection(s)

  • Thesis - MS / PhD
    This collection containts the Ms/PhD thesis of the studetns of Department of Computer Science

Show simple item record

Search DSpace


Advanced Search

Browse

My Account