Toxic Comment Classification of Roman Urdu Text

Abbas, Waheed

dc.contributor.author	Abbas, Waheed
dc.date.accessioned	2019-12-20T15:36:17Z
dc.date.available	2019-12-20T15:36:17Z
dc.date.issued	2019-12-19
dc.identifier.uri	http://dspace.cuilahore.edu.pk/xmlui/handle/123456789/1558
dc.description.abstract	We are living in an era of technology where most of us have easy access to the internet. Due to internet bloom, the use of social media and discussion forum, in particular for communication, increase tremendously in recent years. But this advancement also opens doors for trolls who poison these social media and forums by their abusive behavior toward others. Detection of toxic language online is becoming an important issue in recent years. A toxic comment is defined as rude, unreasonable, or disrespectful comments which likely to make you leave a conversation. Current methods for handling toxicity online are often heavily dependent on manual moderation, and not scalable enough to handle growing users day by day. In this thesis, we investigated the problem of toxic language detection for Roman Urdu (transliteration of Urdu in English) text because Roman Urdu is widely used to communicate online in Pakistani and Indian community. Various work has been done to detect toxicity in the English language, but due to unavailability of large Roman Urdu corpora, there is no work done so far in Roman Urdu. The aim of this thesis work is two folds: first is to develop large scale corpus for detection of toxic language in Roman Urdu, second is to apply different proven classical machine learning and state-of-the-art deep learning methods on the developed corpus. In the first step, we collected data from the YouTube comment section to develop Roman Urdu corpus. The final corpus consists of 10,863 unique comments where we have 5,887 toxic and 4,976 non-toxic comments. We performed manual annotation of corpus and got it reviewed from another annotator. We applied proven classical machine learning algorithms like Logistic Regression, K-Nearest Neighbors, Naive Bayes, Random Forest, Support Vector Machine and Multi-layer Perceptron along with stateof- the-art deep learning methods like Long Short-Term Memory, Bidirectional Long Short-Term Memory, and Convolutional Neural Networks. Best results obtained from Random Forest with character 4-gram (Accuracy = 0.964, Precision = 0.980, Recall = 0.953, F1 = 0.966).	en_US
dc.language.iso	en	en_US
dc.publisher	Department of Computer Science, COMSATS University Islamabad, Lahore Campus	en_US
dc.subject	Toxic Classification, Language, Roman Urdu Text, transliteration of Urdu in English, Logistic Regression, K-Nearest Neighbors, Naive Bayes, Random Forest, Support Vector Machine and Multi-layer Perceptron	en_US
dc.title	Toxic Comment Classification of Roman Urdu Text	en_US
dc.type	Thesis	en_US