Abstract:
We are living in an era of technology where most of us have easy access to the internet.
Due to internet bloom, the use of social media and discussion forum, in particular
for communication, increase tremendously in recent years. But this advancement also
opens doors for trolls who poison these social media and forums by their abusive behavior
toward others. Detection of toxic language online is becoming an important issue in
recent years. A toxic comment is defined as rude, unreasonable, or disrespectful comments
which likely to make you leave a conversation. Current methods for handling
toxicity online are often heavily dependent on manual moderation, and not scalable
enough to handle growing users day by day.
In this thesis, we investigated the problem of toxic language detection for Roman
Urdu (transliteration of Urdu in English) text because Roman Urdu is widely used to
communicate online in Pakistani and Indian community. Various work has been done to
detect toxicity in the English language, but due to unavailability of large Roman Urdu
corpora, there is no work done so far in Roman Urdu. The aim of this thesis work is
two folds: first is to develop large scale corpus for detection of toxic language in Roman
Urdu, second is to apply different proven classical machine learning and state-of-the-art
deep learning methods on the developed corpus.
In the first step, we collected data from the YouTube comment section to develop
Roman Urdu corpus. The final corpus consists of 10,863 unique comments where we
have 5,887 toxic and 4,976 non-toxic comments. We performed manual annotation of
corpus and got it reviewed from another annotator. We applied proven classical machine
learning algorithms like Logistic Regression, K-Nearest Neighbors, Naive Bayes,
Random Forest, Support Vector Machine and Multi-layer Perceptron along with stateof-
the-art deep learning methods like Long Short-Term Memory, Bidirectional Long
Short-Term Memory, and Convolutional Neural Networks. Best results obtained from
Random Forest with character 4-gram (Accuracy = 0.964, Precision = 0.980, Recall =
0.953, F1 = 0.966).