Abstract:
Technological advances have made the internet accessible to the vast majority of
people. In the current era, the usage of social networking sites or media and discussion
forums as a form of communication has increased dramatically. Trolls that spew bully,
hate, and toxic speech on social media and in forums will also be able to use this new
technology. The identification of cyberbullying on the internet is becoming an
increasingly important topic. During the last decade, Poisonous comments have been
sarcastic, rude, insulting, and bullying, resulting in 26% of people thinking about
suicide or harming themselves, and many go through depression and anxiety. Methods
currently used to address Internet toxicity are difficult to measure since it relies so
heavily on human moderators larger than necessary to handle a growing user base. The
author researched the detection of cyberbullying, which has become a significant part
of the internet and affects internet users around the globe; this research is done in three
different languages. These languages include English as International Language,
Spanish as the 3rd most used online language, and Roman Urdu as the national language
used in India to communicate on the internet. With this in mind, the author will begin
by employing a wide range of established classical machine learning and cutting-edge
techniques and Incorporating techniques from deep learning into four different datasets
with a distinct sizes.
The research will be conducted as part of this project's goal to start; there is a need to
develop an ensemble model that can be used to identify the hazardous language. Two
different ensemble models are proposed; one from machine learning classifiers includes
Support Vector Machine, Naive Bayes, Random Forest, and Logistic Regression using
Feature extraction. The second ensemble learning model is on deep learning techniques
using hyperparameter tuning of CNN, LSTM, and GRU. It also compares their results
with Transfer learning classifiers, including BERT and Distil BERT. Distil BERT gives
100% accuracy, recall F1-score, and precision on all four different datasets. In English
Dataset 1 Unigram feature gives 74% accuracy, and 3-3 char gram provides 77%
accuracy. In CNN, using Adam optimizer with 32 batch size and 8 epochs at 0.01 learn
rate gives the highest accuracy of 56%, while in LSTM, rmsprop optimizer gives the
highest accuracy of 53% with batch size 32 and 8 epochs at 0.001 learn rate. In GRU,
the rmsprop optimizer gives the highest accuracy of 52% with batch sizes 32 and 16
epochs at a 0.001 learn rate. The ensemble model of machine learning provides the
highest accuracy of 65%, while the deep learning ensemble model gives 57% accuracy.
In English Dataset 2 Unigram feature gives 95% accuracy, and 4-4 char gram
provides 95% accuracy by random forest. In CNN, using the rmsprop optimizer with
32 batch sizes and 16 epochs at 0.001 learn rate gives the highest accuracy of 94%,
while in LSTM, the rmsprop optimizer gives the highest accuracy of 92% with batch
size 64 and 16 epochs at 0.01 learn rate. In GRU, the rmsprop optimizer gives the
highest accuracy of 91% with batch sizes 64 and 16 epochs at a 0.01 learn rate. The
ensemble model of machine learning provides the highest accuracy of 87%, while the
deep learning ensemble model gives 80% accuracy. In the Spanish dataset, the Unigram
feature gives 78% accuracy, and 5-5 char gram provides 80% accuracy by Logistic
Regression. The highest 76% accuracies are gained at batch size 64, epochs 8 learn
rate 0.001, and optimizer Adam in CNN, while 75% using batch size 64, epochs 8,
learn rate 0.001, optimizer rmsprop in LSTM. In GRU, 74 % accuracy is gained using
batch size 64, 'epochs 8, learn rate 0.01, 'optimizer rmsprop. The ensemble model of
machine learning provides the highest accuracy of 79%, while the deep learning
ensemble model gives 74% accuracy.