Identifying Multilingual Textual Cyberbullying on Social Networking Using a Novel Machine Learning Approach

Fatima, Manahil

dc.contributor.author	Fatima, Manahil
dc.date.accessioned	2024-10-29T10:52:00Z
dc.date.available	2024-10-29T10:52:00Z
dc.date.issued	2024-10-28
dc.identifier.uri	http://repository.cuilahore.edu.pk/xmlui/handle/123456789/4575
dc.description.abstract	Technological advances have made the internet accessible to the vast majority of people. In the current era, the usage of social networking sites or media and discussion forums as a form of communication has increased dramatically. Trolls that spew bully, hate, and toxic speech on social media and in forums will also be able to use this new technology. The identification of cyberbullying on the internet is becoming an increasingly important topic. During the last decade, Poisonous comments have been sarcastic, rude, insulting, and bullying, resulting in 26% of people thinking about suicide or harming themselves, and many go through depression and anxiety. Methods currently used to address Internet toxicity are difficult to measure since it relies so heavily on human moderators larger than necessary to handle a growing user base. The author researched the detection of cyberbullying, which has become a significant part of the internet and affects internet users around the globe; this research is done in three different languages. These languages include English as International Language, Spanish as the 3rd most used online language, and Roman Urdu as the national language used in India to communicate on the internet. With this in mind, the author will begin by employing a wide range of established classical machine learning and cutting-edge techniques and Incorporating techniques from deep learning into four different datasets with a distinct sizes. The research will be conducted as part of this project's goal to start; there is a need to develop an ensemble model that can be used to identify the hazardous language. Two different ensemble models are proposed; one from machine learning classifiers includes Support Vector Machine, Naive Bayes, Random Forest, and Logistic Regression using Feature extraction. The second ensemble learning model is on deep learning techniques using hyperparameter tuning of CNN, LSTM, and GRU. It also compares their results with Transfer learning classifiers, including BERT and Distil BERT. Distil BERT gives 100% accuracy, recall F1-score, and precision on all four different datasets. In English Dataset 1 Unigram feature gives 74% accuracy, and 3-3 char gram provides 77% accuracy. In CNN, using Adam optimizer with 32 batch size and 8 epochs at 0.01 learn rate gives the highest accuracy of 56%, while in LSTM, rmsprop optimizer gives the highest accuracy of 53% with batch size 32 and 8 epochs at 0.001 learn rate. In GRU, the rmsprop optimizer gives the highest accuracy of 52% with batch sizes 32 and 16 epochs at a 0.001 learn rate. The ensemble model of machine learning provides the highest accuracy of 65%, while the deep learning ensemble model gives 57% accuracy. In English Dataset 2 Unigram feature gives 95% accuracy, and 4-4 char gram provides 95% accuracy by random forest. In CNN, using the rmsprop optimizer with 32 batch sizes and 16 epochs at 0.001 learn rate gives the highest accuracy of 94%, while in LSTM, the rmsprop optimizer gives the highest accuracy of 92% with batch size 64 and 16 epochs at 0.01 learn rate. In GRU, the rmsprop optimizer gives the highest accuracy of 91% with batch sizes 64 and 16 epochs at a 0.01 learn rate. The ensemble model of machine learning provides the highest accuracy of 87%, while the deep learning ensemble model gives 80% accuracy. In the Spanish dataset, the Unigram feature gives 78% accuracy, and 5-5 char gram provides 80% accuracy by Logistic Regression. The highest 76% accuracies are gained at batch size 64, epochs 8 learn rate 0.001, and optimizer Adam in CNN, while 75% using batch size 64, epochs 8, learn rate 0.001, optimizer rmsprop in LSTM. In GRU, 74 % accuracy is gained using batch size 64, 'epochs 8, learn rate 0.01, 'optimizer rmsprop. The ensemble model of machine learning provides the highest accuracy of 79%, while the deep learning ensemble model gives 74% accuracy.	en_US
dc.publisher	Computer Science Department COMSATS University Islamabad Lahore Campus	en_US
dc.relation.ispartofseries	CIIT/SP20-RCS-022/LHR;8347
dc.subject	Technological advances have made the internet accessible to the vast majority of people. In the current era, the usage of social networking sites or media and discussion forums as a form of communication has increased dramatically	en_US
dc.title	Identifying Multilingual Textual Cyberbullying on Social Networking Using a Novel Machine Learning Approach	en_US
dc.type	Thesis	en_US