Abstract:
Web-based data is increasing day by day and plays a vital role in developing people’s
opinions. Sentiment mining/analysis is the natural language processing task that helps
to identify, classify these opinions. Usually research focus is on resource-rich language
for sentiment mining. In this thesis, we performed classification of various sentiments
using feature selection techniques for a resource-poor language i.e. Roman Urdu. These
classification techniques include chi-square, mutual information and select from model
which are implemented on the Roman Urdu Dataset of 11k reviews. Well-known
machine learning algorithms are applied for experimental analysis that includes
Logistic Regression (LR), Support Vector Machine (SVM), Random Forest (RF),
Decision Tree (DT), Multinomial Naïve Bayes (MNB) and Multi-Layer Perceptron
(MLP). These are applied for character-level & word-level features for n-gram
variations that are bigram to 7-gram for character-level classification and Uni, Bi, UniBi gram, Uni-Bi-Tri gram & Uni-Bi-Tri-Four gram in terms of word-level
classification. Results are being evaluated using accuracy, precision, recall & f1-score.
The Highest accuracies for both word-level and character-level achieved are 83.93%
and 83.72% which improves the baseline score that was 82.46% on feature union
whereas F1-score is 90.51% & 90.42% respectively. Some renowned Neural Network
techniques are also applied in this thesis which include CNN, LSTM, & Bi-LSTTM.
We achieved maximum results by Bi-LSTM which gives 91.8% accuracy and 91.7%
F1-score