Abstract:
Author region identification is a subtask of author profiling which intends to identify the
region of the author, based on the written text. Identification of author region may improve
content recommendation, security, reduction in cyber-crimes due to its different
applications in fake profile detection, content recommendation, sales and marketing, and
forensic linguistics. Lot of work has been done on several author profiling tasks in English,
Arabic and other European languages but no significant work found on author region
identification task using Urdu language. To fill this gap, we proposed a large benchmark
corpus and methods to predict author’s region using Urdu language articles.
The proposed corpus consists of 12,100 Urdu articles written by authors from different
regions. These articles are categorized into three different regions i.e. Pakistan, India and
others. There are 5000 Urdu articles for Pakistan region, 5000 Urdu articles for India region
and 2100 articles for Others category. For each article, article web links, author name and
domain of each article is mentioned for future research purposes.
To demonstrate how our proposed corpus can be used for the development and evaluation
of author region identification for Urdu, I applied different classical machine learning and
deep learning approaches. For classical machine learning, I used word n-grams technique
for feature extractions and then applied classifiers i.e. Logistic Regression, SVM, Naïve
Bayes, Random Forest, Adaboost and Gradient boosting for multi classification task. For
deep learning algorithms, I applied CNN, RNN, LSTM, Bi-LSTM, and GRU on proposed
corpus. Evaluation was carried out using accuracy, weighted precision, weighted recall and
weighted F1-measure as evaluation measures. Experiments show that the best results with
accuracy of 85 % is achieved by classical machine learning approach using SVM classifier
on 3000 features with word n-gram [n ranges from 1 to 3], while for deep learning approach
CNN with 1 layer outclass other classifiers with accuracy of 96 %. Deep learning CNN
method performed best among all classical machine learning and deep learning methods
for author region identification task on proposed Urdu articles corpus