Abstract:
Author profiling is the task of author attributes classification where the main aim is to
predict the profile and demographic features of an author which includes age group,
gender, region, personality, etc., by examining the written content of the author. There
are different promising applications of author profiling including security, forensic
analysis, and identification of harassing text messages, marketing intelligence, and fake
profile identification. In literature, the majority of the studies have been carried for
single-label author profiling i.e., predicting only one single label at a time. There are
very few studies available on multi-label author profiling on mono-lingual text, i.e.,
predicting more than one label at a time. However, the problem of multi-label author
profiling has not been completely explored for multi-lingual text. The main objective
of this research work is to explore the problem of multi-label author profiling on multi lingual text (English and Roman Urdu).
For this purpose, the aim is to predict four author traits including gender, age, education,
and language as a multi-label task using three state-of-the-art methods: (1) Content based Methods (N-gram models for both word and character), (2) Deep Learning
Approaches (CNN, LSTM, BI-LSTM, GRU, and BI-GRU) and (3) Transfer Learning
Approaches (BERT, and XLNET). The evaluations were carried out on three
benchmark multi-lingual datasets, RUEN-AP-17, SMS–AP–18, and BT-AP-19. After
extensive experimentation and comparison, the results show that the content-based
method outperforms the deep learning and transfer learning methods for multi-label
author profiling tasks on all multi-lingual corpora used in this study. On the RUEN AP-17 corpus the best results (Accuracy = 0.71, F1-measure = 0.65) were obtained
using the word tri-gram model with the Naïve Bayes classifier. On SMS–AP–18 corpus
the best results (Accuracy = 0.74, F1-measure = 0.69) were obtained using word uni gram model using support vector machine with one-vs-rest and one-vs-one classifiers,
and on BT-AP-19 corpus the best results (Accuracy = 0.74, F1-measure = 0.69) were
obtained using word bi-gram model using support vector machine with one-vs-rest and
one-vs-one classifiers