CUI Lahore Repository

Author Region Identification for Urdu Language

Show simple item record

dc.contributor.author Akram Chughtai, Rehman
dc.date.accessioned 2021-06-03T06:50:26Z
dc.date.available 2021-06-03T06:50:26Z
dc.date.issued 2021-06-03
dc.identifier.uri http://repository.cuilahore.edu.pk/xmlui/handle/123456789/2188
dc.description.abstract Author region identification is a subtask of author profiling which intends to identify the region of the author, based on the written text. Identification of author region may improve content recommendation, security, reduction in cyber-crimes due to its different applications in fake profile detection, content recommendation, sales and marketing, and forensic linguistics. Lot of work has been done on several author profiling tasks in English, Arabic and other European languages but no significant work found on author region identification task using Urdu language. To fill this gap, we proposed a large benchmark corpus and methods to predict author’s region using Urdu language articles. The proposed corpus consists of 12,100 Urdu articles written by authors from different regions. These articles are categorized into three different regions i.e. Pakistan, India and others. There are 5000 Urdu articles for Pakistan region, 5000 Urdu articles for India region and 2100 articles for Others category. For each article, article web links, author name and domain of each article is mentioned for future research purposes. To demonstrate how our proposed corpus can be used for the development and evaluation of author region identification for Urdu, I applied different classical machine learning and deep learning approaches. For classical machine learning, I used word n-grams technique for feature extractions and then applied classifiers i.e. Logistic Regression, SVM, Naïve Bayes, Random Forest, Adaboost and Gradient boosting for multi classification task. For deep learning algorithms, I applied CNN, RNN, LSTM, Bi-LSTM, and GRU on proposed corpus. Evaluation was carried out using accuracy, weighted precision, weighted recall and weighted F1-measure as evaluation measures. Experiments show that the best results with accuracy of 85 % is achieved by classical machine learning approach using SVM classifier on 3000 features with word n-gram [n ranges from 1 to 3], while for deep learning approach CNN with 1 layer outclass other classifiers with accuracy of 96 %. Deep learning CNN method performed best among all classical machine learning and deep learning methods for author region identification task on proposed Urdu articles corpus en_US
dc.publisher Department of Computer science, COMSATS University Lahore. en_US
dc.subject Author Region Identification for Urdu Language en_US
dc.title Author Region Identification for Urdu Language en_US
dc.type Thesis en_US


Files in this item

This item appears in the following Collection(s)

  • Thesis - MS / PhD
    This collection containts the Ms/PhD thesis of the studetns of Department of Computer Science

Show simple item record

Search DSpace


Advanced Search

Browse

My Account