Abstract:
Multi-label document classification is a conventional supervised machine learning task where each document is associated with multiple labels to correctly categorize the meaning of context. All natural languages present this computational linguistic task for a wide series of research domains, such as Natural Language processing, document categorization and text mining. Many benchmark corpora have been developed for multi-label classification task, but most are for English and other European languages. However, there is a dearth of annotated benchmark corpus and methods for the multi-label document classification task for the Urdu language. To fill this gap, we proposed a large benchmark corpus and methods for the multi-label classification of Urdu documents.
The proposed corpus contains 600 documents from the field of journalism in the following domains: sports, national, foreign, showbiz and business. The proposed corpus has been manually annotated with USAS (UCREL Semantic Analysis System) scheme which provides 21 major labels and 232 sub labels for fine grained annotation. Each document in the proposed dataset was annotated with minimum two labels and maximum six labels to present a detailed analysis of the context as a supervised multi-label document classification task.
To demonstrate how our proposed dataset can be used for the development and evaluation of Urdu multi-label document classification system, three different techniques were applied: (1) Multi-Word Expression (MWE) N-gram (2) Single-Word Expression (SWE) N-gram and (3) Most Frequent Word (MFW) N-gram. Seven different multi-label classifiers were applied on features extracted using three MWE, SWE and MFW methods. Evaluation was carried out using accuracy, exact match, hamming loss and F1 measures. Results show that the (F1 is 0.545) gives best results by using MFW technique on CC and MCC classifier.