Neural Architecture Based Urdu Named Entity  Recognition

Ehsan, Muhammad

Neural Architecture Based Urdu Named Entity Recognition

Ehsan, Muhammad

URI: http://repository.cuilahore.edu.pk/xmlui/handle/123456789/2335

Date: 2021-06-04

Abstract:

In natural language processing, named entity recognition (NER) is the issue of figuring out and arrange named elements referenced in unstructured content into pre-characterized classifications. To compare, evaluate and develop a NER system, we need standard and high-quality benchmark resources. But unfortunately, corpora in a huge amount are available for many other languages but not for the Urdu language. Even though Urdu is a rich language in all aspects and has more than 300 million speakers all over the world. As state-of-the-art sequence labeling systems are heavily based on large amounts of domain specific knowledge and hand-crafted feature for efficient learning from small data. When the dataset is in a large amount and high quality then the Named Entity Recognition (NER) task can be solved in a better way and provides better results on a large amount of labeled data, but the availability of labeled data in most of the languages is limited for training purpose. One way is there are scarcely any limitations on identifiable words. Hence, concluded the little amount of exampled data is troublesome. Subsequently, deliberately built orthographic features and language-explicit information assets (e.g., gazetteers) are broadly utilized for understanding this undertaking. Tragically, for new languages and new domains, the task of identifying explicit language sources and feature extraction is mainly an expensive process. As a result, it makes efficient NER system development adaptation a challenge. Another way to makes an efficient NER system is the use of deep neural network models that have achieved highest performance in the domain of natural language processing (NLP). However, this performance highly depends on the availability of huge amount of annotated data. The other way to get a better generalization from a small amount of data is to deploy semi-supervised models that performs on small set of annotated examples and a large set of unannotated data. Due to limited annotated data for many Natural Language Processing (NLP) problems like NER for the Urdu language, we will solve the NER problem using deep neural algorithms. In deep learning methods, we perform 44 experiments by using recurrent neural network (RNN), long short-term memory (LSTM), Bidirectional LSTM, gated recurrent unit (GRU), Bidirectional GRU and GRUCell in a supervised learning setting to compare with x the previous models. Our model outperformed the previous results with F1-score of 91.43%. We also use multi-layer perceptron in a semi-supervised setting. This model gives the results with F1-score of 51.00%. We use standard evaluation measures to check the performance of the NER system. However, deep learning results with semi-supervised settings unable to beat the results of deep learning results with supervised settings

Show full item record