Abstract:
In natural language processing, named entity recognition (NER) is the issue of figuring out
and arrange named elements referenced in unstructured content into pre-characterized
classifications. To compare, evaluate and develop a NER system, we need standard and
high-quality benchmark resources. But unfortunately, corpora in a huge amount are
available for many other languages but not for the Urdu language. Even though Urdu is a
rich language in all aspects and has more than 300 million speakers all over the world. As
state-of-the-art sequence labeling systems are heavily based on large amounts of domain specific knowledge and hand-crafted feature for efficient learning from small data. When
the dataset is in a large amount and high quality then the Named Entity Recognition (NER)
task can be solved in a better way and provides better results on a large amount of labeled
data, but the availability of labeled data in most of the languages is limited for training
purpose. One way is there are scarcely any limitations on identifiable words. Hence,
concluded the little amount of exampled data is troublesome. Subsequently, deliberately
built orthographic features and language-explicit information assets (e.g., gazetteers) are
broadly utilized for understanding this undertaking. Tragically, for new languages and new
domains, the task of identifying explicit language sources and feature extraction is mainly
an expensive process. As a result, it makes efficient NER system development adaptation
a challenge. Another way to makes an efficient NER system is the use of deep neural
network models that have achieved highest performance in the domain of natural language
processing (NLP). However, this performance highly depends on the availability of huge
amount of annotated data. The other way to get a better generalization from a small amount
of data is to deploy semi-supervised models that performs on small set of annotated
examples and a large set of unannotated data. Due to limited annotated data for many
Natural Language Processing (NLP) problems like NER for the Urdu language, we will
solve the NER problem using deep neural algorithms.
In deep learning methods, we perform 44 experiments by using recurrent neural network
(RNN), long short-term memory (LSTM), Bidirectional LSTM, gated recurrent unit
(GRU), Bidirectional GRU and GRUCell in a supervised learning setting to compare with x
the previous models. Our model outperformed the previous results with F1-score of
91.43%. We also use multi-layer perceptron in a semi-supervised setting. This model gives
the results with F1-score of 51.00%. We use standard evaluation measures to check the
performance of the NER system. However, deep learning results with semi-supervised
settings unable to beat the results of deep learning results with supervised settings