Abstract:
Lemmatization is a process of obtaining root form of a given word. Lemmatizer is an important
part of Natural Language Processing (NLP) toolkit and is essential for many NLP systems e.g.
Information Retrieval (IR), plagiarism and text reuse detection, Information Extraction (IE),
Machine Translation (MT), Word Sense Disambiguation (WSD) etc. Urdu is a widely spoken
language in the world, but very less work has been done on developing basic NLP tools for this
language, one of them is Urdu lemmatizer. Since Urdu is a morphologically rich language and has
words with many inflectional and derivational forms, development of an efficient lemmatizer is a
challenging task and it will be useful for many Urdu NLP applications.