Abstract:
The Unsupervised Parts-of-Speech Tagging is a challenging task for linguistic structures especially for low-resource languages. Most of the recent research on this topic used Hidden Markov Model (HMM) with different variations in models, datasets, and hyper-parameter values. Many of them worked upon high-resource mono, bi, or multi-lingual, in cross-lingual compared the dictionaries of two languages and then assigned tags to latent words, but there isn’t any proper or efficient model learned upon low-resource languages especially for unsupervised learning. Majority of them tuned the hyper-parameters after each iteration but few learned the parameters and fixed these values for all iterations of all languages. We focus on the main concept of Unsupervised POS Tagging i.e., clustering the unannotated data through grouping of similar word-tags by using standardized tag sets/clusters and work on Urdu low-resource languages and on English too.
We designed a creative model which followed the common approach, that is to maximize the probability of hidden structure under observed data. Model parameter’s Maximum-likelihood estimation (MLE) is used for this purpose. Bayesian HMM improved performance in part-of-speech tagging for unsupervised learning. Bayesian approach performed over all possible hyper-parameters instead of single set of parameter values. Then, choose that set/pair of hyper-parameter values which gives best results. Our model has the structure of regular trigram HMM, as its accuracy for English language is closer to the state-of-the-art unsupervised selective model. And is 9% greater than the maximum-likelihood estimation. But for Urdu language, there is not any work in unsupervised POS tagging, so its accuracy is not compared with any model I.e., 77.6%, it’s less due to missing information of Urdu language. In future, we are going to improve our model until get its maximum accuracy.