CUI Lahore Repository

Class-based Unsupervised Parts-of-Speech Tagging for Low Resource Languages

Show simple item record

dc.contributor.author Ashraf, Zunaira
dc.date.accessioned 2021-06-03T10:29:25Z
dc.date.available 2021-06-03T10:29:25Z
dc.date.issued 2021-06-03
dc.identifier.uri http://repository.cuilahore.edu.pk/xmlui/handle/123456789/2243
dc.description.abstract The Unsupervised Parts-of-Speech Tagging is a challenging task for linguistic structures especially for low-resource languages. Most of the recent research on this topic used Hidden Markov Model (HMM) with different variations in models, datasets, and hyper-parameter values. Many of them worked upon high-resource mono, bi, or multi-lingual, in cross-lingual compared the dictionaries of two languages and then assigned tags to latent words, but there isn’t any proper or efficient model learned upon low-resource languages especially for unsupervised learning. Majority of them tuned the hyper-parameters after each iteration but few learned the parameters and fixed these values for all iterations of all languages. We focus on the main concept of Unsupervised POS Tagging i.e., clustering the unannotated data through grouping of similar word-tags by using standardized tag sets/clusters and work on Urdu low-resource languages and on English too. We designed a creative model which followed the common approach, that is to maximize the probability of hidden structure under observed data. Model parameter’s Maximum-likelihood estimation (MLE) is used for this purpose. Bayesian HMM improved performance in part-of-speech tagging for unsupervised learning. Bayesian approach performed over all possible hyper-parameters instead of single set of parameter values. Then, choose that set/pair of hyper-parameter values which gives best results. Our model has the structure of regular trigram HMM, as its accuracy for English language is closer to the state-of-the-art unsupervised selective model. And is 9% greater than the maximum-likelihood estimation. But for Urdu language, there is not any work in unsupervised POS tagging, so its accuracy is not compared with any model I.e., 77.6%, it’s less due to missing information of Urdu language. In future, we are going to improve our model until get its maximum accuracy. en_US
dc.publisher Department of Computer science, COMSATS University Lahore. en_US
dc.relation.ispartofseries ;6411
dc.subject Class-based Unsupervised Parts-of-Speech Tagging for Low Resource Languages, HMM, Unsupervised POS, Bayesian HMM en_US
dc.title Class-based Unsupervised Parts-of-Speech Tagging for Low Resource Languages en_US
dc.type Thesis en_US


Files in this item

This item appears in the following Collection(s)

  • Thesis - MS / PhD
    This collection containts the Ms/PhD thesis of the studetns of Department of Computer Science

Show simple item record

Search DSpace


Advanced Search

Browse

My Account