Development of Large Scale English-Urdu Machine Translation Corpus for Statistical and Neural Machine Translation Systems

Hussain, Moodser

Development of Large Scale English-Urdu Machine Translation Corpus for Statistical and Neural Machine Translation Systems

Hussain, Moodser

URI: http://dspace.cuilahore.edu.pk/xmlui/handle/123456789/1578

Date: 2019-12-19

Abstract:

The process of automatically converting the text of one natural language into another is known as Machine Translation (MT). The two state-of-the-art machine translation techniques are Statistical Machine Translation (SMT) and Neural Machine Translation (NMT). In both MT techniques, a large-scale parallel aligned corpus of source & target language text is the most basic requirement to develop quality models. For the Urdu language, there are minimal parallel resources developed until now, and these existing resources are not enough to produce some quality translation systems. This research study is majorly focusing on: (1) the development of a large-scale semi-automatically aligned English-Urdu parallel corpus of more than 150k sentences by collecting the data from different domains. (2) the compilation of largesized monolingual data for English and Urdu language by merging existing corpora. (3) the development of a character-separated parallel corpus for supervised transliteration (by using Wikipedia titles) (4) the evaluation of impact generated by language model and monolingual corpus on translation quality (by using different data and word-order in language models training). (5) development of Phrasebased SMT (PB-SMT) and Recurrent Neural Network-based NMT (RNN-NMT) systems to evaluate and compare their performance on the proposed parallel corpus. (6) the usage of supervised transliteration and ‘minimum error rate training’ to enhance the quality of SMT. To evaluate the performance of MT systems, ‘Bilingual Evaluation Understudy (BLEU)’ is used as an evaluation measure. The best-achieved scores for Englishto- Urdu translations are 37.39 and 32.70 by using PBSMT and RNN-NMT, respectively. While for Urdu-to-English translations, PBSMT and RNN-NMT yield the best score of 28.20 and 30.63 respectively.

Show full item record

Files in this item

Name: final_Thesis_FA17 ...

Size: 1.650Mb

Format: PDF

Description: MS Thesis

View/Open

This item appears in the following Collection(s)

Thesis - MS / PhD
This collection containts the Ms/PhD thesis of the studetns of Department of Computer Science

Development of Large Scale English-Urdu Machine Translation Corpus for Statistical and Neural Machine Translation Systems

Development of Large Scale English-Urdu Machine Translation Corpus for Statistical and Neural Machine Translation Systems

Abstract:

Files in this item

This item appears in the following Collection(s)

Search DSpace

Browse

All of DSpace

This Collection

My Account