Abstract:
The process of automatically converting the text of one natural language into another
is known as Machine Translation (MT). The two state-of-the-art machine
translation techniques are Statistical Machine Translation (SMT) and Neural Machine
Translation (NMT). In both MT techniques, a large-scale parallel aligned
corpus of source & target language text is the most basic requirement to develop
quality models. For the Urdu language, there are minimal parallel resources developed
until now, and these existing resources are not enough to produce some
quality translation systems.
This research study is majorly focusing on: (1) the development of a large-scale
semi-automatically aligned English-Urdu parallel corpus of more than 150k sentences
by collecting the data from different domains. (2) the compilation of largesized
monolingual data for English and Urdu language by merging existing corpora.
(3) the development of a character-separated parallel corpus for supervised
transliteration (by using Wikipedia titles) (4) the evaluation of impact generated
by language model and monolingual corpus on translation quality (by using different
data and word-order in language models training). (5) development of Phrasebased
SMT (PB-SMT) and Recurrent Neural Network-based NMT (RNN-NMT)
systems to evaluate and compare their performance on the proposed parallel corpus.
(6) the usage of supervised transliteration and ‘minimum error rate training’
to enhance the quality of SMT.
To evaluate the performance of MT systems, ‘Bilingual Evaluation Understudy
(BLEU)’ is used as an evaluation measure. The best-achieved scores for Englishto-
Urdu translations are 37.39 and 32.70 by using PBSMT and RNN-NMT, respectively.
While for Urdu-to-English translations, PBSMT and RNN-NMT yield
the best score of 28.20 and 30.63 respectively.