Baseline Transliteration Corpus for Improved English-Amharic Machine Translation

,


Introduction
In today's modern age of technology and social media, it is increasingly common to incorporate foreign words into one's native tongue and compose in one language using scripts from other languages.English is the most widely used language in this regard [1].This can be attributed to many reasons, but one of them is the prevalence of the 'QWERTY' keyboard layout in laptops, smartphones, and even mechanical typewriters, especially in developing countries.Thus, many people who don't speak English prefer to compose their ideas using English scripts across multiple messaging platforms.This writing method is known as transliteration [2,3].
In the 1990s, NLP researchers were interested in creating machines for transliteration purpose to support other research areas.This was the first time the concept of machine transliteration was introduced.Machine transliteration is a subfield of MT and cross-language information retrieval (CLIR).Its primary goal is to use computers to convert a text from one language script (the source language) to another language script (the target language) while maintaining as much pronunciation as possible.In technical terms, it is concerned with accurately representing the graphemes of one language script using the script of another language [4].
The literature on MT suggests that transliteration can be used with MT systems to reduce translation errors and improve precision when translating names (named entities), technical terms, and loan (borrowed) words [5,6,7].Particularly for languages with limited resources (e.g small bilingual corpora), such as Amharic.Because learning all of the words of a given language from a small amount of bilingual training data is impossible [8,9,10].Finch et al. [11], carried out a large-scale real-world evaluation of the use of automatic transliteration in an MT system and demonstrated that using a transliteration system can improve MT quality when translating unknown words.As a result, machine transliteration has become a promising application for the use of MT.Table 1 shows the distinction between translation and transliteration for the languages under consideration (Amharic and English).is in wisit'i nati Amharic (€≈r{/@m@r1gn@), the main language of Ethiopia, has its own scripts and is the second most widely spoken Semitic language after Arabic.The Amharic script was originally derived from Ge'ez (g…z/g@'@zz@).Although it has disappeared as a colloquial language, Gee'z is the main language used for prayer, ritual performance, and the main teaching language in the Ethiopian Orthodox Church [12].Amharic uses a slightly modified version of the Gee'z alphabet.It consists of 34 basic characters, each of which has seven forms depending on which vowels in syllables are pronounced.Even though it is no longer widely used, Amharic also inherits all the Gee'z numeric character sets [13].

Related work
Machine transliteration is rarely an end goal by itself, but is often used as part of other NLP tasks (such as CLIR, QA, or MT).In light of its importance in these fields, a number of transliteration mechanisms have been proposed for non-English languages including Russian, Chinese, Korean, Arabic, Persian, and Indian [14].These mechanisms generally fall into three broad categories: linguistic (rulebased) approaches; statistical approaches; and deep learning approaches [15].
The linguistic approach uses hand-crafted rules based on pattern matching, which needs a linguistic analysis to formulate rules.This approach requires a thorough understanding of the language under consideration.Early attempts used this method to construct baseline transliteration corpora, and it is still used as a starting point to acquire transliteration corpora for low-resource languages [16].
Deep and Goyal [17] have proposed a Punjabi to English transliteration system that uses a linguistic-based approach.In the proposed transliteration scheme, a grapheme-based method is used to model the transliteration problem and achieves an accuracy of 93.22% when transliterating common names.A similar transliteration system has been developed by Goyal and Lehal [18] by implementing fifty complex rules.Their system was found to give about 98% accuracy for transliterating proper names, city names, country names, subject-related technical terms etc.
Various transliteration systems were proposed during the Named Entities Workshop (NEWs) evaluation campaigns between 2009 and 2018 [19].During the campaigns, transliteration is done from English into various languages with various writing systems.As a result of this workshop, many advances have been made in methodologies for transliterating proper nouns.There have been several approaches developed, including grapheme-to-phoneme conversion [20,21], based on statistics like machine translation [16,22], as well as neural networks, such as sequence-tosequence models and Long-Short-Term-Memory (LSTM) [23,24,25,26,27].
The three transliteration approaches discussed previ-ously can be based on grapheme1 , phoneme2 , hybrid, or correspondence transliteration models.
-Grapheme-based models: directly converts source language graphemes into target language graphemes without requiring phonetic knowledge of the source language words.
-Phoneme-based models: uses source language phonemes as a pivot when producing target language graphemes from source language graphemes.
-Hybrid and Correspondence-based models: use both source language graphemes and phonemes.
Generally, statistical and neural network techniques based on large parallel transliteration corpora work well for rich-resource languages but low-resource languages do not have the luxury of such resources.For such languages, rulebased transliteration is the only viable option [16].

Amharic transliteration
In our literature review, we found two cases where Amharic was studied for transliteration tasks.The first attempt was made by Tadele Tedla [28].His objective was to develop a framework to convert ASCII transliterated Amharic text to the original Amharic text.In the transliteration of three random test data-sets, the model achieves respectively 97.7, 99.7, and 98.4 percent accuracy.The first set of test data consists of an ASCII transliterated Amharic word list of 32,482 words.The second set of test data is a transliterated poem with 1277 words, and the third set of data is a recipe for Injera, a common local food in Ethiopia, with 123 transliterated Amharic words.
Gezmu et al. [29] is the second attempt at Amharicto-English machine transliteration.In their work, they used machine transliteration as a tool (to facilitate vocabulary sharing) to improve the performance of Amharic-English MT.Despite claiming to have created an Amharic-English transliteration corpus for named entities and borrowed words, they did not make it publicly available.Based on a review of the literature, we believe that our attempt is the first to create a large Amharic-English transliteration corpus for the English-Amharic NMT.

Motivation
Developing a reliable English-Amharic MT system remains a challenge.A scarcity of resources and the absence of well-organized MT research projects are the two major obstacles to overcoming this challenge.Our search reveals that the majority, if not all, of the research works on English-Amharic MT are done by independent individuals and are disjointed.The BLEU score results for these language pairs are, therefore, not indicative of high quality translation, according to a general interpretation of the BLEU score.Thus, this study aims to enhance English-Amharic MT performance by incorporating transliteration as a tool.To achieve this goal, we created an Amharic-English transliteration corpus from previously collected English-Amharic MT corpus [30,31] and used it for English-Amharic NMT experiments.This is the first baseline corpus for these language pairs, which will be made available to MT and IR researchers.

Experimental set-up 4.1 Corpus preparation
The objective of this study is to improve the performance of English-Amharic MT by using a transliterated and augmented corpus.However, the data required for training the NMT models is not available.As a result, the previously gathered English-Amharic translation corpus is used to generate an Amharic-English transliteration corpus.Therefore, this section is devoted to explaining the methods and techniques used to create this corpus, as well as the NMT experiments performed with it.

Acquisition of the previously collected translation corpus
The freely available English-Amharic translation corpus was obtained from the Github Repository 3 4 [30].This corpus was compiled from religious, legal, and news domains and contains 225,304 English-Amharic parallel sentences.

Pre-transliteration preprocessing
This step is completed before the transliteration process begins.It is performed on the previously acquired original Amharic translation corpus.Normalization of homophone characters, removal of punctuation marks, and conversion of Amharic to Arabic numerals are all carried out.After these preprocessing tasks are completed, the corpus is divided into 25 parts and distributed to data collectors.These data collectors use Google Translate to transliterate Amharic sentences into English scripts, and then they collect these sentences by copying and pasting them into a text file.

Transliterating the acquired corpus
For the successful completion of this task two different steps are followed.
1. Performing transliteration: This process was carried out using Google's online translation tool 5 .Regardless of its primary goal of translation, Google Translate can generate text transliterations as part of the translation process if the two languages use distinct scripts.The main task completed at this stage, as shown in Figure 1, was transliterating Amharic sentences to English using Google Translate and collecting the transliterated sentences.
In order to transliterate and compile a total of 225,304 Amharic sentences, 25 data collectors (computer science students) participated.The entire process of transliterating and normalizing these Amharic sentences takes 60 days, and each data collector has a daily throughput of 150 sentences.Prior to the transliteration task, each data collector was provided with brief training and guidance to improve the quality and consistency of the transliteration process.
2. Normalizing the transliteration corpus: After the transliteration corpus was collected, the next task accomplished was corpus normalization.The objective of this task was to make the transliterations of Amharic loan words and named entities (NEs) as close as possible to the spelling of English words, so that they become useful for MT purposes.To assist this manual normalization process, true casing is carried out first using Moses' built-in true-caser script.Because Amharic has a Subject-Verb-Object (SVO) grammatical structure and NEs are more likely to appear at the beginning of a sentence, true casing allowed us to capitalize the first letter of the majority of NEs.This reduces the amount of work required to locate and correct NEs when they are transliterated differently than the English version.Table 3 contains examples of transliterations produced by Google Translate and their normalized forms.The table also includes the Levenshtein edit distance [32] computed between the English translation and the generated and normalized transliterations.Computing the Levenshtein edit distance allows us to choose the transliterations closest to the English translation.
As depicted in the table, all of the differences between the English translation and the generated transliterations (using Google Translate) occur in representing the sixth form of Amharic characters.For instance, the name Daniel (×n"l) is spelled as Dani'ēli by Google Translate.But its correct English spelling (English translation) is Daniel.This discrepancy occurs at writing the sixth form of Amharic characters.In the above example Google Translate uses ni and li to represent (n) and (l) respectively.So, to make the transliterated loan words and named entities in the corpus closer to the English word these characters are normalized to (n) (n) and (l) (l).This normalization is done for all  sixth form characters of Amharic.The transliteration character map used in this work is depicted by Table 4. Which is the modified version of the United Nations Romanization Systems for Geographical Names (BGN/PCGN 1967 System) approved for Amharic to English transliteration [33].Actually, in this standard the six form of Amharic characters have two optional representations.
Overall according to the Levenshtein edit distance, the normalized form of Google transliteration is closer to the English translation.

Post-transliteration preprocessing
At this stage, cleaning and splitting of the corpus are performed.These two preprocessing techniques make the transliterated corpus ready for MT training purposes.The cleaning task removes empty lines from the corpus, avoids redundant space between characters and words, and cuts and discards extremely long sentences (sentences with more than 80 words).As a result, after completing this task, the total number of sentences in the corpus drops from 225,304 to 218,365.
Finally, for training our MT models, the transliterated and preprocessed texts are divided into three parts.For the

Augmentation of transliterated corpus
In addition to the transliteration task, corpus augmentation is performed to increase the size of the transliterated English-Amharic corpus.Several publications have indicated that corpus augmentation can be an effective method of scaling up corpora, especially for languages with a limited resource base.Hence, in this work, token-level corpus augmentation is applied and the augmented corpus is used as the training dataset for different NMT models.Among alternative token level augmentation techniques random insertion, replacement, deletion, and swapping approaches are selected and implemented.In doing so, seven different augmented corpora are generated by varying the values of (delete probability, replacement probability, and swapping range).Then Cosine similarity between the original corpus and the augmented ones is calculated, and the augmented corpus that preserves approximately 90% of the meaning is selected [31].The augmentation task is done for training, validation, and test sets to avoid overlapping sentences in each set.By combining these augmented data sets with the transliterated corpus, 424,230 training, 10,000 validation, and 2500 testing sets are created.Overall, this resulted in 436,730 cleaned, transliterated, and augmented sentences.

NMT Experiments
In this experiment, three different NMT models are created and their performance are evaluated by comparing them to previous attempts for the language pairs.RNN with at-tention mechanisms, GRU-based, and Transformer-based NMT models were developed, and each model was trained using a transliterated and augmented corpus.

Attention based RNN model:
An open source toolkit called Open-NMT [34] [35] is used to build this model.Given the corpus is divided into three parts (training, validation and testing sets) in the preporcessing stage of this experiment, the first task in training the RNN based model is performing Byte Pair Encoding (BPE).BPE enables NMT model translation on openvocabulary by encoding rare and unknown words as sequences of sub-word units.This is based on an intuition that various word classes are translatable via smaller units than words [36].The next step is preprocessing; actually it computes the vocabularies given the most frequent tokens, filters too long sentences, and assigns an index to each token.Finally, RNN based NMT model with attention mechanisms is trained with the parameters depicted in Table 5.Actually, training is the most time consuming task in the whole process of creating this model.A larger batch size is advantageous for improving training time and quality.As a result, a large batch size is used in this experiment.The larger the batch size, the greater the efficiency (matrix multiplication with small batch sizes is very inefficient).Because a larger matrix can more effectively utilize GPU cores and RAM [37] [38].

GRU based model:
In comparison to conventional RNN and LSTM, GRUs are relatively new architectures that are being used in many machine learning applications.Due to their fewer parameters, they improve the training time of LSTM and resolve vanishing and exploding gradients, which occur with RNNs [39].
In order to conduct the GRU-based NMT experiment, three distinct units (encoder, attention, and decoder) are cre-Table 4: Amharic to English transliteration character map.In order to determine the hyper-parameter values for Transformer-BP model, several papers that investigates the effect of hyper-parameter values on the translation quality of NMT models are surveyed.More importantly, the papers focuses on Transformer-based models for low-resource language translation are critically reviewed.By considering the size of the corpus the hyper-parameter values are determined.The parameter values of the three models are summarized in Table 7.

Experimental results
Table 8 presents all the BLEU score results for the three models.The BLEU score results indicated in augmented corpus column are cited from previous works for the purpose of comparison and analysis.
As shown in the table, the BLEU score results of all the three models are improved due to the utilization of transliterated and augmented corpus.Especially, Transformer-BP and GRU models benefited slightly more from the transliteration corpus than the other models.This is due to the fact that Transformer-BP is trained with hyperparameters that have been adjusted to account for the size of the corpus.While GRU is inherently uses small number of parameters to train, making it easier to select more appropriate hyperparameter values and achieve better BLEU score results.
On the other hand, the hyper-parameter values for other Transformer based models (Transformer-B and Transformer-D) are set for bigger corpus sizes.So, their performance is lower than all the remaining models.This makes them the least benefited models of the Transliterated corpus.
In general, a T-test (two-tailed) is used to determine whether or not the BLEU score results obtained by models trained with the transliterated corpus are statistically significant.According to the calculation, the t-value (0.000301279) is smaller than the critical value P (0.05), thus indicating there is a significant difference between the two BLEU score values.From this, we can conclude that transliterating the corpus improves the performance of all three NMT models.Especially, the BLEU sore result of the Transformer-BP model is the highest score so far for English-Amharic MT.

Conclusion
Low resource MT is still a work in progress and in its crawling stage for a variety of reasons.On the contrary, MT research for resource-rich languages goes a long way in the acquisition of resources and the creation of different MT architectures.As a result, different successful NMT architectures are introduced.These includes RNNs, GRUs Y. Biadgligne et al.  and most importantly Transformer.However, due to resource constraints (particularly lack of huge bilingual corpora), most languages in the low resource language category, are not benefiting from these successful architectures.Amharic is one of these languages.So, in this work, we decided to take up this challenge and attempted to improve the performance of English-Amharic MT using corpus transliteration and augmentation.
For that, we created the biggest Amharic -English transliteration corpus from the previously collected English-Amharic parallel corpus using Google Translate (for transliteration) and human data collectors (for normalization).In the normalization process, transliterated names and borrowed words are spelled as closely to their English translation as possible.After this, token level corpus augmentation technique is applied on the transliterated corpus in order to artificially increase the size of the corpus.By doing so we are able to create a corpus (transliterated and augmented) with a size of 450,608 parallel sentences.
With the created data set RNN with attention mechanism, GRU-based and Transformer based NMT architectures are trained.Compared to a previous work in which we used a corpus augmentation with similar training parameters, all three models in this study achieve better MT performance.Especially, the BLEU score achieved by one of the three models (Transformer-BP) is the state of the art result (39.67 BLEU) for the language pairs so far as much as our knowl-edge is concerned.Transliteration played a part in this.
Generally, this work adds two contributions to the knowledge base of English-Amharic MT research.The first one is the creation of English-Amharic transliteration and augmentation corpus.The second one is the improvement of English-Amharic MT performance.

Table 1 :
Example of Amharic to English translation and transliteration

Table 2 :
Summary of related works

Table 3 :
Example of Amharic to English transliteration using Google translate and normalized form of the transliteration.

Table 5 :
Parameters and values of RNN model Each of the encoder and decoder units has three GRU layers, with a hidden state size of 512.Before the training begins the tokenizer converts each word to a unique integer value, which is then converted to word embeddings by the embedding unit.The embedding layer has a dimension of 128.The entire architecture of our GRU based NMT model and the detailed training parameters are depicted in Figure2and Table6respectively.

Table 6 :
Parameters and values of GRU model

Table 7 :
Hyperparameters of our Transformer model

Table 8 :
Experimental results of the different NMT models