31. March 2015
What is Machine Translation all about?
Machine translation (MT), for a very long time one of the most daring dreams of artificial intelligence, has become a reality for an increasing number of use cases. Nowadays computers can produce translations in no time. While speed and volume of translated data are no longer an issue, opinions still differ about the quality of the translation.
The spectrum ranges from strong believes in the usefulness of MT, in particular in scenarios such as getting an understanding of texts in unknown languages and providing raw translation for post- editing, to the absolute negation of any benefits of MT. In addition there are on-going discussions on MT as a growing threat to translators’ jobs.
But at least for some domains, this fear is unnecessary as Machine Translation is never going to replace human translators when it comes to literary texts, poetry, nuances in languages, irony or ambiguities.
This is different for technical manuals, specifications and reports, commercial material, and many other genres with repeating structures and wordings where translation programmes offer a substantial support. But even in these scenarios, automatic translation is only as good as the “experience-level” of the system. This level rises with more “training” through manual translation- runs, the definition of rules, provided example data and automatic discovery of rules and patterns in these materials.
Technical approaches to machine translation
Several approaches have been used in the development of machine translation systems, such as rule-based (RBMT), statistical (SMT), and example-based (EBMT) ones.
Many traditional MT systems are rule-based. Their development requires immense time and human resources to maintain the systems, to enhance translation quality, and in particular to incorporate new language pairs (sets of the same text in two languages). They are in need of detailed dictionaries and grammar-sets as well as manpower of skilled computational linguists and lexicographers.
This is one of the main reasons why many research efforts are now focusing on other approaches to MT, especially the statistical approach. These other approaches are far more effective in terms of time and labour resources.
Statistical models learn, based on a certain amount of manually translated documents, words and phrases and their most probable correspondences in the other language.
SMT started with word-based models but soon turned to the use of phrase-based models instead as a more successful model. Today’s SMT is therefore mainly based on this approach. Further developments nowadays concentrate on syntax-based MT (using syntactic units rather then just single words or strings of words) or hybrid methods, as there is still an issue of speed with pure syntax-based MT.
These research efforts include the integration of linguistic knowledge into SMT, e.g. by linguistically annotating the training corpus. This improves MT performance especially for languages with rich morphology (the language’s building blocks) and free word order.
In the area of RBMT systems however approaches towards using corpus-based statistical technology for e.g. bilingual term extraction, and importing such terms into the dictionary of a rule-based system have shown good results in overcoming the problem of unknown words.
Until now SMT research has been mainly focused on widely used languages, such as English and French, but also Arabic and Chinese. For “small, under-resourced” languages MT is not as well developed due to the lack of linguistic resources and multilingual corpora that enable MT solutions for new language pairs to be developed cost effectively and with good quality. This has resulted in a technological gap between these two groups of languages.
Quite a number of today’s languages are under-resourced and lack both parallel corpora and language technologies for MT. Parallel corpora are very limited not only in terms of language coverage, but also in quantity and genre, being available only for a handful of languages, and for a small number of domains.
Machine Translation in the MULTISENSOR project
For MULTISENSOR we adopted the statistical (corpus-based, data-driven) approach. It is language independent and it has started to reach a level of quality acceptable for many applications. But, as already mentioned, it requires very large parallel corpora for training language and translation models. Best results can be achieved in translating texts of a similar domain to the training data.
In the context of MULTISENSOR, the focus is on MT for under-resourced languages (Bulgarian), on languages with rich morphology (German, Bulgarian, French, and Spanish) and on languages with “difficult” word order (German). Furthermore, machine translation in MULTISENSOR approaches another central problem: out-of-vocabulary words, especially names.
In MULTISENSOR we will use a workflow which mines monolingual resources, extracts named entities from there, and links them into an update of the machine translation resources.
All pictures except for “serverfarm” (rights are with Linguatec) are taken from Pixabay and are free for commercial use.