Domain Adaptation for Statistical Machine Translation Master Defense

Introduction Proposed Method I: New Criterion Proposed Method II: Combination Proposed Method III: Linguistics Domain-Specific Online Translator Conclusion

Domain Adaptation for Statistical Machine Translation Master Defense By Longyue WANG, Vincent MT Group, NLP2CT Lab, FST, UM Supervised by Prof. Lidia S. Chao, Prof. Derek F. Wong 20/08/2014

Computational Linguistics Machine Translation Text Translation Domain-Specific Statistical MT Rule-based MT Hybrid MT Speech Translation Research Scope Figure 1: Our Research Scope [1] [2] [1] Daniel Jurafsky and James Martin (2008) An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, Second Edition. Prentice Hall. [2] Wikipedia, (2/84) Domain-Specific Statistical MT

Introduction ◼ Proposed Method I: New Criterion ◼ Proposed Method II: Combination ◼ Proposed Method III: Linguistics ◼ Domain-Specific Online Translator ◼ Conclusion

Part I: Introduction



Statistical Machine Translation  SMT translations are generated on the basis of statistical models whose parameters are derived from the analysis of text corpora [3].  Currently, the most successful approach of SMT is phrase-based SMT, where the smallest translation unit is n-gram consecutive words. [3] Peter F. Brown, Vincent J. Della Pietra, Stephen A. Della Pietra, and Robert L. Mercer. 1993. The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics. 19:263–311. Figure 2: Phrase-based SMT Framework (6/84)

Statistical Machine Translation  Corpus is a collection of texts. e.g., IWSLT2012 official corpus.  Bilingual corpus is a collection of text paired with translation into another language. Monolingual corpus, in one (mostly are the target side) language.  Corpus may come from different genres, topics etc. Figure 2: Phrase-based SMT Framework Parallel Corpus Monolingual Corpus (7/84)

Statistical Machine Translation  Word alignment can be mined by the help of EM algorithm.  Then extract phrase pairs from word alignment to generate translation table.  Distance-based reordering model is a penalty of changing position of translated phrases. Figure 2: Phrase-based SMT Framework Translation Table Word Alignment Reordering Model (8/84)

Statistical Machine Translation  Language model assigns a probability to a sequence of words. (n-gram) [4] Figure 2: Phrase-based SMT Framework Language Model [4] F Song and W B Croft (1999). "A General Language Model for Information Retrieval". Research and Development in Information Retrieval. pp. 279–280.. 1 1 1 1 ( ) ( | ) l i LM i i n i p s p w w + − − + = = (9/84) (1)

Statistical Machine Translation Decoding function consists of three components: the phrase translation table, which ensure the foreign phrase to match target ones; reordering model, which reorder the phrases appropriately; and language model, which ensure the output to be fluent. Figure 2: Phrase-based SMT Framework Source Text Decoding Target Text Searching Translation Candidates 1 1 1 1 1 arg max ( | ) ( 1) ( | ... ) I e best e i i i i LM i i i i e f e d start end P e e e  − − = = = − −   (10/84) (2)
