Statistical Machine Translation
Neural Methods for Aligning Parallel Corpora
Pages
13
Time to read
40 mins
Publication
Language
English
Pages
13
Time to read
40 mins
Publication
Language
English
This document is a research article that presents neural methods for parallel corpus mining from the web, specifically targeting South and East Asian languages. The work builds upon the hierarchical web mining approach of Paracrawl and introduces a toxicity filtering step, resulting in significant improvements in the quality of machine translation resources. The authors detail the application of these methods to create large-scale parallel corpora for nine languages, including Hindi, Nepali, and several Southeast Asian languages. The article outlines the methodology employed, which includes targeted web crawling, document alignment, sentence alignment, and parallel corpus filtering. The results indicate that the proposed methods yield better translation quality compared to existing datasets, demonstrating improvements in BLEU scores for multiple languages. The document also discusses the computational resources utilized and compares the effectiveness of the neural methods against previous approaches, highlighting the importance of addressing lesser-studied languages in machine translation research.