Neural Methods for Aligning Parallel Corpora preview page 1

Statistical Machine Translation

Neural Methods for Aligning Parallel Corpora

Pages

Time to read

40 mins

Publication

10/25/24

Language

English

Summary

This document is a research article that presents neural methods for parallel corpus mining from the web, specifically targeting South and East Asian languages. The work builds upon the hierarchical web mining approach of Paracrawl and introduces a toxicity filtering step, resulting in significant improvements in the quality of machine translation resources. The authors detail the application of these methods to create large-scale parallel corpora for nine languages, including Hindi, Nepali, and several Southeast Asian languages. The article outlines the methodology employed, which includes targeted web crawling, document alignment, sentence alignment, and parallel corpus filtering. The results indicate that the proposed methods yield better translation quality compared to existing datasets, demonstrating improvements in BLEU scores for multiple languages. The document also discusses the computational resources utilized and compares the effectiveness of the neural methods against previous approaches, highlighting the importance of addressing lesser-studied languages in machine translation research.

Statistical Machine Translation

Neural Methods for Aligning Parallel Corpora

Summary

Get the Full Copy