Statistical Machine Translation
Machine Translation Evaluation Benchmark for Wu Chinese
Pages
6
Time to read
19 mins
Publication
Language
English
Pages
6
Time to read
19 mins
Publication
Language
English
This technical report presents the development of a machine translation evaluation benchmark specifically for Wu Chinese. The report outlines the creation of a new dataset that serves as both a training corpus and evaluation benchmark for machine translation models. It details the contributions made to the FLORES+ dataset, which includes an open-source, manually translated dataset, comprehensive documentation on the dataset creation process, and validation experiments. The report also discusses preliminary tools developed for Wu Chinese normalization and segmentation. Furthermore, it highlights the benefits and limitations of the dataset, as well as its implications for other under-resourced languages. The methodology employed in constructing the dataset is described, including the translation process and the dialectal considerations of the Chongming dialect. The report concludes with a discussion of the results of the experiments conducted and suggests areas for future work in Wu Chinese machine translation.