Vocapia Research SAS
Comparing Self-Supervised and Semi-Supervised Training for Speech Recognition
Pages
5
Time to read
24 mins
Publication
Language
English
Pages
5
Time to read
24 mins
Publication
Language
English
This research article investigates the effectiveness of self-supervised pre-training and semi-supervised training approaches for improving automatic speech recognition (ASR) models in low-resource languages. The study focuses on a hybrid ASR model trained on a limited dataset of transcribed and untranscribed audio data. It compares baseline methods of cross-lingual transfer using MFCC features and features from the multilingual self-supervised model XLSR-53. The results indicate that both training methods yield significant improvements over baseline performance, with relative improvements of 18% and 27% for semi-supervised and continued self-supervised pre-training, respectively, on well-resourced English data. However, performance gains are less pronounced in low-resource settings, as seen in the South African Soap Opera dataset, where semi-supervised training shows only a 3% improvement. The paper also discusses the challenges of building effective language models in low-resource contexts, particularly due to code-switching and limited text data availability.