Vocapia Research SAS
Investigating Transformer Encoders for Speech Emotion Recognition
Pages
17
Time to read
38 mins
Publication
Language
English
Pages
17
Time to read
38 mins
Publication
Language
English
This technical report investigates the application of Transformer-based models for speech emotion recognition in emergency call center conversations. The research focuses on the challenges posed by the limited availability of real-life emotion datasets, particularly in emergency contexts. The study utilizes a corpus named CEMO, which consists of telephone conversations involving over 800 callers and 6 agents, analyzing four primary emotion classes: Anger, Fear, Positive, and Neutral. The report details experiments comparing various Transformer encoders, including wav2vec2 and BERT, and examines their fine-tuning and fusion strategies to enhance emotion recognition accuracy. Results indicate that specific pre-trained models significantly improve performance, achieving an Unweighted Accuracy (UA) of 73.1% with wav2vec2, compared to a baseline of 55.8%. The report also discusses the effectiveness of late and model-level fusion techniques, which further enhance performance metrics. Ethical considerations and reproducibility are addressed, emphasizing the importance of robust systems for real-world applications.