Investigating Transformer Encoders for Speech Emotion Recognition preview page 1

Vocapia Research SAS

Investigating Transformer Encoders for Speech Emotion Recognition

Pages

Time to read

38 mins

Publication

12/01/23

Language

English

Summary

This technical report investigates the application of Transformer-based models for speech emotion recognition in emergency call center conversations. The research focuses on the challenges posed by the limited availability of real-life emotion datasets, particularly in emergency contexts. The study utilizes a corpus named CEMO, which consists of telephone conversations involving over 800 callers and 6 agents, analyzing four primary emotion classes: Anger, Fear, Positive, and Neutral. The report details experiments comparing various Transformer encoders, including wav2vec2 and BERT, and examines their fine-tuning and fusion strategies to enhance emotion recognition accuracy. Results indicate that specific pre-trained models significantly improve performance, achieving an Unweighted Accuracy (UA) of 73.1% with wav2vec2, compared to a baseline of 55.8%. The report also discusses the effectiveness of late and model-level fusion techniques, which further enhance performance metrics. Ethical considerations and reproducibility are addressed, emphasizing the importance of robust systems for real-world applications.

Vocapia Research SAS

Investigating Transformer Encoders for Speech Emotion Recognition

Summary

Get the Full Copy