VoxtLM: Unified Decoder-Only Models for Speech Tasks preview page 1

42dot

VoxtLM: Unified Decoder-Only Models for Speech Tasks

Pages

Time to read

25 mins

Publication

01/25/24

Language

English

Summary

This document is a technical report that presents VoxtLM, a decoder-only language model designed to perform multiple speech-related tasks, including speech recognition, speech synthesis, text generation, and speech continuation. The report outlines the architecture of VoxtLM, which integrates text vocabulary with discrete speech tokens derived from self-supervised speech features. It emphasizes the model's ability to facilitate multitask learning, leading to significant improvements in performance compared to single-task models. Specifically, the report details enhancements in speech intelligibility and quality for speech synthesis tasks, as well as improvements in speech generation and recognition. The authors also discuss the training methodology, which utilizes publicly available datasets and open-source tools to ensure reproducibility. The document further includes a comparison of VoxtLM with existing models and highlights the advantages of a unified approach to speech tasks, simplifying the integration of multiple functionalities within a single framework.

42dot

VoxtLM: Unified Decoder-Only Models for Speech Tasks

Summary

Get the Full Copy