Combining conversational speech with read speech to improve prosody in Text-to-Speech synthesis (Submitted to Interspeech 2022)
Authors: Johannah O'Mahony, Catherine Lai, Simon King
CSTR, University of Edinburgh, Scotland
Abstract
For isolated utterances, speech synthesis quality has improved immensely thanks to the use of sequence-to-sequence models. However, these models are generally trained on read speech and fail to generalise to unseen speaking styles. Recently, more research is focused on the synthesis of expressive and conversational speech. Conversational speech contains many prosodic phenomena that are not present in read speech. We would like to learn these prosodic patterns from data, but unfortunately, many large conversational corpora are unsuitable for speech synthesis due to low audio quality. We investigate whether a data mixing strategy can improve conversational prosody for a target voice based on monologue data from audiobooks by adding real conversational data from podcasts. We filter the podcast data to create a set of 26k question and answer pairs. We evaluate two FastPitch models: one trained on 20 hours of monologue speech from a single speaker, and another trained on 5 hours of monologue speech from that speaker plus 15 hours of questions and answers spoken by nearly 15k speakers. Results from three listening tests show that the second model generates more preferred question prosody.