Microsoft unveils VALL-E, an advanced text-to-speech AI that can speak in anyone's voice based on a 3-second sample

Computer speech

Microsoft has revealed details of its latest foray into the world of artificial intelligence. Billed as a "neural codec language model", VALL-E is an advanced AI-driven text-to-speech (TTS) system that the developers say can be trained to speak like anyone's based on just a three-second sample of their voice.

The result is an incredibly natural-sounding TTS system that takes an entirely different approach to existing systems. Able to convey tone and emotion better than ever, VALL-E sounds realistically human, but there are concerns that it could be used for audio deepfakes.

See also:

The AI has been built and trained using 60,000 hours of audio input from thousands of individuals, including public domain audio books. Working with a short sample, VALL-E is able to closely mimic the tone and timbre of a voice in a way that has simply not been possible previously.

Writing about VALL-E, a team of Microsoft researchers say:

We introduce a language modeling approach for text to speech synthesis (TTS). Specifically, we train a neural codec language model (called VALL-E) using discrete codes derived from an off-the-shelf neural audio codec model, and regard TTS as a conditional language modeling task rather than continuous signal regression as in previous work. During the pre-training stage, we scale up the TTS training data to 60K hours of English speech which is hundreds of times larger than existing systems. VALL-E emerges in-context learning capabilities and can be used to synthesize high-quality personalized speech with only a 3-second enrolled recording of an unseen speaker as an acoustic prompt.

The team goes on to say: "Experiment results show that VALL-E significantly outperforms the state-of-the-art zero-shot TTS system in terms of speech naturalness and speaker similarity. In addition, we find VALL-E could preserve the speaker's emotion and acoustic environment of the acoustic prompt in synthesis".

You can find out more over on the VALL-E demo page where there are numerous samples of how it sounds based on various training inputs.

Image credit: ra2studio / depositphotos

3 Responses to Microsoft unveils VALL-E, an advanced text-to-speech AI that can speak in anyone's voice based on a 3-second sample

  1. Pingback: Dew Drop – January 11, 2023 (#3856) – Morning Dew by Alvin Ashcraft

© 1998-2024 BetaNews, Inc. All Rights Reserved. Privacy Policy - Cookie Policy.