Microsoft unveils VALL-E, an advanced text-to-speech AI that can speak in anyone's voice based on a 3-second sample
Microsoft has revealed details of its latest foray into the world of artificial intelligence. Billed as a "neural codec language model", VALL-E is an advanced AI-driven text-to-speech (TTS) system that the developers say can be trained to speak like anyone's based on just a three-second sample of their voice.
The result is an incredibly natural-sounding TTS system that takes an entirely different approach to existing systems. Able to convey tone and emotion better than ever, VALL-E sounds realistically human, but there are concerns that it could be used for audio deepfakes.
- The latest Start menu change in Windows 11 shows Microsoft is finally listening
- Microsoft ends Windows 7 extended support today -- install all updates now to fix all known issues... and gain Secure Boot support
- Microsoft releases PowerToys v0.66.0 with self-contained .NET 7 and lots of fixes for 2023
The AI has been built and trained using 60,000 hours of audio input from thousands of individuals, including public domain audio books. Working with a short sample, VALL-E is able to closely mimic the tone and timbre of a voice in a way that has simply not been possible previously.
Writing about VALL-E, a team of Microsoft researchers say:
We introduce a language modeling approach for text to speech synthesis (TTS). Specifically, we train a neural codec language model (called VALL-E) using discrete codes derived from an off-the-shelf neural audio codec model, and regard TTS as a conditional language modeling task rather than continuous signal regression as in previous work. During the pre-training stage, we scale up the TTS training data to 60K hours of English speech which is hundreds of times larger than existing systems. VALL-E emerges in-context learning capabilities and can be used to synthesize high-quality personalized speech with only a 3-second enrolled recording of an unseen speaker as an acoustic prompt.
The team goes on to say: "Experiment results show that VALL-E significantly outperforms the state-of-the-art zero-shot TTS system in terms of speech naturalness and speaker similarity. In addition, we find VALL-E could preserve the speaker's emotion and acoustic environment of the acoustic prompt in synthesis".
You can find out more over on the VALL-E demo page where there are numerous samples of how it sounds based on various training inputs.