Microsoft unveils VALL-E, an advanced text-to-speech AI that can speak in anyone's voice based on a 3-second sample

By Sofia Elizabella Wyciślik-Wilson
Published 3 years ago

Microsoft has revealed details of its latest foray into the world of artificial intelligence. Billed as a "neural codec language model", VALL-E is an advanced AI-driven text-to-speech (TTS) system that the developers say can be trained to speak like anyone's based on just a three-second sample of their voice.

The result is an incredibly natural-sounding TTS system that takes an entirely different approach to existing systems. Able to convey tone and emotion better than ever, VALL-E sounds realistically human, but there are concerns that it could be used for audio deepfakes.

See also:

The AI has been built and trained using 60,000 hours of audio input from thousands of individuals, including public domain audio books. Working with a short sample, VALL-E is able to closely mimic the tone and timbre of a voice in a way that has simply not been possible previously.

Writing about VALL-E, a team of Microsoft researchers say:

We introduce a language modeling approach for text to speech synthesis (TTS). Specifically, we train a neural codec language model (called VALL-E) using discrete codes derived from an off-the-shelf neural audio codec model, and regard TTS as a conditional language modeling task rather than continuous signal regression as in previous work. During the pre-training stage, we scale up the TTS training data to 60K hours of English speech which is hundreds of times larger than existing systems. VALL-E emerges in-context learning capabilities and can be used to synthesize high-quality personalized speech with only a 3-second enrolled recording of an unseen speaker as an acoustic prompt.

The team goes on to say: "Experiment results show that VALL-E significantly outperforms the state-of-the-art zero-shot TTS system in terms of speech naturalness and speaker similarity. In addition, we find VALL-E could preserve the speaker's emotion and acoustic environment of the acoustic prompt in synthesis".

You can find out more over on the VALL-E demo page where there are numerous samples of how it sounds based on various training inputs.

Image credit: ra2studio / depositphotos

3 Comments

3 Responses to Microsoft unveils VALL-E, an advanced text-to-speech AI that can speak in anyone's voice based on a 3-second sample

Pingback: Dew Drop – January 11, 2023 (#3856) – Morning Dew by Alvin Ashcraft

Microsoft unveils VALL-E, an advanced text-to-speech AI that can speak in anyone's voice based on a 3-second sample

3 Responses to Microsoft unveils VALL-E, an advanced text-to-speech AI that can speak in anyone's voice based on a 3-second sample

Recent Headlines

Stanford University study finds AI-based therapy has ‘significant risks’

Instagram opens up Trial Reels feature to more creators

The searches that reveal the most common email mistakes

In five seconds, this SSD will self-destruct: 5… 4… 3… 2…

Addressing key tech challenges in the public sector [Q&A]

Belkin is ending cloud support for many Wemo smart devices

Windows 11 Build 27898 introduces taskbar icon scaling and system recovery improvements

Most Commented Stories

Betanews Is Growing Alongside You

Windows 11 25H2 has a new option to remove all unwanted Microsoft apps

16 Billion Passwords Exposed: Major Leak Hits Apple, Facebook and Google Users

Will Windows 10 stop working? See if your PC will survive the switch to Windows 11

Half of Americans think AI is a threat, the other half don't. Who's right?

Apple’s Liquid Glass Control Center Gets a Much-Needed Fix in iOS 26 Beta 2

Apple’s CarPlay Ultra Comes to a Halt as Industry Giants Start Changing Their Minds

Never mind Windows 11, Windows Classic Remastered is the nostalgic Microsoft operating system you didn't know you wanted