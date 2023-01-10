Microsoft unveils VALL-E, an advanced text-to-speech AI that can speak in anyone's voice based on a 3-second sample

No Comments
Computer speech

Microsoft has revealed details of its latest foray into the world of artificial intelligence. Billed as a "neural codec language model", VALL-E is an advanced AI-driven text-to-speech (TTS) system that the developers say can be trained to speak like anyone's based on just a three-second sample of their voice.

The result is an incredibly natural-sounding TTS system that takes an entirely different approach to existing systems. Able to convey tone and emotion better than ever, VALL-E sounds realistically human, but there are concerns that it could be used for audio deepfakes.

See also:

Advertisement

The AI has been built and trained using 60,000 hours of audio input from thousands of individuals, including public domain audio books. Working with a short sample, VALL-E is able to closely mimic the tone and timbre of a voice in a way that has simply not been possible previously.

Writing about VALL-E, a team of Microsoft researchers say:

We introduce a language modeling approach for text to speech synthesis (TTS). Specifically, we train a neural codec language model (called VALL-E) using discrete codes derived from an off-the-shelf neural audio codec model, and regard TTS as a conditional language modeling task rather than continuous signal regression as in previous work. During the pre-training stage, we scale up the TTS training data to 60K hours of English speech which is hundreds of times larger than existing systems. VALL-E emerges in-context learning capabilities and can be used to synthesize high-quality personalized speech with only a 3-second enrolled recording of an unseen speaker as an acoustic prompt.

The team goes on to say: "Experiment results show that VALL-E significantly outperforms the state-of-the-art zero-shot TTS system in terms of speech naturalness and speaker similarity. In addition, we find VALL-E could preserve the speaker's emotion and acoustic environment of the acoustic prompt in synthesis".

You can find out more over on the VALL-E demo page where there are numerous samples of how it sounds based on various training inputs.

Image credit: ra2studio / depositphotos

No Comments
Got News? Contact Us
Advertisement

Recent Headlines

Attacks and payments are down -- but don't write off ransomware yet

Microsoft unveils VALL-E, an advanced text-to-speech AI that can speak in anyone's voice based on a 3-second sample

Microsoft ends Windows 7 extended support today -- install all updates now to fix all known issues... and gain Secure Boot support

Go forth and conquer: New technologies shaping IT in 2023

OBS Studio 29 extends AV1 encoding support, adds new audio filters

How CISOs can communicate cyber risk to the board [Q&A]

The latest Start menu change in Windows 11 shows Microsoft is finally listening

Most Commented Stories

The latest Start menu change in Windows 11 shows Microsoft is finally listening

28 Comments

Windows Vienna is the new version of Windows Vista you didn't know you needed (install it now!)

20 Comments

Windows 11 22H2 has introduced an irritating focus bug in File Explorer

10 Comments

0patch will keep releasing security updates for Microsoft Edge on Windows 7, Server 2008 and Server 2012

10 Comments

Just one more week of security updates for Windows 7 and Windows 8

7 Comments

© 1998-2023 BetaNews, Inc. All Rights Reserved. Privacy Policy - Cookie Policy.