Cloning voices: The opportunities, threats and needed safeguards
Microsoft recently made headlines by announcing it is working on a form of artificial intelligence (AI) called VALL-E that can clone voices from a three second audio clip. Imagine that now AI can have anyone’s voice say words without that individual actually speaking! Even more recently, Samsung announced that its software assistant, Bixby, can now clone users’ voices to answer calls. Specifically, Bixby now lets English speakers answer calls by typing a message which Bixby converts to audio and relays to the caller on their behalf.
Technologies like VALL-E and Bixby are bringing voice cloning to reality and have the potential to be industry game changers. The term voice cloning refers to the ability to use AI to build a digital copy of a person’s unique voice, including speech patterns, accents and voice inflection, by training an algorithm with a sample of a person’s speech. Once a voice model is created, plain text is all that’s needed to synthesize a person’s speech, capturing and mimicking the sound of an individual. In fact, many different types of voice cloning companies are now launching, making this technology much more accessible.
AI-based voice cloning, when done ethically, can have many excellent applications, especially in the entertainment industry. For example, imagine being able to listen to the voice of your favorite actor narrating your grocery list as you walk through the aisles. In the unfortunate occurrence that an actor passes away in the middle of production, their voice can still "complete" the film through the use of a deep fake voice.
Another area where voice cloning can be beneficial is helping individuals with speech disabilities. In this instance it is possible to create a synthetic voice which can assist impaired individuals with the ability to express themselves in a voice that is uniquely their own. For example, a patient with throat cancer who may need to undergo removal of the larynx, could have his voice cloned prior to surgery in order to replicate a voice that sounds more like their old selves.
On the other hand, there are some real issues with this technology going mainstream. Beyond the obvious ethical concerns, creating and using a replica of someone’s voice without their permission, and potentially for malicious activities, is a serious violation of identity and privacy. There are also legal considerations where voice cloning can be maliciously used to defame, deceive or incriminate people. While there are bound to be cases of scam artists recording people unknowingly and against their will, we must implement the same opt in/opt out consent procedures that have become commonplace for facial recognition, anytime we endeavor to record a person’s voice. This is the only way to enable people to maintain control over their unique, natural biological identifiers.
Regarding scammers, the potential for misuse is sky-high. Until recently, to clone a voice you would need a large amount of recorded speech to train the algorithm. But voice cloning technology is evolving so quickly that today all that’s needed is a few minutes of speech -- or in Microsoft VALL-E’s case, a few seconds. This means, if a scammer gets you on the phone for as little as three seconds, that is all they need to synthesize your voice without your consent. In fact, the FBI has already issued warnings of voice cloning technologies used in grandparent scams, whereby scammers call elderly couples and mimic a loved one saying they are in jail, trapped in a foreign country or in other difficult situations in order to extort money. Unfortunately, we can expect to see voice cloning used for other roguish purposes as well, such as creating deep-fakes of politicians making remarks that may spread misinformation or evoke controversy.
Another significant consideration is the fact that many organizations rely on voice recognition as a form of biometric authentication -- think of, say, an emerging fintech that uses voice recognition to enable users to access accounts and exchange funds. Where voices are concerned, it can be very hard to tell what is real and what isn’t. As voice cloning breaks out into the real world -- as many expect it will -- these organizations are going to have to take steps to ensure their systems aren’t subverted by malicious use.
There are two key ways that organizations can do this. One is by implementing liveness detection, a process that is already widely used in facial recognition. Liveness detection thwarts attempts at duping a system, by deciding whether it’s really a live person or a spoof -- like a photo or video or using a voice recording as opposed to a live voice. A second technique involves adopting multi-factor authentication (MFA), so that if a person’s voice is identified, he or she will be prompted to provide a second form of authentication such as a password or a one-time code sent to their mobile device. These secondary authentication methods are not foolproof (both can be intercepted) and they can introduce some user friction, but they can be effective in helping guard against spoofs.
In summary, voice cloning is an exciting new frontier that can deliver many benefits, especially in the area of helping those with speech disabilities. But we need to be cautious with this promising technology, as the potential for ethical and legal liabilities and scamming can be significant. This is why organizations that have invested in voice recognition as a form of biometric authentication would be well-advised to take extra measures to guard against scam threats.
Dr. Mohamed Lazzouni, is CTO, Aware.