We showed them chess games and they became unbeatable rivals; We let them read our texts and they began to write. They also learned to paint and retouch photographs. Did anyone doubt that artificial intelligence would not be able to do the same with speeches and music?
Google’s research division has released AudioLM (paper), a framework to generate high-quality audio that remains consistent over the long term. To do this, it starts from a recording that is just a few seconds long, and is capable of prolonging it in a natural and coherent way. The most remarkable thing is that achieves it without being trained with previous transcripts or annotations despite the fact that the generated discourse is syntactically plausible and semantically plausible. In addition, it maintains the identity and prosody of the speaker to the point of making the listener unable to discern which section of the audio is original and which has been generated by artificial intelligence.
The examples of this artificial intelligence are amazing. Not only is it capable of replicating articulation, pitch, timbre, and intensity, but it is also capable of introducing the sound of the speaker’s breath and forming meaningful sentences. If it’s not from a studio audio, but from one with background noise, AudioLM replicates it to give it continuity. More samples can be heard on the AudioLM website.
An artificial intelligence trained in semantics and acoustics
How do you do it? The generation of audio or music is nothing new. But it is the way that Google researchers have thought to address the problem. Semantic markers are extracted from each audio to encode a high-level structure (phonemes, lexicon, semantics…), and acoustic markers (speaker identity, recording quality, background noise…). With this data already processed and understandable for artificial intelligence, AudioML begins its work by establishing a hierarchy in which it predicts semantic markers first., which are then used as constraints to predict the acoustic markers. The latter are used again at the end to convert the bits into something we humans can hear.
This semantic separation of acoustics, and its hierarchy, is not only a beneficial practice for training language models that generate discourse. According to the researchers, it is also more effective for continuing piano compositions, as shown on their website. It is much better than models that are only trained by acoustic markers.
The most significant thing about AudioLM’s artificial intelligence is not that it is able to continue speeches and melodies, but that it can do everything at once. It is therefore a single language model that can be used for text-to-speech —a robot could read entire books and give professional voice actors a break— or to make any device able to communicate with people through a familiar voice. This idea has already been studied by Amazon, which considered using the voice of loved ones in its Alexa speakers.
Exciting or dangerous?
Programs like Dalle-2 and Stable Diffusion are exceptional tools that allow you to sketch ideas or generate creative resources in a few seconds, like the illustration used on the cover of this article. Audio may be even more important, and one can imagine an announcer’s voice being used on demand by various businesses. Movies could even be dubbed with the voices of deceased actors. The reader may be wondering if this possibility, although exciting, is not dangerous.. Any audio recording could be manipulated for political, legal or judicial purposes. Google says that, although humans have difficulty detecting what comes from man and what from artificial intelligence, a computer knows how to detect if the audio is organic or not. Namely, not only the machine can replace usbut to value their work it will be essential to have another machine.
At the moment AudioLM is not open to the public, it is just a language model that can be integrated into different projects. But this demo, along with OpenAI’s Jukebox music program, demonstrates how quickly we’re entering a new world where no one will ever know, or care, if that picture is taken by a person or if there’s someone on the other end of the phone. a person or an artificially generated speech in real time.