Three years ago, our colleagues from Xataka published a report analyzing What benefits did text transcription software offer back then?. In the author’s mind, as a good journalist, was the usual task of converting the interviews to text. And he asked himself a question:
“Isn’t it curious that having virtual assistants who understand us almost perfectly, there is no known software to transcribe audio to text?”
The final conclusion was that yes, there was software, but for the most part or it was not useful for long audios, or it was paid; and normally failed miserably when interpreting punctuation of the text. And of course I was the question of privacy: Transcription was not possible if the audio was not uploaded, at some point, to a ‘cloud’ service.
And so the artificial intelligence revolution has arrived. And it opened the doors to multiple uses in the generation of texts (with GPT-2). And soon after, he revolutionized imaging from top to bottom (with DALL-E 2). But there was still a need to apply AI to another field: audio-to-text transcription. because the journalists we don’t want a robot to replace us, but we would let it take care of this task specifically no problem.
And then one of the leading AI developers, OpenAI, which had already made headlines thanks to GPT-2 and DALL-E 2, launched last May a new AI: Whisper. And all of a sudden, that interviewing became a lot lighter.
ZAO, the Chinese MOBILE APP that through DEEPFAKE turns you into DICAPRIO in SECONDS
Whisper isn’t generative audio: it’s something much more useful
Whisper is defined as “an automatic speech recognition (ASR) system” that has been subjected to a training consisting of processing “680,000 hours of multilingual supervised data collected from the web”. And, although it is true that 65% of those hours are in English, its results are also excellent in Spanish, showing a lower error rate per word than in English.
Yes indeed, only English speakers can use the direct translation feature of the text extracted from the audio into your language.
“Such a large and diverse dataset allows for better handling of accents, background noise, and technical language. In addition, it makes it easier to transcribe in multiple languages, as well as translate those languages into English.”
Actually, Whisper is a set of five successively more complex models (technically speaking, with more training parameters, which translates to more GB of disk space) and successively more demanding in terms of hardware (which means a higher consumption of GB of RAM).
Thus, we can go from the ‘tiny’ version, with only 39 million parameters and a consumption of only 1 GB of RAM, to the ‘large’ version, with 1550 million parameters, a consumption of 10 GB of RAM, and a speed 32 times less than the previous one.
In any case, it continues to be a task within the reach of most desktop computers today, which, together with its status as ‘open source’ software, opens the door for everyone to transcribe on their computer, without depending on of outside services. But, at first, that was easier said than done: like so many other AI applications, in many cases the only way to use Whisper was to use services like Google Colab:
It’s not an insurmountably complex method (in fact, tutorials have multiplied quickly), but it can push back less experienced users. And, of course, it still does not solve the privacy aspect that we mentioned before. Fortunately, as happened with Stable Diffusion (another open source AI), they have begun to disembark the applications that offer us a graphical interface that makes the use of the application practically trivialand leaves the user little more to do than select a series of options and click ‘OK’.
Buzz: Whisper for Dummies
And that’s where Buzz comes in.a simple cross-platform desktop program (it’s available for Windows, macOS and Linux) that we can download from its Github repository, and that looks like this:
From that window, we can choose task (transcribe / translate), source language (the list is extensive, and I include automatic detection by default), model quality (excludes the most complex model of the 5 mentioned above) and audio source microphone. This allows us to dictate as we go and see how our words are reflected or translated into English.
However, the most common use of this program will be that let’s use it to process an audio file. To do this, we must click on ‘File > Import audio file’. Once we have selected the file in question, another window similar to the previous one will appear, although we will see how the field related to the microphone changes to another that allows us to choose the file format of the transcription (.txt, or some subtitle format).
In the test we conducted, we decided to bet on a difficult audio (a somewhat frantic dialogue, with three interlocutors in a live video broadcast); We specifically chose this video from Xataka TV. We extracted the audio track from it and fed it to Buzz in two ways: using the ‘Very Low’ quality model first, and the ‘High’ quality model later. here is the result:
Well, the quality of ‘Very Low’ doesn’t deserve further comment, it’s a slightly lysergic transcription with tangential similarities to reality. The highest quality model (among those offered by Buzz) provides, however, an experience much better. Without being infallible, we would say that it only needs to identify and separate by interlocutors to be everything an interviewer would ask for for Christmas. In any case, it makes our lives much easier.
Of course, to achieve that result, we had to give our team a little beating for almost half an hour, as can be seen in this screenshot: