Although artificial intelligence has undergone a very notable evolution in recent months through a multitude of generative tools such as ChatGPT, Bing Chat, or Bard, it is possible that one of the most useful that we have today is also Whisper. If you don’t know it yet, it’s basically about a tool capable of transcribing audio to text with the help of artificial intelligence. Its efficiency is surprising, and since its launch, applications have appeared that use its API to facilitate its installation, such as Buzz.
However, OpenAI is not the only one to have delved into this type of project. Meta has also been working on its own tool for some time to transcribe audio to text and vice versa in 1,100 languages. Named by his team as MMS (Massively Multilingual Speech), they guarantee great efficiency and with half as many word errors as Whisper.
A model capable of transcribing some 1,100 languages into text
Through an article published on their official website, they have offered all the information about this new tool, which intends to become one of the most powerful in terms of text and audio transcription. The secret to its effectiveness, like Whisper, is to make use of artificial intelligence to recognize audio in some 1,100 languages, according to Meta. However, the tool has the potential to recognize some 4,000 languages from around the world.
For this project they have used Wav2vec 2.0, a model that learns self-sufficiently through unlabeled training data. They have also used a new dataset that has been used to train the model for those 1,100 languages. Meta has used a total of about 1 billion parameters to train your most powerful model.
Image: Goal
The funny thing about it all is that they started working with religious texts like the Bible, since it is translated into many languages and has been carefully studied previously for translation research. In fact, there are numerous hours of audio of people reading the New Testament, and based on this they have developed a dataset of about 32 hours of data for each of the 1,100 languages covered.
As stated by Meta, the word error rate produced by their model amounts to only 18.7 according to the FLEURS reference for their model with 1,100 languages, being noticeably lower than the ratio of 44.3 produced by Whisper. Also, the number of languages compared to Whisper is much higher, as this one reaches about 100 languages. There is also another MMS model with 61 languages.
This is not the first time that Meta has talked about multilingual translation and transcription projects. Some time ago we talked about NLLB, her language model capable of translating 200 languages.
The company has published the paper corresponding to this project, as well as all its code on Github to run the tool. As with Whisper, it is very likely that in a short time we will see compatible applications for different operating systems that use the API of this project.
The idea is that the project cover even more languages, while also facing the difficulty of dialects. They also mention that this technology can be beneficial for specific cases in which virtual and augmented reality technology is used.
In Genbeta | After comparing Google Bard, ChatGPT and Bing Chat I am clear that I need a change