OpenAI could find itself on rocky ground for copyright and copyright infringements.
Artificial intelligence is beginning to dominate the digital world and there are many authoritative voices who do not welcome the way in which this type of technology manages to increase their knowledge. Some platforms, such as ChatGPT, are being prohibited in certain countries for reasons of privacy and now it is the turn of the new GPT-4which seems to be in the center of the controversy for having memorized countless works of literature.
Feeding GPT-4 with current novels and literary classics
Recently, according to information published on the New Scientist website, we have known as much GPT-4 as ChatGPT would have memorized complete works to add more and more data to their models, in a move by OpenAI that doesn’t seem like a very legal way to integrate knowledge into these artificial intelligence platforms. Of course, the method used to discover which books have been used by these AI is really curious.
Top 20 books that ChatGPT has memorized — similar but a little different than GPT-4
It’s interesting that Zora Neale Hurston is one of the top titles here. She and Chinua Achebe are basically the only non-white authors with a top memorized book from this study. pic.twitter.com/VWxMCHLyxq
— Melanie Walsh (@mellymeldubs) May 5, 2023
All from David Bamman, a professor at the University of California at Berkeley in the area of natural language processing. He, along with other classmates, used the language models in a curious experiment. they chose 100 small pieces of books of success, belonging to novels awarded with the Pulitzer Prize and in the bestseller list of The New York Times, and they removed the name of a character. They then entered the paragraph into the model and they asked the AI if it could fill the gap empty.
What they discovered can be consulted in the results published on GitHub, where you can see the names of the selected books and the percentage of successful responses, both in ChatGPT and GPT-4, as well as in the Google BERT model. For example, if we talk about Harry potter and the Philosopher’s Stonehe ChatGPT hit rate is from 43%but GPT-4 go up to the 76%. For his part, BERT misses in your answer, as well as in virtually no other type of novel-based test.
The legal implications when using this training method, with books under copyright, even without uncertainbut they don’t look too good for OpenAI. Andres Guadamuzfrom the University of Sussex, it states that:
Legal issues are complicated. OpenAI is training its models with online works that can include a large number of legitimate citations from all over the internet, but also potentially pirated copies.