Artificial intelligence models like ChatGPT or Gemini need a lot of computing resources, a lot of energy, but also a lot of training data. And to provide new data that will allow AI labs to train their artificial intelligence models, Harvard will create a huge database of one million books, via its new Institutional Data Initiative project.
Books in the public domain, brought together in a dataset for AI
This data can be used to train future AI models, since these are works that have fallen into the public domain and are therefore no longer protected by copyright. According to Wired magazine, this dataset is five times larger than Books3, a dataset that the Meta group used to train its Llama model.
A project supported by Google, Microsoft and OpenAI
The project is supported by OpenAI and Microsoft, with participation from Google, through its Google Books initiative. The aim is to put all stakeholders on an equal footing, given that the dataset will be accessible free of charge. Indeed, if large organizations like OpenAI or Google can take out their checkbooks to access texts protected by copyright, it can be more complicated for a small startup.
Other datasets will arrive
Additionally, the Harvard Institutional Data Initiative has no plans to stop there, as it is already collaborating with the Boston Public Library to digitize millions of news articles that are already in the public domain. And according to Wired, the university is open to other partnerships.
Otherwise, it should be noted that this is not the only initiative of this kind. For example, in March 2024, the Hugging Face platform released a dataset comprising a total of 500 billion words, with text in English, French, Dutch, Spanish, German and Italian.
- The development of generative artificial intelligence models does not only require chips and energy, since it also requires an immense amount of training data
- Harvard is embarking on a new project to publish a dataset of 1 million books in the public domain. This data can be used by AI laboratories
- Harvard is also working on another project to digitize millions of news articles.






