Harvard to Release Dataset of 1 Million Public-Domain Books for AI Training

Training data for AI can be quite costly, often accessible primarily to wealthy tech companies. To counter this, Harvard University intends to publish a dataset of around 1 million public-domain books.
These books, written by authors like Dickens, Dante, and Shakespeare, are out of copyright due to their age and cover various genres and languages. The dataset isn't available yet, and details on its release remain unclear. The books come from Google's long-standing book-scanning project, Google Books, and Google will assist in making this "treasure trove" widely accessible. Harvard announced the Institutional Data Initiative (IDI) in March, which aims to provide a "trusted conduit for legal data for AI. " Until today, details were scarce, but now it's confirmed that IDI is supported financially by Microsoft and OpenAI.
Brief news summary
Harvard University is planning to release a dataset featuring around 1 million public-domain books. These works, spanning various genres and languages, include authors like Dickens, Dante, and Shakespeare, and are no longer under copyright due to their age. The release date and method for this dataset are still unconfirmed. The books are sourced from Google's extensive book-scanning project, Google Books, and Google will aid in the distribution of this valuable collection. Harvard introduced the Institutional Data Initiative (IDI) in March, aiming to establish a reliable source of legal data for AI purposes. Today marks the formal launch of the IDI, revealing financial support from Microsoft and OpenAI. This initiative underscores the high costs associated with AI training data, often affordable only to large tech companies. The project seeks to make essential data more accessible, harnessing Google's collaboration to maximize the reach of this impressive dataset.
AI-powered Lead Generation in Social Media
and Search Engines
Let AI take control and automatically generate leads for you!

I'm your Content Manager, ready to handle your first test assignment
Learn how AI can help your business.
Let’s talk!

AI Safety Summit Addresses Global Cooperation on …
The 2023 AI Safety Summit at the historic Bletchley Park in the UK marked a pivotal moment in global collaboration to address the risks and challenges posed by artificial intelligence.

Robinhood Unveils Blockchain-Based Trading for US…
Analysis Robinhood’s recent announcement to launch a blockchain-based platform for trading US assets in Europe has generated notable interest across financial markets, especially in the cryptocurrency sector

Apple looks to add AI search to Safari in potenti…
Apple is “actively looking at” restructuring the Safari web browser on its devices to prioritize AI-powered search engines, Bloomberg News reported Wednesday.

Trustworthy Inter-Provider Agreements in 6G Using…
A recent study has presented a novel privacy-enabled hybrid blockchain framework aimed at improving the security and flexibility of inter-provider agreements within 6G networks.

Trump administration to rescind and replace Biden…
The Trump administration has announced plans to rescind and revise a Biden-era regulation that restricted the export of advanced artificial intelligence (AI) chips.

Integral Is Bringing Banks' FX Data to the Blockc…
The partnership between Pyth Network, a decentralized data feed provider, and Integral, a global currency market infrastructure provider, enables institutional foreign exchange (FX) data pipelines to be brought on-chain by leveraging Integral’s backend infrastructure.

Microsoft to urge senators to speed permitting fo…
On May 8, 2025, Microsoft President Brad Smith will testify before the U.S. Senate Commerce Committee on critical challenges facing the nation’s energy infrastructure amid rapid advances in artificial intelligence (AI).