lang icon En
Sept. 2, 2024, 7:12 a.m.
2164

Challenges of Data Access for Generative AI Models Highlighted in New Report

Brief news summary

In a new report by the Data Provenance Initiative, it is revealed that many organizations are restricting access to data sets used to train generative AI models. This has significant implications for the future of AI companies and their ability to improve models. The report discusses how websites are using the robot exclusion protocol (robots.txt) to restrict web crawlers from accessing specific parts of their websites. This has led to a decline in the availability of high-quality data sets, as many news and academic websites are placing restrictions to protect their data from generative AI. The report also highlights the rise of synthetic data and the challenges and opportunities it presents. Overall, the report signals a crisis in obtaining consent for data usage and calls for new standards to be established to facilitate the expression of data preferences by website owners.

Generative AI models rely on large training data sets, typically composed of public data from the internet. However, organizations are increasingly restricting access to their data through robots. txt files, fearing the potential impact of generative AI on their businesses. This restriction poses challenges for AI companies that heavily rely on such data. The Data Provenance Initiative's report, titled "Consent in Crisis: The Rapid Decline of the AI Data Commons, " reveals that a significant portion of the data used to train AI models has been restricted in recent years.

This restriction not only affects the quality and freshness of the data but also creates a gap between models that respect robots. txt and those that disregard it. Some potential solutions proposed include licensing data directly from organizations, utilizing synthetic data, or finding ways to extract hidden data, such as that locked away in PDFs. The report emphasizes the need for industry standardization and improved mechanisms for expressing data usage preferences that balance the interests of various stakeholders.


Watch video about

Challenges of Data Access for Generative AI Models Highlighted in New Report

Try our premium solution and start getting clients — at no cost to you

Content creator image

I'm your Content Creator.
Let’s make a post or video and publish it on any social media — ready?

Language

Hot news

March 15, 2026, 10:20 a.m.

Meta Strikes Multiple AI Deals with News Publishe…

Meta has recently launched a strategic initiative by entering multiple commercial agreements with leading news publishers to integrate real-time news content into its artificial intelligence (AI) services.

March 15, 2026, 10:19 a.m.

Building AI Sales Roleplay Systems That Actually …

By Jon Stojan, Journalist @ jonstojanjournalist Jon Stojan is a professional writer based in Wisconsin, dedicated to providing diverse and high-quality content

March 15, 2026, 10:16 a.m.

AI Company Expands into Financial Services with F…

SecureAI Solutions has launched an advanced machine learning-based fraud detection system tailored for financial institutions to tackle the escalating challenges of fraud that threaten financial losses and customer trust.

March 15, 2026, 10:13 a.m.

Zevia doubles down on AI satire in ads starring c…

Dive Brief: Zevia is once again targeting artificial intelligence through a new advertising campaign designed to reinforce its positioning as a better-for-you soda brand made with natural ingredients, according to news shared with Marketing Dive

March 15, 2026, 10:11 a.m.

Fake War Videos Are Degrading Our Trust in Reality

A US aircraft carrier destroyed by Iranian missiles.

March 15, 2026, 10:06 a.m.

Synscribe Launches SEO AI Agent That Autonomously…

Kindly wait as we try to load the page you requested.

March 15, 2026, 6:28 a.m.

Equal AI campaign highlights intrusive unknown ca…

Equal AI, an AI call assistant platform, has launched a campaign to raise awareness about a common workplace issue many women face: intrusive unknown phone calls during work hours.

All news

AI Company

Launch your AI-powered team to automate Marketing, Sales & Growth

AI Company welcome image

and get clients on autopilot — from social media and search engines. No ads needed

Begin getting your first leads today