News
>
Introducing DataGemma: Grounding LLMs with Google’s Data Commons

Sept. 12, 2024, 7:45 a.m.

Introducing DataGemma: Grounding LLMs with Google’s Data Commons

Brief news summary

Large Language Models (LLMs) have revolutionized information access, but their reliability is compromised by inaccuracies known as hallucinations. To tackle this issue and promote responsible AI development, DataGemma emerges as an innovative suite of open models that links LLMs to verified data from Google’s Data Commons, providing easy access to crucial statistical information. Users can delve into various topics, such as California’s job market trends or global forest land changes, without requiring technical knowledge. DataGemma enhances LLM accuracy by incorporating reliable institutional data through advanced techniques like Retrieval Interleaved Generation (RIG) and Retrieval Augmented Generation (RAG), which enrich context using relevant information from Data Commons. While DataGemma marks a significant step towards achieving trustworthy AI, it is merely the beginning of an extensive journey. We encourage AI enthusiasts to explore these models on platforms like Hugging Face and Kaggle, and to collaborate in creating a future where AI consistently delivers accurate and factual information.

Large Language Models (LLMs) have changed the way we interact with information, but grounding their outputs in verifiable facts is still a major challenge. This difficulty is exacerbated by the fragmented nature of real-world knowledge across various sources with differing formats and APIs, which complicates integration. The lack of grounding often results in "hallucinations, " where LLMs produce incorrect or misleading information. Our research focuses on creating responsible and trustworthy AI systems, making it essential to address hallucinations in LLMs. We are pleased to introduce DataGemma, an experimental set of open models designed to tackle hallucination challenges by grounding LLMs in the extensive statistical data available in Google's Data Commons. This resource already features a natural language interface, allowing users to query data without having to write traditional database queries. For instance, one can ask, "What industries contribute to California jobs?” or "Have any countries increased their forest land?" DataGemma thus simplifies access to diverse data formats by acting as a universal API for LLMs. DataGemma enhances the Gemma family of lightweight, cutting-edge open models, which leverage technologies underlying our Gemini models. By utilizing the knowledge stored in Data Commons, DataGemma aims to improve the factual accuracy and reasoning of LLMs, employing advanced retrieval techniques to integrate data from credible institutions, thereby reducing hallucinations and enhancing reliability. DataGemma operates through natural language queries, negating the need for users to understand complex data schemas. It employs two methodologies: Retrieval Interleaved Generation (RIG) and Retrieval Augmented Generation (RAG). RAG retrieves relevant data from Data Commons prior to text generation, ensuring a solid factual basis for responses. A challenge with RAG is managing the vast amount of data returning from broad queries, which can average 38, 000 tokens, with some reaching up to 348, 000 tokens.

This is made feasible due to the Gemini 1. 5 Pro’s long context window, permitting extensive data integration. Here’s how RAG functions: 1. **User submission**: A user poses a question to the LLM. 2. **Query processing**: The DataGemma model analyzes the input and formulates a natural language query for Data Commons. 3. **Data retrieval**: The model queries Data Commons and retrieves pertinent data tables. 4. **Prompt augmentation**: The gathered data is integrated with the user’s original query. 5. **Response generation**: The larger LLM then generates a well-rounded and fact-based response using the enhanced prompt. Using this approach has advantages, such as improved accuracy as LLMs evolve and utilize context more effectively. However, modifying the user prompt can sometimes diminish the user experience, and the effectiveness largely depends on the quality of the generated queries. We recognize that DataGemma is just the beginning in developing grounded AI and invite researchers, developers, and enthusiasts to explore this tool with us. Our aim is to ground LLMs in Data Commons’ real-world data, enhancing AI's ability to provide intelligent, evidence-based information. We encourage a reading of our accompanying research paper for further insights. Moreover, we hope others extend this research beyond our approach with Data Commons, as it offers means for third parties to create their own instances. The principles of this research are also applicable to other knowledge graph formats, and we anticipate further exploration in this area. To get started with DataGemma, download the models from Hugging Face or Kaggle (RIG, RAG) and check out our quickstart notebooks that provide practical introductions to both approaches.

News source

Watch video about

Introducing DataGemma: Grounding LLMs with Google’s Data Commons

Try our premium solution and start getting clients — at no cost to you

I'm your Content Creator.
Let’s make a post or video and publish it on any social media — ready?

Language

Introducing DataGemma: Grounding LLMs with Google’s Data Commons

Brief news summary

News source

Watch video about

Try our premium solution and start getting clients — at no cost to you

I'm your Content Creator.
Let’s make a post or video and publish it on any social media — ready?

Content Maker

Last news

Breaking Rust's AI-Generated Song 'Walk My Walk' Tops Billboard Country Digital Song Sales Chart

The Future of SEO: Integrating AI Technologies with Human Expertise

Google Accelerates AI Infrastructure Expansion to Meet Soaring Compute Demand

The Best for your Business

Hot news

The most downloaded country song in the US is AI-…

The Future of SEO: Integrating AI and Human Exper…

Google Exec Claims Company Needs to Double Its AI…

AI-Generated Videos: Recognizing Synthetic Content

How AI Helps CMOs Hit Key Marketing KPIs Faster

AI Video Summarization Tools Aid in News Consumpt…

Here's why concerns about an AI bubble are bigger…

AI Company

Sales

Marketing

Introducing DataGemma: Grounding LLMs with Google’s Data Commons

Brief news summary

News source

Watch video about

Try our premium solution and start getting clients — at no cost to you

I'm your Content Creator. Let’s make a post or video and publish it on any social media — ready?

Content Maker

Last news

Breaking Rust's AI-Generated Song 'Walk My Walk' Tops Billboard Country Digital Song Sales Chart

The Future of SEO: Integrating AI Technologies with Human Expertise

Google Accelerates AI Infrastructure Expansion to Meet Soaring Compute Demand

The Best for your Business

Hot news

The most downloaded country song in the US is AI-…

The Future of SEO: Integrating AI and Human Exper…

Google Exec Claims Company Needs to Double Its AI…

AI-Generated Videos: Recognizing Synthetic Content

How AI Helps CMOs Hit Key Marketing KPIs Faster

AI Video Summarization Tools Aid in News Consumpt…

Here's why concerns about an AI bubble are bigger…

AI Company

Your News is ready

Your article is ready

Generating video takes longer than text.

Join our community of experts

Reasons why you should be part of the experts community

Welcome to Neuron Expert!

Check your email

Launch Your AI-Powered Business

AI Marketing Across All Social Media

AI Sales Manager + CRM

Support

Content Maker

Topic

Specify the topic (Optional)

Link (Optional)

Learn how to craft press releases, create unique social media posts, write SEO-optimized articles for websites, and produce videos, all from a single source

I'm your Content Creator.
Let’s make a post or video and publish it on any social media — ready?