Inside DataGemma: Google DeepMind’s Initiative to Ground LLMs in Factual Knowledge

Last Updated on September 17, 2024 by Editorial Team

Author(s): Jesus Rodriguez

Originally published on Towards AI.

I recently started an AI-focused educational newsletter, that already has over 170,000 subscribers. TheSequence is a no-BS (meaning no hype, no news, etc) ML-oriented newsletter that takes 5 minutes to read. The goal is to keep you up to date with machine learning projects, research papers, and concepts. Please give it a try by subscribing below:

TheSequence | Jesus Rodriguez | Substack

The best source to stay up-to-date with the developments in the machine learning, artificial intelligence, and data…

thesequence.substack.com

Grounding large foundation models such as LLMs on factual data is one of the biggest challenge of the current wave of AI systems. From reducing hallucinations to expanding the use cases for LLMs to mission critical applications, validating LLM outputs with trustworthy data is rapidly becoming one of the most important building blocks of LLM applications. This is the topic of a recent research from Google DeepMind which resulted in the creation of DataGemma, a series of open models which validate knowledge with a large factual data repository known as DataCommons. DataGemma is the latest addition to DeepMind’s Gemma models which is their initiative around small language models.

DataGemma

Conceptually, DataGemma is an innovative system developed to bridge LLMs with vast, real-world data sourced from Google’s Data Commons. Its purpose is to enable LLMs to interact with external databases, ensuring that their responses are more accurate and grounded in real-time information. The system tackles three key challenges in integrating LLMs with external data.

First, the LLM needs to learn when to rely on its internal knowledge and when to seek external information. Determining the right moments to access external sources and framing appropriate questions are crucial steps. Various techniques are being explored to instill this ability in the model.

Second, identifying the appropriate external source is essential. Given the wide range of possible data sources, this decision is kept separate from the LLM’s core knowledge. DataGemma simplifies this by connecting to a comprehensive database, so the LLM does not need to manage multiple sources.

Finally, after pinpointing the necessary data, the LLM must generate appropriate queries. Different databases use different formats, but DataGemma solves this issue by adopting a universal API system that relies on natural language. This approach draws inspiration from Robert McCool’s URL parameter encoding system, which has proven effective over time. With this setup, the model uses natural language to form queries, and the retrieved data can include both text and non-text formats.

Prior approaches to this problem include tool-use methods and Retrieval Augmented Generation (RAG). In tool-use, models insert external data into their output using structured commands. In RAG, external systems retrieve relevant information to enhance the model’s responses. DataGemma builds on these strategies for more robust solutions.

Data Commons Overview

Data Commons is an open-source initiative by Google that compiles public datasets into a single, accessible framework. This platform includes data from institutions such as the United Nations, census bureaus, and environmental agencies. The dataset currently holds over 250 billion data points from more than 100 countries. Despite its vast coverage, Data Commons faces some limitations, especially outside the United States, where the range of available variables significantly narrows. In particular, the U.S. dataset includes over 180,000 variables, while data from other countries often lacks similar granularity.

Two key innovations power Data Commons. First, it incorporates a knowledge graph built on publicly available data. This graph relies on Schema.org, an open vocabulary for encoding structured data, ensuring that diverse datasets are aligned and comparable. Second, Data Commons uses LLMs to provide a natural language interface, making it easier for users to explore the data. Importantly, the LLMs do not modify or manipulate the raw data, thus avoiding potential issues like generating inaccurate or misleading information.

There are some fundamental patterns in which DataGemma use factual data.

Retrieval Interleaved Generation (RIG)

DataGemma employs a technique called Retrieval Interleaved Generation (RIG), where the model cross-checks its output against trusted external data. When a user submits a query, the model generates a response and simultaneously sends a query to Data Commons. For example, instead of providing an unverified statistic, the model might return, “The population of California is [DC(What is the population of California?) → ’39 million’]”, ensuring external validation.

Once the data is retrieved, it is used to correct any inaccuracies in the initial response. This final answer is then shared with the user, along with links to the source data, ensuring full transparency. While RIG enhances the reliability of the LLM’s output, the model does not retain this external data for future use, so follow-up queries may not reflect updated information. RIG requires fine-tuned models, with DataGemma utilizing both 7-billion and 27-billion parameter versions of its model for this task.

Retrieval Augmented Generation (RAG)

Another method employed by DataGemma is Retrieval Augmented Generation (RAG). In this approach, the system fetches relevant information from Data Commons before the LLM crafts its response. This ensures that the generated text is based on up-to-date data. A notable challenge is that broad queries can result in extensive datasets, making it necessary to efficiently manage large inputs. The system uses the Gemini 1.5 Pro model, which has the capability to handle these large data sets due to its long context window.

The RAG process begins with the model analyzing a user’s query and generating the appropriate Data Commons queries. The data retrieved from Data Commons is then integrated into the LLM’s input, forming an augmented prompt that helps generate a more comprehensive response. This process ensures the model’s output is grounded in factual data.

One benefit of RAG is that it evolves along with improvements in LLM technology. As models become more advanced, they can better handle and interpret retrieved data, leading to more accurate responses. However, altering the original user query during this process can sometimes result in a less intuitive user experience.

DataGemma fine-tunes its models for each specific task, ensuring that the system generates precise queries for Data Commons. This step is evaluated by human reviewers, who assess whether the generated questions and retrieved data accurately address the user’s needs. The final output is a synthesis of retrieved data and the model’s original capabilities, creating an informative and reliable response.

DataGemma and DataCommons are important steps towards grounding LLMs on trusted data. DataGemma is available in HuggingFace. It is going to be interesting to see how this research expand onto other real world LLM applications.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Inside DataGemma: Google DeepMind’s Initiative to Ground LLMs in Factual Knowledge

Author(s): Jesus Rodriguez

TheSequence | Jesus Rodriguez | Substack

The best source to stay up-to-date with the developments in the machine learning, artificial intelligence, and data…

DataGemma

Data Commons Overview

Retrieval Interleaved Generation (RIG)

Retrieval Augmented Generation (RAG)

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

LAI #66: Information Theory for People in a Hurry

🔎 Decoding LLM Pipeline — Step 1: Input Processing & Tokenization

Meta to Launch Its Own In-House AI Chip

I Built an AI Money Coach in Python — Here’s How You Can Too (Step-by-Step Guide!)

ChatGPT Now Works Natively in Xcode and VS Code

The World’s Leading AI and Technology Publication.

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Inside DataGemma: Google DeepMind’s Initiative to Ground LLMs in Factual Knowledge

Author(s): Jesus Rodriguez

TheSequence | Jesus Rodriguez | Substack

The best source to stay up-to-date with the developments in the machine learning, artificial intelligence, and data…

DataGemma

Data Commons Overview

Retrieval Interleaved Generation (RIG)

Retrieval Augmented Generation (RAG)

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement