Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Take the GenAI Test: 25 Questions, 6 Topics. Free from Activeloop & Towards AI

Publication

Inside DataGemma: Google DeepMind’s Initiative to Ground LLMs in Factual Knowledge
Artificial Intelligence   Latest   Machine Learning

Inside DataGemma: Google DeepMind’s Initiative to Ground LLMs in Factual Knowledge

Last Updated on September 17, 2024 by Editorial Team

Author(s): Jesus Rodriguez

Originally published on Towards AI.

Created Using DALL-E

I recently started an AI-focused educational newsletter, that already has over 170,000 subscribers. TheSequence is a no-BS (meaning no hype, no news, etc) ML-oriented newsletter that takes 5 minutes to read. The goal is to keep you up to date with machine learning projects, research papers, and concepts. Please give it a try by subscribing below:

TheSequence | Jesus Rodriguez | Substack

The best source to stay up-to-date with the developments in the machine learning, artificial intelligence, and data…

thesequence.substack.com

Grounding large foundation models such as LLMs on factual data is one of the biggest challenge of the current wave of AI systems. From reducing hallucinations to expanding the use cases for LLMs to mission critical applications, validating LLM outputs with trustworthy data is rapidly becoming one of the most important building blocks of LLM applications. This is the topic of a recent research from Google DeepMind which resulted in the creation of DataGemma, a series of open models which validate knowledge with a large factual data repository known as DataCommons. DataGemma is the latest addition to DeepMind’s Gemma models which is their initiative around small language models.

DataGemma

Conceptually, DataGemma is an innovative system developed to bridge LLMs with vast, real-world data sourced from Google’s Data Commons. Its purpose is to enable LLMs to interact with external databases, ensuring that their responses are more accurate and grounded in real-time information. The system tackles three key challenges in integrating LLMs with external data.

First, the LLM needs to learn when to rely on its internal knowledge and when to seek external information. Determining the right moments to access external sources and framing appropriate questions are crucial steps. Various techniques are being explored to instill this ability in the model.

Second, identifying the appropriate external source is essential. Given the wide range of possible data sources, this decision is kept separate from the LLM’s core knowledge. DataGemma simplifies this by connecting to a comprehensive database, so the LLM does not need to manage multiple sources.

Finally, after pinpointing the necessary data, the LLM must generate appropriate queries. Different databases use different formats, but DataGemma solves this issue by adopting a universal API system that relies on natural language. This approach draws inspiration from Robert McCool’s URL parameter encoding system, which has proven effective over time. With this setup, the model uses natural language to form queries, and the retrieved data can include both text and non-text formats.

Prior approaches to this problem include tool-use methods and Retrieval Augmented Generation (RAG). In tool-use, models insert external data into their output using structured commands. In RAG, external systems retrieve relevant information to enhance the model’s responses. DataGemma builds on these strategies for more robust solutions.

Data Commons Overview

Data Commons is an open-source initiative by Google that compiles public datasets into a single, accessible framework. This platform includes data from institutions such as the United Nations, census bureaus, and environmental agencies. The dataset currently holds over 250 billion data points from more than 100 countries. Despite its vast coverage, Data Commons faces some limitations, especially outside the United States, where the range of available variables significantly narrows. In particular, the U.S. dataset includes over 180,000 variables, while data from other countries often lacks similar granularity.

Two key innovations power Data Commons. First, it incorporates a knowledge graph built on publicly available data. This graph relies on Schema.org, an open vocabulary for encoding structured data, ensuring that diverse datasets are aligned and comparable. Second, Data Commons uses LLMs to provide a natural language interface, making it easier for users to explore the data. Importantly, the LLMs do not modify or manipulate the raw data, thus avoiding potential issues like generating inaccurate or misleading information.

There are some fundamental patterns in which DataGemma use factual data.

Retrieval Interleaved Generation (RIG)

DataGemma employs a technique called Retrieval Interleaved Generation (RIG), where the model cross-checks its output against trusted external data. When a user submits a query, the model generates a response and simultaneously sends a query to Data Commons. For example, instead of providing an unverified statistic, the model might return, “The population of California is [DC(What is the population of California?) → ’39 million’]”, ensuring external validation.

Once the data is retrieved, it is used to correct any inaccuracies in the initial response. This final answer is then shared with the user, along with links to the source data, ensuring full transparency. While RIG enhances the reliability of the LLM’s output, the model does not retain this external data for future use, so follow-up queries may not reflect updated information. RIG requires fine-tuned models, with DataGemma utilizing both 7-billion and 27-billion parameter versions of its model for this task.

Image Credit: Google DeepMind

Retrieval Augmented Generation (RAG)

Another method employed by DataGemma is Retrieval Augmented Generation (RAG). In this approach, the system fetches relevant information from Data Commons before the LLM crafts its response. This ensures that the generated text is based on up-to-date data. A notable challenge is that broad queries can result in extensive datasets, making it necessary to efficiently manage large inputs. The system uses the Gemini 1.5 Pro model, which has the capability to handle these large data sets due to its long context window.

The RAG process begins with the model analyzing a user’s query and generating the appropriate Data Commons queries. The data retrieved from Data Commons is then integrated into the LLM’s input, forming an augmented prompt that helps generate a more comprehensive response. This process ensures the model’s output is grounded in factual data.

One benefit of RAG is that it evolves along with improvements in LLM technology. As models become more advanced, they can better handle and interpret retrieved data, leading to more accurate responses. However, altering the original user query during this process can sometimes result in a less intuitive user experience.

Image Credit: Google DeepMind

DataGemma fine-tunes its models for each specific task, ensuring that the system generates precise queries for Data Commons. This step is evaluated by human reviewers, who assess whether the generated questions and retrieved data accurately address the user’s needs. The final output is a synthesis of retrieved data and the model’s original capabilities, creating an informative and reliable response.

DataGemma and DataCommons are important steps towards grounding LLMs on trusted data. DataGemma is available in HuggingFace. It is going to be interesting to see how this research expand onto other real world LLM applications.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓