Demystifying Google’s Data Gemma

Author(s): Chirag Agrawal

Originally published on Towards AI.

Discover how Google’s Data Gemma leverages the Data Commons knowledge graph to tackle AI hallucinations. In this blog post, we’ll explore how Data Gemma aims to improve the factual accuracy of Large Language Models (LLMs), set up a Retrieval Augmented Generation (RAG) pipeline, test its capabilities, and compare it with other leading models. Whether you’re an AI enthusiast or a developer looking to enhance your applications, this deep dive into Data Gemma will provide valuable insights into the evolving landscape of AI technology.

To make this exploration hands-on, I’ve created a GitHub repository demonstrating the setup and implementation: Hands-On with Data Gemma. Feel free to follow along!

Introduction

Ever since Google unveiled their new language model called Data Gemma, I’ve been eager to dive in and see what makes it tick. Data Gemma promises to revolutionize how AI models interact with data, aiming to reduce a common issue known as hallucinations — when AI confidently provides inaccurate information.

As someone who frequently tinkers with Large Language Models (LLMs) and grapples with the quirks of Retrieval Augmented Generation (RAG), I was particularly intrigued by Data Gemma’s innovative approach. After poring over their research paper, I decided to get my hands dirty. This blog post chronicles my journey of setting up a RAG pipeline with Data Gemma, testing its capabilities, and comparing it with other models to understand how it addresses these common AI challenges.

Understanding the Problem Space

LLMs are getting impressively sophisticated — they can summarize text, brainstorm creative ideas, even crank out code. But let’s be real: sometimes they confidently spout inaccuracies — a phenomenon we lovingly call hallucination. Google’s research aims to tackle this head-on by addressing three major challenges:

Teaching the LLM when to fetch data from external sources versus relying on its own knowledge.
Helping the LLM decide which external sources to query.
Guiding the LLM to generate queries that fetch the data needed to answer the original question.

Typically, we tackle these problems with Tool Use + Retrieval Augmented Generation. Here’s the playbook:

Tool Use: The LLM is trained — either through fine-tuning or In-Context Learning — to decide which API to call, when to call it, and what arguments to pass.
RAG: Once the data is fetched, it’s augmented into the instruction, and the LLM generates an answer.

Introducing Data Commons

To streamline the process of fetching data, Google introduced an open-source knowledge graph called Data Commons. Think of Data Commons as a massive, well-organized library. Instead of wandering through countless aisles (APIs) to find a book (data), you have a friendly librarian (Natural Language API) who understands exactly what you need and fetches it for you. Google claims that Data Commons brings two key innovations:

A Unified Knowledge Graph: A massive collection of publicly available datasets.
Natural Language API: An API that accepts natural language queries to interact with the knowledge graph — no LLMs required.

Google’s research suggests that relying on the LLM to choose between multiple APIs and determine the right arguments is too error-prone at scale. Replacing that with a single knowledge graph and a natural language API significantly reduces the chances of hallucinations during query inference.

Exploring Retrieval Interleaved Generation (RIG)

While traditional RAG systems retrieve relevant information before generating a response, Google’s approach introduces a new method called Retrieval Interleaved Generation (RIG). Think of it like having a conversation where you pause mid-sentence to check a fact before continuing. In RIG, the model starts generating a response and, when it realizes it needs specific data (like a statistic or factual detail), it produces a natural language query that can be executed on an external database (in this case that is Data Commons).

This interleaving of retrieval and generation aims to minimize hallucinations by grounding the AI’s responses in verified data from Data Commons. By fetching information on-the-fly, the model ensures that the answers it provides are accurate and up-to-date.

Data Gemma’s Two Approaches

Google released two versions of Data Gemma to explore these concepts:

RIG Version: This model is fine-tuned to produce answers to statistical questions while also generating natural language queries for Data Commons. Imagine you’re writing a report and, as you type, you note that you need the latest unemployment rate. The model not only provides an answer but also crafts a query to fetch the exact statistic from Data Commons.
RAG Version: This model focuses on generating a list of natural language queries relevant to the user’s original question. Instead of attempting to provide the answer directly, it expands the user’s question into multiple, more specific queries that can be answered using reliable data sources.

Personally, I found the second approach — using the LLM to expand the user query — more intriguing. According to the research paper, human evaluators also preferred the answers from the RAG pipeline over those from the RIG pipeline. So, I decided to build a RAG pipeline myself using Data Gemma and Data Commons to see how it performs.

Getting My Hands Dirty

You can follow along with my code available on GitHub: Hands-On Data Gemma. Let’s set up the environment together.

Setting Up the Environment

Setting up the model wasn’t without its hurdles. Google hasn’t published a 7B version of the model on HuggingFace — or at least I couldn’t find it — and the 27B version is too large for my machine. So, I had to get creative with quantized models. Luckily, I found several quantized versions and decided to go with the most downloaded one: bartowski/datagemma-rag-27b-it-GGUF. I used the 2-bit quantized version of the model. With llama-cpp-python, hosting these models for inference is a breeze.

Here’s how I set up the Data Gemma model:

Testing the Model

With the model up and running, I wanted to see how well it performed. I used the example query:

“Has the use of renewables increased in the world?”

Data Gemma effectively broke down this question into specific statistical queries:

What is the carbon emission in the world?
How has carbon emission changed over time in the world?
What is the renewable energy consumption in the world?
How has renewable energy consumption changed over time in the world?

To be clear, I didn’t give any special instructions — the model is trained to generate these queries.

Key Observations:

Semantic Mapping: The model mapped “renewables” to related concepts like “carbon emission” and “renewable energy consumption,” capturing the temporal aspect of the question.
Context Preservation: It retained the place name “world” in all generated queries.
Consistent Formatting: It generated queries in a consistent format, making them easy to translate into a structured query for external database, in this case it being the Data Commons knowledge graph.

Integration with Data Commons

To fetch actual data, I wrote a simple client to call the Data Commons Natural Language API using their Python library. You’ll need a Data Commons API key, which you can get here.

Here’s how I set up the client:

The RAG pipeline then just takes the list of generated queries and calls Data Commons NL API for each of them. The API responds with a structured response that contains the numerical value, unit, actual source of information, etc. This information can be passed on to the next step of the pipeline for answer generation. I wrote a small utility to convert the API response into natural language.

Why This Approach Is Intriguing

One of the key challenges with naive Retrieval Augmented Generation (RAG) is its heavy reliance on the user’s initial query to find relevant documents. Even with semantic search intended to bridge gaps, it often falls short, especially when dealing with broad or ambiguous queries. Other techniques like HyDe or Pseudo Relevance Feedback exist but are usually too specialized for semantic search scenarios.

What makes Data Gemma’s solution stand out is its ability to break down a user query into multiple focused and relevant sub-queries. This query expansion approach enhances retrieval by covering more ground and fetching pertinent information.

The fact that a 2-bit quantized version of a 27B model running on my Mac could achieve this is just icing on the cake.

Practical Benefits

Improved Accuracy: By fetching specific data for each sub-query, the model reduces the risk of hallucinations and increases the factual accuracy of its responses.
Comprehensive Answers: The expanded queries allow the AI to provide more detailed and nuanced answers, enriching the user experience.
Efficiency: This method streamlines the retrieval process, making it more efficient than relying solely on semantic search or expecting the LLM to handle complex API interactions.

Applicability Across Domains

This pattern isn’t limited to statistical queries. Imagine an AI assistant that helps you plan a trip. If you ask, “Help me plan a trip to Brazil,” the assistant could decompose this into sub-queries like:

What are the best times to visit Brazil?
Which cities in Brazil are must-see destinations?
What are the top attractions in Brazil?

While Data Gemma isn’t currently designed for this use case, it exemplifies a general pattern that could significantly enhance AI interactions by making them more comprehensive and context-aware.

Simplifying Tool Use with Natural Language APIs

Most traditional RAG solutions either rely solely on semantic search or depend on the LLM to determine API arguments for tool use. For information retrieval, this approach struggles due to the overwhelming number of variables and relationships involved.

The Natural Language API offered by Data Commons demonstrates how tool use for information retrieval can be greatly simplified. Importantly, this NL API doesn’t rely on an LLM to generate the final query executed on the knowledge graph; it uses predefined translation logic.

As highlighted in Google’s research paper:

“Given a query, we first break it down into the following components: one or more statistical variables or topics (like ‘unemployment rate,’ ‘demographics,’ etc.); one or more places (like ‘California’); and a finite set of attributes (like ‘ranking,’ ‘comparison,’ ‘change rate,’ etc.). The variables and places are further mapped to corresponding IDs in Data Commons. For each of the components, we apply different Natural Language Processing (NLP) approaches that we have been independently iterating on. For statistical variables or topics, we use an embeddings-based semantic search index; for places, we use a string-based named entity recognition implementation; for attribute detection, we use a set of regex-based heuristics.“

Comparing with Other Models

Curious about how Data Gemma stacks up against other models, I tested Claude Sonnet 3.5 by prompting it to produce similar queries for the given user question. The prompt I used is in the Appendix, and I found in the Data Commons code repo. Surprisingly, Claude was also able to generate relevant queries and interact with the Data Commons API effectively. This suggests that a fine-tuned model like Data Gemma isn’t strictly necessary; with proper prompt engineering, other LLMs can achieve similar results.

However, it’s worth noting that I used a 2-bit quantized version of the Data Gemma 27B model running on my Mac. In contrast, Claude Sonnet 3.5 is a much larger model.

Conclusion

In the grand scheme of things, I think Data Gemma pushes the envelope by simplifying how LLMs interact with external data sources through natural language APIs. It offers a fresh take on reducing hallucinations and improving the factual accuracy of AI-generated content. Whether this pattern becomes the new standard or not, exploring it has been a valuable exercise in understanding the evolving landscape of AI and how we can make these systems more reliable and effective.

I encourage fellow AI enthusiasts and developers to delve into this technology. Check out my GitHub repository to get started: Hands-On with Data Gemma. Let’s continue the conversation on how to make AI more accurate and helpful.

Appendix

Claude Sonnet 3.5 Prompt

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Demystifying Google’s Data Gemma

Author(s): Chirag Agrawal

Introduction

Understanding the Problem Space

Introducing Data Commons

Exploring Retrieval Interleaved Generation (RIG)

Data Gemma’s Two Approaches

Getting My Hands Dirty

Setting Up the Environment

Testing the Model

Integration with Data Commons

Why This Approach Is Intriguing

Simplifying Tool Use with Natural Language APIs

Comparing with Other Models

Conclusion

Appendix

Claude Sonnet 3.5 Prompt

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

🔎 Decoding LLM Pipeline — Step 1: Input Processing & Tokenization

Meta to Launch Its Own In-House AI Chip

I Built an AI Money Coach in Python — Here’s How You Can Too (Step-by-Step Guide!)

ChatGPT Now Works Natively in Xcode and VS Code

TAI #143: New Scaling Laws Incoming? Ilya’s SSI Raises at $30bn, Manus Takes AI Agents Mainstream

The World’s Leading AI and Technology Publication.

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Demystifying Google’s Data Gemma

Author(s): Chirag Agrawal

Introduction

Understanding the Problem Space

Introducing Data Commons

Exploring Retrieval Interleaved Generation (RIG)

Data Gemma’s Two Approaches

Getting My Hands Dirty

Setting Up the Environment

Testing the Model

Integration with Data Commons

Why This Approach Is Intriguing

Simplifying Tool Use with Natural Language APIs

Comparing with Other Models

Conclusion

Appendix

Claude Sonnet 3.5 Prompt

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement