Chat with Your BigQuery Data

Last Updated on November 5, 2023 by Editorial Team

Author(s): Benedict Neo

Originally published on Towards AI.

Large language models (LLMs) have shown extraordinary ability in understanding natural language and generating code.

One popular use-case for code generation is in Text-To-SQL tasks, where the goal is to automatically generate SQL queries from natural language text.

In this article, we’ll look at an open-source package that can do that.

Introducing Vanna AI

Vanna AI is an open-source Python-based AI SQL agent trained on your schema that writes complex SQL in seconds.

How does it work?

Vanna works in two easy steps:

Step 1: Train a model on your data

Provide Vanna with sample SQL queries, documentation, and DDL statements to teach Vanna about your database structure and terminology. The more high-quality training data you provide, Vanna will understand your data better.

Step 2: Ask questions

Once trained, you can ask Vanna plain English questions about your data, like “What are the top customers by revenue?”. Vanna will analyze the question, search its training data to understand your schema and generate the appropriate SQL query to answer the question.

Behind the scenes, Vanna utilizes GPT-4 to generate highly accurate SQL tailored to your database. As you continue to use Vanna, it learns from your questions and feedback, constantly improving over time.

Read their whitepaper to learn how they maximize SQL generation accuracy.

What does training mean?

Training here means providing the LLM context about your specific data and use case.

Vanna supports automatic training for Snowflake, and it’s in the works for BigQuery.

Below are the options to manually train Vanna.

Train Using Schema

vn.train(ddl="CREATE TABLE ...")

Train Using Documentation

vn.train(documentation="...")

Train Using SQL Examples

vn.train(sql="SELECT ...")

Enough talking; let’s see what Vanna can do!

Vanna U+1F91D BigQuery

Here is the Colab notebook for you to follow along.

Click here to get an account with Vanna.

First, install Vanna for BigQuery.

We’re using Colab to authenticate with BigQuery; refer to the docs for other ways to authenticate.

Now, get your Vanna API key.

Provide your email, and you should receive a code.

Then, provide Vanna with your project ID on GCP.

Medium Blog Titles

The first dataset we’ll test is a dataset of Medium blog post titles and claps I used for a past project for optimizing blog titles using GPT-3.

We give a name for the model we’re building for this dataset.

Here, I call it medium_blog_titles.

Follow the prompt, and choose BigQuery.

Note: You can see all the models you’ve created using vn.get_models()

I give it the DDL of the dataset, which I got using this query.

You can see that Vanna now has this training data with vn.get_training_data()

Let’s start with a simple question.

How many blogs are in the dataset?

A successful run will return the SQL query, the results as a dataframe, a Plotly figure, and follow-up questions.

What is happening under the hood?

vn.askis a wrapper around vn.generate_sql, vn.run_sql, vn.generate_plotly_code, vn.get_plotly_figure, and vn.generate_followup_questions.

Here’s what it looks like in code:

If you don’t want Vanna to output everything, you can set print_results to False, and unpack the results into the respective values.

Now, let’s try another question.

Can you provide a breakdown of claps by different ranges (e.g., 0–100, 100–500, 500–1000, etc.)?

Note: For every question you ask, Vanna fetches context from the model’s training data. You can see the related training data for a specific question using vn.get_related_training_data() as shown below:

Let’s ask Vanna.

This is the SQL query.

The interactive plot.

The follow-up questions.

Now for a statistics challenge.

Are there any outliers in the number of claps?

It did pretty well!

Let’s test Vanna on a BigQuery public dataset.

Chicago Crime data

We’ll be using the Chicago Crime dataset on BigQuery.

After providing the DDL, you can have Vanna come up with questions for this dataset.

Let’s try the first question:

How many unique crime cases are in the dataset?

That’s a lot of crime!

Let’s see which block you should avoid in Chicago by asking where the highest number of reported crimes occurs.

Let’s see some trends in the crime.

Is it increasing?

It’s decreasing!

What about in terms of months? Are there months that are more popular for crimes?

Looks like July is a hot time for crime.

Last one, let’s see if Vanna can do window functions.

It is showing top crimes for each district, but I want it to be only one row for each district.

Let’s tell Vanna that

That’s the right result!

It just needed a litle prompting.

Let’s test Vanna on a dataset with two tables.

Austin bike share

There are two tables in the Austin Bikes dataset:

bikeshare_trips — contains all the trip IDs, trip start and end, membership type and station names + ID
bikeshare_station — a list of all the stations and their geographical location & station ID

Let’s ask a question that incorporates both tables.

One more:

Most of the stations are solar-powered!

Final thoughts

After trying out Vanna on a couple of datasets, my initial impression is that it works great with straightforward queries.

For more complex ones, you would have to phrase your question better.

I tried providing SQL queries as “training data,” but it still failed to get the right query with the same question provided.

I have yet to test it on a more complex dataset with more tables, but so far, it works great. The queries don’t take long, and the generated plots are a nice touch.

So why use Vanna?

Here’s what the GitHub page says.

As someone who’s currently a student, I see this as a good learning tool.

If I were given a dataset to explore, this tool would provide good questions to dig into the dataset and help me write the correct SQL queries.

There’s even a function in Vanna that takes SQL queries and turns them into English questions.

I assume this would also be a good exploration tool for someone in the industry.

Some other thoughts I have is whether more fine-tined models like SQLCoder would outperform GPT-4.

As well as safety concerns and risks of using this as an interface with a data warehouse.

But that’s for another article.

Thank you for reading, and check out some resources below if you want to explore Vanna further!

Resources

You can try out Vanna in this Streamlit app (code)

Thank you for reading!

If you have any suggestions or thoughts, feel free to comment below!

Follow me on LinkedIn and Twitter for awesome Data Science resources.

Like my writing? Join Medium with my referral link for the price of U+2615.

You’ll be supporting me directly U+1F917

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Chat with Your BigQuery Data

Author(s): Benedict Neo

Introducing Vanna AI

How does it work?

What does training mean?

Vanna U+1F91D BigQuery

Medium Blog Titles

Chicago Crime data

Austin bike share

Final thoughts

Resources

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

I Used ChatGPT to Count My Calories

Resource-Efficient Fine-Tuning of DeepSeek-R1

TAI #138: OpenAI’s o3-Mini and Deep Research: A New Era of Reasoning Powered Agents?

Text Preprocessing for NLP: A Step-by-Step Guide to Clean Raw Text Data

DeepSeek AI — The Future is Here

The World’s Leading AI and Technology Publication.

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Chat with Your BigQuery Data

Author(s): Benedict Neo

Introducing Vanna AI

How does it work?

What does training mean?

Vanna U+1F91D BigQuery

Medium Blog Titles

Chicago Crime data

Austin bike share

Final thoughts

Resources

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement