Master LLMs with our FREE course in collaboration with Activeloop & Intel Disruptor Initiative. Join now!

Publication

Chat with Your BigQuery Data
Artificial Intelligence   Data Engineering   Data Science   Latest   Machine Learning

Chat with Your BigQuery Data

Last Updated on November 5, 2023 by Editorial Team

Author(s): Benedict Neo

Originally published on Towards AI.

made with excalidraw

Large language models (LLMs) have shown extraordinary ability in understanding natural language and generating code.

One popular use-case for code generation is in Text-To-SQL tasks, where the goal is to automatically generate SQL queries from natural language text.

In this article, we’ll look at an open-source package that can do that.

Introducing Vanna AI

Vanna AI is an open-source Python-based AI SQL agent trained on your schema that writes complex SQL in seconds.

How does it work?

source

Vanna works in two easy steps:

Step 1: Train a model on your data

Provide Vanna with sample SQL queries, documentation, and DDL statements to teach Vanna about your database structure and terminology. The more high-quality training data you provide, Vanna will understand your data better.

Step 2: Ask questions

Once trained, you can ask Vanna plain English questions about your data, like “What are the top customers by revenue?”. Vanna will analyze the question, search its training data to understand your schema and generate the appropriate SQL query to answer the question.

Behind the scenes, Vanna utilizes GPT-4 to generate highly accurate SQL tailored to your database. As you continue to use Vanna, it learns from your questions and feedback, constantly improving over time.

Read their whitepaper to learn how they maximize SQL generation accuracy.

What does training mean?

Training here means providing the LLM context about your specific data and use case.

Vanna supports automatic training for Snowflake, and it’s in the works for BigQuery.

Below are the options to manually train Vanna.

Train Using Schema

vn.train(ddl="CREATE TABLE ...")

Train Using Documentation

vn.train(documentation="...")

Train Using SQL Examples

vn.train(sql="SELECT ...")

Enough talking; let’s see what Vanna can do!

Vanna U+1F91D BigQuery

Here is the Colab notebook for you to follow along.

Click here to get an account with Vanna.

First, install Vanna for BigQuery.

We’re using Colab to authenticate with BigQuery; refer to the docs for other ways to authenticate.

Now, get your Vanna API key.

Provide your email, and you should receive a code.

Then, provide Vanna with your project ID on GCP.

Medium Blog Titles

The first dataset we’ll test is a dataset of Medium blog post titles and claps I used for a past project for optimizing blog titles using GPT-3.

We give a name for the model we’re building for this dataset.

Here, I call it medium_blog_titles.

Follow the prompt, and choose BigQuery.

Note: You can see all the models you’ve created using vn.get_models()

I give it the DDL of the dataset, which I got using this query.

You can see that Vanna now has this training data with vn.get_training_data()

Let’s start with a simple question.

How many blogs are in the dataset?

A successful run will return the SQL query, the results as a dataframe, a Plotly figure, and follow-up questions.

What is happening under the hood?

vn.askis a wrapper around vn.generate_sql, vn.run_sql, vn.generate_plotly_code, vn.get_plotly_figure, and vn.generate_followup_questions.

Here’s what it looks like in code:

If you don’t want Vanna to output everything, you can set print_results to False, and unpack the results into the respective values.

Now, let’s try another question.

Can you provide a breakdown of claps by different ranges (e.g., 0–100, 100–500, 500–1000, etc.)?

Note: For every question you ask, Vanna fetches context from the model’s training data. You can see the related training data for a specific question using vn.get_related_training_data() as shown below:

Let’s ask Vanna.

This is the SQL query.

The interactive plot.

The follow-up questions.

Now for a statistics challenge.

Are there any outliers in the number of claps?

It did pretty well!

Let’s test Vanna on a BigQuery public dataset.

Chicago Crime data

We’ll be using the Chicago Crime dataset on BigQuery.

After providing the DDL, you can have Vanna come up with questions for this dataset.

Let’s try the first question:

How many unique crime cases are in the dataset?

That’s a lot of crime!

Let’s see which block you should avoid in Chicago by asking where the highest number of reported crimes occurs.

Let’s see some trends in the crime.

Is it increasing?

It’s decreasing!

What about in terms of months? Are there months that are more popular for crimes?

Looks like July is a hot time for crime.

Last one, let’s see if Vanna can do window functions.

It is showing top crimes for each district, but I want it to be only one row for each district.

Let’s tell Vanna that

That’s the right result!

It just needed a litle prompting.

Let’s test Vanna on a dataset with two tables.

Austin bike share

There are two tables in the Austin Bikes dataset:

  1. bikeshare_trips — contains all the trip IDs, trip start and end, membership type and station names + ID
  2. bikeshare_station — a list of all the stations and their geographical location & station ID

Let’s ask a question that incorporates both tables.

One more:

Most of the stations are solar-powered!

Final thoughts

After trying out Vanna on a couple of datasets, my initial impression is that it works great with straightforward queries.

For more complex ones, you would have to phrase your question better.

I tried providing SQL queries as “training data,” but it still failed to get the right query with the same question provided.

I have yet to test it on a more complex dataset with more tables, but so far, it works great. The queries don’t take long, and the generated plots are a nice touch.

So why use Vanna?

Here’s what the GitHub page says.

As someone who’s currently a student, I see this as a good learning tool.

If I were given a dataset to explore, this tool would provide good questions to dig into the dataset and help me write the correct SQL queries.

There’s even a function in Vanna that takes SQL queries and turns them into English questions.

I assume this would also be a good exploration tool for someone in the industry.

Some other thoughts I have is whether more fine-tined models like SQLCoder would outperform GPT-4.

As well as safety concerns and risks of using this as an interface with a data warehouse.

But that’s for another article.

Thank you for reading, and check out some resources below if you want to explore Vanna further!

Resources

You can try out Vanna in this Streamlit app (code)

Thank you for reading!

If you have any suggestions or thoughts, feel free to comment below!

Follow me on LinkedIn and Twitter for awesome Data Science resources.

Like my writing? Join Medium with my referral link for the price of U+2615.

You’ll be supporting me directly U+1F917

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓