Chat with Your BigQuery Data
Last Updated on November 5, 2023 by Editorial Team
Author(s): Benedict Neo
Originally published on Towards AI.
Large language models (LLMs) have shown extraordinary ability in understanding natural language and generating code.
One popular use-case for code generation is in Text-To-SQL tasks, where the goal is to automatically generate SQL queries from natural language text.
In this article, weβll look at an open-source package that can do that.
Introducing Vanna AI
Vanna AI is an open-source Python-based AI SQL agent trained on your schema that writes complex SQL in seconds.
How does it work?
Vanna works in two easy steps:
Step 1: Train a model on your data
Provide Vanna with sample SQL queries, documentation, and DDL statements to teach Vanna about your database structure and terminology. The more high-quality training data you provide, Vanna will understand your data better.
Step 2: Ask questions
Once trained, you can ask Vanna plain English questions about your data, like βWhat are the top customers by revenue?β. Vanna will analyze the question, search its training data to understand your schema and generate the appropriate SQL query to answer the question.
Behind the scenes, Vanna utilizes GPT-4 to generate highly accurate SQL tailored to your database. As you continue to use Vanna, it learns from your questions and feedback, constantly improving over time.
Read their whitepaper to learn how they maximize SQL generation accuracy.
What does training mean?
Training here means providing the LLM context about your specific data and use case.
Vanna supports automatic training for Snowflake, and itβs in the works for BigQuery.
Below are the options to manually train Vanna.
Train Using Schema
vn.train(ddl="CREATE TABLE ...")
Train Using Documentation
vn.train(documentation="...")
Train Using SQL Examples
vn.train(sql="SELECT ...")
Enough talking; letβs see what Vanna can do!
Vanna U+1F91D BigQuery
Here is the Colab notebook for you to follow along.
Click here to get an account with Vanna.
First, install Vanna for BigQuery.
Weβre using Colab to authenticate with BigQuery; refer to the docs for other ways to authenticate.
Now, get your Vanna API key.
Provide your email, and you should receive a code.
Then, provide Vanna with your project ID on GCP.
Medium Blog Titles
The first dataset weβll test is a dataset of Medium blog post titles and claps I used for a past project for optimizing blog titles using GPT-3.
We give a name for the model weβre building for this dataset.
Here, I call it medium_blog_titles
.
Follow the prompt, and choose BigQuery.
Note: You can see all the models youβve created using vn.get_models()
I give it the DDL of the dataset, which I got using this query.
You can see that Vanna now has this training data with vn.get_training_data()
Letβs start with a simple question.
How many blogs are in the dataset?
A successful run will return the SQL query, the results as a dataframe, a Plotly figure, and follow-up questions.
What is happening under the hood?
vn.ask
is a wrapper around vn.generate_sql
, vn.run_sql
, vn.generate_plotly_code
, vn.get_plotly_figure
, and vn.generate_followup_questions
.
Hereβs what it looks like in code:
If you donβt want Vanna to output everything, you can set print_results
to False
, and unpack the results into the respective values.
Now, letβs try another question.
Can you provide a breakdown of claps by different ranges (e.g., 0β100, 100β500, 500β1000, etc.)?
Note: For every question you ask, Vanna fetches context from the modelβs training data. You can see the related training data for a specific question using vn.get_related_training_data()
as shown below:
Letβs ask Vanna.
This is the SQL query.
The interactive plot.
The follow-up questions.
Now for a statistics challenge.
Are there any outliers in the number of claps?
It did pretty well!
Letβs test Vanna on a BigQuery public dataset.
Chicago Crime data
Weβll be using the Chicago Crime dataset on BigQuery.
After providing the DDL, you can have Vanna come up with questions for this dataset.
Letβs try the first question:
How many unique crime cases are in the dataset?
Thatβs a lot of crime!
Letβs see which block you should avoid in Chicago by asking where the highest number of reported crimes occurs.
Letβs see some trends in the crime.
Is it increasing?
Itβs decreasing!
What about in terms of months? Are there months that are more popular for crimes?
Looks like July is a hot time for crime.
Last one, letβs see if Vanna can do window functions.
It is showing top crimes for each district, but I want it to be only one row for each district.
Letβs tell Vanna that
Thatβs the right result!
It just needed a litle prompting.
Letβs test Vanna on a dataset with two tables.
Austin bike share
There are two tables in the Austin Bikes dataset:
bikeshare_trips
β contains all the trip IDs, trip start and end, membership type and station names + IDbikeshare_station
β a list of all the stations and their geographical location & station ID
Letβs ask a question that incorporates both tables.
One more:
Most of the stations are solar-powered!
Final thoughts
After trying out Vanna on a couple of datasets, my initial impression is that it works great with straightforward queries.
For more complex ones, you would have to phrase your question better.
I tried providing SQL queries as βtraining data,β but it still failed to get the right query with the same question provided.
I have yet to test it on a more complex dataset with more tables, but so far, it works great. The queries donβt take long, and the generated plots are a nice touch.
So why use Vanna?
Hereβs what the GitHub page says.
As someone whoβs currently a student, I see this as a good learning tool.
If I were given a dataset to explore, this tool would provide good questions to dig into the dataset and help me write the correct SQL queries.
Thereβs even a function in Vanna that takes SQL queries and turns them into English questions.
I assume this would also be a good exploration tool for someone in the industry.
Some other thoughts I have is whether more fine-tined models like SQLCoder would outperform GPT-4.
As well as safety concerns and risks of using this as an interface with a data warehouse.
But thatβs for another article.
Thank you for reading, and check out some resources below if you want to explore Vanna further!
Resources
You can try out Vanna in this Streamlit app (code)
Thank you for reading!
If you have any suggestions or thoughts, feel free to comment below!
Follow me on LinkedIn and Twitter for awesome Data Science resources.
Like my writing? Join Medium with my referral link for the price of U+2615.
Youβll be supporting me directly U+1F917
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI