Speech-To-SQL Using OpenAI’s Whisper and Ollama (Llama3)
Last Updated on October 31, 2024 by Editorial Team
Author(s): Tapan Babbar
Originally published on Towards AI.
I was rewatching the Harry Potter series and kept wondering: teachers and prefects were deducting points from houses left and right! But do they keep track of these point changes across so many classes? What about data integrity? Scalability? Write conflicts? Surely they’d need something scalable, like a pub-sub system for house point updates. Besides the scale, how well does the speech recognition have to be?
On a slightly serious note, that got me thinking — could we recreate some of it with AI? What if we could go from speech directly to SQL? That’s how I ended up diving into this fun little experiment: Speech-to-SQL using OpenAI’s Whisper for transcription and Meta’s Llama3 for converting text into SQL queries.
Here’s how I did it, and you can too in just four simple steps:
Step 1 — Record Audio
We kick things off by capturing audio using a simple Python setup. Using the sounddevice
library, we record audio directly from your mic, then temporarily save it as a .wav
file to transcribe later.
import sounddevice as sd
import tempfile
import wave
# Function to record audio from the microphone and save it as a WAV file
def record_audio(duration, sample_rate=16000):
print("Recording...")
audio_data = sd.rec(int(duration * sample_rate), samplerate=sample_rate, channels=1, dtype='float32')
sd.wait() # Wait for the recording to finish
print("Recording finished.")
# Save the audio to a temporary WAV file
temp_wav = tempfile.NamedTemporaryFile(delete=False, suffix=".wav")
with wave.open(temp_wav.name, 'wb') as wf:
wf.setnchannels(1) # Mono channel
wf.setsampwidth(2) # 16-bit audio
wf.setframerate(sample_rate)
wf.writeframes(np.int16(audio_data * 32767)) # Convert float32 to int16
return temp_wav.name
This is where it all starts: a humble voice command ready to transform into SQL!
Step 2: Speech-to-Text with Whisper
Next, we transcribe the audio using OpenAI’s Whisper model. This model is excellent for converting speech into text. It’s almost like having a personal assistant who listens to your commands and writes them down — just more reliable and scalable.
import whisper
import os
# Function to transcribe audio from the microphone using Whisper
def audio_to_text_from_mic(duration=5):
# Record audio from the microphone
audio_file = record_audio(duration)
# Load Whisper model
model = whisper.load_model("turbo") # You can use "turbo", "small", etc.
# Transcribe the recorded audio file
result = model.transcribe(audio_file)
# Delete the temporary audio file after transcription
os.remove(audio_file)
return result['text']
# Example usage
text = audio_to_text_from_mic(duration=3) # Record for 5 seconds
print("Transcription:", text)
Transcription: 10 points to Gryffindor
Your natural language command is now in text form, ready for the next level of transformation.
Step 3: Text-to-SQL using Llama 3
Now for the real magic — turning that transcribed text into SQL commands. Using the Llama 3 model, we feed the natural language command (“Add 10 points to Gryffindor”) and out a valid SQL query.
We start by building a prompt that gives context about the database schema. In our case, the house_points
table has two columns: house_name
(the name of the house) and points
(the current point total). The prompt clearly explains this structure and instructs the model to return a well-formed SQL UPDATE
query without unnecessary explanations.
Here’s what’s happening step-by-step:
1. Define the Table Schema: We provide the table's structure, making it clear to the model what it looks like. The schema specifies that the table contains house_name
and points
.
table_schemas = """
house_points(house_name TEXT PRIMARY KEY, points INTEGER)
"""
2. Create a Prompt: We generate a prompt that asks Llama 3 to convert the natural language command into an SQL UPDATE
query. It explicitly asks for a response in JSON format with only the query, ensuring that the output is clean and usable.
prompt = f"""
You are a SQL expert.
Please help to convert the following natural language command into a valid UPDATE SQL query. Your response should ONLY be based on the given context and follow the response guidelines and format instructions.
===Tables
{table_schemas}
===Response Guidelines
1. If the provided context is sufficient, please generate a valid query WITHOUT any explanations for the question.
2. Please format the query before responding.
3. Please always respond with a valid well-formed JSON object with the following format
4. There are only UPDATE queries and points are either added or deducted from a house
===Response Format
{{
"query": "A valid UPDATE SQL query when context is sufficient.",
}}
===command
{natural_language_text}
"""
3. Send Request to Llama 3: The text is then sent to the LLM using Ollama API. The model processes the request and returns a JSON object with the SQL query. We parse the model’s JSON response to extract the SQL query. If something goes wrong, such as a failure to parse the response, an error is returned. This ensures robustness in the code.
import ollama
import json
response = ollama.chat(
model="llama3",
messages=[{"role": "user", "content": prompt}]
)
# Directly return the content as it should now be only the SQL query
# Parse the JSON response and return the SQL query if provided
response_content = response['message']['content']
# Directly return the content as it should now be only the SQL query
# Parse the JSON response and return the SQL query if provided
response_content = response['message']['content']
try:
response_json = json.loads(response_content)
if "query" in response_json:
return response_json["query"]
else:
return f"Error: {response_json.get('explanation', 'No explanation provided.')}"
except json.JSONDecodeError:
return "Error: Failed to parse response as JSON."
With this, your “10 points to Gryffindor” becomes an SQL query like:
UPDATE house_points SET points = points + 10 WHERE house_name = 'Gryffindor';
Step 4 — Running the SQL Query
Finally, we take the generated SQL query and execute it on our database to update the house points. But before diving into query execution, let’s make sure the initial setup is in place.
First, you’ll need a table to track the points for each Hogwarts house. Here’s a simple table structure that does the job:
CREATE TABLE house_points (
house_name VARCHAR(50) PRIMARY KEY,
points INT
);
Now, populate the table with the initial points for each house. Here’s a quick SQL command to give each house 100 points to start with:
INSERT INTO house_points (house_name, points)
VALUES ('Gryffindor', 100), ('Hufflepuff', 100), ('Ravenclaw', 100), ('Slytherin', 100);
Once your database is ready, you’ll need to establish a connection to run queries. Using SQLAlchemy makes this super easy. Here’s how you set up the connection:
from sqlalchemy import create_engine, text
engine = create_engine('postgresql://db_user:db_password@localhost/db_name')
def run_sql_query(query):
with engine.connect() as conn:
conn.execute(text(query))
conn.commit()
Replace 'db_user'
, 'db_password'
, and 'db_name'
with your actual PostgreSQL credentials and database name.
This function takes the SQL query produced by our Speech-to-SQL script and executes it on your database. Every time a new voice command updates the points, this function will run the corresponding SQL and commit the changes, ensuring real-time updates to the house points table.
You’ve just updated your Hogwarts house points through voice commands — how’s that for magical?
Bonus: Bringing It All Together in Streamlit
To make your project even more user-friendly, why not wrap the entire solution into a web app? With Streamlit, you can build an interactive interface where users can record their voice commands, see live updates of house points, and view visuals like house logos.
I’ve already included the complete code for this in my Github repo, which you can check out and customize to suit your needs.
Conclusion
With just a few lines of code, you’ve built a voice-powered SQL executor. Perfect for those moments when you’re channeling your inner Professor McGonagall, deducting points with flair. So, whether you’re managing Hogwarts or any dataset, Speech-to-SQL might just be the spell you need to save time.
The full source code and the Jupyter Notebook are available in the GitHub repository. Feel free to reach out if you have ideas for improvements or any other observations.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI