Fine-Tuning Open-Source LLMs for Text-to-SQL: Project Overview and Motivations (article 1 of 3)

Author(s): Lorentz Yeung

Originally published on Towards AI.

Fine-Tuning Open-Source LLMs for Text-to-SQL: Project Overview and Motivations (article 1 of 3) — OpenAI’s GPT-4 Mini as a benchmark for this project. Photo by Growtika on Unsplash

In the rapidly evolving world of AI, transforming natural language questions into executable SQL queries — known as text-to-SQL — has become a game-changer for data analysis. Imagine asking your database, “How many customers placed orders last quarter, grouped by region and ordered by compounded growth rate?” and getting a perfectly crafted SQL query in return. This project explores fine-tuning open-source large language models (LLMs) like Meta’s Llama 3.1 8B Instruct and Alibaba’s Qwen 2.5 series to excel at this task, using advanced techniques like Guided Reward Policy Optimization (GRPO). Over 60 training sessions (1600+ hours), I experimented on a high-end setup (RTX 4090 via WSL2 on Windows 11) to push these models toward handling complex queries.

This is the first in a series of articles. Here, I’ll cover the project’s methodology, core elements, methods, datasets, my motivations (including why open-source models are crucial for closed ecosystems), and a high-level preview of the results. The second article will dive into the machine setup, and the third will detail the outcomes and key takeaways.

What Is This Project?

At its heart, this project aims to create an “expert” LLM tailored for text-to-SQL tasks on a specific database schema. I started with the Llama 3.1 8B Instruct model (from Meta) and later switched to Qwen 2.5 variants (like Qwen2.5-Coder-7B-Instruct) for better performance on coding-related tasks, and finally interchanging them back and forth for comparison. Fine-tuning focused on SQLite-compatible queries, emphasizing complex operations like self join, temporal analysis, running totals, and recurrence detection.

The end goal? A model that rivals paid services like Grok or Perplexity for intricate SQL generation, but in an open-source, privacy-focused setup. Is it possible? If yes, how many hours would it take? How good can we expect? In the end of this project, I have my answers for all those questions. If you are only interested in the final results/answers. Please proceed to the 3rd article.

Methods Used

I adapted techniques from articles like “Fine-Tuning a Text-to-SQL LLM for Reasoning Using GRPO” by Yi Ai, optimizing for my hardware (RTX 4090 with 24GB VRAM). I modified this script to fit my need, e.g. My script focuses on balancing training datasets and then focused on complex SQL queries; I also curated my own dataset for the project. I have my own evaluation method for after training evaluation.

Anyway, the key methods include:

1. Guided Reward Policy Optimization (GRPO)

GRPO is a reinforcement learning approach that refines the model based on multiple reward signals. Unlike standard supervised fine-tuning, it encourages exploration while penalizing deviations from a reference policy (via KL divergence).

Reward Functions:

Format Reward: Ensures output follows <reasoning>…</reasoning><sql>…</sql> (heuristic-based, score 0–1).
SQL Correctness Reward: Compares generated SQL execution results to ground truth (using sqlglot for parsing and actual query running).
Complexity Reward: Aligns query complexity with the gold standard (on token length and operations).
Reasoning Quality Reward: Evaluates reasoning clarity (heuristics like length, SQL term usage, and structure).

Hyperparameters: Learning rates from 1e-6 to 4e-5, beta (KL penalty) 0.01–0.1, max gradient norm 1.0–10.0, epochs 3–10 ( at least 12 recommended), batch size 8.

Libraries: trl for GRPOTrainer, unsloth for GPU optimizations, peft for LoRA, bitsandbytes for 4-bit/8-bit quantization to fit in 14–20 GB VRAM.

2. LoRA for Efficient Fine-Tuning

Instead of full-model updates, LoRA adds low-rank adapters (rank 8–32) to target modules like attention layers, training only ~20M parameters.
This kept training times reasonable (~2–72 hours per run) and memory under 15 GB for most sessions.

3. Evaluation Metrics

Syntactic Validity Score (SVS): Does the query run without errors?
Ground Truth Semantic Correctness Score (GTSCS): Do results match the gold query?
AI Semantic Correctness Score (AISCS): If it doesn’t resemble GT, would it still be right? Grok gives judgment on semantic equivalence in this sense.
Composite Precision Score (CPS): Average of SVS, GTSCS, AISCS.
Tested on 10 queries (5 easy/medium, 5 hard), executed via SQLite.

4. Hardware and Environment

Ran on WSL2 (Ubuntu 22.04) with CUDA 12.1, PyTorch 2.2.0.
Dependencies: transformers 4.43.0, datasets 2.20.0, sqlglot 25.1.0, etc. (full list in requirements.txt).
Challenges: Resolved CUDA mismatches and dependency conflicts (e.g., downgrading trl to 0.8.6 for compatibility).

Datasets

Datasets evolved across sessions to focus on complexity:

Initial Dataset: From b-mc2/sql-create-context (Hugging Face), ~300–500 examples for quick prototyping, then the entire dataset for benchmarking. Formatted with prompts like: “Schema: [context] Question: [natural language]”.
Curated Dataset: The original dataset was ~10,000 rows total, validated to 5,020 initially, then focused on 616 complexity-3 queries (Complexity 1 = easy, 2 = medium, 3 = hard) which was validated with, e.g. category counting with running totals (e.g., quarterly aggregates using strftime); temporal sequences (e.g., using julianday for date differences), recurring issues (e.g., RANK() OVER for detection). At this point, the data point is formatted with prompts like: “Schema: [context] Question: [natural language], Ground truth: [SQL], Complexity: [Level in INT]”.
Evaluation Set: The evaluation set consists of 10 optimized test queries designed to assess SQL generation models on a synthetic call center database (“wandsworth_callcenter_sampled.db”) which can be found in the link below. It includes natural language prompts requiring complex SQL queries, divided evenly into 5 Easy/Medium and 5 Hard examples (based on factors like self joins, temporal logic, and aggregations). The set evaluates models on syntactic validity (executable SQL), semantic correctness (via ground truth output matching and AI-assisted review using Grok), and overall precision, providing insights into model performance on realistic, domain-specific tasks.
Augmentation: Added synthetic queries for diversity, ensuring SQLite compatibility. No external datasets like Spider to avoid leakage.

Training used 200–10,000 examples per run, with max sequence length 4,096 to handle long schemas.

Reasons for the Project

This isn’t just a research exercise — it’s driven by real-world needs. Here’s why I embarked on this:

Is Fine-Tuning for a Single Database Doable?
I wanted to test if we can create a “domain expert” LLM for one project or database. Companies often have proprietary schemas (e.g., customer data in a CRM). Could a fine-tuned open-sourced small model handle tough, nuanced questions like, “Identify recurring service requests in wards with escalating issues over the past year, ranking them by severity”, or “Identify addresses with recurring fly-tipping issues across different months, and rank wards by the number of such addresses” ? The idea is to make the model so attuned that users can ask complex, ad-hoc queries and get reliable answers without manual SQL writing.
How Much Training Is Needed?
If viable, what’s the sweet spot? Early runs showed 300–500 examples suffice for simple queries, but complex ones probably needed 600+ hard-focused queries over 3–8 epochs. I aimed to quantify: Can 1,000 data points and 12 epochs push GTSCS >50% on hard queries?
How Much Time Does It Need?
Efficiency matters for real adoption. On an RTX 4090, runs took 2–4 hours (250–3,000 steps). Scaling to 12 epochs might take 48–72 hours or even more, but is it worth it for production-grade accuracy?
Which Open-Source Model Suits Better?
I compared Llama 3.1 8B Instruct (strong in instruction-following) vs. Qwen 2.5 Coder series (optimized for code/SQL). Spoiler: Qwen edged out in picking up new functions, but both 7B–8B sizes struggled with ultra-complex syntax. My experiments told me even 13B models struggled with those questions.
Privacy and Closed Ecosystems Demand Open-Source Solutions
Many companies keep databases air-gapped for security — exposing them to cloud APIs like OpenAI or Grok risks breaches. Paid models are off-limits, so open-source is key. As of March 2025, only massive paid platforms handle truly difficult queries. For example: “Compute running totals of requests per ward, using CTEs to identify months with consecutive escalations, ranked by julianday differences.” Grok and Perplexity nailed it with correct CTEs and window functions, but smaller open-source models, even Chatgpt 4o mini hallucinated invalid syntax. This project tests if fine-tuning bridges that gap affordably.

These motivations stem from practical AI deployment: empowering in-house tools without vendor lock-in.

Roughly What Are the Results?

Across 60+ runs, 1600+ cumulated hours, models improved on easy/medium queries (CPS 0.53–0.87, up from untuned 0.43–0.50), but hard ones remained challenging (GTSCS 0.00–0.20 vs. OpenAI’s 0.60). Qwen outperformed Llama in simple SQL, and learning bits like strftime and joins — but CTEs and ranking eluded them. Optimal params: 2e-5 or less learning rate for stability, 0.03 beta, 12+ epochs on 1,000 queries. Time: ~72+ hours per solid run. Viable? Yes, for basics, but complex expertise needs more data/epochs. Full details in article 3! I decided to include my setup in article 2 in the sense that I myself struggled alot in setting up my own machine for LLM training and finetuning. Lots of crashes and incompatibility. With my current setup details, whoever wants a quick solution for their machine will benefit, not only limited for GRPO or SQL related fine tuning.

Stay tuned for the setup guide and deep-dive results. Questions? Comment below!

Go to Part 2 (Setting Up a Machine for Fine-Tuning LLMs on WSL2), please click here: Fine-Tuning Open-Source LLMs for Text-to-SQL: Setting Up a Machine for Fine-Tuning LLMs on WSL2 (article 2 of 3) | by Lorentz Yeung | Jul, 2025 | Towards AI.

If you want to jump directly to Part 3 (Results and Key Takeaways), please go here: https://pub.towardsai.net/fine-tuning-open-source-llms-for-text-to-sql-results-and-key-takeaways-article-3-of-3-2b887951edda?source=friends_link&sk=542b07803d0a04150922f6c36f41e25e

References:

Initial raining Data: b-mc2/sql-create-context · Datasets at Hugging Face

My synthetic database: wandsworth_callcenter_sampled.db · Entz/council_2 at main

My synthetic database: wandsworth_callcenter_sampled.csv · Entz/council_2 at main

If you’ve enjoyed the article — I’d be stoked if you’d consider tossing a coin my way via PayPal or becoming a GitHub Sponsor. Your support keeps the train running. Let’s echo James Bently’s saying, let’s keep contributing to the AI community and benefit humanity.

Support the Author

If you found this article useful, please consider donating to my PayPal tip jar!

Pay Pui Yeung using PayPal.Me

Go to PayPal.Me/entzyeung and enter the amount. It’s safer and more secure. Don’t have a PayPal account? No problem.

paypal.me

Your support means the universe for me and allows me to stay on this lonely road of exploration — keep experimenting, writing articles, making tutorials, …

Thank you!

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Fine-Tuning Open-Source LLMs for Text-to-SQL: Project Overview and Motivations (article 1 of 3)

Author(s): Lorentz Yeung

What Is This Project?

Methods Used

1. Guided Reward Policy Optimization (GRPO)

2. LoRA for Efficient Fine-Tuning

3. Evaluation Metrics

4. Hardware and Environment

Datasets

Reasons for the Project

Roughly What Are the Results?

References:

Support the Author

Pay Pui Yeung using PayPal.Me

Go to PayPal.Me/entzyeung and enter the amount. It’s safer and more secure. Don’t have a PayPal account? No problem.

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

Why Knowledge Graphs Are the Missing Piece in AI Agent API Discovery

The Complexity of Self-Driving Cars Explained Simply

Bridging Symbolic AI and Deep Learning: How Knowledge Graphs are Revolutionizing ResNets

LAI #93: Smarter Model Choices, Multi-Agent Systems, and Cutting Through AI Noise

Who Wins Purview vs Rogue AI in Data Control

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Fine-Tuning Open-Source LLMs for Text-to-SQL: Project Overview and Motivations (article 1 of 3)

Author(s): Lorentz Yeung

What Is This Project?

Methods Used

1. Guided Reward Policy Optimization (GRPO)

2. LoRA for Efficient Fine-Tuning

3. Evaluation Metrics

4. Hardware and Environment

Datasets

Reasons for the Project

Roughly What Are the Results?

References:

Support the Author

Pay Pui Yeung using PayPal.Me

Go to PayPal.Me/entzyeung and enter the amount. It’s safer and more secure. Don’t have a PayPal account? No problem.

Related posts

Popular posts

Updates

Recent Posts

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

GDPR CCPA Statement