Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab VeloxTrend Ultrarix Capital Partners Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-FranΓ§ois Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Take our 85+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Publication

Fine-Tuning Open-Source LLMs for Text-to-SQL: Project Overview and Motivations (article 1 of 3)
Latest   Machine Learning

Fine-Tuning Open-Source LLMs for Text-to-SQL: Project Overview and Motivations (article 1 of 3)

Author(s): Lorentz Yeung

Originally published on Towards AI.

Fine-Tuning Open-Source LLMs for Text-to-SQL: Project Overview and Motivations (article 1 of 3)
OpenAI’s GPT-4 Mini as a benchmark for this project. Photo by Growtika on Unsplash

In the rapidly evolving world of AI, transforming natural language questions into executable SQL queries β€” known as text-to-SQL β€” has become a game-changer for data analysis. Imagine asking your database, β€œHow many customers placed orders last quarter, grouped by region and ordered by compounded growth rate?” and getting a perfectly crafted SQL query in return. This project explores fine-tuning open-source large language models (LLMs) like Meta’s Llama 3.1 8B Instruct and Alibaba’s Qwen 2.5 series to excel at this task, using advanced techniques like Guided Reward Policy Optimization (GRPO). Over 60 training sessions (1600+ hours), I experimented on a high-end setup (RTX 4090 via WSL2 on Windows 11) to push these models toward handling complex queries.

This is the first in a series of articles. Here, I’ll cover the project’s methodology, core elements, methods, datasets, my motivations (including why open-source models are crucial for closed ecosystems), and a high-level preview of the results. The second article will dive into the machine setup, and the third will detail the outcomes and key takeaways.

What Is This Project?

At its heart, this project aims to create an β€œexpert” LLM tailored for text-to-SQL tasks on a specific database schema. I started with the Llama 3.1 8B Instruct model (from Meta) and later switched to Qwen 2.5 variants (like Qwen2.5-Coder-7B-Instruct) for better performance on coding-related tasks, and finally interchanging them back and forth for comparison. Fine-tuning focused on SQLite-compatible queries, emphasizing complex operations like self join, temporal analysis, running totals, and recurrence detection.

The end goal? A model that rivals paid services like Grok or Perplexity for intricate SQL generation, but in an open-source, privacy-focused setup. Is it possible? If yes, how many hours would it take? How good can we expect? In the end of this project, I have my answers for all those questions. If you are only interested in the final results/answers. Please proceed to the 3rd article.

Methods Used

I adapted techniques from articles like β€œFine-Tuning a Text-to-SQL LLM for Reasoning Using GRPO” by Yi Ai, optimizing for my hardware (RTX 4090 with 24GB VRAM). I modified this script to fit my need, e.g. My script focuses on balancing training datasets and then focused on complex SQL queries; I also curated my own dataset for the project. I have my own evaluation method for after training evaluation.

Anyway, the key methods include:

1. Guided Reward Policy Optimization (GRPO)

GRPO is a reinforcement learning approach that refines the model based on multiple reward signals. Unlike standard supervised fine-tuning, it encourages exploration while penalizing deviations from a reference policy (via KL divergence).

Reward Functions:

  • Format Reward: Ensures output follows <reasoning>…</reasoning><sql>…</sql> (heuristic-based, score 0–1).
  • SQL Correctness Reward: Compares generated SQL execution results to ground truth (using sqlglot for parsing and actual query running).
  • Complexity Reward: Aligns query complexity with the gold standard (on token length and operations).
  • Reasoning Quality Reward: Evaluates reasoning clarity (heuristics like length, SQL term usage, and structure).

Hyperparameters: Learning rates from 1e-6 to 4e-5, beta (KL penalty) 0.01–0.1, max gradient norm 1.0–10.0, epochs 3–10 ( at least 12 recommended), batch size 8.

Libraries: trl for GRPOTrainer, unsloth for GPU optimizations, peft for LoRA, bitsandbytes for 4-bit/8-bit quantization to fit in 14–20 GB VRAM.

2. LoRA for Efficient Fine-Tuning

  • Instead of full-model updates, LoRA adds low-rank adapters (rank 8–32) to target modules like attention layers, training only ~20M parameters.
  • This kept training times reasonable (~2–72 hours per run) and memory under 15 GB for most sessions.

3. Evaluation Metrics

  • Syntactic Validity Score (SVS): Does the query run without errors?
  • Ground Truth Semantic Correctness Score (GTSCS): Do results match the gold query?
  • AI Semantic Correctness Score (AISCS): If it doesn’t resemble GT, would it still be right? Grok gives judgment on semantic equivalence in this sense.
  • Composite Precision Score (CPS): Average of SVS, GTSCS, AISCS.
  • Tested on 10 queries (5 easy/medium, 5 hard), executed via SQLite.

4. Hardware and Environment

  • Ran on WSL2 (Ubuntu 22.04) with CUDA 12.1, PyTorch 2.2.0.
  • Dependencies: transformers 4.43.0, datasets 2.20.0, sqlglot 25.1.0, etc. (full list in requirements.txt).
  • Challenges: Resolved CUDA mismatches and dependency conflicts (e.g., downgrading trl to 0.8.6 for compatibility).

Datasets

Datasets evolved across sessions to focus on complexity:

  • Initial Dataset: From b-mc2/sql-create-context (Hugging Face), ~300–500 examples for quick prototyping, then the entire dataset for benchmarking. Formatted with prompts like: β€œSchema: [context] Question: [natural language]”.
  • Curated Dataset: The original dataset was ~10,000 rows total, validated to 5,020 initially, then focused on 616 complexity-3 queries (Complexity 1 = easy, 2 = medium, 3 = hard) which was validated with, e.g. category counting with running totals (e.g., quarterly aggregates using strftime); temporal sequences (e.g., using julianday for date differences), recurring issues (e.g., RANK() OVER for detection). At this point, the data point is formatted with prompts like: β€œSchema: [context] Question: [natural language], Ground truth: [SQL], Complexity: [Level in INT]”.
  • Evaluation Set: The evaluation set consists of 10 optimized test queries designed to assess SQL generation models on a synthetic call center database (β€œwandsworth_callcenter_sampled.db”) which can be found in the link below. It includes natural language prompts requiring complex SQL queries, divided evenly into 5 Easy/Medium and 5 Hard examples (based on factors like self joins, temporal logic, and aggregations). The set evaluates models on syntactic validity (executable SQL), semantic correctness (via ground truth output matching and AI-assisted review using Grok), and overall precision, providing insights into model performance on realistic, domain-specific tasks.
  • Augmentation: Added synthetic queries for diversity, ensuring SQLite compatibility. No external datasets like Spider to avoid leakage.

Training used 200–10,000 examples per run, with max sequence length 4,096 to handle long schemas.

Reasons for the Project

This isn’t just a research exercise β€” it’s driven by real-world needs. Here’s why I embarked on this:

  1. Is Fine-Tuning for a Single Database Doable?
    I wanted to test if we can create a β€œdomain expert” LLM for one project or database. Companies often have proprietary schemas (e.g., customer data in a CRM). Could a fine-tuned open-sourced small model handle tough, nuanced questions like, β€œIdentify recurring service requests in wards with escalating issues over the past year, ranking them by severity”, or β€œIdentify addresses with recurring fly-tipping issues across different months, and rank wards by the number of such addresses” ? The idea is to make the model so attuned that users can ask complex, ad-hoc queries and get reliable answers without manual SQL writing.
  2. How Much Training Is Needed?
    If viable, what’s the sweet spot? Early runs showed 300–500 examples suffice for simple queries, but complex ones probably needed 600+ hard-focused queries over 3–8 epochs. I aimed to quantify: Can 1,000 data points and 12 epochs push GTSCS >50% on hard queries?
  3. How Much Time Does It Need?
    Efficiency matters for real adoption. On an RTX 4090, runs took 2–4 hours (250–3,000 steps). Scaling to 12 epochs might take 48–72 hours or even more, but is it worth it for production-grade accuracy?
  4. Which Open-Source Model Suits Better?
    I compared Llama 3.1 8B Instruct (strong in instruction-following) vs. Qwen 2.5 Coder series (optimized for code/SQL). Spoiler: Qwen edged out in picking up new functions, but both 7B–8B sizes struggled with ultra-complex syntax. My experiments told me even 13B models struggled with those questions.
  5. Privacy and Closed Ecosystems Demand Open-Source Solutions
    Many companies keep databases air-gapped for security β€” exposing them to cloud APIs like OpenAI or Grok risks breaches. Paid models are off-limits, so open-source is key. As of March 2025, only massive paid platforms handle truly difficult queries. For example: β€œCompute running totals of requests per ward, using CTEs to identify months with consecutive escalations, ranked by julianday differences.” Grok and Perplexity nailed it with correct CTEs and window functions, but smaller open-source models, even Chatgpt 4o mini hallucinated invalid syntax. This project tests if fine-tuning bridges that gap affordably.

These motivations stem from practical AI deployment: empowering in-house tools without vendor lock-in.

Roughly What Are the Results?

Across 60+ runs, 1600+ cumulated hours, models improved on easy/medium queries (CPS 0.53–0.87, up from untuned 0.43–0.50), but hard ones remained challenging (GTSCS 0.00–0.20 vs. OpenAI’s 0.60). Qwen outperformed Llama in simple SQL, and learning bits like strftime and joins β€” but CTEs and ranking eluded them. Optimal params: 2e-5 or less learning rate for stability, 0.03 beta, 12+ epochs on 1,000 queries. Time: ~72+ hours per solid run. Viable? Yes, for basics, but complex expertise needs more data/epochs. Full details in article 3! I decided to include my setup in article 2 in the sense that I myself struggled alot in setting up my own machine for LLM training and finetuning. Lots of crashes and incompatibility. With my current setup details, whoever wants a quick solution for their machine will benefit, not only limited for GRPO or SQL related fine tuning.

Stay tuned for the setup guide and deep-dive results. Questions? Comment below!

Go to Part 2 (Setting Up a Machine for Fine-Tuning LLMs on WSL2), please click here: Fine-Tuning Open-Source LLMs for Text-to-SQL: Setting Up a Machine for Fine-Tuning LLMs on WSL2 (article 2 of 3) | by Lorentz Yeung | Jul, 2025 | Towards AI.

If you want to jump directly to Part 3 (Results and Key Takeaways), please go here: https://pub.towardsai.net/fine-tuning-open-source-llms-for-text-to-sql-results-and-key-takeaways-article-3-of-3-2b887951edda?source=friends_link&sk=542b07803d0a04150922f6c36f41e25e

References:

Initial raining Data: b-mc2/sql-create-context Β· Datasets at Hugging Face

My synthetic database: wandsworth_callcenter_sampled.db Β· Entz/council_2 at main

My synthetic database: wandsworth_callcenter_sampled.csv Β· Entz/council_2 at main

If you’ve enjoyed the article β€” I’d be stoked if you’d consider tossing a coin my way via PayPal or becoming a GitHub Sponsor. Your support keeps the train running. Let’s echo James Bently’s saying, let’s keep contributing to the AI community and benefit humanity.

Support the Author

If you found this article useful, please consider donating to my PayPal tip jar!

Pay Pui Yeung using PayPal.Me

Go to PayPal.Me/entzyeung and enter the amount. It’s safer and more secure. Don’t have a PayPal account? No problem.

paypal.me

Your support means the universe for me and allows me to stay on this lonely road of exploration β€” keep experimenting, writing articles, making tutorials, …

Thank you!

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.

Published via Towards AI

Feedback ↓