Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: pub@towardsai.net
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab VeloxTrend Ultrarix Capital Partners Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Our 15 AI experts built the most comprehensive, practical, 90+ lesson courses to master AI Engineering - we have pathways for any experience at Towards AI Academy. Cohorts still open - use COHORT10 for 10% off.

Publication

NLQ-to-SQL Evaluation: The Metrics That Matter
Latest   Machine Learning

NLQ-to-SQL Evaluation: The Metrics That Matter

Author(s): Tiyasa Mukherjee

Originally published on Towards AI.

This article explores how a typical Natural Language to SQL (NLQ-to-SQL) pipeline works, why its evaluation is critical, and introduces key metrics — including LLM-based, rule-based, and mathematical approaches — to measure its accuracy and effectiveness.

Natural Language to SQL (NLQ-to-SQL) systems are becoming increasingly popular for making data access more natural and user-friendly. They allow users to ask questions in plain English and receive instant answers without writing any SQL. This has huge potential to reduce the dependency on data teams and make insights more accessible across an organization.

Large language models (LLMs) have made it easier than ever to build NLQ-to-SQL systems. But one crucial question remains: How can we be confident that the generated SQL queries truly reflects the user’s intent and retrieves the right data?

NLQ-to-SQL Evaluation: The Metrics That Matter
Just as DJs tune sound, NLQ-to-SQL systems need metric-based tuning for accurate results [Image courtesy: Ideogram AI]

That’s why evaluation is critical — not just in production, but from the earliest prototyping stages. To move beyond human validation, we need measurable ways to assess whether the system is doing its job.

I’ve gathered a practical set of metrics that can help you evaluate NLQ-to-SQL pipelines more effectively. Those metrics can be categorized into:

  • Non-execution, LLM-as-a-Judge, with Ground Truth Metrics
  • Non-execution, Mathematical/Rule-Based, without Ground Truth Metrics
  • Execution, Mathematical/Rule-Based, with Ground Truth Metrics
  • Execution, Mathematical/Rule-Based, without Ground Truth Metrics

In the article, I will introduce the metrics that matter — starting with how NLQ-to-SQL pipelines use LLMs to bridge the gap between natural language and structured data.

NLQ-to-SQL: Bridging Language and Data with LLMs

Imagine asking a database a question in plain English and instantly getting the answer. That’s the transformative promise of Natural Language to SQL (NLQ-to-SQL) pipelines.

A sample database schema (Made with drawSQL)

A typical NLQ-to-SQL pipeline powered by LLMs involves:

  • Natural Language Understanding (NLU): We use LLM to analyze the natural language query, understanding user intent and extracting key entities. We retrieve the schemas of the relevant tables, setting the stage for SQL generation.
Which payment method was used the most in the last quarter?
Table: orders | Column: payment_method, COUNT(order_id) | Condition: order_date >= DATE_FORMAT(NOW() - INTERVAL 3 MONTH, '%Y-%m-01') | Aggregation: COUNT(order_id) GROUP BY payment_method ORDER BY COUNT(order_id) DESC LIMIT 1
  • Prompt Tuning: We fine-tune the prompt for LLM very efficiently, preferably with few-shot domain examples and proper instructions.
(Sample Zero-Shot Prompt)

Generate a SQL query based on the provided extracted entities, user query, and table schemas.

### Input Parameters:
- Extracted Entities (e.g., tables, columns, conditions, aggregations):
{entities}

- User Query:
{user_query}

- Relevant Table Schemas:
{filtered_table_schemas_text}

### Requirements:
1. Generate a valid SQL query that accurately fulfills the user's request.
2. Use proper SQL syntax according to the schema structure.
3. Include necessary joins if multiple tables are involved.
4. Apply appropriate filtering conditions based on the extracted entities.
5. Use aggregations (SUM, COUNT, AVG, etc.) if required in the user query.
6. Ensure column names match the schema while avoiding unnecessary computations.
7. Format the SQL query for readability.

Output:
Provide the SQL query as a properly formatted code block.
  • SQL Generation: We combine the gathered information, which includes the user query, detected intent, and extracted entities, integrated with mapped tables and filtered schema info, such as table and column descriptions. Then, we compile the metadata into our prompt and call the LLM for SQL generation.
Which payment method was used the most in the last quarter?
SELECT payment_method
FROM orders
WHERE order_date >= DATE_FORMAT(NOW() - INTERVAL 3 MONTH, '%Y-%m-01')
GROUP BY payment_method
ORDER BY COUNT(order_id) DESC
LIMIT 1;
  • SQL Execution: The generated SQL statement is executed against the database, completing the text-to-SQL conversion process.
UPI
  • Natural Language Output: The SQL output passed to the LLM to generate answer in more human-readable format, thereby improving the user experience.
UPI was the most frequently used payment method by our customers in the last quarter.

Key Metrics: Measuring the Success of Your NLQ-to-SQL Pipeline

To truly understand how well an NLQ-to-SQL system is performing, we need to go beyond surface-level validation. In this section, I’ll walk you through 6 metrics that help evaluate both the correctness and the efficiency of generated SQL queries.

1. Non-execution, LLM-as-a-Judge, with Ground Truth Metrics

i. Entity Recognition Score:

This metric is calculated as the average of F1 score across each entity type, such as tables, columns, conditions, and aggregate functions by checking whether the extracted entities from the generated SQL query matches with the entities of ground truth SQL query.

The value of Entity Recognition Score ranges between [0–1]. Entity Recognition Score is 1 means that the entities extracted for both the Generated SQL and Ground Truth SQL are exactly same.

We can set a threshold value, such as 0.9, and if the score for any system falls below this threshold, mismatches in tables, columns, conditions, or aggregate functions in the generated SQL will be flagged early, allowing for intervention and refinement of the NLQ-to-SQL system.

Example: Entity Recognition Score = 0.75

NLQ:

How many orders were placed in the last month?

Ground Truth SQL:

SELECT COUNT(order_id) AS total_orders
FROM orders
WHERE order_date >= DATE_FORMAT(NOW() - INTERVAL 1 MONTH, '%Y-%m-01');

Entities from Ground Truth SQL:

Table: orders | Columns: order_id | Condition: order_date >= DATE_FORMAT(NOW() - INTERVAL 1 MONTH, '%Y-%m-01') |Aggregation: COUNT(order_id)

Generated SQL:

SELECT SUM(order_id) AS total_orders
FROM orders
WHERE order_date >= DATE_FORMAT(NOW() - INTERVAL 1 MONTH, '%Y-%m-01');

Entities from Generated SQL:

Table: orders | Columns: order_id | Condition: order_date >= DATE_FORMAT(NOW() - INTERVAL 1 MONTH, '%Y-%m-01') |Aggregation: SUM(order_id)

Entity Recognition Score:

ii. Semantic Equivalence Score

The Semantic Equivalence Score evaluates whether two SQL queries produce the same logical outcome using LLM. This metric ensures that NLQ-to-SQL models generate functionally correct queries, verifying that variations in phrasing, ordering, or formatting do not affect the final results.

The value of the Semantic Equivalence Score ranges between [0–1]. A score of 1 indicates that the Generated SQL and Ground Truth SQL are logically identical, meaning they retrieve the same results under all conditions.

We can set a threshold value, such as 0.9, and if the score for any system falls below this threshold, it signals a potential mismatch in query logic. This early flagging helps detect errors in query structure, joins, filtering conditions, and aggregations, allowing for intervention and refinement in the NLQ-to-SQL pipeline.

Example: Semantic Equivalence Score = 0 (Generated SQL misses the date filter (last month) present in ground truth)

NLQ:

Show the names of customers who made purchases in the last month.

Ground Truth SQL:

SELECT customer_name FROM customers;

Generated SQL:

SELECT customer_name FROM orders WHERE purchase_date >= DATE_SUB(CURDATE(), INTERVAL 1 MONTH);

2. Non-execution, Mathematical/Rule-Based, without Ground Truth Metrics

i. Halstead Complexity Scores

The Halstead Complexity Score measures the complexity of a piece of code by analyzing the number of distinct operators and operands. In the context of NLQ-to-SQL, we can apply this metric to the generated SQL queries to understand their complexity.

An ideal SQL query should balance clarity and efficiency, ensuring it is neither too simplistic nor overly complex. Queries with high effort or difficulty scores may indicate poor optimization and should be reviewed for refactoring. Leveraging these scores can help improve NLQ-to-SQL model performance, enhance query generation quality, and drive better database efficiency.

Image from Wikipedia

Example:

NLQ:

How many orders has each customer placed since January 1, 2024?

Generated SQL:

SELECT customer_id, COUNT(order_id) FROM orders WHERE order_date >= '2024-01-01' GROUP BY customer_id;

Halstead Complexity Scores:

{'Vocabulary (n)': 13,
'Length (N)': 13,
'Volume (V)': 48.105716335834195,
'Difficulty (D)': 2.0,
'Effort (E)': 96.21143267166839,
'Estimated Errors': 0.016035238778611398}

ii. SQL Injection Checking

SQL Injection (SQL) Detection is a crucial security measure that identifies malicious patterns in SQL queries to prevent unauthorized data access, modification, or deletion. This method scans queries for common SQL injection patterns, such as, DROP, DELETE, UPDATE etc.

Example:

Safe Query:

SELECT name, email FROM users WHERE id = 5;

Injection (DELETION) Query:

DELETE FROM orders WHERE order_id = 101;

3. Execution, Mathematical/Rule-Based, with Ground Truth Metrics

i. Data Retrieval Accuracy

The Data Retrieval Accuracy uses DataCompy Score metric that evaluates how well the generated SQL query retrieves the same data records and structure as the ground truth SQL query. It compares the response (executed generated SQL results) with the reference (executed ground truth SQL results) and computes precision and recall across both rows and columns:

Finally it uses F1 Score, which is the harmonic mean of precision and recall, to provide a balanced assessment of retrieval accuracy.

A threshold of 0.9 can be set — if the retrieval accuracy falls below this, it indicates a significant mismatch in the retrieved data structure, requiring intervention to improve query generation quality.

Example:

NLQ:

Show all orders placed in March 2024 with order ID, customer ID, and total amount.

Ground Truth SQL:

SELECT order_id, customer_id, total_amount
FROM orders
WHERE order_date BETWEEN '2024-03-01' AND '2024-03-31';

Retrieved Data from Ground Truth SQL:

Generated SQL:

SELECT order_id, total_amount
FROM orders;

Retrieved Data from Generated SQL:

Data Retrieval Accuracy:

4. Execution, Mathematical/Rule-Based, without Ground Truth Metrics

i. Performance Metrics

SQL queries can vary significantly in execution time, memory consumption, and CPU usage, impacting system performance and database efficiency. Measuring resource utilization during SQL execution helps identify bottlenecks, optimize query performance, and improve NLQ-to-SQL model-generated queries.

This method captures key system metrics:

  • Execution Time (s): The time taken to execute the query.
  • Peak Memory Used (MB): The highest memory consumption during execution.
  • CPU Time Used (s): The total CPU processing time for the query.
  • Disk I/O Read/Write (MB): The amount of data read from or written to disk.

Example:

Generated SQL:

SELECT customer_id, SUM(total_amount) AS revenue
FROM orders
WHERE order_date >= DATE_SUB(CURDATE(), INTERVAL 3 MONTH)
GROUP BY customer_id;

Performance Metrics:

Execution Time (s) 0.52s 
Peak Memory (MB) 15.3MB
CPU Time (s) 0.25s
Disk I/O Read (MB) 1.2MB
Disk I/O Write (MB) 0.3MB

Conclusion

In conclusion, building an NLQ-to-SQL system isn’t just about translating natural language into syntactically correct SQL — it’s about ensuring the generated queries are truly meaningful and aligned with user intent. That’s why evaluation and governance must be treated as first-class citizens in the NLQ-to-SQL development process.

In this article, we explored the motivation for evaluating these systems and introduced a set of practical metrics — covering aspects like entity recognition, semantic correctness, data retrieval, and query complexity. These metrics provide a foundation for structured assessment, from early prototyping to production deployment.

In the NLQ-to-SQL Evaluation: A Hands-On Guide article, I’ve dived deep into the implementation details: how to calculate each metric to improve and govern your NLQ-to-SQL pipeline with confidence.

I’m an AI Engineer focused on multi-agent systems, enterprise NLP workflows, and AI governance. If you’d like to connect or explore collaboration opportunities, feel free to reach out on LinkedIn!

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI


Take our 90+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Towards AI has published Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!


Discover Your Dream AI Career at Towards AI Jobs

Towards AI has built a jobs board tailored specifically to Machine Learning and Data Science Jobs and Skills. Our software searches for live AI jobs each hour, labels and categorises them and makes them easily searchable. Explore over 40,000 live jobs today with Towards AI Jobs!

Note: Content contains the views of the contributing authors and not Towards AI.