NLQ-to-SQL Evaluation: The Metrics That Matter

Author(s): Tiyasa Mukherjee

Originally published on Towards AI.

This article explores how a typical Natural Language to SQL (NLQ-to-SQL) pipeline works, why its evaluation is critical, and introduces key metrics — including LLM-based, rule-based, and mathematical approaches — to measure its accuracy and effectiveness.

Natural Language to SQL (NLQ-to-SQL) systems are becoming increasingly popular for making data access more natural and user-friendly. They allow users to ask questions in plain English and receive instant answers without writing any SQL. This has huge potential to reduce the dependency on data teams and make insights more accessible across an organization.

Large language models (LLMs) have made it easier than ever to build NLQ-to-SQL systems. But one crucial question remains: How can we be confident that the generated SQL queries truly reflects the user’s intent and retrieves the right data?

NLQ-to-SQL Evaluation: The Metrics That Matter — Just as DJs tune sound, NLQ-to-SQL systems need metric-based tuning for accurate results [Image courtesy: Ideogram AI]

That’s why evaluation is critical — not just in production, but from the earliest prototyping stages. To move beyond human validation, we need measurable ways to assess whether the system is doing its job.

I’ve gathered a practical set of metrics that can help you evaluate NLQ-to-SQL pipelines more effectively. Those metrics can be categorized into:

Non-execution, LLM-as-a-Judge, with Ground Truth Metrics
Non-execution, Mathematical/Rule-Based, without Ground Truth Metrics
Execution, Mathematical/Rule-Based, with Ground Truth Metrics
Execution, Mathematical/Rule-Based, without Ground Truth Metrics

In the article, I will introduce the metrics that matter — starting with how NLQ-to-SQL pipelines use LLMs to bridge the gap between natural language and structured data.

NLQ-to-SQL: Bridging Language and Data with LLMs

Imagine asking a database a question in plain English and instantly getting the answer. That’s the transformative promise of Natural Language to SQL (NLQ-to-SQL) pipelines.

A sample database schema (Made with drawSQL)

A typical NLQ-to-SQL pipeline powered by LLMs involves:

Natural Language Understanding (NLU): We use LLM to analyze the natural language query, understanding user intent and extracting key entities. We retrieve the schemas of the relevant tables, setting the stage for SQL generation.

Which payment method was used the most in the last quarter?

Table: orders | Column: payment_method, COUNT(order_id) | Condition: order_date >= DATE_FORMAT(NOW() - INTERVAL 3 MONTH, '%Y-%m-01') | Aggregation: COUNT(order_id) GROUP BY payment_method ORDER BY COUNT(order_id) DESC LIMIT 1

Prompt Tuning: We fine-tune the prompt for LLM very efficiently, preferably with few-shot domain examples and proper instructions.

(Sample Zero-Shot Prompt)

Generate a SQL query based on the provided extracted entities, user query, and table schemas. 

### Input Parameters:
- Extracted Entities (e.g., tables, columns, conditions, aggregations): 
 {entities}

- User Query: 
 {user_query}

- Relevant Table Schemas:
 {filtered_table_schemas_text}

### Requirements:
1. Generate a valid SQL query that accurately fulfills the user's request.
2. Use proper SQL syntax according to the schema structure.
3. Include necessary joins if multiple tables are involved.
4. Apply appropriate filtering conditions based on the extracted entities.
5. Use aggregations (SUM, COUNT, AVG, etc.) if required in the user query.
6. Ensure column names match the schema while avoiding unnecessary computations.
7. Format the SQL query for readability.

Output:
Provide the SQL query as a properly formatted code block.

SQL Generation: We combine the gathered information, which includes the user query, detected intent, and extracted entities, integrated with mapped tables and filtered schema info, such as table and column descriptions. Then, we compile the metadata into our prompt and call the LLM for SQL generation.

Which payment method was used the most in the last quarter?

SELECT payment_method
FROM orders
WHERE order_date >= DATE_FORMAT(NOW() - INTERVAL 3 MONTH, '%Y-%m-01')
GROUP BY payment_method
ORDER BY COUNT(order_id) DESC
LIMIT 1;

SQL Execution: The generated SQL statement is executed against the database, completing the text-to-SQL conversion process.

UPI

Natural Language Output: The SQL output passed to the LLM to generate answer in more human-readable format, thereby improving the user experience.

UPI was the most frequently used payment method by our customers in the last quarter.

Key Metrics: Measuring the Success of Your NLQ-to-SQL Pipeline

To truly understand how well an NLQ-to-SQL system is performing, we need to go beyond surface-level validation. In this section, I’ll walk you through 6 metrics that help evaluate both the correctness and the efficiency of generated SQL queries.

1. Non-execution, LLM-as-a-Judge, with Ground Truth Metrics

i. Entity Recognition Score:

This metric is calculated as the average of F1 score across each entity type, such as tables, columns, conditions, and aggregate functions by checking whether the extracted entities from the generated SQL query matches with the entities of ground truth SQL query.

The value of Entity Recognition Score ranges between [0–1]. Entity Recognition Score is 1 means that the entities extracted for both the Generated SQL and Ground Truth SQL are exactly same.

We can set a threshold value, such as 0.9, and if the score for any system falls below this threshold, mismatches in tables, columns, conditions, or aggregate functions in the generated SQL will be flagged early, allowing for intervention and refinement of the NLQ-to-SQL system.

Example: Entity Recognition Score = 0.75

NLQ:

How many orders were placed in the last month?

Ground Truth SQL:

SELECT COUNT(order_id) AS total_orders
FROM orders
WHERE order_date >= DATE_FORMAT(NOW() - INTERVAL 1 MONTH, '%Y-%m-01');

Entities from Ground Truth SQL:

Table: orders | Columns: order_id | Condition: order_date >= DATE_FORMAT(NOW() - INTERVAL 1 MONTH, '%Y-%m-01') |Aggregation: COUNT(order_id)

Generated SQL:

SELECT SUM(order_id) AS total_orders
FROM orders
WHERE order_date >= DATE_FORMAT(NOW() - INTERVAL 1 MONTH, '%Y-%m-01');

Entities from Generated SQL:

Table: orders | Columns: order_id | Condition: order_date >= DATE_FORMAT(NOW() - INTERVAL 1 MONTH, '%Y-%m-01') |Aggregation: SUM(order_id)

Entity Recognition Score:

ii. Semantic Equivalence Score

The Semantic Equivalence Score evaluates whether two SQL queries produce the same logical outcome using LLM. This metric ensures that NLQ-to-SQL models generate functionally correct queries, verifying that variations in phrasing, ordering, or formatting do not affect the final results.

The value of the Semantic Equivalence Score ranges between [0–1]. A score of 1 indicates that the Generated SQL and Ground Truth SQL are logically identical, meaning they retrieve the same results under all conditions.

We can set a threshold value, such as 0.9, and if the score for any system falls below this threshold, it signals a potential mismatch in query logic. This early flagging helps detect errors in query structure, joins, filtering conditions, and aggregations, allowing for intervention and refinement in the NLQ-to-SQL pipeline.

Example: Semantic Equivalence Score = 0 (Generated SQL misses the date filter (last month) present in ground truth)

NLQ:

Show the names of customers who made purchases in the last month.

Ground Truth SQL:

SELECT customer_name FROM customers;

Generated SQL:

SELECT customer_name FROM orders WHERE purchase_date >= DATE_SUB(CURDATE(), INTERVAL 1 MONTH);

2. Non-execution, Mathematical/Rule-Based, without Ground Truth Metrics

i. Halstead Complexity Scores

The Halstead Complexity Score measures the complexity of a piece of code by analyzing the number of distinct operators and operands. In the context of NLQ-to-SQL, we can apply this metric to the generated SQL queries to understand their complexity.

An ideal SQL query should balance clarity and efficiency, ensuring it is neither too simplistic nor overly complex. Queries with high effort or difficulty scores may indicate poor optimization and should be reviewed for refactoring. Leveraging these scores can help improve NLQ-to-SQL model performance, enhance query generation quality, and drive better database efficiency.

Example:

NLQ:

How many orders has each customer placed since January 1, 2024?

Generated SQL:

SELECT customer_id, COUNT(order_id) FROM orders WHERE order_date >= '2024-01-01' GROUP BY customer_id;

Halstead Complexity Scores:

{'Vocabulary (n)': 13,
 'Length (N)': 13,
 'Volume (V)': 48.105716335834195,
 'Difficulty (D)': 2.0,
 'Effort (E)': 96.21143267166839,
 'Estimated Errors': 0.016035238778611398}

ii. SQL Injection Checking

SQL Injection (SQL) Detection is a crucial security measure that identifies malicious patterns in SQL queries to prevent unauthorized data access, modification, or deletion. This method scans queries for common SQL injection patterns, such as, DROP, DELETE, UPDATE etc.

Example:

Safe Query:

SELECT name, email FROM users WHERE id = 5;

Injection (DELETION) Query:

DELETE FROM orders WHERE order_id = 101;

3. Execution, Mathematical/Rule-Based, with Ground Truth Metrics

i. Data Retrieval Accuracy

The Data Retrieval Accuracy uses DataCompy Score metric that evaluates how well the generated SQL query retrieves the same data records and structure as the ground truth SQL query. It compares the response (executed generated SQL results) with the reference (executed ground truth SQL results) and computes precision and recall across both rows and columns:

Finally it uses F1 Score, which is the harmonic mean of precision and recall, to provide a balanced assessment of retrieval accuracy.

A threshold of 0.9 can be set — if the retrieval accuracy falls below this, it indicates a significant mismatch in the retrieved data structure, requiring intervention to improve query generation quality.

Example:

NLQ:

Show all orders placed in March 2024 with order ID, customer ID, and total amount.

Ground Truth SQL:

SELECT order_id, customer_id, total_amount
FROM orders
WHERE order_date BETWEEN '2024-03-01' AND '2024-03-31';

Retrieved Data from Ground Truth SQL:

Generated SQL:

SELECT order_id, total_amount
FROM orders;

Retrieved Data from Generated SQL:

Data Retrieval Accuracy:

4. Execution, Mathematical/Rule-Based, without Ground Truth Metrics

i. Performance Metrics

SQL queries can vary significantly in execution time, memory consumption, and CPU usage, impacting system performance and database efficiency. Measuring resource utilization during SQL execution helps identify bottlenecks, optimize query performance, and improve NLQ-to-SQL model-generated queries.

This method captures key system metrics:

Execution Time (s): The time taken to execute the query.
Peak Memory Used (MB): The highest memory consumption during execution.
CPU Time Used (s): The total CPU processing time for the query.
Disk I/O Read/Write (MB): The amount of data read from or written to disk.

Example:

Generated SQL:

SELECT customer_id, SUM(total_amount) AS revenue
FROM orders
WHERE order_date >= DATE_SUB(CURDATE(), INTERVAL 3 MONTH)
GROUP BY customer_id;

Performance Metrics:

Execution Time (s) 0.52s 
Peak Memory (MB) 15.3MB 
CPU Time (s) 0.25s 
Disk I/O Read (MB) 1.2MB 
Disk I/O Write (MB) 0.3MB

Conclusion

In conclusion, building an NLQ-to-SQL system isn’t just about translating natural language into syntactically correct SQL — it’s about ensuring the generated queries are truly meaningful and aligned with user intent. That’s why evaluation and governance must be treated as first-class citizens in the NLQ-to-SQL development process.

In this article, we explored the motivation for evaluating these systems and introduced a set of practical metrics — covering aspects like entity recognition, semantic correctness, data retrieval, and query complexity. These metrics provide a foundation for structured assessment, from early prototyping to production deployment.

In the NLQ-to-SQL Evaluation: A Hands-On Guide article, I’ve dived deep into the implementation details: how to calculate each metric to improve and govern your NLQ-to-SQL pipeline with confidence.

I’m an AI Engineer focused on multi-agent systems, enterprise NLP workflows, and AI governance. If you’d like to connect or explore collaboration opportunities, feel free to reach out on LinkedIn!

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

NLQ-to-SQL Evaluation: The Metrics That Matter

Author(s): Tiyasa Mukherjee

NLQ-to-SQL: Bridging Language and Data with LLMs

Key Metrics: Measuring the Success of Your NLQ-to-SQL Pipeline

1. Non-execution, LLM-as-a-Judge, with Ground Truth Metrics

i. Entity Recognition Score:

ii. Semantic Equivalence Score

2. Non-execution, Mathematical/Rule-Based, without Ground Truth Metrics

i. Halstead Complexity Scores

ii. SQL Injection Checking

3. Execution, Mathematical/Rule-Based, with Ground Truth Metrics

i. Data Retrieval Accuracy

4. Execution, Mathematical/Rule-Based, without Ground Truth Metrics

i. Performance Metrics

Conclusion

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

Why Knowledge Graphs Are the Missing Piece in AI Agent API Discovery

The Complexity of Self-Driving Cars Explained Simply

Bridging Symbolic AI and Deep Learning: How Knowledge Graphs are Revolutionizing ResNets

LAI #93: Smarter Model Choices, Multi-Agent Systems, and Cutting Through AI Noise

Who Wins Purview vs Rogue AI in Data Control

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

NLQ-to-SQL Evaluation: The Metrics That Matter

Author(s): Tiyasa Mukherjee

NLQ-to-SQL: Bridging Language and Data with LLMs

Key Metrics: Measuring the Success of Your NLQ-to-SQL Pipeline

1. Non-execution, LLM-as-a-Judge, with Ground Truth Metrics

i. Entity Recognition Score:

ii. Semantic Equivalence Score

2. Non-execution, Mathematical/Rule-Based, without Ground Truth Metrics

i. Halstead Complexity Scores

ii. SQL Injection Checking

3. Execution, Mathematical/Rule-Based, with Ground Truth Metrics

i. Data Retrieval Accuracy

4. Execution, Mathematical/Rule-Based, without Ground Truth Metrics

i. Performance Metrics

Conclusion

Related posts

Popular posts

Updates

Recent Posts

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

GDPR CCPA Statement