Scaling LLM Evaluation

Last Updated on April 28, 2025 by Editorial Team

Author(s): Nadav Barak

Originally published on Towards AI.

Scaling LLM Evaluation — Photo by Jungwoo Hong on Unsplash.

Large Language Models (LLMs) are transforming machine learning, powering applications like chatbots, RAG, and autonomous agents. But building with LLMs comes with a major hurdle: Their output is evaluated either manually, which is costly and slow, or through crude automation that is inconsistent, lacking detail, and inaccurate. Every pipeline tweak demands re-annotation, eating up time and resources. This post breaks down a step-by-step playbook to a better automated evaluation pipeline which is consistent, explainable and trustworthy.

For a hands-on tutorial, refer to the accompanying workshop, which includes a template notebook and data you can use as a starting point.

Define Evaluation Criteria

In order to evaluate machine learning models we rely on a representative evaluation set (either validation or test). In classical machine learning, we feed each sample in the evaluation set to our model, and for each possible output, we can automatically evaluate its correctness and thus evaluate the model’s overall performance. This does not apply when dealing with generated text.

Evaluating the output of an LLM-based application contains many facets and cannot be achieved using a single correctness metric. For example, consider a generated summarization that, while containing all the key points, lacks fluency and is very difficult to read. Alternatively, consider a question-answering pair where the output, although relevant to the provided question, is not factually correct.

New tools are needed. Photo by Glenn Carstens-Peters on Unsplash.

The first step in evaluating generated text is to determine the relevant key evaluation criteria for our use case. It is recommended to combine criteria that deal with text quality and criteria that address task fulfillment. For instance, in the following paper on evaluating text summarizations, the authors used Coherence, Consistency, Fluency, and Relevance as their evaluation criteria.

The second step is to determine the secondary conditions we want the generated text to uphold. For example, consider a question-answering pair in which the answer is correct but uses toxic language, disclose private information, or highly recommend your competition.

Once we have our evaluation criteria in place, we would want to build an automated pipeline that would be able to evaluate the criteria for a given sample and they ability to draw actionable conclusions from it.

Understand what are the relevant criteria. Photo by Volodymyr Hryshchenko on Unsplash.

Divide and Conquer

To understand why a more complex solution is necessary, let’s start with the downsides of the naive approach — using a single LLM to evaluate all criteria. The core problem is oversimplification. When one judge is tasked with evaluating multiple dimensions, some criteria don’t get the attention they deserve, which weaken overall performance. Another key issue is consistency: in different runs, the LLM may shift its focus across criteria, leading to uneven and unreliable results.

This is where divide and conquer comes in. Evaluating each criterion separately ensures it gets the attention it deserves, along with a tailored approach to improve both accuracy and consistency. It also enables more meaningful analysis — separate scores for each criterion make root cause analysis easier and comparisons between versions more insightful.

Criterion Evaluation

Start by picking a small, focused set of examples to test your evaluator. Make sure the negative cases are clearly negative for the specific criterion you’re targeting — like a sample containing hallucinations for factual accuracy.

Begin with a basic prompt as your baseline and define a clear KPI which is good enough for your use case. This gives you a reference point for how complex the task is — no need to spend a lot of effort building a car for a 5 km walk. From there, experiment with refinements like Chain-of-Thought, Few-Shot Learning, a better model or a multi step process. Continuously track changes in performance to see what actually moves the needle.

| Approach | Advantages | Disadvantages | Ideal Use Cases |
|---------------------|-------------------------------------|--------------------------------------------|----------------------------------------|
| Model Evaluator | Fast, cost-effective, consistent | Require a trained models, limited semantic understanding | Toxicity, Fluency, Input Safety |
| LLM Evaluator | Flexible, strong text understanding | Higher cost, inconsistent | Nuanced Classification, Completeness |
| Multi-Step Process | High accuracy on complex tasks | More setup, higher cost and latency | Hallucination, Content Coverage |

Model Evaluator

Criterion evaluation is fundamentally a classification task, which makes BERT-based classification models a strong fit for certain criteria. They’re faster, more cost-effective than LLMs, consistent, and their specialization can give them an edge in performance for specific tasks (see example here).

Since training these models can be a significant overhead, I recommend using them only for relatively common criteria that already have high-performing, pre-trained models available on Hugging Face. Some widely used ones I’d suggest are Toxicity, Fluency, and Input Safety.

LLM Evaluator

LLMs are trained to generate natural language, but their strong understanding of text also makes them effective for classification tasks. To get the best results when using an LLM as a single-criterion evaluator, it’s crucial to give it just enough information — no more, no less. For example, if you want it to judge the factual accuracy of an output, you need to provide both the output’s claims and relevant reference material. Techniques like prompt engineering, Chain-Of-Thought reasoning, and Few-Shot Learning can further improve performance.

Mixture of Experts (MoE) can boost both the performance and consistency of LLM-based evaluators, but they increase latency and cost. As a result, they should be used selectively — only when the a single judge is failing short and the benefits outweigh the trade-offs.

Multi Step Process

Some tasks are complex, and just like in car manufacturing — where different parts are assembled separately before the final product is put together — it’s often more effective to divide the work into smaller steps. This same approach applies to evaluating complex criteria.

Take the task of judging whether a summary includes all the key information from an original article. To do this well, you’d first need to read through the article — possibly in chunks if it’s long — and identify the main ideas or essential facts throughout. Once those points are gathered, you’d compile them into a clear list of what truly matters in the original text. Then, you’d go back and compare the summary against that list to determine what’s included and what’s missing. Each of these steps carries its own complexity, and trying to handle them all at once would lead to mistakes or overlooked details.

Know your toolbox and select the right tool for the job. Source.

Criteria Scores Aggregation

Choosing the right aggregation method is highly use-case specific yet there are three ‘rule of thumb’ guidelines I recommend following:

Learn from the experts: Ask domain experts to annotate a set of samples for each key evaluation criterion as well as to provide an overall annotation. From this set of samples, you can gain valuable insights to guide you when designing your aggregation method.
Know when you do not know: Some examples may be hard to annotate automatically and best to be manually evaluated. As such, we want our aggregation method to be able to return an ‘NA’ score or confidence bounds for some of the examples, rather than a specific score.
Robustness: The quality of scores from different criteria may vary, so it is important to construct the aggregation method in a way that small changes won’t significantly affect the outcome.

Build your evaluation dream team. Photo by Pascal Swier on Unsplash.

Conclusion

Automatically evaluating your LLM-based application can provide enormous value, but does not come cheap. In this post, we have described how to build an effective multi-step evaluation pipeline and its advendages over crude methods.

There are numerous elements within the evaluation process that can be optimized. However, before investing valuable time in each of them, I recommend adopting a telescopic approach. Start with a basic implementation that includes what you believe are the essential components, test it, and then concentrate on optimizing the elements that could have a significant impact.

Nadav Barak is a Head of AI at Deepchecks, a start-up company that arms organizations with tools to evaluate and monitor their Machine-Learning-based systems. Nadav has a rich background in Data Science and is a domain expert in building and improving Generative NLP applications.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Scaling LLM Evaluation

Author(s): Nadav Barak

Define Evaluation Criteria

Divide and Conquer

Criterion Evaluation

Model Evaluator

LLM Evaluator

Multi Step Process

Criteria Scores Aggregation

Conclusion

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

Why Knowledge Graphs Are the Missing Piece in AI Agent API Discovery

The Complexity of Self-Driving Cars Explained Simply

Bridging Symbolic AI and Deep Learning: How Knowledge Graphs are Revolutionizing ResNets

LAI #93: Smarter Model Choices, Multi-Agent Systems, and Cutting Through AI Noise

Who Wins Purview vs Rogue AI in Data Control

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Scaling LLM Evaluation

Author(s): Nadav Barak

Define Evaluation Criteria

Divide and Conquer

Criterion Evaluation

Model Evaluator

LLM Evaluator

Multi Step Process

Criteria Scores Aggregation

Conclusion

Related posts

Popular posts

Updates

Recent Posts

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

GDPR CCPA Statement