Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: pub@towardsai.net
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab VeloxTrend Ultrarix Capital Partners Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Our 15 AI experts built the most comprehensive, practical, 90+ lesson courses to master AI Engineering - we have pathways for any experience at Towards AI Academy. Cohorts still open - use COHORT10 for 10% off.

Publication

Scaling LLM Evaluation
Latest   Machine Learning

Scaling LLM Evaluation

Last Updated on April 28, 2025 by Editorial Team

Author(s): Nadav Barak

Originally published on Towards AI.

Scaling LLM Evaluation
Photo by Jungwoo Hong on Unsplash.

Large Language Models (LLMs) are transforming machine learning, powering applications like chatbots, RAG, and autonomous agents. But building with LLMs comes with a major hurdle: Their output is evaluated either manually, which is costly and slow, or through crude automation that is inconsistent, lacking detail, and inaccurate. Every pipeline tweak demands re-annotation, eating up time and resources. This post breaks down a step-by-step playbook to a better automated evaluation pipeline which is consistent, explainable and trustworthy.

For a hands-on tutorial, refer to the accompanying workshop, which includes a template notebook and data you can use as a starting point.

Define Evaluation Criteria

In order to evaluate machine learning models we rely on a representative evaluation set (either validation or test). In classical machine learning, we feed each sample in the evaluation set to our model, and for each possible output, we can automatically evaluate its correctness and thus evaluate the model’s overall performance. This does not apply when dealing with generated text.

Evaluating the output of an LLM-based application contains many facets and cannot be achieved using a single correctness metric. For example, consider a generated summarization that, while containing all the key points, lacks fluency and is very difficult to read. Alternatively, consider a question-answering pair where the output, although relevant to the provided question, is not factually correct.

New tools are needed. Photo by Glenn Carstens-Peters on Unsplash.

The first step in evaluating generated text is to determine the relevant key evaluation criteria for our use case. It is recommended to combine criteria that deal with text quality and criteria that address task fulfillment. For instance, in the following paper on evaluating text summarizations, the authors used Coherence, Consistency, Fluency, and Relevance as their evaluation criteria.

The second step is to determine the secondary conditions we want the generated text to uphold. For example, consider a question-answering pair in which the answer is correct but uses toxic language, disclose private information, or highly recommend your competition.

Once we have our evaluation criteria in place, we would want to build an automated pipeline that would be able to evaluate the criteria for a given sample and they ability to draw actionable conclusions from it.

Understand what are the relevant criteria. Photo by Volodymyr Hryshchenko on Unsplash.

Divide and Conquer

To understand why a more complex solution is necessary, let’s start with the downsides of the naive approach — using a single LLM to evaluate all criteria. The core problem is oversimplification. When one judge is tasked with evaluating multiple dimensions, some criteria don’t get the attention they deserve, which weaken overall performance. Another key issue is consistency: in different runs, the LLM may shift its focus across criteria, leading to uneven and unreliable results.

This is where divide and conquer comes in. Evaluating each criterion separately ensures it gets the attention it deserves, along with a tailored approach to improve both accuracy and consistency. It also enables more meaningful analysis — separate scores for each criterion make root cause analysis easier and comparisons between versions more insightful.

Criterion Evaluation

Start by picking a small, focused set of examples to test your evaluator. Make sure the negative cases are clearly negative for the specific criterion you’re targeting — like a sample containing hallucinations for factual accuracy.

Begin with a basic prompt as your baseline and define a clear KPI which is good enough for your use case. This gives you a reference point for how complex the task is — no need to spend a lot of effort building a car for a 5 km walk. From there, experiment with refinements like Chain-of-Thought, Few-Shot Learning, a better model or a multi step process. Continuously track changes in performance to see what actually moves the needle.

| Approach | Advantages | Disadvantages | Ideal Use Cases |
|---------------------|-------------------------------------|--------------------------------------------|----------------------------------------|
| Model Evaluator | Fast, cost-effective, consistent | Require a trained models, limited semantic understanding | Toxicity, Fluency, Input Safety |
| LLM Evaluator | Flexible, strong text understanding | Higher cost, inconsistent | Nuanced Classification, Completeness |
| Multi-Step Process | High accuracy on complex tasks | More setup, higher cost and latency | Hallucination, Content Coverage |

Model Evaluator

Criterion evaluation is fundamentally a classification task, which makes BERT-based classification models a strong fit for certain criteria. They’re faster, more cost-effective than LLMs, consistent, and their specialization can give them an edge in performance for specific tasks (see example here).

Since training these models can be a significant overhead, I recommend using them only for relatively common criteria that already have high-performing, pre-trained models available on Hugging Face. Some widely used ones I’d suggest are Toxicity, Fluency, and Input Safety.

LLM Evaluator

LLMs are trained to generate natural language, but their strong understanding of text also makes them effective for classification tasks. To get the best results when using an LLM as a single-criterion evaluator, it’s crucial to give it just enough information — no more, no less. For example, if you want it to judge the factual accuracy of an output, you need to provide both the output’s claims and relevant reference material. Techniques like prompt engineering, Chain-Of-Thought reasoning, and Few-Shot Learning can further improve performance.

Mixture of Experts (MoE) can boost both the performance and consistency of LLM-based evaluators, but they increase latency and cost. As a result, they should be used selectively — only when the a single judge is failing short and the benefits outweigh the trade-offs.

Multi Step Process

Some tasks are complex, and just like in car manufacturing — where different parts are assembled separately before the final product is put together — it’s often more effective to divide the work into smaller steps. This same approach applies to evaluating complex criteria.

Take the task of judging whether a summary includes all the key information from an original article. To do this well, you’d first need to read through the article — possibly in chunks if it’s long — and identify the main ideas or essential facts throughout. Once those points are gathered, you’d compile them into a clear list of what truly matters in the original text. Then, you’d go back and compare the summary against that list to determine what’s included and what’s missing. Each of these steps carries its own complexity, and trying to handle them all at once would lead to mistakes or overlooked details.

Know your toolbox and select the right tool for the job. Source.

Criteria Scores Aggregation

Choosing the right aggregation method is highly use-case specific yet there are three ‘rule of thumb’ guidelines I recommend following:

  • Learn from the experts: Ask domain experts to annotate a set of samples for each key evaluation criterion as well as to provide an overall annotation. From this set of samples, you can gain valuable insights to guide you when designing your aggregation method.
  • Know when you do not know: Some examples may be hard to annotate automatically and best to be manually evaluated. As such, we want our aggregation method to be able to return an ‘NA’ score or confidence bounds for some of the examples, rather than a specific score.
  • Robustness: The quality of scores from different criteria may vary, so it is important to construct the aggregation method in a way that small changes won’t significantly affect the outcome.
Build your evaluation dream team. Photo by Pascal Swier on Unsplash.

Conclusion

Automatically evaluating your LLM-based application can provide enormous value, but does not come cheap. In this post, we have described how to build an effective multi-step evaluation pipeline and its advendages over crude methods.

There are numerous elements within the evaluation process that can be optimized. However, before investing valuable time in each of them, I recommend adopting a telescopic approach. Start with a basic implementation that includes what you believe are the essential components, test it, and then concentrate on optimizing the elements that could have a significant impact.

Nadav Barak is a Head of AI at Deepchecks, a start-up company that arms organizations with tools to evaluate and monitor their Machine-Learning-based systems. Nadav has a rich background in Data Science and is a domain expert in building and improving Generative NLP applications.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI


Take our 90+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Towards AI has published Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!


Discover Your Dream AI Career at Towards AI Jobs

Towards AI has built a jobs board tailored specifically to Machine Learning and Data Science Jobs and Skills. Our software searches for live AI jobs each hour, labels and categorises them and makes them easily searchable. Explore over 40,000 live jobs today with Towards AI Jobs!

Note: Content contains the views of the contributing authors and not Towards AI.