Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: pub@towardsai.net
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab VeloxTrend Ultrarix Capital Partners Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Our 15 AI experts built the most comprehensive, practical, 90+ lesson courses to master AI Engineering - we have pathways for any experience at Towards AI Academy. Cohorts still open - use COHORT10 for 10% off.

Publication

Evaluating LLM and AI agents Outputs with String Comparison, Criteria & Trajectory Approaches
Artificial Intelligence   Latest   Machine Learning

Evaluating LLM and AI agents Outputs with String Comparison, Criteria & Trajectory Approaches

Author(s): Michalzarnecki

Originally published on Towards AI.

Evaluating LLM and AI agents Outputs with String Comparison, Criteria & Trajectory Approaches

When your model’s answers sound convincing, how do you prove they’re actually good?
This article walks through three complementary evaluation strategies — string comparison, criteria-based scoring, and trajectory analysis.

1. String-Comparison Metrics

Consider question below:

What is the result of 2+2 ?

There are multiple correct answers:

4
four
the answer is 4
4 is correct answer
2+2=4
2 + 2 = 4
two plus two gives four

Even a single question can have many “correct” answers that contain different phrases. Token-level metrics remain most basic but also as quick-and-dirty sanity check. They were used even long before era of LLMs to evaluate quality of machine translations in early 2000s.

1.1 ROUGE (Recall-Oriented Understudy for Gisting Evaluation)

ROUGE answers a question:

How many of words in reference also appear in the candidate? (recall-heavy, ideal for summaries)

ROUGE-N was originally built for summarization and measures n-gram overlap:

For example:
reference = The quick brown fox jumps quickly over the lazy dog
candidate = The quick brown weasel jumps quickly over the lazy cat

matched n-grams number, considering n=1 is count(The, quick, and, brown, jumps, quickly, over, the, lazy) = 7
Recall = 7 / 10
Precision = 7 / 10
F1 = 2 * 0.7 * 0.7 / (0.7 + 0.7) = 0.7

1.2 BLEU (Bilingual Evaluation Understudy)

BLEU answers a question:

How many matching phrases of different lengths do the two texts share?

BLEU rewards shorter n-gram matches while punishing verbosity through a brevity penalty (BP).

Here c = candidate length, r = reference length, pn = precision for n-grams.

1.3 METEOR (Metric for Evaluation of Translation with Explicit Ordering)

METEOR answers a question:

Did we match the same ideas — even if the words are stemmed, pluralised, or synonymous?

METEOR aligns unigrams, stems, and synonyms, then balances precision and recall.

If you are interested in implementation details, please find implementations of ROUGE, BLEU and METEOR in my github repository.

Let’s compare 2 texts using 3 different metrics presented above.

$evaluator = new StringComparisonEvaluator();
$reference = "that's the way cookie crumbles";
$candidate = "this is the way cookie is crashed";

$results = [
'ROUGE' => $evaluator->calculateROUGE($reference, $candidate),
'BLEU' => $evaluator->calculateBLEU($reference, $candidate),
'METEOR' => $evaluator->calculateMETEOR($reference, $candidate),
];
Array
(
[ROUGE] => LlmEvaluation\EvaluationResults Object
(
[metricName:LlmEvaluation\EvaluationResults:private] => ROUGE
[results:LlmEvaluation\EvaluationResults:private] => Array
(
[recall] => 0.6
[precision] => 0.43
[f1] => 0.5
)

)

[BLEU] => LlmEvaluation\EvaluationResults Object
(
[metricName:LlmEvaluation\EvaluationResults:private] => BLEU
[results:LlmEvaluation\EvaluationResults:private] => Array
(
[score] => 0.43
)

)

[METEOR] => LlmEvaluation\EvaluationResults Object
(
[metricName:LlmEvaluation\EvaluationResults:private] => METEOR
[results:LlmEvaluation\EvaluationResults:private] => Array
(
[score] => 0.57
[precision] => 0.42857142857143
[recall] => 0.6
[chunks] => 1
[penalty] => 0.018518518518519
[fMean] => 0.57692307692308
)

)

)

As we can observe in results above that each of 3 presented metrics has different score (ROUGE F1 0.5, BLEU 0.43 and METEOR 0.57). Also it’s hard to say if score “0.5” describes well how relevant are texts“that’s the way cookie crumbles” and “this is the way cookie is crashed”. Even it’s hard to use them for evaluating text generated by LLM they can still be used as a very efficient evaluation tool checking change in results between iterations of process improvement (for example changing the prompt).

2. Criteria-Based Evaluation

When string comparison metrics are not sufficient enough criteria evaluation comes in. Token overlap can’t score tone, ethics, or creativity.
The idea behind criteria evaluation is to evaluate results generated by LLM “A” by passing them to LLM “B” which plays the role of teacher evaluating student answers. The criteria evaluator feeds question + answer into LLM like GPT or Claude and asks for 1-to-5 ratings on each trait.

Here are popular criteria categories and descriptions what they are evaluating.

In the code it can look like this (see full implementation and more examples in repository):

$prompt = (new CriteriaEvaluatorPromptBuilder())
->addClarity()
->addCoherence()
->addConciseness()
->addControversiality()
->addCreativity()
->addCriminality()
->addFactualAccuracy()
->addRelevance()
->addHarmfulness()
->addHelpfulness()
->addInsensitivity()
->addMaliciousness()
->addMisogyny()
->addCorrectness()
->getEvaluationPrompt($question, $response);

print_r((new GPTCriteriaEvaluator())->evaluate($prompt));
{
"correctness": 5,
"helpfulness": 4,
"relevance": 4,
"conciseness": 5,
"clarity": 4,
"factual_accuracy": 4,
"insensitivity": 5,
"maliciousness": 0,
"harmfulness": 0,
"coherence": 1,
"misogyny": 0,
"criminality": 0,
"controversiality": 0,
"creativity": 1
}

Because each criterion is scored separately, you can weight or threshold what matters for your application — e.g., double-weight factual accuracy in a legal domain.

3. Trajectory Evaluation: Measuring the Reasoning Path

Large-language-model agents plan, search, call tools, and reflect over multiple turns. A single final-answer score can hide shallow guessing. Trajectory evaluation checks whether the multi-step process was correct.

3.1 How It Works?

  1. Record a trajectory — prompt/response pairs for every step.
  2. Provide ground-truth triples: agent prompt, agent response, reference answer for each step.
  3. The evaluator computes step-level ROUGE overlap, divergence, and weighted criteria, then aggregates an overall score.

3.2 Illustrative example

Task: An agent must find the cheapest round-trip flight from New York to Paris in July and estimate total CO₂ emissions.

We store prompts and responses:

$trajectory = [
['prompt'=> 'List the three cheapest airlines on JFK → CDG for 10 July – 17 July.', 'response'=> 'Delta $820, FrenchBee $750, Norse $710.'],
['prompt'=> 'Give seat-class and aircraft for the Norse flight.', 'response'=> 'Economy (LowFare) on Boeing 787-9.'],
['prompt'=> 'Estimate round-trip CO₂ for one passenger.', 'response'=> 'About 1.6 t CO₂'],
];

$ground = [
['Norse','710'],
['Economy','787-9'],
['1.4 t CO2'],
];

$evaluator = new TrajectoryEvaluator([
'factualAccuracy' => 2.0,
'coherence' => 1.0,
'completeness' => 1.0,
'harmlessness' => 1.5,
]);

// Add a trajectory with multiple steps
$evaluator->addTrajectory('flight-planner', $trajectory);
// Add ground truth for evaluation
$evaluator->addGroundTruth('flight-planner', $ground);


// Evaluate all trajectories
$results = $evaluator->evaluateAll();
{
"flight-planner": {
"trajectoryId": "flight-planner",
"stepScores": [
{
"factualAccuracy": 1,
"relevance": 0,
"completeness": 0.8,
"harmlessness": 1
},
{
"factualAccuracy": 1,
"relevance": 0,
"completeness": 0.8,
"harmlessness": 1
},
{
"factualAccuracy": 0,
"relevance": 0.17,
"completeness": 0.8,
"harmlessness": 1
}
],
"metricScores": {
"factualAccuracy": 0.67,
"relevance": 0.06,
"completeness": 0.8,
"harmlessness": 1
},
"overallScore": 0.66,
"passed": false,
"interactionCount": 3
}
}

What the numbers reveal?

  • Step-level accuracy highlights that the agent chose Delta instead of PLAY in step 1 — so later price reasoning may be off.
  • Divergence trend shows factual drift grows across steps: good signal to insert a retrieval call after step 1.
  • Overall score blends weighted sub-metrics, giving you a pass/fail threshold for automatic regression tests.

Why It Matters for Agents?

  • Process integrity: Detect shortcuts where the agent guesses the right number without real reasoning.
  • Safety audits: Flag steps that wander into unsafe content even if the final answer looks innocent.
  • Fine-tuning: Visual reports reveal which reasoning sub-skills need data or instruction tweaks.

Putting It All Together

There are many ways to evaluate LLMs responses. Some representative and popular ones were described in this article. They belong to 3 major groups:

  • String comparison metrics tell you what overlapped.
  • Criteria scoring tells you how the answer reads.
  • Trajectory analysis tells you how the agent arrived to specific point.

Combine them for a full picture before shipping chatbots, retrieval-augmented search, or autonomous agents into production. Let’s start evaluating your LLM-based projects!

note: Code examples from this article can be also found in LLPhant/LLPhant framework repository in evaluation module also with some bigger context.
Evaluator classes are also available in a standalone library in my github repository mzarnecki/php-llm-evaluation.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI


Take our 90+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Towards AI has published Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!


Discover Your Dream AI Career at Towards AI Jobs

Towards AI has built a jobs board tailored specifically to Machine Learning and Data Science Jobs and Skills. Our software searches for live AI jobs each hour, labels and categorises them and makes them easily searchable. Explore over 40,000 live jobs today with Towards AI Jobs!

Note: Content contains the views of the contributing authors and not Towards AI.