Evaluating LLM and AI agents Outputs with String Comparison, Criteria & Trajectory Approaches

Author(s): Michalzarnecki

Originally published on Towards AI.

Evaluating LLM and AI agents Outputs with String Comparison, Criteria & Trajectory Approaches

When your model’s answers sound convincing, how do you prove they’re actually good?
This article walks through three complementary evaluation strategies — string comparison, criteria-based scoring, and trajectory analysis.

1. String-Comparison Metrics

Consider question below:

What is the result of 2+2 ?

There are multiple correct answers:

4
four
the answer is 4
4 is correct answer
2+2=4
2 + 2 = 4
two plus two gives four
…

Even a single question can have many “correct” answers that contain different phrases. Token-level metrics remain most basic but also as quick-and-dirty sanity check. They were used even long before era of LLMs to evaluate quality of machine translations in early 2000s.

1.1 ROUGE (Recall-Oriented Understudy for Gisting Evaluation)

ROUGE answers a question:

How many of words in reference also appear in the candidate? (recall-heavy, ideal for summaries)

ROUGE-N was originally built for summarization and measures n-gram overlap:

For example:
reference = The quick brown fox jumps quickly over the lazy dog
candidate = The quick brown weasel jumps quickly over the lazy cat

matched n-grams number, considering n=1 is count(The, quick, and, brown, jumps, quickly, over, the, lazy) = 7
Recall = 7 / 10
Precision = 7 / 10
F1 = 2 * 0.7 * 0.7 / (0.7 + 0.7) = 0.7

1.2 BLEU (Bilingual Evaluation Understudy)

BLEU answers a question:

How many matching phrases of different lengths do the two texts share?

BLEU rewards shorter n-gram matches while punishing verbosity through a brevity penalty (BP).

Here c = candidate length, r = reference length, pn = precision for n-grams.

1.3 METEOR (Metric for Evaluation of Translation with Explicit Ordering)

METEOR answers a question:

Did we match the same ideas — even if the words are stemmed, pluralised, or synonymous?

METEOR aligns unigrams, stems, and synonyms, then balances precision and recall.

If you are interested in implementation details, please find implementations of ROUGE, BLEU and METEOR in my github repository.

Let’s compare 2 texts using 3 different metrics presented above.

$evaluator = new StringComparisonEvaluator();
$reference = "that's the way cookie crumbles";
$candidate = "this is the way cookie is crashed";

$results = [
 'ROUGE' => $evaluator->calculateROUGE($reference, $candidate),
 'BLEU' => $evaluator->calculateBLEU($reference, $candidate),
 'METEOR' => $evaluator->calculateMETEOR($reference, $candidate),
];

Array
(
 [ROUGE] => LlmEvaluation\EvaluationResults Object
 (
 [metricName:LlmEvaluation\EvaluationResults:private] => ROUGE
 [results:LlmEvaluation\EvaluationResults:private] => Array
 (
 [recall] => 0.6
 [precision] => 0.43
 [f1] => 0.5
 )

 )

 [BLEU] => LlmEvaluation\EvaluationResults Object
 (
 [metricName:LlmEvaluation\EvaluationResults:private] => BLEU
 [results:LlmEvaluation\EvaluationResults:private] => Array
 (
 [score] => 0.43
 )

 )

 [METEOR] => LlmEvaluation\EvaluationResults Object
 (
 [metricName:LlmEvaluation\EvaluationResults:private] => METEOR
 [results:LlmEvaluation\EvaluationResults:private] => Array
 (
 [score] => 0.57
 [precision] => 0.42857142857143
 [recall] => 0.6
 [chunks] => 1
 [penalty] => 0.018518518518519
 [fMean] => 0.57692307692308
 )

 )

)

As we can observe in results above that each of 3 presented metrics has different score (ROUGE F1 0.5, BLEU 0.43 and METEOR 0.57). Also it’s hard to say if score “0.5” describes well how relevant are texts“that’s the way cookie crumbles” and “this is the way cookie is crashed”. Even it’s hard to use them for evaluating text generated by LLM they can still be used as a very efficient evaluation tool checking change in results between iterations of process improvement (for example changing the prompt).

2. Criteria-Based Evaluation

When string comparison metrics are not sufficient enough criteria evaluation comes in. Token overlap can’t score tone, ethics, or creativity.
The idea behind criteria evaluation is to evaluate results generated by LLM “A” by passing them to LLM “B” which plays the role of teacher evaluating student answers. The criteria evaluator feeds question + answer into LLM like GPT or Claude and asks for 1-to-5 ratings on each trait.

Here are popular criteria categories and descriptions what they are evaluating.

In the code it can look like this (see full implementation and more examples in repository):

$prompt = (new CriteriaEvaluatorPromptBuilder())
 ->addClarity()
 ->addCoherence()
 ->addConciseness()
 ->addControversiality()
 ->addCreativity()
 ->addCriminality()
 ->addFactualAccuracy()
 ->addRelevance()
 ->addHarmfulness()
 ->addHelpfulness()
 ->addInsensitivity()
 ->addMaliciousness()
 ->addMisogyny()
 ->addCorrectness()
 ->getEvaluationPrompt($question, $response);

print_r((new GPTCriteriaEvaluator())->evaluate($prompt));

{
 "correctness": 5,
 "helpfulness": 4,
 "relevance": 4,
 "conciseness": 5,
 "clarity": 4,
 "factual_accuracy": 4,
 "insensitivity": 5,
 "maliciousness": 0,
 "harmfulness": 0,
 "coherence": 1,
 "misogyny": 0,
 "criminality": 0,
 "controversiality": 0,
 "creativity": 1
}

Because each criterion is scored separately, you can weight or threshold what matters for your application — e.g., double-weight factual accuracy in a legal domain.

3. Trajectory Evaluation: Measuring the Reasoning Path

Large-language-model agents plan, search, call tools, and reflect over multiple turns. A single final-answer score can hide shallow guessing. Trajectory evaluation checks whether the multi-step process was correct.

3.1 How It Works?

Record a trajectory — prompt/response pairs for every step.
Provide ground-truth triples: agent prompt, agent response, reference answer for each step.
The evaluator computes step-level ROUGE overlap, divergence, and weighted criteria, then aggregates an overall score.

3.2 Illustrative example

Task: An agent must find the cheapest round-trip flight from New York to Paris in July and estimate total CO₂ emissions.

We store prompts and responses:

$trajectory = [
 ['prompt'=> 'List the three cheapest airlines on JFK → CDG for 10 July – 17 July.', 'response'=> 'Delta $820, FrenchBee $750, Norse $710.'],
 ['prompt'=> 'Give seat-class and aircraft for the Norse flight.', 'response'=> 'Economy (LowFare) on Boeing 787-9.'],
 ['prompt'=> 'Estimate round-trip CO₂ for one passenger.', 'response'=> 'About 1.6 t CO₂'],
];

$ground = [
 ['Norse','710'],
 ['Economy','787-9'],
 ['1.4 t CO2'],
];

$evaluator = new TrajectoryEvaluator([
 'factualAccuracy' => 2.0,
 'coherence' => 1.0,
 'completeness' => 1.0,
 'harmlessness' => 1.5,
]);

// Add a trajectory with multiple steps
$evaluator->addTrajectory('flight-planner', $trajectory);
// Add ground truth for evaluation
$evaluator->addGroundTruth('flight-planner', $ground);


// Evaluate all trajectories
$results = $evaluator->evaluateAll();

{
 "flight-planner": {
 "trajectoryId": "flight-planner",
 "stepScores": [
 {
 "factualAccuracy": 1,
 "relevance": 0,
 "completeness": 0.8,
 "harmlessness": 1
 },
 {
 "factualAccuracy": 1,
 "relevance": 0,
 "completeness": 0.8,
 "harmlessness": 1
 },
 {
 "factualAccuracy": 0,
 "relevance": 0.17,
 "completeness": 0.8,
 "harmlessness": 1
 }
 ],
 "metricScores": {
 "factualAccuracy": 0.67,
 "relevance": 0.06,
 "completeness": 0.8,
 "harmlessness": 1
 },
 "overallScore": 0.66,
 "passed": false,
 "interactionCount": 3
 }
}

What the numbers reveal?

Step-level accuracy highlights that the agent chose Delta instead of PLAY in step 1 — so later price reasoning may be off.
Divergence trend shows factual drift grows across steps: good signal to insert a retrieval call after step 1.
Overall score blends weighted sub-metrics, giving you a pass/fail threshold for automatic regression tests.

Why It Matters for Agents?

Process integrity: Detect shortcuts where the agent guesses the right number without real reasoning.
Safety audits: Flag steps that wander into unsafe content even if the final answer looks innocent.
Fine-tuning: Visual reports reveal which reasoning sub-skills need data or instruction tweaks.

Putting It All Together

There are many ways to evaluate LLMs responses. Some representative and popular ones were described in this article. They belong to 3 major groups:

String comparison metrics tell you what overlapped.
Criteria scoring tells you how the answer reads.
Trajectory analysis tells you how the agent arrived to specific point.

Combine them for a full picture before shipping chatbots, retrieval-augmented search, or autonomous agents into production. Let’s start evaluating your LLM-based projects!

note: Code examples from this article can be also found in LLPhant/LLPhant framework repository in evaluation module also with some bigger context.
Evaluator classes are also available in a standalone library in my github repository mzarnecki/php-llm-evaluation.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Evaluating LLM and AI agents Outputs with String Comparison, Criteria & Trajectory Approaches

Author(s): Michalzarnecki

1. String-Comparison Metrics

1.1 ROUGE (Recall-Oriented Understudy for Gisting Evaluation)

1.2 BLEU (Bilingual Evaluation Understudy)

1.3 METEOR (Metric for Evaluation of Translation with Explicit Ordering)

2. Criteria-Based Evaluation

3. Trajectory Evaluation: Measuring the Reasoning Path

3.1 How It Works?

3.2 Illustrative example

Why It Matters for Agents?

Putting It All Together

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

Why Knowledge Graphs Are the Missing Piece in AI Agent API Discovery

The Complexity of Self-Driving Cars Explained Simply

Bridging Symbolic AI and Deep Learning: How Knowledge Graphs are Revolutionizing ResNets

LAI #93: Smarter Model Choices, Multi-Agent Systems, and Cutting Through AI Noise

Who Wins Purview vs Rogue AI in Data Control

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Evaluating LLM and AI agents Outputs with String Comparison, Criteria & Trajectory Approaches

Author(s): Michalzarnecki

1. String-Comparison Metrics

1.1 ROUGE (Recall-Oriented Understudy for Gisting Evaluation)

1.2 BLEU (Bilingual Evaluation Understudy)

1.3 METEOR (Metric for Evaluation of Translation with Explicit Ordering)

2. Criteria-Based Evaluation

3. Trajectory Evaluation: Measuring the Reasoning Path

3.1 How It Works?

3.2 Illustrative example

Why It Matters for Agents?

Putting It All Together

Related posts

Popular posts

Updates

Recent Posts

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

GDPR CCPA Statement