LLM & AI Agent Applications with LangChain and LangGraph — Part 16: String Evaluators (BLEU, ROUGE, METEOR)

Last Updated on January 3, 2026 by Editorial Team

Author(s): Michalzarnecki

Originally published on Towards AI.

LLM & AI Agent Applications with LangChain and LangGraph — Part 16: String Evaluators (BLEU, ROUGE, METEOR)

Hi. Welcome to the next episode. The first category of evaluation techniques I want to cover is String Evaluators.

Unlike semantic evaluators — which analyze meaning using embeddings — String Evaluators compare outputs at the text level. They look at how closely the generated answer matches a reference text in terms of the words used, their order, and overall overlap.

This approach is simpler, but still very useful in specific scenarios — especially when you care about exact alignment with a target, not only semantic similarity.

When should you use String Evaluators?

String Evaluators are especially helpful when:

you’re evaluating translations, where wording and ordering can matter,
you want to verify that the model reproduced an output in an expected format,
you need a quantitative metric to compare two models or two versions of prompts.

These methods have been popular in NLP for many years. Before modern embeddings became mainstream, metrics like BLEU and ROUGE were the standard tools for evaluating machine translation and summarization.

Three popular metrics

1) BLEU (Bilingual Evaluation Understudy)

BLEU is a classic metric used for translation quality.
It measures what percentage of n-grams (text fragments of length n) in the prediction overlaps with n-grams in the reference.

BLEU typically ranges from 0 to 1, where 1 means a perfect match.

Example:
Reference: “The capital of Poland is Warsaw.”
Prediction: “Warsaw is the capital of Poland.”

BLEU would be high here, because the same words are used — just in a different order.

2) ROUGE (Recall-Oriented Understudy for Gisting Evaluation)

ROUGE is most commonly used for summarization evaluation.

Instead of focusing mainly on precision like BLEU, ROUGE measures recall — how much of the reference content appears in the generated text. In other words: how much of the original information the model managed to “keep”.

Example:
Reference: “Warsaw is the capital of Poland.”
Prediction: “Warsaw is capital.”

BLEU would likely be low because some words are missing, but ROUGE would show that the key information is still present.

3) METEOR (Metric for Evaluation of Translation with Explicit Ordering)

METEOR is a more advanced metric. Besides n-gram overlap, it also considers synonyms and inflections.

This makes it more flexible and often better aligned with how natural language works — because the same meaning can be expressed in multiple ways.

Example:
Reference: “The dog is running fast.”
Prediction: “The dog runs quickly.”

BLEU and ROUGE might score low because the words differ, but METEOR can recognize that “fast” and “quickly” are close in meaning and assign a higher score.

Limitations of string-based metrics

Of course, these methods have real limitations:

they’re sensitive to synonyms — if the model uses different wording, the score can drop even if the meaning is identical,
they don’t understand context or factual correctness — they only measure textual similarity,
they require a reference text, so they’re not suitable for open-ended tasks where there isn’t a single “correct” answer.

That’s why today we often rely more on embeddings for semantic evaluation. Still, String Evaluators absolutely have their place — especially in test pipelines where you want repeatability and clear numeric comparisons.

Alright — let’s move to the notebook and see practical examples.

Import support libraries and load environment variables

import json
from dotenv import load_dotenv

load_dotenv()

String Comparison

The evaluator compares two texts using the BLEU metric, which measures the n-gram similarity of the generated answer to the reference answer.

evaluator = load_evaluator("string_distance")

result1 = evaluator.evaluate_strings(
 prediction="The capital of Poland is Warsaw",
 reference="The capital of Poland is Warsaw"
)

result2 = evaluator.evaluate_strings(
 prediction="The capital of Poland is Warsaw",
 reference="The capital of Poland = Warsaw"
)

result3 = evaluator.evaluate_strings(
 prediction="The capital of Poland is Warsaw",
 reference="Warsaw is the capital of Poland"
)

print(round(result1["score"], 4))
print(round(result2["score"], 4))
print(round(result3["score"], 4))

output:

0.0
0.0334
0.289

Embedding Distance Evaluator

Embedding Distance Evaluator compares two responses by converting them into embedding vectors and measuring distance or cosine similarity. This allows it to assess semantic content proximity, not just word matching.

from langchain_classic.evaluation import load_evaluator

evaluator = load_evaluator("embedding_distance", embeddings_model="openai")

result1 = evaluator.evaluate_strings(
 prediction="The capital of Poland is Warsaw",
 reference="The capital of Poland is Warsaw"
)

result2 = evaluator.evaluate_strings(
 prediction="The capital of Poland is Warsaw",
 reference="The capital of Poland is called Warsaw"
)

result3 = evaluator.evaluate_strings(
 prediction="The capital of Poland is Warsaw",
 reference="The capital of Burkina Faso is called Ouagadougou"
)

print(round(result1["score"], 4))
print(round(result2["score"], 4))
print(round(result3["score"], 4))

output:

0.0
0.0168
0.1829

A/B Tests

PairwiseStringEvaluator compares two text responses against a single reference to determine which one is better. This allows you to automatically evaluate which response is closer to the expected result.

from langchain_classic.evaluation import load_evaluator

evaluator = load_evaluator("labeled_pairwise_string")

result = evaluator.evaluate_string_pairs(
 input="What is the capital of Poland?",
 prediction="Warsaw is the capital of Poland",
 prediction_b="I don't know",
 reference="Warsaw is Poland's capital"
)

print(json.dumps(result, indent=4))

output:

{
 "reasoning": "Assistant A's response is helpful, relevant, correct, and accurate. It directly answers the user's question about the capital of Poland. On the other hand, Assistant B's response is not helpful or accurate. It does not provide the user with the information they were seeking. Therefore, Assistant A's response is superior in this case. \n\nFinal verdict: [[A]]",
 "value": "A",
 "score": 1
}

That’s all in this part dedicated to LLM output evaluation with string comparison and embedding distance approaches. In the next article I’ll show how to evaluate output of one LLM using another LLM with defined criteria.

see next chapter

see previous chapter

see the full code from this article in the GitHub repository

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

15 engineers. 100,000+ students. Towards AI Academy teaches what actually survives production.

Start free — no commitment:

→ Agents Architecture Cheatsheet — 3 years of architecture decisions in 6 pages

Our courses:

→ AI Engineering Certification — 90+ lessons from project selection to deployed product. The most comprehensive practical LLM course out there.

→ Agent Engineering Course — Hands on with production agent architectures, memory, routing, and eval frameworks — built from real enterprise engagements.

→ AI for Work — Understand, evaluate, and apply AI for complex work tasks.

Note: Article content contains the views of the contributing authors and not Towards AI.

Frequently Used, Contextual References

Resources

LLM & AI Agent Applications with LangChain and LangGraph — Part 16: String Evaluators (BLEU, ROUGE, METEOR)

Author(s): Michalzarnecki

When should you use String Evaluators?

Three popular metrics

1) BLEU (Bilingual Evaluation Understudy)

2) ROUGE (Recall-Oriented Understudy for Gisting Evaluation)

3) METEOR (Metric for Evaluation of Translation with Explicit Ordering)

Limitations of string-based metrics

Import support libraries and load environment variables

String Comparison

Embedding Distance Evaluator

A/B Tests

Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

Recent Posts

Full-Stack Data Scientists for the Agentic Coding World

Building Production-Grade AI Skills with Snowflake Cortex AI Function Studio

I Tried 10 AI Agent Frameworks in 2026 — Here’s the Honest Guide I Wish I Had Earlier

How One Spring Boot Optimization Saved Our Startup $30,000 a Year

Inside Palantir AIP: How the World’s Most Controversial AI Platform Actually Works

What Is a Reverse Proxy? (And Why Every Backend Developer Should Care)

What Claude Opus 4.8 Actually Changes If You’re Building Agents

QWEN 3.7 Max Worked For 35 Hrs Straight And The Results Were Mind-blowing

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

LLM & AI Agent Applications with LangChain and LangGraph — Part 16: String Evaluators (BLEU, ROUGE, METEOR)

Author(s): Michalzarnecki

When should you use String Evaluators?

Three popular metrics

1) BLEU (Bilingual Evaluation Understudy)

2) ROUGE (Recall-Oriented Understudy for Gisting Evaluation)

3) METEOR (Metric for Evaluation of Translation with Explicit Ordering)

Limitations of string-based metrics

Import support libraries and load environment variables

String Comparison

Embedding Distance Evaluator

A/B Tests

Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

Related posts

Recent Posts

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

GDPR CCPA Statement