Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: pub@towardsai.net
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab VeloxTrend Ultrarix Capital Partners Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Free: 6-day Agentic AI Engineering Email Guide.
Learnings from Towards AI's hands-on work with real clients.
LLM & AI Agent Applications with LangChain and LangGraph — Part 16: String Evaluators (BLEU, ROUGE, METEOR)
Latest   Machine Learning

LLM & AI Agent Applications with LangChain and LangGraph — Part 16: String Evaluators (BLEU, ROUGE, METEOR)

Last Updated on January 3, 2026 by Editorial Team

Author(s): Michalzarnecki

Originally published on Towards AI.

LLM & AI Agent Applications with LangChain and LangGraph — Part 16: String Evaluators (BLEU, ROUGE, METEOR)

Hi. Welcome to the next episode. The first category of evaluation techniques I want to cover is String Evaluators.

Unlike semantic evaluators — which analyze meaning using embeddings — String Evaluators compare outputs at the text level. They look at how closely the generated answer matches a reference text in terms of the words used, their order, and overall overlap.

This approach is simpler, but still very useful in specific scenarios — especially when you care about exact alignment with a target, not only semantic similarity.

When should you use String Evaluators?

String Evaluators are especially helpful when:

  • you’re evaluating translations, where wording and ordering can matter,
  • you want to verify that the model reproduced an output in an expected format,
  • you need a quantitative metric to compare two models or two versions of prompts.

These methods have been popular in NLP for many years. Before modern embeddings became mainstream, metrics like BLEU and ROUGE were the standard tools for evaluating machine translation and summarization.

Three popular metrics

1) BLEU (Bilingual Evaluation Understudy)

BLEU is a classic metric used for translation quality.
It measures what percentage of n-grams (text fragments of length n) in the prediction overlaps with n-grams in the reference.

BLEU typically ranges from 0 to 1, where 1 means a perfect match.

Example:
Reference: “The capital of Poland is Warsaw.”
Prediction: “Warsaw is the capital of Poland.”

BLEU would be high here, because the same words are used — just in a different order.

2) ROUGE (Recall-Oriented Understudy for Gisting Evaluation)

ROUGE is most commonly used for summarization evaluation.

Instead of focusing mainly on precision like BLEU, ROUGE measures recall — how much of the reference content appears in the generated text. In other words: how much of the original information the model managed to “keep”.

Example:
Reference: “Warsaw is the capital of Poland.”
Prediction: “Warsaw is capital.”

BLEU would likely be low because some words are missing, but ROUGE would show that the key information is still present.

3) METEOR (Metric for Evaluation of Translation with Explicit Ordering)

METEOR is a more advanced metric. Besides n-gram overlap, it also considers synonyms and inflections.

This makes it more flexible and often better aligned with how natural language works — because the same meaning can be expressed in multiple ways.

Example:
Reference: “The dog is running fast.”
Prediction: “The dog runs quickly.”

BLEU and ROUGE might score low because the words differ, but METEOR can recognize that “fast” and “quickly” are close in meaning and assign a higher score.

Limitations of string-based metrics

Of course, these methods have real limitations:

  • they’re sensitive to synonyms — if the model uses different wording, the score can drop even if the meaning is identical,
  • they don’t understand context or factual correctness — they only measure textual similarity,
  • they require a reference text, so they’re not suitable for open-ended tasks where there isn’t a single “correct” answer.

That’s why today we often rely more on embeddings for semantic evaluation. Still, String Evaluators absolutely have their place — especially in test pipelines where you want repeatability and clear numeric comparisons.

Alright — let’s move to the notebook and see practical examples.

Import support libraries and load environment variables

import json
from dotenv import load_dotenv

load_dotenv()

String Comparison

The evaluator compares two texts using the BLEU metric, which measures the n-gram similarity of the generated answer to the reference answer.

evaluator = load_evaluator("string_distance")

result1 = evaluator.evaluate_strings(
prediction="The capital of Poland is Warsaw",
reference="The capital of Poland is Warsaw"
)

result2 = evaluator.evaluate_strings(
prediction="The capital of Poland is Warsaw",
reference="The capital of Poland = Warsaw"
)

result3 = evaluator.evaluate_strings(
prediction="The capital of Poland is Warsaw",
reference="Warsaw is the capital of Poland"
)

print(round(result1["score"], 4))
print(round(result2["score"], 4))
print(round(result3["score"], 4))

output:

0.0
0.0334
0.289

Embedding Distance Evaluator

Embedding Distance Evaluator compares two responses by converting them into embedding vectors and measuring distance or cosine similarity. This allows it to assess semantic content proximity, not just word matching.

from langchain_classic.evaluation import load_evaluator

evaluator = load_evaluator("embedding_distance", embeddings_model="openai")

result1 = evaluator.evaluate_strings(
prediction="The capital of Poland is Warsaw",
reference="The capital of Poland is Warsaw"
)

result2 = evaluator.evaluate_strings(
prediction="The capital of Poland is Warsaw",
reference="The capital of Poland is called Warsaw"
)

result3 = evaluator.evaluate_strings(
prediction="The capital of Poland is Warsaw",
reference="The capital of Burkina Faso is called Ouagadougou"
)

print(round(result1["score"], 4))
print(round(result2["score"], 4))
print(round(result3["score"], 4))

output:

0.0
0.0168
0.1829

A/B Tests

PairwiseStringEvaluator compares two text responses against a single reference to determine which one is better. This allows you to automatically evaluate which response is closer to the expected result.

from langchain_classic.evaluation import load_evaluator

evaluator = load_evaluator("labeled_pairwise_string")

result = evaluator.evaluate_string_pairs(
input="What is the capital of Poland?",
prediction="Warsaw is the capital of Poland",
prediction_b="I don't know",
reference="Warsaw is Poland's capital"
)

print(json.dumps(result, indent=4))

output:

{
"reasoning": "Assistant A's response is helpful, relevant, correct, and accurate. It directly answers the user's question about the capital of Poland. On the other hand, Assistant B's response is not helpful or accurate. It does not provide the user with the information they were seeking. Therefore, Assistant A's response is superior in this case. \n\nFinal verdict: [[A]]",
"value": "A",
"score": 1
}

That’s all in this part dedicated to LLM output evaluation with string comparison and embedding distance approaches. In the next article I’ll show how to evaluate output of one LLM using another LLM with defined criteria.

see next chapter

see previous chapter

see the full code from this article in the GitHub repository

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI


Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

15 engineers. 100,000+ students. Towards AI Academy teaches what actually survives production.

Start free — no commitment:

6-Day Agentic AI Engineering Email Guide — one practical lesson per day

Agents Architecture Cheatsheet — 3 years of architecture decisions in 6 pages

Our courses:

AI Engineering Certification — 90+ lessons from project selection to deployed product. The most comprehensive practical LLM course out there.

Agent Engineering Course — Hands on with production agent architectures, memory, routing, and eval frameworks — built from real enterprise engagements.

AI for Work — Understand, evaluate, and apply AI for complex work tasks.

Note: Article content contains the views of the contributing authors and not Towards AI.