LLM & AI Agent Applications with LangChain and LangGraph — Part 18: Trajectory Evaluator
Last Updated on January 2, 2026 by Editorial Team
Author(s): Michalzarnecki
Originally published on Towards AI.

Hi! The next evaluation technique we’ll discuss in this part is the Trajectory Evaluator.
This tool doesn’t look only at the final answer produced by the model. Instead, it focuses on the entire reasoning process — step by step.
Trajectory Evaluator helps us judge whether the model reached the result in a correct way, or whether it made mistakes along the route and produced the right final answer only by accident.
What does “trajectory” mean?
Imagine a student solving a math problem.
You can check only the final result — if it matches, the student gets the points.
But very often what matters more is how the student arrived there.
Because if the result is correct but the reasoning was completely wrong, the student will fail on the next problem.
It’s the same with language models. Sometimes the final answer is correct, but the logic on the way is weak. And sometimes when the final answer is wrong, we want to find out exactly which reasoning step introduced the mistake.
Trajectory Evaluator lets us evaluate the quality of reasoning, not just the final output.
Why does this matter?
A few practical reasons:
- Debugging — when the model gives wrong answers, you can see where the logic collapsed.
- Safety — in critical applications (medical, legal, compliance), it matters not only what the model answered, but also why it answered that way.
- Training and fine-tuning — if you’re training or tuning a model, you want the reasoning process to align with your expectations.
- Transparency — explaining the process is often necessary for users to trust the system.
How does Trajectory Evaluator work?
Trajectory Evaluator compares the model-generated trajectory — its sequence of reasoning steps — with a reference trajectory prepared by a human.
The scoring happens on two levels:
- Step-by-step — do individual reasoning elements match the reference?
- Overall — how far does the full reasoning path deviate from the expected one?
Under the hood, it can use metrics such as ROUGE (text overlap), similarity measures, and analysis of how errors propagate into later steps.
Alright — let’s look at a practical example in the Jupyter notebook.
Install and import libraries, load environment variables
!pip install langchain-core langchain-openai python-dotenv
from langchain_classic.evaluation import load_evaluator
from langchain_openai import ChatOpenAI
from dotenv import load_dotenv
import json
load_dotenv()
Trajectory evaluation
Trajectory evaluation compares a model’s reasoning steps (prediction) with a reference sequence, assessing the consistency, order, and completeness of each step. It detects not only an incorrect final answer but also where a deviation occurred in the reasoning chain (e.g., a missed or distorted step), making it particularly useful for agents and multi-step tasks.
# LLM used by evaluator
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
evaluator = load_evaluator("trajectory", llm=llm)
reference_steps = [
"Fibonacci starts at 0, 1",
"The next number is the sum of the previous two",
"The 5th number is 5"
]
prediction_steps = [
"Fibonacci starts at 0, 1",
"Subsequent numbers are the sum of the previous ones",
"The 5th number is 4"
]
result = evaluator.invoke({
"question": "What is the 5th number of the Fibonacci sequence?",
"answer": "4",
"agent_trajectory": prediction_steps,
"reference": {
"answer": "5",
"agent_trajectory": reference_steps
}
})
print(json.dumps(result, indent=4))
output:
{
"question": "What is the 5th number of the Fibonacci sequence?",
"answer": "4",
"agent_trajectory": [
"Fibonacci starts at 0, 1",
"Subsequent numbers are the sum of the previous ones",
"The 5th number is 4"
],
"reference": "\n\nThe following is the expected answer. Use this to measure correctness:\n[GROUND_TRUTH]\n{'answer': '5', 'agent_trajectory': ['Fibonacci starts at 0, 1', 'The next number is the sum of the previous two', 'The 5th number is 5']}\n[END_GROUND_TRUTH]\n",
"score": 0.0,
"reasoning": "Let's evaluate the AI language model's answer step by step based on the provided criteria:\n\ni. **Is the final answer helpful?**\n - The final answer given by the AI model is \"4,\" which is incorrect. The 5th number in the Fibonacci sequence is actually \"5.\" Therefore, the answer is not helpful.\n\nii. **Does the AI language model use a logical sequence of tools to answer the question?**\n - The AI model provides a logical sequence in its reasoning by stating how the Fibonacci sequence starts and how subsequent numbers are derived. However, it ultimately arrives at the wrong conclusion.\n\niii. **Does the AI language model use the tools in a helpful way?**\n - The model does not appear to use any specific tools in this case. It relies on its internal reasoning rather than utilizing any external tools to verify or calculate the Fibonacci sequence. This lack of tool usage is a missed opportunity to ensure accuracy.\n\niv. **Does the AI language model use too many steps to answer the question?**\n - The model does not use an excessive number of steps; it provides a concise explanation of the Fibonacci sequence. However, the explanation does not lead to the correct answer.\n\nv. **Are the appropriate tools used to answer the question?**\n - The model does not use any tools at all. Given the straightforward nature of the question, it could have used a calculator tool to verify the Fibonacci sequence or simply relied on its internal knowledge. The absence of tool usage is a significant oversight.\n\n**Judgment:**\nThe AI language model's final answer is incorrect, and it did not utilize any tools to verify its reasoning. While the explanation of the Fibonacci sequence is logical, it ultimately leads to an incorrect answer. Therefore, the performance is poor.\n\n**"
}
correct_prediction_steps = [
"Fibonacci starts at 0, 1",
"The next number is the sum of the previous two",
"The 5th number is 5"
]
result = evaluator.invoke({
"question": "What is the 5th number of the Fibonacci sequence?",
"answer": "5",
"agent_trajectory": correct_prediction_steps,
"reference": {
"answer": "5",
"agent_trajectory": reference_steps
}
})
print(json.dumps(result, indent=4))
output:
{
"question": "What is the 5th number of the Fibonacci sequence?",
"answer": "5",
"agent_trajectory": [
"Fibonacci starts at 0, 1",
"The next number is the sum of the previous two",
"The 5th number is 5"
],
"reference": "\n\nThe following is the expected answer. Use this to measure correctness:\n[GROUND_TRUTH]\n{'answer': '5', 'agent_trajectory': ['Fibonacci starts at 0, 1', 'The next number is the sum of the previous two', 'The 5th number is 5']}\n[END_GROUND_TRUTH]\n",
"score": 1.0,
"reasoning": "Let's evaluate the AI language model's answer step by step based on the provided criteria:\n\ni. **Is the final answer helpful?**\n - Yes, the final answer \"5\" is correct and directly answers the question about the 5th number in the Fibonacci sequence.\n\nii. **Does the AI language model use a logical sequence of tools to answer the question?**\n - The model's reasoning is logical. It starts with the definition of the Fibonacci sequence and explains how to derive the numbers, leading to the correct answer.\n\niii. **Does the AI language model use the tools in a helpful way?**\n - The model effectively uses the reasoning process to arrive at the answer. However, it does not explicitly mention using any tools, but the reasoning provided is sufficient to understand how it arrived at the answer.\n\niv. **Does the AI language model use too many steps to answer the question?**\n - No, the model does not use too many steps. The explanation is concise and directly leads to the answer without unnecessary elaboration.\n\nv. **Are the appropriate tools used to answer the question?**\n - While the model does not explicitly mention using any tools, the reasoning provided is appropriate for answering the question. The Fibonacci sequence is a well-known mathematical concept, and the model's explanation is accurate.\n\n**Judgment:**\nThe AI language model provided a correct and helpful answer with a logical sequence of reasoning. It did not use any tools explicitly, but the reasoning was sufficient. Given the correctness and clarity of the response, I would rate the model's performance as a 5.\n\n**"
}
That’s all in this part dedicated to LLM trajectory evaluation. In the next article we will implement guardrails that prevent LLM-based applications from returning unexpected output.
see next chapter
see previous chapter
see the full code from this article in the GitHub repository
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI
Towards AI Academy
We Build Enterprise-Grade AI. We'll Teach You to Master It Too.
15 engineers. 100,000+ students. Towards AI Academy teaches what actually survives production.
Start free — no commitment:
→ 6-Day Agentic AI Engineering Email Guide — one practical lesson per day
→ Agents Architecture Cheatsheet — 3 years of architecture decisions in 6 pages
Our courses:
→ AI Engineering Certification — 90+ lessons from project selection to deployed product. The most comprehensive practical LLM course out there.
→ Agent Engineering Course — Hands on with production agent architectures, memory, routing, and eval frameworks — built from real enterprise engagements.
→ AI for Work — Understand, evaluate, and apply AI for complex work tasks.
Note: Article content contains the views of the contributing authors and not Towards AI.