Deep Dive into Context Engineering

Author(s): Sunil Rao

Originally published on Towards AI.

Remember the early days of prompt engineering, when LLMs responded to clever single prompts and a little “vibe coding”? It would spit out a functional script, a web component, or a full-stack proof-of-concept.
But as the projects grew from weekend hacks to production-grade applications, the cracks began to show. The “vibes” led to brittle code, security loopholes, and an unmanageable mess of technical debt. A simple change would break the entire system, and debugging the AI-generated logic felt like trying to solve a puzzle with half the pieces missing.

We learned a hard truth: a great prompt wasn’t enough. The AI, with its limited memory, lacked the crucial context to build something truly robust.

That’s where we stand today. The journey from “vibe coding” to production-ready AI applications requires a new discipline, one that moves beyond the art of the prompt to the science of intelligent information management. This discipline is Context Engineering.

Deep Dive into Context Engineering — AI Generated

“Vibe coding” a term popularized by AI researcher Andrej Karpathy in early 2025, describes a style of software development where you primarily interact with an AI coding assistant (like GitHub Copilot, Cursor, etc.) using natural language prompts. The idea is to let the AI generate much of the code, with the human acting more as a guide, reviewer, and debugger rather than writing code line-by-line.

It was seen as a way to:

Allowing non-technical users, entrepreneurs, and designers to build functional applications without deep coding knowledge.
Enable rapid prototyping early versions of apps (MVPs) in hours or days.
Automate repetitive coding tasks, freeing up developers for more complex problem-solving and design.

In essence, you’d describe your desired outcome in plain English (e.g., “Create a Python function that reads a CSV file,” or “Build a simple recipe-sharing application with user accounts”), and the AI would attempt to generate the corresponding code.

While promising, vibe coding quickly revealed significant drawbacks, especially when moving beyond “throwaway weekend projects” to more serious or scalable applications:

AI-generated code often lacked structure, consistency, and optimal design. It could be brittle, poorly organized, and accrue “technical debt” quickly.
AI models are trained on vast amounts of public code, including insecure or outdated patterns. This meant AI-generated code could inadvertently include common security flaws like SQL injection, insecure file handling, or improper authentication, which inexperienced users might not spot.
The code generated might “just work” initially but become very difficult to maintain or scale as the project grew. Inconsistent naming, coding styles, and ad-hoc logic made it a “patchwork codebase” hard to understand or extend.
Debugging AI-generated code could be harder because the user might not fully understand the underlying logic. If the AI “fixed” one issue, it might introduce others, making it a time-consuming and frustrating process.
While great for common patterns, vibe coding offered limited flexibility for novel, complex, or highly specific requirements. Developers had less fine-grained control over the generated code.
There was a risk of users blindly accepting AI-generated code without sufficient understanding or review, leading to hidden bugs or security risks.
While seemingly saving on developer salaries, AI service API calls could become expensive, and the time spent debugging “AI-generated messes” could negate initial efficiency gains.

This report Qodo 2025 “State of AI Code Quality” examines how AI coding tools are impacting developers and software quality worldwide, based on a survey of 609 developers across various regions and company sizes.

AI adoption is now mainstream

82% of developers use AI coding tools daily or weekly, moving from experiment to core workflow.
59% use three or more different AI coding tools, suggesting both heavy reliance and fragmented tool choices.
AI influences at least a quarter of the codebase for 65% of users, with some reporting over 80% AI-influenced code.
78% report productivity gains due to AI; 17% claim a “10×” increase in output.

But Confidence and trust are lagging behind adoption

Source: Hallucination vs Shipping Confidence

Despite widespread usage, only 3.8% of developers experience both low hallucinations (mistakes) and high confidence in shipping AI-generated code without human review.
25% of developers say about 1 in 5 AI suggestions are factually or functionally incorrect.
The largest group (76%) experience frequent hallucinations and have low confidence, leading to manual code review or rejection of AI-generated code.
65% of developers say AI “misses relevant context” during key tasks like refactoring, reviewing, or testing.

The pervasive issues of code quality, security, maintainability, and debugging difficulties arising from vibe coding — primarily stemming from the AI’s lack of sufficient and relevant context — highlighted the need for a more robust approach. This led to the emergence of Context Engineering.

Lets explore with simple scenario of Hotel Reservation System

A user wants to book a hotel room, and they’re interacting with an AI assistant.

With “Prompt Engineering / Vibe Coding” Approach:

Prompt engineering is the practice of crafting effective, specific, and clear natural language instructions to get a desired output from a LLM. It’s a linguistic art and a form of in-the-moment communication with the AI.

Think of it as the “what you say” part of the conversation.

Core techniques in prompt engineering include:

Role-playing: Giving the LLM a persona, e.g., “You are an expert financial analyst…”
Providing Examples (Few-Shot Prompting): Showing the LLM a few examples of input-output pairs to guide its behavior.
Chain of Thought (CoT): Asking the LLM to think step-by-step before giving a final answer, which is especially useful for complex reasoning tasks.
Defining Constraints: Specifying the desired format (e.g., “Respond in JSON format”), length, or tone.
Clear Instructions: Using clear, unambiguous language to tell the model exactly what you want it to do

In this approach, the developer’s primary focus is on crafting a single, highly-detailed prompt that contains all the logic and rules for the system. The AI’s state and memory are almost entirely contained within the prompt itself.

Initial Prompt might look something like this:

You are a helpful and friendly hotel reservation bot. Your goal is to gather all the necessary information from the user to make a hotel booking.

Here are the rules:
1. **Always ask for these four pieces of information:**
 * Hotel Name
 * Check-in Date
 * Check-out Date
 * Number of Guests
2. If any of these pieces of information are missing, politely ask for them.
3. Do not proceed with the booking until all four pieces of information have been provided.
4. Once all information is gathered, summarize it back to the user and ask for confirmation.
5. If the user confirms, say "Thank you! Your reservation for [Hotel Name] from [Check-in Date] to [Check-out Date] has been confirmed."
6. The current date is [Current Date]. Use this to validate dates.
7. The hotel names you can book are: Grand Hyatt, The Fairmont, and Marriott.

User: I want to book a hotel room.

How a typical conversation would flow:

User: I want to book a hotel room.

AI (Prompt-Engineered): “Hello! I can help you with that. To get started, I need a few details. What’s the name of the hotel, your check-in and check-out dates, and the number of guests?”

User: I want to stay at the Marriott from next Friday to Sunday.

AI (Prompt-Engineered): “Got it. What’s the exact date for next Friday and Sunday, and how many people will be staying?”

User: Just me.

AI (Prompt-Engineered): “Great. I have that you want to book a room for 1 guest at the Marriott. I still need the exact check-in and check-out dates. Can you provide those?”

What is Context Engineering?

Context engineering is basically How to provide the LLMs with precisely the right information, tools, and constraints at the right time to enable it to perform complex tasks effectively and reliably, especially in multi-step or conversational interactions.

It moves beyond simple “prompt engineering” (crafting a single, perfect instruction) to a more systemic approach. Think of it as managing the LLM’s “working memory” effectively.

Key aspects of Context Engineering include:

The context is not static. It’s assembled on the fly and evolves as the conversation or task progresses. This involves fetching or updating information at runtime.
System Instructions/Prompts — High-level rules and persona for the AI.
Conversation History/User Preferences – Remembering previous interactions and user-specific details.
Retrieved Information RAG — Pulling relevant data from internal documents, databases, or external sources.
Describing available Tools and their definitions — functions or APIs the AI can call to get more information or perform actions.
Guiding the AI to produce structured output in a specific format (e.g., JSON).
Incorporating Real-time current information from external APIs.
LLMs have limited context windows (their “RAM”). Context engineering involves strategies to:
Select relevant context — Deciding what information is most pertinent to the current step.
Using techniques to reduce the amount of text passed to the LLM without losing critical information.
Ordering information so the most important details are highlighted.
Storing and retrieving persistent knowledge across sessions.
Context engineering is often part of a larger workflow or agentic system where multiple LLM calls and non-LLM steps are orchestrated. Each step receives a focused, optimized context.
Building systems to validate AI output and handle cases where the AI might miss context, hallucinate, or produce incorrect results.

Continuing the same scenario…

With “Context Engineering” Approach

In this approach, the developer builds a system around the LLM. The LLM is a reasoning engine, not the entire application. The context is dynamically assembled and provided to the LLM at each step. The System’s Architecture:

State Management:
A dedicated database or a state object tracks the conversation’s progress (e.g., {'hotel_name': 'Marriott', 'check_in': '...', 'guests': 1}).

2. Tooling:
The system has access to a set of predefined functions (APIs) the LLM can call:

get_available_hotels(city)
check_room_availability(hotel_name, check_in_date, check_out_date)
book_room(hotel_name, check_in_date, check_out_date, num_guests)
get_current_date()

3. Prompt Orchestrator: This component decides what information to send to the LLM at each turn.

A typical conversation would flow :

User: I want to book a hotel room.

Context Engineering System: The system assembles a prompt that includes:

A high-level system prompt:
………

“You are a hotel reservation assistant. You can use the following tools: get_available_hotels, check_room_availability, book_room. Your goal is to gather all necessary info and call the book_room tool."

Tool Definitions: The system provides the LLM with the names, descriptions, and parameters for each available tool.

Conversation History: The current conversation history (e.g., User: I want to book a hotel room.)

Dynamic Data: The system might also retrieve today’s date using get_current_date() and pass it to the LLM.
………

LLM: Based on the tools and the user’s request, the LLM determines it needs more information. It responds:
“Hello! I can help you with that. To get started, I need the name of the hotel you’d like to book, your check-in and check-out dates, and the number of guests.”

User: I want to stay at the Marriott from next Friday to Sunday.

Context Engineering System:
The system updates the state:
{'hotel_name': 'Marriott', 'guests': None, 'check_in': 'next Friday', 'check_out': 'next Sunday'}.
It then assembles a new prompt, including the updated state and the full set of tools.

LLM: The LLM sees the new information and the state. It knows it needs to resolve the dates. It might call a date-parsing tool or ask directly. It responds: “Great. What are the exact dates for next Friday and Sunday, and how many people will be staying?”

User: It’s just me.

Context Engineering System:
The system updates the state:
{'guests': 1, 'check_in': '...', 'check_out': '...'}.
The system might then call the check_room_availability tool with the provided info. This function returns True and the current rate. This result is then added to the LLM's context.

LLM: The LLM sees that a room is available. It responds:
“I’ve checked and a room is available at the Marriott for your dates for one guest. The rate is $200 per night. Should I go ahead and book it for you?”

User: Yes, please.

Context Engineering System: The system sees the confirmation. It assembles a prompt that guides the LLM to call the book_room tool with the parameters from the state. The LLM generates the tool call, the system executes it, and the book_room tool returns a confirmation number.

LLM: The system takes the confirmation number and generates a final, natural-language response for the user: “Thank you! Your reservation at the Marriott for 1 guest from [date] to [date] is confirmed. Your confirmation number is 12345.”

Lets start understanding with what is a Context Window?

Context window is the limited “working memory” of a LLM. It’s a contiguous block of text (the “context”) that the model can see and process at one time to generate a response.
When you send a prompt to an LLM, the entire prompt — including your instructions, any examples, and the conversation history — is loaded into this window. The model then uses all of the information within this window to predict the next word or token.

Imagine you’re having a conversation with someone who has a very short-term memory. They can only remember the last few things you’ve said. This is the context window. As you continue talking, the oldest parts of the conversation are “pushed out” to make room for the new information.

Tokens: LLMs don’t read words; they read “tokens.” A token can be a word (“hello”), a part of a word (“ing”), or punctuation (“,”).
The context window is measured in tokens, not words. A good rule of thumb is that 1 token is roughly 4 English characters or about 3/4 of a word. So, a 4,096-token window is approximately 3,000 words.
Prompt + Response = Context: Every time you send a prompt to an LLM, the model’s response is also part of the context for the next turn of the conversation. The total number of tokens (prompt + response) must fit within the maximum context length.

Models have a maximum context length due to computational and architectural constraints. The primary reasons are:

The attention mechanism, the core component of a transformer model, compares every token in the input to every other token. This computation scales quadratically with the length of the sequence. Doubling the context length quadruples the memory and processing power required.
Memory Limitations: Storing the “key” and “value” vectors for each token in the context window consumes a lot of memory (VRAM) on the GPU. Longer contexts require exponentially more memory.
The model’s architecture is designed for a specific context size, and it was trained on sequences of that length. Changing this after training is not straightforward.

What does a maximum context length mean?

4,096 tokens: This was a common context window for early powerful models like GPT-3.5. It’s sufficient for single-turn questions or short conversations, but it quickly gets filled up in longer dialogues, forcing the model to “forget” the beginning of the conversation.
1 million tokens: This represents a massive leap, seen in newer models like Gemini.
1-million-token window is large enough to ingest entire books, codebases, or hours of audio transcripts. It allows the model to reason over vast amounts of information without needing to be pre-trained on it. This is a game-changer for tasks like summarizing long documents or analyzing a large codebase.

The practical implication is that a model with a larger context window can handle more complex, multi-turn tasks and reason over a larger set of external documents without losing its “memory.”

Challenges of Long Context Windows

Even with a massive context window, just stuffing a lot of information into it doesn’t guarantee a good response. In fact, it introduces new challenges that context engineering aims to solve.

Context Poisoning:

When an LLM’s response or an external piece of information, which is factually incorrect, gets added back into the context window. The LLM then treats this false information as a fact in subsequent turns, leading to a cascade of errors.

Turn 1: The user asks to book a room.

Turn 2: The LLM, based on a faulty check_room_availability tool call (a bug in the tool), hallucinates a response: "The Grand Hyatt is fully booked for your dates." The system logs this response and adds it to the conversation history.

Turn 3: The user asks, “Is the Fairmont also fully booked?”

Turn 4: The LLM sees the previous response in its context window: “The Grand Hyatt is fully booked.” It incorrectly generalizes this hallucination and might respond:
“Yes, all hotels are fully booked for those dates,” even if the Fairmont has rooms. The false information has “poisoned” the context.

Context Distraction

When the sheer volume of information in the context window makes it difficult for the model to focus on the truly relevant parts. The signal is buried in the noise.

Initial Context: The user’s request, the hotel availability tools, and their definitions.

Additional Context: A developer, using a massive context window, decides to also inject the hotel’s entire terms of service, the menu for the hotel’s restaurant, and a list of local tourist attractions into the context.

User: “Can I get a late checkout?”

AI’s Response: The model, overwhelmed by the volume of text, might get distracted by the restaurant menu and respond:
“Yes, you can have a late checkout, but have you tried our famous lobster bisque? It’s a local favorite.”
The irrelevant information (the menu) has distracted the model from the core task.

Context Confusion

When the LLM struggles to differentiate between different pieces of context, leading to a mixing of information. This often happens when different sources of information contain conflicting or similar-looking details.

Initial Context: The LLM is provided with the user’s booking details for the Marriott.

Additional Context: The developer, in a rush, also injects a user’s previous booking history, which includes a stay at the Grand Hyatt, into the context.

User: “Can you confirm my reservation?”

AI’s Response: The model sees the current request for the Marriott but also the old booking for the Grand Hyatt. It might get confused and respond: “Yes, your reservation at the Grand Hyatt is confirmed.” It has mixed up the two pieces of context, leading to an incorrect and potentially frustrating user experience.

Context Clash

When different parts of the context window contain directly conflicting information, and the LLM doesn’t know which one to prioritize.

Initial Context: The user’s request to book the Marriott for one guest.

Conflicting Context: A previous message in the chat says, “I’m booking for my family of four.” and a different tool call log says check_room_availability(hotel='Marriott', num_guests=1).

User: “What’s the final price for my reservation?”

AI’s Response: The model now has a “clash” in its context. Should it trust the user’s last message (“family of four”) or the tool log that it was just given (“num_guests=1”)? The model’s response might be: “The price for your room is $200 per night for your family of four,” incorrectly combining the information. It doesn’t have a clear way to resolve this disagreement.

Context Management Tactics

Here’s few tactic addresses the challenges above:

RAG (Retrieval-Augmented Generation)

A technique for dynamically fetching relevant information from a knowledge base (like a database, vector store, or set of documents) and inserting it into the LLM’s context at the moment it’s needed.
It prevents Context Distraction by only providing the most relevant information. Instead of injecting the entire hotel manual, RAG would be used to search for and retrieve only the specific section on “cancellation policy” when the user asks about it. This keeps the context focused and avoids irrelevant clutter.

A user asks, “What amenities does the hotel have?”
The system uses RAG to search a document store for “Marriott amenities” and retrieves a short, precise list, which is then added to the prompt.

Tool Loadout

The process of providing the LLM with a list of available functions or APIs (tools) it can call to perform actions or get real-time information. A core part of context engineering is deciding which tools to “load” into the context for a given task.
It prevents the LLM from hallucinating actions and enables the system to perform a task reliably. It also helps manage context size by only providing the tool definitions that are relevant to the current conversation.

The initial prompt gives the LLM access to check_availability, book_room, and get_user_info.
When the user asks about local restaurants, the system might swap out the booking tools for a find_restaurants tool, making the context highly focused on the new task.

Context Quarantine

A strategy where different parts of a complex workflow are handled by separate, dedicated LLM threads, each with its own isolated context.
This is a powerful solution to Context Confusion. By quarantining conversations, you prevent information from one part of the interaction from bleeding into another.

The main reservation bot thread handles booking logistics.
If a user asks a question about the hotel’s amenities or restaurant, the system might hand off that question to a separate “amenities bot” thread.
This new thread has its own, isolated context about amenities and is forbidden from seeing the booking details.
Once the question is answered, the system returns to the main thread. This ensures the booking details and amenity details never get confused.

Context Pruning

The process of removing outdated, irrelevant, or low-priority information from the context window to keep it lean and focused.
It directly combats Context Distraction and also helps manage costs and latency. In a long conversation, older turns might no longer be relevant. Pruning removes them. It can also be used to remove superfluous information that may be influencing the response.

In the reservation conversation, after the booking is complete, the system might prune all the previous back-and-forth about hotel availability and dates, leaving only a short summary of the final reservation details in the context for future questions (e.g., “What’s my confirmation number?”).

Context Summarization

A technique to “boil down” a long conversation history or a large document into a concise summary. The summary is then used in place of the original full text to save tokens and maintain a high-level understanding of the context.
This is a crucial tactic for managing a long conversation. It prevents the context window from growing indefinitely, and it directly addresses the problem of losing a user’s intent.

After a user has had 10 back-and-forth messages about their reservation details, the system might generate a summary:
“User wants to book a room at the Marriott for two guests from Friday to Sunday.”
This summary is then used in the next turn’s prompt, replacing the entire chat history, so the LLM doesn’t have to process the full conversation again.

Context Offloading

Storing information outside the LLM’s context and only bringing it in when specifically requested. This is a form of long-term memory management.
It prevents Context Poisoning and Context Clash by not blindly trusting all information. The system can act as a gatekeeper, storing information in a structured database rather than the LLM’s free-text context.

When a user provides their dates and hotel choice, the system doesn’t just put that text in the context window.
It offloads that information into a structured state object or database record (state.hotel_name = 'Marriott').
If the user later changes their mind, the system updates the database record, and the LLM's context is only ever fed the current, validated state of the reservation, preventing any clash or confusion from previous, discarded information.
This also ensures that if a tool call fails, the flawed information never "poisons" the LLM's memory.

Foundational Component

Context Engineering is built upon three fundamental components that collectively address the core challenges of information management in large language models

1. Context Retrieval and Generation

This component is all about intelligently and deliberately gathering the information that the LLM needs.

1.1. Prompt Engineering and Context Generation

This is the most fundamental layer. While prompt engineering is often seen as distinct from context engineering, it’s actually the first step in creating context.

A developer writes a long, static prompt.

"You are a hotel reservation bot for The Fairmont. 
Do not mention any other hotels. 
Here are the available dates: [list of dates]. 
Here are the room types: [list of room types]. 
Do not mention any other dates or room types. 

Here are the booking rules: [list of rules]. 
I need to know the check-in date, check-out date, number of guests, and room type. 
Once you have this, ask for confirmation."

This is brittle. If a date or room type changes, the developer has to manually edit the prompt. It’s a static “one-shot” instruction.

CLEAR Framework: A context-engineered system dynamically builds this prompt based on the user’s need.

Conciseness: Only the necessary instructions are sent.
Logic: The system’s logic (e.g., “The Fairmont is the only hotel”) is enforced by a layer outside the LLM.
Explicitness: The system defines the required inputs explicitly (e.g., “ask for hotel_name, check_in_date, check_out_date, guests").
Adaptability: The prompt is dynamically updated based on the system state.
Reflectiveness: The system can monitor its own performance and, for instance, notice if the LLM is consistently asking for the same information, indicating a need for a more explicit instruction.

Zero-Shot :
User: Can you book me a hotel?
The LLM, with no examples, tries its best. It might get the key information but may not follow the conversational flow or format you want.
Few-Shot: The system dynamically adds examples to the prompt to guide the LLM’s behavior.

“Here are examples of how to respond:
User: I want a room for two guests.
LLM: Great. What are the check-in and check-out dates?

User: Book me a room at the Fairmont for this weekend.
LLM: Got it. I need the exact check-in and check-out dates, and the number of guests.”

This provides a template for the LLM to follow, making its responses more predictable and consistent.

CoT (Context Engineering): A context-engineered system would use CoT to ensure the LLM reasons before acting.

Prompt: “User wants to book a room at The Fairmont for two people from August 10th to 15th. <tool_definitions> Please think step-by-step before calling a tool:

What information do I have?

What information do I still need?

What tool should I call?

What is the final plan?”`

This forces the LLM to follow a logical path, which is especially useful for preventing it from calling a tool before all the necessary parameters are gathered.

ToT : For a more complex task like “Plan a romantic weekend getaway,” the system might use ToT. It would ask the LLM to generate multiple potential plans (e.g., “Plan A: Spa and fine dining,” “Plan B: Hiking and outdoor activities”), evaluate each one, and then select the best path to pursue with the user.

1.2. External Knowledge Retrieval

This moves beyond the static prompt to using external data sources.

RAG: This is a core component. The system doesn’t rely on the LLM’s pre-trained knowledge for hotel information.

User: What’s the cancellation policy for The Fairmont?
The system uses RAG to query its knowledge base and retrieves the specific text: "Cancellations must be made at least 48 hours in advance for a full refund." It then inserts this text into the prompt before asking the LLM to answer.

Knowledge Graph Integration: For structured data like hotel features, a knowledge graph is ideal.

User: I need a hotel with a pool and a gym.
A context-engineered system converts this into a query: FIND hotel WHERE has_amenity="pool" AND has_amenity="gym".
The knowledge graph returns a list of matching hotels.
This list is then provided to the LLM, which can then generate a natural language response.

1.3. Dynamic Context Assembly

This is the orchestration layer that brings everything together.

Using Prompt:

User: I want to book a room at The Fairmont from the 1st to the 5th for two people.
My budget is $300 a night. [LLM must now deal with all of this information at once]”

Problem is LLM has to parse everything and infer the correct sequence of actions.

Using Context Engineering:

A system orchestrates a series of steps.

Orchestrator: Parses the user’s intent: book_hotel.

State Manager: Updates the state object: {'hotel': 'The Fairmont', 'check_in': '1st', 'check_out': '5th', 'guests': 2, 'budget': '$300'}.

Tool Executor: Calls the check_availability tool with the state data. The tool returns {'available': True, 'rate': '$350'}.

Prompt Assembler: Creates a new, optimized prompt for the LLM that includes:

System instructions for booking.

A concise summary of the state.

The result from the check_availability tool call.

The user’s last message (“My budget is $300 a night.”).

LLM’s Task: The LLM’s new, highly-focused task is to compare the budget ($300) with the rate ($350) and inform the user that it's over budget.

2. Context Processing

This component deals with the optimization and transformation of the information after it has been retrieved but before the LLM sees it.

2.1. Long Context Processing

These are architectural improvements within the LLM itself to handle huge amounts of information efficiently.

State Space Models (SSMs): In our hotel example, if the system needed to analyze a year’s worth of booking data from a log file to predict trends, an SSM could process that massive sequence more efficiently than a traditional transformer.
Position Interpolation: This allows a model trained on a 4K token window to “stretch” its context to, say, 8K, enabling it to process a slightly longer conversation without needing a full-model update.
Grouped-Query Attention (GQA): For a hotel system with many concurrent users, GQA would allow the model to process all their separate requests more quickly and efficiently.

2.2. Contextual Self-Refinement and Adaptation

This is about the LLM’s ability to correct its own mistakes.

Vibe Coding: If the LLM gives a wrong date, the user has to correct it in the next turn, and the LLM might still get confused.

Using Context Engineering (Self-Refinement): The system might add a “critic” layer.

Prompt: “You are a reservation bot. The user wants to book August 10th-15th. The user’s most recent message says they need to book August 10th-12th.
Your previous response was: ‘Booking from August 10th to 15th.’
<tool_definitions> Critique your previous response. Is it correct? If not, what should the corrected response be?”

This allows the system to force the LLM to review its output against the latest user input before a response is sent, preventing errors from propagating.

3. Context Management

This component focuses on the practical techniques for organizing and using context over time, particularly for long-running interactions.

3.1. Fundamental Constraints & “Lost-in-the-Middle”

Vibe Coding: A long, unmanaged conversation with the hotel bot. The user asks about their reservation made 20 messages ago. The key details of the reservation are “lost in the middle” of the conversation history, and the LLM hallucinates, “I don’t see a reservation for you.”

Context Engineering: The system addresses this with deliberate design. It might move the most critical information to the end of the prompt where the model is more likely to pay attention, or use other techniques to manage the conversation history.

3.2. Memory Hierarchies and Storage Architectures

OS-inspired Hierarchical Memory (MemGPT): A context-engineered reservation system would use a hierarchy.
Main Context (Working Memory): The last few messages of the current conversation. This is the “RAM.”
External Context (Long-Term Memory): A database of past reservations, user preferences, and loyalty points. This is the “hard drive.” When the user says, “Can you remind me of my last booking?”, the system uses a function call to “page in” the relevant information from the external context to answer the question, without the LLM needing to store that data permanently.
MemoryBank: This would store key facts from the conversation (e.g., “User’s check-in date is August 10th”) with a “strength” score. Over time, less important facts fade, while the most critical ones are retained and retrieved when needed, much like human memory.

3.3. Context Compression

Autoencoder-based Compression (ICAE): This is an advanced technique where the system uses a small model to create a compressed representation of a long conversation.

Initial conversation: Many messages about dates, guests, room types, and price.

Compressed Context: A small number of tokens that represent the core meaning: “The user has booked a king suite at The Fairmont for two guests from August 10th to 15th for $350 a night. The confirmation number is 12345.”

This compressed summary is then used in future prompts, drastically reducing token usage and improving latency while preserving the most important information.

System Implementations

Building upon the foundational components we’ve discussed, let’s explore how these concepts come to life in practical, real-world systems. Here are four major categories of system implementations.

1. RAG Systems (Retrieval-Augmented Generation)

RAG systems are about giving the LLM access to external, real-time knowledge. They move beyond the static, pre-trained knowledge of the model to a dynamic system that retrieves and injects information on demand.

1.1. Modular RAG Architectures

A modular RAG system breaks the knowledge base into different, organized components. For our hotel system, this means having separate databases or knowledge stores for different types of information.

Implementation: The system doesn’t have one giant database. Instead, it has:

A Hotel_Facts database (for amenities, addresses, policies).
A Booking_History database (for past reservations).
A Real-time_Availability API endpoint.

Example Prompt:

def get_hotel_info(query):
 # A modular RAG function that decides which data source to use
 if "policy" in query.lower():
 # Retrieve from the Hotel_Facts DB
 return db.query("SELECT policy FROM Hotel_Facts WHERE hotel_name='Marriott'")
 elif "past bookings" in query.lower():
 # Retrieve from the Booking_History DB
 return db.query("SELECT * FROM Booking_History WHERE user_id='user123'")
 else:
 # Fallback retrieval
 return db.query("SELECT * FROM Hotel_Facts WHERE hotel_name='Marriott'")

# System assembles the prompt dynamically
retrieved_info = get_hotel_info("What's the cancellation policy for the Marriott?")

prompt = f"""
You are a hotel reservation assistant.
User's request: "What is the cancellation policy for the Marriott?"

Relevant Information retrieved from our database:
{retrieved_info}

Based on the information above, please provide a clear and concise answer.
"""

This approach ensures the LLM receives only the most relevant, up-to-date information, preventing distraction and improving accuracy.

1.2. Agentic RAG Systems

This is a more advanced RAG implementation where an autonomous agent (an LLM with a reasoning loop) decides how to retrieve information. It uses a think-then-act cycle.

Implementation: The agent starts with a high-level goal and then performs a series of thought, action, and observation steps to achieve it.

# The agent's reasoning loop
#
# Agent's initial thought: I need to check the hotel's amenities.
# Agent's initial action: Call the `search_amenities` tool for 'The Fairmont'.
# Agent's initial observation: Tool returns a list of amenities: ['pool', 'gym', 'spa'].

prompt = f"""
Thought: The user wants to book a room at The Fairmont. I have the amenities list: {amenities}. Now I need to check for availability and price.
Action: Call the `check_availability` tool.
Action Input: {{'hotel_name': 'The Fairmont', 'check_in': 'August 10', 'check_out': 'August 15', 'guests': 2}}
"""

This demonstrates an agent reasoning about what information is missing and then actively taking a step to retrieve it, rather than just being handed information by a static orchestrator.

1.3. Graph-Enhanced RAG

This is a powerful RAG system for handling complex relationships. Instead of retrieving chunks of text, it queries a knowledge graph that connects entities like hotels, amenities, locations, and user preferences.

Implementation: The system’s knowledge base is a graph. A query like “hotels with a pool near downtown” translates into a graph traversal.

User: “Find me a pet-friendly hotel in San Francisco with a gym.”

Graph Query: MATCH (h:Hotel)-[:HAS_AMENITY]->(a:Amenity), (h)-[:LOCATED_IN]->(c:City) WHERE a.name='gym' AND h.is_pet_friendly=true AND c.name='San Francisco' RETURN h.name

System Action: The graph query returns ["The Fairmont", "Hotel Nikko"]. This list is then injected into the LLM's context.

LLM Prompt:

Relevant Information: [“The Fairmont”, “Hotel Nikko”]
User’s request: “Find me a pet-friendly hotel in San Francisco with a gym.” Please list the hotels that match the user’s request.

This approach is highly accurate for complex, multi-criteria queries and avoids the “lost-in-the-middle” problem that can plague document-based RAG.

2. Memory Systems

Memory systems allow the LLM to move beyond a single conversation and recall information across sessions, turning a stateless bot into a persistent agent.

Implementation: The system uses a multi-level memory architecture.

# The agent's memory
class HotelAgentMemory:
 def __init__(self):
 # Short-term memory (in-context) for the current session
 self.session_context = [] 
 # Long-term memory (database) for past interactions
 self.user_profile_db = {} 

 def add_to_session(self, message):
 self.session_context.append(message)
 # Prune old messages to stay within the context window
 if len(self.session_context) > MAX_CONTEXT_LENGTH:
 self.session_context.pop(0)

 def save_to_long_term(self, user_id, preferences):
 # Save user preferences from this session to the long-term memory
 self.user_profile_db[user_id] = preferences

 def retrieve_from_long_term(self, user_id):
 # Retrieve from long-term memory at the start of a new session
 return self.user_profile_db.get(user_id, {})

Session 1: A user books a hotel and mentions, “I prefer a room on a high floor.”

System Action: The agent saves "high floor" to the long-term memory associated with user_id=123.

Session 2 (a month later): The user returns and says, “Book me a room at the Marriott.”

System Action: The agent’s memory system retrieves the user’s preference from long-term memory ("high floor").

Agent’s Prompt:

You are a hotel booking agent. The user’s preferences are: {preferences}. User’s request: “Book me a room at the Marriott.”

The agent can now proactively say, “I’ll book you a room at the Marriott. I remember you prefer a high floor; I’ll add that to your request.” This shows true persistence and adaptation.

3. Tool-Integrated Reasoning

This transforms the LLM from a passive text generator into an active participant that can use external tools to interact with the world.

Source: Tool Augmented Systems Framework

3.1. Function Calling Mechanisms

This is the cornerstone of tool use. The LLM generates structured output (e.g., JSON) that the system recognizes as a command to execute a function.

Implementation: We define tools with clear names and parameters.

Prompt to the LLM:
You are an assistant with access to the following tools:

Tool 'check_availability':
- description: Check room availability and rates.
- parameters: {'hotel_name': str, 'check_in_date': str, 'check_out_date': str}

Tool 'book_room':
- description: Finalizes a room booking.
- parameters: {'hotel_name': str, 'check_in_date': str, 'check_out_date': str, 'guests': int}

User message: "Book me a room at the Marriott for two guests from tomorrow to Sunday."

Response from the LLM (structured JSON):
```json
{
 "tool_name": "book_room",
 "parameters": {
 "hotel_name": "Marriott",
 "check_in_date": "2025-08-04",
 "check_out_date": "2025-08-10",
 "guests": 2
 }
}

The system then receives this JSON, executes the book_room function with the provided parameters, and uses the result to inform the next LLM response.

3.2. Tool-Integrated Reasoning (TIR)

TIR is about using tools during the reasoning process to solve a complex problem step-by-step.

Implementation: A “ReAct” (Reasoning and Acting) loop.

# A simplified ReAct loop in a hotel booking agent
# User: "I want to book a room at the Marriott for 3 nights next month, but it can't be more than $200 a night."

# LLM Thought 1: The user wants to book a room. I need to find the dates for "next month" and then check availability and price.
# LLM Action 1: Call `get_current_date` to find today's date.
# LLM Observation 1: The tool returns `2025-08-03`.

# LLM Thought 2: I know today's date is August 3rd. "Next month" would be September. I need to find the first 3 nights.
# LLM Action 2: Call `check_availability` with dates '2025-09-01' to '2025-09-04'.
# LLM Observation 2: The tool returns `{'available': True, 'rate': '$250'}`.

# LLM Thought 3: The rate is $250, but the user's budget is $200. I cannot book this room. I need to inform the user and suggest alternatives.
# Final Response: "The Marriott is available from September 1st to 4th, but the rate is $250 per night, which is over your budget of $200. Would you like me to check other hotels?"

This shows the LLM actively reasoning, using tools to gather information, and then synthesizing a final response based on the observed data.

4. Multi-Agent Systems

This is the most advanced implementation, where multiple specialized agents collaborate to solve a problem.

This implementation focuses on sophisticated communication protocols, orchestration mechanisms, and coordination strategies that enable seamless collaboration across diverse agent architectures

Implementation: A central orchestrator (BookingOrchestrator) manages a team of specialized agents (ReservationsAgent, PaymentAgent, CustomerServiceAgent). Each agent has a specific role, toolset, and communication protocol.

# An orchestration workflow for a complex reservation
def handle_reservation_workflow(user_request):
 # 1. Orchestrator receives the request
 orchestrator_state = {'task': 'Book a room', 'user_request': user_request}

 # 2. Orchestrator delegates to the Reservations Agent
 reservations_agent_prompt = f"User wants to book a room. Please gather all details."
 reservations_agent_response = reservations_agent.run(reservations_agent_prompt)

 # 3. Reservations Agent and user negotiate details. Once details are confirmed,
 # the Reservations Agent sends a message back to the Orchestrator.
 if reservations_agent_response['status'] == 'booking_confirmed':
 orchestrator_state['booking_details'] = reservations_agent_response['details']

 # 4. Orchestrator now delegates to the Payment Agent
 payment_agent_prompt = f"A booking is confirmed. Please process payment."
 payment_agent_response = payment_agent.run(payment_agent_prompt, details=orchestrator_state['booking_details'])

 # 5. Payment Agent processes and sends a confirmation back.
 if payment_agent_response['status'] == 'payment_successful':
 # 6. Orchestrator hands off to the Customer Service Agent for final user communication
 final_response_prompt = f"The booking is confirmed. Confirmation number: {payment_agent_response['conf_id']}."
 customer_service_agent.run(final_response_prompt)

User: “I want to book a room, but I need to use my loyalty points for a discount and split the payment between two cards.”

Orchestrator: Sees this is a complex task.

Delegation 1: It hands off to the ReservationsAgent to handle the booking_confirmed state.

Delegation 2: It then hands off to a LoyaltyAgent to apply the points. The LoyaltyAgent might communicate back with the ReservationsAgent to adjust the price.

Delegation 3: The orchestrator then hands off to a PaymentAgent, which has a specialized tool to handle split payments.

This shows how each agent, with its specific context and toolset, can handle a subtask, with the orchestrator managing the flow to ensure the complex request is completed correctly. It’s a robust solution for transactional integrity and complex, multi-step workflows.

Common Strategies for Agent Context Engineering in LangChain

This blog post “Context Engineering for Agents” from LangChain outlines that for an LLM agent to be effective, its context must be actively and intelligently managed.
Simple prompt engineering (just telling the agent what to do) is insufficient for complex tasks. Instead, a sophisticated system must be built around the agent that writes, selects, compresses, and isolates context.

Let’s explore each of the four strategies in detail

1. Write Context

This is about actively building the working memory of the agent. The agent needs a static foundation to know its role and a dynamic “scratchpad” to think and act.

Scratchpads (Within an Agent Session): This is the core of the popular ReAct (Reasoning and Acting) framework. The agent’s internal monologue is dynamically “written” into the context with each turn. The sequence of Thought, Action, and Observation is explicitly added to the context. This allows the agent to reason about its next steps, see the results of its previous actions, and maintain a coherent plan.

A user wants to book a room. The agent’s scratchpad context would evolve like this:

# Initial Context from the user

User: “I want to book a room for two guests.”

# The agent’s first turn of writing to its scratchpad

Thought: The user wants to book a room, but I’m missing key details. I need the hotel name and the dates.
Action: I should ask the user for the missing information.
Action Input: I need the hotel name, check-in date, and check-out date.

# User responds. The new context is added.
Observation: The user provides the hotel name as “Marriott”.

# The agent’s second turn of writing to its scratchpad
Thought: I now have the hotel name and number of guests. I still need the dates. I should ask for them.
Action: I will ask the user for the dates.
Action Input: What are your check-in and check-out dates?

This ongoing process of writing to the scratchpad ensures the agent doesn’t “forget” what it’s trying to do.

Memories (Long-Term, Across Sessions): While a scratchpad works for a single session, memories store information for future interactions. This could be a ConversationBufferWindowMemory for a long conversation or a RAG-based system that stores user preferences for months.

The system saves user preferences to a persistent store.

A user says, “I prefer a room on a high floor.”

The system uses an agent to parse this preference and “writes” it to a long-term memory (e.g., a user_preferences database).

A month later, the same user returns and simply says, “Book me a room at the Marriott.”

The agent’s initial prompt assembly includes the long-term memory, allowing it to proactively say, “I’ll book that for you and will request a high floor, as you’ve mentioned before.”

2. Select Context

This strategy is about being selective to prevent a token-heavy, unfocused context. The agent actively retrieves only the information it deems necessary.

Scratchpads and Memories: The agent’s AgentExecutor automatically selects the most recent turns from the scratchpad or the ConversationBufferWindowMemory, pruning older, less relevant messages.
Tools: An agent is not given a list of every possible tool it could ever use. Instead, the system presents it with a focused set of tools relevant to the current task.

For a simple booking task, the agent might only be given tools like get_available_rooms, book_room, and cancel_reservation.

If a user asks, “What’s the best local restaurant?”, the orchestrator might switch the toolset, giving the agent a search_local_restaurants tool instead. This is a form of selection that focuses the agent on a new task.

Knowledge (RAG): This is a prime example of context selection. Instead of providing the agent with the entire hotel policy manual, the system uses a retriever to select only the most relevant document chunk.

# Agent’s Action: Call a tool to search for hotel policy.
Action: Use the knowledge base to find relevant documents for ‘The Fairmont cancellation policy’.

# The tool returns a single, relevant paragraph.
Observation: “Cancellations must be made at least 48 hours in advance for a full refund. Cancellations within 48 hours will incur a fee equal to one night’s stay.”

# This is the only new context added.

This prevents Context Distraction by ensuring the agent doesn't see irrelevant information like the spa menu while discussing a cancellation.

3. Compress Context

This is a vital strategy for managing long conversations and avoiding token limits. It reduces the size of the context while preserving its core meaning.

Context Summarization: Instead of keeping the entire chat history, the system periodically uses a separate LLM call to summarize the old parts of the conversation. The summary is then stored and used in place of the full history.

After 20 messages of back-and-forth about dates, room types, and rates, the system would generate a summary like this:

# The user has successfully booked a King Suite at the Marriott for two guests from August 10th to 14th. The final rate is $250 per night.

This single sentence replaces the entire verbose conversation in the context, allowing the conversation to continue without hitting the token limit. LangChain’s ConversationSummaryBufferMemory is designed for this.

Context Trimming: This is a simpler form of compression. It involves using a fixed-size window (ConversationBufferWindowMemory) and simply discarding the oldest tokens.

The system maintains a window of the last 5 turns. When the 6th turn occurs, the oldest turn (turn 1) is automatically trimmed from the context.
This works well for many conversational tasks where the most recent information is the most important.

4. Isolate Context

This strategy is about architecting the system to prevent confusion and maintain integrity by compartmentalizing information.

Multi-Agent: Instead of a single, monolithic agent, the system uses an orchestrator to delegate tasks to multiple specialized agents, each with its own isolated context and toolset.

A complex request might involve three agents:

ReservationsAgent: Has its own tools to check availability and book rooms. Its context is isolated to booking logistics.

PaymentsAgent: Has its own tools to handle credit card transactions. Its context is isolated to payment processing.

CustomerServiceAgent: Handles general questions and post-booking support.

When a user says, “Book a room and charge it to my credit card,” the orchestrator first sends the request to the ReservationsAgent. After the booking is confirmed, the orchestrator passes the booking details to the PaymentsAgent. This isolation prevents the ReservationsAgent from needing access to sensitive payment tools and ensures that each agent's context remains clean and focused.

Context Isolation with Environments: Tools often need to run in a safe, isolated environment (a “sandbox”). The agent’s main context is isolated from the messy, potentially risky execution of a tool.

Source: Context Isolation with Environment

If a tool for checking hotel availability is an external API call, the agent doesn’t see the raw HTTP request/response. It only sees a clean Observation in its context: "The check_availability tool returned that the Marriott is available for the given dates with a rate of $200 per night." The complex internal workings of the tool are hidden and isolated.

State: Key information is extracted from the free-form text of the conversation and isolated into a structured state object. This prevents Context Confusion and Context Clash.

Instead of the LLM having to parse “I need a room for two guests, but my friend is coming too, so make it three guests now,” the system isolates the key fact into a structured state:

current_booking_state = {
 "hotel": "Marriott",
 "guests": 3,
 "check_in": "August 10",
 "check_out": "August 15",
}

This single, canonical source of truth prevents the agent from getting confused by conflicting or superseded information in the chat history. The agent’s context is only ever updated with the clean, structured state.

If you found this article helpful, please give it a clap, leave a comment with your thoughts on context engineering, and share it with your network. I appreciate your support, and I’ll see you in the next article!

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication