In July 2025, Gartner declared: "Context engineering is in, and prompt engineering is out." This wasn't just analyst hype—it reflected a fundamental shift in how production AI systems are built. While prompt engineering focused on crafting clever instructions, context engineering treats the entire information environment as a designable system.
This post explores three interconnected ideas: the paradigm shift from prompts to context, why XML structures outperform JSON for LLM communication, and how progressive disclosure transforms memory management from a token-burning liability into a strategic advantage.
The Shift: From Prompt Engineering to Context Engineering
Prompt engineering was the craft of tweaking input phrasing to coax better outputs. "Maybe if I say 'step by step' the model will reason better." It was necessary, but ultimately limited— you can't prompt your way out of missing information.
Context engineering is the discipline of designing systems that provide the right information, in the right format, at the right time. It encompasses:
- System prompts that set behavioral boundaries
- Tool definitions that extend model capabilities
- Retrieved knowledge from RAG and vector databases
- Memory systems that persist across sessions
- Dynamic context loaded based on task requirements
"Most agent failures are not model failures anymore—they are context failures." — Anthropic Engineering
The insight is that LLMs have finite attention budgets. Research on "context rot" shows that as token counts increase, recall accuracy degrades. The goal isn't to stuff more information in—it's to curate the smallest set of high-signal tokens that maximize success probability.
Why XML Beats JSON for LLM Structured Data
When building memory systems, format matters more than you'd expect. Here's a real example from an
insurance brokerage CRM—customer memory in llm.txt format:
The Format Question
# JSON: Flat, ambiguous boundaries
{
"instruction": "Analyze this contract",
"context": "Client is a roofing contractor...",
"contract": "AGREEMENT entered into...",
"example": "Previous analysis showed..."
}
# XML: Clear semantic boundaries, nestable
<task>
<instruction>
Analyze this contract for liability risks.
Focus on indemnification and insurance clauses.
</instruction>
<context>
<client_profile>
Roofing contractor, 25 years in business,
$500K revenue, 4 employees
</client_profile>
<risk_factors>
High-risk industry, hurricane zone,
uses subcontractors
</risk_factors>
</context>
<contract>
AGREEMENT entered into this 4th day of December...
[Full contract text]
</contract>
<output_format>
<findings>List each risk with severity rating</findings>
<recommendations>Actionable next steps</recommendations>
</output_format>
</task> Why XML Wins for Claude (and Most LLMs)
Claude was specifically trained with XML tags in inputs and outputs. Controlled studies across 10,000+ prompts found that XML-scaffolded prompts achieved 23% higher accuracy on mathematical reasoning tasks compared to JSON.
Semantic Clarity
Tags like <instruction>, <context>, and <output_format>
eliminate ambiguity about what's being asked vs. what's reference material.
Hierarchical Nesting
XML naturally represents nested structures—a customer with events, each event with attachments, each attachment with metadata. JSON requires awkward array indexing.
Parseability
When Claude outputs XML, post-processing is trivial. Extract <findings> and
<recommendations> separately for different downstream uses.
Reasoning Delimiters
Tags like <thinking> and <answer> enable sophisticated
multi-step reasoning within structured frameworks.
The tradeoff is token count—XML uses more tokens than JSON for the same data. But the accuracy gains more than compensate, especially for complex reasoning tasks where misinterpretation is costly.
The Problem: Flat Memory Dumps
Consider a typical customer memory file for an insurance brokerage. Every interaction—phone calls, emails, documents, notes—serialized into a single context payload:
{
"llm_txt": "<company_context>
Company: Skyline Roofing Solutions
Industry: Roofing Contractor
Address: 1250 Harbor View Drive, Tampa, FL
Phone: +18135550142
Email: mrodriguez@skylineroofing.com
Annual Revenue: $500,000
Employees: 4 full-time
...
</company_context>
<events_timeline>
<total_events>68</total_events>
<event index='1'>
<type>Phone</type>
<timestamp>2025-12-04T13:44:58.765000+00:00</timestamp>
<summary>Discussion about claims made policy...</summary>
<content>[Full 2000-word transcript]</content>
</event>
<event index='2'>
<type>Email</type>
<timestamp>2025-12-03T09:15:22.000000+00:00</timestamp>
<sender>sarah@acmeinsurance.com</sender>
<recipient>mrodriguez@skylineroofing.com</recipient>
<content>[Full email body + attachments]</content>
<attachments count='3'>
<attachment index='1'>
<filename>Quote_GL_2025.pdf</filename>
[Full PDF content extracted]
</attachment>
...
</attachments>
</event>
<!-- 66 more events with full content... -->
</events_timeline>
}"
// Total: ~132KB, ~16,000 words, ~46,000 tokens Why This Fails
- Token waste: Most queries need 2-3 events, not 68
- Context rot: Important details buried in noise, recall degrades
- Cost explosion: $0.69 per query × thousands of daily queries = budget destroyed
- Latency: More tokens = slower inference
- Attention dilution: Model struggles to identify relevant information
The Solution: Progressive Disclosure
Anthropic introduced progressive disclosure as a core design principle for Agent Skills— but the pattern applies broadly to any memory system. The idea: organize information in layers, like a filesystem, allowing the LLM to explore only what's needed.
# Layer 0: Company Context (always loaded)
<company_context>
<name>Skyline Roofing Solutions</name>
<industry>Roofing Contractor - Residential/Commercial</industry>
<location>Tampa, FL</location>
<contact>Marcus Rodriguez (+18135550142)</contact>
<financials>
<revenue>$500,000</revenue>
<payroll>$300,000</payroll>
</financials>
<insurance_needs>General Liability, Workers Comp</insurance_needs>
</company_context>
# Layer 1: Events Index (summaries only)
<events_index total="68">
<event id="evt_001" type="phone" date="2025-12-04">
Claims made policy discussion - awaiting quotes
</event>
<event id="evt_002" type="email" date="2025-12-03">
GL quote documents sent (3 attachments)
</event>
<event id="evt_003" type="document" date="2025-12-02">
Certificate of Insurance uploaded
</event>
<!-- Summaries only, ~50 tokens each vs ~500+ for full content -->
</events_index>
# Layer 2+: On-demand via Memory MCP
# LLM can request: get_event_detail(evt_001)
# Returns full transcript/content only when needed The Three Layers
Company Context
Always loaded. Core identity, contact info, key metrics. ~500 tokens.
Events Index
Summaries only. Type, date, one-line description. ~50 tokens each.
On-Demand Detail
Full content via tool calls. Transcripts, emails, PDFs—loaded only when relevant.
This mirrors how humans navigate information. You don't memorize every email—you remember "there was an email about quotes last week" and retrieve it when needed.
Memory MCP: The Implementation
The Model Context Protocol (MCP) provides the infrastructure for progressive disclosure. Instead of dumping everything into context, we expose tools that let the agent explore:
from mcp.server import Server
from mcp.types import Tool, TextContent
import json
server = Server("customer-memory-mcp")
@server.list_tools()
async def list_tools():
return [
Tool(
name="get_customer_summary",
description="Get high-level customer context and event index",
inputSchema={
"type": "object",
"properties": {
"customer_id": {"type": "string"}
},
"required": ["customer_id"]
}
),
Tool(
name="get_event_detail",
description="Retrieve full content of a specific event",
inputSchema={
"type": "object",
"properties": {
"event_id": {"type": "string"},
"include_attachments": {"type": "boolean", "default": False}
},
"required": ["event_id"]
}
),
Tool(
name="search_events",
description="Search events by keyword, type, or date range",
inputSchema={
"type": "object",
"properties": {
"query": {"type": "string"},
"event_type": {"type": "string", "enum": ["phone", "email", "document", "note"]},
"date_from": {"type": "string", "format": "date"},
"date_to": {"type": "string", "format": "date"}
}
}
),
Tool(
name="get_document_content",
description="Extract and return content from a specific document/attachment",
inputSchema={
"type": "object",
"properties": {
"document_id": {"type": "string"},
"extract_type": {"type": "string", "enum": ["full", "summary", "key_points"]}
},
"required": ["document_id"]
}
)
]
@server.call_tool()
async def call_tool(name: str, arguments: dict):
if name == "get_customer_summary":
# Returns ~2KB: company context + event summaries
summary = await memory_store.get_summary(arguments["customer_id"])
return [TextContent(type="text", text=json.dumps(summary))]
elif name == "get_event_detail":
# Returns full event content only when explicitly requested
event = await memory_store.get_event(
arguments["event_id"],
include_attachments=arguments.get("include_attachments", False)
)
return [TextContent(type="text", text=json.dumps(event))]
elif name == "search_events":
# Returns matching event summaries, not full content
results = await memory_store.search(
query=arguments.get("query"),
event_type=arguments.get("event_type"),
date_range=(arguments.get("date_from"), arguments.get("date_to"))
)
return [TextContent(type="text", text=json.dumps(results))]
elif name == "get_document_content":
# Extracts PDF/document content on demand
content = await document_store.extract(
arguments["document_id"],
extract_type=arguments.get("extract_type", "summary")
)
return [TextContent(type="text", text=content)] The Agent Workflow
# Progressive Disclosure in Action
# Step 1: Agent receives query
query = "What insurance quotes has Marcus received?"
# Step 2: Load Layer 0 (always in context) - ~500 tokens
context = """
<company_context>
Skyline Roofing Solutions | Tampa, FL
Contact: Marcus Rodriguez | Seeking GL Insurance
</company_context>
"""
# Step 3: Agent calls Memory MCP to explore
events = await mcp.call("search_events", {
"query": "quote",
"event_type": "email"
})
# Returns summaries (~200 tokens):
# - evt_002: GL quote from Coastal Insurance ($4,200/yr)
# - evt_015: Quote from SafeGuard Agency ($3,800/yr)
# - evt_031: Updated quote with hurricane coverage
# Step 4: Agent decides which events need detail
detail = await mcp.call("get_event_detail", {
"event_id": "evt_002",
"include_attachments": True
})
# Only NOW loads full content (~2,000 tokens)
# Step 5: Agent synthesizes answer from relevant data only
# Total tokens used: ~2,700 vs ~46,000 for flat approach
# Savings: 94% fewer tokens, faster response, less noise | Metric | Flat Dump | Progressive Disclosure | Improvement |
|---|---|---|---|
| Tokens per query | ~46,000 | ~2,700 | 94% reduction |
| Cost per query | $0.69 | $0.04 | 94% savings |
| Latency | ~8 seconds | ~2 seconds | 75% faster |
| Accuracy | Degraded by noise | Focused context | Higher recall |
Compaction: Managing Long-Running Sessions
Progressive disclosure handles the width of information. Compaction handles the depth—what happens when conversations span hours and context windows fill up?
# Compaction: Summarizing when approaching context limits
class ContextManager:
def __init__(self, max_tokens=100000):
self.max_tokens = max_tokens
self.working_memory = []
self.compressed_history = []
async def add_interaction(self, interaction):
self.working_memory.append(interaction)
if self.estimate_tokens() > self.max_tokens * 0.8:
await self.compact()
async def compact(self):
# Preserve: architectural decisions, key facts, user preferences
# Discard: verbose outputs, intermediate reasoning, duplicates
summary = await llm.summarize(
self.working_memory,
preserve=[
"decisions",
"customer_preferences",
"policy_details",
"action_items"
],
discard=[
"greetings",
"acknowledgments",
"intermediate_calculations"
]
)
self.compressed_history.append(summary)
self.working_memory = self.working_memory[-5:] # Keep recent
def get_context(self):
return {
"compressed_history": self.compressed_history,
"recent_interactions": self.working_memory,
"customer_summary": self.customer_context # Layer 0
} The key insight: not all information has equal value. Preserve architectural decisions, customer preferences, and action items. Discard greetings, acknowledgments, and verbose intermediate outputs.
Anthropic's Claude Pokémon agent demonstrates this at scale—maintaining coherent behavior across thousands of steps by persistently storing strategic learnings in external notes (NOTES.md pattern) that survive context resets.
Trade-offs and Considerations
Progressive Disclosure Costs
- Latency: Tool calls add round-trips. A flat dump is one inference; exploration requires multiple turns.
- Complexity: Requires careful system design—what goes in Layer 0? How granular should summaries be?
- Risk of missing context: If the agent doesn't know to explore, it might miss relevant information.
Mitigation Strategies
- Smart Layer 0: Include enough context that the agent knows what to explore. "3 insurance quotes received" triggers exploration; "some emails" doesn't.
- Hybrid approaches: Pre-load critical recent events, enable exploration for history.
- Semantic summaries: Layer 1 summaries should be information-dense, not just "Email from Sarah" but "GL quote $3,800/yr, excludes hurricane, 30-day validity".
Real-World Results
Research from Mem0 validates this architecture at scale:
- 26% higher response accuracy vs. OpenAI's memory on LOCOMO benchmark
- 91% lower p95 latency through selective retrieval
- 90% token savings by loading only relevant context
The pattern has been adopted by production systems including MCP Memory Service (compatible with Claude Code, Cursor, VS Code) and enterprise CRM integrations.
Key Takeaways
Context > Prompts
The shift from prompt engineering to context engineering reflects a maturation of the field. Design the information environment, not just the instructions.
XML for Structured Data
Claude (and most LLMs) parse XML more reliably than JSON. The token overhead is worth the accuracy gains for complex reasoning tasks.
Progressive Disclosure
Layer information like a filesystem. Always-loaded summaries, on-demand details via tool calls. 90%+ token savings are achievable.
Memory MCP Pattern
Expose exploration tools (search, get_detail, get_document) rather than dumping everything. Let the agent navigate autonomously.
References
- Effective Context Engineering for AI Agents — Anthropic Engineering
- Use XML Tags to Structure Your Prompts — Claude Documentation
- Equipping Agents for the Real World with Agent Skills — Anthropic
- Context Engineering: Bringing Engineering Discipline to Prompts — Addy Osmani
- Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory — arXiv
- Memory in the Age of AI Agents: A Survey — arXiv
- MCP Memory Service — GitHub
- Structured Prompting Techniques: XML & JSON Guide — CodeConductor