From Prompt Engineering to Context Engineering: Building Intelligent Memory Systems

In July 2025, Gartner declared: "Context engineering is in, and prompt engineering is out." This wasn't just analyst hype—it reflected a fundamental shift in how production AI systems are built. While prompt engineering focused on crafting clever instructions, context engineering treats the entire information environment as a designable system.

This post explores three interconnected ideas: the paradigm shift from prompts to context, why XML structures outperform JSON for LLM communication, and how progressive disclosure transforms memory management from a token-burning liability into a strategic advantage.

The Shift: From Prompt Engineering to Context Engineering

Prompt engineering was the craft of tweaking input phrasing to coax better outputs. "Maybe if I say 'step by step' the model will reason better." It was necessary, but ultimately limited— you can't prompt your way out of missing information.

Context engineering is the discipline of designing systems that provide the right information, in the right format, at the right time. It encompasses:

System prompts that set behavioral boundaries
Tool definitions that extend model capabilities
Retrieved knowledge from RAG and vector databases
Memory systems that persist across sessions
Dynamic context loaded based on task requirements

"Most agent failures are not model failures anymore—they are context failures." — Anthropic Engineering

The insight is that LLMs have finite attention budgets. Research on "context rot" shows that as token counts increase, recall accuracy degrades. The goal isn't to stuff more information in—it's to curate the smallest set of high-signal tokens that maximize success probability.

Why XML Beats JSON for LLM Structured Data

When building memory systems, format matters more than you'd expect. Here's a real example from an insurance brokerage CRM—customer memory in llm.txt format:

The Format Question

# JSON: Flat, ambiguous boundaries
{
  "instruction": "Analyze this contract",
  "context": "Client is a roofing contractor...",
  "contract": "AGREEMENT entered into...",
  "example": "Previous analysis showed..."
}

# XML: Clear semantic boundaries, nestable
<task>
  <instruction>
    Analyze this contract for liability risks.
    Focus on indemnification and insurance clauses.
  </instruction>

  <context>
    <client_profile>
      Roofing contractor, 25 years in business,
      $500K revenue, 4 employees
    </client_profile>
    <risk_factors>
      High-risk industry, hurricane zone,
      uses subcontractors
    </risk_factors>
  </context>

  <contract>
    AGREEMENT entered into this 4th day of December...
    [Full contract text]
  </contract>

  <output_format>
    <findings>List each risk with severity rating</findings>
    <recommendations>Actionable next steps</recommendations>
  </output_format>
</task>

Why XML Wins for Claude (and Most LLMs)

Claude was specifically trained with XML tags in inputs and outputs. Controlled studies across 10,000+ prompts found that XML-scaffolded prompts achieved 23% higher accuracy on mathematical reasoning tasks compared to JSON.

Semantic Clarity

Tags like <instruction>, <context>, and <output_format> eliminate ambiguity about what's being asked vs. what's reference material.

Hierarchical Nesting

XML naturally represents nested structures—a customer with events, each event with attachments, each attachment with metadata. JSON requires awkward array indexing.

Parseability

When Claude outputs XML, post-processing is trivial. Extract <findings> and <recommendations> separately for different downstream uses.

Reasoning Delimiters

Tags like <thinking> and <answer> enable sophisticated multi-step reasoning within structured frameworks.

The tradeoff is token count—XML uses more tokens than JSON for the same data. But the accuracy gains more than compensate, especially for complex reasoning tasks where misinterpretation is costly.

The Problem: Flat Memory Dumps

Consider a typical customer memory file for an insurance brokerage. Every interaction—phone calls, emails, documents, notes—serialized into a single context payload:

{
  "llm_txt": "<company_context>
    Company: Skyline Roofing Solutions
    Industry: Roofing Contractor
    Address: 1250 Harbor View Drive, Tampa, FL
    Phone: +18135550142
    Email: mrodriguez@skylineroofing.com
    Annual Revenue: $500,000
    Employees: 4 full-time
    ...
  </company_context>

  <events_timeline>
    <total_events>68</total_events>

    <event index='1'>
      <type>Phone</type>
      <timestamp>2025-12-04T13:44:58.765000+00:00</timestamp>
      <summary>Discussion about claims made policy...</summary>
      <content>[Full 2000-word transcript]</content>
    </event>

    <event index='2'>
      <type>Email</type>
      <timestamp>2025-12-03T09:15:22.000000+00:00</timestamp>
      <sender>sarah@acmeinsurance.com</sender>
      <recipient>mrodriguez@skylineroofing.com</recipient>
      <content>[Full email body + attachments]</content>
      <attachments count='3'>
        <attachment index='1'>
          <filename>Quote_GL_2025.pdf</filename>
          [Full PDF content extracted]
        </attachment>
        ...
      </attachments>
    </event>

    <!-- 66 more events with full content... -->
  </events_timeline>
}"

// Total: ~132KB, ~16,000 words, ~46,000 tokens

132KB File Size

~46,000 Tokens

68 Events

$0.69 Per Query (Claude)

Why This Fails

Token waste: Most queries need 2-3 events, not 68
Context rot: Important details buried in noise, recall degrades
Cost explosion: $0.69 per query × thousands of daily queries = budget destroyed
Latency: More tokens = slower inference
Attention dilution: Model struggles to identify relevant information

The Solution: Progressive Disclosure

Anthropic introduced progressive disclosure as a core design principle for Agent Skills— but the pattern applies broadly to any memory system. The idea: organize information in layers, like a filesystem, allowing the LLM to explore only what's needed.

# Layer 0: Company Context (always loaded)
<company_context>
  <name>Skyline Roofing Solutions</name>
  <industry>Roofing Contractor - Residential/Commercial</industry>
  <location>Tampa, FL</location>
  <contact>Marcus Rodriguez (+18135550142)</contact>
  <financials>
    <revenue>$500,000</revenue>
    <payroll>$300,000</payroll>
  </financials>
  <insurance_needs>General Liability, Workers Comp</insurance_needs>
</company_context>

# Layer 1: Events Index (summaries only)
<events_index total="68">
  <event id="evt_001" type="phone" date="2025-12-04">
    Claims made policy discussion - awaiting quotes
  </event>
  <event id="evt_002" type="email" date="2025-12-03">
    GL quote documents sent (3 attachments)
  </event>
  <event id="evt_003" type="document" date="2025-12-02">
    Certificate of Insurance uploaded
  </event>
  <!-- Summaries only, ~50 tokens each vs ~500+ for full content -->
</events_index>

# Layer 2+: On-demand via Memory MCP
# LLM can request: get_event_detail(evt_001)
# Returns full transcript/content only when needed

The Three Layers

Company Context

Always loaded. Core identity, contact info, key metrics. ~500 tokens.

Events Index

Summaries only. Type, date, one-line description. ~50 tokens each.

On-Demand Detail

Full content via tool calls. Transcripts, emails, PDFs—loaded only when relevant.

This mirrors how humans navigate information. You don't memorize every email—you remember "there was an email about quotes last week" and retrieve it when needed.

Memory MCP: The Implementation

The Model Context Protocol (MCP) provides the infrastructure for progressive disclosure. Instead of dumping everything into context, we expose tools that let the agent explore:

from mcp.server import Server
from mcp.types import Tool, TextContent
import json

server = Server("customer-memory-mcp")

@server.list_tools()
async def list_tools():
    return [
        Tool(
            name="get_customer_summary",
            description="Get high-level customer context and event index",
            inputSchema={
                "type": "object",
                "properties": {
                    "customer_id": {"type": "string"}
                },
                "required": ["customer_id"]
            }
        ),
        Tool(
            name="get_event_detail",
            description="Retrieve full content of a specific event",
            inputSchema={
                "type": "object",
                "properties": {
                    "event_id": {"type": "string"},
                    "include_attachments": {"type": "boolean", "default": False}
                },
                "required": ["event_id"]
            }
        ),
        Tool(
            name="search_events",
            description="Search events by keyword, type, or date range",
            inputSchema={
                "type": "object",
                "properties": {
                    "query": {"type": "string"},
                    "event_type": {"type": "string", "enum": ["phone", "email", "document", "note"]},
                    "date_from": {"type": "string", "format": "date"},
                    "date_to": {"type": "string", "format": "date"}
                }
            }
        ),
        Tool(
            name="get_document_content",
            description="Extract and return content from a specific document/attachment",
            inputSchema={
                "type": "object",
                "properties": {
                    "document_id": {"type": "string"},
                    "extract_type": {"type": "string", "enum": ["full", "summary", "key_points"]}
                },
                "required": ["document_id"]
            }
        )
    ]

@server.call_tool()
async def call_tool(name: str, arguments: dict):
    if name == "get_customer_summary":
        # Returns ~2KB: company context + event summaries
        summary = await memory_store.get_summary(arguments["customer_id"])
        return [TextContent(type="text", text=json.dumps(summary))]

    elif name == "get_event_detail":
        # Returns full event content only when explicitly requested
        event = await memory_store.get_event(
            arguments["event_id"],
            include_attachments=arguments.get("include_attachments", False)
        )
        return [TextContent(type="text", text=json.dumps(event))]

    elif name == "search_events":
        # Returns matching event summaries, not full content
        results = await memory_store.search(
            query=arguments.get("query"),
            event_type=arguments.get("event_type"),
            date_range=(arguments.get("date_from"), arguments.get("date_to"))
        )
        return [TextContent(type="text", text=json.dumps(results))]

    elif name == "get_document_content":
        # Extracts PDF/document content on demand
        content = await document_store.extract(
            arguments["document_id"],
            extract_type=arguments.get("extract_type", "summary")
        )
        return [TextContent(type="text", text=content)]

The Agent Workflow

# Progressive Disclosure in Action

# Step 1: Agent receives query
query = "What insurance quotes has Marcus received?"

# Step 2: Load Layer 0 (always in context) - ~500 tokens
context = """
<company_context>
  Skyline Roofing Solutions | Tampa, FL
  Contact: Marcus Rodriguez | Seeking GL Insurance
</company_context>
"""

# Step 3: Agent calls Memory MCP to explore
events = await mcp.call("search_events", {
    "query": "quote",
    "event_type": "email"
})

# Returns summaries (~200 tokens):
# - evt_002: GL quote from Coastal Insurance ($4,200/yr)
# - evt_015: Quote from SafeGuard Agency ($3,800/yr)
# - evt_031: Updated quote with hurricane coverage

# Step 4: Agent decides which events need detail
detail = await mcp.call("get_event_detail", {
    "event_id": "evt_002",
    "include_attachments": True
})

# Only NOW loads full content (~2,000 tokens)

# Step 5: Agent synthesizes answer from relevant data only
# Total tokens used: ~2,700 vs ~46,000 for flat approach
# Savings: 94% fewer tokens, faster response, less noise

Metric	Flat Dump	Progressive Disclosure	Improvement
Tokens per query	~46,000	~2,700	94% reduction
Cost per query	$0.69	$0.04	94% savings
Latency	~8 seconds	~2 seconds	75% faster
Accuracy	Degraded by noise	Focused context	Higher recall

Compaction: Managing Long-Running Sessions

Progressive disclosure handles the width of information. Compaction handles the depth—what happens when conversations span hours and context windows fill up?

# Compaction: Summarizing when approaching context limits

class ContextManager:
    def __init__(self, max_tokens=100000):
        self.max_tokens = max_tokens
        self.working_memory = []
        self.compressed_history = []

    async def add_interaction(self, interaction):
        self.working_memory.append(interaction)

        if self.estimate_tokens() > self.max_tokens * 0.8:
            await self.compact()

    async def compact(self):
        # Preserve: architectural decisions, key facts, user preferences
        # Discard: verbose outputs, intermediate reasoning, duplicates

        summary = await llm.summarize(
            self.working_memory,
            preserve=[
                "decisions",
                "customer_preferences",
                "policy_details",
                "action_items"
            ],
            discard=[
                "greetings",
                "acknowledgments",
                "intermediate_calculations"
            ]
        )

        self.compressed_history.append(summary)
        self.working_memory = self.working_memory[-5:]  # Keep recent

    def get_context(self):
        return {
            "compressed_history": self.compressed_history,
            "recent_interactions": self.working_memory,
            "customer_summary": self.customer_context  # Layer 0
        }

The key insight: not all information has equal value. Preserve architectural decisions, customer preferences, and action items. Discard greetings, acknowledgments, and verbose intermediate outputs.

Anthropic's Claude Pokémon agent demonstrates this at scale—maintaining coherent behavior across thousands of steps by persistently storing strategic learnings in external notes (NOTES.md pattern) that survive context resets.

Trade-offs and Considerations

Progressive Disclosure Costs

Latency: Tool calls add round-trips. A flat dump is one inference; exploration requires multiple turns.
Complexity: Requires careful system design—what goes in Layer 0? How granular should summaries be?
Risk of missing context: If the agent doesn't know to explore, it might miss relevant information.

Mitigation Strategies

Smart Layer 0: Include enough context that the agent knows what to explore. "3 insurance quotes received" triggers exploration; "some emails" doesn't.
Hybrid approaches: Pre-load critical recent events, enable exploration for history.
Semantic summaries: Layer 1 summaries should be information-dense, not just "Email from Sarah" but "GL quote $3,800/yr, excludes hurricane, 30-day validity".

Real-World Results

Research from Mem0 validates this architecture at scale:

26% higher response accuracy vs. OpenAI's memory on LOCOMO benchmark
91% lower p95 latency through selective retrieval
90% token savings by loading only relevant context

The pattern has been adopted by production systems including MCP Memory Service (compatible with Claude Code, Cursor, VS Code) and enterprise CRM integrations.

Key Takeaways

Context > Prompts

The shift from prompt engineering to context engineering reflects a maturation of the field. Design the information environment, not just the instructions.

XML for Structured Data

Claude (and most LLMs) parse XML more reliably than JSON. The token overhead is worth the accuracy gains for complex reasoning tasks.

Progressive Disclosure

Layer information like a filesystem. Always-loaded summaries, on-demand details via tool calls. 90%+ token savings are achievable.

Memory MCP Pattern

Expose exploration tools (search, get_detail, get_document) rather than dumping everything. Let the agent navigate autonomously.

References

Effective Context Engineering for AI Agents — Anthropic Engineering
Use XML Tags to Structure Your Prompts — Claude Documentation
Equipping Agents for the Real World with Agent Skills — Anthropic
Context Engineering: Bringing Engineering Discipline to Prompts — Addy Osmani
Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory — arXiv
Memory in the Age of AI Agents: A Survey — arXiv
MCP Memory Service — GitHub
Structured Prompting Techniques: XML & JSON Guide — CodeConductor