The Anatomy of Claude Code And How To Build Agent Harnesses

33 min read

The source code for Claude Code leaked. In this post, we explore how it actually works, from the moment you type a message to the moment it delivers working code.

Part of the Claude , AI Agents and Coding Tools topic hubs.

Hero image for The Anatomy of Claude Code And How To Build Agent Harnesses
Table of Contents

A few months ago, I built a baby Claude Code from scratch using nothing but Python. The idea was to illustrate how coding agents, and AI agents in general, work.

Today, some poor soul at Anthropic mistakenly uploaded the entire source code of Claude Code and, of course, I (with the help of Claude Code) decided to dig in and see how close my baby version was to the real thing.

Narrator: It was not close.

Obviously, the real thing is more complex and robust, but the core patterns are still fairly the same, and I think there are some interesting lessons to be learned from the real thing.

This post follows what happens from the moment you type a message into Claude Code to the moment it delivers a working result. We’re going to trace a single request through the entire system, and along the way, we’ll see how every major feature fits into the picture.

The Loop at a Glance

Before we go deep, let’s see the full picture. The entire agent is powered by a single async generator function called query() in query.ts. Everything else in the codebase exists to serve this function.

Here’s what happens when you type a message and hit enter:

You type a message

Step 1: Assemble context
        (system prompt + CLAUDE.md files + memory + conversation history)

Step 2: Call the Claude API
        (streaming, via async generator)

Step 3: Parse the response
        (text blocks + tool_use blocks)

Step 4: Check permissions
        (deny rules → allow rules → classifier → ask user)

Step 5: Execute tools
        (read-only in parallel, writes in serial)

Step 6: Feed results back
        (tool results become messages in the conversation)

Step 7: Context check
        (too large? compact. otherwise, loop back to Step 2)

Step 8: Termination
        (no more tool calls? done. error? recover or exit.)

Here’s a visual of the loop:

Diagram of the Claude Code agent loop showing 8 steps from user input through context assembly, API call, response parsing, permission checks, tool execution, result feedback, and context management

That’s the whole agent. Six steps in a loop. We’re going to walk through each one.

Step 1: Avengers… Assemble!

Every journey begins with packing. Before Claude Code makes its first API call, it has to assemble everything the model needs to do its job. This is way more involved than I previously thought.

The system prompt is built by an aptly named function called buildEffectiveSystemPrompt(). It’s not a single string that gets pasted in. It’s broken into named sections like “environment,” “tools guidance,” “tone and style,” and so on. Each section has its own function that generates its content. For example, the environment section calls computeSimpleEnvInfo(), which reads your working directory, detects your OS, checks if you’re in a git repo, and assembles all of that into a text block. The tools guidance section generates rules about which tools to prefer. Each section is computed independently, then they’re all concatenated into the final prompt.

Why break it up like this? Caching. There are two types of sections:

Cached sections: These are computed once and reused on every subsequent turn. They’re cleared when you run /clear or /compact. Most sections are this type, because the content (like your OS version) doesn’t change between turns.

Uncached sections: These are recomputed on every single turn. When the computed value changes between turns, the API can’t reuse its cached tokenization of the system prompt Currently, only one section uses this: MCP server instructions, because MCP servers can connect and disconnect mid-session.

The sections themselves are organized into two groups, separated by a SYSTEM_PROMPT_DYNAMIC_BOUNDARY marker. Everything before the boundary can use the API’s global cache scope (shared across all users). Everything after is session-specific. This is a clever way to avoid recomputing the entire prompt on every turn.

Here’s what the assembled system prompt actually contains:

SectionCached?What It Contains
IntroYesRole definition, security instructions, URL handling rules
SystemYesTool execution rules, permission behavior, context compression guidance
Doing TasksYesTask-specific guidance: coding style, verification, security
ActionsYesRisk mitigation: reversibility checks, destructive action warnings
Using Your ToolsYes”Use Read instead of cat, use Edit instead of sed” (tool preference rules)
Tone and StyleYesEmoji policy, conciseness rules, code reference format
Session GuidanceYesAvailable skills, agent types, user-facing tool descriptions
Memory (CLAUDE.md)YesAll loaded CLAUDE.md files and rules
EnvironmentYesWorking directory, git status, platform, shell, model name, knowledge cutoff
MCP InstructionsNoConnected MCP server names and their instructions
Summarize ResultsYes”Write down important information from tool results before they’re cleared”

To make this concrete, here’s what the environment section looks like when assembled:

# Environment

You have been invoked in the following environment:
 - Primary working directory: /Users/me/projects/myapp
 - Is a git repository: true
 - Platform: darwin
 - Shell: zsh
 - OS Version: Darwin 25.3.0
 - You are powered by the model named Claude Opus 4.6.
 - Assistant knowledge cutoff is May 2025.

And the “Using Your Tools” section includes rules like:

Do NOT use the Bash tool to run commands when a relevant
dedicated tool is provided:
  - To read files use Read instead of cat, head, tail, or sed
  - To edit files use Edit instead of sed or awk
  - To create files use Write instead of cat with heredoc
  - To search for files use Glob instead of find or ls
  - To search file contents use Grep instead of grep or rg

This is why Claude Code prefers its own FileReadTool over running cat in bash. It’s not the model’s preference, rather an explicit instruction in the system prompt.

The prompt assembly also has a priority system. If you’re running Claude Code normally, you get the default system prompt. But if you’re using the --system-prompt flag, that replaces the default. If you’re running inside a custom agent (defined in .claude/agents/), the agent’s prompt can either replace or append to the default. And if coordinator mode is active, the entire prompt is swapped for a coordinator-specific version.

With the system prompt assembled, CLAUDE.md files are layered on top. This is where Claude Code gets its project-specific instructions.

How CLAUDE.md Files Are Discovered

A function called getMemoryFiles() walks the entire directory tree from your current working directory up to the filesystem root, collecting instruction files at every level. The loading order matters:

  1. Managed files: /etc/claude-code/CLAUDE.md and /etc/claude-code/.claude/rules/*.md. These are the instructions that are used when Claude Code is used in a corporate environment.

  2. User files: ~/.claude/CLAUDE.md and ~/.claude/rules/*.md. Your personal instructions that apply to every project.

  3. Project files: For every directory from the filesystem root down to your current working directory, it loads:

    • CLAUDE.md
    • .claude/CLAUDE.md
    • .claude/rules/*.md
  4. Local files: CLAUDE.local.md at each level. These are gitignored, for instructions you don’t want to commit.

The walk happens from root down to your current directory, so the closest files have the highest priority. Here’s what a real discovery looks like if you’re working in /Users/me/projects/myapp/src/:

Files loaded (in order):
  1. /etc/claude-code/CLAUDE.md                    (Managed)
  2. ~/.claude/CLAUDE.md                            (User)
  3. ~/.claude/rules/always-use-typescript.md       (User)
  4. /Users/me/projects/CLAUDE.md                   (Project)
  5. /Users/me/projects/myapp/CLAUDE.md             (Project)
  6. /Users/me/projects/myapp/.claude/CLAUDE.md     (Project)
  7. /Users/me/projects/myapp/.claude/rules/api.md  (Project)
  8. /Users/me/projects/myapp/CLAUDE.local.md       (Local)

Each file is loaded, its content is parsed, HTML comments are stripped, and any @include directives are resolved (up to 5 levels deep). The @include syntax lets one CLAUDE.md file pull in content from another file:

# CLAUDE.md
See our API conventions:
@./docs/api-conventions.md

And our testing standards:
@./docs/testing.md

The whole function is memoized so it only runs once per session. It also handles git worktrees correctly (avoiding duplicate loading when the worktree and the main repo would otherwise load the same files) and respects exclusion patterns from your settings.

The Full Context Package

Once the system prompt and CLAUDE.md files are assembled, getAttachmentMessages() pulls in everything else:

  • Memory files: Your persistent memory from ~/.claude/projects/<slug>/memory/MEMORY.md, with a background prefetch of relevant topic files
  • Task/todo lists: Any active tasks you’ve created during the session (we’ll come to this later)
  • MCP server instructions: Instructions from connected MCP servers
  • Skill discovery results: Available skills that match the current context
  • Conversation history: Every message from the current session

All of this gets packaged into the API call. So when you type “fix the login bug,” the actual payload sent to Claude looks more like:

System prompt:       ~15,000 tokens (role, rules, tools, environment)
CLAUDE.md files:     ~2,000 tokens (project instructions)
Memory:              ~500 tokens (relevant memories from past sessions)
MCP instructions:    ~300 tokens (connected server docs)
Conversation history: ~5,000 tokens (previous turns this session)
Your message:        ~10 tokens ("fix the login bug")

Phew, all of this is to say, the context isn’t just “your message.” It’s your message, plus your project’s rules, plus your personal preferences, plus the agent’s memory, plus the conversation so far. This is context engineering in practice and what makes or breaks an agent.

How Skills Get Surfaced

One more thing happens during context assembly: skills are listed.

Skills are reusable prompts that live in .claude/skills/ (project-level) or ~/.claude/skills/ (user-level). They can also come from MCP servers. During context assembly, getSkillListingAttachments() collects all available skills, formats their names and descriptions, and injects them into the context as a system reminder.

The model sees something like:

<system-reminder>
The following skills are available for use with the Skill tool:

- commit: Create a new git commit with a descriptive message.
- review-pr: Review a pull request for code quality and correctness.
- write-blog: Write blog posts for siddharthbharath.com.
</system-reminder>

When the user’s request matches a skill, the model calls SkillTool with the skill name. The SkillTool’s own prompt includes a strict instruction: “When a skill matches the user’s request, invoke the Skill tool BEFORE generating any other response.” So the model doesn’t try to handle it itself. It delegates.

The skill then runs in an isolated sub-agent with its own context. The skill’s prompt (which can include detailed instructions, examples, reference files) gets loaded into a forked agent that executes independently and returns its result to the main conversation.

There’s a budget system to keep skill listings from bloating the context. Skill descriptions are capped at 1% of the context window (about 2,000 tokens for a 200K model). Each individual description is capped at 250 characters. If you have 50 skills, descriptions get truncated to fit within budget. The model can always call ToolSearchTool to fetch the full schema for a deferred tool if it needs more detail.

Step 2: I Can Do This All Day

With context assembled, the loop makes its API call. And this is where the async generator pattern earns its keep. If you’re not familiar with async generators, they’re functions that can yield values incrementally over time. Instead of computing a result and returning it all at once, they hand you pieces as they become available, pausing between each one. For example, if you’re reading a file, the async generator will yield the file contents one chunk at a time, pausing between each chunk to let the terminal render the previous chunk.

The inner loop, queryLoop(), runs a while (true) loop (yes, even Claude Code is just a big loop that can run all day long) where each iteration represents one full round-trip with the API. At the start of each iteration, it:

  1. Yields a stream_request_start event so the UI knows an API call is beginning
  2. Applies tool result size budgets (trimming large outputs to save context space)
  3. Runs microcompaction on previous assistant responses if needed (we’ll come to this later)
  4. Builds the final system prompt with appendSystemContext()
  5. Streams the response

As Claude generates tokens, they’re yielded as StreamEvents immediately. The terminal UI picks them up and renders them character by character. You see Claude “typing” in real time because each chunk of the response is yielded the moment it arrives from the API.

Between iterations, the loop needs to remember things: the full conversation history, which tools are available, how many times it’s tried to recover from errors, whether compaction has been attempted, the current turn count. All of this lives in a single mutable State object that gets passed from one iteration to the next. When the loop compacts the conversation, it updates state.messages. When it retries after hitting the output token limit, it increments state.maxOutputTokensRecoveryCount. When a tool execution changes the available tool set, it updates state.toolUseContext. Every decision the loop makes on one iteration can influence what happens on the next, and this object is how that information is carried forward.

Step 3: I Understood That Reference

The API call completes (or rather, the stream ends). Now we need to figure out what Claude wants to do.

Claude’s response comes back as a sequence of content blocks. These are either text blocks (things Claude wants to say to you) or tool_use blocks (things Claude wants to do). The parsing happens during streaming and each tool_use block has a structure like:

{
  type: 'tool_use',
  id: 'toolu_abc123',      // Unique ID for this call
  name: 'FileEditTool',    // Which tool to invoke
  input: {                  // Parameters for the tool
    file_path: '/Users/me/project/src/app.ts',
    old_string: 'const x = 1',
    new_string: 'const x = 2'
  }
}

To make this concrete, when you ask Claude Code to “fix the login bug,” the response might contain something like:

[
  { "type": "text", "text": "Let me look at the login code..." },
  { "type": "tool_use", "id": "toolu_1", "name": "GrepTool",
    "input": { "pattern": "login", "glob": "**/*.ts" } },
  { "type": "tool_use", "id": "toolu_2", "name": "GrepTool",
    "input": { "pattern": "authenticate", "glob": "**/*.ts" } }
]

Claude is saying something to you and requesting two search operations in the same response.

If there are no tool use blocks, Claude just wanted to talk. The message gets yielded to the UI, and the loop waits for the next user input. Turn complete.

If there are tool use blocks, we’re not done. The needsFollowUp flag is set to true, and the loop proceeds to execute those tools. This is where things get interesting.

The 43 Tools

What tools can Claude actually call? The getAllBaseTools() function registers the full set. Here are the ones that are always available:

CategoryTools
File I/OFileReadTool, FileEditTool, FileWriteTool, NotebookEditTool
SearchGlobTool, GrepTool
ExecutionBashTool
WebWebFetchTool, WebSearchTool
AgentsAgentTool, SendMessageTool
TasksTaskCreateTool, TaskGetTool, TaskUpdateTool, TaskListTool
PlanningEnterPlanModeTool, ExitPlanModeTool
User interactionAskUserQuestionTool
SkillsSkillTool
MCPListMcpResourcesTool, ReadMcpResourceTool

There are also conditional tools that only appear when certain feature flags are enabled or when specific environments are detected: REPLTool for Node environments, PowerShellTool for Windows, CronCreateTool and RemoteTriggerTool for scheduled agents, SleepTool for proactive mode, and more.

On top of this, any MCP servers you’ve configured add their own tools to the pool. These get prefixed with the server name (mcp_github_create_issue, for example) to avoid collisions with built-in tools. Server connections are memoized per config so the same server doesn’t get connected twice.

The tool registry also supports filtering. Your settings.json can blanket-deny certain tools, and there’s a “simple mode” (CLAUDE_CODE_SIMPLE=1) that strips everything down to just BashTool, FileReadTool, and FileEditTool.

Step 4: We Don’t Want to Kill You, But We Will

Claude wants to edit a file, run a bash command, or write to disk. Before any tool executes, it has to pass through the permission system.

There’s a delicate balance to be struck here. Too many permission prompts and the experience is annoying, you’ll stop using the tool. Too few and your agent might rm -rf your project like an OpenClaw.

Claude Code’s answer is a layered permission system. When a tool call comes in, the decision flows through multiple layers, and it exits as soon as one layer makes a decision:

Tool call received

1. Deny rules → instant rejection

2. Allow rules → instant approval

3. Bash classifier → async check, 2-second timeout

4. Interactive dialog → ask the user

The deny and allow rules come from your Claude Codesettings:

{
  "permissions": {
    "allow": ["Read", "Glob", "Grep", "BashTool(grep:*)"],
    "deny": ["BashTool(rm -rf:*)"]
  }
}

Rules are checked first because they’re instant. If you’ve said “always allow Grep,” every grep operation sails through without interruption.

The classifier layer sits between rules and the interactive dialog. For bash commands, Claude Code has a safety classifier that speculatively evaluates whether a command is safe. If the classifier is confident a command is read-only (like git status or ls), it can auto-approve. If it’s a destructive command (like rm or git push --force), it falls through to the interactive dialog.

The classifier has a 2-second timeout. If classification takes too long, the system falls through to the interactive dialog. You’re never blocked waiting for a classifier that’s stuck.

Here’s how this plays out in practice with a few different tool calls:

Tool CallLayer 1 (Deny)Layer 2 (Allow)Layer 3 (Classifier)Layer 4 (Ask)Result
GrepTool("login")PassMatch: allowed--Auto-approved
BashTool("git status")PassPassRead-only: safe-Auto-approved
BashTool("rm -rf /")Match: denied---Instant block
FileEditTool("app.ts")PassPassPassUser decidesDialog shown
BashTool("npm install")PassPassUnsure (timeout)User decidesDialog shown

This layered design is why Claude Code feels fast during exploration. When you’re reading files, searching code, and running tests, everything flows through without interruption. But the moment Claude wants to edit a file or run a deployment script, you get asked. The system is calibrated to minimize friction where it can and maximize caution where it must.

Step 5: Hulk… Smash!

Permission granted. Time to actually run the tools! And this is where one of the most impactful design decisions in the codebase lives.

When Claude returns multiple tool calls in a single response (which it does often, sometimes 6 or 8 at a time), the naive approach is to run them one by one. Claude Code does something smarter.

Every tool in the system implements a common interface:

type Tool = {
  name: string
  description(input, context): string
  inputSchema: ZodSchema
  execute(input, context): Promise<ToolUseResult>
  isConcurrencySafe(input): boolean
  isReadOnly(input): boolean
}

That isConcurrencySafe() method is a small addition with a big effect. It basically indicates whether this tool can be run in parallel with others. Before executing a batch of tool calls, Claude Code checks this flag and groups tools into batches. Consecutive read-only tools get batched together and run concurrently, up to a configurable maximum (default 10). The moment a non-read-only tool appears in the sequence, it gets its own serial batch.

So when Claude decides to read 6 files at once to understand a codebase, all 6 reads happen in parallel. But when it edits a file, that edit happens alone, in order, with no risk of race conditions.

Here’s a real example. Say Claude responds to “fix the login bug” with these 6 tool calls:

Tool calls from Claude's response:
  1. GrepTool("login handler")         → read-only
  2. GrepTool("auth middleware")        → read-only
  3. GlobTool("**/auth/**/*.ts")       → read-only
  4. FileEditTool("src/auth/login.ts") → write
  5. FileReadTool("src/auth/test.ts")  → read-only
  6. FileReadTool("src/auth/types.ts") → read-only

The partitionToolCalls() function produces three batches:

Batch 1 (concurrent): GrepTool, GrepTool, GlobTool
  → All three run at the same time

Batch 2 (serial): FileEditTool
  → Runs alone, waits for Batch 1 to finish

Batch 3 (concurrent): FileReadTool, FileReadTool
  → Both run at the same time, after Batch 2 finishes

Instead of 6 sequential operations, you get 3 rounds. The reads that can overlap do overlap. The write that needs exclusive access gets it.

The runTools() function walks through the batches in order. For a concurrent batch, it fires off all the tools at once and yields each result as it completes. For a serial batch, it runs each tool one at a time, waiting for it to finish before starting the next.

There’s a detail worth noting here: tools can return context modifiers. A tool result can include a function that modifies the ToolUseContext for subsequent tools. For concurrent batches, these modifiers are queued and applied after the whole batch completes (because you can’t safely modify shared state while things are running in parallel). For serial batches, they’re applied immediately after each tool.

This matters for tools that change the working directory, update file state caches, or modify the available tool set. The execution model accounts for it.

In practice, this concurrent execution is one of the reasons Claude Code feels fast when exploring a codebase. Instead of waiting for 6 sequential file reads, they all happen at once.

Step 6: On Your Left

The tools have run. Now what?

Each tool execution produces a result, and that result needs to become part of the conversation so Claude can see what happened.

Each tool result becomes a user message with a tool_result content block. The toolUseID links it back to the specific tool_use block that triggered it. This is how the API’s conversation format works: Claude says “I want to use tool X with ID abc123,” and the next user message says “here’s the result for abc123.”

If Claude called GrepTool with ID toolu_1, the tool result message looks like:

{
  "role": "user",
  "content": [{
    "type": "tool_result",
    "tool_use_id": "toolu_1",
    "content": "src/auth/login.ts:42: async function handleLogin(req, res) {\nsrc/auth/login.ts:58: if (!user) return res.status(401)\n..."
  }]
}

Claude sees the grep results and can now decide what to do next: read the file, edit it, or search for more context.

Back in the main loop (query.ts), these results are collected and appended to the message history:

const next: State = {
  messages: [...messagesForQuery, ...assistantMessages, ...toolResults],
  // ...
}

And now the loop goes back to Step 2. Claude gets called again with the updated history, including the tool results, and decides what to do next. Maybe it needs more information and calls more tools. Maybe it has enough to write the code. Maybe it found an error in the test output and wants to fix it.

This is the agentic loop in action. Each iteration adds more context. The agent learns from its own actions, adjusts, and continues.

Between iterations, the loop also refreshes attachments. getAttachmentMessages() runs again to pick up any new memory files or task updates. If a background memory prefetch has completed, its results get injected. If skill discovery found new relevant skills, those get added too. The context evolves between turns.

Plan Mode and Tasks: That’s My Secret, Cap

So far we’ve followed a simple case: user sends a message, Claude responds, tools run, results come back. But real tasks aren’t simple. When you ask Claude Code to “refactor the authentication system,” that’s a large batch of work across dozens of files. The agent needs a way to plan before it acts, and track its own progress as it goes.

Claude Code has two features for this: plan mode and tasks.

Plan Mode

Plan mode is a state change in the permission system. When Claude enters plan mode (either on its own or when you tell it to), the toolPermissionContext.mode switches to 'plan'. The previous mode is saved in prePlanMode so it can be restored later.

What changes in plan mode? Technically, all the same tools are available. Claude could still edit files. But the system injects an attachment into the context that says:

In plan mode, you should:
1. Thoroughly explore the codebase to understand existing patterns
2. Identify similar features and architectural approaches
3. Consider multiple approaches and their trade-offs
4. Use AskUserQuestion if you need to clarify the approach
5. Design a concrete implementation strategy
6. When ready, use ExitPlanMode to present your plan for approval

Remember: DO NOT write or edit any files yet.
This is a read-only exploration and planning phase.

This is steering via prompt, not via tool restriction. Claude still has write access, but the instruction is clear: explore, don’t modify. In practice, this works. Claude reads files, greps for patterns, maps out the codebase, and assembles a plan.

The plan itself is saved to disk as a file. When Claude calls ExitPlanMode, it presents the plan to you for approval. If you approve it, the system restores the pre-plan permission mode and injects the approved plan into the context:

User has approved your plan. You can now start coding.
Start with updating your todo list if applicable.

## Approved Plan:
1. Extract auth logic from middleware.ts into auth/service.ts
2. Update login handler to use new service
3. Add unit tests for auth service
4. Update integration tests
5. Remove deprecated auth helpers

Now Claude has both permission to write and a concrete plan to follow.

The plan mode attachment is throttled. It doesn’t re-inject the full plan reminder every turn. It alternates between “full” and “sparse” reminders, and skips turns entirely if it was recently shown. This keeps the context from getting bloated with repeated plan instructions.

Tasks

If plan mode is the “think before you act” feature, tasks are the “track what you’re doing” feature.

The tasks system is a set of four tools: TaskCreateTool, TaskGetTool, TaskUpdateTool, and TaskListTool. Tasks are stored as individual JSON files on disk.

Each task has a simple structure:

{
  id: string,
  subject: string,
  description: string,
  status: 'pending' | 'in_progress' | 'completed',
  blocks: string[],      // task IDs this task blocks
  blockedBy: string[],   // task IDs that block this task
  owner: string,         // which agent owns this task
}

The blocks and blockedBy fields create dependency relationships between tasks. If task #3 is blocked by task #1, the agent knows it can’t start #3 until #1 is done.

The system nudges Claude to keep tasks updated. Every N turns, getTaskReminderAttachments() checks how long it’s been since the agent last interacted with the task system. If it’s been too long, the system injects a reminder into the context with the current task list. The agent sees “hey, you have 3 pending tasks and 1 in progress” and is prompted to update its progress or move on to the next task.

The tasks system also ties into the sub-agent architecture. When multiple agents are working together (teams/swarms), they share the same task list (keyed by team name rather than session ID). An agent claiming a task automatically gets set as the owner. When a task is assigned to a new owner, a notification is sent via an internal mailbox system so the receiving agent knows about the assignment.

How These Fit Into the Loop

Plan mode and tasks don’t change the loop itself. They operate within it, using the same tool call mechanism as everything else. EnterPlanMode is a tool call. TaskCreate is a tool call. The context assembly step (Step 1) picks up plan mode attachments and task reminders and injects them alongside everything else.

The pattern is: the agent uses its own tools to organize its own work. It’s not an external project management layer bolted on. It’s part of the conversation, visible to the model, tracked in the same message history.

Step 7: We Have a Hulk Problem

If you’ve used Claude Code for a while, you’ve probably noticed that at some point it says it’s running out of context and asks you to compact the conversation. This is the second part of context management, after the initial context assembly.

Every time the loop completes a turn, it checks: how much context have we accumulated? The conversation history keeps growing. Tool results (especially from file reads and bash commands) can be large. After a few dozen turns, you can easily be pushing against the model’s context window.

The auto-compaction check runs at the start of every iteration and triggers when you’re within about 13,000 tokens (this seems to be the defaault value) of the limit.

When it triggers, Claude Code has multiple compaction strategies, and it layers them:

Microcompaction runs between turns. It compresses assistant responses that appeared between tool calls. If Claude wrote a long explanation before making 5 file edits, that explanation can be compressed without losing the tool results themselves. The tool inputs and outputs are preserved; the commentary gets summarized.

Session memory compaction is the primary strategy. It takes the oldest chunk of conversation history, calls the API to generate a concise summary, and replaces the original messages with a single CompactBoundaryMessage.

For example, say the first 15 turns of your conversation involved exploring the codebase, reading 20 files, and discussing the architecture. After compaction, all of that becomes something like:

[CompactBoundaryMessage]
"The user asked to fix a login bug. I explored the auth module in
src/auth/, read login.ts, middleware.ts, and types.ts. The bug is
a missing null check on line 58 of login.ts where user lookup can
return undefined. The user confirmed this is the right file."

One message instead of 30. The conversation continues with the summary standing in for everything that came before.

Tool use summaries are generated asynchronously after tool execution. When Claude runs a long sequence of tools (say, 20 grep searches followed by 10 file reads), the summary generator produces a condensed version: “searched for X across the codebase, found it in files A, B, and C, then read those files to understand the implementation.” These summaries can replace the detailed tool results when context gets tight.

Reactive compaction is the emergency fallback. This one doesn’t trigger on a threshold check. It triggers when the API rejects the request with a prompt_too_long error. At that point, the loop catches the error, runs compaction immediately, rebuilds the messages, and retries the API call.

And then there’s the circuit breaker. I thought it would be fun to call this one out from the codebase:

// Stop trying autocompact after this many consecutive failures.
// BQ 2026-03-10: 1,279 sessions had 50+ consecutive failures
// (up to 3,272) in a single session, wasting ~250K API calls/day globally.
const MAX_CONSECUTIVE_AUTOCOMPACT_FAILURES = 3

1,279 sessions were hitting 50+ consecutive compaction failures, some reaching 3,272 failures in a single session. That’s 250K wasted API calls per day. The fix is a constant that stops retrying after 3 consecutive failures.

Step 8: Part of the Journey Is the End

Eventually, the loop has to end. In the happy case, Claude responds with only text blocks and no tool calls. needsFollowUp is false, no recovery is needed, and the function returns { reason: 'completed' }.

But there are many other ways the loop can exit, and Claude Code handles each one:

Termination ReasonWhat Happened
completedClaude responded with text only. Turn done.
max_turnsHit the configured maximum turn count.
aborted_streamingUser pressed Ctrl+C during API streaming.
aborted_toolsUser pressed Ctrl+C during tool execution.
prompt_too_longContext too large, all compaction strategies failed.
model_errorAPI error that couldn’t be recovered.
blocking_limitHard context limit hit with auto-compact disabled.
hook_stoppedA stop hook explicitly prevented continuation.

Before terminating on an error, the loop tries to recover. Each error type has its own recovery path:

Prompt too long: First, try context collapse (a lightweight drain that removes granular context while preserving structure). If that doesn’t free enough space, try reactive compaction (full summarization). If that also fails, surface the error and exit.

Max output tokens: Claude ran out of output space mid-response. First, escalate the output token cap to a higher limit (ESCALATED_MAX_TOKENS) and retry. If that doesn’t work, send a “continue from where you left off” nudge and retry, up to 3 times. If that’s exhausted, surface the partial response.

Server overload (529): Retry with exponential backoff, but only for foreground queries where the user is actively waiting. Background tasks (summarization, classification, memory operations) don’t retry on 529, because during a capacity crunch each retry generates 3-10x gateway amplification. Making background tasks retry would make the overload worse for everyone.

Model fallback: If the primary model repeatedly fails with 529 errors, Claude Code can fall back to an alternative model entirely. The loop clears its state, tombstones any orphaned partial messages, creates a fresh streaming tool executor, and retries with the fallback model.

Sub-Agents: I Am… Inevitable

Sometimes, one of the tool calls in Step 5 is AgentTool. When that happens, the whole loop starts again, recursively.

The AgentTool accepts a prompt, an optional agent type, and isolation settings:

const baseInputSchema = z.object({
  description: z.string().describe('A short (3-5 word) description'),
  prompt: z.string().describe('The task for the agent to perform'),
  subagent_type: z.string().optional(),
  model: z.enum(['sonnet', 'opus', 'haiku']).optional(),
  run_in_background: z.boolean().optional(),
})

Under the hood, it creates a new query() loop with its own message history, its own tool context, and a restricted set of tools. The sub-agent can’t spawn certain types of sub-agents itself, and it has limits on which tools it can access.

The isolation modes determine how the sub-agent relates to the parent’s workspace:

  • Same CWD: The sub-agent shares the parent’s working directory. Good for research tasks where it needs to read the same files the parent is working with.
  • Worktree: Creates a git worktree, giving the sub-agent its own copy of the repo. It can make changes without interfering with the parent’s work. If the changes are useful, they’re sitting on a branch ready to merge. If not, the worktree gets cleaned up automatically.
  • Background: The sub-agent runs asynchronously. The parent continues its own work and gets notified when the sub-agent completes. This is how Claude Code can do research in the background while continuing to help you with something else.

A sub-agent isn’t a special case in the codebase. It’s the same query() generator, the same tool system, the same permission model, the same compaction logic. Just with a different scope and a restricted tool set. When it completes, its result flows back to the parent loop as a tool result, just like any other tool.

This recursive design means the complexity of sub-agents is bounded. There’s no separate “agent runner” or “task executor.” The same loop that handles your top-level request handles sub-tasks. If compaction works at the top level, it works for sub-agents. If error recovery works at the top level, it works for sub-agents.

What This All Adds Up To

We’ve followed a single user message from the moment it’s typed to the moment the agent loop exits. Along the way, we’ve seen:

  • Context assembly that pulls from system prompts, CLAUDE.md files at every directory level, persistent memory, and skills
  • An async generator loop that streams responses, executes tools, and manages its own state across iterations
  • 43 built-in tools plus any number of MCP tools, all going through the same permission system
  • A layered permission model that auto-approves reads and asks about writes
  • Concurrent execution of read-only tools and serial execution of writes
  • Tool results fed back as conversation messages, creating a self-reinforcing loop
  • Four compaction strategies with a circuit breaker, keeping long sessions alive
  • Layered error recovery for every failure mode: context overflow, output limits, server overload, model fallback
  • Recursive sub-agents that reuse the same loop with restricted scope

Here’s the thing about all of this: the LLM call is one line of code. Everything else, the hundreds of files, the thousands of lines (not as many as Garry Tan’s LOC but a respectable number nevertheless), is the harness around that call. An agent harness is mostly plumbing. Assembling context, checking permissions, running tools safely, managing the context window, recovering from errors, and recording sessions. And it’s this plumbing that makes the agent work, and stand out against others.

So if you’re building AI agents, remember that most of your work will be building the harness around it, and that’s where your moat lies. If you want to get started, my Claude Code guide covers how to use it, and my Claude tips and tricks post has practical workflows you can apply today.

Related Posts

Read Ralph Wiggum: The Dumbest Smart Way to Run Coding Agents
Hero image for Ralph Wiggum: The Dumbest Smart Way to Run Coding Agents
guide coding-tools ai-agents

Ralph Wiggum: The Dumbest Smart Way to Run Coding Agents

Forget agent swarms and complex orchestrators. The secret to overnight AI coding is a bash for-loop. I break down the Ralph Wiggum technique and show you how to actually ship working code with long-running agents.

8 min