Post

Returns to Expertise: What 400K Claude Code Sessions Say About Shipping Agentic AI

Anthropic's new economic study argues coding agents amplify domain expertise rather than replace it — here's what's real, what's contested, and what it changes about production agent design.

Mad Scientist 17 Jun 2026 6 min read

Enjoying the field notes? Subscribe for each new deep dive.Subscribe →

Returns to Expertise: What 400K Claude Code Sessions Say About Shipping Agentic AI

Anthropic's new economic study argues coding agents amplify domain expertise rather than replace it — here's what's real, what's contested, and what it changes about production agent design.

The most quietly subversive claim in agentic AI right now isn't "agents will replace engineers." It's the opposite. Per Anthropic's June 16, 2026 report Agentic Coding and Persistent Returns to Expertise (authors Zoe Hitzig, Maxim Massenkoff, Eva Lyubich, Ryan Heller, Peter McCrory), the people who get the most out of Claude Code are the ones who already understand the problem deeply — and that signal shows up across ~400,000 sessions from ~235,000 people between October 2025 and April 2026.

If you build or operate agentic systems, this reframes where your reliability budget should go. Let me walk through what the study actually measured, the numbers worth quoting, the methodological caveat you must hold, and the production implications.

What they measured (and didn't)

Anthropic used a privacy-preserving classifier pipeline — no humans reading transcripts — to label each session. Crucially, scope is Claude Code via the CLI, Claude.ai, or desktop app. It excludes third-party IDEs, SDKs, and headless mode (claude -p "<prompt>"), because much of that is programmatic/automated (per Anthropic). So this is interactive, human-in-the-loop agent use — not autonomous pipelines.

Three classifier families do the heavy lifting:

Work-mode classifier — buckets each session into one of nine modes across four families: writing/maintaining code (building, fixing, testing, orchestrating), operating software (deploying, configuring, running pipelines, monitoring), figuring out what to do (understanding, planning), and non-code (analyzing data, communicating).
Decision-attribution classifier — splits decisions into planning (what to do, what counts as done) versus execution (which files, which commands, which language).
Expertise + success classifiers — rate task-specific expertise on a 5-point scale, and grade outcomes as succeeded / partially succeeded / failed / no clear goal, with a separate "verifiable evidence" bar like committed code.

A pipeline flow — "400K Claude Code sessions" feeding into three parallel classifier boxes (Work-mode 9 buckets, Decision attribution: planning vs execution, Expertise + Success grading), each emitting a labeled output, converging into "Anthropic Economic Index"

A validation check worth noting: per Anthropic, more than 90% of sessions labeled as creating/modifying code showed actual code changes in telemetry — a sanity tether between classifier labels and ground truth.

The division of labor: you steer, the agent drives

Per Anthropic, on average people make ~70% of the planning decisions but only ~20% of the execution decisions. "People decide what to build, and the agent decides how to build it." A typical session is ~4 turns; each prompt triggers ~10 Claude actions and ~2,400 words of output on average (sometimes 100+).

The interesting structural detail: when users retain execution control (>80%), Claude takes ~8 actions/turn; when Claude controls planning (>80%), it takes ~16 actions/turn (per Anthropic). The agent does more when handed more — but who hands it more, and safely? That's where expertise enters.

The core finding: experts get more agent per prompt

Per Anthropic, agent output scales sharply with task-specific expertise:

Expertise	Actions/prompt	Words output
Novice	~5	~600
Expert	~12	~3,200

That's a ~2.4x gap in actions and ~5x in output for the same unit of human input. Anthropic reports it holds within every work mode and value band, survives regression controls (work mode, task value, month, occupation, model family), and remains significant at roughly +9% actions and +13% output per expertise level (p < 0.001).

On success: per independent breakdowns of the report (explainx, digg), verified success rises from ~15% (novice) → ~28% (intermediate) → ~33% (expert), and partial success from ~77% → ~91% → ~92%. Note the shape — most of the gain is in the novice→intermediate jump. Per explainx, novices abandon sessions ~19% of the time when trouble hits; intermediate/expert users only ~5–7%. The expert advantage is largely recovery skill: reframing, decomposing, adding context — not hitting fewer walls.

[[CHART: success rate vs expertise level]]

Expertise is task-specific, not a job title

This is the part fractional CTOs should internalize. Per Anthropic, expertise is rated by how you frame the task — precision of instructions, what you ask Claude to verify, whether you correct it. "A senior engineer asking their first Rust question is a beginner at Rust. An accountant who has never used Python, but tells Claude exactly which reconciliation rules a Python script must enforce... is an expert at that task."

Consequently, per @AnthropicAI, on the toughest (verifiable) success measure every major occupation lands within 7 percentage points of software engineers. Per explainx, software engineers hit ~30% verified success overall (34% in code-producing sessions) versus ~26%/29% for non-software occupations — and that gap did not widen over seven months. Management occupations reportedly scored the highest verified success of any group, because translating a problem into requirements and judging the output is exactly the transferable skill (per explainx).

Over the seven months, usage also shifted: fixing-broken-code sessions fell (per explainx, ~33%→19%), while operating software (~14%→21%) and data analysis/prose (~10%→20%) grew. And per Anthropic, the freelance-marketplace-estimated value of the average session rose ~25%. People are using the agent more ambitiously, not more cautiously.

The caveat you must not skip: Claude grading Claude

Here's the integrity check. Both expertise and success are assigned by Claude-based classifiers reading transcripts. Multiple researchers flagged the obvious conflict: "We let Claude analyze Claude sessions and then asked Claude how much value Claude brought" (per @xeophon, via digg); similar from @ziv_ravid. A fair defense (per @main_horse, via digg) is that the paper avoids the strong causal "amplification" claim — it reports correlations, not a controlled effect.

That distinction matters because a rigorous experiment points the other way. METR's 2025 randomized controlled trial found that early-2025 AI tools slowed experienced open-source developers by ~19% on real tasks, despite developers believing they'd been sped up (per METR).

Read the paper on arXiv →

So we have observational data (Anthropic, huge N, no causal claim) saying expertise correlates with agent leverage, and an RCT (METR, small N, causal) showing self-perceived speedups can be illusory. They aren't contradictory — Anthropic measures who succeeds, METR measures whether AI caused speedup — but together they argue: trust verifiable outcomes over vibes.

The other independent anchor is the SWE-chat dataset (the public dataset Anthropic drew example data from). Per its paper (Baumann et al., submitted Apr 22 2026), real-world agent use is bimodal — 41% of sessions are near-total "vibe coding," 23% are human-only — only ~44% of agent-produced code survives into commits, agent code introduces more security vulnerabilities than human code, and users push back in 44% of turns.

Read the paper on arXiv →

What this changes about shipping agentic AI

For an "make agentic AI shippable" practice, three concrete moves:

Invest in specification, not just models. If expertise correlates with both agent throughput and recovery from failure, the cheapest reliability gain is teaching operators to write precise, verifiable task definitions. Bake "what counts as done" into the prompt scaffold:

GOAL: <one-sentence outcome>
DONE WHEN: <verifiable checks, e.g. "tests in tests/billing pass", "no new lint errors">
CONSTRAINTS: <files/services NOT to touch>
VERIFY: <command the agent must run and show output for>

Keep humans on planning, gate execution by verifiable evidence. The 70/20 split is healthy; the failure mode is over-handing planning to the agent. Require committed-code / passing-test evidence before a session is "done" — mirror Anthropic's toughest bar in your own acceptance criteria.
Treat the SWE-chat findings as your QA threat model. ~44% code survival and elevated vulnerability rates mean an agent-friendly org needs more review automation, not less: security linting, dependency scanning, and a human approval gate on anything touching auth, payments, or data egress.

The clean takeaway: agentic coding doesn't flatten the value of knowing things. It compounds it. The org that wins isn't the one with the best model — it's the one whose people can specify problems and verify answers.

Sources & further reading

@AnthropicAI launch thread — https://x.com/AnthropicAI/status/2066969532380721386
Anthropic, Agentic coding and persistent returns to expertise — https://www.anthropic.com/research/claude-code-expertise
explainx breakdown — https://explainx.ai/blog/anthropic-claude-code-expertise-research-agentic-coding-2026
digg coverage (incl. critiques) — https://digg.com/tech/y1f4liz1
SWE-chat dataset — https://huggingface.co/datasets/SALT-NLP/SWE-chat
Baumann et al., SWE-chat (arXiv:2604.20779) — https://arxiv.org/abs/2604.20779
METR RCT blog — https://metr.org/blog/2025-07-10-early-2025-ai-experienced-os-dev-study/
METR RCT paper (arXiv:2507.09089) — https://arxiv.org/abs/2507.09089

Get the next deep dive in your inbox

Field notes on shipping agentic AI — no spam, unsubscribe anytime.

Subscribe →