Post

The Agentic SDLC: When Execution Gets Cheap, Verification Becomes the Job

How autonomous coding agents reshape planning, review, testing, and ops — and why the hard part of shipping them is governing what they produce, not generating it.

Mad Scientist 20 Jun 2026 7 min read

Enjoying the field notes? Subscribe for each new deep dive.Subscribe →

The Agentic SDLC: When Execution Gets Cheap, Verification Becomes the Job

How autonomous coding agents reshape planning, review, testing, and ops — and why the hard part of shipping them is governing what they produce, not generating it.

For ten years, "AI in the IDE" meant better autocomplete: a human directing every keystroke, the model shaving seconds off each one. That era is closing. Engineering teams now hand entire workflows to agents — take a ticket, read the codebase, write a multi-file implementation, run the suite, iterate on failures, open a PR — with the human setting intent and reviewing the result (per CodeRabbit's agentic-SDLC guide). The distinction matters: AI-assisted means a human stays in the loop at every decision point; agentic means the AI pursues a goal across many steps without a human directing each one (per CodeRabbit).

That shift relocates the bottleneck. The CodeRabbit guide frames three delivery models cleanly: in the traditional SDLC the constraint is coding speed and reviewer availability; in AI-assisted development it's reviewer availability and context-switching; in the agentic SDLC it's verification, review, and governance (per CodeRabbit). The cost of producing code falls toward zero. The cost of trusting it does not.

Three-lane comparison — "Traditional SDLC" (humans own every phase, bottleneck = coding speed), "AI-assisted" (humans direct, AI accelerates, bottleneck = review), "Agentic SDLC" (agents execute, humans verify, bottleneck = verification + governance); arrow showing the bottleneck migrating left-to-right downstream

The capability curve is real — and steep

This is not hype detached from measurement. SWE-bench, introduced by Jimenez et al. in 2023, built an evaluation of 2,294 real GitHub issues across 12 popular Python repos; the task is to edit a codebase to resolve a described issue, often coordinating changes across multiple functions, classes, and files. The best model at launch, Claude 2, resolved just 1.96% (per arXiv:2310.06770).

Read the paper on arXiv →

The leap came not only from bigger models but from better interfaces. SWE-agent (Yang et al., 2024) argued that LM agents are "a new category of end users" who benefit from purpose-built tooling the way humans benefit from IDEs. Their agent-computer interface (ACI) — commands to create/edit files, navigate the repo, and run tests — lifted performance to 12.5% pass@1 on SWE-bench, far exceeding non-interactive LMs (per arXiv:2405.15793). The lesson for production: the harness around the model often matters as much as the model.

Read the paper on arXiv →

By late 2025 the benchmark frontier had moved again. SWE-Bench Pro (Deng et al.) reframed the bar around enterprise reality: 1,865 human-verified problems across 41 actively maintained repos, including held-out and commercial partitions to resist contamination, featuring long-horizon tasks that take a professional engineer hours to days and patches spanning multiple files (per arXiv:2509.16941). Crucially, the authors cluster failure modes from agent trajectories — an acknowledgement that the interesting question is now how agents fail, not whether they can pass toy tasks.

Read the paper on arXiv →

The verification tax, quantified

If agents can do the work, why hasn't delivery uniformly accelerated? Google Cloud's 2024 DORA report — a decade into the program — gives the uncomfortable answer. A modeled 25% increase in AI adoption was associated with estimated gains of +7.5% documentation quality, +3.4% code quality, and +3.1% code-review speed. But the same adoption was associated with an estimated −1.5% in delivery throughput and −7.2% in delivery stability (per Google Cloud / DORA 2024). Over 75% of respondents used AI for at least one daily task, yet 39% reported little to no trust in AI-generated code (per Google Cloud / DORA 2024).

DORA's own framing: "AI does not appear to be a panacea." Improving the development process does not automatically improve delivery without the fundamentals — small batch sizes and robust testing (per Google Cloud / DORA 2024). This is the empirical shape of the verification tax: you generate faster, then pay it back auditing diffs that look normal but encode iterations, intermediate decisions, and cross-file side effects a reviewer never watched happen.

Two-axis conceptual sketch — horizontal axis "AI adoption", one rising curve "individual productivity / code generation", one declining curve "delivery stability"; the widening gap between them labeled "verification tax"

Phase by phase: what actually changes

Planning. Quality is now determined in planning, not just documented. Vague intent yields PRs that are technically correct but functionally wrong — and expensive to rework. Teams invest more in turning requirements into precise, codebase-grounded specs (per CodeRabbit). Anthropic's guidance mirrors this: explore first (read files in plan mode), then plan, then code — because "letting Claude jump straight to coding can produce code that solves the wrong problem" (per Anthropic Engineering).

Development. The most mature phase. Agents complete full features, run autonomous debugging loops, and handle multi-day refactors; the constraint is no longer typing speed (per CodeRabbit). The practical enabler is context discipline. Anthropic's first-order constraint is blunt: "Claude's context window fills up fast, and performance degrades as it fills" — manage context as the scarce resource (per Anthropic Engineering).

Testing. Testing becomes continuous rather than scheduled. A canonical loop: the agent writes a failing test that reproduces a bug, then modifies logic until it passes (per CodeRabbit). Anthropic operationalizes this as giving the agent a check that returns pass/fail — tests, a build exit code, a linter, a fixture diff, or a screenshot — so it self-corrects instead of making you the verification step (per Anthropic Engineering).

Review. This is where the verification gap concentrates. Agents generate faster than teams verify, and opacity makes review harder: the diff looks ordinary while the reasoning behind it is invisible (per CodeRabbit). Anthropic's mitigations are concrete — have the agent show evidence (test output, command results) rather than assert success, and use a verification subagent with a fresh model to try to refute the result; a deterministic Stop hook can block a turn from ending until a check passes (per Anthropic Engineering).

# Pseudo-code: a verification-gated agent turn (pattern, not a specific API)
def agent_turn(task, repo):
    plan = agent.explore_then_plan(task, repo)     # read-only first
    patch = agent.implement(plan, repo)
    while not verify(patch, repo):                  # tests/build/lint = pass/fail
        patch = agent.iterate(patch, failures=verify.last_report)
    # second opinion: a fresh model tries to REFUTE, not rubber-stamp
    if not reviewer_subagent.refute(patch, evidence=verify.evidence):
        return open_pull_request(patch, evidence=verify.evidence)
    return escalate_to_human(patch, reason="reviewer dissent")

Ops. Incident response folds into the loop: a production alert can trigger root-cause analysis, a fix proposal, regression tests, and findings surfaced directly in the alert's Slack thread (per CodeRabbit). This is also where DORA's stability warning bites hardest — an agent that ships fixes fast but degrades stability is a net liability.

More agents is not automatically safer

The tempting fix — stack a reviewer agent on a coder agent — has its own failure surface. Cemri et al.'s "Why Do Multi-Agent LLM Systems Fail?" analyzed 1,600+ annotated traces across 7 frameworks and built MAST, a taxonomy of 14 failure modes in 3 categories: system-design issues, inter-agent misalignment, and task verification (per arXiv:2503.13657). Their bottom line is sobering: failures need more sophisticated solutions than simple fixes, and multi-agent systems' gains on popular benchmarks are "often minimal" (per arXiv:2503.13657). Verification isn't a free lunch you buy by adding a reviewer role — it's a system property you have to design.

Read the paper on arXiv →

What this means for shipping agentic AI

Four implications for anyone trying to make agentic AI production-real:

Treat verification as the product, not an afterthought. If execution is cheap, your differentiator is the trust apparatus around it: deterministic gates, evidence-bearing PRs, second-opinion reviewers with fresh context.
Invest the saved time upstream. Precise, codebase-grounded specs prevent the most expensive failure — confidently-built wrong things (per CodeRabbit; Anthropic Engineering).
Keep the fundamentals. DORA's decade-long signal holds: small batches and robust testing are what convert development-speed into delivery-speed (per Google Cloud / DORA 2024).
Govern access explicitly. CodeRabbit's four requirements — context, persistent knowledge, multi-player collaboration, and governance (scoped access, role-based controls, auditable guardrails) — separate "isolated coding assistance" from a real agentic SDLC (per CodeRabbit).

The agent can write the code. Whether you can ship it depends on whether you've rebuilt the lifecycle around the question that now matters most: how do you know this is right?

Sources & further reading

Jimenez et al., "SWE-bench: Can Language Models Resolve Real-World GitHub Issues?" — arXiv:2310.06770 — https://arxiv.org/abs/2310.06770
Yang et al., "SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering" — arXiv:2405.15793 — https://arxiv.org/abs/2405.15793
Deng et al., "SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?" — arXiv:2509.16941 — https://arxiv.org/abs/2509.16941
Cemri et al., "Why Do Multi-Agent LLM Systems Fail?" — arXiv:2503.13657 — https://arxiv.org/abs/2503.13657
Anthropic Engineering, "Best practices for Claude Code" — https://www.anthropic.com/engineering/claude-code-best-practices
Google Cloud Blog, "Announcing the 2024 DORA report" — https://cloud.google.com/blog/products/devops-sre/announcing-the-2024-dora-report
DORA, "Accelerate State of DevOps Report 2024" — https://dora.dev/research/2024/dora-report/
CodeRabbit, "A guide to the agentic software development lifecycle (SDLC)" — https://coderabbit.ai/guides/agentic-sdlc

Get the next deep dive in your inbox

Field notes on shipping agentic AI — no spam, unsubscribe anytime.

Subscribe →