MacleodLabs.ai
← All posts

Post

The Agentic SDLC: What Actually Changes When Agents Plan, Review, Test, and Operate Your Software

How autonomous coding agents are rewiring the software development lifecycle — and the primary-source evidence on where they help, where they hurt, and what you must keep human.

The Agentic SDLC: What Actually Changes When Agents Plan, Review, Test, and Operate Your Software

Enjoying the field notes? Subscribe for each new deep dive.Subscribe →

The Agentic SDLC: What Actually Changes When Agents Plan, Review, Test, and Operate Your Software

How autonomous coding agents are rewiring the software development lifecycle — and the primary-source evidence on where they help, where they hurt, and what you must keep human.

The fastest way to understand the agentic software development lifecycle is to watch one number. In March 2024, SWE-agent — the first system designed to let a language model autonomously navigate a repo and submit a patch — resolved 12.5% of real GitHub issues (pass@1) on SWE-bench (per arXiv:2405.15793). As of February 2026, the top entry on the SWE-bench Verified leaderboard, Claude 4.5 Opus, resolves 76.8% at an average of $0.75 per task (per swebench.com). SWE-bench itself is 2,294 real issues drawn from popular Python repos, where a model is given a codebase and an issue and must produce a patch that passes the project's own tests (per arXiv:2310.06770).

arXiv:2310.06770 — first page

Read the paper on arXiv →

That curve reframes the entire discussion. The question is no longer "can an agent write code that works." Increasingly, on well-scoped tasks, it can. The question is how an autonomous agent reshapes each stage of the lifecycle — and what new failure modes you inherit when you let it.

The unlock was the interface, not the model

The most underappreciated finding from the SWE-agent paper is that the leap came from interface design. The authors argue LM agents are "a new category of end users" who, like humans with IDEs, "benefit from specially-built interfaces" (per arXiv:2405.15793). They call this the Agent-Computer Interface (ACI).

arXiv:2405.15793 — first page

Read the paper on arXiv →

The concrete ACI choices are humble and instructive (per the SWE-agent ACI docs):

  • A linter runs on every edit and rejects the edit if the result isn't syntactically valid — the agent can't commit broken code.
  • A custom file viewer shows ~100 lines per turn, with scroll and in-file search, instead of letting the agent cat whole files.
  • A search command lists only the files that contain a match — showing more per-match context "proved to be too confusing for the model."
# ACI design principle: constrain the action space, validate every step
on edit(file, patch):
    new = apply(patch, file)
    if not lint(new).ok:
        return "Edit rejected: " + lint(new).errors   # block the bad state
    write(new)
    return "OK"

The lesson for production: a frontier model behind a sloppy harness underperforms a weaker model behind a well-designed one. Your scaffolding — the tools, the validation gates, the context you show per turn — is a first-class engineering artifact.

A loop showing an LLM agent connected to an Agent-Computer Interface layer (linter-gated editor, 100-line file viewer, file-listing search, test runner), which sits between the agent and the repository/runtime; arrows show observe -> reason -> act -> validate -> observe

Stage 1 — Planning

Agentic planning shows up empirically as scope. In a study of 567 PRs created with Claude Code across 157 open-source repos (Feb–Apr 2025), agentic PRs (APRs) carried a median description length of 355 words vs 56 for human PRs, and 40.0% of APRs were multi-purpose vs 12.2% for humans (per arXiv:2509.14745). Agents bundle a fix with tests, a refactor, and docs in one go.

arXiv:2509.14745 — first page

Read the paper on arXiv →

That cuts both ways. Rich self-documentation and broad coverage are good. But multi-purpose PRs are harder to review and revert, and they blur the unit of change. For shipping, the planning-stage discipline is to constrain the agent's task to one reviewable intent — the same hygiene you'd enforce on a junior engineer.

Stage 2 — Code review

Review is where agentic SDLC has scaled furthest in industry. Microsoft reports its AI PR-review assistant now supports over 90% of PRs and touches more than 600K pull requests per month, with 5,000 onboarded repositories observing 10–20% median PR-completion-time improvements (per Microsoft Engineering). The design constraints are the interesting part: the AI is "treated just like any other reviewer," it does not commit changes directly — the author must click "apply change" — and every change is attributed in commit history for accountability (per Microsoft Engineering). Their learnings fed GitHub Copilot's PR review, which reached general availability in April 2025.

Two patterns generalize. First, AI review is best aimed at the toil — style, null checks, missing exception handling — so humans spend their attention on architecture and security. Second, keep the human as the committer. The agent proposes; a person applies. That single rule preserves accountability and a clean audit trail.

The skepticism is real: an agentic reviewer that floods authors with low-value or low-confidence suggestions trains humans to ignore it, which is why aiming AI review at clear-cut toil — and keeping a human as the committer — matters.

Stage 3 — Testing

Testing is the stage agents want to do, because it's the toil humans skip. An MSR '26 empirical study analyzing 2,232 test-related commits from the AIDev dataset found AI authored 16.4% of all commits adding tests in real-world repositories, and that AI-generated tests deliver coverage comparable to human-written tests (per arXiv:2603.13724).

arXiv:2603.13724 — first page

Read the paper on arXiv →

Two caveats matter for production. First, structure differs: AI test methods were longer, with higher assertion density but lower cyclomatic complexity — i.e., linear, repetitive logic (per arXiv:2603.13724). Second, prevalence is wildly bimodal: in small agent-heavy projects AI wrote up to 100% of test-adding commits, but in large enterprise repos it fell to 1.9%–14.4% (e.g., azure-sdk-for-js 1.9%, cal.com 14.4%) (per arXiv:2603.13724).

The trap is using the agent to both write the code and write the tests that "prove" it. SWE-bench is honest precisely because the project's own tests are the oracle. Replicate that: gate agent code against tests the agent didn't author, or against a human-reviewed spec. Coverage going up while assertions are shallow is a false sense of safety.

Conceptual bar comparison of AI vs human test methods across three axes — eLoC (AI higher), assertion density (AI higher), cyclomatic complexity (AI lower) — labelled "conceptual, from arXiv:2603.13724 directional findings"

Stage 4 — Ops

The newest frontier is the agent on call. The emerging pattern in AI-SRE tooling is to separate detection, triage, investigation, remediation, and escalation into distinct stages rather than handing an agent a blank prod console (per Augment Code's AI-SRE guide). The safe default mirrors code review: the agent investigates alerts, correlates recent changes, runs read-only diagnostics, drafts a root-cause hypothesis and a remediation plan — and a human approves any action that mutates production. Drafting incident summaries, on-call handover notes, and runbook updates is where practitioners report the clearest wins today (per r/sre practitioner discussion).

The honest counter-evidence

The benchmark curve is not the production reality. METR ran a randomized controlled trial: 16 experienced open-source developers, 246 real issues, repos averaging 22k+ stars and 1M+ lines of code. Allowing AI tools made them 19% slower — yet they believed AI had sped them up by ~20%, and had forecast a 24% speedup beforehand (per METR; arXiv:2507.09089).

arXiv:2507.09089 — first page

Read the paper on arXiv →

METR is careful: this does not show AI fails to help most developers; their devs and giant codebases aren't representative, and learning effects for tools like Cursor may only appear after hundreds of hours (per METR). They later revised their experiment design for 2026 (per metr.org). But the durable lesson is brutal: self-reported and anecdotal speedup estimates can be badly wrong. Measure outcomes, not feelings.

The second counter-signal is quality. GitClear's analysis of 211M changed lines (2020–2024) found refactoring-associated lines fell from 25% (2021) to under 10% (2024) while copy/pasted lines rose from 8.3% to 12.3% — and 2024 was the first year ever where copy/pasted lines exceeded "moved" (refactored) lines (per GitClear). Agents generate more code; they don't automatically generate DRY code.

What this means for shipping agentic AI

  • The harness is the product. Per SWE-agent, interface design (linter gates, bounded context, validated actions) drives more capability than model swaps. Invest there.
  • Humans own the merge and the deploy. Per Microsoft's design, the agent proposes and a human applies — preserving accountability and audit trail. Apply the same gate to prod actions.
  • Don't let the agent grade its own homework. Per SWE-bench's design and the testing study's shallow-assertion finding, gate agent code against tests or specs the agent didn't author.
  • Constrain scope per task. Per the agentic-PR study, agents sprawl into multi-purpose PRs; enforce one reviewable intent.
  • Measure outcomes. Per METR, perceived speedup is unreliable. Track merge rates, revert rates, churn, and time-to-resolution — not gut feel.

The agentic SDLC isn't "agents replace engineers." It's that the engineer's job moves up a level: from writing every line to designing the interfaces, gates, and oracles that let an autonomous worker operate safely. The 12.5% → 76.8% curve says the worker is getting good. The METR and GitClear data say the gates are what keep it shippable.

Sources & further reading

  • SWE-bench paper — arXiv:2310.06770 (https://arxiv.org/abs/2310.06770)
  • SWE-bench Verified leaderboard — https://www.swebench.com/
  • SWE-bench repo — https://github.com/swe-bench/SWE-bench
  • SWE-agent / ACI paper — arXiv:2405.15793 (https://arxiv.org/abs/2405.15793)
  • SWE-agent ACI docs — https://github.com/SWE-agent/SWE-agent/blob/main/docs/background/aci.md
  • METR developer productivity RCT — https://metr.org/blog/2025-07-10-early-2025-ai-experienced-os-dev-study/ ; arXiv:2507.09089 ; 2026 update https://metr.org/blog/2026-02-24-uplift-update/
  • Agentic PRs empirical study — arXiv:2509.14745 (https://arxiv.org/html/2509.14745v1)
  • Microsoft AI-powered code reviews — https://devblogs.microsoft.com/engineering-at-microsoft/enhancing-code-quality-at-scale-with-ai-powered-code-reviews/
  • Testing with AI agents (MSR '26) — arXiv:2603.13724 (https://arxiv.org/html/2603.13724)
  • GitClear AI code quality 2025 — https://www.gitclear.com/ai_assistant_code_quality_2025_research

Get the next deep dive in your inbox

Field notes on shipping agentic AI — no spam, unsubscribe anytime.

Subscribe →