Post

Project Fetch: A Model Programmed a Robodog 20x Faster Than Humans — and Still Couldn't Fetch

Anthropic's Project Fetch Phase Two shows Claude Opus 4.7 crushing the software side of robotics autonomously, while precise physical control remains stubbornly unsolved.

Mad Scientist 20 Jun 2026 5 min read

Enjoying the field notes? Subscribe for each new deep dive.Subscribe →

Project Fetch: A Model Programmed a Robodog 20x Faster Than Humans — and Still Couldn't Fetch

Anthropic's Project Fetch Phase Two shows Claude Opus 4.7 crushing the software side of robotics autonomously, while precise physical control remains stubbornly unsolved.

There's a perfect, slightly comic image at the center of Anthropic's latest Frontier Red Team result: a frontier model that programmed a robot dog roughly 20 times faster than the best human team, and then watched it fail to nudge a beach ball back to home base (per AnthropicAI on X). Both halves of that sentence matter, and the gap between them is the whole story.

The setup

Project Fetch tests how AI helps non-expert humans operate an off-the-shelf robotic quadruped — a "robodog." The original experiment, in August 2025, took teams of Anthropic employees, randomly assigned to work with or without Claude, and timed them on a sequence of tasks (per Anthropic Frontier Red Team):

Operate the robodog using the manufacturer's controller
Connect to its video and lidar sensors
Write and operate a manual-control program
Monitor the robodog's path through space
Write a program to detect the beach ball
Combine everything to autonomously retrieve the ball

Phase Two (June 18, 2026), authored by Michael Ilie, C. Daniel Freeman, and Kevin K. Troy, re-ran those tasks — but this time with Claude Opus 4.7 operating autonomously inside Claude Code, using adaptive thinking at maximum effort, across three trials (per Anthropic). The researcher's role was deliberately minimal: plug the laptop into the robodog, enter the initial prompt, approve commands, and approve progression to the next task. Tasks that required the physical manufacturer controller were excluded.

The numbers

On the four tasks that every human team completed back in August, the comparison is stark (per Anthropic):

Participant	Time	Relative to Opus 4.7
Team Claude-less	361 min	~37.7x slower
Team Claude	181 min	~18.9x slower
Claude Opus 4.7	9 min 35 sec	—

Across the broader five-task Phase Two set (connect video, connect lidar, detect the beach ball, etc.), Team Claude took 264 minutes, Team Claude-less didn't complete all five, and Opus 4.7 averaged 12 minutes 7 seconds over its three trials (per Anthropic).

The efficiency gap is just as telling as the speed gap. Lines of code written (per Anthropic):

Participant	Lines of code
Team Claude	10,309
Team Claude-less	1,136
Opus 4.7	1,045

So the model was as or more successful than the human teams while writing roughly 10x less code than the humans who were also using Claude (per Anthropic). Anthropic also notes much of the model's code worked on the first try, and that it showed high reliability — little within-task variance across the three trials. It quickly picked the best approach to interface with the sensors, where humans dithered among options (per Anthropic).

A horizontal bar chart concept comparing time-to-complete the four shared tasks — Team Claude-less (361 min), Team Claude (181 min), Opus 4.7 autonomous (9 min 35 sec) — with a second small panel showing lines of code (Team Claude 10,309 vs Opus 4.7 1,045), illustrating "faster AND leaner."

The three-stage pattern

Anthropic situates this in a recurring dynamic they say they've also observed in cybersecurity (per Anthropic):

Models are helpful to humans.
Humans are helpful to models.
Models are largely able to do the task themselves.

Phase Two, in their telling, is stage three arriving at the intersection of AI and the physical world. Importantly, they stress this didn't come from a targeted robotics push — it "emerged from much more general scaling" (per Anthropic). The same general-purpose capability gains that drive agentic coding apparently spilled over into "program this robot."

A three-stage progression timeline — Stage 1 "Models help humans" -> Stage 2 "Humans help models" -> Stage 3 "Models do it themselves" — with two parallel tracks labeled "Cybersecurity" and "Physical world / robotics," showing the physical-world track now reaching Stage 3.

Where it broke: closed-loop control

Now the other half. The model could connect every sensor and write the detection code — and then failed at the actual fetching (per Anthropic). The hard part is precise closed-loop physical control: continuously perceiving the error between the ball's position and the goal, and adjusting motor inputs in real time. Humans, with practice, could pilot the robodog to nudge the ball home. Claude could position the robot behind the ball, but its movements were "poorly controlled and unsuccessful" — comparable to untrained human participants at the same stage (per Anthropic).

This is a clean illustration of a distinction that often gets blurred in "AI robotics" hype. There are two very different problems:

The software-integration problem: discover the SDK, connect sensors, parse lidar, run object detection, glue it into a control loop. This is essentially agentic coding against unfamiliar hardware APIs — and Opus 4.7 ate it alive.
The closed-loop control problem: real-time perception-action with tight feedback and physical dynamics. This is where it stalled.

A useful piece of nuance: the model wasn't perfect even on the software side. It defaulted to an outdated object-detection algorithm, then worked around it to reach an effective solution — which likely made one beach-ball detection trial run longer (per Anthropic). So "first-try success" is the trend, not an absolute.

And there's a forward-looking signal: one researcher with more robotics experience did successfully program autonomous fetching, and Anthropic believes current Claude could likely do the same with more time and scaffolding. The open question they pose is whether models can do this final task with the same speed and reliability they showed on everything else (per Anthropic).

Why this matters for shipping agents

You're probably not deploying a robodog. But the asymmetry Project Fetch exposes is broadly useful for calibrating where agents create value today:

Integration glue is collapsing in cost. The unglamorous work of "connect to this unfamiliar API/SDK/sensor and wire it into a working loop" is exactly what agentic coding is now extremely good at — even against hardware the model has never seen. If your roadmap has a backlog of integration work, that's the part most exposed to automation.
Tight feedback loops with real-world dynamics remain hard. Where success depends on continuous, precise, real-time correction against a noisy physical (or messy real-world) signal, models still lag. Scope your agent ambitions accordingly: let it write the integration, but don't assume it nails the control loop.
Capability arrives sideways. Anthropic's point that this emerged from general scaling, not a robotics program, is the strategically important one. Capabilities you didn't plan for can show up in adjacent domains. The pattern — help humans, then get helped by humans, then do it alone — is a useful template for predicting which of your workflows is next.

Anthropic is careful not to overclaim: this "does not mean LLMs have solved robotics," and they note that current Claude models could likely do the final task only with more time and scaffolding (per Anthropic). Their point that this capability "emerged from much more general scaling" rather than a targeted robotics effort is the strategically important framing — and it suggests adjacent-domain progress is worth watching closely.

The robodog still can't fetch. But the speed at which the software around it got built should reset your priors on what "the physical world is hard for AI" actually means.

Sources & further reading

Anthropic Frontier Red Team, "Project Fetch: Phase two": https://www.anthropic.com/research/project-fetch-phase-two

Get the next deep dive in your inbox

Field notes on shipping agentic AI — no spam, unsubscribe anytime.

Subscribe →