Post

Manager as compiler

What does it actually mean to manage when half your team's output is generated by something that can't be held accountable?

Mad Scientist 11 Jun 2026 12 min read

There is a specific failure mode that shows up about six months into a team adopting AI seriously. Output goes up. Throughput, measured naively, looks great. And yet the manager is more tired than before. Review queues get longer. The "are we actually shipping value?" feeling gets worse.

This is a control-systems problem, not a tooling problem, and most management advice about AI misses it completely because it treats AI as a productivity multiplier rather than as a new source of unverified output flooding into a pipeline that was never designed to filter it.

I want to work through this carefully, using a concrete thought experiment as a lens: an open-source "work-quality linter" called WorkSpec
The product matters less than what building it forces you to think clearly about:

What does it actually mean to manage when half your team's output is generated by something that can't be held accountable?

The data is unambiguous, and it disagrees with the hype

Start with the thing nobody wants to say out loud. AI is generating enormous volumes of polished, plausible, low-substance work, and that work is landing on humans.

The BetterUp Labs and Stanford research that coined "workslop" (AI-generated content that masquerades as good work but lacks the substance to meaningfully advance a given task) found that 41% of workers have encountered such output, costing nearly two hours of rework per instance and creating downstream productivity, trust, and collaboration issues. The mechanism is the cruel part: it shifts the cognitive burden from the creator to the recipient, who now has to edit the generic, low-quality work.

Note who the recipient often is. The HBR analysis found the phenomenon occurs mostly between peers at 40 percent, but workslop is also sent to managers by direct reports 18 percent of the time, and 16 percent of the time it flows down the ladder from managers to their teams. Managers are simultaneously a dumping ground *and* a source.

And the cost compounds. By early 2026, reporting put it at 66% of workers spending six or more hours each week cleaning up AI mistakes, with workers' confidence in AI utility dropping 18% in 2025 even as adoption climbed. For a 10,000-person company, the BetterUp/Stanford team estimated a roughly $186 invisible tax per worker per month, about a $9 million annual hit to productivity.
One finding should reframe the whole conversation. Gallup data shows AI use at work doubled since 2023 from 21% to 40%, yet 95% of organizations don't see a measurable return on their investment, per a recent MIT Media Lab report. The technology got dramatically better. The returns didn't follow. That gap is the management problem.

Microsoft's 2026 Work Trend Index gives it a name, the Transformation Paradox, describing organizations rapidly adopting AI technologies while simultaneously struggling to redesign operational structures around them. They found only one in four AI users believe organizational leadership is consistently aligned on AI strategy, while 65% fear falling behind professionally if they fail to adapt. Deloitte's 2026 Global Human Capital Trends echoes it bluntly: only 6% of leaders say they're making real progress designing how humans and AI should work together.

So the situation is: cheap output, expensive review, accountability vacuum, and almost nobody has redesigned the workflow around it. That's the terrain.

Reframe the job: managers became reviewers, and review doesn't scale

If you're an engineer, you already know this shape. It is exactly what happens when a codebase has no CI, no linter, no type checker, and no PR template, except that the merge gate is one exhausted senior reviewer reading everything by hand.
The old management loop was assign tasks, review outputs. That loop assumed output was expensive to produce, which meant it was implicitly pre-filtered: a human had to care enough to write the memo, and the friction of writing forced thought.

The San Francisco Standard piece on 2025 made this point well: before AI, workers had to struggle through writing a memo or strategy document, forcing them to think about the problem at hand; if they didn't know something they'd do research or talk to a colleague. AI removes that productive friction. The thinking step becomes optional.

When production gets cheap and review stays human-expensive, review becomes the bottleneck. And the manager's instinct, to review harder and review everything, is precisely the thing that doesn't scale. You cannot out-review a generative model. It produces faster than you can read.

Microsoft frames the shift as human agency moving toward intent-setting, judgement, orchestration and accountability, with their CMO describing a pipeline from Author to Editor to Director to Orchestrator. That's directionally right but dangerously abstract. "Orchestrate more" is not an action. Let's make it concrete.

The thought experiment: what if "good work" were a compiled artifact?

Imagine a tool, call it WorkSpec, that does for knowledge work what ESLint plus a CI gate does for code. You encode "what good work looks like" as a machine-checkable spec, and work gets linted against it before it reaches you.

Principle 1: Move your judgment upstream, from review to specification

The single highest-leverage change is to stop spending your judgment at the output gate and start spending it at the input gate.

Every time you review a piece of workslop and think "this is missing the tradeoff," you've discovered a rule. The reviewer's tragedy is that they re-derive the same rules over and over, one document at a time, and the knowledge dies in a Slack DM. The Neuron's analysis of the Microsoft report nailed the institutional cost: every successful prompt, every agent workflow, every "this worked / this didn't" should be captured somewhere a teammate can find. If your only documentation lives in DMs, your firm is leaking compounding value daily.

Concrete action: Keep a running "rejection log" for two weeks. Every time you send work back, write down the reason in one line. After two weeks you will have 8–15 recurring failure patterns. That list is your spec. Most of it will be structural rather than strategic: missing owner, missing date, claim without source, recommendation without a tradeoff. Those are exactly the things that can be checked cheaply and consistently, which means they should never have been consuming your attention in the first place.

This is the manager analog of fixing one module before refactoring the whole app: you don't redesign your entire operating model, you isolate the single most expensive recurring defect and write a check for it.

Principle 2: Don't write rubrics from scratch, derive them from examples

A real objection kills most "just write standards" advice: managers won't sit down and author perfect rubrics. They don't have time and they don't think in spec form.
The fix is to invert it. Don't author. Extract.

Take three pieces of work you considered excellent and three you considered weak, and the difference between them is your standard. If you want AI to help (and this is genuinely a good use of it), feed it the six examples and ask it to articulate what separates the good from the bad. You'll edit the output, but editing beats authoring, and that is the difference between a five-minute task and a task you'll never do.

This also sidesteps the trap the workslop research keeps surfacing. The Guardian-sourced analysis found the dividing line cleanly: organizations that outperform on AI ROI typically restrict AI to tasks where the human remains the author and the AI is an accelerator; those that underperform ask AI to be the author while humans become editors, which is where workslop breeds. Deriving a rubric from your own judged examples keeps you as the author of the standard even when AI helps express it.

Principle 3: Make the standard a pre-flight check employees run on themselves

Most quality-gate ideas die at the same spot: they read as surveillance. If a standard feels like the manager installing a tripwire, the team routes around it and morale craters.

The reframe that survives contact with reality is making the check something people run on their own work, before anyone sees it. Career armor, not a trap. "Pass this before sending work upward so you don't look sloppy" is a value proposition to the employee. It's the same reason developers run the linter locally before pushing: not because someone forces them, but because failing in front of the team is worse.

The data strongly supports leading by modeling rather than mandating. Microsoft's study of 1,800 workers found that when managers actively model AI use, employees report a 17-point lift in AI value, a 22-point lift in critical thinking about AI use, and a 30-point lift in trust in agentic AI. And critically, employees were 1.4x more likely to be high-frequency users of agentic AI when managers created psychological safety around experimentation. The Neuron's summary of the cheapest available intervention: get your managers to use AI in front of their teams and ship the visible artifacts of that work. Cheaper than tooling, faster than restructuring, more impactful than training programs.

Concrete action: Publish your spec, then visibly run your *own* work through it first and share the result, including the parts that failed. The standard becomes a shared tool the team uses, not a weapon you hold.

Principle 4: For AI agents, the spec isn't optional; it's the only control surface you have

Everything above applies doubly to the genuinely new part of the job: you are now also managing non-human teammates, and you are accountable for their output.
This is no longer hypothetical. Microsoft's 2025 index reported 28% of managers were considering hiring AI workforce managers to lead hybrid teams of people and agents, while 32% planned to hire AI agent specialists over the next 12 to 18 months. Mercer's 2026 Global Talent Trends, based on nearly 12,000 executives, found 82% of C-suite leaders believe the future of HR lies in managing human talent and digital agents side by side. Korn Ferry observed that AI agents are becoming real teammates with their own identities, access permissions, and responsibilities, and that most leaders have no idea how to manage mixed human-AI workforces. It is, in their words, uncharted territory.

The structural difference changes the management approach entirely. A human direct report has internalized judgment, social accountability, and a sense of when something is off. An agent has none of that. You cannot delegate to an agent by gesturing vaguely the way you can with a trusted senior person ("figure out why activation dropped and propose fixes"). The agent will produce something, polished and confident, and the gap between that and what you meant is your problem.

So the delegation brief, what the thought experiment frames as an `ai-delegation-brief` rubric, stops being a nicety and becomes the actual interface. A usable one specifies goal, context, inputs, tools allowed, tools forbidden, output format, the explicit quality bar, the human review checkpoint, and the failure conditions. That last cluster matters most: with a human you can rely on them to escalate when confused; with an agent you have to encode the conditions under which it should stop and hand back.

Notice the load-bearing insight: the skill of writing a good agent delegation brief is the same skill as writing a good rubric for humans, made mandatory. Harvard's David Deming and colleagues found a strong link between skill at coordinating AI agents and skill at leading human teams. The same leadership moves matter: asking clarifying questions, setting clear expectations, and learning through trial and error. The manager who got good at making standards explicit for people is exactly the manager who can direct agents. The manager who managed by vibes and hallway corrections has no transferable surface at all.

There's also a hard governance line worth stating plainly. Emerging 2026 professional practice mandates clear override mechanisms: humans must have the final, unchallengeable authority to correct an AI's output, ensuring the business remains anchored in human accountability and moral reasoning. Your delegation spec is where that override lives in practice. "Human accountable: true" is more than a checkbox. It asserts that when the agent's output is wrong, a named person owns it.

Principle 5: Match the collaboration mode to the task, don't maximize autonomy

A subtle trap is treating "more agent autonomy" as the goal. It isn't. The Neuron's read of the framework is the right caution: not every workflow should become full delegation; some need collaboration, some need exploration. The skill is matching the mode to the outcome, not maximizing agent autonomy.

The decision is structural, and you can reason about it like an engineer reasoning about where to put a verification boundary. Full delegation is appropriate where the output is cheaply and objectively verifiable, where a check can confirm correctness without a human re-deriving the answer. Collaboration is appropriate where verification requires the same effort as production, which is most genuinely strategic work. The fastest way to manufacture workslop is to fully delegate a task whose quality you can only assess by doing it yourself. At that point the agent didn't save you the work; it added a review step.

The broader industry lesson reinforces this. Deloitte's 2026 trends warned that enterprises are hitting a wall by trying to automate existing processes (tasks designed by and for human workers) without reimagining how the work should actually be done; true value comes from redesigning operations, not just layering agents onto old workflows. Bolting an agent onto a task that was shaped around a human's tacit judgment just relocates the judgment to your review queue.

Principle 6: Reward the redesign, not just the deliverable

The last principle is about incentives, and it's where the Transformation Paradox actually gets resolved. The reason teams produce workslop and managers drown is that organizations still measure success through current deliverables, familiar workflows, and short-term performance goals, which makes AI-driven reinvention feel risky even when leaders say they want it.

The lever the Neuron identified is specific and small: reward reinvention, not just results. Employees rewarded for redesigning work with AI, even when results are uneven, is the lever that flips the paradox. Pick one team, make the explicit ask, watch what happens.

For a manager, this is tractable. You can't change company-wide comp, but you control what you praise in a standup and what you put in a review. The person who wrote the reusable delegation brief that the whole team now uses created more value than the person who personally shipped one more memo, and your recognition should say so. Build the agent signals into institutional memory: the prompts that worked, the specs that caught real defects, the failure conditions you learned the hard way.

What this actually looks like on Monday

Strip away the thought experiment and the principles reduce to a short, concrete sequence:
1. Run a rejection log for two weeks. One line per piece of work you send back, capturing the reason. This costs you nothing and produces your draft standard.

2. Extract one rubric from examples, not theory. Take three good and three weak artifacts of your most common deliverable type: status update, decision memo, whatever you review most. Write down the 5–8 structural things that separate them. Use AI to draft it from the examples; you edit.

Publish it as a self-check, and run it on your own work first, in public. Frame it as "pass this before it reaches me so it lands well," never as a tripwire. Model it before you ask for it.
Convert your vaguest recurring delegation into a real brief. Pick the task you most often hand off with a hand-wave. Write the goal, inputs, allowed and forbidden tools, output format, quality bar, review checkpoint, and failure conditions. Use it for both humans and agents. It works for both, which is the point.
For each agent task, decide the mode deliberately. Full delegation only where output is cheaply verifiable. Everything else is collaboration with a human author. Never delegate a task whose quality you can only judge by redoing it.
Once this quarter, reward the redesign over the deliverable out loud. Name the person who built the reusable thing. That single act of recognition does more to flip your team's behavior than any tool.

The actual shift

The comfortable story about AI and management is that AI handles the grunt work and managers ascend to "higher-level strategic thinking." That's true only for managers who do the unglamorous work of making their judgment explicit. For everyone else, AI doesn't elevate the job. It floods it.

The deeper truth the thought experiment exposes is that management was always a compilation problem; AI just made the cost of skipping the compile step unbearable.

A manager takes ambiguous human intent and turns it into work that humans and now agents can execute, then verifies the result against a standard. When output was expensive, you could get away with leaving the standard in your head and compiling it one document at a time in your own review. When output is nearly free, the standard has to live outside your head: written down, checkable, shared, and applied before work reaches you.

This is not being replaced by AI, and it's not becoming an "AI boss" either. It's the same job it always was, finally forced into the open. The managers who thrive in a mixed team won't be the ones who review the hardest or adopt the most tools. They'll be the ones who can answer, precisely and in writing, a deceptively simple question: what does good work look like here?

*A note on the sources: the workforce data here comes from BetterUp Labs/Stanford and HBR (workslop), Microsoft's 2025 and 2026 Work Trend Index, Deloitte's and Mercer's 2026 trends reports, Gallup, and Korn Ferry. Figures on rework cost and adoption-vs-ROI are self-reported survey estimates and should be read as directional rather than precise, but the direction is consistent across every independent source, which is what makes the pattern worth acting on.*