Issue #50 — The 'Five Agents Per Employee' KPI Is the Wrong Measure

Dear Reader,

A couple of months ago I spoke to an innovation director at a large technology company. As their headline metric for AI innovation they had settled on the number of agents per employee: every team was to build or adopt five agents a head, and the total would climb on a board-level dashboard. The appeal is easy to understand. It’s a clean number that only moves in one direction, and it sounds like a workforce teaching itself new tools. The trouble is that it doesn’t reach the P&L at all, unless you count the bigger bill for tools.

I run into this a lot now, and it’s a natural trap for a young technology that doesn’t yet have proven patterns for deployment. A company adds up how much AI it’s running and presents that as the value the AI has created, when the real value is usually unknown.

The vanity is the vendor’s

Companies didn’t invent the habit of counting agents. They were sold it. Salesforce runs an “Agentic Enterprise Index” built around figures like 119% agent growth in the first half of 2025 and 65% monthly growth in how often employees talk to agents. Nvidia’s Jensen Huang talks up a future of a hundred agents per human. The implied message is that the agent count and the interaction count are themselves the measure of transformation. None of which is surprising, given that more agents running means more revenue for Nvidia, OpenAI, Anthropic and everyone else selling the underlying compute and tokens.

It isn’t only the vendors. McKinsey’s chief executive has taken to quoting the firm’s own agent count as a headline number: 25,000 AI agents alongside 60,000 people, up from 3,000 a year and a half earlier, offered as proof of how far ahead it is. Steve Newman, who runs engineering at the rival firm EY, had the obvious answer: the number of agents doesn’t translate into value. When the firms that sell transformation advice are themselves keeping score by agent headcount, it’s easy to see how the habit spreads. A Sequoia partner, Alfred Lin, put it more bluntly in Forbes: AI adoption is a vanity metric.

Look at how some of those same vendors actually charge for AI, though, and you can see the market starting to force a change. Intercom bills $0.99 for every issue its Fin agent resolves, with no charge per seat or per agent. Sierra takes money only once its agent has closed a case without a human stepping in. HubSpot is moving from per-use pricing to per-resolution. Salesforce has introduced “Agentic Work Units” and bills for work done rather than licences held. Sierra spells out why the old model is broken: when you charge by the seat, a more effective agent means the customer needs fewer seats, so the vendor is working against its own product, or it has to push the per-unit price up, which puts buyers off. It’s worth sitting with that for a second. The companies closest to this technology looked at volume as a basis for billing and walked away from it. So why would anyone deploying AI build their own internal scorecard around it?

Goodhart at machine speed

Goodhart’s law is the thing to keep in mind. Once a measure becomes the target people are chasing, it stops being a reliable measure of anything. Make “five agents per employee” the goal and you will, reliably, end up with five agents per employee. Whether any of them do useful work is a separate question. In practice someone wraps a prompt around a spreadsheet, logs it in the agent registry, and moves on, and the dashboard fills up with green while the way the work actually gets done stays exactly the same.

What’s new is the speed. An agent chases a target far more single-mindedly than a person ever would. Pay it for shorter handle times and it’ll get very good at getting customers off the line, answered or not. Take a developer team measured on the share of code that’s AI-written, and what you get is a great deal of code that compiled and that nobody read. Gartner expects more than 40% of agentic AI projects to be scrapped by the end of 2027, and the headline reason it gives is “unclear business value”. Most of these aren’t technical failures at all. They get cancelled because, when someone finally asks what the agents have done for the business, nobody can draw the line back to a concrete outcome.

The cascade

A measure of AI transformation that actually means something isn’t one figure; it’s a short chain of linked ones, governed by a single rule: anything the bottom-level metrics report has to add up, by arithmetic, to something real at the top.

Begin at the top. Level 0 is the line on the P&L you’re trying to improve: cost-to-serve, operating margin. Level 1 is the operational lever that actually moves it; for cost-to-serve, that’s your cost per case multiplied by your volume. Level 2 is the health of the process the automation runs inside: how many cases get handled end to end, how long they take, how many come back. Level 3, at the bottom, is the agent itself, measured against the person it replaces, on net cost per case and on how often it gets the answer right.

Now try to find “five agents per employee” somewhere on that chain. It isn’t even at Level 3. There’s no operation that carries “agents deployed” up to “cost-to-serve”, so the number just sits there, with no bearing on anything the board cares about.

The two gates

Level 3 is where the measurement usually goes soft, because the obvious figure, cost per agent-handled case, makes the agent look far better than it is. Getting it right means asking two questions, both of them against the human the agent took over from.

The first is what it really costs. An agent burning €0.20 of inference next to a €12 human looks about sixty times cheaper, and it almost never is once you finish the sum. The moment a person has to read behind the agent and confirm its work, you’re paying for that time. Every case it gets wrong and a human then has to redo is a €12 case you’ve now paid for twice. The figure worth knowing is the fully loaded cost with verification and rework folded in, and against that figure most of the “ten times cheaper” claims quietly fall apart.

The second is whether the quality is good enough, and good enough is set by the step rather than in the abstract. The agent doesn’t need to be cleverer than a human overall; it needs to be reliable enough given what a mistake costs in this particular place. A five-percent error rate is invisible on expense coding and a reportable breach on a regulated complaint, which is exactly why the quality bar can only be set step by step, against what getting it wrong actually costs there. That’s the error-cost asymmetry from the last issue, and it’s what really decides whether an agent should be anywhere near a given step.

A flat scorecard quietly hides three things, and this brings them out. One: cutting the cost of a case and letting each person get through more of them are not the same achievement, and they land on different lines, because freeing people up only becomes money if you redeploy the time or take out the headcount, and otherwise it just evaporates. Two: when the saving shows up, separate what the AI did from what the redesign did. A lot of the gain can be the redesign itself (fewer steps, fewer handoffs, less rework), which a leaner human-run process might have captured with no inference bill at all. Three: the consequences that matter are long-term. The error rate is visible on day one, while the cancelled contracts only surface months later, long after the pilot got signed off as a win.

Complaint handling, all the way down

Take a complaints team working through 200,000 cases a year, each one costing a fully loaded €12. That’s €2.4m sitting at Level 0 as cost-to-serve. Now drop an agent into the middle of it.

The vendor will lead with a resolution rate, and the first thing to understand about that figure is how wildly it moves from one deployment to the next. Intercom’s Fin runs at roughly 66-67% across its 8,000 customers, Sierra’s strongest deployments reach 90%, and plenty of agents through 2024 and 2025 never climbed out of the low twenties. So “we’ve deployed an agent” carries almost no information. The resolution rate is doing all the work, and depending on where it lands you’ve either made a serious saving or an expensive mess.

Walk it down the chain. Say the agent clears 60% of complaints end to end at €0.20 a time, and routes the other 40% to a human before it does anything irreversible, at the full €12. On the slide the agent costs €0.20 a case, which makes it look about sixty times cheaper than the human. The honest net is €5.00, and cost-to-serve drops from €2.4m to roughly €1.0m. Still a real saving, but the agent is about 2.4 times cheaper, not sixty.

Same agent, same 200,000 cases. What turns “sixty times cheaper” into “a bit over twice” is which costs you let onto the page: the inference bill on its own, or the inference bill plus the human cost of everything the agent couldn’t finish.

The quality question decides whether you should be running the agent at all. A complaint closed wrongly isn’t a free miss; in a regulated business it can be a reportable breach, and it’s usually a customer you don’t keep. If a 66% resolution rate means roughly a third of complaints are being quietly closed wrong, the cost of those mistakes can run straight past the wage bill you were trying to cut. The figure to track, then, is the share of complaints closed correctly, held to an error rate the business can actually live with.

Briefing

At CamundaCon in Amsterdam on 20 May, Camunda launched ProcessOS, an “agentic operating system” that claims to discover, re-engineer and continuously optimise enterprise processes rather than just orchestrate the ones already in place. It was pitched to 1,200 enterprise leaders with a fairly blunt framing: becoming AI-native means re-engineering the underlying processes first, not bolting agents onto the old ones. Whatever the product turns out to be in practice, it’s a notable shift in stance from a major orchestration vendor now selling the re-engineering as the headline and the agents as the consequence.

Google shipped Gemini 3.5 Flash at I/O on 19–20 May, with a Pro version due next month. The interesting part is the price. The new Flash runs at roughly three times the per-token cost of the model it replaces, and because it also gets through far more tokens per task, independent testing by Artificial Analysis put the cost of the same workload at around five times higher, enough that on agentic jobs it can come out dearer than the previous generation’s Pro. Google is following OpenAI and Anthropic here: GPT-5.5 and Claude Opus 4.7 both landed more expensive than what they replaced. The comfortable assumption that inference only ever gets cheaper isn’t holding at the frontier, which is worth remembering before anchoring a business case to today’s price per task.

Questions for your leadership team

Does our transformation metric count agents and deployments, or does it track a real movement in a named P&L line? When someone brings you an agent count, ask them to carry it upward, step by step, to a cost or a revenue figure. If they can’t make that connection, it isn’t telling you anything about value.
For each automation, do we know the net cost once oversight and rework are counted in, rather than just the inference bill? Have we tested that the agent comes out cheaper after a person’s verification time?
Has “AI training completed” (the Article 4 AI Act obligation, in force since February 2025) ended up on a management slide dressed as a transformation result? Completing mandatory training shows the organisation is compliant. It says nothing about whether any value has been created.
When we compare the agent to a human, is it a human running the redesigned process or the old one? And is the quality bar set against what an error actually costs at that specific step?

Summary

Whatever you choose to measure on the first deployment becomes, in effect, the goal the organisation will try to hit. Issue 46 made this point about which use case you start with; it matters at least as much for the number you judge the work by. Reward the number of agents built and you’ll get an organisation that’s very good at producing agents and barely interested in whether they earn their keep.

For any KPI meant to measure the effect of an AI deployment, there’s one question worth asking: can we show a link between its value and a measurable P&L impact? “Five agents per employee” fails that test.

Stay balanced, Krzysztof Goworek

Krzysztof Goworek is founder of Quintant — AI advisory that gets enterprises from experiment to production value.