You're Measuring Your AI Wrong

The chatbot hit 91% accuracy on support ticket classification. Average response time dropped from four hours to twelve minutes in the test group. The five agents who volunteered rated it 4.3 out of 5.

Six months into the next phase, customer churn rate hasn't moved. Same renewal numbers. Same escalation volume. The chatbot works. The retention dashboard didn't notice.

The vanity metrics trap

Most AI pilots report metrics that describe the model: accuracy, latency, speed improvement over a manual baseline. What they should report is how the pilot meaningfully moved a business number.

McKinsey's 2025 global survey found that 88% of organizations use AI in at least one function. Only 39% report any impact on EBIT. For most of that 39%, the impact is less than 5% of total EBIT.

The gap persists because AI teams report what they can control. Model accuracy is within the team's control. Business outcomes depend on adoption, workflow change, and integration with existing systems, none of which the AI team owns. So the quarterly review gets a slide about accuracy, and finance gets a question about ROI that nobody can answer.

The perception vacuum

When measurement doesn't exist, perception fills the gap.

A BCG study published in Harvard Business Review found that 76% of executives believe their employees feel enthusiastic about AI adoption. Only 31% of individual contributors expressed enthusiasm. Larridin's 2025 survey of 350 senior finance and IT leaders found that 55% are unsure whether their AI investments are paying off.

Without hard numbers, the quarterly review becomes a storytelling exercise. Each team reports what feels true. The demo looked good. The users seemed happy. The numbers should improve once adoption picks up. This is how a pilot that isn't working gets another quarter of budget: nobody measured its absence of impact, so nobody can argue against its presence.

The pendulum

No measurement creates two failure modes, and most companies cycle through both.

First: everything gets funded. Without data to distinguish a working pilot from a dead one, every team gets "just one more quarter." Pilots accumulate. Budgets grow.

Then the backlash. S&P Global found that 42% of companies scrapped most of their AI initiatives in 2025, up from 17% the year before. The average organization abandoned 46% of its AI proofs of concept before production. The CFO ran out of patience before the data came in.

Fund everything is faith. Kill everything is frustration. Both are what happens when decisions get made without data.

What the 6% measure

McKinsey defines AI high performers as organizations that attribute 5% or more of EBIT to AI and report significant value. About 6% of respondents qualify.

The single strongest predictor of EBIT impact, out of 25 attributes tested: workflow redesign. 55% of high performers fundamentally reworked processes when deploying AI, nearly three times the rate of everyone else.

This tells you where the metric should point. The high performers measure the process the AI is supposed to improve. If a document-summarization tool is meant to speed up claims processing, the metric is average days from claim filed to claim resolved. If an automated risk-scoring model is meant to improve underwriting accuracy, the metric is loss ratio over the next renewal cycle. If a customer-routing agent is meant to reduce call center costs, the metric is cost per resolution.

The 6% track business outcomes. Everything else follows from that.

The baseline nobody sets

The simplest reason companies can't prove AI value: they never recorded where the number started.

Larridin found that 81% of enterprises cite ROI measurement as their top governance challenge. The Cisco AI Readiness Index found that only 32% have a defined process to measure AI ROI at all. Most teams launch a pilot, run it for a few months, then try to argue retroactively that things improved. Without a baseline, "improved" is just a feeling.

A baseline takes a week. Pull the current state of the business metric the pilot is supposed to move. Log it. Agree on a threshold: if this number doesn't improve by X% in Y months, the pilot dies. This turns a pile of experiments into a portfolio with clear entry and exit criteria. It also removes politics from the kill decision, because the threshold was set before anyone got attached.

At one organization, setting baselines before launch cut the average pilot review from a 45-minute debate to a 10-minute check. The threshold did the arguing. Three pilots got killed in the first cycle, and the engineers reassigned to the two that were working. The CTO said it was the first AI review meeting that ended with a decision instead of a request for more time.

A measurement discipline

Four steps. None require a new platform or a data science team.

Before the pilot launches, name the business metric it's supposed to move: time to resolution, cost per transaction, retention rate, revenue per customer.

Baseline it. Measure the current state before the AI touches anything.

Set a threshold and a timeframe. "If cost per resolution doesn't drop 10% in six months, we stop."

At the deadline, check. If the number moved, the pilot earned production resources and a real engineering team. If it didn't, the pilot gave you information, and the team is free to try the next thing.

McKinsey's data makes the case: nearly two-thirds of organizations remain in the experimentation stage. The measurement is what hasn't caught up.

Churn dropping two points earns a production timeline. 91% classification accuracy earns another demo.