Measuring an AI OS, Honestly — What We Track, and What We Refuse to Claim

Every leadership team that adopts AI eventually asks the same question out loud in its weekly review: is the AI actually working? We got asked it on our own scorecard call. The honest first answer was that we could not fully prove it. The system was not failing. The problem is that the thing everyone wants on a slide, return on the AI investment, is genuinely hard to measure, and most of the people claiming a number cannot back it up.

That admission is the whole point of this piece. We run a large internal AI system. Around 60 governed agents live inside our workspace, they share 140+ skills, and roughly 80 scheduled jobs run on their own around the clock. One enrichment agent alone has logged about 7,700 runs. We moved the company onto Notion as its operating substrate in November 2025, and the AI sits on top of it in two halves: a governed workspace brain (Notion AI) that already knows the business, and an autonomous external-reach layer (NanoClaw) that acts beyond it. For a system that size, "is it worth it?" is not rhetorical. It is a budget line we defend every month.

We could not answer it with a clean ROI figure. Almost nobody can. So we stopped pretending, and built a scorecard around what is actually measurable. Here is what made the cut, what we threw out, and the one number we trust without reservation.

The honest answer is the common one

If you read the research, our uncertainty is the norm, not the exception. A widely cited MIT study found that roughly 95% of enterprise generative AI pilots produced no measurable return. Survey after survey shows most companies reporting little measurable productivity gain despite real spend. And studies that compare how much time people think AI saves them against how much it measurably saves keep finding the perceived number running well ahead of the measured one.

None of that means AI is not working. It means the standard ROI claim is mostly vibes. We did not want to add to that pile.

Why AI ROI is genuinely hard to measure

Four things break the clean before-and-after story.

Selection bias. You apply AI to the work where you already expect it to help. So the with-AI task set and the without-AI task set are not comparable in the first place. The comparison is rigged before you start.

No counterfactual. You cannot run the same person on the same task twice, once with AI and once without, under identical conditions. The clean A/B everyone imagines does not exist in real operations.

Attribution noise. Output moves for a dozen reasons in any given week. Pulling the AI's contribution out of headcount changes, seasonality, and ordinary process tweaks is mostly guesswork.

The best work was invisible to begin with. The thing AI does best for us is the routine work that used to quietly slip: the follow-up that never got sent, the record that never got updated. That work never showed up in a metric before, so "we now do it consistently" does not register as a gain on any chart we used to keep.

What we measure instead

We gave up on a single ROI number and measure three things we can actually stand behind.

1. Adoption inputs. How much the system is actually used, counted as AI sessions per person per week and fed automatically to the company scorecard. It is a leading indicator. Usage comes before integration, and integration comes before any outcome. If usage drops, everything downstream drops a few weeks later.

2. Maturity. A five-stage, self-reported ladder that asks each person where they honestly sit: Aware, Trying, Using, Integrated, Transforming. "Using" means AI is part of a weekly workflow you would miss if it disappeared. The stages roll up into a single 0 to 100 AI Adoption Index. The ladder is self-reported, which sounds soft, but it captures what a usage count misses: whether someone has actually rebuilt a workflow around AI or is just pinging a chatbot a lot. We triangulate the two lines against each other. If self-reports climb while usage stays flat, that gap is itself a conversation. Our targets are concrete: at least 60% of the team at Stage 3 or above, and at least 40% at Stage 4 or above, by the end of 2026.

3. Cost. This is the one tier we measure rigorously, because it is the one tier that holds still. We have token-level telemetry on what the system spends, month-to-date guardrails that step the models down to cheaper tiers as spend approaches a cap, and a single view that pulls together what used to live across three separate billing surfaces. Before we unified it, no one place showed the total, which is its own kind of risk when the system runs itself.

The scorecard journey

Getting to those three was not a clean design. It was an argument we had in public, week after week, on our Ninety.io scorecard.

We started with a checkbox: "AI tools adopted, yes or no." After a year of investment the only honest reading was "yes, partly," which told us nothing. A binary cannot move week to week. It scores a daily power user the same as someone who opened a chatbot once, and it cannot catch a team that leaned on AI heavily last quarter and quietly stopped.

So we tried to count the calls. Even that turned out to need real definitional choices. Is one "call" a whole agent run, or each underlying model request? Do you count cached reads, which can inflate the number several times over? Where does the week start, given the system runs around the clock across time zones? We picked the unit a human actually feels, the agent run, and moved on. The lesson was that even "just count it" is a modeling decision. Pick the wrong unit and the number swings while nothing real changes underneath it.

Counting still only captured behavior, not depth. So we paired the objective usage line with the subjective maturity line. Two numbers, one honest picture: are people using it, and is it changing how they work? Both sit on the leadership scorecard, and the system feeds the usage line itself, automatically, before anyone walks into the Monday review. A number that moves on its own every week is worth far more than a green dot someone has to remember to tick.

What we refuse to claim

We do not publish hours saved. We do not attribute revenue to the AI. We do not claim headcount avoided. Every one of those needs a counterfactual we just explained does not exist, and putting a precise figure on a guess is how AI reporting loses credibility.

What we accept instead are honest proxies: run counts, the jobs that no longer slip, the consistency of the output, and the cost per workflow. None of them prove ROI on their own. Together they tell us whether the engine is running and roughly what it costs to run it.

The one number we trust completely

Cost. You cannot prove the benefit to the decimal, but you can bound the bet. We know within tight margins what the system spends, we cap it, and we watch it daily. A bounded cost sitting next to a stack of visible operational wins is a rational bet, even without a clean ROI figure.

That reframe is the practical takeaway. The question is not "what is the ROI," which you cannot honestly answer. The question is "what does this cost, and can we see it earning its keep," which you can.

Where we landed

Measure the inputs honestly. Bound the costs tightly. Let the operational wins make their own case instead of inventing a return number to justify them. That is the scorecard we trust, and it is the one we would build again.

If you are deploying an AI system and wrestling with the same question, come find me. I am Kadeem Clarke, and I am around in the Canton ecosystem Slack.

Keep reading

Start with the hub: The Infrastructure Mindset, Turned Inward — How BitSafe Runs on AI

How BitSafe Runs on Notion — the brain:

Part 1: Notion as the Company OS · Part 2: The Architecture · Part 3: Agents, Automations, and the AI Layer · Part 4: Replacing Salesforce with Notion · Part 5: The Agent Governance Model

The NanoClaw series — the reach:

Part 1: Building a Company-Wide AI Assistant · Part 2: The Architecture · Part 3: The Autonomous Engine · Part 4: The Substrate · Part 5: Working With NanoClaw · Companion: Cost Discipline

Standalone deep-dives:

Why Not Just Use the Claude App? · The Invisible Seam · Measuring an AI OS, Honestly