With reports about the artificial inflation happening with the arena metrics, it's important to figure out what's the best way to measure AI success on a business scaling.
The best teams run AI like any other product: set objectives, ship, then “check and act” on the numbers. That’s literally the ISO/IEC 42001 loop - Plan, Do, Check, Act - baked into an AI management system (ISO/IEC, 2023). NIST says the same thing with plainer language: “govern, map, measure, and manage risks” (NIST, 2024).
Start with the job, not the model
Every AI initiative exists to do a job. Reduce cost-to-serve. Increase revenue per rep. Handle more tickets with fewer escalations. Your measures live there - not in a benchmark spreadsheet.
NIST’s generative AI profile is clear: tailor what you measure to your use case and risk level (NIST, 2024). ISO/IEC 42001 turns it into muscle memory with PDCA. You set objectives, define KPIs, monitor, and iterate. Simple. Disciplined.
The 5-layer AI scorecard
You don’t need 100 metrics. You need the right dozen across five layers. Think stack, not soup.
1) Business outcomes (lagging, decisive)
- Support: issues resolved/hour, average handle time, first-contact resolution, CSAT/NPS, retention.
- Sales: qualified meetings/week, win rate, cycle time, revenue per rep.
- Ops: throughput, rework/defect rates, time-to-completion.
Field proof? In a staggered rollout at a large contact center, AI assistance lifted productivity by ~14-15% on average and more for less-experienced agents - and improved customer sentiment (Brynjolfsson et al., 2023).
2) Adoption & behavior (leading, human)
If your people don’t use it, it won’t pay. Track coverage (% of eligible flows using AI), unique users, query/session depth, and “time-to-first-value.”
For engineering, extend the SPACE framework to AI: satisfaction, performance, activity, communication, efficiency - blending telemetry with sentiment rather than chasing vanity counts (Microsoft Research, 2025). As they note, productivity is “multidimensional,” and developers spend surprisingly little time on net-new code without help.
3) System quality & safety (guardrails, trust)
Measure task success, hallucination rate, factuality, robustness, bias/fairness, toxicity, and harmful content escape. VHELM’s lesson: single-number wins “neglect other critical aspects such as fairness, multilinguality, or toxicity” (Lee et al., 2024). Use standardized prompts, seeds, and parameters to reduce eval noise.
Public-sector playbooks call this out explicitly: model metrics tell you if the tech works; service metrics tell you if user and mission needs are met (UK DSIT, 2025).
4) Unit economics & performance (viability, scale)
Track cost per task (or per 1K tokens), latency (p50/p95), throughput (tasks/hour), and cache hit rates. Tie this to routing choices (model A vs. model B), prompt length, and retrieval quality.
The AI Index shows capability and cost curves improving fast; use that context for buy/hold/replace decisions without confusing benchmark gains with business value (AI Index, 2025).
5) Governance & risk posture (license to operate)
Risk-weight your KPIs by criticality. In regulated or safety-sensitive contexts, require: policy conformance, red-team coverage, incident rate per 100k interactions, and escalation/override logs (NIST, 2024). Make owners and thresholds explicit.
ISO/IEC 42001 expects continuous monitoring and improvement; treat non-conformance like a Sev incident, with root-cause and action items (ISO/IEC, 2023).
Treat evaluation like product infrastructure
Human spot checks don’t scale. “Human reviews and stand-alone testing tools” aren’t enough; you need systematic experimentation and quality-control pipelines (Thomke et al., 2025). Build an evaluation store: curated test sets, prompts, seeds, model versions, and expected outputs - versioned and replayable.
Offline → Online, on purpose
- Offline: domain test sets, red-team suites (safety, fairness, toxicity), robustness checks, deterministic config.
- Shadow: run AI behind the scenes; compare to human/baseline.
- Online: A/B or stepped-wedge trials with stop-ship thresholds and risk gates.
For agents, add a forward-looking indicator: long-horizon task completion. METR proposes “task length” as a practical way to quantify agentic capability - and it’s accelerating fast (METR, 2025).
What the evidence actually says
- Workflows matter. The call-center study measured resolutions per hour and customer sentiment - not model BLEU scores (Brynjolfsson et al., 2023).
- Teams matter. Developer impact requires blended SPACE-style measures and nuanced telemetry, not raw activity counts (Microsoft Research, 2025).
- Context matters. The AI Index shows big benchmark gains but uneven adoption of responsible evaluation; use benchmarks to inform, not to declare victory (AI Index, 2025).
- Economy matters. OECD flags real productivity potential - and real uncertainty in distributional effects. Don’t overclaim early wins; run causal designs (OECD, 2024).
A minimal KPI set
- Issues resolved/hour (or task throughput)
- First-contact resolution / success rate
- Average handle time / cycle time
- CSAT/NPS (or task-level quality rating)
- Escalation/deferral rate
- Coverage: % of eligible flows using AI
- Active users/week, sessions/user
- Cost per task (or per 1K tokens)
- Latency p95 (end-to-end)
- Hallucination rate / factuality score
- Toxic/unsafe output rate (policy violations per 1k)
- Incident rate per 100k interactions (and time-to-mitigation)
... with anti-patterns to keep in mind
- Measuring the model, not the job.
- Shipping without a baseline.
- Counting activity instead of value.
- Treating safety as a checkbox, not a rate you can lower.
- One-and-done evals; no experimentation pipeline.
- Benchmarks as victory laps rather than context.
Quick FAQ
Is there a single “best” metric for AI success?
No. Use a layered stack: outcomes, adoption, quality/safety, unit economics, governance. Then weight by risk (NIST, 2024; ISO/IEC, 2023).
How should people measure developer impact?
Blend telemetry and perception under SPACE-of-AI: satisfaction, flow, collaboration, and real throughput - not lines of code (Microsoft Research, 2025).
What about safety and fairness?
Adopt multi-aspect evaluation suites (e.g., VHELM/HELM) and monitor rates (toxicity, bias, incidents). Single-number scores miss real harms (Lee et al., 2024).
We need results before ROI shows up. What should we track now?
Leading indicators: coverage, handle time, escalation rate, and agent “task length” (METR, 2025). Tie early wins to later P&L.
Public sector?
Split model metrics from service metrics. Prove user need and mission impact, then scale (UK DSIT, 2025).
“Organizations may choose to tailor how they measure GAI risks based on these characteristics.” (NIST, 2024)
“Model metrics measure how well your technology is performing… service metrics help you understand if users’ needs and business goals are being met.” (UK DSIT, 2025)
Just keep in mind
Benchmarks are helpful. Discipline is mandatory. Treat measurement as product infrastructure, not a slide. Set objectives. Measure relentlessly. Improve or remove.
Sources
- AI Index Steering Committee. (2025). AI Index Report 2025. Stanford University, Institute for Human-Centered AI. https://hai.stanford.edu/ai-index/2025-ai-index-report
- Brynjolfsson, E., Li, D., & Raymond, L. (2023). Generative AI at work (NBER Working Paper No. 31161). National Bureau of Economic Research. https://www.nber.org/papers/w31161
- Department for Science, Innovation and Technology. (2025). Artificial Intelligence Playbook for the UK Government. GOV.UK. https://www.gov.uk/government/publications/ai-playbook-for-the-uk-government/artificial-intelligence-playbook-for-the-uk-government-html
- International Organization for Standardization & International Electrotechnical Commission. (2023). ISO/IEC 42001:2023 — Artificial intelligence — Management system. https://www.iso.org/standard/42001.html
- Lee, T., Tu, H., Wong, C. H., Zheng, W., Zhou, Y., Mai, Y., … Liang, P. (2024). VHELM: A holistic evaluation of vision-language models. NeurIPS Datasets & Benchmarks. https://proceedings.neurips.cc/paper_files/paper/2024/file/fe2fc7dc60b55ccd8886220b40fb1f74-Paper-Datasets_and_Benchmarks_Track.pdf
- METR. (2025, March 19). Measuring AI ability to complete long tasks. https://www.metr.org/blog/a-metric-for-measuring-ais-ability-to-complete-long-tasks
- National Institute of Standards and Technology. (2024). Artificial Intelligence Risk Management Framework: Generative AI Profile (NIST AI 600-1). https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.600-1.pdf
- Organisation for Economic Co-operation and Development. (2024). The impact of artificial intelligence on productivity, distribution and growth: Key mechanisms, initial evidence and policy challenges (OECD AI Papers, No. 15). https://doi.org/10.1787/8d900037-en
- Thomke, S., Eisenhauer, P., & Sahni, P. (2025, Sept–Oct). Addressing Gen AI’s quality-control problem. Harvard Business Review. https://hbr.org/2025/09/addressing-gen-ais-quality-control-problem
- Microsoft Research (Houck, B., Lowdermilk, T., Beyer, C., Clarke, S., & Hanrahan, B.). (2025). The SPACE of AI: Real-world lessons on AI’s impact on developers. arXiv:2508.00178. https://arxiv.org/abs/2508.00178