Beyond Vanity Metrics: Measure Experimentation Impact

Search for “experimentation programme metrics” and you will find the same four categories repeated everywhere: test velocity, test quality, test effectiveness, and organisational adoption.

They sound reasonable. They are neatly structured. And they almost entirely miss the point.

These are activity metrics. They tell you how busy the programme is, how fast it moves, and how many teams have been recruited into the effort. They tell you almost nothing about whether the programme is making the organisation smarter or whether any of the evidence it generates reaches the decisions that matter.

An experimentation programme can score perfectly on all four categories and still have zero influence on strategic direction. Tests run, results filed, leadership none the wiser. The evidence sits in a repository nobody opens. The decisions that determine product direction, budget allocation, and market strategy get made on opinion anyway.

This is the Evidence Gap: the structural disconnect between the evidence organisations generate through experiments and the strategic decisions that evidence should inform.

The difference between experimentation programs that influence boardroom decisions and those stuck in optimization theaters lies in what they choose to measure. It’s time to abandon metrics that count activity and embrace those that demonstrate strategic impact through governance.

The problem with “conventional” experimentation program metrics

Traditional experimentation metrics tell a compelling story—just not the one that matters. Consider the most common measurements that teams parade in front of leadership, and why they fundamentally miss the mark.

Velocity Is a Vanity Metric

Test velocity, the number of experiments launched per week or month or quarter, is the metric the industry loves most. It is easy to measure. It goes up over time if you are doing your job. It looks impressive in a board update.

It also rewards the wrong behaviour. A team that runs forty shallow, poorly conceived tests in a quarter will outscore a team that runs eight well-designed experiments that genuinely inform product strategy. Velocity treats all experiments as interchangeable units, which they are not. A pricing experiment that shapes your entire market positioning is not equivalent to a button colour test, but velocity counts them the same.

Worse, velocity creates perverse incentives. When teams are measured on how many tests they ship, they optimise for speed. They skip the slower, harder work of connecting experiments to strategy, building on prior findings, and ensuring results are trustworthy. The programme gets faster and less useful at the same time.

The common defence of velocity is that it is at least a metric teams can control and that it shows progress. If you were running one experiment a month and now you are running ten, that is clearly forward movement. And on the surface, that is true. Going from one to ten experiments is progress in the same way that a sales team going from ten calls a day to a hundred is progress. The activity has increased. But if none of those calls convert, the increase means nothing. If the additional nine experiments are poorly designed, untethered from strategy, and their results never reach a decision-maker, the programme has not become more valuable. It has become more expensive.

The velocity defence also tends to arrive alongside a second argument: that the team should not worry about quality because they do not want to stifle creativity or slow people down. This pairing is revealing. It means the programme is optimising for the one metric it can show growth on while explicitly declining to measure whether that growth is producing anything trustworthy. That is not pragmatism. It is avoidance. No finance team says they do not want to enforce accounting standards because it might stifle creativity. No engineering team says they would rather not have code reviews because they slow down deployment. Quality infrastructure does not constrain good work. It prevents bad work from being treated as if it were good. And without that distinction, every experiment result carries equal weight, which means leadership has no rational basis to trust any of them.

Learning Rate Is Still an Activity Metric

The experimentation community has started to rally around learning rate as the more sophisticated alternative to win rate. Spotify started talking about this as the metric they use to measure their experimentation program. The argument is that even losing experiments generate value if they produce actionable learnings. This sounds better. It is still insufficient.

Learning rate measures whether teams extracted insights from their experiments. It does not measure whether those insights influenced anything. A team can document a hundred learnings that sit in a Confluence page nobody reads. That is a high learning rate and zero strategic impact.

The problem is not whether learnings are being generated. It is whether those learnings are reaching the people who make decisions and changing their behaviour. Learning rate does not capture that. It stops at the point of extraction and calls it success.

Quality Metrics Are Technical Hygiene, Not Programme Health

The “test quality and rigor” category, minimum detectable effect, powered versus underpowered tests, false positive rates, is the one that looks most sophisticated. These are real statistical concerns. Tracking them is genuinely important.

But they are infrastructure metrics, not programme health metrics. They tell you whether your testing engine is calibrated correctly. They do not tell you whether the output of that engine matters to anyone.

It is the equivalent of a hospital reporting that all its diagnostic machines are properly calibrated while never checking whether doctors are using the test results to inform treatment decisions. The machines are working. The system is broken.

Tracking MDE and statistical power is necessary. Treating it as a measure of programme success is a category error.

There is also a selectivity problem with how quality gets defined. Teams that point to MDE calculations and A/A test calibration as evidence of rigour are often the same teams that resist tracking whether hypotheses changed after launch, whether metrics were swapped at the reporting stage, or whether individual experiments followed the agreed process. The technical metrics are comfortable because they are abstract and systematic. Nobody feels scrutinised by a platform-level false positive rate. But the governance metrics that would actually hold individual experiments accountable, that would reveal which teams cut corners and which do not, those are the ones that get resisted under the banner of not wanting to stifle people. The result is a programme that can prove its statistical engine is calibrated while having no visibility into whether the humans operating it are producing trustworthy results.

Adoption Measures Spread, Not Influence

Organisational adoption, how many teams run tests, how many people check dashboards, how quickly winning variants are rolled out, measures whether the experimentation habit has spread across the organisation. It does not measure whether that spread is producing better decisions.

A company where twelve teams run experiments but none of their findings inform the annual planning process has high adoption and low impact. The programme has been democratised but not made strategically relevant. Everyone is testing. Nobody is deciding differently because of it.

The Win Rate Illusion

Win rate, the percentage of experiments that produce a positive, statistically significant result, is the second most common metric. It sounds like a measure of programme effectiveness. It is actually a measure of how well teams can construct experiments that confirm what they already believe.

A high win rate often signals a team that is running safe, predictable tests rather than genuinely exploring uncertainty. It can also signal teams that stop tests when they see a positive result rather than running them to completion, or teams that adjust their primary metric after the fact to find one that moved.

More fundamentally, win rate says nothing about whether the winning variant was ever implemented, whether it delivered the predicted uplift in production, or whether anyone outside the experimentation team cared about the result. A programme with a 70 per cent win rate and a 15 per cent implementation rate is not effective. It is generating evidence that goes nowhere.

What to Measure Instead

If the conventional metrics measure activity, what measures impact?

The answer sits across three layers of The Decision Intelligence Framework: the integrity of the evidence you produce, how well that evidence connects across teams and time, and whether it reaches the decisions that matter. Each layer builds on the one beneath it. Evidence that lacks integrity cannot be synthesised reliably. Evidence that is not synthesised cannot influence decisions at scale.

Layer 1: Evidence Integrity

Evidence integrity answers a simple question. Can we trust what we have generated?

A programme that produces unreliable results is worse than one that produces no results at all. Unreliable results create false confidence. They lead to implementations that do not deliver the predicted uplift. They erode executive trust in experimentation over time.

Governance Health Score

Not all experiments are created equal. Two teams can both run fifty tests in a quarter. One team follows rigorous methodology. The other cuts corners on sample size, changes hypotheses mid-flight, and cherry-picks metrics at the reporting stage.

A Governance Health Score evaluates each individual experiment against your governance framework, regardless of whether the test won or lost. It checks whether the hypothesis was properly formed and remained unchanged after launch, whether the metrics selected at the planning stage are the same ones reported on, whether the experiment ran for the planned duration rather than being stopped early to capture a significant result, and whether statistical checks like sample ratio mismatch were performed.

This score can then be aggregated across teams and individuals to reveal where process discipline is strong and where it breaks down. When aggregated over time, it shows whether the programme is becoming more trustworthy or less.

The health score is deliberately unrelated to win rate. A well-run experiment that produces a null result scores higher than a poorly run experiment that claims a win. This distinction matters enormously once you understand that unreliable wins create phantom revenue: the gap between what teams claim experiments delivered and what actually materialised in production.

Source and Hypothesis Integrity

Where did the experiment idea come from? Was it data-driven, sourced from customer research, pulled from a stakeholder request, or generated from a previous experiment’s findings?

Tracking the source of experiment ideas reveals whether the programme is building on evidence or running on opinion. Programmes where the majority of ideas come from ad-hoc requests or senior management whims tend to produce lower-quality outcomes than those that systematically build on prior findings.

Equally important is hypothesis integrity. Was the hypothesis changed at any point after the experiment was planned? Were metrics added or removed between the planning and reporting stages? These are early warning signs of results being shaped to fit a narrative rather than tested against reality.

Prediction Accuracy

Of the experiments declared as wins and implemented into production, what percentage delivered the predicted uplift?

This is the metric that separates credible programmes from theatrical ones. If your programme consistently predicts a 12 per cent uplift and the real-world result is 3 per cent, the programme has a credibility problem regardless of its win rate or velocity. Executives will learn, correctly, that experiment predictions cannot be trusted at face value.

Track the gap between predicted and actual outcomes over time. A narrowing gap signals a programme that is getting more rigorous. A widening gap signals one that is optimising for appearance over substance.

Layer 2: Evidence Synthesis

Evidence synthesis answers the next question. Can we connect what we know across teams, studies, and time?

Most experimentation programmes treat each test as an isolated event. The result is reported, filed, and forgotten. The next team working on a related problem starts from scratch. The organisation never accumulates knowledge. It just accumulates results.

Insight Reuse Rate

This measures how effectively the organisation builds on previous findings rather than testing in isolation. Of the experiments launched this quarter, how many were informed by or directly built on insights from previous experiments?

High-performing programmes typically achieve insight reuse rates above 40 per cent, meaning nearly half of new experiments build directly on prior learnings. Most programmes sit well below 20 per cent.

A low insight reuse rate tells you that knowledge is being generated but not circulated. Teams are running in parallel without awareness of what others have already learned. The same questions get tested repeatedly in different parts of the business. This is one of the most expensive invisible failures in experimentation, and velocity metrics will never surface it.

Cross-team Evidence Visibility

Are experiment results accessible and discoverable by teams other than the one that ran the test?

This is not about whether results are technically stored somewhere. It is about whether a product manager in one team can find and act on findings from a marketing experiment that ran six months ago. It is about whether someone joining the organisation can get up to speed on what has already been tested and learned before proposing work that duplicates it.

Track how many teams access results from experiments they did not run. Track how many new experiments reference prior work from other teams. These are leading indicators of whether your evidence is being synthesised or siloed.

Knowledge Decay

How quickly do experiment findings become inaccessible or irrelevant?

If your primary method of sharing results is a slide deck presented once in a meeting, the knowledge decays almost immediately. The people who were not in the room never see it. The people who were in the room forget the details within weeks. When a related decision comes up three months later, nobody refers back to the evidence.

Track the half-life of your experiment findings. How long after completion are results still being referenced or influencing new work? If the answer is less than a month, your synthesis layer is broken regardless of how many experiments you run.

Layer 3: Decision Influence

Decision influence answers the most important question of all. Does trustworthy, synthesised evidence reach the decisions that matter?

This is where the vast majority of experimentation programmes fall short. They generate evidence. They might even synthesise it well. But the evidence never reaches the boardroom. The strategic decisions that determine product direction, market entry, budget allocation, and organisational priorities get made without it.

Decision Impact Rate

Of the significant product, growth, and strategic decisions made by your organisation in the last quarter, how many were informed by experiment or research evidence?

This metric flips the traditional measurement on its head. Instead of asking how many experiments influenced decisions (which inflates the number with trivial tactical choices), it starts from the decisions themselves and asks how many drew on experimental evidence.

In organisations with healthy governance, at least 60 per cent of significant product and growth decisions are supported by reliable experiment data. Most organisations sit below 20 per cent. Some sit at zero despite running hundreds of experiments per year.

Implementation Success Rate

Of the experiments declared as wins and approved for implementation, what percentage were actually implemented within 90 days?

This is the metric that separates claimed value from real value. Many programmes report impressive revenue figures based on experiment predictions that never materialise because the winning variant was never implemented, was implemented months later in a different context, or was implemented but failed to deliver the predicted lift at scale.

Every day a winning experiment sits unimplemented is a day of unrealised value. If your programme struggles to get engineering resources for implementation, the implementation lag tells that story clearly. And if you are trying to prove the programme’s value to stakeholders, the gap between “experiments completed” and “experiments in production” is the single most important number to close.

Strategic Alignment

What percentage of experiments are linked to stated business objectives or OKRs?

This is not about adding a mandatory dropdown to your experiment planning template. It is about whether the experimentation programme is oriented around the questions the business needs answered or whether it is running on its own agenda.

Programmes where less than 50 per cent of experiments connect to business strategy are typically tolerated by leadership rather than valued. Programmes where that number exceeds 80 per cent tend to receive growing investment because leadership can see the direct line between the programme’s work and the organisation’s priorities.

Supporting Operational Metrics

The three-layer framework covers what matters strategically. There are a handful of operational metrics that support those layers and are worth tracking alongside them.

Throughput and Bottlenecks

How long does an experiment take from idea to completed report? More importantly, where does it get stuck?

Break the journey into granular stages: ideation, prioritisation, design, development, QA, approval, live, analysis, and reporting. Measure the time spent at each stage. The bottlenecks will become obvious.

This is not about speed for its own sake, which is where velocity goes wrong. It is about understanding where the system creates drag that prevents evidence from reaching decisions while it is still relevant. A bottleneck at the approval stage tells a different story from a bottleneck at the development stage, and each requires a different response.

Ramp Time

How long does it take a new team member or a newly onboarded team to run experiments at the same quality standard as established teams?

If the answer is months, you have a training and documentation problem. If it varies wildly between teams, you have a consistency problem. Both undermine evidence integrity at scale.

Wider Business Engagement

Is the experimentation programme a closed loop within a single team, or does the wider organisation contribute ideas, consume findings, and act on evidence?

Track how many experiment ideas originate from teams outside the core experimentation function. Track how many people outside the programme access experiment results. Track whether stakeholders are acting on evidence or ignoring it.

Low engagement tells you the programme is operating in a silo. The evidence it generates, however high quality, is not reaching the people who make the decisions that determine business outcomes.

From Activity to Impact

Transitioning from activity metrics to impact metrics is not a dashboard redesign. It is a fundamental shift in what the programme considers success.

Start by acknowledging the political challenge. Teams that have been optimised for velocity will resist metrics that reveal their tests do not connect to strategy. Practitioners comfortable with win rates will not welcome tracking that shows their winners do not deliver the predicted uplift in production. This resistance is understandable but it cannot be the reason you continue measuring the wrong things.

Introduce new metrics alongside existing ones rather than replacing them overnight. Let the evidence integrity metrics explain why traditional metrics have not translated into business impact. Let decision influence metrics show leadership what the programme could become if it were properly supported.

Most critically, ensure executive sponsorship for the transition. When leadership asks about experiment velocity, redirect to implementation success. When they celebrate win rates, show them the prediction accuracy gap. Executive attention drives organisational behaviour. Point it at what matters.

What Your Dashboard Should Tell

Structure your experimentation dashboard in three layers that guide viewers from trust to impact to opportunity.

The trust layer shows governance foundations: average Evidence Health Score, process adherence trends, and prediction accuracy. These metrics answer whether the programme’s output can be relied upon.

The impact layer demonstrates strategic value: Decision Impact Rate, Implementation Success Rate, and Strategic Alignment. These metrics justify the programme’s existence and its budget.

The opportunity layer guides future focus: the lowest health scores highlighting areas for improvement, the highest-impact insights suggesting where to expand, and engagement patterns revealing where the programme should build new relationships.

When these three layers work together, the dashboard tells a story of strategic contribution, not tactical activity. That is the difference between a programme that leadership tolerates and one that leadership champions.

What Leaders Should Do

If you are a VP of Product, a CDO, a CPO, or a senior leader who receives experimentation reports, here is the uncomfortable reality: you probably cannot trust the numbers you are being shown. Not because anyone is lying to you, but because the system is not set up to verify them.

When a team tells you they ran 200 experiments and generated three million in incremental revenue, there is usually no mechanism to check whether that number reflects reality. Did the winning variants get implemented? Did the predicted uplift materialise in production? Were the experiments run with sufficient rigour that the results can be relied upon? In most organisations, nobody is checking. The number is the prediction, not the outcome.

This is not a people problem. It is an infrastructure problem. Your experimentation teams are reporting the only metrics they have available to them, which are activity metrics. They are not hiding the truth. They simply do not have the systems to measure what matters.

Here is what you can do about it.

Ask three questions in your next programme review.

First: of the experiments we declared as wins last quarter, how many were implemented within 90 days? Second: of those that were implemented, how closely did the real-world outcome match the prediction? Third: which strategic decisions made in the last six months were directly informed by experiment evidence?

If your team cannot answer these questions with data, that tells you everything about where the programme actually stands.

Stop rewarding velocity. If your programme reviews celebrate the number of tests shipped, you are incentivising speed over impact. Shift the conversation to evidence quality and decision influence. Ask what changed as a result of the programme’s work, not how much work the programme produced.

Commission an independent diagnostic. The same team that runs the programme cannot objectively evaluate it. They are too close to the work, too invested in the narrative, and they lack the external benchmark to know what good looks like. A structured diagnostic that audits evidence integrity, synthesis, and decision influence across your programme will reveal gaps that internal reporting never surfaces. It will also give you a baseline to measure real progress against, rather than relying on activity metrics that only go up.

Treat governance as infrastructure, not overhead. Every organisation accepts the need for financial governance and security governance. Evidence governance is no different. Without it, your investment in experimentation is producing outputs that cannot be verified, connected, or acted upon at a strategic level. The absence of governance is not freedom. It is a gap between what the programme costs and what it delivers.

What Experimentation Leads Should Do

If you run the experimentation programme, you already know something is off. You feel the disconnect between the work your team produces and the influence it has. You see the results that never get implemented. You know which experiments were run properly and which were rushed through. You can probably name the colleagues who stop tests early and the ones who add metrics after the fact.

But you cannot prove any of it. And that is the problem.

Without systematic oversight, you are relying on trust and tribal knowledge to maintain quality. You are hoping that people follow the process rather than verifying it. You have no aggregated view of where corners are being cut, which teams are producing reliable evidence, and which are generating noise that damages the programme’s credibility with leadership.

Here is what you can do about it.

Start tracking what you can track now. You do not need new tools to begin capturing prediction accuracy. Pick ten experiments from last quarter that were declared wins. Find out how many were implemented. For those that were, compare the predicted uplift to the actual result. That single exercise will tell you more about your programme’s health than a year of velocity reporting.

Build the case for governance infrastructure. If you are tracking experiments in spreadsheets, Jira tickets, or a collection of Confluence pages, you do not have the infrastructure to measure evidence integrity, synthesis, or decision influence at scale. You have a record-keeping system, not a governance system. The difference matters. A governance system detects when hypotheses change, flags when metrics are added post-hoc, scores each experiment against a quality framework, and makes that information visible to everyone, not just the person who ran the test.

Make the invisible visible. The biggest risk to your programme is not bad experiments. It is the experiments that look good on paper but cannot withstand scrutiny. Every underpowered test that claims significance, every metric swapped at the reporting stage, every predicted uplift that never materialised in production – these erode your programme’s credibility with leadership whether or not anyone notices them in the moment. They always surface eventually, usually when the programme is asking for more budget. The way to prevent that is to surface them yourself, on your own terms, as evidence that you take quality seriously.

Stop being the single point of quality control. If quality depends entirely on your personal oversight, it does not scale. When you go on holiday, standards slip. When a new team onboards, they make mistakes nobody catches until the results are published. When the programme grows beyond what you can personally review, you lose visibility entirely. Quality needs to be embedded in the system, not held in your head. That means automated health scoring, mandatory governance checkpoints, and transparent reporting that does not depend on any single individual.

Frame the ask correctly. When you go to leadership for governance infrastructure, do not frame it as process improvement. Frame it as risk reduction and investment protection. The organisation spends significant money on experimentation, tools, people, opportunity cost. Without governance, there is no way to verify whether that investment is producing trustworthy evidence or expensive noise. That framing connects to what executives care about. Process improvement does not.

Beyond Metrics: Building a Governance Culture

Metrics alone don’t transform programs—they enable transformation by making governance visible and valuable. Use these metrics not as judgment tools but as improvement guides.

When governance scores are low, don’t punish—investigate and improve. When implementation rates disappoint, examine the full system from hypothesis to execution. When strategic alignment wavers, strengthen the connection between experimentation and planning cycles.

Create regular governance reviews that celebrate improvement, not just achievement. A team that increases their governance score from 45 to 65 deserves more recognition than a team that maintains 70. Progress matters more than position.

Most importantly, use metrics to tell stories. The Trust Gap Score becomes compelling when you share how closing it influenced a major product decision. Implementation Success Rate resonates when you calculate the revenue recovered by fixing it. Strategic Alignment Index matters when you show experiments driving competitive advantage.

The Path Forward

The experimentation industry stands at a crossroads. We can continue celebrating vanity metrics—whether traditional velocity measures or seemingly sophisticated efficiency ratios—that make us feel productive while delivering minimal strategic impact. Or we can embrace governance metrics that reveal hard truths but guide us toward genuine business value.

Organizations that make this transition stop asking “How many experiments did we run?” or “How efficient is our experimentation team?” and start asking “How many better decisions did we make?” They stop celebrating test wins and start measuring implementation impact. They stop counting activity and start demonstrating strategic value.

This isn’t just a metrics change—it’s a maturity evolution. Programs measured by governance metrics can’t hide behind velocity or efficiency ratios. They can’t claim success through p-hacked wins or operational optimization. They must deliver what executives actually need: reliable insights that drive confident decisions.

The metrics you choose define the program you build. Choose vanity metrics—whether traditional or sophisticated-sounding—and you’ll create an expensive theater of optimization. Choose governance metrics, and you’ll build a strategic capability that transforms how your organization competes.

The question isn’t whether you’ll make this transition—competitive pressure ensures that programs stuck in vanity metrics won’t survive. The question is whether you’ll lead this change or be forced into it after watching governance-focused programs deliver the strategic impact yours cannot.

Your current dashboard tells a story. Make sure it’s the story that matters: not how busy or efficient your experimentation program is, but how much your business trusts and benefits from its insights. That’s the only metric that ultimately counts.

Manuel da Costa

Founder of Efestra - The Experimentation Governance System that is enabling organizations to make better decisions through experimentation

Efestra's Decision Intelligence Suite

Confident experimentation

Audit your evidence

Research meets decisions

Governance at scale

Meet MOSAIC - AI that helps you

Discover the gaps between the evidence your teams generate and the decisions it should be reaching.