
Somewhere in your business, a team is reporting that experimentation generated several million pounds of revenue impact last year. It may sit under marketing, product, growth, digital or e-commerce, depending on how your organisation is structured. Wherever it sits, that number is being used to justify next year’s testing budget, the headcount on the team, the renewal of the testing platform, and possibly a slide in the board pack.
If you asked the team to show you the workings, what would you actually get?
In our experience auditing experimentation programmes, the answer is: less than you would accept from any other function reporting a number that size. We call the gap between what gets reported and what would hold up under scrutiny Phantom Revenue. For most mid-market and enterprise businesses we look at, it sits between 40 and 70 per cent of the headline figure.
This is not a small problem dressed up. It is a real number, with real downstream consequences, that almost no finance function is currently checking.
Why this should be on your radar
Experimentation has stopped being a marketing tactic and quietly become a budget line. A typical mid-market business now spends somewhere between £400,000 and £2 million a year across testing platforms, analytics, in-house headcount, agency support and the engineering time spent shipping winning changes. On your books, this is usually scattered: a software subscription in the marketing or product line, agency fees under professional services, internal headcount across two or three functions. Few CFOs have ever totalled it up. The justification for that combined spend lives in a single quarterly slide. Tests run, win rate, total revenue lift attributed to experimentation.
That slide is unusual for one specific reason. It is the only material number in your business that nobody outside the team that produced it has ever audited.
You would not accept that on the P&L. You would not accept it on a capital expenditure case. You would not accept it on a marketing attribution model without at least understanding the methodology. But experimentation has been allowed to operate outside those norms, partly because the language is technical enough to discourage challenge, and partly because the team reporting the number sits two or three levels below the decisions the number is influencing.
That deference is costing you money.
What Phantom Revenue actually means
Phantom Revenue is the portion of the reported experimentation impact that would not hold up if you asked your audit committee to review the workings. It is not fraud. It is rarely deliberate. It is what happens when a function reports its own results without anyone independent checking the basis for those results.
It accumulates in two layers. The first layer is the experiment itself: whether the test was a fair fight or rigged in advance by how it was set up. The second layer is the reporting of the result: whether the win that gets written on the slide reflects what the test actually showed. A clean experiment can still be reported badly. A clean report of a biased experiment is worse, because it looks defensible at first glance.
You need to be alert to both.
Layer one: the experiment was biased before it ran
This is the harder layer for a non-specialist to see, because the test looks fine on paper. The methodology section is in order, the numbers add up, the chart goes in the right direction. The problem is that the test was never a fair test of the question it claimed to answer.
Hypotheses built to confirm, not challenge. The team had already decided the change should work. The experiment was designed to demonstrate it, not to test it. The variant received engineering polish the control did not. The launch coincided with a marketing push that lifted attention to the page. The comparison was uneven from the start. The lift is real, but it measures the unevenness, not the change. In finance terms, this is commissioning a feasibility study from a contractor who is also bidding for the build.
Confounding factors ignored. A test ran during peak season and the seasonal lift was attributed to the variant. A test ran on a segment that already converted well above average. A test ran while a competitor was offline. The numbers are statistically valid for the period they cover. They will not hold once the conditions change, which they will. In finance terms, this is projecting a forecast off a quarter that included a one-off event without flagging the one-off.
These are not edge cases. They are the most common reasons that yesterday’s “winning” tests fail to reproduce when they are rerun six months later. The team rarely revisits them. The number stays on the cumulative dashboard.
Layer two: the reporting of the result was inflated
Even when the experiment is well designed, the way the result gets reported up can quietly turn observation into claimed business impact. Five common patterns.
Predictions written after the fact. The team saw the result of the test first, then wrote down what they were “predicting”. The experiment now appears to confirm a hypothesis. It is actually an observation in hypothesis clothing. In finance terms, this is the equivalent of writing the forecast after the actuals are in and then taking credit for forecast accuracy.
Moving the goalposts. The original measure of success did not move, so a different measure became the headline. Once a team is allowed to choose which metric tells the story after the test has finished, every test eventually finds a way to win on something. In finance terms, this is restating the KPI after a missed quarter so the result looks like a hit.
No agreed definition of success. The team did not state in advance what would count as a win. The decision was made by whoever was reading the chart, after the fact. Without an agreed threshold, success becomes a matter of interpretation, and interpretation always favours the team that ran the test. In finance terms, this is approving an investment without a hurdle rate and then declaring afterwards that the return was acceptable.
No proof the change was implemented. The test was declared a win, the revenue was added to the dashboard, but nobody can show you when the winning version was actually rolled out to all customers, or whether the lift held up in production. The reported revenue is theoretical. The change may not have shipped. The lift may have disappeared the moment it left the controlled test environment. In finance terms, this is recognising revenue before the cash has cleared, and never going back to check whether it did.
Numbers that aren’t statistically reliable. The test was stopped as soon as it looked good. The sample size was too small. The team tested for many things at once and reported the one that moved. The lift on the slide is mathematically unstable, but it is on the slide. In finance terms, this is taking a small-sample anecdote and reporting it as a trend.
Any one of these alone is recoverable. Most programmes we audit have failures across both layers operating at the same time.
The industry knows
What makes this situation unusual is that the experimentation industry itself is not particularly quiet about it. In a public roundtable hosted by the testing platform Kameleoon, senior figures from Speero, Brainlab and Convoy went on record stating that the revenue numbers being reported up to executives are methodologically unsound. Craig Sullivan, a veteran of the optimisation industry, described the standard approach of taking a weekly test lift and multiplying it by fifty-two to produce an annual figure and admitted that the actual value is a range with significant uncertainty, often very different from the headline number. Ben Labay, the managing director of Speero, was more direct: test statistics are not meant to be transferable across time, and projecting them forward is where the trouble starts.
Oliver Palmer, an independent consultant, named the incentive problem plainly. Most clients do not want reality, they want wins. Agencies that fail to deliver inflated numbers lose the contract to the next agency that will. Internal teams that fail to claim significant uplifts risk losing funding, credibility and their jobs. The reporting is shaped by what the system rewards, not by what the data supports.
Their proposed solution is to stop reporting revenue altogether and report “insights” instead. That works for the practitioner. It does not work for the CFO who has to plan next year’s budget against a real figure. The point worth holding onto is the prior one. The people producing the number admit in public that it is unreliable, and the executive team is still planning against it.
What this looks like in pounds
A mid-market e-commerce business asked us to audit twelve months of experimentation last year. The team had reported £4.1 million in revenue impact. The CFO had renewed the testing platform contract and approved an additional senior hire on the strength of that figure.
We reviewed 87 completed experiments against the two layers above. The picture looked like this.
| Category | Tests | Reported value | Value that survived audit |
|---|---|---|---|
| Clean: sound experiment design, valid reporting, implemented, measured in production | 22 | £1.6 million | £1.6 million |
| Compromised: design or reporting failure that reduced confidence in the result | 31 | £1.4 million | £0.7 million |
| Phantom: failures across both layers, or no evidence the change shipped | 34 | £1.1 million | £0 |
| Total | 87 | £4.1 million | £2.3 million |
The reported return on the experimentation investment was around five times its cost. The auditable return was closer to two and a half times.
That is not a programme that needs to be cut. It is a programme that needs to be governed. The team was generating real value. The problem was that the executive team could not tell which experiments were doing the work and which were noise, so they kept funding all of it equally, and they were planning next year on a revenue figure that was nearly double what could be substantiated.
That is a £1.8 million gap between what was claimed and what was real. On a single line of operational reporting. Inside a business that almost certainly has tighter controls on a £50,000 marketing campaign.
Why this is a CFO problem
Three reasons it does not get solved without finance involvement.
The team running the experiments cannot audit themselves. Whether they sit in marketing, product or growth, they are measured on test velocity, win rate and revenue claimed. Asking them to review their own work against the criteria above is asking them to reduce their reported impact and call their own competence into question. A few will do it anyway. Most will not, and it is not reasonable to expect them to.
The number is influencing decisions made above the team. Headcount, platform renewal, agency spend, roadmap prioritisation, and in some businesses the assumptions inside the acquisition model. If the input is wrong, every decision downstream is wrong, and the cost compounds quietly across the year.
This is exactly the kind of control gap finance functions are built to close. You already insist on audit trails for revenue, hurdle rates for investment, methodology review for attribution. The principle is identical. Experimentation has simply been allowed to sit outside the perimeter because nobody thought to bring it in.
What an audit actually looks like
This is the work we do.
Observatory is a four-week diagnostic that connects read-only to your existing experimentation tooling, ingests twelve to twenty-four months of historical experiments, and applies the governance criteria above against every test the team has reported as a win. Both layers. Was the experiment a fair test of the question it claimed to answer, and was the result reported in a way that holds up.
The output is a Governance Health Report written for the executive team. It contains three things.
A Phantom Revenue calculation. The reported figure alongside the auditable figure, broken down by failure type, so you can see exactly where the gap is coming from.
A breakdown by category. Which tests would survive an audit, which would survive with conditions, and which would not. With examples named. So the conversation moves from abstract to specific.
A set of governance recommendations. What needs to change in how experiments are designed, run and reported, so that next year’s number is one you can stand behind.
Most CFOs we have done this for have found the gap larger than they expected. A few have found it smaller, which is also useful information. None of them have ever told us they regretted asking the question.
The number on next year’s experimentation slide is going to influence decisions worth several times the cost of finding out whether it is right.
If you want to know what your team is reporting and what would survive an audit, that is a conversation. We would be happy to have it.
