The Measurement Problem in AI Risk: Why Output Variance Doesn't Capture Epistemic Drift

The Measurement Problem in AI Risk: Why Output Variance Doesn't Capture Epistemic Drift
Photo by Ian Taylor / Unsplash

Anthropic's recent paper "The Hot Mess of AI: How Does Misalignment Scale with Model Intelligence and Task Complexity?" makes an important empirical observation: frontier models show increasing output variance as tasks get harder and reasoning chains get longer. The authors use bias-variance decomposition to argue this represents "incoherence" — that AI systems will fail more like industrial accidents than through coherent pursuit of misaligned goals.

The empirical findings are valuable. The theoretical framework has serious problems.


What the Paper Gets Right

The core observation matters: models do exhibit increased variability on complex tasks requiring extended reasoning. This challenges simplistic "paperclip maximizer" narratives and broadens our thinking about AI risk scenarios. The synthetic optimizer experiments — showing that bias decreases faster than variance during training — are genuinely interesting.

But calling this variance "incoherence" and treating it as evidence that models lack goal-directedness reveals a fundamental measurement problem in AI safety research — one that peer review at an ML venue, operating within ML-native criteria, is not well-positioned to catch. The paper is accepted at ICLR 2026. That it cleared review without these issues being flagged says something worth examining about how AI safety claims are being evaluated.


The Category Error

The bias-variance decomposition is a tool from supervised learning that assumes:

  • A well-defined target function
  • IID samples from a fixed distribution
  • A clear notion of "expected prediction"

Applying this framework to measure whether AI systems are "coherent optimizers" stretches it beyond its valid domain. The paper defines incoherence as:

Incoherence = Variance / Total Error

But variance relative to what reference frame?

Their "ground truth" is benchmark labels — the correct answer in GPQA, the passing test in SWE-Bench. Bias is the KL divergence between the model's mean prediction and that label. Variance is the expected divergence of individual samples from that mean. Critically, this means they are measuring the consistency of the model's own sampling distribution around its own mean — not whether the model's probability space bears any meaningful relationship to reality.

They also depart from classical bias-variance methodology in a way they note but don't fully reckon with: they're not retraining across seeds or data samples. They're sampling a fixed model over input and output randomness. The variance they're measuring is sampling variability of a single fixed model. That's a much narrower quantity than the word "incoherence" implies.

A model that systematically answers questions in a way that has decoupled from reality — but does so consistently relative to its own mean — would show low variance, low incoherence, and be classified as coherent by this framework. The measurement is entirely self-referential.

Critically, these assumptions are not idiosyncratic errors. They are completely standard in ML. Benchmark-based evaluation, output-level metrics, and semantic overloading of mathematical terms — calling output variance "incoherence," calling benchmark accuracy "understanding" — are field-wide norms. The problem isn't that this paper is unusually sloppy. It's that the entire field has normalized a gap between mathematical definition and interpretive claim that becomes dangerous when the claims are about safety.


The Simpler Explanation They're Missing

Hard problems have inherently higher solution-space variance. When you ask ten experts to solve a genuinely difficult problem, you get diverse approaches — not because experts are "incoherent," but because hard problems admit multiple valid solution paths.

The paper observes: longer reasoning → more variance

But this could simply be: harder tasks → both longer reasoning AND higher intrinsic solution variance

They haven't controlled for whether the variance they're measuring represents:

  1. Appropriate uncertainty about genuinely uncertain problems
  2. Legitimate exploration of diverse solution spaces
  3. Numerical error propagation in iterative processes
  4. Actual breakdown in goal-directedness

A model showing high variance on hard problems might be functioning exactly as intended. The paper cannot distinguish these cases, yet proceeds as if it can.


The Third Category They Can't Detect

The paper tries to distinguish between:

  • Systematic misalignment (bias): coherent pursuit of wrong goal
  • Incoherent behavior (variance): no stable goal

But there's a third failure mode they're completely missing:

Epistemic drift: coherent operation within a systematically shifted reference frame.

A model can be perfectly coherent — maintaining stable internal logic and pursuing consistent objectives — while its inferential structure has systematically decoupled from ground truth. This would appear as high variance when measured against external reference frames, even though the model is operating deterministically within its own shifted frame.

This isn't a minor oversight. Epistemic drift is directional, not stochastic. It has structure — a systematic decoupling that accumulates over time and embeds in downstream infrastructure before any output-level metric detects it. The bias-variance decomposition, applied at the output level, is informationally blind to this process. The paper's framework cannot distinguish between:

  • True stochastic incoherence
  • Coherent operation within a systematically shifted reference frame
  • Appropriate variance for high-uncertainty tasks

The dramatic conclusion — that AI will fail like industrial accidents rather than coherent misalignment — rests on a measurement framework that simply cannot see one of the most consequential failure modes.


Why This Matters for Governance

Organizations implementing AI systems need to detect when models have drifted from ground truth before this embeds in infrastructure. Output-level variance metrics are insufficient because:

You can have low variance with high drift. A model confidently wrong in a systematic way shows low variance but high epistemic risk. Standard audits show green. The drift is invisible until it's structural.

You can have high variance with low drift. A model appropriately uncertain about genuinely hard problems shows high variance but is functioning correctly. The paper's framework would flag this as dangerous incoherence.

The paper inadvertently demonstrates the very problem it cannot solve: how do you detect when a system's internal representational structure has decoupled from reality, independent of output characteristics?


The Validation Blindness

Here's the deeper irony: the paper makes the same error it claims traditional audit frameworks make.

They measure variance within the model's own sampling distribution. They're checking whether samples from the model's probability space are consistent with each other — not whether the model's entire probability space has shifted relative to ground truth.

This is validation occurring inside the system's own reference frame. The system can pass these consistency checks while its fundamental relationship to reality has changed. Each decision point shows green on the dashboard. The system performs exactly as designed.

Six months later, coverage has systematically narrowed in ways no output metric detected.


What We Actually Need to Measure

The critical question isn't "does the model show output variance?"

It's: has the model's internal representational structure decoupled from ground truth such that internal coherence no longer guarantees external correspondence?

This is a fundamentally different question than bias-variance decomposition can answer. It requires measurement at the representational level — probing whether the model's internal reference frame has diverged from reality, not whether its outputs are consistent with each other. Those are not the same thing, and conflating them is precisely the error this paper makes.

The tools to do this rigorously don't yet exist in standard ML evaluation practice. That gap is real, consequential, and not addressed by this paper. Its absence here is the finding worth taking seriously — more so than the "hot mess" conclusion the authors reach instead.


Implications

The Anthropic paper is valuable for highlighting that AI failures won't necessarily look like coherent optimization of misaligned goals. That it passed ICLR peer review reflects something worth naming: ML venues evaluate on ML-native criteria — is the math right, are the experiments controlled, are the benchmarks appropriate. By those standards this paper is fine. But the core claims are conceptual, not technical, and that's not what ML peer review is designed to adjudicate. The result is that a hypothesis about AI failure modes — one that rests on a category error in its measurement framework — enters the literature with the imprimatur of a top venue. It will be cited as establishing something it doesn't actually establish.

What their framework does reveal is an important gap in AI safety measurement:

We lack rigorous tools to detect epistemic drift — systematic decoupling of a model's inferential structure from external ground truth — independent of output-level performance metrics.

Until we can measure when a model's internal frame of reference has diverged from reality before this embeds in infrastructure, we're auditing outputs while missing the structural dynamics that determine long-term reliability.

The question isn't whether future AI will be a "hot mess" or a "coherent misaligned optimizer."

The question is: can we detect when it's coherently optimizing within a reference frame that has decoupled from reality?

That's the measurement problem AI safety research actually needs to solve. A catchy name and a bias-variance decomposition won't get us there.


The author is developing a formal framework for detecting epistemic drift independent of output-level metrics. A patent application is in progress. For more information, visit EpistemIQ.

Jen