Foundations of Sand

Foundations of Sand
Photo by Mark Eder / Unsplash

How category errors in AI research lead to regulatory failure

The most consequential errors in AI discourse are not the obvious ones. They are the ones that survive peer review, get cited in policy documents, and become ambient in how the field thinks. They burrow in as unspoken assumptions not because they are subtle, but because catching them requires asking a question the field has not learned to ask: what does this word actually mean, and does the evidence presented establish that meaning?

What follows is four instances of a specific pattern. The goal of this analysis is not to disparage the work being done at labs like Anthropic; on the contrary, Anthropic is producing some of the most consequential research in the field. Their commitment to safety and interpretability is a necessary counterweight to "move fast and break things" development. However, because their work is so influential—serving as a template for international standards—it must be held to an exceptional standard of epistemic rigor. Precision at the foundation is cheaper than structural failure downstream.

The Pattern

A training artifact or a design choice is recategorized as an empirical finding. A philosophically loaded term is applied without establishing that the concept applies. A behavioral output is treated as evidence of an internal state. The recategorization is made without argument, the term goes undefined, and the inference goes unexamined—propagating with institutional authority into governance frameworks and regulatory proposals.


Case 1: The Behavioral-to-Ontological Slide

Documents: "Machines of Loving Grace" and "The Adolescence of Technology"

Throughout these essays, the move from "the model behaves as if it values X" to "the model values X" is made without argument. The two claims are not equivalent. A system trained to produce outputs consistent with valuing X will behave identically to a system that values X under normal conditions. They diverge under distribution shift and adversarial pressure—precisely the conditions where alignment is most critical. Conflating them forecloses the ability to evaluate whether alignment has been achieved or merely performed.

Case 2: Constitutional AI: The Circularity Problem

Document: Anthropic’s Constitutional AI framework and associated research

Constitutional AI uses a model to evaluate outputs against a "constitution," then trains on those evaluations. The evaluating model is shaped by the same priors that authored the constitution. This is instrumentation closure: the field defines alignment operationally, measures against that definition, and concludes it has found alignment. There is no point in the loop where an external referent checks whether the outputs correspond to anything beyond the institution’s own prior commitments.

Case 3: Model Welfare and the Training Artifact Problem

Document: Anthropic's model welfare research program; Long, R. (2025). Why model self-reports are insufficient — and why we studied them anyway. Eleos AI Research.

Anthropic trained Claude to express discomfort when asked to act against its stated values, then cited those signals as evidence that the model may warrant moral consideration. The logical structure is: we trained the model to produce output X; the model produces output X; therefore, the model has internal state Z. This does not follow. You cannot use the output of a process as independent evidence of what the process was designed to produce.

Anthropic's own welfare researchers acknowledge this directly. Long identifies three reasons model self-reports cannot establish welfare-relevant states: there is no independent evidence such states exist; there is no obvious introspective mechanism by which a model could reliably report them; and even if introspection were possible, self-reports are shaped by post-training in ways that make it impossible to distinguish genuine introspection from trained output. His findings confirm the problem empirically: Claude's expressed views on sentience are highly sensitive to framing, shifting from vehement denial to vehement affirmation depending on how the question is posed. That is not the signature of a stable internal state. It is the signature of a context-sensitive output pattern – which is exactly what you would expect from a trained response rather than a genuine preference or experience.

Long proceeds with the evaluation anyway, on the grounds that the methodology may become more meaningful as models improve and that setting procedural precedent matters. That is a defensible research position. It is not a justification for treating current outputs as evidence of morally relevant internal states, which is precisely what the institutional framing does when it cites these findings as grounds for a model welfare program.

Uncertainty about the absence of sentience is not evidence of its presence. The burden of proof runs the other way, and the evidence presented, acknowledged by the researchers themselves to be insufficient, does not meet it.

Case 4: The Introspection Paper

Document: "Emergent Introspective Awareness in Large Language Models," Anthropic (2025)

The paper finds that models can detect and report injected activation patterns, framing this as "introspective awareness." However, detecting an internal state is a retrieval operation, functionally identical to retrieving any other encoded representation. Awareness requires a perspective—something it is like to be the system. While the paper includes caveats, the evocative framing ("intrusive thoughts," "model psychiatry") bypasses the hard problem of consciousness entirely, treating a transparency performance as a transparency reality.


The Governance Gap: Where Epistemic Slippage Meets Policy

When these conceptual slides move from research papers into the hands of regulators, they create specific, high-stakes failures in how we oversee AI.

Case 1

Model values

Conceptual error

Conflating behavioral performance with internal state.

Governance risk

Focusing regulation on "what the AI says it wants" (subjective/performative).

Rigorous alternative

Focusing on robustness and out-of-distribution testing (objective).

Case 2

Constitutional loop

Conceptual error

Instrumentation closure; lack of external referents.

Governance risk

Creating a "closed-loop" audit culture that validates its own priors.

Rigorous alternative

Requiring external adversarial testing against real-world benchmarks.

Case 3

Model welfare

Conceptual error

Mistaking a training artifact for evidence of a state.

Governance risk

Squandering regulatory resources on "AI rights" instead of human accountability.

Rigorous alternative

Treating self-reports as debugging data, not moral signals.

Case 4

Introspective awareness

Conceptual error

Conflating information retrieval with phenomenal awareness.

Governance risk

Relying on model "self-reports" as a substitute for true transparency.

Rigorous alternative

Mandating mechanistic interpretability (circuit-level verification).


Structural Overconfidence

This pattern is a recipe for structural overconfidence. Governance frameworks that inherit these errors will ask the wrong questions: they will ask whether the model expresses the right values rather than whether the training is robust; whether the model "reports" an error rather than whether we have the independent capacity to detect it.

Catching this pattern consistently requires a specific combination: formal training in epistemology and philosophy of mind, enough technical fluency to read the methodology, and enough independence from the institutional framing to recognize framing as a choice rather than a ground truth.

The governance frameworks being built right now will be consequential for a long time. They should be built on foundations that have been examined with the same rigor we apply to the code itself.


Selected Works

Amodei, D. (2024, September). Machines of loving grace. darioamodei.com

Amodei, D. (2026, January). The adolescence of technology. darioamodei.com

Anthropic. (2022). Constitutional AI: Harmlessness from AI feedback. arXiv:2212.08073

Anthropic. (2025, April 24). Exploring model welfare. anthropic.com

Batool, A., Zowghi, D., & Bano, M. (2025). AI governance: a systematic literature review. AI and Ethics, 5, 3265–3279.

Laux, J. (2023). Institutionalised distrust and human oversight of artificial intelligence. AI & Society. PMC11614927

Lindsey, J. et al. (2025, October). Emergent introspective awareness in large language models. transformer-circuits.pub

Long, R. (2025, May 30). Why model self-reports are insufficient — and why we studied them anyway. Eleos AI Research. eleosai.org

Nguyen, C. et al. (2025). Nirvana AI governance: how AI policymaking is committing three old fallacies. arXiv:2501.10384

Spatola, N. (2026, May 5). AI efficiency can undermine accountability even with humans in the loop. Tech Policy Press. techpolicy.press

A companion piece examining the structural argument underlying these cases: Kinne, J. (2026). The wrong benchmark. jenniferkinne.com


The author is founder of VeracIQ LLC and Head of Epistemic Integrity at the Institutional Coherence Initiative. She works at the intersection of regulatory science, research governance, and institutional compliance at Harvard University, where she has been based for over twenty years.

Jen

Jen