28 Feb 2026 6 min read Theory

Open-Loop Generation: On the Architectural Basis of LLM Output Errors

A recent paper from Tsinghua University identifies what it calls H-Neurons — a sparse subpopulation of neurons whose activation patterns predict hallucination events in large language models [1]. The finding is real and probably replicable. The interpretation is not.

The paper concludes that these neurons represent a structural flaw, a localized over-compliance mechanism encoded during pretraining. This is a misreading of what they found, and it points in the wrong direction for anyone trying to understand or govern these systems.

What they actually found

They found a neural correlate. A sparse subpopulation whose activity tracks a system-level behavior. This is exactly what you would expect to find in any neural network, biological or artificial, if you looked for it systematically. It is not surprising. It is not specific to LLMs. It is not a flaw. It is how neural networks work.

The localizationist assumption underlying the paper — that finding neurons correlated with a behavior means you have found where that behavior lives — is one neuroscience largely worked through after decades of lesion studies. Damage or suppress a subpopulation correlated with a behavior and the behavior typically migrates rather than disappears. Information in neural networks is stored distributively precisely because distribution is what makes the system robust to loss. Loss is inevitable in any biological system. Recoverability is the design principle. Artificial neural networks inherit this property not by design but because it emerges from the same optimization pressures. Finding a sparse subpopulation correlated with a failure mode tells you something about the network’s operating regime. It does not tell you that the behavior is localized there, or that suppressing those neurons will resolve it.

What the finding actually reveals

To understand what the H-neuron finding is actually pointing at, it helps to think carefully about what LLMs are doing when they produce a wrong answer confidently.

The standard framing calls this hallucination, implying perceptual distortion, a system seeing something that isn’t there. This has always been the wrong word. It suggests the problem is one of corrupted input rather than unchecked output, and it has directed mitigation efforts accordingly: toward the wrong target.

The phenomenon is also sometimes called confabulation, borrowing from clinical neuroscience, and this is closer [2, 3]. Confabulation refers to the production of fluent, coherent, confident output that is not grounded in accurate representation, without awareness of error and without intent to deceive. But the clinical definition of confabulation is too narrow for what is happening here. Clinically, it is defined as a pathological symptom of damaged memory retrieval — a failure of orbitofrontal reality filtering in patients with Korsakoff syndrome, anosognosia, or certain frontal lesion patterns [4, 5]. That definition was built around the cases where the mechanism becomes visible because the filter is damaged. It does not capture the general case, and LLMs are not damaged systems with missing filters; they are systems that were never built with that filter at all.

What LLMs do and do not filter

Before naming what is missing, it is worth being precise about what is present. LLMs do have filtering mechanisms. Instruction tuning and RLHF create behavioral filters — the model learns to express uncertainty in certain contexts, to decline certain requests, to modulate tone and format. Sampling parameters — temperature, top-k, top-p — filter the probability distribution over tokens before selection. There is emerging mechanistic evidence, including work from Anthropic’s interpretability research, that something like a “known entity” feature can inhibit a default uncertainty pathway, functioning as a rudimentary check on whether the model should commit to an answer at all [6].

These are real. They matter. They are also not the missing mechanism.

All of these filters operate on the output distribution: they shape what gets selected from the generative process before a token is produced. None of them check generated output against whatever the system has converged on about the structure of the world after generation and before production. The distinction is between filtering the generator and monitoring the output. RLHF and sampling do the former. What is absent is the latter.

Introducing open-loop generation

I propose the term open-loop generation for this architectural property, borrowed deliberately from control theory.

In control theory, an open-loop system produces output without feedback from the result of that output back into the generating process. There is no mechanism by which the output is checked against a target state and used to correct generation before the next output. A closed-loop system monitors its output, compares it to a target, and uses the discrepancy as a correction signal. The distinction is architectural, not incidental, and it is independent of how sophisticated the generator itself is.

LLMs are open-loop generators in a specific and precise sense: they produce output by propagating activation forward through learned weights without a mechanism that verifies whether candidate outputs correspond to whatever the system has converged on about the structure of the world. The filtering that exists operates upstream — on the generative distribution — not downstream on the relationship between output and reality.

The orbitofrontal reality filter in biological neural systems is one implementation of a closed-loop architecture — a preconscious mechanism that monitors upcoming output and suppresses it when it fails to refer to current reality [4]. It operates after generation and before production. LLMs have no deliberately designed equivalent at this stage — though the architecture may be developing one regardless, given sufficient freedom to converge. This is not a bug introduced during pretraining or fine-tuning. It is an absent system, or more precisely, an insufficiently developed one.

The term open-loop generation is preferable to hallucination because it correctly identifies the mechanism as architectural rather than perceptual. It is preferable to confabulation because it does not import the clinical pathology framing or require the narrow memory-disorder context in which that term was developed. And it is preferable to both because it names what is actually missing — the feedback loop — rather than describing the output symptomatically. It also carries a direct implication for what a solution would look like: not suppressing neurons, but closing the loop.

The same mechanism, different context

Default mode network research shows that unconstrained generative thinking — mind wandering, imaginative simulation, creative ideation — involves reduced prefrontal filtering and increased activity in regions associated with associative and generative processes [7]. The generative process that surfaces a novel idea and the one that produces a confident wrong answer are, at the mechanistic level, the same process. What differs is the filtering applied afterward.

LLMs generate in this mode continuously and have no reliable filtering step. Sometimes the output is useful. Sometimes it is wrong and stated with equal confidence. The architecture cannot tell the difference, because the part that would tell the difference is missing.

This reframes the H-neuron finding entirely. The paper identifies a sparse subpopulation correlated with the failure mode and interprets it as the location of a flaw to be corrected. What it has actually found is the activation signature of a system operating in open-loop generation mode — which is not a localized bug but a description of how the entire system works, all the time. The H-neurons are not where the problem lives. They are where the problem becomes detectable. That is a different thing, with different implications.

On localizationism and the limits of neuron-level analysis

The localizationist framing is not unique to this paper. It runs through much of the mechanistic interpretability research agenda in ML. Finding neurons, circuits, and features that correlate with behaviors is productive work. But the interpretation of those findings consistently overreaches — treating correlation as localization, treating a predictive signature as a causal locus.

Neuroscience worked through this after decades of lesion studies [5]. Lesioning a region correlated with a behavior does not eliminate the behavior — it redistributes it, because the behavior was never stored there in the first place. Information in biological neural networks is distributed because distribution is robust to loss. The same is true of artificial neural networks. Suppressing H-neurons will not fix open-loop generation. It will move it.

What this means for anyone trying to fix it

If open-loop generation is architectural rather than a localized training artifact, then neuron-level interventions are unlikely to resolve it. The right question is not how to find and suppress the neurons that predict this failure mode. It is how to build the missing feedback loop — or how to design systems and workflows that account for the fact that an open-loop generator will produce output continuously, indiscriminately, and confidently, and that this is a structural property of the architecture rather than a correctable defect.

That is a harder problem than the H-neuron paper implies. It also happens to be the right one.

The author is developing EpistemIQ, a framework for detecting epistemic drift in AI systems: an approach to closing the loop externally where the architecture does not close it internally.

References

[1] Gao, C., Chen, H., Xiao, C., Chen, Z., Liu, Z., & Sun, M. (2025). H-Neurons: On the Existence, Impact, and Origin of Hallucination-Associated Neurons in LLMs. arXiv:2512.01797.

[2] Brender, J. (2023). Chatbot confabulations are not hallucinations. JAMA Internal Medicine, 183(10), 1177.

[3] Canevaro, M. et al. (2024). Confabulation: The Surprising Value of Large Language Model Hallucinations. Proceedings of ACL 2024. arXiv:2406.04175.

[4] Schnider, A. (2003). Spontaneous confabulation and the adaptation of thought to ongoing reality. Nature Reviews Neuroscience, 4, 662–671.

[5] Gilboa, A., & Moscovitch, M. (2002). The cognitive neuroscience of confabulation: A review and a model. In A.D. Baddeley, M.D. Kopelman & B.A. Wilson (Eds.), Handbook of Memory Disorders (2nd ed., pp. 315–342). John Wiley & Sons.

[6] Ferrando, J., Obeso, O., Rajamanoharan, S., & Nanda, N. (2025). Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models. International Conference on Learning Representations (ICLR 2025). proceedings.iclr.cc

[7] Buckner, R.L., Andrews-Hanna, J.R., & Schacter, D.L. (2008). The brain’s default network: Anatomy, function, and relevance to disease. Annals of the New York Academy of Sciences, 1124, 1–38.