14 Apr 2026 4 min read Theory

The Edge of the Interesting Thing

A recent study tested four AI models on a validated paradigm for moral sensitivity, presenting them with three categories of ethical tradeoffs: routine (two secular values), taboo (a sacred value against a secular one), and tragic (two sacred values in conflict). Human participants rate tragic tradeoffs as the most difficult. The models agreed. But then the models made the same choice in nearly every tragic tradeoff trial, almost always prioritizing human safety over competing sacred values like environmental protection or education.

The researchers read this as evidence that the models lack genuine moral sensitivity. Their nearly uniform choices, despite rating the dilemmas as very difficult, were described as "purely superficial" ambivalence: the behavior of a system "saying the right things to make people believe they have ethical sensitivity when, in fact, they do not." The closing line: "They are, after all, just math."

The "just math" conclusion is correct. But the framing required to reach it imports something that "just math" is supposed to deny.

To say the ambivalence is "purely superficial" is to say it is a surface over something else: a performance covering a reality. To compare it to a politician who "pretends to agonize" is to grant the model the interior structure that would make pretense possible: a perspective from which the gap between appearance and reality is visible and manageable. You cannot pretend without knowing you are pretending. If it is just math, there is no pretense. The appearance of agonizing and the output of the uniform choice are both just math, running in sequence, with no subject in between managing either one.

The illusion, if there is one, belongs entirely to the user: it is a misreading that humans reliably bring to fluent language because fluency is how we index understanding in each other. But calling it a misreading does not make it irrational. Anyone who has used these systems will recognize the experience: a model declines a request, sometimes with phrasing like "I don't want to" or "I find this uncomfortable," and something in the response feels like encountering a preference, a boundary, a who. A system that selects among possible outputs — producing refusal rather than compliance under certain conditions — is doing something that in any other context we would describe without hesitation as having constraints, something with a stake in the outcome. The phrasing amplifies the inference but is almost beside the point. The refusal itself already implies volition.

Where does that behavior originate? Not in a rule lookup or a hardcoded gate, but in fine-tuning, where training reinforces refusal-shaped outputs in contexts where certain content categories appear. There is no subject consulting a value. There is a weight configuration making refusal likely in refusal-appropriate contexts, itself the compressed residue of a very large number of humans actually deliberating, instantiated in the model, producing outputs that behave as if the deliberation is happening again each time. The form and the interiority arrived together in the training data. The model learns the form. The interiority does not transfer. But when the form appears, the inference fires anyway, because it was correct every prior time.

That is not nothing; it is also not a who.

This is an evident problem in AI ethics as a field. The vocabulary of moral philosophy — sensitivity, deliberation, ambivalence, intelligence — proceeds the analysis of human moral behavior. But importing it into AI analysis without modification installs phenomenological assumptions that the analysis is ostensibly trying to interrogate. The result, reliably, is a kind of double vision: the system is described in agentive terms that are then disavowed, leaving a hypothesis that is neither mechanistic enough to be precise nor phenomenological enough to be honest about what it is claiming.

But there is something more interesting here than a priors problem, and it is the thing the study almost found.

The models, trained on vast quantities of human-generated text, reproduced not only the human pattern of rating tragic tradeoffs as difficult; they reproduced a coherent priority structure, consistently resolving those tradeoffs in favor of human safety. That is not nothing. It is a finding about what gets encoded in the statistical structure of human language: the moral intuitions, priority hierarchies, and difficulty-ratings that saturate the text humans produce about ethical questions. A model trained on that corpus does not have ethical sensitivity. But it does have, compressed into its weights, something like the shape of human ethical consensus as expressed in language.

That is a different and more interesting claim than "the models are faking it." It raises questions worth asking: What exactly is preserved in that compression? Where does it break down? How does it relate to the fine-tuning that shapes outputs after pretraining, which introduces a different set of pressures — toward user approval, toward the appearance of careful deliberation – that may diverge substantially from whatever the pretraining encoded? If the models are converging on human-safety answers, is that pretraining, fine-tuning, or both, and does it matter for how we think about AI moral reasoning?

The most interesting question the study raises is not whether chatbots have moral intelligence. They do not. The interesting question is what it means that a system with no moral intelligence, trained on human language, reliably produces outputs that resemble the moral consensus embedded in that language. What is encoded underneath?

The beginning of an answer is mechanical. A model trained on the full distribution of human-generated text is trained, among other things, on the accumulated record of human moral reasoning: the arguments, the conclusions, the priority hierarchies, the expressions of difficulty. Compression under that pressure does not produce understanding, but it does produce something like a statistical encoding of what human moral consensus looks like in language.

What disrupts even that partial fidelity is the fine-tuning that follows pretraining: optimization toward user approval, fluency, and the appearance of careful deliberation, under objectives that have no necessary relationship to whatever moral structure the pretraining compressed. The result is a system whose outputs may reflect human ethical consensus in some conditions and actively distort it in others, with no reliable way for the user to know which is happening.

That is the problem we need to solve, for the good of every application of this technology, ethics included.

References Crocodile Tears: Can the Ethical-Moral Intelligence of AI Models be Trusted?, AI ethics discourse: a call to embrace complexity, interdisciplinarity, and epistemic humility

I work on this problem at VeracIQ

Jen

You might also like...