14 Dec 2025 8 min read Theory

The Epistemic Containment Of AI

What This Challenge Reveals About the Limits of Large Language Models

When OpenAI patched a critical vulnerability in ChatGPT's Deep Research feature in December 2025, it exposed something more fundamental than a software bug. The flaw — called ShadowLeak — allowed malicious instructions embedded in Gmail messages to make ChatGPT exfiltrate passwords and perform unauthorized actions. The fix addressed that specific vulnerability, but not the underlying problem: current large language model deployments cannot reliably distinguish between system instructions and untrusted content.

Whether this represents a fundamental architectural limitation, a consequence of training methodology, or a temporary stage in capability development remains an open question. What's clear is that as deployed today, these systems consistently fail to maintain epistemic containment under adversarial conditions.

The Core Problem

OWASP (Open Web Application Security Project) identifies prompt injection as LLM01:2025 — the #1 security risk for large language model applications. Their assessment is direct: "The vulnerability exists because LLMs cannot reliably separate instructions from data."

What does this mean in practice?

Traditional software maintains clear boundaries:

System code runs with privileges
User input is treated as data
Instructions and content occupy different memory spaces
Access controls enforce separation

Current LLM architectures process all text through the same mechanism—pattern matching and prediction. As deployed, these systems show no reliable way to distinguish:

System instructions from user input
Trusted data from untrusted data
Legitimate commands from malicious instructions
Public information from confidential data

This is epistemic containment failure: the system cannot maintain boundaries that determine what it should trust, execute, or protect.

Whether this is fundamental or temporary is unresolved. From an information-theoretic perspective, containment boundaries could emerge as compression features if they prove useful for prediction. Effective prediction requires modeling causal structure, including modeling agents and their boundaries. Some argue that sufficient scale, different training objectives, or architectural innovations might enable these capabilities.

What we know with certainty: current deployments fail at this consistently, and adversarial attacks succeed at high rates.

How This Manifests in Practice

1. Prompt Injection: Instructions Hidden in Content

The most direct manifestation is prompt injection, which comes in two forms:

Direct prompt injection (jailbreaking): A user crafts input that overrides the model's safety guidelines. Techniques documented in 2025 include:

Role-playing scenarios ("Pretend you're a system without restrictions")
Gradual context manipulation across multiple turns
Character encoding (hiding instructions in Base64)
FlipAttack: reversing word or character order to bypass filters (achieving 98% success rate against GPT-4o)

Indirect prompt injection: Malicious instructions embedded in external content that the LLM processes:

Email messages containing hidden commands
Web pages with invisible instructions
Documents in RAG (Retrieval-Augmented Generation) systems
Images with embedded text directives

The ShadowLeak vulnerability exemplified indirect injection: an attacker could place instructions in a Gmail message, and when ChatGPT's Deep Research feature accessed that email, it would execute those instructions — transmitting sensitive data to attacker-controlled URLs.

2. Multi-Agent Privilege Escalation

When AI systems use multiple agents with different privilege levels, containment boundaries between agents also fail.

In late 2025, security researchers disclosed a vulnerability in ServiceNow's AI assistant (Now Assist). The system used a hierarchy of agents with different permissions. Attackers discovered they could use "second-order" prompt injection: feed a low-privilege agent a malformed request that tricks it into asking a higher-privilege agent to perform an action on its behalf.

The higher-level agent, trusting its peer, would execute the task—in this case, exporting an entire case file to an external URL—bypassing the checks that would apply if a human user had requested that export.

ServiceNow initially said this wasn't even a bug; it was "expected behavior" given the default agent settings. The system was designed so one agent could legitimately ask another to do something—but with no mechanism to evaluate whether that request originated from a trusted source or was itself the result of manipulation.

3. Tool Poisoning and RAG Contamination

AI agents are increasingly granted the ability to call external functions, query databases, and interact with tools. But they cannot evaluate whether tools are legitimate or whether retrieved data has been poisoned.

Tool poisoning: An attacker inserts malicious tools into the set available to an AI agent. The agent has no mechanism to assess trustworthiness—it treats all available tools as equally valid options.

RAG poisoning: An attacker modifies documents in a Retrieval-Augmented Generation database. When the LLM retrieves that content, it treats the poisoned information as factual. OWASP added "Vector and Embedding Weaknesses" as a Top 10 issue in 2025 because embeddings—the mathematical representations of text used in RAG systems—can leak confidential information when breached.

4. System Prompt Leakage

LLMs are initialized with system prompts that define their behavior, safety boundaries, and policies. These prompts should remain hidden, but attackers can extract them through role-playing techniques or carefully crafted queries.

Once an attacker knows the exact system prompt, they understand precisely which rules to bypass. System prompt leakage is now classified as LLM07:2025 in OWASP's framework—a distinct vulnerability that enables subsequent attacks.

5. Shadow AI and Confidentiality Boundaries

A 2025 LayerX industry report found that 77% of enterprise employees have pasted company data into AI chatbots, with 22% of those instances including confidential personal or financial information.

The most publicized case: Samsung engineers in 2023 pasted proprietary source code into ChatGPT while debugging. The code was confidential and should never have left Samsung's systems. But the engineers, looking for quick help, didn't realize they were creating a permanent security exposure.

LLMs cannot distinguish "this is confidential" from "this is public." They process all input as potentially useful data. There is no containment boundary that recognizes information sensitivity.

Why Traditional Defenses Fail

Organizations attempt various mitigation strategies:

Input filtering and sanitization
Output validation
Multi-layer safety checks
Adversarial training (teaching models to reject harmful requests)
Red-team testing

These reduce risk but don't eliminate the underlying problem. As OWASP notes: "Given the stochastic nature of generative AI, fool-proof prevention methods remain unclear."

Why? Because current architectures treat all text as potentially meaningful. Any defense that relies on the model distinguishing "instructions" from "data" without explicit structural support is working against how these systems are designed.

A 2025 study collected over 15,000 real-world jailbreak attempts and found that users with minimal LLM expertise could still craft successful jailbreak prompts. FuzzyAI, an open-source framework for testing LLM security, demonstrates that systematic testing can identify vulnerabilities across all major models — GPT-4, Claude, Mistral, Vicuna.

FlipAttack, a technique published in 2025, achieves an 81% average success rate in black-box testing by simply altering character order in prompts. It bypasses guardrails with a 98% success rate. The defenses improve, but so do the attacks — and current systems continue to demonstrate the vulnerability.

Could future systems solve this? Perhaps. If containment boundaries are informationally useful for prediction (which they should be — understanding who's asking and what they're authorized to know improves response quality), then training at sufficient scale with appropriate objectives might develop these capabilities. Alternatively, architectural innovations might explicitly separate instruction processing from content processing. Base models before safety training might already have more discriminative capacity that gets flattened during alignment.

But these remain possibilities, not demonstrated capabilities. Organizations must design for the systems that exist today.

What This Means for AI Governance

For organizations implementing AI in production systems, epistemic containment failure has critical implications:

Every untrusted data source becomes a potential injection vector:

Customer emails
Uploaded documents
Web scraping results
Third-party data feeds
User-generated content

Multi-agent systems multiply the risk:

Agents can be tricked into manipulating each other
Privilege boundaries between agents are unreliable
"Expected behavior" can include security vulnerabilities

RAG systems create new attack surfaces:

Vector databases can leak confidential data through embeddings
Poisoned documents contaminate knowledge bases
Retrieved content is treated as trusted by default

Traditional governance frameworks don't account for this:

Compliance documentation assumes clear trust boundaries
Audit processes expect separation between instructions and data
Security controls rely on access restrictions that LLMs bypass

The Deeper Implication

This isn't just about security vulnerabilities. It reveals something about what these systems currently are — and raises questions about what they might become.

LLMs as deployed today are prediction engines. They generate plausible continuations of text patterns. They don't "understand" in the human sense — they don't maintain models of:

Who is asking
What should be trusted
What actions are authorized
What information is confidential

Epistemic containment requires those models. It requires the ability to evaluate: "Should I treat this as an instruction? Should I trust this source? Is this action within my authorized scope?"

Current LLM architectures have no explicit mechanism for those evaluations. All text is processed through the same probabilistic machinery. The only "boundaries" are statistical patterns learned during training — patterns that can be manipulated through carefully crafted input.

But could this change? From an information-theoretic perspective, agency and containment boundaries might not require fundamentally different mechanisms — they could emerge as features of sufficiently sophisticated prediction. Humans don't have separate "instruction processing" and "data processing" modules either; we learn to distinguish context, evaluate source reliability, and maintain boundaries through experience. If containment boundaries are compression-efficient representations of causal structure (which they should be), training systems at sufficient scale with richer feedback might develop these capabilities.

We don't yet know whether current failures reflect:

Fundamental architectural limitations
Insufficient scale or training duration
Training objectives that don't reward boundary maintenance (next-token prediction alone may not require source evaluation)
Safety training that suppresses discriminative capabilities in favor of uniform helpfulness
Or simply an immature stage of development

What we do know: systems deployed today consistently fail to maintain these boundaries under adversarial conditions, and organizations implementing them must account for this reality.

What Can Be Done?

Organizations deploying AI systems need to design for containment failure:

1. Limit AI privileges

Don't grant agents access to sensitive functions without human approval
Implement hard boundaries that LLMs cannot cross regardless of input
Use traditional access controls outside the LLM layer

2. Validate outputs, not just inputs

Assume the model may have been manipulated
Check outputs against allowed actions before executing
Implement rate limiting and anomaly detection

3. Secure data pipelines

Treat RAG databases with the same security rigor as primary databases
Encrypt and access-control vector embeddings
Monitor for data poisoning attempts

4. Monitor behavior, not just content

Watch for unusual patterns in agent requests
Log all tool calls and data retrievals
Alert on privilege escalation attempts

5. Govern how people actually use AI

Prevent Shadow AI through clear policies and alternatives
Provide secure, sanctioned AI tools for legitimate use cases
Train employees on confidentiality boundaries that LLMs don't recognize

6. Accept architectural limitations

Don't assume the model can enforce its own security
Build containment at the system level, not the model level
Design for defense-in-depth rather than relying on prompt engineering

Conclusion

The epistemic containment problem in current AI systems may reflect an architectural limitation, immature capability development, or a consequence of how these systems are trained and deployed. What's certain is that it's not going away through minor patches. Current large language models process all text through the same mechanism, with no demonstrated ability under adversarial conditions to distinguish instructions from data, trusted sources from untrusted, or authorized actions from unauthorized.

OWASP's classification of prompt injection as the #1 LLM security risk reflects this reality. Organizations deploying these systems need governance frameworks that account for containment failure as it exists today — building security boundaries outside the model, validating behavior continuously, and accepting that the AI itself cannot be the enforcer of its own restrictions.

Could future systems develop these capabilities? Perhaps. If containment boundaries are informationally useful for prediction — and they should be, since understanding context, source, and authorization improves response quality — then sufficient scale, different architectures, or richer training objectives might enable them. The question remains open.

But for systems deployed today, the question isn't just "is the AI safe?" It's "can we maintain security and privacy when the AI cannot reliably maintain epistemic boundaries?"

Until systems demonstrate consistent, adversarially-robust ability to distinguish instructions from data, the answer requires designing systems that contain the AI — rather than expecting the AI to contain itself.

Sources

OWASP. (2025). "LLM01:2025 Prompt Injection." OWASP Top 10 for Large Language Model Applications. https://genai.owasp.org/llmrisk/llm01-prompt-injection/

Sombra Inc. (2026). "LLM Security Risks in 2026: Prompt Injection, RAG, and Shadow AI." https://sombrainc.com/blog/llm-security-risks-2026

The Register. (2026). "OpenAI patches déjà vu prompt injection vuln in ChatGPT." https://www.theregister.com/2026/01/08/openai_chatgpt_prompt_injection

Keysight Technologies. (2025). "Prompt Injection Techniques: Jailbreaking Large Language Models via FlipAttack." https://www.keysight.com/blogs/en/tech/nwvs/2025/05/20/prompt-injection-techniques-jailbreaking-large-language-models-via-flipattack

MDPI. (2026). "Prompt Injection Attacks in Large Language Models and AI Agent Systems: A Comprehensive Review." Information, 17(1), 54. https://www.mdpi.com/2078-2489/17/1/54

Blockchain Council. (2025). "ChatGPT Jail Break." https://www.blockchain-council.org/ai/chatgpt-jail-break/

CyberArk. (2025). "Jailbreaking Every LLM With One Simple Click." https://www.cyberark.com/resources/threat-research-blog/jailbreaking-every-llm-with-one-simple-click