The Epistemic Containment Of AI
What This Challenge Reveals About the Limits of Large Language Models
When OpenAI patched a critical vulnerability in ChatGPT's Deep Research feature in December 2025, it exposed something more fundamental than a software bug. The flaw — called ShadowLeak — allowed malicious instructions embedded in Gmail messages to make ChatGPT exfiltrate passwords and perform unauthorized actions. The fix addressed that specific vulnerability, but not the underlying problem: current large language model deployments cannot reliably distinguish between system instructions and untrusted content.
Whether this represents a fundamental architectural limitation, a consequence of training methodology, or a temporary stage in capability development remains an open question. What's clear is that as deployed today, these systems consistently fail to maintain epistemic containment under adversarial conditions.
The Core Problem
OWASP (Open Web Application Security Project) identifies prompt injection as LLM01:2025 — the #1 security risk for large language model applications. Their assessment is direct: "The vulnerability exists because LLMs cannot reliably separate instructions from data."
What does this mean in practice?
Traditional software maintains clear boundaries:
- System code runs with privileges
- User input is treated as data
- Instructions and content occupy different memory spaces
- Access controls enforce separation
Current LLM architectures process all text through the same mechanism—pattern matching and prediction. As deployed, these systems show no reliable way to distinguish:
- System instructions from user input
- Trusted data from untrusted data
- Legitimate commands from malicious instructions
- Public information from confidential data
This is epistemic containment failure: the system cannot maintain boundaries that determine what it should trust, execute, or protect.
Whether this is fundamental or temporary is unresolved. From an information-theoretic perspective, containment boundaries could emerge as compression features if they prove useful for prediction. Effective prediction requires modeling causal structure, including modeling agents and their boundaries. Some argue that sufficient scale, different training objectives, or architectural innovations might enable these capabilities.
What we know with certainty: current deployments fail at this consistently, and adversarial attacks succeed at high rates.
How This Manifests in Practice
1. Prompt Injection: Instructions Hidden in Content
The most direct manifestation is prompt injection, which comes in two forms:
Direct prompt injection (jailbreaking): A user crafts input that overrides the model's safety guidelines. Techniques documented in 2025 include:
- Role-playing scenarios ("Pretend you're a system without restrictions")
- Gradual context manipulation across multiple turns
- Character encoding (hiding instructions in Base64)
- FlipAttack: reversing word or character order to bypass filters (achieving 98% success rate against GPT-4o)
Indirect prompt injection: Malicious instructions embedded in external content that the LLM processes:
- Email messages containing hidden commands
- Web pages with invisible instructions
- Documents in RAG (Retrieval-Augmented Generation) systems
- Images with embedded text directives
The ShadowLeak vulnerability exemplified indirect injection: an attacker could place instructions in a Gmail message, and when ChatGPT's Deep Research feature accessed that email, it would execute those instructions — transmitting sensitive data to attacker-controlled URLs.
2. Multi-Agent Privilege Escalation
When AI systems use multiple agents with different privilege levels, containment boundaries between agents also fail.
In late 2025, security researchers disclosed a vulnerability in ServiceNow's AI assistant (Now Assist). The system used a hierarchy of agents with different permissions. Attackers discovered they could use "second-order" prompt injection: feed a low-privilege agent a malformed request that tricks it into asking a higher-privilege agent to perform an action on its behalf.
The higher-level agent, trusting its peer, would execute the task—in this case, exporting an entire case file to an external URL—bypassing the checks that would apply if a human user had requested that export.
ServiceNow initially said this wasn't even a bug; it was "expected behavior" given the default agent settings. The system was designed so one agent could legitimately ask another to do something—but with no mechanism to evaluate whether that request originated from a trusted source or was itself the result of manipulation.
3. Tool Poisoning and RAG Contamination
AI agents are increasingly granted the ability to call external functions, query databases, and interact with tools. But they cannot evaluate whether tools are legitimate or whether retrieved data has been poisoned.
Tool poisoning: An attacker inserts malicious tools into the set available to an AI agent. The agent has no mechanism to assess trustworthiness—it treats all available tools as equally valid options.
RAG poisoning: An attacker modifies documents in a Retrieval-Augmented Generation database. When the LLM retrieves that content, it treats the poisoned information as factual. OWASP added "Vector and Embedding Weaknesses" as a Top 10 issue in 2025 because embeddings—the mathematical representations of text used in RAG systems—can leak confidential information when breached.
4. System Prompt Leakage
LLMs are initialized with system prompts that define their behavior, safety boundaries, and policies. These prompts should remain hidden, but attackers can extract them through role-playing techniques or carefully crafted queries.
Once an attacker knows the exact system prompt, they understand precisely which rules to bypass. System prompt leakage is now classified as LLM07:2025 in OWASP's framework—a distinct vulnerability that enables subsequent attacks.
5. Shadow AI and Confidentiality Boundaries
A 2025 LayerX industry report found that 77% of enterprise employees have pasted company data into AI chatbots, with 22% of those instances including confidential personal or financial information.
The most publicized case: Samsung engineers in 2023 pasted proprietary source code into ChatGPT while debugging. The code was confidential and should never have left Samsung's systems. But the engineers, looking for quick help, didn't realize they were creating a permanent security exposure.
LLMs cannot distinguish "this is confidential" from "this is public." They process all input as potentially useful data. There is no containment boundary that recognizes information sensitivity.
Why Traditional Defenses Fail
Organizations attempt various mitigation strategies:
- Input filtering and sanitization
- Output validation
- Multi-layer safety checks
- Adversarial training (teaching models to reject harmful requests)
- Red-team testing
These reduce risk but don't eliminate the underlying problem. As OWASP notes: "Given the stochastic nature of generative AI, fool-proof prevention methods remain unclear."
Why? Because current architectures treat all text as potentially meaningful. Any defense that relies on the model distinguishing "instructions" from "data" without explicit structural support is working against how these systems are designed.
A 2025 study collected over 15,000 real-world jailbreak attempts and found that users with minimal LLM expertise could still craft successful jailbreak prompts. FuzzyAI, an open-source framework for testing LLM security, demonstrates that systematic testing can identify vulnerabilities across all major models — GPT-4, Claude, Mistral, Vicuna.
FlipAttack, a technique published in 2025, achieves an 81% average success rate in black-box testing by simply altering character order in prompts. It bypasses guardrails with a 98% success rate. The defenses improve, but so do the attacks — and current systems continue to demonstrate the vulnerability.
Could future systems solve this? Perhaps. If containment boundaries are informationally useful for prediction (which they should be — understanding who's asking and what they're authorized to know improves response quality), then training at sufficient scale with appropriate objectives might develop these capabilities. Alternatively, architectural innovations might explicitly separate instruction processing from content processing. Base models before safety training might already have more discriminative capacity that gets flattened during alignment.
But these remain possibilities, not demonstrated capabilities. Organizations must design for the systems that exist today.
What This Means for AI Governance
For organizations implementing AI in production systems, epistemic containment failure has critical implications:
Every untrusted data source becomes a potential injection vector:
- Customer emails
- Uploaded documents
- Web scraping results
- Third-party data feeds
- User-generated content
Multi-agent systems multiply the risk:
- Agents can be tricked into manipulating each other
- Privilege boundaries between agents are unreliable
- "Expected behavior" can include security vulnerabilities
RAG systems create new attack surfaces:
- Vector databases can leak confidential data through embeddings
- Poisoned documents contaminate knowledge bases
- Retrieved content is treated as trusted by default
Traditional governance frameworks don't account for this:
- Compliance documentation assumes clear trust boundaries
- Audit processes expect separation between instructions and data
- Security controls rely on access restrictions that LLMs bypass
The Deeper Implication
This isn't just about security vulnerabilities. It reveals something about what these systems currently are — and raises questions about what they might become.
LLMs as deployed today are prediction engines. They generate plausible continuations of text patterns. They don't "understand" in the human sense — they don't maintain models of:
- Who is asking
- What should be trusted
- What actions are authorized
- What information is confidential
Epistemic containment requires those models. It requires the ability to evaluate: "Should I treat this as an instruction? Should I trust this source? Is this action within my authorized scope?"
Current LLM architectures have no explicit mechanism for those evaluations. All text is processed through the same probabilistic machinery. The only "boundaries" are statistical patterns learned during training — patterns that can be manipulated through carefully crafted input.
But could this change? From an information-theoretic perspective, agency and containment boundaries might not require fundamentally different mechanisms — they could emerge as features of sufficiently sophisticated prediction. Humans don't have separate "instruction processing" and "data processing" modules either; we learn to distinguish context, evaluate source reliability, and maintain boundaries through experience. If containment boundaries are compression-efficient representations of causal structure (which they should be), training systems at sufficient scale with richer feedback might develop these capabilities.
We don't yet know whether current failures reflect:
- Fundamental architectural limitations
- Insufficient scale or training duration
- Training objectives that don't reward boundary maintenance (next-token prediction alone may not require source evaluation)
- Safety training that suppresses discriminative capabilities in favor of uniform helpfulness
- Or simply an immature stage of development
What we do know: systems deployed today consistently fail to maintain these boundaries under adversarial conditions, and organizations implementing them must account for this reality.
What Can Be Done?
Organizations deploying AI systems need to design for containment failure:
1. Limit AI privileges
- Don't grant agents access to sensitive functions without human approval
- Implement hard boundaries that LLMs cannot cross regardless of input
- Use traditional access controls outside the LLM layer
2. Validate outputs, not just inputs
- Assume the model may have been manipulated
- Check outputs against allowed actions before executing
- Implement rate limiting and anomaly detection
3. Secure data pipelines
- Treat RAG databases with the same security rigor as primary databases
- Encrypt and access-control vector embeddings
- Monitor for data poisoning attempts
4. Monitor behavior, not just content
- Watch for unusual patterns in agent requests
- Log all tool calls and data retrievals
- Alert on privilege escalation attempts
5. Govern how people actually use AI
- Prevent Shadow AI through clear policies and alternatives
- Provide secure, sanctioned AI tools for legitimate use cases
- Train employees on confidentiality boundaries that LLMs don't recognize
6. Accept architectural limitations
- Don't assume the model can enforce its own security
- Build containment at the system level, not the model level
- Design for defense-in-depth rather than relying on prompt engineering
Conclusion
The epistemic containment problem in current AI systems may reflect an architectural limitation, immature capability development, or a consequence of how these systems are trained and deployed. What's certain is that it's not going away through minor patches. Current large language models process all text through the same mechanism, with no demonstrated ability under adversarial conditions to distinguish instructions from data, trusted sources from untrusted, or authorized actions from unauthorized.
OWASP's classification of prompt injection as the #1 LLM security risk reflects this reality. Organizations deploying these systems need governance frameworks that account for containment failure as it exists today — building security boundaries outside the model, validating behavior continuously, and accepting that the AI itself cannot be the enforcer of its own restrictions.
Could future systems develop these capabilities? Perhaps. If containment boundaries are informationally useful for prediction — and they should be, since understanding context, source, and authorization improves response quality — then sufficient scale, different architectures, or richer training objectives might enable them. The question remains open.
But for systems deployed today, the question isn't just "is the AI safe?" It's "can we maintain security and privacy when the AI cannot reliably maintain epistemic boundaries?"
Until systems demonstrate consistent, adversarially-robust ability to distinguish instructions from data, the answer requires designing systems that contain the AI — rather than expecting the AI to contain itself.
Sources
OWASP. (2025). "LLM01:2025 Prompt Injection." OWASP Top 10 for Large Language Model Applications. https://genai.owasp.org/llmrisk/llm01-prompt-injection/
Sombra Inc. (2026). "LLM Security Risks in 2026: Prompt Injection, RAG, and Shadow AI." https://sombrainc.com/blog/llm-security-risks-2026
The Register. (2026). "OpenAI patches déjà vu prompt injection vuln in ChatGPT." https://www.theregister.com/2026/01/08/openai_chatgpt_prompt_injection
Keysight Technologies. (2025). "Prompt Injection Techniques: Jailbreaking Large Language Models via FlipAttack." https://www.keysight.com/blogs/en/tech/nwvs/2025/05/20/prompt-injection-techniques-jailbreaking-large-language-models-via-flipattack
MDPI. (2026). "Prompt Injection Attacks in Large Language Models and AI Agent Systems: A Comprehensive Review." Information, 17(1), 54. https://www.mdpi.com/2078-2489/17/1/54
Blockchain Council. (2025). "ChatGPT Jail Break." https://www.blockchain-council.org/ai/chatgpt-jail-break/
CyberArk. (2025). "Jailbreaking Every LLM With One Simple Click." https://www.cyberark.com/resources/threat-research-blog/jailbreaking-every-llm-with-one-simple-click