Prompt Injection, End of 2025: Progress, Without the Self-Deception

Dec 24, 2025

Agentic AI will reward organizations that are honest about the risks they are taking — and intentional about where they are willing to take them.

Frontier labs now report sub-1% attacker success rates against synthetic prompt-injection tests. Anthropic’s recent work on automated defenses is a good example, showing strong results against model-generated attacks in controlled settings (Prompt Injection Defenses). That reflects real work and real progress.

It does not reflect how these systems fail in the wild.

Human red-teamers continue to reliably compromise agentic systems via indirect prompt injection: instructions embedded in web pages, documents, tool outputs, and long context. Public Arena-style testing — for example the ongoing results in the Gray Swan Arena leaderboard — shows repeated success across models and workflows involving browsing, tools, memory, and code execution. These are the same agent capabilities vendors are actively encouraging customers to deploy.

The incentives to surface these failures remain modest. Prize pools for high-quality indirect prompt-injection research are small compared to mature bug bounty programs for browsers, mobile platforms, or cloud infrastructure, despite comparable impact and far greater ambiguity. In security, sustained high bounties usually follow problems that are both severe and tractable. Here, severity is clear; tractability is not.

Vendors are candid about this. Both OpenAI and Anthropic consistently frame prompt injection — especially in agentic systems — as a hard problem, with no clear path to elimination. They clearly operate large-scale abuse monitoring and incident-response pipelines, but they do not publish prompt-injection-specific detection metrics, response SLAs, or guarantees. What exists looks closer to Trust & Safety operations than to an IDS or WAF analogue for agents.

Synthetic robustness metrics are improving. Treating them as a proxy for real-world risk is the mistake.

Constrain Agency, Not Adoption

The practical response is not to slow down agentic AI adoption, but to treat agency itself as a privileged capability, scoped deliberately by role.

Knowledge workers (office, ops, support)
AI is an assistant, not an actor. Chat, summarize, draft. No autonomous browsing, no tool chaining, no write access without explicit confirmation.

Engineers and analysts (e.g. Claude Code)
Enable agents, but sandbox them. Run in isolated environments, restrict access to secrets and control planes, default to read-only, log write actions, and reset context aggressively.

Executives
Allow analysis and briefing. Avoid inbox access, browsing agents, or persistent memory tied to identity or strategy. AI informs; humans decide.

Platform and automation owners
This is where autonomy belongs. Scope tools narrowly, use short-lived credentials, monitor actions rather than prompts, and assume injection will occur.

Conclusion

Organizations have seen this pattern before. Browsers were deployed before they were safe, then sandboxed. The move to cloud required new trust boundaries, isolation models, identity systems, monitoring, and compensating controls. AI agents follow the same trajectory.

Ultimately, this isn’t a tooling problem so much as a risk ownership problem. When systems are non-deterministic, adaptive, and capable of taking action, technical controls alone will never remove ambiguity. Progress depends on decision-makers being explicit about risk appetite — not in abstract terms, but in how much autonomy, persistence, and blast radius the organization is willing to accept.

The job then becomes composition, not elimination: combining protect, detect, respond, and adapt controls in a way that enables real upside while bounding downside. Crucially, that balance must reflect company risk appetite, not the optimism of tool builders or the caution of individual practitioners.

The Threat Prompt Newsletter

Discussion about this post

Ready for more?