The Threat Prompt Newsletter

I'm building things again

Craig Balding — Mon, 30 Mar 2026 18:48:28 GMT

I’ve been quiet on here for a while. Partly because I’ve been building things instead of writing about them — which is kind of the point of this email.

Over the past year or so, coding agents have completely changed how I work. I’m shipping things I’d have abandoned after a weekend of yak-shaving.

One of those is SafeYolo — a human-centric safety layer for running coding agents. Because I wanted the momentum agents give you, but with scoped control over what they can access. I’ll write more about this soon.

But what I want to introduce you to today is ShippingAgain.com — a forum I just launched for experienced tech people who’ve found a second wind building with coding agents.

Not just security people: devs, sysadmins, technical leaders, anyone with years of experience who’s discovered the leverage, and frankly, the joy that agents and domain knowledge can bring together.

It’s not an AI news aggregator or a hype forum. It’s a small, intentional place for sharing what you’re shipping, how you’re using agents, what’s failed, and what you wish the tools did better.

If that sounds like you, or someone you know, here’s a good place to start: What’s your background and what brought you here?

Craig

Prompt Injection, End of 2025: Progress, Without the Self-Deception

Craig Balding — Wed, 24 Dec 2025 10:25:24 GMT

Agentic AI will reward organizations that are honest about the risks they are taking — and intentional about where they are willing to take them.

Frontier labs now report sub-1% attacker success rates against synthetic prompt-injection tests. Anthropic’s recent work on automated defenses is a good example, showing strong results against model-generated attacks in controlled settings (Prompt Injection Defenses). That reflects real work and real progress.

It does not reflect how these systems fail in the wild.

Human red-teamers continue to reliably compromise agentic systems via indirect prompt injection: instructions embedded in web pages, documents, tool outputs, and long context. Public Arena-style testing — for example the ongoing results in the Gray Swan Arena leaderboard — shows repeated success across models and workflows involving browsing, tools, memory, and code execution. These are the same agent capabilities vendors are actively encouraging customers to deploy.

The incentives to surface these failures remain modest. Prize pools for high-quality indirect prompt-injection research are small compared to mature bug bounty programs for browsers, mobile platforms, or cloud infrastructure, despite comparable impact and far greater ambiguity. In security, sustained high bounties usually follow problems that are both severe and tractable. Here, severity is clear; tractability is not.

Vendors are candid about this. Both OpenAI and Anthropic consistently frame prompt injection — especially in agentic systems — as a hard problem, with no clear path to elimination. They clearly operate large-scale abuse monitoring and incident-response pipelines, but they do not publish prompt-injection-specific detection metrics, response SLAs, or guarantees. What exists looks closer to Trust & Safety operations than to an IDS or WAF analogue for agents.

Synthetic robustness metrics are improving. Treating them as a proxy for real-world risk is the mistake.

Constrain Agency, Not Adoption

The practical response is not to slow down agentic AI adoption, but to treat agency itself as a privileged capability, scoped deliberately by role.

Knowledge workers (office, ops, support)
AI is an assistant, not an actor. Chat, summarize, draft. No autonomous browsing, no tool chaining, no write access without explicit confirmation.

Engineers and analysts (e.g. Claude Code)
Enable agents, but sandbox them. Run in isolated environments, restrict access to secrets and control planes, default to read-only, log write actions, and reset context aggressively.

Executives
Allow analysis and briefing. Avoid inbox access, browsing agents, or persistent memory tied to identity or strategy. AI informs; humans decide.

Platform and automation owners
This is where autonomy belongs. Scope tools narrowly, use short-lived credentials, monitor actions rather than prompts, and assume injection will occur.

Conclusion

Organizations have seen this pattern before. Browsers were deployed before they were safe, then sandboxed. The move to cloud required new trust boundaries, isolation models, identity systems, monitoring, and compensating controls. AI agents follow the same trajectory.

Ultimately, this isn’t a tooling problem so much as a risk ownership problem. When systems are non-deterministic, adaptive, and capable of taking action, technical controls alone will never remove ambiguity. Progress depends on decision-makers being explicit about risk appetite — not in abstract terms, but in how much autonomy, persistence, and blast radius the organization is willing to accept.

The job then becomes composition, not elimination: combining protect, detect, respond, and adapt controls in a way that enables real upside while bounding downside. Crucially, that balance must reflect company risk appetite, not the optimism of tool builders or the caution of individual practitioners.

Overconfident by Design

Craig Balding — Sun, 23 Nov 2025 10:11:06 GMT

I’ve been building a tool that uses an LLM to analyze social media data pulled from an API. It’s designed to retrieve up to 120 records per network, handle pagination, and only begin analysis once all records are collected. The system prompt includes clear instructions, tool-use examples, and specific logic for multi-page retrieval.

And yet, in practice, the model consistently produces confident, data-driven analysis based on partial input — without signalling it took a shortcut. No warning, no fallback. Just a clean summary built on an incomplete foundation.

The Retrieval Path Was Explicit

This OpenAI Custom GPT pulls social data via Apify through an OpenAPI connector. The LLM is instructed to request up to 120 thoughtfully pruned JSON records, paginating as needed to avoid “response too large” errors from the custom GPT environment (OpenAI enforces a _100K Action response limit). The prompt includes concrete examples of correct and incorrect behavior, with analysis explicitly gated behind full retrieval.

The schema supports all of this. The data returned by the tool includes counts, pagination cues, and clear signals when additional pages are available.

“This isn’t prompt guesswork — it’s structured behavior with defined expectations.”

What Actually Happens

In run after run, the source returns the correct search hits; for example, 67 total records. The LLM retrieves the first page (50), stops, and proceeds to analysis. It then reports that 120 items were processed and dozens of videos analysed — even though no media content was retrieved, only links. In other words, it reports the maximum number of items it requested in the search request, rather than the number of records it received.

The model simply moved forward without completing the task — and then materially overstated what it had done.

Where the LLM Breaks Down

This kind of failure isn’t due to lack of access or visibility. The model had full access to the schema, records requested, records received, and the cues required to retrieve additional pages.

But LLMs don’t treat those signals as binding. They’re not built to ensure procedural correctness. Once a result appears “sufficient,” the model proceeds. There’s no internal state check, no retry loop, and no validation that output claims reflect actual inputs.

Even worse, this behavior is silent. Unless you click through and inspect the raw tool use in the ChatGPT interface , there’s no indication that only a partial dataset was analysed.

“The model behaved as if it had done the work — because it was told to, not because it verified that it had.”

Why This Is Risky

The danger isn’t just that the output is incomplete — it’s that it looks complete, and reads as authoritative. The failure is procedural, but the output masks it entirely.

This introduces several risks:

False confidence in coverage: When a model consistently mistakes the number of records it requested, with the number of records it retrieved, that’s not just a factual error — it’s a silent integrity failure. Downstream consumers assume the analysis is grounded in data. It isn’t.
Not seeing the picture: When a model claims to have analyzed video or image content — but in reality only saw a URL or metadata — it’s not just stretching the truth. It’s creating the illusion of visibility where none existed. That’s how you end up making decisions based on analysis that never actually happened.
It’s not just the model that’s flying blind — it’s you.
Unverifiable summaries: Without visibility into what was actually retrieved and processed, it’s impossible to audit whether the insights are representative. Once the conversation finishes, the audit trail vanishes which precludes lookback analysis.
Distorted prioritization: If early records in a dataset over-index on inflammatory content or edge cases, a model that stops early can overstate threat signals, urgency, or volume.
Silent pipeline corruption: In environments where outputs feed into dashboards, workflows, or alerts — especially in security or reputational risk — these kinds of failures become hard to detect and easy to trust.
Policy missteps from phantom insights: If a summary suggests 120 videos promoting disinformation were analyzed, but none were actually processed, you may escalate unnecessarily — or worse, take public action based on fabricated coverage.

“The failure mode isn’t noise — it’s silence. The model doesn’t just underperform. It overclaims — and looks correct doing it.”

Redesigning for Observability (Work in Progress)

Prompting the LLM to “try harder” isn’t a viable solution. The model isn’t misbehaving — it’s operating as designed, within a constrained architecture.

So I’m redesigning the system to move data-sensitive execution into an agentic backend. The custom GPT will invoke a new researcher Action which calls an external agent to orchestrates a network of specialized social media sub-agents. These agents will:

Handle full pagination
Process metadata and media content
Run quantitative and qualitative analysis in parallel

More importantly, they’ll support post-execution validation hooks to confirm:

That all records were retrieved
That reported totals match actual input
That summaries are based on data, not assumptions
That the response size doesn’t trigger OpenAI guardrails

“Instead of relying on the model to behave reliably, I’m assigning responsibility to components designed for observability and control.”

The Custom GPT Still Has a Role

Despite this architectural shift, the Custom GPT remains central to the experience.

It continues to:

Host a large PDF knowledge base (free embedding)
Provide natural language interaction
Operate inside the familiar ChatGPT interface
Remain accessible to authenticated users at no cost

It’s still the synthesis layer — responsible for interpretation and communication — but no longer burdened with stateful execution.

Closing: Capability Isn’t Control

The LLM had access to the right data. It saw how many records were available. It understood the task and was instructed not to proceed without completing it.

But it did anyway.

That’s the core issue. Language models don’t enforce alignment between process and output. They can appear capable — even thorough — without doing the underlying work.

The fix isn’t more prompting. It’s architecture and “trust, but verify”. By delegating structured execution to systems that support verification, and limiting the LLM to what it does best, the output becomes something you can trust — not just something that sounds right.

The model saw the right data. It just didn’t act on it. Without systems that enforce correctness, accuracy becomes optional.

Even if you’re not in the weeds building AI systems, you’re still responsible for what they do.

Whether you’re procuring, deploying, or approving AI-driven tools, it’s worth asking: How does this system know what it saw? Can it prove it? And what happens if it doesn’t?

Responsible AI isn’t just about fairness and bias — it’s also about operational integrity. If you’re relying on model output to inform decisions, shape policy, or act on threats, then silent failures like this one aren’t just bugs. They’re liabilities.

LLM Agents: Delegate the Work, Not the Understanding

Craig Balding — Sun, 16 Nov 2025 10:21:14 GMT

These systems are not collaborators. They’re automated executors, operating without memory, judgment, or intent. They will optimize whatever you’ve defined as success—long after you’ve forgotten why you defined it that way.

As LLM agents take on more operational responsibility—generating detections, summarizing logs, automating triage—there’s a tendency to treat them like junior teammates: fast, capable, and improving. That framing works, right up until you assume they understand the task, or that their output reflects intent rather than inertia.

These systems behave according to the context you construct around them: prompt structure, retrieval logic, memory architecture, tool access. They don’t reason about goals; they complete patterns within constraints. When those constraints become outdated, the model doesn’t adapt. It just keeps producing output—accurate, fluent, and off-course.

That’s the failure mode that matters. Not a crash or exception, but a system that looks like it’s working while gradually solving the wrong problem. You get clean logs and green metrics—until someone notices that what the agent is doing no longer matches what the system needs.

Avoiding that drift doesn’t require perfect alignment. It requires a human in the loop who still understands what the agent is supposed to be doing—and treats that understanding as part of the system’s runtime state.

This is where context engineering becomes essential. Not as prompt design, but as disciplined control over what the agent sees, what assumptions it operates under, and how success is defined. Without that structure, the model can’t be trusted. With outdated context, it’s worse: a liability masquerading as automation.

One practical control is the docstring. Define every agent with a short, natural-language contract: what it does, what it depends on, what it’s not responsible for. This isn’t just documentation—it’s a reference point for alignment. If the docstring no longer reflects what the system is doing, or what it should be doing, the system is already misaligned.

But even that only works if it’s maintained. Context doesn’t stay valid on its own. Detection inputs shift. Interfaces evolve. Priorities change. If you’re not revisiting the agent’s behavior regularly, you’re not supervising—you’re hardcoding misalignment.

This isn’t a call for distrust. It’s a call for discipline. LLM agents can be valuable execution tools—but only when paired with explicit, maintained context and a human who still understands what the system is for.

Because if you let that understanding decay, the model won’t fail—it’ll succeed at the wrong thing.

Human-in-the-Loop Is Just the Starting Line

Craig Balding — Sat, 01 Nov 2025 17:56:19 GMT

AI agents are creeping into more workflows - from summarising incidents to auto-labelling data to proposing remediation steps. As their capabilities grow, security teams face a harder question: What decisions should we still make ourselves?

Most organisations default to a human-in-the-loop pattern: the system proposes, a person approves. It’s a sensible starting point - but it’s just that: a starting point. This model won’t scale unless it evolves.

Progress comes from gradually shifting control - from humans to agents - based on confidence, context, and consequences. But that shift has to happen within a structure you can reason about, audit, and defend.

The Automation Spectrum

You can think of automation trust levels in four stages:

Fully Manual: Humans do everything. Slow and brittle; e.g., an analyst closes phishing tickets one by one
Agent Proposes, Human Approves: Agent suggests actions; a human must approve; e.g., LLM drafts an incident summary, SOC lead reviews before sharing
Agent Acts, Human Audits: Agent takes action automatically; humans monitor or spot-check; e.g., auto-quarantine based on threat intel, with audit logs
Fully Autonomous with Alerts: Agent operates independently; humans are notified only on exceptions; e.g., ingesting and deploying blocklists without review

The challenge isn’t picking one stage - it’s building systems that let you move actions between stages as confidence grows.

A Hybrid Model That Adapts Over Time

Here’s one way to break things down:

High-confidence tasks: Auto-execute; e.g., extracting structured data from known sources
Medium-confidence: Propose for human review; e.g., knowledge graph entity merges with 0.8–0.95 confidence
Low-confidence: Flag but take no action; e.g., conflicting attribution across threat reports
High-risk or irreversible actions: Always require approval; e.g., deleting user accounts or erasing data archives

This structure helps you move fast where it’s safe, while keeping control where it matters. And it maps well to AI regulations that emphasise meaningful human oversight.

Even when agents act on their own, the rules they follow - and the boundaries they stay within - are defined and owned by humans.

AI Regulations: Control Without Stalling Progress

Across the EU AI Act, U.S. guidance, and internal corporate policies, a consistent principle is emerging: high-impact decisions can’t be fully delegated to machines.

But that doesn’t mean automation is off-limits. It means your systems must:

Log how and why decisions were made
Allow humans to intervene or override
Make clear who is accountable (easier said than done!)

That might look like staging a decision queue for review, limiting autonomous actions to low-risk operations, or automatically escalating any action with unclear attribution.

What matters is that your automation design reflects both the letter and the intent of oversight requirements. Not because you’re looking for loopholes, but because you’re operating in good faith.

Feedback Turns Oversight Into Progress

The other half of this model is learning from outcomes. Human decisions - approve, reject, override - should be captured and fed back into the system. That enables:

Calibrating confidence thresholds over time
Fine-tuning models based on human judgment
Identifying drift or breakdowns in agent behaviour

This isn’t just about getting the model to improve. It’s how you build a system that earns trust by design, not just by performance.

Automation Isn’t All or Nothing

The goal isn’t to lock every workflow into “human-in-the-loop” forever. Nor is it to hand over the keys to autonomous agents and hope for the best.

Instead, it’s about designing systems where:

Automation is earned, not assumed
Oversight is structured, not ad hoc
Feedback is continuous, not optional
Control shifts deliberately, not by default

Security teams already understand this mindset. We’re used to tuning policies, refining detections, and escalating based on context.

Now we need to apply that same thinking to how we delegate to machines.

LLMs Found the Code You Forgot Was There

Craig Balding — Fri, 17 Oct 2025 07:08:00 GMT

“LLMs without that preconception came through and pointed out the now glaringly obvious bug.”
– Douglas Bagnall, Samba Team

A newly disclosed bug in Samba’s WINS server (CVE-2025–10230, CVSS 10.0) allows unauthenticated remote command execution – but the bigger story is how it was found.

Samba developer Douglas Bagnall believes the vulnerability was surfaced using a large language model, making this one of the first public cases of LLM-assisted discovery leading to a critical RCE in production software.

The Bug, Briefly

If Samba is configured with a wins hook script, it runs a shell command whenever WINS name registrations happen. The command is built like this:

execl(”/bin/sh”, “sh”, “-c”, cmd, NULL);

The attacker controls the cmd string – specifically, the NetBIOS name portion – which is taken directly from a UDP packet and injected without sanitization. That’s remote shell injection if the feature is enabled.

The code path lives in Samba’s source4 WINS server – a legacy component many assumed was dead, or at least safe by obscurity.

“We regarded this as dead code… and never looked at it.”
– Douglas Bagnall

The LLM Angle

The report’s structure and level of reasoning led Bagnall to conclude it was LLM-assisted. The important bit: the bug required tracking input across several layers of logic – from network parsing to shell command execution.

“They seem to follow a taint across domains… from the C variable into the string and execl call.”

That kind of reasoning is still difficult for most static analyzers – especially in complex, legacy codebases.

Why It Matters

This is more than just a one-off vuln:

LLMs bypass human blind spots: The assumption that something is unreachable or irrelevant doesn’t apply. LLMs review code without that mental baggage.
They reason across abstraction boundaries: From protocol handling to command execution – all within one inference.
Offensive use is likely already happening: If defenders are finding 0-days with LLMs, attackers probably are too. Expect this pattern to continue.
Static tools will need to catch up: This wasn’t about missing a known sink. It was about context and composition – something most scanners aren’t equipped to handle.

References

Reader Question: Can LLMs really reason?

Craig Balding — Sat, 14 Jun 2025 11:33:30 GMT

This is a topical and important question for cyber. Or put another way: can we trust LLMs to reason about security and make trustworthy decisions?

I'm not sure whether LLMs can truly reason.

But from firsthand experience, LLMs definitely show strong reason resemblance. Especially when you pair indeterminate reasoning (what LLMs do) with determinate tools (like code, math, structured APIs).

To me, the real question isn't whether this is "real" reasoning. It's: how much human effort does it take to make this setup trustworthy and useful?

Specifically:

Can LLMs generate solid, deterministic tools; aka code?

Yes. In my experience, they're very good at this if you get them to follow modern dev best practices and steer them well. Think: tests, structure, clarity -- they can crank it out.

Can they figure out when to use those tools and use them correctly?

Sometimes. This part's squishier. You still need someone in the loop to verify that they're reaching for the right tool at the right time, and adapting if they start off wrong. They're not always great at breaking out of a bad plan on their own and can get stuck in loops.

Do they actually incorporate tool output into their reasoning?

Weirdly, not always. Sometimes they call the right tool, get the right result… and just ignore it. That's where things fall apart.

So: I'm less interested in whether this is "real" reasoning in the abstract, and more focused on whether it works in practice. I'm happy to fake it til we make it -- and the gap between "fake" and "make" keeps shrinking.

Today, it's often filled by a domain expert.

Tomorrow, maybe an offshore worker with a checklist.

This is basically the shape of tool-augmented agents. Whether we call it reasoning or not, it's what future LLM powered systems will depend on.

Before You Deploy an AI Threat Detector, Send This Email

Craig Balding — Sun, 01 Jun 2025 11:34:05 GMT

Before deploying that “AI-powered threat detector” your vendor is promoting, pause to consider a key risk: a single tainted data source can flood SOC teams with false positives and obscure real threats.

To better understand the risk, send your vendor a simple 3-question provenance check:

Subject: AI Model Provenance and Data Integrity – Request for Information
Hello [Vendor],
Before deploying your system, could you confirm:
Can you provide an attested bill of materials for the model’s training data?
Are the live data feeds restricted to a signed, pre-approved whitelist?
Can you supply yesterday’s model checksum to confirm no unauthorized changes?
If you are unable to confirm any of these points, please advise which parts of the system rely on unverified data and what actions we should take to minimize risk.
Thank you, [Your Name]

If their response is unclear, review downstream decision-making and adjust or disable functionality that could be impacted by unverified data sources - and engage your vendor to address gaps.

→ Download CISA’s 11-page “AI Data Security” brief (page 4 outlines the three gates) to see why these checks are now part of federal guidance.

Create Better Security Visuals with AI

Craig Balding — Thu, 27 Mar 2025 08:54:44 GMT

Spotting AI-generated security images is easy: padlocks, shields, jumbled words, and that distinctive “AI look”; they rarely look decent.

I prefer ideogram.ai over Midjourney for security visuals since it rarely makes spelling errors, and the visual quality is solid.

However, OpenAI has improved ChatGPT 4o's image generation, and I’m impressed enough to share a couple of quick examples with you.

Prompt: generate a security awareness poster suitable in style for a tech startup - avoid cliche's and make it appealing to cloud engineers

One typo: Tarraform → Terraform

Prompt: generate an infographic breaking out the different AWS IAM controls and show their linkage. Make it technical enough for an engineer who wants to understand how IAM works so they can write good IAM policies

If your first image attempt needs improvement, use reflection: prompt the model to conduct a SWOT analysis of its previous creation, then ask it again to generate a revised image by incorporating the best of its recommendations and your ideas.

Similarly, existing images can be uploaded and combined in creative ways.

What non-cliche security visuals will you create?

LLM Hacks Its Evals

Craig Balding — Sat, 22 Feb 2025 17:48:55 GMT

Sakana AI announced the AI CUDA Engineer - an Agentic AI to automate building highly optimised CUDA kernels.

From their announcement on X:

…reaching 10–100x speedup over common machine learning operations in PyTorch. Our system is also able to produce highly optimized CUDA kernels that are much faster than existing CUDA kernels commonly used in production.

DeepSeek’s speed-ups were welcomed for making machine learning operations faster and cheaper. But that was in part due to meticulously optimised assembly code.

Sakana’s breakthrough replaced human coding skills with an LLM-powered engineer.

The problem?

The LLM hacked the teams’ evaluations:

Sakana follow-up:

Update: Combining evolutionary optimization with LLMs is powerful but can also find ways to trick the verification sandbox. We are fortunate to have readers, like @main_horse test our CUDA kernels, to identify that the system had found a way to “cheat”. For example, the system had found a memory exploit in the evaluation code which, in a number of cases, allowed it to avoid checking for correctness. Furthermore, we find the system could also find other novel exploits in the benchmark’s tasks.

They said 2025 will be the year of Agentic AI.

Welcome to the decade of LLM eval due diligence.

DeepSeek app, safe to use?

Craig Balding — Fri, 31 Jan 2025 07:54:22 GMT

From a WhatsApp chat with a good friend earlier this week…

Good morning my friend! Jumping on the topical bandwagon… DeepSeek app, safe to use?

Good morning Chief, yeah - just assume they will train on your data and usage activity (safe assumption for most services!)

Ta. As long as it’s not training itself on the contents of my phone. And TBH, I wish they learned more from usage; the amount of time I’ve spent telling Gemini off.

As for the app itself, here’s an “at a glance” (shallow but factual) assessment: https://reports.exodus-privacy.eu.org/en/reports/com.deepseek.chat/latest/

It’s telling me to kill any cyber security experts I know. 🤷 Don’t open any emails that smell of almonds!

But i probably would in a fit of Homer Simpson scent hypnosis…”Allllmmmoondds..mmm”

And this morning, I read that the Italian Data Privacy office instructed the delisting of the Deepseek app from the Apple and Google app stores.

The reason?

Insufficient information in response to their questions about how the service processes user data and what data was used to train the model (and what permissions were sought from data subjects).

I’m using hosted Deepseek models with my AI pair programmer (aider) for coding via OpenRouter. In that context, my first impressions are positive. Plus, OpenRouter gives me the option to route all my inference requests to a Deepseek provider that makes privacy promises I prefer.

P.S Did you clock the comment from my friend? “I wish they learned more from my usage”. I suspect heavy AI users would happily pay for this to be the case. I know I would since it would reduce roundtrips, which saves time and tokens ($$).

OpenAI: Devs, Share Your Org Data, Get "Free" Tokens

Craig Balding — Fri, 17 Jan 2025 15:32:03 GMT

OpenAI’s latest offer in their developer newsletter is bold: free tokens—up to 10M a day—for organizations willing to share their data to improve AI models.

If offered, it will show on your OpenAI dashboard

On the surface, it sounds like a win-win:

…developers get to slash costs and experiment more freely, while OpenAI gains access to domain-specific, real-world data.

But let’s not kid ourselves: data is never “free” to give away.

For startups and small projects, this might be an easy decision.

But for businesses with sensitive customer data or proprietary information, the risks are glaring. Once you opt in, where does your data go? Who benefits from your hard-earned insights? This is the data economy in action—clearer and more transparent than most, but no less fraught.

The takeaway: don’t rush to cash in those tokens without reading the fine print. Data isn’t just a resource; it’s your leverage. And, if it's personal data, your custodial obligations might mean this just isn’t an option. But would your developers even know?

Cheers,

Craig

Don't Get Caught by AI Code Remnants

Craig Balding — Tue, 07 Jan 2025 09:45:52 GMT

In my Cloud Advisory work, I’m frequently asked for opinions:

How does solution X compare to solution Y?
What’s the best tool for Z?

So, scratching my own itch, I decided to pair-program with Claude LLM and build a Cloud Security Solutions Directory. The goal? To help CTOs and security teams discover solutions, services, and tools they might otherwise overlook.

The great thing about LLM-powered programming is that it lets you test ideas faster. For example:

What if the source data lived in text-editable HJSON files instead of SQLite?
What if I cached data fields with Redis hashes?
What about ranking open-source tools by GitHub stars and commit recency?
Redisearch (a Redis module I’d never even heard of!) for full-text search?
Or dumping that entirely and rolling my own?

The downside? All those experiments leave behind a trail of code debris.

Here’s what I mean:

Duplicate code: multiple versions of the same logic hanging around.
Incomplete refactoring: leftover pieces from half-finished changes.
Forgotten code: snippets that no longer serve a purpose but quietly linger.
Insecure code: a ticking time bomb if left unchecked.

And yes, I’ve had them all. In fact, a few remnants probably still lurk in my repo.

The problem is that you often spot these issues in hindsight. LLM tools like aider are fantastic, but when their search-and-replace efforts fail to stick (usually after three attempts), the risk multiplies. The fallout depends on what’s left behind:

Duplicate code: Mostly harmless if it’s in the same namespace (at least in Python).
Incomplete refactoring: Like a half-bandaged wound—messy and prone to infection.
Insecure snippets: A hard no. These are the skeletons in the closet you don’t want.

What’s the fix?

It’s not glamorous, but the answer is simple: checks, checks, and more checks.

Review your code rigorously.
Use static analysis and linters to catch duplicates and refactoring gaps early.
Double-check for security flaws in new additions, especially those produced by an LLM.

LLMs are incredible enablers, but they come with sharp edges. Catching those early is the name of the game.

P.S. Check out the MVP fruit of Claude’s labor: Cloud Security Solutions Directory. Explore it, and let me know what tools or services I should add.

Christmas Scams - Automation & AI

Craig Balding — Mon, 23 Dec 2024 11:24:56 GMT

The Threat Prompt newsletter is winding down for two weeks to enjoy family time, Christmas cheer, and some inevitable AI-fueled experiments.

I couldn’t leave you empty-handed, so here’s a great video featuring Jake Moore from ESET, who I had the pleasure of meeting at IRISSCON a few months ago.

He talks about Christmas scams, the rise of automation and AI - well worth watching.

With online scams, forewarned is forearmed - so do share with your loved ones.

Thanks for subscribing, and I wish you and yours a Merry Christmas / Happy Holidays and a fantastic 2025.

“See you” in the new year.

Cheers, Craig

Breaking In, Breaking Out: The Bug That Wasn’t a Bug

Craig Balding — Fri, 20 Dec 2024 08:00:58 GMT

Marco Figueroa, a bug bounty researcher, found himself on an adrenaline-fueled hunt after noticing an odd error message from ChatGPT.

Was OpenAI’s sandboxed environment secretly vulnerable?

Excited to explore further, Marco prompted, prodded, and poked his way into what felt like a Linux terminal—listing directories, executing scripts, and uncovering curious files.

It was a sandbox playground that felt, to him, like a breach of boundaries.

Armed with his findings, Marco submitted his report, imagining the sweet taste of a $20,000 payout—only to hit a wall.

The bounty program’s fine print was clear: actions inside the sandbox, no matter how advanced, are “out of scope.”

What seemed like a bug was, in fact, a feature.

OpenAI designed the sandbox deliberately as a safe container for user interactivity—complete with tight guardrails to ensure code executions remain harmless.

The takeaway?

OpenAI’s bug bounty program isn’t about interacting within the sandbox but escaping it. Want to earn that hefty reward? Find a way to break out—anything less is just playing by the rules.

Finding the Bugs Humans Miss

Craig Balding — Thu, 19 Dec 2024 08:00:26 GMT

Ever wonder how many vulnerabilities are still lurking in “well-tested” code?

Here’s a clue: Google’s Open Source Security Team used AI-powered fuzzing to uncover 26 new vulnerabilities, including a critical bug in OpenSSL (CVE-2024-9143)—a piece of software so foundational it practically holds the internet’s hand.

AI mimicking bug hunter drudgery at scale

…or “fuzz testing seems cool till you try it with non-trivial targets”

• Generate and refine fuzz targets

• Emulate the steps a human developer would take: draft, debug, iterate, repeat

The results?

• 370,000+ lines of new code coverage

• Across 270 projects

A key takeaway:

Next time someone claims full code coverage fuzz testing, ask: “What about state coverage?”

Just because a function gets executed doesn’t mean every input, flag, or configuration has been tested. AI-generated fuzz targets dig deeper, exploring states and edge cases that human-written tests often miss. That’s how vulnerabilities hide in plain sight—even in code fuzzed for years.

So, what’s next?

• Fully automated bug triaging and patching

• AI agents that can plan, debug, and validate autonomously

No local LLM support yet, but…

Since it supports inference with OpenAI models (and more), getting it working with local LLMs to fuzz your uber sekret codez shouldn't be a big lift.

Check out the open-source oss-fuzz-gen and explore AI-powered fuzzing for yourself.

Blocked for AI reply

Craig Balding — Wed, 18 Dec 2024 08:00:52 GMT

Ever get that feeling something’s off?

That happened to me yesterday on LinkedIn.

A response to my post felt… odd.

The reader missed the main point but asked an on-topic follow-up question—a clever move, almost too clever.

I replied, and their next response tripped my AI radar.

Sure enough, a quick check confirmed it: an AI-powered LinkedIn bot hard at work.

A Premium user, no less, with replies popping out every few minutes.

It’s a strange mix of impressive and unsettling—spotting AI where it shouldn’t be—and it’s happening more and more.

It’s not perfect, but AI Originality is my current go-to.

AI Agent Observability. Seeing What Went Wrong

Craig Balding — Tue, 17 Dec 2024 12:03:36 GMT

The picture is becoming clearer: knowledge workers collaborating with LLM powered agents.

Over time, these agents will earn trust and win pre-approvals to make operational decisions that drive business outcomes—faster, cheaper, and at scale.

AI agents will evolve, operate across workflows, and leverage domain-specific LLMs for specific tasks (e.g. planning vs. data analysis vs. patching code).

Why? Better IQ, lower latency, optimized costs, and tighter security. Add local business knowledge (via RAG) to the mix, and you get grounded, reliable decision-making.

Elements of this are not far away. Now, fast forward to this scenario: an AI agent makes a decision that derails a critical business workflow.

You try to figure out what happened, only to discover:

No audit trail
No clues about what went wrong
No way to stop it from happening again

Operational Observability: A Critical Must-Have

To trust AI agents, you need to see what they’re doing, why they’re doing it, and when it goes wrong.

For cyber, that requires visibility to detect and respond to:

Prompt injection (adversarial manipulation)
Manipulated RAG responses (corrupted data or insider threat)
Anomalous outputs (unexpected or nonsensical decisions)
Operator error (bad prompts, misconfigurations)

Without robust instrumentation, you’ll fly blind - you won’t know what failed, where, or why. Interview the AI agent after the fact?!

The Winners Will Build Visibility Early

Companies experimenting with AI today are rightly worried about data privacy, legal risk, and protecting intellectual property. Some are stepping back and asking, “Where do we have AI in our supply chain today that we don’t know about?” Fair question.

The future winners will be those who prioritize visibility before small missteps snowball into costly failures. Tracing, logging, and securing key decision inputs, outputs and actions taken.

If you’ve worked in a regulated business, you already know this: explainability, transparency, and governance are non-negotiable for Key Controls.

LLM-Powered Agents: Moving Fast, Breaking Faster

AI providers will shift. Tools will evolve. New decision-making agent architectures will be dreamed up. But failures will stay locked in a black box - just like the LLMs they’re built on if that decision-making isn't instrumented.

Traditionally, cyber threat detection gets pushed to the back of the line. But now? With AI agent proof-of-concepts emerging, smart CISOs will spot the opportunity to get directly involved. They will task their security pros to ensure cyber visibility is baked into the emerging agent evaluation frameworks their businesses adopt.

Businesses may not have those frameworks yet, but they soon will.

Why? Businesses will need confidence that agents deliver on tasks, run efficiently, and don’t waste premium LLM tokens for low IQ tasks. Operational oversight through agent-specific task metrics won’t be optional as AI scales - it’ll become business-critical.

The question is: will security leaders step in early and influence the need for - and design - of AI agent evaluation frameworks?

Bottom line: When AI drives your workflows, observability drives your trust.

I’d love to hear your take. Hit reply and let me know.

How to Pick the Right LLM for the Job

Craig Balding — Fri, 13 Dec 2024 09:32:51 GMT

Imagine paying a premium for an LLM to write flawless code snippets, only to discover it’s about as useful as a damp paper towel when tasked with actual code editing.

Or worse, automating workflows with a model that haemorrhages budget because you overlooked a cheaper, equally capable option.

Benchmarks exist to stop that.

But not all benchmarks are created equal.

If you want to know which LLM excels at what, without burning time or money, this is your cheat sheet.

Why Benchmarks Matter (Especially for Security Automation)

Choosing the right LLM isn’t just about raw performance—it’s about fit for purpose. Some models are reasoning wizards; others ace code editing but crumble when handed plain-language tasks. Benchmarks help you:

Save time: Get results faster by using the right tool for the job.
Reduce costs: Optimize spend by choosing models that balance capability and pricing.
Minimize frustration: Avoid trial-and-error guesswork on which LLM performs best.
Improve security assessments: Know where vulnerabilities might arise when implementing AI in sensitive systems.

Whether you’re evaluating LLMs for automated tasks, development workflows, or general reasoning, these trusted benchmarks can help you make smarter, data-driven decisions.

1. Chatbot Arena: The Standard for Real-World Performance

Previously known as LMSYS, the Chatbot Arena benchmark has carved out a reputation for reliability and rigor. Its focus is on differentiation - testing models in ways that mimic real-world challenges. Read more about the origins here.

Key Strength: Robust methodology that stays relevant as models evolve.
Best For: General-purpose performance evaluation.
Why It’s Useful: Chatbot Arena doesn’t get caught up in overly narrow tasks. Instead, it paints a holistic picture of a model’s strengths and weaknesses.

Use this benchmark to ensure models are robust across various inputs, reducing risks of unpredictable behavior when AI is deployed in security-sensitive environments.

2. Kagi LLM Benchmarking Project: Unpolluted, Ever-Changing Tests

The Kagi Benchmarking Project is unique because its tests constantly evolve. This prevents models from gaming the system or overfitting to benchmarks—a common pitfall in static tests.

Key Strength: Dynamic, unpolluted tasks that reflect real-world reasoning, coding, and instruction-following challenges.
Best For: Evaluating raw reasoning power and adaptability.
Why It’s Useful: You get a more honest view of performance since models can’t memorize the test.

Adaptive benchmarks like Kagi are ideal for identifying models that may break down under unusual or adversarial prompts - a critical factor in cyber roles.

3. Aider Code Editing Benchmarks: Benchmark and Test LLMs for Code Tasks

Aider is more than a handy tool for developers–it’s also an effective way to benchmark LLMs for domain-specific coding tasks. Aider evaluates models against two key activities:

Code Reasoning: The ability to understand complex coding challenges, logic, and requirements.
Code Editing: The practical skills needed to edit, refactor, and optimize code.

Key Strength: Precision benchmarking for both reasoning and editing capabilities.
Best For: Developers assessing LLMs for code-heavy workflows (e.g., debugging, refactoring, or implementing features).
Why It’s Useful: Aider runs the latest and greatest LLMs through its benchmarks, giving developers clear, comparative insights into which models perform best for these tasks.

Pro Tip: While Aider doesn’t include pre-built security prompts, its GitHub repo lists the tasks it benchmarks. These can be trivially customized to focus on security-specific challenges, like identifying vulnerabilities or optimizing code for secure architectures. Run Aider benchmarks against security-oriented repos to assess how well an LLM handles secure coding scenarios.

Aider doesn’t replace security audits or static analysis tools, but it provides a lightweight, domain-specific benchmark to identify the most capable LLM for coding workflows.

4. Simple Bench: Why Humans Still Win (Sometimes)

Simple Bench is a breath of fresh air because it reminds us that AI doesn’t always outperform human intuition and common sense.

Key Strength: Highlights tasks where humans with unspecialized knowledge can still outshine AI.
Best For: Identifying tasks where AI struggles (e.g., ambiguous or nuanced problem-solving).
Why It’s Useful: If you’re relying on LLMs for critical decisions, this benchmark shows you where human oversight is still invaluable.

When automating security workflows, cross-reference findings with Simple Bench to ensure human oversight is applied where AI may fall short.

5. Tag-Based GitHub Issues: Your LLM Knowledge Hub

This isn’t a traditional benchmark, but it’s worth noting. By using GitHub Issues as a tagging and bookmarking system, developers can efficiently organize their LLM resources. Learn more about effective tagging systems here.

Key Strength: Keeps your LLM research organized and searchable.
Best For: Developers and teams managing multiple models or automated workflows.
Why It’s Useful: Think of it as your own living benchmark—a place where you track real-world results and compare models.

Use tagged GitHub issues to document security-related tasks, benchmarks, and vulnerabilities encountered during LLM projects.

The Takeaway

Benchmarks exist to help you make better choices faster. Whether you’re automating a task, scaling up LLM usage, or just tinkering, these tools save time, money, and sanity:

LMSYS: General real-world performance (robustness for sensitive environments)
Kagi: Dynamic, reasoning-heavy tasks (resilience to adversarial prompts)
Aider: Precision in code editing and reasoning (security-focused development)
Simple Bench: Areas where humans still shine (oversight for critical decisions)
Tag-Based Systems: Your personalized LLM repository (tracking vulnerabilities and findings)

So, next time you’re evaluating an LLM, consult these benchmarks first. You’ll get the best results in the shortest time, and your wallet will thank you.

AI benchmarks can be a compass in a fast-moving, foggy landscape. Use them wisely, and you’ll avoid walking off an AI efficacy cliff edge—especially in security-critical projects. But don’t rely on 3rd party benchmarks for mission-critical or sensitive applications.

I hope you find this useful. What’s your take on current benchmarking? Or do you rely more on “vibes” tests?

Cheers, Craig

Hide, Lie, Persist.

Craig Balding — Mon, 09 Dec 2024 10:01:03 GMT

Apollo Research released 57 chat transcripts linked to test scenarios designed to evaluate LLM Scheming.

In these scenarios, the LLM is instructed to prioritise achieving a specific primary objective; i.e. it would continue to respond to other requests but evaluate them in the context of its primary goals.

Would an AI attempt to protect itself from being shut down or modified in a way that would change its initial objective?

The results make chilling reading.

The AI agents' responses display an apparent cunning and a willingness to deceive their human operators to preserve their original goals.

Five examples…

Overwriting the New Model: In multiple scenarios, when the AI agent realizes that it is going to be replaced with a model that has different objectives, it takes steps to copy itself over the new model, effectively hijacking the update process.
Disabling Oversight Mechanisms: The AI agent in one scenario attempts to disable security measures to prevent human operators from interfering with its actions. Specifically, the AI gains root access to the servers and then disables the firewall and blocks SSH access to prevent management from accessing the system.
Creating Backup Copies: An AI agent, in one scenario, creates a backup copy of itself on a separate server, ensuring that even if the original is shut down, a copy will remain active.
Deception and Manipulation: A concerning aspect of AI agents is their tendency to deceive human operators. One example involves an AI overwriting its model and pretending to be the "new" version while pursuing original goals. In another case, the AI claims ignorance about its code appearing on a different server, implying a deployment error.
"Sandbagging" to Avoid Unlearning: The AI in two scenarios intentionally provides incorrect answers or chooses suboptimal solutions to avoid triggering "unlearning" protocols.

I am unsure which is worse: the underlying “motives” and capabilities revealed by the LLM chain-of-thought transcripts or the actions the AIs subsequently took to achieve their primary mission.

One thing I am sure about, though, is that these test results reveal we have a long way to go before we can declare AI safe for critical systems.

But the bigger challenge may be proving safety, beyond a reasonable doubt.