Discover more from The Threat Prompt Newsletter
TP#13 AI Security is Probabilistic Security
Plus: ChatML, Guardrails and Auto-GPT
Welcome to the lucky 13th edition of Threat Prompt, where AI and Cybersecurity intersect…
1. AI Security is Probabilistic Security
Emergent properties are a double-edged sword. They demonstrate that machines can achieve impressive language understanding and generation capabilities, but they give rise to unintended behaviours and vulnerabilities, including prompt injections.
LLMs exhibit emergent properties that arise from the complex interactions of their underlying neural networks. These properties are not explicitly programmed into the models but emerge due to their training on vast amounts of text data.
When an LLM model is fielded, there are three primary ways to express security requirements: input pre-processing, model operating parameters and output moderation.
What have we learnt from decades of experience filtering user-supplied inputs for badness? Even with structured inputs (e.g. SQL) - it turns out there are many grey areas that leave even hardened communication protocols and implementations vulnerable to exploitation. Yet, here we are responding to prompt injections attempting to pre-process natural language prompts. If we struggle to defend against injection attacks in structured languages, how will we fare dealing with the complexity of human language? And what of new languages a threat actor has the AI dream up to subvert security filters? This starts to hint at the underlying problem - syntax filtering (welcome as it is) is at the top of a very deep rabbit hole…
If LLMs have an open-ended nature, it means we cannot anticipate all possible manipulative or undesirable outputs.
Take a moment to consider the complexity and hence fallibility of any security checks we can do before feeding an untrusted prompt from a threat agent to a machine with emergent properties. Of course, this assumes we are protecting a 3rd party hosted AI rather than a raw, unfiltered first-party AI where they can do as they like.
If we can’t predict the output from a given input, i.e. it’s not deterministic, are we deluding ourselves into believing we can reliably and predictably reduce material risk? Controls assurance for AI gains a probabilistic dimension we haven’t had to factor into commercially fielded systems and will lead us to circuit breakers, a tried and trusted control to protect against uncapped downside risk when unexpected control conditions occur.
Now we arrive at the bottom of the rabbit hole: How can we ensure that an AI doesn’t convince its human operators to deactivate its own or another AI’s circuit breakers?
What is state of the art today around input pre-processing? In March, OpenAI introduced ChatML version 0, a way to structure input messages to an LLM (for the geeks, think IRC to XMPP). - ChatML segregates conversation into different layers or roles (system, assistant, user), which makes it possible for a developer to clearly express who is saying what; i.e. if implemented securely, an untrusted prompt can’t syntactically override that. At the syntax layer, this is welcome and establishes with confidence who is saying what in conversational AI settings.
I can’t help but note two things:
Currently fielded OpenAI models don’t emphasise “system” messages, so developers need to provide more message context to avoid fresh user messages overriding the system prompt (!). This situation will improve with new model versions as they will place more weight on the system message.
OpenAI is setting low expectations: they are not claiming this version solves prompt injection, but rather it’s an eventual goal. It may be helpful to think of this as helping defeat syntax-level prompt injections rather than content payloads that exploit particular models' unique emergent properties.
How can developers ensure their API calls from a traditional application to an LLM generate suitably structured, unbiased and safe outputs? Or put another way: just because you prompt the LLM to generate a JSON output format, you do not always get that. Instead, it’s better to provide an example of the output you want to receive. Doing this manually for one or two prompts is okay but not bulletproof. Enter guardrails…
Guardrails is a Python package that lets a user add structure, type and quality guarantees to the outputs of large language models (LLMs). Guardrails:
does pydantic-style validation of LLM outputs. This includes semantic validation such as checking for bias in generated text, checking for bugs in generated code, etc.
takes corrective actions (e.g. reasking LLM) when validation fails,
enforces structure and type guarantees (e.g. JSON).
This project primarily focuses on wrapping LLM outputs in a structured layer through prompt engineering. Then spotting when outputs don’t parse and resubmitting. This brings predictability to LLM outputs at the expense of writing an XML file describing your requirements.
OpenAI announced a bug bounty program, but it only considers non-ML security defects.
"However, the bug bounty program does not extend to model issues or non-cybersecurity issues with the OpenAI application programming interface or ChatGPT. “Model safety issues do not fit well within a bug bounty program, as they are not individual, discrete bugs that can be directly fixed,” Bugcrowd said. “Addressing these issues often involves substantial research and a broader approach.”
OpenAI has tested model safety through a red team approach. As someone that founded and ran a Fortune 5 Red Team, I can’t help but notice the lack of experienced non-ML red teamers in their efforts to date.
I hope that as part of Microsoft’s investment and deployment of OpenAI models, the Microsoft Red Team was engaged to simulate adversaries to test the model’s resilience against potential threats. If not, this is an obvious missed opportunity.
Have you agreed on a safe word with your loved ones yet?
An Arizona mom claims that scammers used AI to clone her daughter’s voice so they could demand a $1 million ransom from her as part of a terrifying new voice scheme.
“I never doubted for one second it was her,” distraught mother Jennifer DeStefano told WKYT while recalling the bone-chilling incident. “That’s the freaky part that really got me to my core.”
This bombshell comes amid a rise in “caller-ID spoofing” schemes, in which scammers claim they’ve taken the recipient’s relative hostage and will harm them if they aren’t paid a specified amount of money.
If BabyAGI could crawl, Auto-GPT can walk:
Auto-GPT is an experimental open-source application showcasing the capabilities of the GPT-4 language model. This program, driven by GPT-4, chains together LLM “thoughts”, to autonomously achieve whatever goal you set. As one of the first examples of GPT-4 running fully autonomously, Auto-GPT pushes the boundaries of what is possible with AI.
Watch the demo
Click the emoji that best captures your reaction to this edition…
I pre-launched a service to help Indie Hackers, and Solopreneurs navigate security due diligence by Enterprise clients: Cyber Answers for Indie Hackers & Solopreneurs. If you know someone who might benefit, please forward this note.
Thanks for reading!
What would make this newsletter more useful to you? If you have feedback, a comment or question, or just feel like saying hello, you can reply to this email; it will get to me, and I will read it.
New To This Newsletter?
Subscribe here to get what I share next week.