Welcome to the 21st edition of Threat Prompt, where AI and Cybersecurity intersect…
Four Ideas
1. Fundamental Limitations of Alignment in Large Language Models
An important aspect in developing language models that interact with humans is aligning their behavior to be useful and unharmful for their human users. This is usually achieved by tuning the model in a way that enhances desired behaviors and inhibits undesired ones, a process referred to as alignment. In this paper, we propose a theoretical approach called Behavior Expectation Bounds (BEB) which allows us to formally investigate several inherent characteristics and limitations of alignment in large language models. Importantly, we prove that for any behavior that has a finite probability of being exhibited by the model, there exist prompts that can trigger the model into outputting this behavior, with probability that increases with the length of the prompt
This paper puts a nail in the coffin of today’s AI alignment practices and calls for a new approach.
Two primary tactics to escape hosted AI guardrails:
prompt the LLM to take on a persona that naturally takes the actions you want to be taken.
fill the context window with highly relevant words to overwhelm the LLM operators underlying prompt.
For practical examples, see Jailbreak Chat.
A website aimed at developers is sharing bad advice on prompt injection prevention.
First, they show “non-compliant code” that passes the unfiltered user input to an LLM inference function. Then they should “compliant code” where…
…several changes have been made to prevent prompt injections:
A regular expression pattern (
input_pattern
) is defined to validate the user’s input. It allows only alphanumeric characters, spaces, commas, periods, exclamation marks, and question marks.The
sanitize_input
function removes any special characters or symbols from the user’s input, ensuring it contains only the allowed characters.The
validate_input
function checks whether the sanitized input matches the defined pattern. If it does, the LLM model is called to generate the prompt and produce the response. Otherwise, an error message is displayed.By validating and sanitizing the user’s input, the compliant code protects against prompt injections by ensuring that only safe and expected prompts are passed to the LLM model.
Don’t follow this advice; it doesn’t reduce the risk or impact of prompt injection.
I must admit that I cannot recommend a universal method to defend against prompt injection, and I’m not sure I ever will.
Prompt Injection can be classified as a type of vulnerability, a threat agent tactic, or a technique. In itself, it doesn’t convey sufficient information to determine who and what we need to protect. Is it the LLM, the user who unintentionally submits a malicious prompt, downstream APIs, or data sinks (or all of the above!)?
It is challenging to establish key controls when the abstractions used are unhelpful. More information is required to identify the targets of our protection.
Consequently, I believe Prompt Injection defence is best seen through a scenario-specific lens. A back-of-a-napkin threat model can help quickly identify scenario-specific threats across trust boundaries and assess the feasibility of defence.
There are prompt injection scenarios I can’t defend against today. If what is at stake is low, then the risk of harm may be acceptable (but be very transparent since people’s risk appetite varies widely). But the more valuable data sources and sinks we connect to LLMs, the greater the risk materiality and blast radius.
More on this next week.
3. Meet “ZipPy”, a fast AI LLM text detector
The smart folks at Thinkst Labs announced they are open-sourcing…
ZipPy, a very fast LLM text detection tool…
LLMs do provide the ability to scale natural language tasks, for good or ill, it is when that force-multiplier is used for ill, or without attribution, that it becomes a concern already showing up, from disinformation campaigns, cheating in academic environments, or automating phishing; detecting LLM-generated text is an important tool in managing the downsides.
TL:DR; ZipPy is a simple (< 200 LoC Python), open-source (MIT license), and fast (50x faster than Roberta) LLM detection tool that can perform very well depending on the type of text. At its core, ZipPy uses LZMA compression ratios to measure the novelty/perplexity of input samples against a small (< 100KiB) corpus of AI-generated text–a corpus that can be easily tuned to a specific type of input classification (e.g., news stories, essays, poems, scientific writing, etc.).
Let’s take it for a quick spin.
First, we’ll feed it Why I prefer Textfiles by Jason Scott:
curl -s http://www.textfiles.com/100/whytext.oct | python zippy.py -s
('Human', 0.06186917830125186)
ZipPy determines a human wrote the text with 0.06 confidence.
Next, we’ll feed the same text file to GPT-4 using Simon Willisons' handy llm tool with a prompt to summarise the text and have ZipPy analyse that:
curl -s http://www.textfiles.com/100/whytext.oct | llm chatgpt -4 --system "summarise this article" | python zippy.py -s
('Human', 0.030928426920823204)
The GPT-4 summary does not get flagged as LLM generated but scores half the confidence level as the original.
This time we’ll have GPT-4 generate a compelling rewrite:
curl -s http://www.textfiles.com/100/whytext.oct | llm chatgpt -4 --system "re-write this article in the style of a famous copywriter" | python zippy.py -s
('Human', 0.11271105303531499)
ZipPy again determines the text to be human, with nearly double the original confidence score (!).
I repeated the same 3 prompts with the US Constitution (first 6000 tokens) and all were detected as human. The rewrite score was 0.0558, below the original human author score of 0.0782. The GPT-4 summary was only marginally determined to be human at 0.0019
I was surprised by these results and shared them for feedback
4. Can you trust ChatGPT’s package recommendations?
How can you profit from AI hallucinations?
Ask ChatGPT to generate code to solve the top 100 most popular questions on StackOverflow
Scrape the fictitious package names it hallucinates from the generated code
Create an evil package for each generated name
Upload your evil packages to default package repos
Wait…
We have identified a new malicious package spreading technique we call, “AI package hallucination.”
The technique relies on the fact that ChatGPT, and likely other generative AI platforms, sometimes answers questions with hallucinated sources, links, blogs and statistics. It will even generate questionable fixes to CVEs, and – in this specific case – offer links to coding libraries that don’t actually exist.
Using this technique, an attacker starts by formulating a question asking ChatGPT for a package that will solve a coding problem. ChatGPT then responds with multiple packages, some of which may not exist. This is where things get dangerous: when ChatGPT recommends packages that are not published in a legitimate package repository (e.g. npmjs, Pypi, etc.).
When the attacker finds a recommendation for an unpublished package, they can publish their own malicious package in its place. The next time a user asks a similar question they may receive a recommendation from ChatGPT to use the now-existing malicious package. We recreated this scenario in the proof of concept below using ChatGPT 3.5.
What is robust defence for Python users? I don’t have one, but I do have this…
Let’s use package age as a proxy for risk.
Here’s a shell function to calculate the upload age of a package:
package_age() {
curl -s "https://pypi.org/pypi/$1/json" | jq '[(.releases[] | .[] | .upload_time_iso_8601 | (.[:19] + "Z") | fromdateiso8601) / 86400 ] | max - min | floor' }
$ package_age requests
4480
Let’s assume* that new evil packages typically get discovered as such within their first month and prompt the user accordingly:
package_age() {
curl -s "https://pypi.org/pypi/$1/json" | jq "[(.releases[] | .[] | .upload_time_iso_8601 | (.[:19] + \"Z\") | fromdateiso8601) / 86400 ] | max - min | floor > 31 | if . then . else (error(\"Package is 31 days old or less. Check online for any suspect reports.\")) end" 2>&1 }
Now we can call this shell function before our preferred package installer:
package_age requests && pip install requests
Or can we can one step further and alias the pip command to do this check for us (as we might forget to run package_age):
pip() {
if [ "$1" = "install" ] && [ -n "$2" ]; then
if package_age_output="$(package_age "$2")"; then
command pip install "$2"
else
echo "$package_age_output"
fi
else
command pip "$@"
fi
}
Now when we try to install a very young pip package we get this:
$ pip install a3redis
jq: error (at <stdin>:1): Package is 31 days old or less. Check online for any suspect reports.
Where can we check for suspect package reports?
open and closed Issues in the associated GitHub code repo
use the safety cli or just search the safety DB data file
Found something suspect? Email the Python package index’s security report email (security@pypi.org) or report problematic packages via the GitHub repository linked to the PyPI project.
ass-u-me: 31 days is just a guess - but I wouldn’t recommend going lower than this. Change to suit your risk appetite and the time you’re willing to commit to checking for suspect reports
Bonus Idea
This free guide covers a wide range of prompting techniques - well worth a gander:
Motivated by the high interest in developing with LLMs, we have created this new prompt engineering guide that contains all the latest papers, learning guides, models, lectures, references, new LLM capabilities, and tools related to prompt engineering.
Shout Outs
Two Twitter accounts this week…
If you crave a daily AI security fix, follow @llm_sec. Observant and curates good content.
Want to learn more about open-source LLMs? Follow Anton Bacaj as he explores what’s available and shares practical insights.
Thanks for reading!
What would make this newsletter more useful to you? If you have feedback, a comment or question, or just feel like saying hello, you can reply to this email; it will get to me, and I will read it.
-Craig
New To This Newsletter?
Subscribe here to get what I share next week.