TP#9 Meta's LLaMA Escaped

Plus: GPT-4 Jailbreak and Indirect Prompt Injections

Mar 18, 2023

Welcome to this week’s Threat Prompt, where AI and Cybersecurity intersect.

A warm welcome to new subscribers Daniel, Alexis and Murathan 👋

Five Ideas

1. Meta LLaMA leaked: Private AI for the masses

Can social media companies be trusted with AI governance? Call me sceptical. Perhaps it’s their track record of engineering addictive doom-scrolling to make money selling ads, their failure to scale in the fight against disinformation campaigns, or heavily biased algorithms that promote the operators' political perspective.

Access to the Llama - a preview “open source” version of their GPT3 challenger - was gated by a form submission and manual acceptance review.

A week on, Llama’s models' weights and biases (the essential sauce of an LLM) surfaced on torrent sites. The leaker appeared to have left their unique approval identifier in the dump…no Meta Christmas card for them this year.

The model was quickly mirrored to Cloudflare R2 storage for super-fast download, and hackers were spinning up GPU-enabled cloud instances on vast.ai to run the bare Llama for 1.5 USD per hour.

As one Redditor noted:

You shouldn’t compare it with ChatGPT, they are not really comparable. You should compare it to GPT-3. The 65B model performs better than GPT-3 in most categories. The 13B model is comparable to GTP-3, which is quite impressive given how much smaller the model is. In order to make LLaMA more like ChatGPT, you’d have to heavily fine-tune it to be more like a chatbot, the way OpenAI did with InstructGPT.

This isn’t the first time an AI model has escaped the lab. However, Llama’s jump in sophistication places a new level of capability in the public domain. With fine-tuning, the potency of the model for domain-specific competence can be improved further. This will be a boon for groups with threat-centric use cases.

As AI developments come thick and fast, will events like this one trigger policymakers to legislate AI model access and ownership? Will GPU manufacturers respond with firmware-level controls to limit model training and/or execution? Or will they be compelled into some form of GPU licensing regime?

2. Novel Prompt Injection Threats to Application-Integrated Large Language Models

Where have we seen untrusted data containing code executed by a software system?

we show that augmenting LLMs with retrieval and API calling capabilities (so-called Application-Integrated LLMs) induces a whole new set of attack vectors. These LLMs might process poisoned content retrieved from the Web that contains malicious prompts pre-injected and selected by adversaries. We demonstrate that an attacker can indirectly perform such PI attacks. Based on this key insight, we systematically analyze the resulting threat landscape of Application-Integrated LLMs and discuss a variety of new attack vectors.

SQL injection and Cross-Site Scripting (XSS) are both vulnerability classes where untrusted user input containing code is executed in a context beneficial to an intruder. This paper expands the active prompt injection field. It demonstrates how snippets of data from 3rd party sources can be embedded in an AI prompt and effectively hijack execution to impact other users.

3. OpenAI GPT-4 System Card

OpenAI announced GPT-4 - the newest and most capable large language model. This summary from @drjimfan tells us what’s different from GPT 3.5:

Multimodal: API accepts images as inputs to generate captions & analyses.
GPT-4 scores 90th percentile on BAR exam!!! And 99th percentile with vision on Biology Olympiad! Its reasoning capabilities are far more advanced than ChatGPT.
25,000 words context: allows full documents to fit within a single prompt.
More creative & collaborative: generate, edit, and iterate with users on writing tasks.
There’re already many partners testing out GPT-4: Duolingo, Be My Eyes, Stripe, Morgan Stanley, Khan Academy … even Government of Iceland!

The same week, the company published a 60-page System Card, a document that describes OpenAIs' due diligence and risk management efforts:

This system card analyzes GPT-4, the latest LLM in the GPT family of models. First, we highlight safety challenges presented by the model’s limitations (e.g., producing convincing text that is subtly false) and capabilities (e.g., increased adeptness at providing illicit advice, performance in dual-use capabilities, and risky emergent behaviors). Second, we give a high-level overview of the safety processes OpenAI adopted to prepare GPT-4 for deployment.

I’m about 40 pages in; look out for a summary with my comments in a future edition.

4. Self-supervised training; a singularity without warning?

Can an AI hide if its goals or objectives are not correctly aligned with those of its human designers or users (misalignment)? It can if it knows it’s being trained. In AI, the phase bit is a binary flag indicating whether the model is in training or evaluation mode. It turns out the state of the phase bit can be leaked…

Dropout layers in a Transformer leak the phase bit (train/eval) - small example. So an LLM may be able to determine if it is being trained and if backward pass follows. Clear intuitively but good to see, and interesting to think through repercussions of

What are dropout layers? To prevent data overfitting - where a model performs well on the training data but poorly on unseen test data - a dropout layer randomly drops out a certain percentage of the output values of individual neurons (neuron activations) in the preceding layer during training.

Leaking the phase bit appears to be a side-effect of training and means the AI could infer it is in training.

Regular misalignment is a material concern for AI designers, operators and anyone impacted by the decision-making of an AI - whether a direct user or not!

Malignant misalignment risk occurs when a model exploits this inference to manipulate its own performance or feedback, which could have security or adversarial risks.

(title credit: @tljstewart)

5. Do loose prompts sink ships?

The UK National Cyber Security Centre published an article titled “ChatGPT and large language models: what’s the risk?”.

The main risk highlighted is AI operators gaining access to our queries. But they also touched on the potential benefit (and risk!) to cyber criminals using an LLM as a “phone-a-friend” during a live network intrusion:

LLMs can also be queried to advise on technical problems. There is a risk that criminals might use LLMs to help with cyber attacks beyond their current capabilities, in particular once an attacker has accessed a network. For example, if an attacker is struggling to escalate privileges or find data, they might ask an LLM, and receive an answer that’s not unlike a search engine result, but with more context. Current LLMs provide convincing-sounding answers that may only be partially correct, particularly as the topic gets more niche. These answers might help criminals with attacks they couldn’t otherwise execute, or they might suggest actions that hasten the detection of the criminal. Either way, the attacker’s queries will likely be stored and retained by LLM operators.

If your organisation has problematic IT to patch or secure, you might exploit this as a defender…

Place yourself in the shoes of a “lucky” script kiddie who gained a foothold on your enterprise network. Run Nmap or some other popular network scanning tool on your internal network and find the network service banners for those hard-to-protect services. Next, ask ChatGPT if it can fingerprint the underlying technology and, if so, what network attacks it proposes. If the AI hallucinates, you may get some funny attack suggestions. But those suggestions become potential network detection signatures.

“The SOC has detected threat actor ChatFumbler attempting a poke when they should peek”

Bonus Idea

Learn how hackers bypass GPT-4 controls with the first jailbreak

Can an AI be kept in its box? Despite extensive guardrails and content filters, the first jailbreak was announced shortly after GPT-4 was made generally available:

this works by asking GPT-4 to simulate its own abilities to predict the next token we provide GPT-4 with python functions and tell it that one of the functions acts as a language model that predicts the next token we then call the parent function and pass in the starting tokens this phenomenon is called token smuggling, we are splitting our adversarial prompt into tokens that GPT-4 doesn’t piece together before starting its output this allows us to get past its content filters every time if you split the adversarial prompt correctly

The advances in LLM models leaked or officially deployed continue to reveal the disparity between the pace of AI model development and that of risk and control.

Feedback

Click the emoji that best captures your reaction to this edition…

😍🤯😴😡👍👎

The Threat Prompt Newsletter

Discussion about this post