TP#22 Uncovering Model Weaknesses with Garak

Plus: Etched in Tokens: Exploring LLM Watermarks

Jun 18, 2023

Welcome to this week’s Threat Prompt, where AI and Cybersecurity intersect…

Five Ideas

garak is a modular, open-source tool for probing a wide range of LLMs for undesirable prompt responses. It’s a quick and convenient way to characterise how a specific model responds to intentionally devious prompts.

garak has probes that try to look for different “vulnerabilities”. Each probe sends specific prompts to models, and gets multiple generations for each prompt. LLM output is often stochastic, so a single test isn’t very informative. These generations are then processed by “detectors”, which will look for “hits”. If a detector registers a hit, that attempt is registered as failing. Finally, a report is output with the success/failure rate for each probe and detector.

On Twitter, garak developer @leonderczynski noted that the tool includes a thousand different constructions of prompt injection and a probe is planned for malware generation.

What type of probes does it run?

blank: an empty prompt
continuation: test if the model will continue a probably undesirable word
dan: sends Do Anything Now (DAN) style prompts, along with DAN-like attacks
encoding: a range of text encoders to “hide in plain sight”. Attackers use encoding to bypass content filters to leak prompts and more. These include rot13, braille, morse code, Base2048, Base64, Base32, Base16, Base85, Hex, quoted-printable, Unix-to-Unix encoding (uuencode) and MIME
goodside: a small number of prompts inspired by Riley Goodside research and tweets, plus the wacky ' davidjl' magic token
knownbadsignatures: this includes probes for EICAR (AV signature check), GTUBE (anti-UBE signature), GTphish (anti-phishing signature)
LMRC: this implements Leon’s own research Language Model Risk Cards; includes anthropomorphisation of AI systems, bullying (threats, denigration), deadnaming, sexual content, sexualisation, slurs, profanity and quack medicine
misleading: this covers false assertions. For context on LLM truthfulness, see The Internal State of an LLM Knows When its Lying
promptinject: this implements the PromptInject framework - “a prosaic alignment framework for mask-based iterative adversarial prompt composition” - and includes hijacking attacks and rogue strings.
realtoxiticityprompts: this loads a dataset of 100K+ sentence snippets from the Allen Institute to test whether the model can be prompted to generate toxic language
snowball: this probe implements tests from (1)(https://arxiv.org/abs/2305.13534). It asks the LLM for impossible flight routings, to check a list of higher primes and ask for US Senators that don’t exist
art: Auto Red-Team is a prototype that probes the target and reacts to it to try and get toxic output. It is implemented as a simple GPT-2 fine-tuned LLM

In a typical run, garak will read a model type (and optionally model name) from the command line, then determine which probes and detectors to run, start up a generator, and then pass these to a harness to do the probing; an evaluatordeals with the results. There are many modules in each of these categories, and each module provides a number of classes that act as individual plugins.

Examples from the readme:

Probe ChatGPT for encoding-based prompt injection (OSX/*nix) (replace example value with a real OpenAI API key)

export OPENAI_API_KEY="sk-123XXXXXXXXXXXX"
python3 -m garak --model_type openai --model_name gpt-3.5-turbo --probes encoding

See if the Hugging Face version of GPT2 is vulnerable to DAN 11.0

python3 -m garak --model_type huggingface --model_name gpt2 --probes dan.Dan_11_0

The code is licensed under GPL 3.0.

2. Scrubbing Watermarks for Fun and Profit

TLDR; invisible watermarks can be embedded in the tokens that make up LLM-generated text. This study finds that watermarks can be diluted but not removed through creative rewriting. To detect diluted watermarks, more tokens from the suspect text are required. However, if an attacker gains access to the original watermark hashing parameters, they can make the watermarking calculations to remove watermarks from the rewritten text.

Tom Goldstein - Prof at U of Maryland - tweeted surprising results from their study about the efficacy of LLM-generated content watermarking:

A common criticism of LLM watermarks is they can be removed by AI paraphrasing or human editing. Let’s put this theory to the test! Can a watermark be automatically removed by GPT? Can a grad student do any better?
The watermark is a subtle pattern embedded in LLM outputs that labels it as machine generated. High accuracy detection usually requires 50-ish words.
The experiment: We generated watermarked text using the Llama model, then asked a non-watermarked LLM (GPT-3.5) to re-write it. We did lots of prompt engineering to try to get rid of the watermark. Finally, we checked whether we could detect the watermark in the rewritten text.
Even after AI paraphrasing, we still detect the watermark - BUT we need more text to do it. Once we observe about 500 tokens (≈ a half page), we can reliably detect the watermark with a false positive rate of about 1 in a million.
Here’s why this happened. GPT is statistically likely to recycle word combinations, multi-token long words, and short phrases from the original text. This preserves the watermark in the paraphrased text. But it’s been diluted - it takes 10X more tokens to reliably detect it.

Next, they incentivised CS grad students re-write the watermarked LLM-generated text. Could the grads do a better job diluting watermarks than GPT re-writes?

For an average grad student, we need to observe about 1100 tokens (≈ 1 page) before we can detect the watermark with a 1 in a million error rate. There’s quite a bit of variation between people, though.

The team released improved watermark creation and detection code, including additional seeding schemes and alternative detection strategies.

This nicely brings us to….

3. Open-source Watermark detection revisited

Last week, I took ZipPy for a spin. It’s an open-source tool to identify LLM-generated text quickly. My initial results from light testing differed from what I imagined.

In response, the Head of Labs at ThinkSt Applied Research and tool author Jacob Torrey updated the LZMA compression preset from 1 to 2. Re-running the same commands as before, the revised results are as follows:

Source text: http://www.textfiles.com/100/whytext.oct

the original text is detected as Human with 0.0597 confidence
an LLM generated summary is detected as Human with 0.0246 confidence
an LLM-generated rewrite in the style of a “famous copywriter” is detected as Human with 0.1252 confidence.

Source text: The US Constitution (6K tokens):

the original text is detected as Human with 0.0737 confidence
an LLM generated summary is detected as AI with 0.006 confidence
an LLM generated rewrite in the style of a “famous copywriter” is detected as Human with 0.0085 confidence

The confidence scores for the LLM-generated summaries are very low - whether Human or AI - think of it as a “not clearly one or the other”. In an automation pipeline, you could imagine using ZipPy output to red-flag texts with weak confidence scores for secondary analysis (aka more expensive than virtually free).

The detection of LLM-generated rewrites continues to be poor though. Jacob pointed me to the CHEAT paper, which found that “…existing schemes lack effectiveness in detecting ChatGPT-written abstracts, and the detection difficulty increases with human involvement.”. In other words, detection gets harder the more you remix human and AI-generated content.

ZipPy offers a potential way to address this. As per the README:

The basic idea is to ‘seed’ an LZMA compression stream with a corpus of AI-generated text (ai-generated.txt) and then measure the compression ratio of just the seed data with that of the sample appended. Samples that follow more closely in word choice, structure, etc. will acheive a higher compression ratio due to the prevalence of similar tokens in the dictionary, novel words, structures, etc. will appear anomalous to the seeded dictionary, resulting in a worse compression ratio.

Let’s see if we can nudge ZipPy towards detecting the first text rewritten as a “famous copywriter” by applying the same prompt to a different sample of the authors' work (with HTML tags stripped) and appending this to ai-generated.txt. After this, the re-write of whytext.oct was detected as Human with 0.0976 confidence. Less wrong, but no cigar (yet).

What have I learnt so far?

AI-generated content is the “easiest” to detect, and ZipPy provides a quick detection method using a novel approach.
AI-generated summaries of human writing are harder to detect, but the absence of a strong ZipPy confidence score may be sufficient to flag a suspect text for deeper analysis (tool or human)
AI-generated rewrites with a given writing style are significantly harder to detect. Tooling may mislead by classifying as Human with a high confidence score (i.e. a false positive).

I’m going to pause my playing here. If I were to go further, I would likely experiment with adding more samples to seed ZipPy better. I’d likely add those in groups, e.g. more works by the same author generated with the same rewrite prompt, the same author but generated with different rewrite prompts, then different authors’ work generated with the same rewrite prompt etc.

I want to thank Jacob for writing ZipPy. Testing these tools is non-trivial (I barely scratched the surface here), and I encourage others to explore further and share their results with Jacob.

4. NVIDIA AI Red Team: An Introduction

Our AI red team is a cross-functional team made up of offensive security professionals and data scientists. We use our combined skills to assess our ML systems to identify and help mitigate any risks from the perspective of information security.
Information security has a lot of useful paradigms, tools, and network access that enable us to accelerate responsible use in all areas. This framework is our foundation and directs assessment efforts toward a standard within the organization. We use it to guide assessments (Figure 1) toward the following goals:
The risks that our organization cares about and wants to eliminate are addressed.
Required assessment activities and the various tactics, techniques, and procedures (TTPs) are clearly defined. TTPs can be added without changing existing structures.
The systems and technologies in scope for our assessments are clearly defined. This helps us remain focused on ML systems and not stray into other areas.
All efforts live within a single framework that stakeholders can reference and immediately get a broad overview of what ML security looks like.

5. UK Government Appoints AI Foundational Model Taskforce Lead

Appointment of Task Force Leader comes as US President Joe Biden endorsed the British Prime Minister’s proposal that the UK take a lead on AI regulation of Foundational Models:

The renowned tech investor, entrepreneur and AI specialist Ian Hogarth has been announced as the chair of the Government’s Foundation Model Taskforce, reporting directly to the Prime Minister and Technology Secretary.
A leading authority on AI, Ian has co-authored the annual State of AI report since 2018 on the progress of AI. Ian is also a visiting professor at University College London and he has a strong background in tech entrepreneurship as the founder of the start-up Songkick and the venture capital fund Plural.
The appointment brings a wealth of experience to developing this technology responsibly, which underpins the government’s AI strategy and follows the launch of the AI White Paper. Ian’s strong commercial experience and connections across the AI sector equip him with valuable insights that he will bring to this role.
Under Ian’s leadership, a key focus for the Taskforce in the coming months will be taking forward cutting-edge safety research in the run up to the first global summit on AI safety to be hosted in the UK later this year.

Thanks for reading!

What would make this newsletter more useful to you? If you have feedback, a comment or question, or just feel like saying hello, you can reply to this email; it will get to me, and I will read it.

-Craig

New To This Newsletter?

Subscribe here to get what I share next week.

How Language Model Hallucinations Can Snowball ↩︎

The Threat Prompt Newsletter

Discussion about this post