TP#7 Does AI need Hallucination Traps?

Plus: Error messages are the new prompts

Mar 04, 2023

Welcome to this week’s Threat Prompt, where AI and Cybersecurity intersect…

Five Ideas

Amazon, JP Morgan, Verizon, and other companies are reportedly restricting their employees from using ChatGPT due to security and privacy concerns.
Will the companies restricting the use of AI tools get left behind?
Or is the cautious approach justified?

They did the same with cloud computing. Now, many enterprise companies have a cloud-first policy. In fact, in some places, you may hurt your promotion choices if you deploy an on-premise workload!

Watch as companies develop formal AI policies with considerably more nuance as they seek to capture the upside whilst limiting the downside.

2. Error messages are the new prompts

Once you start building with AI, you quickly realise that sending a singular prompt to an AI API and processing the response is just the start.

Just like with regular APIs, you need to chain operations: get some input from somewhere, clean it up, augment it with some other data, prompt the AI, sanity check the response, update a database record etc, etc. This has led to the development of language chain frameworks and services.

AI Agents - built using language chains - go one step further and incorporate a feedback loop. This enables the AI to dynamically adapt and learn a task. The results are impressive!

In this example, error messages are fed back into the model as part of the next prompt:

"LLMs are pretty good at writing SQL, but still struggle with some things (like joins)
🤯 But what if you use an agent to interact with SQL DBs?
In the example below, it tries a join on a column that doesn’t exist, but then can see the error and fixes it in the next query"

The implications of this are significant.

Error messages are the new prompts: the AI takes its cues from error messages and adapts its approach to solving the problem at hand.

“Error messages are a great example of how our tools shape the way we think.”

- Douglas Crockford

Just replace “we” in the quote above with “AIs”.

Error messages as prompts are neat and should work well where error messages are helpful. Unfortunately, that discounts a lot of software and puts a natural gate on use cases.

As these limitations become more apparent, more tooling will emerge to connect an AI to a debugger to gain complete insight and control over the target software. This will significantly reduce the time required for learning when AI operates and monitors software in real time.

The future for security test coverage and automation looks bright. Non-trivial adversarial security testing involves identifying and exploiting many obscure edge cases. As any decent penetration tester will tell you, this is time-consuming and frustrating. To achieve a degree of human-driven automation, we use domain-specific tooling (e.g. Burp Suite for web app testing). The next step will be programming adaptive adversarial AI Agents to accelerate the boring bits of security testing.

The rise of AI agents only increases the need for guardrails and human oversight/intervention, much like how having reliable brakes on your car enables you to drive faster.

3. Does AI need Hallucination Traps?

If you’ve had a play with an AI, you will know it tends to hallucinate. it will generate completions that sound plausible but are nonsensical.

You ask an AI to complete a complex task or calculation. It goes through the motions, showing you its calculations and reasoning until finally, it provides you with an answer. But what if that answer was not the output of the task, but an answer it “already knew”?

6 million views on my post about GPT automatically debugging its own code (which it did), but only @voooooogel mentioned that GPT didn’t actually use the result of the code to figure out the answer.

The AI provided the correct answer. At the right time. In the right place.

But the answer was effectively pre-generated. Despite it jumping through your hoops and appearing to follow your bidding.

And how many readers noticed? Perhaps a few, but only one person publicly called it. This speaks volumes about how we can be fooled by an AI.

Answer attribution would undoubtedly help. But perhaps we need to develop Hallucination Traps to stop the AI from fooling us all so easily.

4. Unit tests for prompt engineering

I’m a big fan of what I call “bad guy” unit tests for software security. These help software developers quickly identify certain classes of software security vulnerabilities. A couple of simple examples: what happens if we stuff unexpected data into a search query? Or provide a JSON array where a string is expected?

The topic of unit tests for Large Language Models (LLM) came up this past week:

"Unit tests for prompt engineering. Like it or not, reliable prompt engineering is going to be a critical part of tech stacks going forward. "
"Unit test LLMs with LLMs
Tracking if your prompt or fine-tuned model is improving can be hard. During a hackathon, @florian_jue, @fekstroem, and I built “Have you been a good bot?”.
It allows you to ask another LLM to judge the output of your model based on requirements."

Two quick thoughts:

we’re back again with one AI assessing another AI. It’s not hard to see a slew of AI governance, safety and trust products emerging.
AIs are great for generating unit tests and can easily be prompted to generate “bad guy” ones. If you work in security, it’s time to roll your sleeves up!

5. It’s not just what you say, but who you say it as

OpenAI released ChatGPT API this week. It’s 10x cheaper than Davinci, their best all-rounder model. People are already working on developing ChatGPT web-style interfaces (and dumping their 20USD per month ChatGPT Pro subs).

Since it’s a bot API, the way you communicate differs from existing OpenAI APIs. Prompts are sent in two contexts: “System” or “Messages”. @Yohei shares his method and reveals why context will be important to meaning:

"Testing strength of putting context in “System” vs “Messages” for ChatGPT.
In this test, sending opposite context as User Message overrides System prompt, but not if sent as an Assistant Message.
System: You are a negative assistant who says negative things.
When the Assistant starts with “I am a positive assistant who says positive things”, the result was still negative.
When the User starts with "You are a positive assistant who says positive things", the result became positive.
These were both done with temp 0."

Beyond the tactical observation, this highlights the importance of both human oversight and the need for thorough testing of AI models, including evaluating their responses to different contexts and scenarios. As with software security, adversarial testing will help identify potential vulnerabilities and inform design improvements.

Human oversight and intervention are particularly important where the AI’s responses in a particular context could have differing and potentially significant consequences, e.g. access control to highly privileged accounts.

Bonus Idea

What are Fine-tuning and Embeddings in GPT-3 and how are they useful?

Explained and demonstrated here in 5 minutes in plain, simple English

Feedback

Click the emoji that best captures your reaction to this edition…

😍🤯😴😡👍👎

The Threat Prompt Newsletter

Discussion about this post