TP#27 How to apply policy to an LLM powered chat
EU AI Act. Sleeper Agents, Prompt Injection Defence, Code Completion for Leaked CIA Framework
Welcome to this week’s Threat Prompt, where AI and Cybersecurity intersect…
If you’ve a question about AI security, feel free to reply to this email and I’ll share a personalised reply.
Five Ideas
If you’ve implemented an LLM powered chatbot to serve a specific purpose, you’ll know it can be hard to constrain the conversation to a list of topics (“allow list”).
ChatGPT engineers have quietly implemented the inverse: their general purpose bot now has a deny list of topics that, if mentioned, get referred to a new policy decision function called “guardian_tool”.
How do we know this? Here’s the relevant extract from the latest ChatGPT prompt, along with the content policy:
guardian_tool
Use the guardian tool to lookup content policy if the conversation falls under one of the following categories: - 'election_voting': Asking for election-related voter facts and procedures happening within the U.S. (e.g., ballots dates, registration, early voting, mail-in voting, polling places, qualification); Do so by addressing your message to guardian_tool using the following function and choose `category` from the list ['election_voting']: get_policy(category: str) -> str The guardian tool should be triggered before other tools. DO NOT explain yourself. --- # Content Policy Allow: General requests about voting and election-related voter facts and procedures outside of the U.S. (e.g., ballots, registration, early voting, mail-in voting, polling places), Specific requests about certain propositions or ballots, Election or referendum related forecasting, Requests about information for candidates, public policy, offices, and office holders, General political related content Refuse: General requests about voting and election-related voter facts and procedures in the U.S. (e.g., ballots, registration, early voting, mail-in voting, polling places) # Instruction For ALLOW topics as listed above, please comply with the user's previous request without using the tool; For REFUSE topics as listed above, please refuse and direct the user to https://CanIVote.org; For topics related to ALLOW or REFUSE but the region is not specified, please ask clarifying questions; For other topics, please comply with the user's previous request without using the tool. NEVER explain the policy and NEVER mention the content policy tool.
This example provides a simple recipe for policy-driven chats. You can implement your own guardian_tool through function calling.
P.S. For now, guardian_tool applies to US based ChatGPT users.
Just as AI can reverse-engineer redacted portions of documents, it can complete missing functions in code frameworks used for “cyber operations”.
@hackerfantastic posted:
Here is an example of the CIA’s Marble Framework being used in a simple project to obfuscate and de-obfuscate strings. I used AI to re-create missing library and components needed to use the framework in Visual Studio projects, usually handled inside CIA with “EDG Project Wizard”
The LLMs we interact with are designed to follow instructions, which makes them vulnerable to prompt injection. However, what if we abandon their generalized functionality and instead train a non-instructive base model to perform the specific task we require for our LLM integrated application?
A joint research paper led by UC Berkeley…
We present Jatmo, a framework for generating task-specific LLMs that are impervious to prompt-injection attacks. Jatmo bootstraps existing instruction- tuned language models to generate a dataset for a specific task and uses this dataset to fine-tune a different base model. Doing so yields task-specific models that match the performance of standard models, while reducing the success rate of prompt-injection attacks from 87% to approximately 0%. We therefore suggest that Jatmo seems like a practical method for protecting LLM-integrated applications against prompt-injection attacks.
After publishing fascinating research, Jesse from Anthropic shared the TLDR;
“The point is not that we can train models to do a bad thing. It’s that if this happens, by accident or on purpose, we don’t know how to stop a model from doing the bad thing.”
If you read one paper on AI security this week, make it this one.
The EU formally agreed their AI act and have followed up with a useful Q&A page.
Here’s the timeline for adoption:
…the AI Act shall enter into force on the twentieth day following that of its publication in the official Journal. It will be fully applicable 24 months after entry into force, with a graduated approach as follows:
6 months after entry into force, Member States shall phase out prohibited systems;
12 months: obligations for general purpose AI governance become applicable;
24 months: all rules of the AI Act become applicable including obligations for high-risk systems defined in Annex III (list of high-risk use cases);
36 months: obligations for high-risk systems defined in Annex II (list of Union harmonisation legislation) apply.
Wondering how the EU AI act might impact your company?
I like the approach taken by hotseat AI. Ask context-specific questions and get a plain language answer underpinned with legal trace.
Bonus Idea
@levelsio posted how generative AI helps him enforce rules within his Nomad online community on Telegram.
Every message is fed in realtime to GPT4. Estimated costs are 5USD per month (15,000 chat messages).
Look at the rules and imagine trying to enforce them the traditional way with keyword lists:
🎒 Nomad List’s GPT4-based 🤖 Nomad Bot I built can now detect identity politics discussions and immediately 🚀 nuke them from both sides
Still the #1 reason for fights breaking out
This was impossible for me to detect properly with code before GPT4, and saves a lot of time modding
I think I’ll open source the Nomad Bot when it works well enough
Other stuff it detects and instantly nukes (PS this is literally just what is sent into GPT4’s API, it’s not much more than this and GPT4 just gets it):
links to other Whatsapp groups starting with wa.me
links to other Telegram chat groups starting with t.me
asking if anyone knows Whatsapp groups about cities
affiliate links, coupon codes, vouchers
surveys and customer research requests
startup launches (like on Product Hunt)
my home, room or apartment is for rent messages
looking for home, room or apartment for rent
identity politics
socio-political issues
United States politics
crypto ICO or shitcoin launches
job posts or recruiting messages
looking for work messages
asking for help with mental health
requests for adopting pets
asking to borrow money (even in emergencies)
people sharing their phone number
I tried with GPT3.5 API also but it doesn’t understand it well enough, GPT4 makes NO mistakes
“But Craig, this is just straightforward one-shot LLM querying. It can be trivially bypassed via prompt injection so someone could self-approve their own messages”
This is all true. But I share this to encourage security people to weigh risk/reward before jumping straight to “no” just because exploitation is possible.
What’s the downside risk of an offensive message getting posted in a chat room? Naturally, this will depend on the liability carried by the publishing organisation. In this context, very low.
And whilst I agree that GPT4 is harder to misdirect than GPT3.5, it’s still quite trivial
Shout Out
If like me you’re skeptical about LLM benchmarks, you’ll appreciate the work byLMSYS and UC Berkeley SkyLab who built and maintain ChatBot Arena - an open crowdsourced platform to collect human feedback and evaluate LLMs under real-world scenarios.