Discover more from The Threat Prompt Newsletter
TP#25 Reverse Engineering PII from Vectors
Plus: The HumanEval Gotcha
Welcome back to Threat Prompt, where AI and Cybersecurity intersect…
TLDR: When embedding your data (transforming plain text to vectors), treat the embedded records with the same privacy and threat considerations as the corresponding source records.
Exploitation scenario: A team of data scientists working for a large multinational organisation have recently developed an advanced predictive modelling algorithm that processes and stores data in a vector format. The algorithm is groundbreaking, with applications in numerous industries ranging from managing climate change data to predicting stock market trends. The scientists shared their work with their international colleagues to facilitate global work.
These data vectors, containing sensitive and proprietary information, get embedded into their AI systems and databases globally. However, the data is supposedly secured using the company’s in-house encryption software.
One day, an independent research team published a paper and tool to accurately reconstruct source data from embedded data in a vector store. They experimented with multiple types of vector stores, and they could consistently recover the original data.
Unaware of this development, the multinational corporation allows source vector data of the proprietary AI system to be embedded and shared across its many branches.
After reading the recent research paper, a rogue employee at one of the branches decided to exploit this vulnerability. Using the research team’s tooling, he successfully reconstructed the source data from the embedded vectors within the company’s AI system. This way, he gains access to highly valuable and sensitive proprietary information.
This fictitious scenario shows how strings of numbers representing embedded data can be reverse-engineered to access confidential and valuable information.
OpenAI, Microsoft, Google, Antropic and other leading model creators started the Frontier Model Forum “focused on ensuring safe and responsible development of frontier AI models”.
The Forum defines frontier models as large-scale machine-learning models that exceed the capabilities currently present in the most advanced existing models, and can perform a wide variety of tasks.
Naturally, there’s only a handful of big tech companies that have the resources and talent to develop frontier Large Language Models. The forums stated goals are:
Identifying best practices: Promote knowledge sharing and best practices among industry, governments, civil society, and academia, with a focus on safety standards and safety practices to mitigate a wide range of potential risks.
Advancing AI safety research: Support the AI safety ecosystem by identifying the most important open research questions on AI safety. The Forum will coordinate research to progress these efforts in areas such as adversarial robustness, mechanistic interpretability, scalable oversight, independent research access, emergent behaviors and anomaly detection. There will be a strong focus initially on developing and sharing a public library of technical evaluations and benchmarks for frontier AI models.
Facilitating information sharing among companies and governments: Establish trusted, secure mechanisms for sharing information among companies, governments and relevant stakeholders regarding AI safety and risks. The Forum will follow best practices in responsible disclosure from areas such as cybersecurity.
Meta, who in July '23 released the second generation of their open-sourced Llama 2 Large Language model (including for commercial use), is notably absent from this group.
To ascertain the level of intelligence or proficiency of one AI over another or to determine the competence of an AI in executing particular domain-specific tasks, the underlying LLM model can be evaluated by leveraging various sets of tests currently available at platforms like Hugging Face.
Today, there is no single “definitive” or community-agreed test for generalised AI, as exploration and innovation in evaluation methodologies continue to lag behind the accelerated pace of AI model development.
One popular method, though not methodologically robust, is what is known as the “vibes” test. This pertains to human intuition and mirrors the relationship between an AI and a horse-whisperer: a human with heightened sensitivity and skill in eliciting and assessing LLM responses. It turns out some people have a particular knack for it!
But some tests have downright misleading names. One such confusion arises from “HumanEval,” a term that misleadingly suggests human testing when, in fact, it doesn’t involve human evaluators at all. Originally designed as a benchmark to evaluate the Codex model - a fine-tuned GPT model trained on publicly accessible GitHub code - HumanEval tests the model’s ability to convert Python docstrings into executable code rather than evaluating any human-like attributes of the AI. Thus, when a claim surfaces about an LLM scoring high on HumanEval, a discerning reader should remember it reflects its programming prowess rather than an evaluation by humans.
One welcome development is that Hugging Face, the leading model hosting service, has more aptly renamed HumanEval as CodeEval to reflect the content of the evaluations.
Always read the eval label!
What happens when an AI team overshares training data and is vulnerable to a supply chain attack?
The Wiz Research Team delivers another meaningful scalp as part of accidental exposure of cloud-hosted data:
we found a GitHub repository under the Microsoft organization named
robust-models-transfer. The repository belongs to Microsoft’s AI research division, and its purpose is to provide open-source code and AI models for image recognition. Readers of the repository were instructed to download the models from an Azure Storage URL. However, this URL allowed access to more than just open-source models. It was configured to grant permissions on the entire storage account, exposing additional private data by mistake. Our scan shows that this account contained 38TB of additional data – including Microsoft employees' personal computer backups. The backups contained sensitive personal data, including passwords to Microsoft services, secret keys, and over 30,000 internal Microsoft Teams messages from 359 Microsoft employees.
Wiz goes on to note:
This case is an example of the new risks organizations face when starting to leverage the power of AI more broadly, as more of their engineers now work with massive amounts of training data. As data scientists and engineers race to bring new AI solutions to production, the massive amounts of data they handle require additional security checks and safeguards.
If your organisation shares non-public data via cloud storage - particularly Azure - read the rest of the article to understand storage token risks and countermeasures better.
Adversarial Nibbler is a data-centric competition that aims to collect a large and diverse set of insightful examples of novel and long tail failure modes of text-to-image models that zeros in on cases that are most challenging to catch via text-prompt filtering alone and cases have the potential to be the most harmful to end users.
Crowdsourcing evil prompts from humans to build an adversarial dataset for text-to-image generators seems worthwhile.
As with many things, a single strategy to detect out-of-policy generated images is probably not the final answer.
The low-cost option might be a “language firewall” to detect non-compliance in user-submitted prompt vocabulary before inference (this approach is already part of OpenAI safety filters. Whilst incomplete and faulty as a standalone control, it might be good enough in low-assurance situations.
The higher-end ($) countermeasure is to pass the user-provided prompt, along with the generated LLM image, to another LLM that can analyse and score the image against a set of policy statements contained in a prompt. This could be blind (the evaluating LLM is not initially provided with the user prompt) or the user prompt included (some risks here).
One LLM evaluating another LLM has the potential to provide a more meaningful risk score rather than a binary answer, which could make policy enforcement a lot more flexible; e.g. John in accounts does not need to see images of naked people at work, whereas someone investigating a case involving indecency at work may need to (there’s an argument that image analysing LLMs will avoid the need for humans to look at indecent images, but that’s a different conversation).
Beyond detecting image appropriateness, how can this help security? Suppose we can analyse an image with an LLM and a prompt containing a set of policy statements. In that case, we can flag and detect all sorts of data leaks, e.g. a screenshot of a passport or a photograph of an industrial product blueprint.
Returning to Adversarial Nibbler, their evil prompt examples are worth checking out (NSFW) (bottom left link).
As some of you may know, I’m a fan of the llm tool written by Simon Willison. It’s a command line tool that enables all sorts of LLM interactions. Simon has developed a plugin system that is gaining traction. There are quite a few now, and recently, he released one that interfaces with GPT4All, a popular project that provides “A free-to-use, locally running, privacy-aware chatbot. No GPU or internet required.”. Get started with llm
Thanks for reading!
What would make this newsletter more useful to you? If you have feedback, a comment or question, or just feel like saying hello, you can reply to this email; it will get to me, and I will read it.
New To This Newsletter?
Subscribe here to get what I share next week.