TP#26 AI knows what you typed
Plus: Clickworkers, Benchmarking LLMs, Mac Studio
Welcome to this week’s Threat Prompt, where AI and Cybersecurity intersect…
Side channel attacks (SCA) collect and interpret signals a device emits to reveal otherwise confidential information or operations.
Researchers at Durham, Surrey and Royal Holloway published a paper applying ML and AI to SCA:
With recent developments in deep learning, the ubiquity of microphones and the rise in online services via personal devices, acoustic side-channel attacks present a greater threat to keyboards than ever. This paper presents a practical implementation of a state-of-the-art deep learning model in order to classify laptop keystrokes, using a smartphone integrated microphone. When trained on keystrokes recorded by a nearby phone, the classifier achieved an accuracy of 95%, the highest accuracy seen without the use of a language model. When trained on keystrokes recorded using the video-conferencing software Zoom, an accuracy of 93% was achieved, a new best for the medium. Our results prove the practicality of these side-channel attacks via off-the-shelf equipment and algorithms. We discuss a series of mitigation methods to protect users against these series of attacks.
Arthur’s ML Engineers Max Cembalest & Rowan Cheung on LLMs evaluating other LLMs. Topics covered:
Evolving Evaluation: LLMs require new evaluation methods to determine which models are best suited for which purposes.
LLMs as Evaluators: LLMs are used to assess other LLMs, leveraging their human-like responses and contextual understanding.
Biases and Risks: Understanding biases in LLM responses when judging other models is essential to ensure fair evaluations.
Relevance and Context: LLMs can create testing datasets that better reflect real-world context, enhancing model applicability assessment.
Traditional LLMs benchmarks have drawbacks: they quickly become part of training datasets and are hard to relate to in terms of real-world use-cases.
I made this as an experiment to address these issues. Here, the dataset is dynamic (changes every week) and composed of crowdsourced real-world prompts.
We then use GPT-4 to grade each model’s response against a set of rubrics (more details on the about page). The prompt dataset is easily explorable.
Everything is then stored in a Postgres database and this page shows the raw results.
Each benchmarked LLM is ranked by score, linked to detailed results. You can also compare two LLM scores side by side.
If you apply LLMs within a security context, having a non-AI execute benchmark will highlight things an AI wouldn’t. It may also help sceptics who (with some merit) challenge an AI benchmarking another AI as a potential risk that should be carefully controlled.
Morgan Meaker, writing for Wired:
AI companies are only going to need more data labor, forcing them to keep seeking out increasingly unusual labor forces to keep pace. As Metroc [Finnish Construction Company] plots its expansion across the Nordics and into languages other than Finnish, Virnala [CEO] is considering whether to expand the prison labor project to other countries. “It’s something we need to explore,” he says.
Data labour - or “Clickworkers” are part of the AI supply chain, in this case labelling data to help an LLM differentiate “between a hospital project that has already commissioned an architect or a window fitter, for example, and projects that might still be hiring.”
Supply chain security (and integrity) is already challenging. How far do we need to peer up-chain to establish the integrity of LLMs?
Andrey Karpathy posted:
I always struggle a bit with I’m asked about the “hallucination problem” in LLMs. Because, in some sense, hallucination is all LLMs do. They are dream machines. We direct their dreams with prompts. The prompts start the dream, and based on the LLM’s hazy recollection of its training documents, most of the time the result goes someplace useful. It’s only when the dreams go into deemed factually incorrect territory that we label it a “hallucination”. It looks like a bug, but it’s just the LLM doing what it always does.
Andrey goes on to explain that when people complain about hallucinations, what they mean is they don’t want their LLM assistants hallucinating.
This implies an attempt to shift the focus away from LLMs - the source of the problem - (“too hard to fix”) and attempt to shift the problem to some safety layer within an AI assistant.
Cart, meet horse.
Local Inference Hardware
You need a suitable CPU or GPU to run truly private AI on your hardware. Small LLMs - or heavily quantised larger models - can run well on recent CPUs. But larger or less quantised models need serious GPU power, and the two games in town are Nvidia and Apple.
Today, the Apple Studio M2 Ultra with 192GB RAM is the most powerful Mac for LLM inference. Released in June '23, this is 2 x M2 chips with a very high bandwidth/fast interconnect. Apple watchers are suggesting the M3 Ultra could be released in June '24.
Given the pace of open-source LLM development and associated tooling, this may be worth waiting for if you are already in the Apple ecosystem and need strictly private and performant inference.
It's seriously expensive, but if you have confidential workflows that could materially benefit from a fast, private assistant, this could pay for itself quickly.
Thanks for reading!
What would make this newsletter more useful to you? If you have feedback, a comment or question, or just feel like saying hello, you can reply to this email; it will get to me, and I will read it.
New To This Newsletter?
Subscribe here to get what I share next week.