Your AI Model Exposing Itself?

Sep 13, 2024

Imagine discovering your cutting-edge AI model is inadvertently revealing its own blueprint.

That's exactly what researchers recently uncovered, extracting crucial architectural details from production language models like GPT-3.5-turbo.

First, let's break down two key concepts:

1. Hidden states: These are the internal representations of information within the model. The size of these states (hidden dimension) is a crucial aspect of the model's architecture.

2. Embedding projection matrix: This is the final layer of the model that converts the hidden states into output probabilities for each word in the vocabulary.

Here's how researchers extracted this information:

1. API Queries: They sent carefully crafted prompts to the model, manipulating two key features:

- Logit biases: Adding values to specific tokens' logits (pre-softmax scores)

- Output logprobs: Requesting log probabilities for the top likely tokens

2. Data Collection: For each prompt, they collected:

- The model's output

- The logprobs for the top K tokens (usually top 5)

- How the output changed with different logit biases

3. Analysis:

- They used linear algebra techniques (like Singular Value Decomposition) on the collected logits

- This revealed the dimensionality of the model's hidden states

- Further analysis recovered the entire embedding projection matrix (up to certain symmetries)

The results were striking. For just $20, they extracted the embedding projection matrix of OpenAI's ada and babbage models, confirming hidden dimensions of 1024 and 2048 respectively. For GPT-3.5-turbo, they estimated full extraction would cost under $2,000.

Why does this matter?

1. It reveals previously confidential architectural details

2. It demonstrates vulnerabilities in seemingly secure API designs

3. It could be a stepping stone to more comprehensive model extraction

What can AI teams do?

1. Limit access to raw logits/confidence scores

2. Add controlled noise to model outputs

3. Implement stricter rate limiting on logit bias queries

Has this changed how you think about AI API security? Going forward, how will providers balance functionality with protecting their model's architecture?

Cheers,

Craig

The Threat Prompt Newsletter

Discussion about this post