Get an easy 5% performance gain

Oct 01, 2024

Ever wondered if there's a way to make AI more effective for complex security tasks? A recent experiment in AI-assisted coding reveals a simple trick that could be a game-changer.

The key: use two AI models in sequence - one to plan, one to execute.

"But Craig, this is what research paper XYZ already said…and our LLM-powered agents that gobble up all those tokens already do”.

So they said, but where are their credible, repeatable, real-world benchmark results?

Here are some eye-opening results from a code editing benchmark using Aider:

Using OpenAI's o1-preview model to plan, followed by DeepSeek to execute, successfully completed 85% of the coding tasks without error.
o1-preview alone completed 79.7% of tasks, showing a significant 5.3% improvement with the two-step approach."
Even pairing a model with itself improved results. GPT-4o jumped from 71.4% to 75.2% when used for both steps.

These are mightly impressive performance jumps when you consider how much effort and resources LLM providers put into getting just a 1% improvement in their models.

That means for every 100 code development requests, you’re only fixing 15….rather than the current ˜20%.

Play those numbers forward and that’s an awful lot of expensive dev time reclaimed.

While these tests focused on solving real world programming challenges, the principle could apply to various security tasks involving higher and lower-level reasoning tasks.

Why it works:

Reduced complexity per step avoids attention splitting: Each AI can focus on a single aspect of the task, either high-level reasoning or specific implementation.
Optimized model selection: Different models can be chosen based on their observed strengths in either strategic reasoning or detailed execution.
Improved task division: Splitting the process allows for more precise prompting (with LLM specific tweaks) and potentially better utilization of the AI's capabilities.

This approach enables us to leverage the strengths of different AI models more effectively, or to use the same model in a more focused manner by giving it distinct, separate tasks.

Next time you're using AI for a complex security task, try this two-step approach (or at least test it!). You might be surprised by the improvement.

Have you ever experimented with using AI in stages for security work? What were your results?

Cheers, Craig

The Threat Prompt Newsletter

Discussion about this post