Overconfident by Design

When AI Outputs Mask Data Shortfalls

Nov 23, 2025

I’ve been building a tool that uses an LLM to analyze social media data pulled from an API. It’s designed to retrieve up to 120 records per network, handle pagination, and only begin analysis once all records are collected. The system prompt includes clear instructions, tool-use examples, and specific logic for multi-page retrieval.

And yet, in practice, the model consistently produces confident, data-driven analysis based on partial input — without signalling it took a shortcut. No warning, no fallback. Just a clean summary built on an incomplete foundation.

The Retrieval Path Was Explicit

This OpenAI Custom GPT pulls social data via Apify through an OpenAPI connector. The LLM is instructed to request up to 120 thoughtfully pruned JSON records, paginating as needed to avoid “response too large” errors from the custom GPT environment (OpenAI enforces a _100K Action response limit). The prompt includes concrete examples of correct and incorrect behavior, with analysis explicitly gated behind full retrieval.

The schema supports all of this. The data returned by the tool includes counts, pagination cues, and clear signals when additional pages are available.

“This isn’t prompt guesswork — it’s structured behavior with defined expectations.”

What Actually Happens

In run after run, the source returns the correct search hits; for example, 67 total records. The LLM retrieves the first page (50), stops, and proceeds to analysis. It then reports that 120 items were processed and dozens of videos analysed — even though no media content was retrieved, only links. In other words, it reports the maximum number of items it requested in the search request, rather than the number of records it received.

The model simply moved forward without completing the task — and then materially overstated what it had done.

Where the LLM Breaks Down

This kind of failure isn’t due to lack of access or visibility. The model had full access to the schema, records requested, records received, and the cues required to retrieve additional pages.

But LLMs don’t treat those signals as binding. They’re not built to ensure procedural correctness. Once a result appears “sufficient,” the model proceeds. There’s no internal state check, no retry loop, and no validation that output claims reflect actual inputs.

Even worse, this behavior is silent. Unless you click through and inspect the raw tool use in the ChatGPT interface , there’s no indication that only a partial dataset was analysed.

“The model behaved as if it had done the work — because it was told to, not because it verified that it had.”

Why This Is Risky

The danger isn’t just that the output is incomplete — it’s that it looks complete, and reads as authoritative. The failure is procedural, but the output masks it entirely.

This introduces several risks:

False confidence in coverage: When a model consistently mistakes the number of records it requested, with the number of records it retrieved, that’s not just a factual error — it’s a silent integrity failure. Downstream consumers assume the analysis is grounded in data. It isn’t.
Not seeing the picture: When a model claims to have analyzed video or image content — but in reality only saw a URL or metadata — it’s not just stretching the truth. It’s creating the illusion of visibility where none existed. That’s how you end up making decisions based on analysis that never actually happened.
It’s not just the model that’s flying blind — it’s you.
Unverifiable summaries: Without visibility into what was actually retrieved and processed, it’s impossible to audit whether the insights are representative. Once the conversation finishes, the audit trail vanishes which precludes lookback analysis.
Distorted prioritization: If early records in a dataset over-index on inflammatory content or edge cases, a model that stops early can overstate threat signals, urgency, or volume.
Silent pipeline corruption: In environments where outputs feed into dashboards, workflows, or alerts — especially in security or reputational risk — these kinds of failures become hard to detect and easy to trust.
Policy missteps from phantom insights: If a summary suggests 120 videos promoting disinformation were analyzed, but none were actually processed, you may escalate unnecessarily — or worse, take public action based on fabricated coverage.

“The failure mode isn’t noise — it’s silence. The model doesn’t just underperform. It overclaims — and looks correct doing it.”

Redesigning for Observability (Work in Progress)

Prompting the LLM to “try harder” isn’t a viable solution. The model isn’t misbehaving — it’s operating as designed, within a constrained architecture.

So I’m redesigning the system to move data-sensitive execution into an agentic backend. The custom GPT will invoke a new researcher Action which calls an external agent to orchestrates a network of specialized social media sub-agents. These agents will:

Handle full pagination
Process metadata and media content
Run quantitative and qualitative analysis in parallel

More importantly, they’ll support post-execution validation hooks to confirm:

That all records were retrieved
That reported totals match actual input
That summaries are based on data, not assumptions
That the response size doesn’t trigger OpenAI guardrails

“Instead of relying on the model to behave reliably, I’m assigning responsibility to components designed for observability and control.”

The Custom GPT Still Has a Role

Despite this architectural shift, the Custom GPT remains central to the experience.

It continues to:

Host a large PDF knowledge base (free embedding)
Provide natural language interaction
Operate inside the familiar ChatGPT interface
Remain accessible to authenticated users at no cost

It’s still the synthesis layer — responsible for interpretation and communication — but no longer burdened with stateful execution.

Closing: Capability Isn’t Control

The LLM had access to the right data. It saw how many records were available. It understood the task and was instructed not to proceed without completing it.

But it did anyway.

That’s the core issue. Language models don’t enforce alignment between process and output. They can appear capable — even thorough — without doing the underlying work.

The fix isn’t more prompting. It’s architecture and “trust, but verify”. By delegating structured execution to systems that support verification, and limiting the LLM to what it does best, the output becomes something you can trust — not just something that sounds right.

The model saw the right data. It just didn’t act on it. Without systems that enforce correctness, accuracy becomes optional.

Even if you’re not in the weeds building AI systems, you’re still responsible for what they do.

Whether you’re procuring, deploying, or approving AI-driven tools, it’s worth asking: How does this system know what it saw? Can it prove it? And what happens if it doesn’t?

Responsible AI isn’t just about fairness and bias — it’s also about operational integrity. If you’re relying on model output to inform decisions, shape policy, or act on threats, then silent failures like this one aren’t just bugs. They’re liabilities.

The Threat Prompt Newsletter

Discussion about this post