Three Questions I Ask Before Trusting AI Output
AI generates answers in seconds. Confidence is not accuracy. Three questions that separate effective AI users from ones who amplify mistakes.
51 articles in Engineering Practices
AI generates answers in seconds. Confidence is not accuracy. Three questions that separate effective AI users from ones who amplify mistakes.
Everyone agrees you should "look at your data." Then they open a hundred traces, scroll for ten minutes, feel vaguely worried, and reach for a tool. The looking has a method — open coding, then axial coding into a ranked taxonomy — and the taxonomy, not your assumptions, is what decides which evals to write.
When a test requires 40 lines of setup and three mocks for four lines of assertion, engineers blame the test. The test isn't the problem. It's the first thing honest enough to say the design is wrong.
Most teams build one corner of the eval map — offline, end-to-end, on the inputs they imagined — and call it "evals," the way you'd say "we have tests." Eval is plural. Here's the map: Hamel Husain's three levels, and the three questions that decide which kind of eval you're actually writing.
The feature shipped. Tests passed. Ticket closed. Three months later someone needs to change it — and the original author is the only person who understands it. That is not done. That is a liability with a time delay.
The first eval post said pull a hundred production traces and read them. But what if you haven't launched? Here's how to build your first eval set from dimensions, scenarios, and synthetic data — and why that set is scaffolding, not the building.
Your CI went red. You clicked Retry. It went green. You merged. This happens dozens of times a month on most teams. Nobody is counting. The cost is not the retries — it is what the retries teach your team about what red means.
Most teams buy an eval tool, run it once, and call it done. They confuse benchmarks with evals — and ship AI that confidently produces wrong outputs no one will catch. The work the tool can't do for you is the work that matters: looking at your traces, naming the failure modes, and writing assertions that fire when they recur.
Your juniors are shipping more code and learning less from it. The agent answers the question before they form the question. The debugging muscle never gets built. The cost lands eighteen months from now when those juniors are the mid-levels and nobody can reason from first principles.