Three Questions I Ask Before Trusting AI Output
AI generates answers in seconds. Confidence is not accuracy. Three questions that separate effective AI users from ones who amplify mistakes.
35 articles in AI Engineering
AI generates answers in seconds. Confidence is not accuracy. Three questions that separate effective AI users from ones who amplify mistakes.
Everyone agrees you should "look at your data." Then they open a hundred traces, scroll for ten minutes, feel vaguely worried, and reach for a tool. The looking has a method — open coding, then axial coding into a ranked taxonomy — and the taxonomy, not your assumptions, is what decides which evals to write.
Most teams build one corner of the eval map — offline, end-to-end, on the inputs they imagined — and call it "evals," the way you'd say "we have tests." Eval is plural. Here's the map: Hamel Husain's three levels, and the three questions that decide which kind of eval you're actually writing.
The first eval post said pull a hundred production traces and read them. But what if you haven't launched? Here's how to build your first eval set from dimensions, scenarios, and synthetic data — and why that set is scaffolding, not the building.
Most teams buy an eval tool, run it once, and call it done. They confuse benchmarks with evals — and ship AI that confidently produces wrong outputs no one will catch. The work the tool can't do for you is the work that matters: looking at your traces, naming the failure modes, and writing assertions that fire when they recur.
Your juniors are shipping more code and learning less from it. The agent answers the question before they form the question. The debugging muscle never gets built. The cost lands eighteen months from now when those juniors are the mid-levels and nobody can reason from first principles.
CODEOWNERS, blame, and branch protection were built around named humans with reputation. When the committer becomes an agent, the trust scaffolding stays standing on a foundation that quietly disappeared — and the load migrates to a single reviewer's signature.
Your engineering agent reviews PRs, writes tests, and ships code. Nobody can tell you whether it is better than rubber-stamping. Without evals on the agent itself, you are flying blind on the tool that is making 40% of your engineering decisions.
AI code passes review on day one and ages worst by month six — not because it's wrong, but because the design intent that makes it refactorable was never part of the diff. The fix is encoding intent in artifacts the next iteration must read.