AI Engineering

35 articles in AI Engineering

Three verification question cards with a risk-stakes calibration scale — a Warm Beige background showing a framework for deciding when to trust AI output

Three Questions I Ask Before Trusting AI Output

AI generates answers in seconds. Confidence is not accuracy. Three questions that separate effective AI users from ones who amplify mistakes.

By Vishvjitsinh Vanar•June 11, 2026

A single production trace on the left flowing into a free-form open-coding note, then a stack of notes clustering into a ranked taxonomy of named failure modes on the right, with the most frequent mode at the top.

AI Engineering, Engineering Practices

Error Analysis Is the Eval Work. Here's How to Actually Do It.

Everyone agrees you should "look at your data." Then they open a hundred traces, scroll for ten minutes, feel vaguely worried, and reach for a tool. The looking has a method — open coding, then axial coding into a ranked taxonomy — and the taxonomy, not your assumptions, is what decides which evals to write.

By Harsh Parmar•June 8, 2026

A pyramid of three evaluation levels — unit tests at the base, human and model eval in the middle, A/B testing at the top — beside three cross-cutting axes: offline to online, reference-based to reference-free, end-to-end to per-component

AI Engineering, Engineering Practices

You Don't Have Evals. You Have One Kind of Eval.

Most teams build one corner of the eval map — offline, end-to-end, on the inputs they imagined — and call it "evals," the way you'd say "we have tests." Eval is plural. Here's the map: Hamel Husain's three levels, and the three questions that decide which kind of eval you're actually writing.

By Harsh Parmar•June 6, 2026

An empty dimensions-by-scenarios grid on the left labelled "no traces yet" being filled cell by cell with synthetic inputs, then progressively backfilled with real production traces on the right.

AI Engineering, Engineering Practices

You Can't Look at Data You Don't Have: Building Your First Eval Set

The first eval post said pull a hundred production traces and read them. But what if you haven't launched? Here's how to build your first eval set from dimensions, scenarios, and synthetic data — and why that set is scaffolding, not the building.

By Harsh Parmar•June 2, 2026

Two columns contrasting a generic benchmark suite on the left with a domain-specific eval written from production traces on the right, framed as borrowed versus yours.

AI Engineering, Engineering Practices

Evals Aren't a Benchmark Suite. They're a Habit of Looking at Your Data.

Most teams buy an eval tool, run it once, and call it done. They confuse benchmarks with evals — and ship AI that confidently produces wrong outputs no one will catch. The work the tool can't do for you is the work that matters: looking at your traces, naming the failure modes, and writing assertions that fire when they recur.

By Harsh Parmar•May 28, 2026

Two-panel diagram showing the traditional 10-year apprenticeship curve where a junior builds debugging, design, and reasoning skills through struggle, versus the AI-shortcut version where the agent answers every question before the engineer forms it and the skill curve stays flat

AI Engineering, Engineering Leadership

AI Is Making Your Junior Engineers Worse At Their Jobs

Your juniors are shipping more code and learning less from it. The agent answers the question before they form the question. The debugging muscle never gets built. The cost lands eighteen months from now when those juniors are the mid-levels and nobody can reason from first principles.

By Vishvjitsinh Vanar•May 27, 2026

AI Engineering, Engineering Practices

CODEOWNERS Was Built for Humans You Could Trust. Your Committers Are Now Agents.

CODEOWNERS, blame, and branch protection were built around named humans with reputation. When the committer becomes an agent, the trust scaffolding stays standing on a foundation that quietly disappeared — and the load migrates to a single reviewer's signature.

By Harsh Parmar•May 25, 2026

Diagram showing three engineering agents — code writer, code reviewer, test writer — each with a paired evaluation harness measuring its output against a labeled golden set

AI Engineering, Engineering Practices

Evals for Engineering Agents: How We Test the AI That Tests Your Code

Your engineering agent reviews PRs, writes tests, and ships code. Nobody can tell you whether it is better than rubber-stamping. Without evals on the agent itself, you are flying blind on the tool that is making 40% of your engineering decisions.

By Vishvjitsinh Vanar•May 22, 2026

AI Engineering, Code Quality

The Half-Life of AI-Generated Code

AI code passes review on day one and ages worst by month six — not because it's wrong, but because the design intent that makes it refactorable was never part of the diff. The fix is encoding intent in artifacts the next iteration must read.

By Harsh Parmar•May 22, 2026