Engineering Practices

51 articles in Engineering Practices

Three verification question cards with a risk-stakes calibration scale — a Warm Beige background showing a framework for deciding when to trust AI output

Three Questions I Ask Before Trusting AI Output

AI generates answers in seconds. Confidence is not accuracy. Three questions that separate effective AI users from ones who amplify mistakes.

By Vishvjitsinh Vanar•June 11, 2026

A single production trace on the left flowing into a free-form open-coding note, then a stack of notes clustering into a ranked taxonomy of named failure modes on the right, with the most frequent mode at the top.

AI Engineering, Engineering Practices

Error Analysis Is the Eval Work. Here's How to Actually Do It.

Everyone agrees you should "look at your data." Then they open a hundred traces, scroll for ten minutes, feel vaguely worried, and reach for a tool. The looking has a method — open coding, then axial coding into a ranked taxonomy — and the taxonomy, not your assumptions, is what decides which evals to write.

By Harsh Parmar•June 8, 2026

An orange background with two cards side by side — left card in deep teal showing the number 31 lines of setup against 4 lines of assertion, right card in warm beige listing the three design problems revealed: too many mocks, complex setup, cannot isolate.

Engineering Practices, Software Design

The Test That's Hardest to Write Is Telling You Something About Your Design

When a test requires 40 lines of setup and three mocks for four lines of assertion, engineers blame the test. The test isn't the problem. It's the first thing honest enough to say the design is wrong.

By Shivani Sutreja•June 8, 2026

A pyramid of three evaluation levels — unit tests at the base, human and model eval in the middle, A/B testing at the top — beside three cross-cutting axes: offline to online, reference-based to reference-free, end-to-end to per-component

AI Engineering, Engineering Practices

You Don't Have Evals. You Have One Kind of Eval.

Most teams build one corner of the eval map — offline, end-to-end, on the inputs they imagined — and call it "evals," the way you'd say "we have tests." Eval is plural. Here's the map: Hamel Husain's three levels, and the three questions that decide which kind of eval you're actually writing.

By Harsh Parmar•June 6, 2026

A deep teal background showing two states side by side — Day 1 with a green DONE checkmark and Day 3 Months Later with a coral question mark asking whether the next engineer can change it without asking the original author.

Engineering Practices, Software Design

"Done" Doesn't Mean It Works. It Means Someone Else Can Change It.

The feature shipped. Tests passed. Ticket closed. Three months later someone needs to change it — and the original author is the only person who understands it. That is not done. That is a liability with a time delay.

By Shivani Sutreja•June 5, 2026

An empty dimensions-by-scenarios grid on the left labelled "no traces yet" being filled cell by cell with synthetic inputs, then progressively backfilled with real production traces on the right.

AI Engineering, Engineering Practices

You Can't Look at Data You Don't Have: Building Your First Eval Set

The first eval post said pull a hundred production traces and read them. But what if you haven't launched? Here's how to build your first eval set from dimensions, scenarios, and synthetic data — and why that set is scaffolding, not the building.

By Harsh Parmar•June 2, 2026

A CI run history on a warm beige background showing an alternating pattern of green pass and coral fail circles with Retry annotations, alongside two stat cards showing 47 retries and zero investigations last month.

Testing, Engineering Practices

The Flaky Test Is the Most Expensive Test You Have

Your CI went red. You clicked Retry. It went green. You merged. This happens dozens of times a month on most teams. Nobody is counting. The cost is not the retries — it is what the retries teach your team about what red means.

By Shivani Sutreja•May 28, 2026

Two columns contrasting a generic benchmark suite on the left with a domain-specific eval written from production traces on the right, framed as borrowed versus yours.

AI Engineering, Engineering Practices

Evals Aren't a Benchmark Suite. They're a Habit of Looking at Your Data.

Most teams buy an eval tool, run it once, and call it done. They confuse benchmarks with evals — and ship AI that confidently produces wrong outputs no one will catch. The work the tool can't do for you is the work that matters: looking at your traces, naming the failure modes, and writing assertions that fire when they recur.

By Harsh Parmar•May 28, 2026

Two-panel diagram showing the traditional 10-year apprenticeship curve where a junior builds debugging, design, and reasoning skills through struggle, versus the AI-shortcut version where the agent answers every question before the engineer forms it and the skill curve stays flat

AI Engineering, Engineering Leadership

AI Is Making Your Junior Engineers Worse At Their Jobs

Your juniors are shipping more code and learning less from it. The agent answers the question before they form the question. The debugging muscle never gets built. The cost lands eighteen months from now when those juniors are the mid-levels and nobody can reason from first principles.

By Vishvjitsinh Vanar•May 27, 2026