Context Beats Tests: An Engineering Discipline for the LLM Era

Why "AI should always TDD" is half-right, and what senior engineers actually do instead. The real leverage stack: Context, Types, Tests, Vibe coding, and where evals fit.

Why “AI should always TDD” is half-right, and what senior engineers actually do instead.

A thesis is making the rounds in 2026: AI-assisted coding without tests is dangerous slop generation, therefore every prompt should be wrapped in a Red-Green-Refactor loop. The conclusion is correct. The reasoning is half wrong. And acting on the half-wrong reasoning produces slightly-better slop, faster.

Test-Driven Development was designed in 2003 for a world where the spec already lived in the developer’s head and tests externalized it. That world had two properties LLM development does not: outputs were deterministic, and the cost of writing a test was negligible compared to writing the implementation. Both assumptions break in the AI era, but not in the same way for every kind of AI work. Most “AI must TDD” advice quietly conflates two very different problems and answers them with one tool.

Let me separate them.

Two Different Problems Wearing the Same Hat

Problem A: LLM-assisted coding. You’re using Claude Code, Cursor, or Copilot to generate deterministic software: a Go CLI, a TypeScript service, a database migration. The output is normal code. The LLM is a faster pair-programmer.

Problem B: LLM-containing systems. You’re shipping a product where the LLM itself is part of the runtime: a RAG pipeline, an agent that calls tools, a chatbot, a content classifier. The output is non-deterministic by design.

These are completely different engineering problems. The TDD-everywhere crowd is mostly answering Problem A. The “evals are the new tests” crowd is mostly answering Problem B. Both are right inside their domain and wrong outside it. Conflating them is how teams burn six weeks scaffolding test infrastructure for a system whose actual failure modes are hallucination, drift, and prompt-injection, none of which a unit test catches.

Why TDD Alone Doesn’t Make LLMs Smarter

For Problem A, the TDD evangelists are partially right: an LLM with a precise failing test produces dramatically better code than an LLM with a vague prose request. The test is an unambiguous done-condition, the failure is a clean feedback signal, and the loop converges fast.

But TDD-first skips the actual bottleneck. Watch a senior engineer use an LLM well and you’ll notice the test isn’t where the leverage is. It’s in the 200 tokens of context, constraints, and interface design that come before any test. Junior devs (and many AI tools) jump to code. Seniors map the problem first.

The reason is depressingly simple: an LLM with a perfect test suite and a confused spec will produce confidently wrong code that passes tests. It will hallucinate file paths, invent library functions, mis-model your domain. All behind a green CI badge. Tests don’t fix bad context. They just verify that the wrong thing works.

For Problem B, TDD doesn’t even apply cleanly. Unit tests assume binary pass/fail outcomes against deterministic inputs. An agent answering customer questions has neither. You need a different instrument entirely: golden datasets, distribution-aware metrics, drift monitoring, human review loops. That instrument is called evals, and it’s the actual TDD-analog for LLM-containing systems.

The Engineering Hierarchy That Actually Works

Strip out the religion and what you’re left with is a leverage stack:

Context  >>  Types  >>  Tests  >>  Vibe coding

Each level is roughly 5-10x more leverage than the next. Skipping levels is how teams ship slop.

Context (or: spec). What problem are we solving, why, for whom, with what constraints, and what are the explicit non-goals? Data shapes, invariants, failure modes you care about, latency and cost budgets. A page of spec changes the game; LLMs reward sharp specs and punish ambiguity. This is the work CTOs and tech leads have always done. The LLM era didn’t replace it, it amplified its value.

Types. A strong type system (TypeScript with strict mode, Rust, Go’s interface design, Python with Pydantic) does maybe 60% of TDD’s job for free. Types are machine-readable specs. LLMs respect them. The compiler is a free, instant test that runs on every keystroke. Compared to writing a test, defining a type is nearly free; the ROI is brutal.

Tests. For non-trivial business logic, write the test before the implementation. Not because TDD is sacred, but because it forces you to express the contract before the LLM tries to satisfy it. The test becomes a precise prompt for the model. Property-based and integration tests beat unit-test maximalism. LLMs are clever and will write tautological unit tests if given the chance.

Vibe coding. What you’re left with when you skip the first three. Sometimes appropriate (throwaway scripts, prototypes, exploratory spikes). The trap is doing it for production code and calling it engineering.

Where TDD Genuinely Earns Its Place

Once context is locked in, tests-first does change the loop in real ways:

Unambiguous done-conditions. The LLM knows when to stop.
Failures become precise prompts. “Here’s the failing assertion, fix it” is a higher-quality instruction than any prose.
Refactor safety. When an LLM rewrites a 300-line module, tests are the only thing between you and silent regression.
It defuses the “looks plausible, subtly broken” failure mode. This is the failure LLMs love most.

So: yes to TDD, but only after you’ve done the work TDD doesn’t do.

A Concrete Workflow

For new features in deterministic code:

Write the spec. Two paragraphs minimum: what, why, constraints, non-goals.
Sketch the interfaces. Type signatures, data shapes, dependencies. Think in contracts.
Write the failing tests. Or have the LLM draft them, and review them. LLMs love tautologies.
Generate the implementation. Now the model has everything it needs.
Refactor with tests as the safety net. Let the model rewrite freely; the suite is your guarantee.

For LLM-containing features:

Define success in measurable terms. Faithfulness, task completion, latency budget. Pick one or two. More than three and you’ve lost the plot.
Build a golden dataset. 30 to 100 realistic inputs with expected behaviour annotations. Hand-curated. This is the work; the rest is mechanical.
Run evals before iterating. Establish the baseline. Most teams skip this and “improve” things they can’t measure.
Add code-based assertions for deterministic failures, LLM-as-judge for subjective ones. Hamel Husain’s framing here is the canonical reference.
Monitor in production. For non-deterministic systems, the eval suite is incomplete by definition. Drift detection and sampled human review are non-negotiable.

Why Evals Are the Real Discipline

Here’s the unglamorous truth about production LLM systems in 2026: the teams that ship reliable products spend 60 to 80% of their development time on error analysis and evals. Not on agent frameworks. Not on prompt tricks. Not on the latest model release. Looking at data, building golden sets, aligning judges with human reviewers.

Husain, who has helped 30+ companies set up eval systems, puts it bluntly: most failed LLM products fail because their teams never built robust evaluation. They optimize what they can see (vector DBs, frameworks, agent topologies) and ignore what actually determines quality: does the system do the right thing on real inputs?

This is the work senior engineers were already doing under different names: error budgets, observability, regression suites, A/B testing, post-mortems. Evals are the LLM-shaped version of that discipline. They’re not exciting. They are what separates production-grade systems from demos.

A Worked Example

Suppose I’m shipping a function that extracts structured order data from a supplier email, typical e-commerce LLM work.

Wrong way (vibe + TDD):

“Write a function that extracts order info from an email. Add tests.”

The LLM cheerfully invents a schema, writes tautological tests, and you ship something that fails on every email format you didn’t think of.

Right way (hierarchy applied):

Context: emails are from 6 suppliers, schemas vary. Fields needed are
SKU, qty, price, delivery_date. Currency is always EUR. Non-goal:
handling consumer returns.

Type contract:
  type Order = {
    sku: string;       // 8-char alphanumeric
    qty: number;       // positive integer
    priceEur: number;  // 2 decimal places
    deliveryDate: Date;
    supplier: SupplierId;
  };

Eval set: 30 real emails (anonymized) with hand-labeled expected
Orders. Metric: field-level exact match. Run on every prompt change.

Code-based tests: extracted SKUs match supplier regex; qty > 0;
deliveryDate within 90 days.

Now the LLM has a real contract. Code tests catch deterministic failures. Evals catch LLM-specific ones. You can iterate on the prompt without flying blind. Roughly 30 minutes of upfront work that saves weeks of debugging in production.

A Curated Reading List for the AI Era

The field is drowning in content-marketing garbage. Here are ten resources actually worth your time, ranked by ROI for engineers shipping production systems.

Books

Chip Huyen, AI Engineering (O’Reilly, 2025). If you read one book, this. Production thinking, evals, agents, deployment tradeoffs. Engineering-first, not research-first.
Iusztin & Labonne, LLM Engineer’s Handbook. Code-heavy production lifecycle: RAG at scale, fine-tuning, deployment patterns. Complements Huyen rather than repeating her.
Biswas & Talukdar, Building Agentic AI Systems. Multi-agent architectures, tool use, memory. Directly relevant to anyone building on the agentic-web thesis.
Berryman & Ziegler, Prompt Engineering for LLMs. The serious prompting book. Patterns, reliability, evals, not “10 tricks” content marketing.

Online: free, read these now

Anthropic, “Building Effective Agents” (anthropic.com/engineering). The reference for when to use workflows vs. agents. ~30 minutes, saves months. Read before adopting LangGraph or CrewAI.
hamel.dev, “LLM Evals FAQ” (Husain & Shankar, January 2026). Best free eval resource on the internet. Then read the Field Guide to Rapidly Improving AI Products and Your AI Product Needs Evals.
applied-llms.org, “What We Learned From a Year of Building with LLMs”. Six practitioners (Yan, Husain, Shankar, Bischof, Frye, Liu), hard-won lessons in long form. The single best synthesis document on production LLM patterns.
DSPy, official docs + Khattab et al. paper. Programmatic prompt optimization. Becoming standard; replaces manual prompt iteration once you have evals to optimize against.

Courses worth the time

DeepLearning.AI short courses (free). Specifically Multi AI Agent Systems with crewAI and AI Agentic Design Patterns with AutoGen. 1 to 2 hours each, hands-on agent code.
Husain & Shankar, AI Evals for Engineers & PMs (Maven, live cohort). ~$2k. Worth it because evals are the discipline that separates production-grade systems from demos. If too expensive, the FAQ in #6 is roughly 70% of the content.

Suggested order of attack (4 to 6 weeks part-time)

Anthropic’s agents post + applied-llms.org, one weekend, sets framing.
Huyen’s AI Engineering, skim the rest, go deep on the evals and agents chapters.
hamel.dev FAQ, then start instrumenting your real system with real evals on a small slice.
LLM Engineer’s Handbook + Building Agentic AI Systems in parallel.
DSPy docs and a DeepLearning.AI agent course as you build.

Ignore the rest until you hit a specific gap. The field moves fast enough that anything you don’t apply within six months is wasted reading.

The Honest Version of “Test, Test, Test”

The TDD evangelists aren’t wrong that LLMs without rigour produce dangerous slop. They are wrong about where the rigour goes.

The honest version of the rule is this:

AI-assisted code should be specified, typed, contract-tested, and eval’d, in that order of leverage.

TDD is one tool in that stack, not the stack itself. Skipping the first two layers is what turns “AI-augmented engineering” into “automated slop at scale.”

Senior engineers have always known this. Specifications, interface design, type discipline, observability, post-mortems. These aren’t new. The LLM era didn’t invent them and doesn’t replace them. It just made them more valuable, because the cost of skipping them now compounds at machine speed.

Build the discipline first. Then let the model help.