Skip to content

Vibe Coding Is Dead. Spec-Driven AI Engineering Is the Future.

Posted on:Alvar Laigna | June 5, 2026 at 10:00 AM

AI coding is no longer a toy. That is the uncomfortable part. It is also not magic. That is the part many teams still refuse to accept.

The frontier has moved from autocomplete to agentic engineering. Modern coding agents can inspect a repository, edit multiple files, run tools, execute tests, create pull requests, and keep enough context to handle tasks that would have looked impossible only a few years ago. The DeepSWE benchmark, which was designed around original long-horizon software engineering tasks, reports frontier models solving difficult repository-level problems at meaningful rates, with the leading captured result showing gpt-5.5 at 70% ±4% Pass@1 and Claude Opus variants following behind on the same benchmark.1 That is not “toy demo” territory anymore.

But there is a second number every serious developer should care about more than the leaderboard. Princeton’s AI Agent Reliability Tracker argues that rising agent accuracy has not produced equally strong reliability. Their central finding is blunt: agents still fail unpredictably in practice, consistency remains weak, and a single accuracy score hides whether an agent is stable, robust, predictable, and safe.2

Capability is not reliability. A model that can solve a hard task once is not the same thing as an engineering system you can trust with production responsibility.

That distinction is where the real work starts. If we treat AI coding as a faster version of copy-paste programming, we get faster chaos. If we treat it as a new software supply chain, we can get leverage without surrendering engineering discipline.

From Prompting to a Software Supply Chain

The wrong question is: “Which AI tool writes the most code?”

The better question is: “What system of constraints, memory, review, tests, and deployment controls turns AI output into trustworthy software?”

This is why I do not think the next phase belongs to developers who merely prompt well. It belongs to teams that can build a repeatable AI-assisted engineering process. NIST’s Secure Software Development Framework frames secure software work around preparing the organization, protecting software, producing well-secured software, and responding to vulnerabilities.3 OWASP’s LLM application risk work adds a more agent-specific warning: prompt injection, insecure output handling, supply-chain vulnerabilities, sensitive information disclosure, excessive agency, and overreliance all become real engineering risks when models gain tool access.4

AI-generated code is not just text. In a real workflow, it becomes database migrations, authorization rules, API endpoints, cloud resources, CI pipelines, dependency changes, infrastructure configuration, and sometimes destructive shell commands. That means AI coding is part of your supply chain. It needs the same seriousness as dependencies, container images, Terraform modules, GitHub Actions, and production credentials.

Old mental modelBetter mental model
AI is an autocomplete toolAI is a semi-autonomous contributor with tool access
Prompt quality is the main skillSpecification, review, memory, tests, and orchestration are the main skills
The output is probably fine if it compilesThe output is untrusted until reviewed, tested, and constrained
One model should do everythingDifferent models should challenge each other in different roles
Context is whatever fits in the chatContext is engineered through docs, specs, memory, hooks, and task boundaries
Vendor tools own the workflowThe repository, contracts, and infrastructure code own the workflow

This is the line I now draw for myself: AI can propose and implement, but it should not silently decide architecture, security policy, production permissions, or deployment state.

My Current Operating Model

My preferred workflow is intentionally multi-model and task-driven. I do not want one model to be the architect, developer, reviewer, tester, and release manager at the same time. That creates monoculture. It also creates a false sense of certainty, because the same failure mode can survive every step.

For serious work, I like to split the process roughly like this.

StagePrimary tool/model roleOutput I want
Initial specificationClaude CodeA concrete feature spec, repository map, affected files, acceptance criteria, risks, and a first implementation plan
Spec reviewGemini / Google AI toolingA second-opinion critique: missing constraints, architecture alternatives, edge cases, simpler designs, and possible overengineering
Security/code reviewGPT-5.5 / Codex-style reviewerHigh-severity issue detection, auth flaws, data-leak paths, dependency risks, race conditions, and review comments
Implementation orchestrationClaude Code with strong tool use, agents, skills, hooks, and scoped memorySmall PR-sized changes, tests, migrations, docs, and repeatable command output
Task trackingGitHub Issues / JiraAudit trail, acceptance criteria, linked PRs, and human-readable engineering intent
Deployment disciplineTerraform/OpenTofu/Pulumi, Ansible, containers, GitHub Actions, Kubernetes where justifiedReproducible state, rollback options, and less vendor lock-in

Claude Code is strong as an implementation and repository-navigation environment because it can read code, edit files, run commands, use project memory, work with hooks, use custom agents, and integrate with development workflows.5 Gemini is useful as a critic because I often want a broad-context reviewer that is not emotionally attached to the first plan and can challenge the architecture before code exists.6 GPT-5.5 or a Codex-style reviewer is useful as a separate security and code-review pass, especially when the instruction is not “make this prettier” but “find serious issues I would regret shipping.” OpenAI’s Codex GitHub integration, for example, can be invoked for PR review, reads repository instructions such as AGENTS.md, and focuses review comments on serious issues by default.7

This is not about brand loyalty. It is about separation of duties. In human teams, we do not ask the same developer to write code, approve their own security model, rubber-stamp the PR, and deploy straight to production without telemetry. AI should not get a lower bar just because it is fast.

Start With Tasks, Not Vibes

The biggest improvement most teams can make is boring: write better tasks.

Agentic coding works best when the work is bounded. GitHub’s own guidance for assigning work to coding agents says issues should be clear, well-scoped, include acceptance criteria, and include relevant files or implementation notes where possible.8 It also warns that agents are a poor fit for ambiguous, production-critical, sensitive, security-heavy, or deep-domain tasks unless there is strong human oversight.8

This matches my experience. If the task is “improve onboarding,” the agent will hallucinate product intent. If the task is “implement invite-only team onboarding using the existing OrganizationMember model, add a pending invite state, enforce tenant isolation, update the OpenAPI contract, and add negative authorization tests,” the agent has rails.

A good AI-ready issue should include the following information in prose, not just as a checklist.

FieldWhy it matters
Problem statementPrevents the agent from optimizing a different problem
Non-goalsReduces scope creep and token waste
Acceptance criteriaGives the agent and reviewer a shared definition of done
Security notesForces early thinking about auth, data boundaries, secrets, and abuse cases
Likely filesReduces repository wandering and irrelevant context loading
ContractsKeeps API, database, and event boundaries explicit
Test expectationsPrevents “it compiles” from becoming the definition of quality
Rollback or migration notesForces operational thinking before deployment

This is not bureaucracy. This is how we turn probabilistic tool use into deterministic engineering flow.

The Repository Must Become AI-Readable

A codebase that is hard for humans to understand is almost always worse for agents. The model may look confident, but it is still navigating your architecture through text, file names, local patterns, tests, and instructions. If those are missing, stale, or contradictory, the agent will invent glue.

The practical answer is to make the repository self-describing.

repo/
  AGENTS.md
  CLAUDE.md
  GEMINI.md
  .github/
    copilot-instructions.md
    workflows/
  docs/
    architecture.md
    security.md
    ai-workflow.md
    decisions/
      ADR-0001-stack.md
  specs/
    feature-name.md
  api/
    openapi.yaml
  db/
    migrations/
    policies/
      rls-tests.md
  tests/
    contract/
    e2e/
    security/
  infra/
    terraform/ or opentofu/
    ansible/
  .claude/
    skills/
    agents/
    hooks/

The exact folder names matter less than the principle: intent should live in the repository, not only in somebody’s head or in yesterday’s chat window.

Anthropic’s Claude Code documentation describes CLAUDE.md as a place for project instructions, style conventions, architecture notes, and recurring commands. It also notes that Claude can import other files, which makes it practical to reference a shared AGENTS.md instead of duplicating instructions across tools.9 GitHub Copilot’s coding agent documentation similarly supports repository custom instructions, including AGENTS.md, CLAUDE.md, and GEMINI.md patterns.10

The important constraint is that these instruction files should be short, concrete, and maintained. A 700-line graveyard of old preferences becomes prompt pollution. A concise file that states architecture boundaries, commands, test strategy, security expectations, and “never do this” rules becomes leverage.

Here is the kind of content I want in an AI-facing project file.

Instruction categoryExample content
Architecture boundaries“API routes must call service layer methods; do not query the database directly from route handlers.”
Security invariants“Every tenant-scoped query must include organization_id; add negative tests for cross-tenant access.”
Command map“Use pnpm test:unit, pnpm test:contract, pnpm lint, and pnpm typecheck before proposing a PR.”
Stack constraints“Do not introduce a new queue, ORM, cloud service, or auth provider without an ADR.”
Migration rules“All schema changes require reversible migrations and a compatibility note.”
Review expectations“Explain risky changes, changed auth behavior, and any generated code that was not tested.”

This is where many AI coding workflows quietly fail. Developers try to fix hallucination with more prompting, when the deeper problem is that the project itself has no stable operating manual.

Hooks Are the New Guardrails

If an agent can run shell commands, edit files, call APIs, and use external tools, instructions are not enough. You need enforcement.

Claude Code hooks can run at lifecycle events such as session start, before or after tool use, on file changes, and before compaction.11 More importantly, Anthropic’s documentation explicitly warns that instructions are context for the model, while blocking behavior should be implemented with hooks.9 That distinction matters. “Please do not run destructive commands” is a request. A PreToolUse hook that blocks dangerous commands is a control.

In practice, I want hooks around at least four categories of behavior.

Hook targetPractical control
Dangerous shell commandsDeny or require approval for rm -rf, direct production commands, credential export, force pushes, destructive database operations
SecretsScan written files for API keys, tokens, private keys, .env leakage, and accidental log exposure
Dependency changesFlag lockfile changes, new packages, lifecycle scripts, native modules, and suspicious dependency sources
Security-sensitive filesRequire extra review for auth middleware, RLS policies, Firebase rules, IAM, Terraform state backends, CI secrets, and payment code

This is also where token efficiency improves. A hook can run the actual type checker instead of asking the model to infer type safety from memory. A hook can run tests. A hook can block a bad command before it becomes a production incident. We should let deterministic tools do deterministic work.

Security: Generated Code Is Untrusted Code

AI-generated code should enter the repository with the same suspicion as code copied from a random blog post. That does not mean it is bad. It means it has no trust until it earns trust.

The main security risks are not exotic. They are the same boring failures that already hurt software teams, accelerated by a machine that can produce them faster.

RiskHow it appears in AI-assisted developmentGuardrail
Broken authorizationThe agent adds an endpoint but forgets tenant boundaries, ownership checks, role checks, or row-level policy coverageNegative auth tests, policy tests, route middleware tests, human review
Weak database rulesSupabase RLS, Firebase rules, or SQL policies are too broad, happy-path only, or inconsistent with app-layer authDedicated policy tests, deny-by-default review, cross-user/cross-tenant test cases
Secret exposureThe agent logs tokens, commits .env, hardcodes API keys, or prints secrets during debuggingSecret scanning, vault-backed envs, hooks, CI checks, no secrets in prompts
Supply-chain riskThe agent installs packages casually, accepts typosquatted dependencies, or upgrades lockfiles without reviewLockfile diff review, SBOM, vulnerability scan, package provenance checks
Prompt injectionUntrusted docs, issues, websites, or dependency files contain instructions that manipulate the agentTreat external text as data, restrict tool permissions, review commands before execution
Excessive agencyThe agent has permission to deploy, delete, charge, email, migrate, or mutate production state without approvalLeast privilege, approval gates, sandbox-first execution, separate production credentials
Insecure output handlingModel-generated strings flow into shell, SQL, HTML, eval-like APIs, or workflow files without escapingStatic analysis, code review, parameterized APIs, sandbox tests

OWASP’s LLM application risks are directly relevant here because AI coding agents combine model output, tool access, plugins, data retrieval, and sometimes autonomous action.4 NIST’s SSDF is also relevant because it treats secure development as a lifecycle, not a final scan.3

AI code review helps, but it is not enough. A second model can catch issues the first model missed, and Codex-style PR review can focus attention on serious findings.7 But if the test suite has no negative authorization tests, if your RLS policies are never exercised, if secrets are available in the same environment where an agent can run arbitrary shell commands, and if deployment approval is just a green button, then AI review is theater.

The standard I want is simple: AI may accelerate implementation, but the security boundary must be deterministic. Auth rules, policy tests, CI checks, deployment gates, infrastructure state, and secrets handling must not depend on the model “remembering to be careful.”

Memory Is Useful Only If It Is Curated

Every serious AI coding setup eventually runs into the memory problem. The model forgets something important, repeats a mistake, rediscovers a command, ignores a style convention, or loses project context after compaction. The naive fix is to throw more text into CLAUDE.md, AGENTS.md, or some global memory file.

That works for about a week. Then the memory turns into mud.

I prefer to think about memory in layers.

Memory layerPurposeRisk
Personal/global knowhowMy general preferences, engineering taste, communication style, recurring tool choicesCan become too broad and bias every project incorrectly
Project memoryArchitecture, commands, conventions, known traps, deployment model, testing strategyCan become stale when the codebase changes
Local/session memoryTemporary notes, current branch context, work-in-progress constraintsCan leak into long-term memory when it should expire
Dreaming/consolidationPeriodic review of sessions to extract durable lessons and remove noiseCan hallucinate patterns or preserve bad habits if not audited

Anthropic’s Claude Code memory docs distinguish project instructions and auto memory, and Claude’s managed-agent Dreams feature goes further by asynchronously reviewing past sessions and an existing memory store to produce a reorganized memory store.9 12 The Dreams documentation says the process can merge duplicates, replace stale or contradicted entries, and output a separate memory store that can be reviewed or discarded rather than modifying the original memory in place.12

That last detail is important. Memory should be reviewable. It should not silently become law.

A practical memory hygiene routine looks like this: project instructions stay concise; session notes stay local unless promoted; recurring mistakes become tests or hooks, not just prose; architecture decisions become ADRs; and periodic “dreaming” or consolidation outputs are reviewed before they become durable project memory.

If a model repeatedly forgets to include organization_id in tenant-scoped queries, the fix is not only “remember to include organization_id.” The real fix is a query helper, a linter rule, a policy test, and a project instruction explaining the invariant.

Good Stack Selection Beats Clever Prompting

AI makes it easy to generate complexity. That is not the same as engineering progress.

A mature AI-assisted stack should be boring where boring helps. Prefer explicit contracts, ordinary databases, reproducible builds, infrastructure as code, source-controlled migrations, and clear runtime boundaries. Use Terraform, OpenTofu, Pulumi, Ansible, containers, Kubernetes, or managed services where they make sense, but keep the state understandable. Do not let an agent stitch together six proprietary services because it found examples in its training data.

Interoperability is a security feature as much as a business feature. If your app logic, auth model, database rules, background jobs, file storage, deployment, and observability are trapped inside one platform’s opaque runtime, you have made future review harder. You have also made incident response harder.

Design choiceAI-friendly reasonSecurity/interoperability reason
OpenAPI / typed contractsAgents can implement against explicit boundariesReviewers can detect breaking changes and auth gaps
Postgres migrationsAgents can reason over schema historyData model is portable and auditable
IaCAgents can propose diffs instead of clicking consolesInfrastructure state is reproducible and reviewable
Containers / Nix-like reproducible dev envsAgents run consistent commandsBuilds are less dependent on one laptop or hidden cloud state
ADRsAgents inherit decision contextTeams can challenge and reverse decisions deliberately
Minimal service countLess context and fewer failure modesSmaller attack surface and easier migration path

The rule is not “never use managed services.” That would be naive. The rule is: use managed services through explicit contracts, source-controlled configuration, exportable data, and a migration story.

Minimal Testing, Maximum Signal

AI can generate thousands of tests. Most of them will be noise if the specification is weak.

I want fewer tests that hit the real risk. For AI-assisted work, the minimal efficient test set usually includes type checking, linting, unit tests around pure logic, contract tests for API boundaries, migration tests for schema changes, and a small number of end-to-end tests for the highest-value flows. For security-sensitive features, add negative authorization tests and policy tests.

Test typeWhat it catchesWhy it matters with AI
Typecheck/lintMechanical mistakes, wrong imports, dead assumptionsDeterministic feedback beats model self-assessment
Unit testsPure logic regressionsGood for generated helper functions and edge cases
Contract testsAPI/schema mismatchStops models from changing implicit interfaces silently
Migration testsBroken schema evolutionPrevents “works on empty DB” illusions
Auth/RLS testsCross-user, cross-tenant, role escalation bugsThe most important generated-code failure class in many SaaS apps
E2E smoke testsBroken core flowsGives fast confidence without building a slow test cathedral
Security regression testsPreviously found vulnerabilitiesConverts AI mistakes into permanent guardrails

The goal is not test volume. The goal is model-independent evidence.

When an agent says “all tests pass,” I want command output. When it says “authorization is enforced,” I want negative tests. When it says “the migration is safe,” I want rollback notes and compatibility reasoning. When it says “this dependency is fine,” I want to know why the package changed, who publishes it, and what scripts it runs.

Token Efficiency Is an Architecture Problem

Developers often talk about token usage as if it is only a cost problem. It is also a quality problem.

Large contexts can help, but unstructured context becomes noise. If the model has to read half the repository to understand a single task, the architecture is either too implicit, too coupled, or poorly documented. Good project structure reduces token use because the agent can retrieve the right files, commands, and invariants quickly.

The highest-leverage token optimizations are not clever prompt tricks. They are engineering practices.

PracticeToken impactQuality impact
Small issues and PRsLess irrelevant contextEasier review and rollback
Repository mapsFaster file targetingFewer hallucinated paths and duplicated logic
Concise instruction filesLess repeated explanationMore consistent behavior across sessions
Specs and ADRsLess re-derivationBetter architectural continuity
Hooks and CILess model reasoning about deterministic factsMore reliable feedback
Memory curationLess stale contextBetter long-term consistency
Contract-first APIsLess ambiguous implementation spaceFewer integration bugs

The paradox is that the best way to use powerful models is to make them think less about things deterministic systems can already know.

The SPEC-SEC-RUN Loop

The practical model I keep coming back to is SPEC-SEC-RUN.

LayerMeaningConcrete artifacts
SPECMake intent explicit before code generationGitHub/Jira issue, acceptance criteria, ADR, OpenAPI contract, DB migration sketch, threat notes
SECTreat generated code and agent actions as untrusted until reviewedRLS/auth tests, secrets scanning, dependency review, SAST, code review by a second model, human approval
RUNMake implementation reproducible and portableIaC, containers, migrations, CI, hooks, minimal E2E tests, rollback notes

This loop is intentionally simple. It is not a grand methodology. It is a reminder that AI-assisted engineering still has the same three hard parts: knowing what to build, proving it is safe enough, and making it run reliably outside the demo.

The Human Still Owns the Work

There is one final point I care about deeply. AI systems are tools. Useful tools. Sometimes extremely useful tools. I like using them. I love what they unlock when they are placed inside a serious engineering workflow.

But they are not the author of my architecture. They are not responsible for my production system. They are not accountable to my users. They do not carry the incident pager, sign the contract, protect the customer data, or explain the breach.

If I publish an article, the author is Alvar Laigna. If I ship code, the responsible person or team is the human team that chose to ship it. Manus, Claude, Gemini, GPT, Codex, Lovable, Replit, Firebase Studio, or any other AI service is a tool used under human direction.

That is not a philosophical detail. It is an engineering requirement.

The future of software development will not be won by people who let AI generate the most code. It will be won by developers who can turn AI into a disciplined, reviewable, secure, interoperable, and human-owned engineering system.

Vibe coding was the prototype. Spec-driven AI engineering is the production version.

References

Footnotes

  1. DeepSWE — Measuring frontier coding agents on original, long-horizon engineering tasks

  2. Princeton HAL — AI Agent Reliability Tracker

  3. NIST Secure Software Development Framework, SP 800-218 2

  4. OWASP Top 10 for Large Language Model Applications 2

  5. Anthropic Claude Code Overview

  6. Google Gemini API Code Execution

  7. OpenAI Codex GitHub Integration and Code Review 2

  8. GitHub Copilot Coding Agent Best Practices 2

  9. Anthropic Claude Code Memory 2 3

  10. GitHub Copilot Coding Agent Custom Instructions

  11. Anthropic Claude Code Hooks

  12. Anthropic Managed Agents Dreams 2