Vibe Coding Is Dead. Spec-Driven AI Engineering Is the Future.

AI agents can now write serious code, but mature engineering teams still need specifications, task boundaries, security review, tests, project memory, hooks, and human ownership.

AI coding is no longer a toy. That is the uncomfortable part. It is also not magic. That is the part many teams still refuse to accept.

The frontier has moved from autocomplete to agentic engineering. Modern coding agents can inspect a repository, edit multiple files, run tools, execute tests, create pull requests, and keep enough context to handle tasks that would have looked impossible only a few years ago. The DeepSWE benchmark, which was designed around original long-horizon software engineering tasks, reports frontier models solving difficult repository-level problems at meaningful rates, with the leading captured result showing gpt-5.5 at 70% ±4% Pass@1 and Claude Opus variants following behind on the same benchmark.¹ That is not “toy demo” territory anymore.

But there is a second number every serious developer should care about more than the leaderboard. Princeton’s AI Agent Reliability Tracker argues that rising agent accuracy has not produced equally strong reliability. Their central finding is blunt: agents still fail unpredictably in practice, consistency remains weak, and a single accuracy score hides whether an agent is stable, robust, predictable, and safe.²

Capability is not reliability. A model that can solve a hard task once is not the same thing as an engineering system you can trust with production responsibility.

That distinction is where the real work starts. If we treat AI coding as a faster version of copy-paste programming, we get faster chaos. If we treat it as a new software supply chain, we can get leverage without surrendering engineering discipline.

From Prompting to a Software Supply Chain

The wrong question is: “Which AI tool writes the most code?”

The better question is: “What system of constraints, memory, review, tests, and deployment controls turns AI output into trustworthy software?”

This is why I do not think the next phase belongs to developers who merely prompt well. It belongs to teams that can build a repeatable AI-assisted engineering process. NIST’s Secure Software Development Framework frames secure software work around preparing the organization, protecting software, producing well-secured software, and responding to vulnerabilities.³ OWASP’s LLM application risk work adds a more agent-specific warning: prompt injection, insecure output handling, supply-chain vulnerabilities, sensitive information disclosure, excessive agency, and overreliance all become real engineering risks when models gain tool access.⁴

AI-generated code is not just text. In a real workflow, it becomes database migrations, authorization rules, API endpoints, cloud resources, CI pipelines, dependency changes, infrastructure configuration, and sometimes destructive shell commands. That means AI coding is part of your supply chain. It needs the same seriousness as dependencies, container images, Terraform modules, GitHub Actions, and production credentials.

Old mental model	Better mental model
AI is an autocomplete tool	AI is a semi-autonomous contributor with tool access
Prompt quality is the main skill	Specification, review, memory, tests, and orchestration are the main skills
The output is probably fine if it compiles	The output is untrusted until reviewed, tested, and constrained
One model should do everything	Different models should challenge each other in different roles
Context is whatever fits in the chat	Context is engineered through docs, specs, memory, hooks, and task boundaries
Vendor tools own the workflow	The repository, contracts, and infrastructure code own the workflow

This is the line I now draw for myself: AI can propose and implement, but it should not silently decide architecture, security policy, production permissions, or deployment state.

My Current Operating Model

My preferred workflow is intentionally multi-model and task-driven. I do not want one model to be the architect, developer, reviewer, tester, and release manager at the same time. That creates monoculture. It also creates a false sense of certainty, because the same failure mode can survive every step.

For serious work, I like to split the process roughly like this.

Stage	Primary tool/model role	Output I want
Initial specification	Claude Code	A concrete feature spec, repository map, affected files, acceptance criteria, risks, and a first implementation plan
Spec review	Gemini / Google AI tooling	A second-opinion critique: missing constraints, architecture alternatives, edge cases, simpler designs, and possible overengineering
Security/code review	GPT-5.5 / Codex-style reviewer	High-severity issue detection, auth flaws, data-leak paths, dependency risks, race conditions, and review comments
Implementation orchestration	Claude Code with strong tool use, agents, skills, hooks, and scoped memory	Small PR-sized changes, tests, migrations, docs, and repeatable command output
Task tracking	GitHub Issues / Jira	Audit trail, acceptance criteria, linked PRs, and human-readable engineering intent
Deployment discipline	Terraform/OpenTofu/Pulumi, Ansible, containers, GitHub Actions, Kubernetes where justified	Reproducible state, rollback options, and less vendor lock-in

Claude Code is strong as an implementation and repository-navigation environment because it can read code, edit files, run commands, use project memory, work with hooks, use custom agents, and integrate with development workflows.⁵ Gemini is useful as a critic because I often want a broad-context reviewer that is not emotionally attached to the first plan and can challenge the architecture before code exists.⁶ GPT-5.5 or a Codex-style reviewer is useful as a separate security and code-review pass, especially when the instruction is not “make this prettier” but “find serious issues I would regret shipping.” OpenAI’s Codex GitHub integration, for example, can be invoked for PR review, reads repository instructions such as AGENTS.md, and focuses review comments on serious issues by default.⁷

This is not about brand loyalty. It is about separation of duties. In human teams, we do not ask the same developer to write code, approve their own security model, rubber-stamp the PR, and deploy straight to production without telemetry. AI should not get a lower bar just because it is fast.

Start With Tasks, Not Vibes

The biggest improvement most teams can make is boring: write better tasks.

Agentic coding works best when the work is bounded. GitHub’s own guidance for assigning work to coding agents says issues should be clear, well-scoped, include acceptance criteria, and include relevant files or implementation notes where possible.⁸ It also warns that agents are a poor fit for ambiguous, production-critical, sensitive, security-heavy, or deep-domain tasks unless there is strong human oversight.⁸

This matches my experience. If the task is “improve onboarding,” the agent will hallucinate product intent. If the task is “implement invite-only team onboarding using the existing OrganizationMember model, add a pending invite state, enforce tenant isolation, update the OpenAPI contract, and add negative authorization tests,” the agent has rails.

A good AI-ready issue should include the following information in prose, not just as a checklist.

Field	Why it matters
Problem statement	Prevents the agent from optimizing a different problem
Non-goals	Reduces scope creep and token waste
Acceptance criteria	Gives the agent and reviewer a shared definition of done
Security notes	Forces early thinking about auth, data boundaries, secrets, and abuse cases
Likely files	Reduces repository wandering and irrelevant context loading
Contracts	Keeps API, database, and event boundaries explicit
Test expectations	Prevents “it compiles” from becoming the definition of quality
Rollback or migration notes	Forces operational thinking before deployment

This is not bureaucracy. This is how we turn probabilistic tool use into deterministic engineering flow.

The Repository Must Become AI-Readable

A codebase that is hard for humans to understand is almost always worse for agents. The model may look confident, but it is still navigating your architecture through text, file names, local patterns, tests, and instructions. If those are missing, stale, or contradictory, the agent will invent glue.

The practical answer is to make the repository self-describing.

repo/
  AGENTS.md
  CLAUDE.md
  GEMINI.md
  .github/
    copilot-instructions.md
    workflows/
  docs/
    architecture.md
    security.md
    ai-workflow.md
    decisions/
      ADR-0001-stack.md
  specs/
    feature-name.md
  api/
    openapi.yaml
  db/
    migrations/
    policies/
      rls-tests.md
  tests/
    contract/
    e2e/
    security/
  infra/
    terraform/ or opentofu/
    ansible/
  .claude/
    skills/
    agents/
    hooks/

The exact folder names matter less than the principle: intent should live in the repository, not only in somebody’s head or in yesterday’s chat window.

Anthropic’s Claude Code documentation describes CLAUDE.md as a place for project instructions, style conventions, architecture notes, and recurring commands. It also notes that Claude can import other files, which makes it practical to reference a shared AGENTS.md instead of duplicating instructions across tools.⁹ GitHub Copilot’s coding agent documentation similarly supports repository custom instructions, including AGENTS.md, CLAUDE.md, and GEMINI.md patterns.¹⁰

The important constraint is that these instruction files should be short, concrete, and maintained. A 700-line graveyard of old preferences becomes prompt pollution. A concise file that states architecture boundaries, commands, test strategy, security expectations, and “never do this” rules becomes leverage.

Here is the kind of content I want in an AI-facing project file.

Instruction category	Example content
Architecture boundaries	“API routes must call service layer methods; do not query the database directly from route handlers.”
Security invariants	“Every tenant-scoped query must include `organization_id`; add negative tests for cross-tenant access.”
Command map	“Use `pnpm test:unit`, `pnpm test:contract`, `pnpm lint`, and `pnpm typecheck` before proposing a PR.”
Stack constraints	“Do not introduce a new queue, ORM, cloud service, or auth provider without an ADR.”
Migration rules	“All schema changes require reversible migrations and a compatibility note.”
Review expectations	“Explain risky changes, changed auth behavior, and any generated code that was not tested.”

This is where many AI coding workflows quietly fail. Developers try to fix hallucination with more prompting, when the deeper problem is that the project itself has no stable operating manual.

Hooks Are the New Guardrails

If an agent can run shell commands, edit files, call APIs, and use external tools, instructions are not enough. You need enforcement.

Claude Code hooks can run at lifecycle events such as session start, before or after tool use, on file changes, and before compaction.¹¹ More importantly, Anthropic’s documentation explicitly warns that instructions are context for the model, while blocking behavior should be implemented with hooks.⁹ That distinction matters. “Please do not run destructive commands” is a request. A PreToolUse hook that blocks dangerous commands is a control.

In practice, I want hooks around at least four categories of behavior.

Hook target	Practical control
Dangerous shell commands	Deny or require approval for `rm -rf`, direct production commands, credential export, force pushes, destructive database operations
Secrets	Scan written files for API keys, tokens, private keys, `.env` leakage, and accidental log exposure
Dependency changes	Flag lockfile changes, new packages, lifecycle scripts, native modules, and suspicious dependency sources
Security-sensitive files	Require extra review for auth middleware, RLS policies, Firebase rules, IAM, Terraform state backends, CI secrets, and payment code

This is also where token efficiency improves. A hook can run the actual type checker instead of asking the model to infer type safety from memory. A hook can run tests. A hook can block a bad command before it becomes a production incident. We should let deterministic tools do deterministic work.

Security: Generated Code Is Untrusted Code

AI-generated code should enter the repository with the same suspicion as code copied from a random blog post. That does not mean it is bad. It means it has no trust until it earns trust.

The main security risks are not exotic. They are the same boring failures that already hurt software teams, accelerated by a machine that can produce them faster.

Risk	How it appears in AI-assisted development	Guardrail
Broken authorization	The agent adds an endpoint but forgets tenant boundaries, ownership checks, role checks, or row-level policy coverage	Negative auth tests, policy tests, route middleware tests, human review
Weak database rules	Supabase RLS, Firebase rules, or SQL policies are too broad, happy-path only, or inconsistent with app-layer auth	Dedicated policy tests, deny-by-default review, cross-user/cross-tenant test cases
Secret exposure	The agent logs tokens, commits `.env`, hardcodes API keys, or prints secrets during debugging	Secret scanning, vault-backed envs, hooks, CI checks, no secrets in prompts
Supply-chain risk	The agent installs packages casually, accepts typosquatted dependencies, or upgrades lockfiles without review	Lockfile diff review, SBOM, vulnerability scan, package provenance checks
Prompt injection	Untrusted docs, issues, websites, or dependency files contain instructions that manipulate the agent	Treat external text as data, restrict tool permissions, review commands before execution
Excessive agency	The agent has permission to deploy, delete, charge, email, migrate, or mutate production state without approval	Least privilege, approval gates, sandbox-first execution, separate production credentials
Insecure output handling	Model-generated strings flow into shell, SQL, HTML, eval-like APIs, or workflow files without escaping	Static analysis, code review, parameterized APIs, sandbox tests

OWASP’s LLM application risks are directly relevant here because AI coding agents combine model output, tool access, plugins, data retrieval, and sometimes autonomous action.⁴ NIST’s SSDF is also relevant because it treats secure development as a lifecycle, not a final scan.³

AI code review helps, but it is not enough. A second model can catch issues the first model missed, and Codex-style PR review can focus attention on serious findings.⁷ But if the test suite has no negative authorization tests, if your RLS policies are never exercised, if secrets are available in the same environment where an agent can run arbitrary shell commands, and if deployment approval is just a green button, then AI review is theater.

The standard I want is simple: AI may accelerate implementation, but the security boundary must be deterministic. Auth rules, policy tests, CI checks, deployment gates, infrastructure state, and secrets handling must not depend on the model “remembering to be careful.”

Memory Is Useful Only If It Is Curated

Every serious AI coding setup eventually runs into the memory problem. The model forgets something important, repeats a mistake, rediscovers a command, ignores a style convention, or loses project context after compaction. The naive fix is to throw more text into CLAUDE.md, AGENTS.md, or some global memory file.

That works for about a week. Then the memory turns into mud.

I prefer to think about memory in layers.

Memory layer	Purpose	Risk
Personal/global knowhow	My general preferences, engineering taste, communication style, recurring tool choices	Can become too broad and bias every project incorrectly
Project memory	Architecture, commands, conventions, known traps, deployment model, testing strategy	Can become stale when the codebase changes
Local/session memory	Temporary notes, current branch context, work-in-progress constraints	Can leak into long-term memory when it should expire
Dreaming/consolidation	Periodic review of sessions to extract durable lessons and remove noise	Can hallucinate patterns or preserve bad habits if not audited

Anthropic’s Claude Code memory docs distinguish project instructions and auto memory, and Claude’s managed-agent Dreams feature goes further by asynchronously reviewing past sessions and an existing memory store to produce a reorganized memory store.⁹ ¹² The Dreams documentation says the process can merge duplicates, replace stale or contradicted entries, and output a separate memory store that can be reviewed or discarded rather than modifying the original memory in place.¹²

That last detail is important. Memory should be reviewable. It should not silently become law.

A practical memory hygiene routine looks like this: project instructions stay concise; session notes stay local unless promoted; recurring mistakes become tests or hooks, not just prose; architecture decisions become ADRs; and periodic “dreaming” or consolidation outputs are reviewed before they become durable project memory.

If a model repeatedly forgets to include organization_id in tenant-scoped queries, the fix is not only “remember to include organization_id.” The real fix is a query helper, a linter rule, a policy test, and a project instruction explaining the invariant.

Good Stack Selection Beats Clever Prompting

AI makes it easy to generate complexity. That is not the same as engineering progress.

A mature AI-assisted stack should be boring where boring helps. Prefer explicit contracts, ordinary databases, reproducible builds, infrastructure as code, source-controlled migrations, and clear runtime boundaries. Use Terraform, OpenTofu, Pulumi, Ansible, containers, Kubernetes, or managed services where they make sense, but keep the state understandable. Do not let an agent stitch together six proprietary services because it found examples in its training data.

Interoperability is a security feature as much as a business feature. If your app logic, auth model, database rules, background jobs, file storage, deployment, and observability are trapped inside one platform’s opaque runtime, you have made future review harder. You have also made incident response harder.

Design choice	AI-friendly reason	Security/interoperability reason
OpenAPI / typed contracts	Agents can implement against explicit boundaries	Reviewers can detect breaking changes and auth gaps
Postgres migrations	Agents can reason over schema history	Data model is portable and auditable
IaC	Agents can propose diffs instead of clicking consoles	Infrastructure state is reproducible and reviewable
Containers / Nix-like reproducible dev envs	Agents run consistent commands	Builds are less dependent on one laptop or hidden cloud state
ADRs	Agents inherit decision context	Teams can challenge and reverse decisions deliberately
Minimal service count	Less context and fewer failure modes	Smaller attack surface and easier migration path

The rule is not “never use managed services.” That would be naive. The rule is: use managed services through explicit contracts, source-controlled configuration, exportable data, and a migration story.

Minimal Testing, Maximum Signal

AI can generate thousands of tests. Most of them will be noise if the specification is weak.

I want fewer tests that hit the real risk. For AI-assisted work, the minimal efficient test set usually includes type checking, linting, unit tests around pure logic, contract tests for API boundaries, migration tests for schema changes, and a small number of end-to-end tests for the highest-value flows. For security-sensitive features, add negative authorization tests and policy tests.

Test type	What it catches	Why it matters with AI
Typecheck/lint	Mechanical mistakes, wrong imports, dead assumptions	Deterministic feedback beats model self-assessment
Unit tests	Pure logic regressions	Good for generated helper functions and edge cases
Contract tests	API/schema mismatch	Stops models from changing implicit interfaces silently
Migration tests	Broken schema evolution	Prevents “works on empty DB” illusions
Auth/RLS tests	Cross-user, cross-tenant, role escalation bugs	The most important generated-code failure class in many SaaS apps
E2E smoke tests	Broken core flows	Gives fast confidence without building a slow test cathedral
Security regression tests	Previously found vulnerabilities	Converts AI mistakes into permanent guardrails

The goal is not test volume. The goal is model-independent evidence.

When an agent says “all tests pass,” I want command output. When it says “authorization is enforced,” I want negative tests. When it says “the migration is safe,” I want rollback notes and compatibility reasoning. When it says “this dependency is fine,” I want to know why the package changed, who publishes it, and what scripts it runs.

Token Efficiency Is an Architecture Problem

Developers often talk about token usage as if it is only a cost problem. It is also a quality problem.

Large contexts can help, but unstructured context becomes noise. If the model has to read half the repository to understand a single task, the architecture is either too implicit, too coupled, or poorly documented. Good project structure reduces token use because the agent can retrieve the right files, commands, and invariants quickly.

The highest-leverage token optimizations are not clever prompt tricks. They are engineering practices.

Practice	Token impact	Quality impact
Small issues and PRs	Less irrelevant context	Easier review and rollback
Repository maps	Faster file targeting	Fewer hallucinated paths and duplicated logic
Concise instruction files	Less repeated explanation	More consistent behavior across sessions
Specs and ADRs	Less re-derivation	Better architectural continuity
Hooks and CI	Less model reasoning about deterministic facts	More reliable feedback
Memory curation	Less stale context	Better long-term consistency
Contract-first APIs	Less ambiguous implementation space	Fewer integration bugs

The paradox is that the best way to use powerful models is to make them think less about things deterministic systems can already know.

The SPEC-SEC-RUN Loop

The practical model I keep coming back to is SPEC-SEC-RUN.

Layer	Meaning	Concrete artifacts
SPEC	Make intent explicit before code generation	GitHub/Jira issue, acceptance criteria, ADR, OpenAPI contract, DB migration sketch, threat notes
SEC	Treat generated code and agent actions as untrusted until reviewed	RLS/auth tests, secrets scanning, dependency review, SAST, code review by a second model, human approval
RUN	Make implementation reproducible and portable	IaC, containers, migrations, CI, hooks, minimal E2E tests, rollback notes

This loop is intentionally simple. It is not a grand methodology. It is a reminder that AI-assisted engineering still has the same three hard parts: knowing what to build, proving it is safe enough, and making it run reliably outside the demo.

The Human Still Owns the Work

There is one final point I care about deeply. AI systems are tools. Useful tools. Sometimes extremely useful tools. I like using them. I love what they unlock when they are placed inside a serious engineering workflow.

But they are not the author of my architecture. They are not responsible for my production system. They are not accountable to my users. They do not carry the incident pager, sign the contract, protect the customer data, or explain the breach.

If I publish an article, the author is Alvar Laigna. If I ship code, the responsible person or team is the human team that chose to ship it. Manus, Claude, Gemini, GPT, Codex, Lovable, Replit, Firebase Studio, or any other AI service is a tool used under human direction.

That is not a philosophical detail. It is an engineering requirement.

The future of software development will not be won by people who let AI generate the most code. It will be won by developers who can turn AI into a disciplined, reviewable, secure, interoperable, and human-owned engineering system.

Vibe coding was the prototype. Spec-driven AI engineering is the production version.

Vibe Coding Is Dead. Spec-Driven AI Engineering Is the Future.

From Prompting to a Software Supply Chain

My Current Operating Model

Start With Tasks, Not Vibes

The Repository Must Become AI-Readable

Hooks Are the New Guardrails

Security: Generated Code Is Untrusted Code

Memory Is Useful Only If It Is Curated

Good Stack Selection Beats Clever Prompting

Minimal Testing, Maximum Signal

Token Efficiency Is an Architecture Problem

The SPEC-SEC-RUN Loop

The Human Still Owns the Work

References

Lahendused, mida toetan (affiliate)

Vibe Coding Is Dead. Spec-Driven AI Engineering Is the Future.

From Prompting to a Software Supply Chain

My Current Operating Model

Start With Tasks, Not Vibes

The Repository Must Become AI-Readable

Hooks Are the New Guardrails

Security: Generated Code Is Untrusted Code

Memory Is Useful Only If It Is Curated

Good Stack Selection Beats Clever Prompting

Minimal Testing, Maximum Signal

Token Efficiency Is an Architecture Problem

The SPEC-SEC-RUN Loop

The Human Still Owns the Work

References

Footnotes

Lahendused, mida toetan (affiliate)