AI coding is no longer a toy. That is the uncomfortable part. It is also not magic. That is the part many teams still refuse to accept.
The frontier has moved from autocomplete to agentic engineering. Modern coding agents can inspect a repository, edit multiple files, run tools, execute tests, create pull requests, and keep enough context to handle tasks that would have looked impossible only a few years ago. The DeepSWE benchmark, which was designed around original long-horizon software engineering tasks, reports frontier models solving difficult repository-level problems at meaningful rates, with the leading captured result showing gpt-5.5 at 70% ±4% Pass@1 and Claude Opus variants following behind on the same benchmark.1 That is not “toy demo” territory anymore.
But there is a second number every serious developer should care about more than the leaderboard. Princeton’s AI Agent Reliability Tracker argues that rising agent accuracy has not produced equally strong reliability. Their central finding is blunt: agents still fail unpredictably in practice, consistency remains weak, and a single accuracy score hides whether an agent is stable, robust, predictable, and safe.2
Capability is not reliability. A model that can solve a hard task once is not the same thing as an engineering system you can trust with production responsibility.
That distinction is where the real work starts. If we treat AI coding as a faster version of copy-paste programming, we get faster chaos. If we treat it as a new software supply chain, we can get leverage without surrendering engineering discipline.
From Prompting to a Software Supply Chain
The wrong question is: “Which AI tool writes the most code?”
The better question is: “What system of constraints, memory, review, tests, and deployment controls turns AI output into trustworthy software?”
This is why I do not think the next phase belongs to developers who merely prompt well. It belongs to teams that can build a repeatable AI-assisted engineering process. NIST’s Secure Software Development Framework frames secure software work around preparing the organization, protecting software, producing well-secured software, and responding to vulnerabilities.3 OWASP’s LLM application risk work adds a more agent-specific warning: prompt injection, insecure output handling, supply-chain vulnerabilities, sensitive information disclosure, excessive agency, and overreliance all become real engineering risks when models gain tool access.4
AI-generated code is not just text. In a real workflow, it becomes database migrations, authorization rules, API endpoints, cloud resources, CI pipelines, dependency changes, infrastructure configuration, and sometimes destructive shell commands. That means AI coding is part of your supply chain. It needs the same seriousness as dependencies, container images, Terraform modules, GitHub Actions, and production credentials.
| Old mental model | Better mental model |
|---|---|
| AI is an autocomplete tool | AI is a semi-autonomous contributor with tool access |
| Prompt quality is the main skill | Specification, review, memory, tests, and orchestration are the main skills |
| The output is probably fine if it compiles | The output is untrusted until reviewed, tested, and constrained |
| One model should do everything | Different models should challenge each other in different roles |
| Context is whatever fits in the chat | Context is engineered through docs, specs, memory, hooks, and task boundaries |
| Vendor tools own the workflow | The repository, contracts, and infrastructure code own the workflow |
This is the line I now draw for myself: AI can propose and implement, but it should not silently decide architecture, security policy, production permissions, or deployment state.
My Current Operating Model
My preferred workflow is intentionally multi-model and task-driven. I do not want one model to be the architect, developer, reviewer, tester, and release manager at the same time. That creates monoculture. It also creates a false sense of certainty, because the same failure mode can survive every step.
For serious work, I like to split the process roughly like this.
| Stage | Primary tool/model role | Output I want |
|---|---|---|
| Initial specification | Claude Code | A concrete feature spec, repository map, affected files, acceptance criteria, risks, and a first implementation plan |
| Spec review | Gemini / Google AI tooling | A second-opinion critique: missing constraints, architecture alternatives, edge cases, simpler designs, and possible overengineering |
| Security/code review | GPT-5.5 / Codex-style reviewer | High-severity issue detection, auth flaws, data-leak paths, dependency risks, race conditions, and review comments |
| Implementation orchestration | Claude Code with strong tool use, agents, skills, hooks, and scoped memory | Small PR-sized changes, tests, migrations, docs, and repeatable command output |
| Task tracking | GitHub Issues / Jira | Audit trail, acceptance criteria, linked PRs, and human-readable engineering intent |
| Deployment discipline | Terraform/OpenTofu/Pulumi, Ansible, containers, GitHub Actions, Kubernetes where justified | Reproducible state, rollback options, and less vendor lock-in |
Claude Code is strong as an implementation and repository-navigation environment because it can read code, edit files, run commands, use project memory, work with hooks, use custom agents, and integrate with development workflows.5 Gemini is useful as a critic because I often want a broad-context reviewer that is not emotionally attached to the first plan and can challenge the architecture before code exists.6 GPT-5.5 or a Codex-style reviewer is useful as a separate security and code-review pass, especially when the instruction is not “make this prettier” but “find serious issues I would regret shipping.” OpenAI’s Codex GitHub integration, for example, can be invoked for PR review, reads repository instructions such as AGENTS.md, and focuses review comments on serious issues by default.7
This is not about brand loyalty. It is about separation of duties. In human teams, we do not ask the same developer to write code, approve their own security model, rubber-stamp the PR, and deploy straight to production without telemetry. AI should not get a lower bar just because it is fast.
Start With Tasks, Not Vibes
The biggest improvement most teams can make is boring: write better tasks.
Agentic coding works best when the work is bounded. GitHub’s own guidance for assigning work to coding agents says issues should be clear, well-scoped, include acceptance criteria, and include relevant files or implementation notes where possible.8 It also warns that agents are a poor fit for ambiguous, production-critical, sensitive, security-heavy, or deep-domain tasks unless there is strong human oversight.8
This matches my experience. If the task is “improve onboarding,” the agent will hallucinate product intent. If the task is “implement invite-only team onboarding using the existing OrganizationMember model, add a pending invite state, enforce tenant isolation, update the OpenAPI contract, and add negative authorization tests,” the agent has rails.
A good AI-ready issue should include the following information in prose, not just as a checklist.
| Field | Why it matters |
|---|---|
| Problem statement | Prevents the agent from optimizing a different problem |
| Non-goals | Reduces scope creep and token waste |
| Acceptance criteria | Gives the agent and reviewer a shared definition of done |
| Security notes | Forces early thinking about auth, data boundaries, secrets, and abuse cases |
| Likely files | Reduces repository wandering and irrelevant context loading |
| Contracts | Keeps API, database, and event boundaries explicit |
| Test expectations | Prevents “it compiles” from becoming the definition of quality |
| Rollback or migration notes | Forces operational thinking before deployment |
This is not bureaucracy. This is how we turn probabilistic tool use into deterministic engineering flow.
The Repository Must Become AI-Readable
A codebase that is hard for humans to understand is almost always worse for agents. The model may look confident, but it is still navigating your architecture through text, file names, local patterns, tests, and instructions. If those are missing, stale, or contradictory, the agent will invent glue.
The practical answer is to make the repository self-describing.
repo/
AGENTS.md
CLAUDE.md
GEMINI.md
.github/
copilot-instructions.md
workflows/
docs/
architecture.md
security.md
ai-workflow.md
decisions/
ADR-0001-stack.md
specs/
feature-name.md
api/
openapi.yaml
db/
migrations/
policies/
rls-tests.md
tests/
contract/
e2e/
security/
infra/
terraform/ or opentofu/
ansible/
.claude/
skills/
agents/
hooks/
The exact folder names matter less than the principle: intent should live in the repository, not only in somebody’s head or in yesterday’s chat window.
Anthropic’s Claude Code documentation describes CLAUDE.md as a place for project instructions, style conventions, architecture notes, and recurring commands. It also notes that Claude can import other files, which makes it practical to reference a shared AGENTS.md instead of duplicating instructions across tools.9 GitHub Copilot’s coding agent documentation similarly supports repository custom instructions, including AGENTS.md, CLAUDE.md, and GEMINI.md patterns.10
The important constraint is that these instruction files should be short, concrete, and maintained. A 700-line graveyard of old preferences becomes prompt pollution. A concise file that states architecture boundaries, commands, test strategy, security expectations, and “never do this” rules becomes leverage.
Here is the kind of content I want in an AI-facing project file.
| Instruction category | Example content |
|---|---|
| Architecture boundaries | “API routes must call service layer methods; do not query the database directly from route handlers.” |
| Security invariants | “Every tenant-scoped query must include organization_id; add negative tests for cross-tenant access.” |
| Command map | “Use pnpm test:unit, pnpm test:contract, pnpm lint, and pnpm typecheck before proposing a PR.” |
| Stack constraints | “Do not introduce a new queue, ORM, cloud service, or auth provider without an ADR.” |
| Migration rules | “All schema changes require reversible migrations and a compatibility note.” |
| Review expectations | “Explain risky changes, changed auth behavior, and any generated code that was not tested.” |
This is where many AI coding workflows quietly fail. Developers try to fix hallucination with more prompting, when the deeper problem is that the project itself has no stable operating manual.
Hooks Are the New Guardrails
If an agent can run shell commands, edit files, call APIs, and use external tools, instructions are not enough. You need enforcement.
Claude Code hooks can run at lifecycle events such as session start, before or after tool use, on file changes, and before compaction.11 More importantly, Anthropic’s documentation explicitly warns that instructions are context for the model, while blocking behavior should be implemented with hooks.9 That distinction matters. “Please do not run destructive commands” is a request. A PreToolUse hook that blocks dangerous commands is a control.
In practice, I want hooks around at least four categories of behavior.
| Hook target | Practical control |
|---|---|
| Dangerous shell commands | Deny or require approval for rm -rf, direct production commands, credential export, force pushes, destructive database operations |
| Secrets | Scan written files for API keys, tokens, private keys, .env leakage, and accidental log exposure |
| Dependency changes | Flag lockfile changes, new packages, lifecycle scripts, native modules, and suspicious dependency sources |
| Security-sensitive files | Require extra review for auth middleware, RLS policies, Firebase rules, IAM, Terraform state backends, CI secrets, and payment code |
This is also where token efficiency improves. A hook can run the actual type checker instead of asking the model to infer type safety from memory. A hook can run tests. A hook can block a bad command before it becomes a production incident. We should let deterministic tools do deterministic work.
Security: Generated Code Is Untrusted Code
AI-generated code should enter the repository with the same suspicion as code copied from a random blog post. That does not mean it is bad. It means it has no trust until it earns trust.
The main security risks are not exotic. They are the same boring failures that already hurt software teams, accelerated by a machine that can produce them faster.
| Risk | How it appears in AI-assisted development | Guardrail |
|---|---|---|
| Broken authorization | The agent adds an endpoint but forgets tenant boundaries, ownership checks, role checks, or row-level policy coverage | Negative auth tests, policy tests, route middleware tests, human review |
| Weak database rules | Supabase RLS, Firebase rules, or SQL policies are too broad, happy-path only, or inconsistent with app-layer auth | Dedicated policy tests, deny-by-default review, cross-user/cross-tenant test cases |
| Secret exposure | The agent logs tokens, commits .env, hardcodes API keys, or prints secrets during debugging | Secret scanning, vault-backed envs, hooks, CI checks, no secrets in prompts |
| Supply-chain risk | The agent installs packages casually, accepts typosquatted dependencies, or upgrades lockfiles without review | Lockfile diff review, SBOM, vulnerability scan, package provenance checks |
| Prompt injection | Untrusted docs, issues, websites, or dependency files contain instructions that manipulate the agent | Treat external text as data, restrict tool permissions, review commands before execution |
| Excessive agency | The agent has permission to deploy, delete, charge, email, migrate, or mutate production state without approval | Least privilege, approval gates, sandbox-first execution, separate production credentials |
| Insecure output handling | Model-generated strings flow into shell, SQL, HTML, eval-like APIs, or workflow files without escaping | Static analysis, code review, parameterized APIs, sandbox tests |
OWASP’s LLM application risks are directly relevant here because AI coding agents combine model output, tool access, plugins, data retrieval, and sometimes autonomous action.4 NIST’s SSDF is also relevant because it treats secure development as a lifecycle, not a final scan.3
AI code review helps, but it is not enough. A second model can catch issues the first model missed, and Codex-style PR review can focus attention on serious findings.7 But if the test suite has no negative authorization tests, if your RLS policies are never exercised, if secrets are available in the same environment where an agent can run arbitrary shell commands, and if deployment approval is just a green button, then AI review is theater.
The standard I want is simple: AI may accelerate implementation, but the security boundary must be deterministic. Auth rules, policy tests, CI checks, deployment gates, infrastructure state, and secrets handling must not depend on the model “remembering to be careful.”
Memory Is Useful Only If It Is Curated
Every serious AI coding setup eventually runs into the memory problem. The model forgets something important, repeats a mistake, rediscovers a command, ignores a style convention, or loses project context after compaction. The naive fix is to throw more text into CLAUDE.md, AGENTS.md, or some global memory file.
That works for about a week. Then the memory turns into mud.
I prefer to think about memory in layers.
| Memory layer | Purpose | Risk |
|---|---|---|
| Personal/global knowhow | My general preferences, engineering taste, communication style, recurring tool choices | Can become too broad and bias every project incorrectly |
| Project memory | Architecture, commands, conventions, known traps, deployment model, testing strategy | Can become stale when the codebase changes |
| Local/session memory | Temporary notes, current branch context, work-in-progress constraints | Can leak into long-term memory when it should expire |
| Dreaming/consolidation | Periodic review of sessions to extract durable lessons and remove noise | Can hallucinate patterns or preserve bad habits if not audited |
Anthropic’s Claude Code memory docs distinguish project instructions and auto memory, and Claude’s managed-agent Dreams feature goes further by asynchronously reviewing past sessions and an existing memory store to produce a reorganized memory store.9 12 The Dreams documentation says the process can merge duplicates, replace stale or contradicted entries, and output a separate memory store that can be reviewed or discarded rather than modifying the original memory in place.12
That last detail is important. Memory should be reviewable. It should not silently become law.
A practical memory hygiene routine looks like this: project instructions stay concise; session notes stay local unless promoted; recurring mistakes become tests or hooks, not just prose; architecture decisions become ADRs; and periodic “dreaming” or consolidation outputs are reviewed before they become durable project memory.
If a model repeatedly forgets to include organization_id in tenant-scoped queries, the fix is not only “remember to include organization_id.” The real fix is a query helper, a linter rule, a policy test, and a project instruction explaining the invariant.
Good Stack Selection Beats Clever Prompting
AI makes it easy to generate complexity. That is not the same as engineering progress.
A mature AI-assisted stack should be boring where boring helps. Prefer explicit contracts, ordinary databases, reproducible builds, infrastructure as code, source-controlled migrations, and clear runtime boundaries. Use Terraform, OpenTofu, Pulumi, Ansible, containers, Kubernetes, or managed services where they make sense, but keep the state understandable. Do not let an agent stitch together six proprietary services because it found examples in its training data.
Interoperability is a security feature as much as a business feature. If your app logic, auth model, database rules, background jobs, file storage, deployment, and observability are trapped inside one platform’s opaque runtime, you have made future review harder. You have also made incident response harder.
| Design choice | AI-friendly reason | Security/interoperability reason |
|---|---|---|
| OpenAPI / typed contracts | Agents can implement against explicit boundaries | Reviewers can detect breaking changes and auth gaps |
| Postgres migrations | Agents can reason over schema history | Data model is portable and auditable |
| IaC | Agents can propose diffs instead of clicking consoles | Infrastructure state is reproducible and reviewable |
| Containers / Nix-like reproducible dev envs | Agents run consistent commands | Builds are less dependent on one laptop or hidden cloud state |
| ADRs | Agents inherit decision context | Teams can challenge and reverse decisions deliberately |
| Minimal service count | Less context and fewer failure modes | Smaller attack surface and easier migration path |
The rule is not “never use managed services.” That would be naive. The rule is: use managed services through explicit contracts, source-controlled configuration, exportable data, and a migration story.
Minimal Testing, Maximum Signal
AI can generate thousands of tests. Most of them will be noise if the specification is weak.
I want fewer tests that hit the real risk. For AI-assisted work, the minimal efficient test set usually includes type checking, linting, unit tests around pure logic, contract tests for API boundaries, migration tests for schema changes, and a small number of end-to-end tests for the highest-value flows. For security-sensitive features, add negative authorization tests and policy tests.
| Test type | What it catches | Why it matters with AI |
|---|---|---|
| Typecheck/lint | Mechanical mistakes, wrong imports, dead assumptions | Deterministic feedback beats model self-assessment |
| Unit tests | Pure logic regressions | Good for generated helper functions and edge cases |
| Contract tests | API/schema mismatch | Stops models from changing implicit interfaces silently |
| Migration tests | Broken schema evolution | Prevents “works on empty DB” illusions |
| Auth/RLS tests | Cross-user, cross-tenant, role escalation bugs | The most important generated-code failure class in many SaaS apps |
| E2E smoke tests | Broken core flows | Gives fast confidence without building a slow test cathedral |
| Security regression tests | Previously found vulnerabilities | Converts AI mistakes into permanent guardrails |
The goal is not test volume. The goal is model-independent evidence.
When an agent says “all tests pass,” I want command output. When it says “authorization is enforced,” I want negative tests. When it says “the migration is safe,” I want rollback notes and compatibility reasoning. When it says “this dependency is fine,” I want to know why the package changed, who publishes it, and what scripts it runs.
Token Efficiency Is an Architecture Problem
Developers often talk about token usage as if it is only a cost problem. It is also a quality problem.
Large contexts can help, but unstructured context becomes noise. If the model has to read half the repository to understand a single task, the architecture is either too implicit, too coupled, or poorly documented. Good project structure reduces token use because the agent can retrieve the right files, commands, and invariants quickly.
The highest-leverage token optimizations are not clever prompt tricks. They are engineering practices.
| Practice | Token impact | Quality impact |
|---|---|---|
| Small issues and PRs | Less irrelevant context | Easier review and rollback |
| Repository maps | Faster file targeting | Fewer hallucinated paths and duplicated logic |
| Concise instruction files | Less repeated explanation | More consistent behavior across sessions |
| Specs and ADRs | Less re-derivation | Better architectural continuity |
| Hooks and CI | Less model reasoning about deterministic facts | More reliable feedback |
| Memory curation | Less stale context | Better long-term consistency |
| Contract-first APIs | Less ambiguous implementation space | Fewer integration bugs |
The paradox is that the best way to use powerful models is to make them think less about things deterministic systems can already know.
The SPEC-SEC-RUN Loop
The practical model I keep coming back to is SPEC-SEC-RUN.
| Layer | Meaning | Concrete artifacts |
|---|---|---|
| SPEC | Make intent explicit before code generation | GitHub/Jira issue, acceptance criteria, ADR, OpenAPI contract, DB migration sketch, threat notes |
| SEC | Treat generated code and agent actions as untrusted until reviewed | RLS/auth tests, secrets scanning, dependency review, SAST, code review by a second model, human approval |
| RUN | Make implementation reproducible and portable | IaC, containers, migrations, CI, hooks, minimal E2E tests, rollback notes |
This loop is intentionally simple. It is not a grand methodology. It is a reminder that AI-assisted engineering still has the same three hard parts: knowing what to build, proving it is safe enough, and making it run reliably outside the demo.
The Human Still Owns the Work
There is one final point I care about deeply. AI systems are tools. Useful tools. Sometimes extremely useful tools. I like using them. I love what they unlock when they are placed inside a serious engineering workflow.
But they are not the author of my architecture. They are not responsible for my production system. They are not accountable to my users. They do not carry the incident pager, sign the contract, protect the customer data, or explain the breach.
If I publish an article, the author is Alvar Laigna. If I ship code, the responsible person or team is the human team that chose to ship it. Manus, Claude, Gemini, GPT, Codex, Lovable, Replit, Firebase Studio, or any other AI service is a tool used under human direction.
That is not a philosophical detail. It is an engineering requirement.
The future of software development will not be won by people who let AI generate the most code. It will be won by developers who can turn AI into a disciplined, reviewable, secure, interoperable, and human-owned engineering system.
Vibe coding was the prototype. Spec-driven AI engineering is the production version.