Stop Treating Claude Like a Chatbot: A Practical Field Guide to Agentic Software Work
The first generation of AI coding felt like autocomplete with better marketing. The next generation is different. Tools like Claude Code, Gemini CLI, Google Antigravity, and OpenAI Codex are not merely answering questions; they are reading codebases, editing files, running commands, calling tools, reviewing pull requests, and operating inside the same messy development environments where real software is built.1 2 3
That is powerful, but it also means the old mental model is wrong. If you treat an agent like a chatbot, you get chatbot-shaped software: confident, fast, plausible, and occasionally dangerous. If you treat it like a junior developer, you get disappointed, because it has no embodied memory of your product, no social cost for breaking conventions, and no instinctive fear of production. The better framing is more operational: an AI coding agent is a workflow amplifier. It becomes useful when you give it context, constraints, tools, tests, review loops, and a narrow definition of done.
The real unit of AI-assisted engineering is not the prompt. It is the closed loop: product intent → specification → issue → plan → implementation → test → review → fix → documentation → memory.
This article is a field guide for building that loop. It is written primarily for developers, but it should also be useful to product leads, founders, and one-person entrepreneurs who need to turn fuzzy ideas into working software without losing control of architecture, quality, or security.
The mistake: using an agent as a clever textbox
Most failed AI coding sessions have the same shape. The user opens a tool, types a broad instruction, watches it produce a large diff, notices two obvious mistakes, argues with it for thirty minutes, then either merges with anxiety or reverts in frustration. The model is blamed, but the process was already broken.
Software work has always required context. Human teams use architecture documents, coding standards, user stories, pull requests, CI, test suites, incident reviews, design systems, and product roadmaps because code is not self-explanatory. Agents need the same scaffolding, often more explicitly. Claude Code’s official documentation describes it as an agentic coding tool that can understand a codebase, edit files, run commands, and integrate with developer workflows.1 That capability is not a license to skip engineering discipline. It is a reason to make discipline executable.
The cheat-sheet version is simple. A project needs a short rulebook, reusable skills, connected tools, isolated subagents, plan-first behavior, compact context, durable memory, and bounded loops. In Claude Code vocabulary, that means CLAUDE.md, skills, MCP, subagents, /plan, /compact, hooks, and memory. In Codex vocabulary, the equivalent durable guidance is AGENTS.md, plus configuration, approval modes, sandboxing, MCP, skills, and plan-first workflows.4 In Google’s newer agentic stack, Antigravity similarly emphasizes agents that plan, execute, and verify across the editor, terminal, and browser, with artifacts such as plans, task lists, screenshots, and recordings.3
| Bad pattern | Better operating pattern |
|---|---|
| “Build this feature.” | “Read this issue, inspect these files, propose a plan, wait for approval, implement only scope A, run tests X/Y/Z, update docs, and list follow-up tasks.” |
| One mega-session for everything. | Short sessions with compacted context, explicit handoff notes, and durable memory updates. |
| Rules hidden in chat history. | Project rules in CLAUDE.md or AGENTS.md, product specs in Markdown, and enforcement in hooks/CI. |
| Agent writes code, human skims diff. | Agent writes tests, runs checks, self-reviews, then Gemini/Codex/Claude perform independent review before human merge. |
| Unlimited autonomy. | Least-capability autonomy: narrow tools, scoped credentials, approval gates, auditable operations. |
The point is not to slow down. The point is to stop paying the tax of repeatedly explaining your project to an amnesiac genius.
The agentic loop in one picture
A useful agentic workflow looks boring. That is a feature. It resembles a good engineering team, not a magic show. The agent should not “just code.” It should enter a controlled loop where product intent becomes testable work, and every step leaves an artifact.
| Loop stage | Human responsibility | Agent responsibility | Useful tools |
|---|---|---|---|
| Research | Define the market, user, constraint, or problem. | Summarize sources, compare options, identify unknowns. | Claude, Gemini, web research, docs. |
| Product intent | Decide what matters and what does not. | Convert fuzzy intent into user stories and acceptance criteria. | Claude Design, Pencil, Markdown PRDs. |
| Specification | Approve scope, edge cases, and non-goals. | Produce implementation-ready specs and risk notes. | docs/specs/*.md, issue templates. |
| Issue/task | Prioritize and assign. | Break work into atomic tasks with test commands. | GitHub Issues, Jira, Linear. |
| Plan | Challenge assumptions before code exists. | Inspect codebase, propose architecture, ask clarifying questions. | Claude /plan, Codex Plan mode, Antigravity Planning mode. |
| Build | Keep autonomy bounded. | Implement the approved plan, update files, run local checks. | Claude Code, Codex CLI, Gemini CLI. |
| Test | Define what “working” means. | Add unit, integration, and E2E tests; run them. | Playwright, Gauge, Pydantic, CI. |
| Review | Make final judgment. | Self-review, request independent model review, fix high-signal issues. | Codex review, Gemini Code Assist, Qlty. |
| Document | Preserve decisions. | Update README, ADRs, specs, changelog, issue status. | Markdown docs, PR notes. |
| Memory | Decide what becomes durable. | Summarize repeated lessons and propose rule/skill updates. | CLAUDE.md, .claude/rules/, skills, AGENTS.md. |
This loop matters because it changes the failure mode. Without it, the agent hallucinates requirements and invents architecture. With it, the agent becomes an execution engine inside a narrow corridor.
Pre-work: make the idea inspectable before code exists
The cheapest time to fix a bad product idea is before it becomes code. This is where developers and non-developers can collaborate well with AI. A product person does not need to understand the internals of a dependency injection container to provide examples, edge cases, user language, pricing constraints, onboarding friction, or compliance concerns. A founder does not need to write Playwright tests by hand to define what a user must be able to do.
Anthropic’s Claude Design is interesting because it moves some of this pre-work into a visual canvas. Anthropic describes Claude Design, in research preview, as a tool for creating designs, prototypes, slides, one-pagers, marketing collateral, and code-powered prototypes from prompts, files, images, codebases, and web capture.5 The documented workflow is deliberately product-friendly: create a project, add context, describe what to build, review the generated canvas, iterate through chat or inline comments, then export or share.6
That makes Claude Design useful before Claude Code. Use it to ask questions like: What screens are implied by this feature? Which onboarding path is confusing? What does the empty state say? How should the admin view differ from the customer view? What must be responsive on mobile? A good visual artifact is not final truth, but it is inspectable. It gives product, design, and engineering something concrete to argue with.
Pencil sits in a related but more marketing-oriented part of the workflow. Pencil describes itself as an AI operating system for marketing and ads, with workflows for generating, understanding, editing, launching, approving, and evaluating creative work.7 Its official material emphasizes model aggregation, brand governance, creative scoring, localization, approval workflows, audit trails, and direct publishing to ad platforms.7 For a solo entrepreneur or product team, the lesson is broader than the tool itself: before building a feature, test the message. If you cannot describe the value proposition in an ad, landing page, or onboarding card, you probably cannot specify the software cleanly either.
| Pre-work artifact | What it should answer | Why it helps the agent |
|---|---|---|
| One-page PRD | Who is this for, what problem is solved, what is out of scope? | Prevents invented requirements. |
| Clickable or visual prototype | What should the user experience feel like? | Reduces UI ambiguity before implementation. |
| Acceptance examples | What inputs, outputs, and edge cases matter? | Converts product intent into tests. |
| Decision log | Why this approach and not another? | Prevents the agent from reopening settled debates. |
| Review checklist | What must be true before merge? | Gives AI reviewers and humans the same target. |
A simple pre-work prompt is enough:
You are helping me turn a product idea into an implementation-ready spec.
Context:
- Product: [describe product]
- User: [target user]
- Business goal: [goal]
- Existing constraints: [tech, timeline, compliance, brand, pricing]
Task:
Interview me before proposing a solution. Ask up to 10 high-leverage questions.
Then produce:
1. A one-page PRD.
2. User stories.
3. Acceptance criteria.
4. Non-goals.
5. Edge cases.
6. Open questions.
7. Suggested GitHub issues.
Do not write code yet.
The final sentence is not decoration. It is the difference between planning and premature implementation.
Claude Code setup: the minimum viable operating system
A Claude Code project should feel less like an empty folder and more like a small operating system for engineering decisions. The official Claude Code docs distinguish between user-written instructions, such as CLAUDE.md, and auto memory, which Claude writes based on learned project patterns and corrections.8 Both are context. Neither should be treated as absolute enforcement. If you need enforcement, use hooks, tests, CI, and permissions.9
A practical repository structure might look like this:
my-product/
CLAUDE.md
CLAUDE.local.md # gitignored personal preferences
.claude/
rules/
frontend.md
backend.md
security.md
skills/
write-playwright-test.md
review-api-security.md
update-adr.md
hooks/
block-dangerous-shell.sh
run-format-after-edit.sh
docs/
prd/
checkout-redesign.md
specs/
loyalty-rules-engine.md
adr/
0007-use-event-sourcing-for-points.md
runbooks/
local-dev.md
.github/
ISSUE_TEMPLATE/
agent-task.yml
security-review.yml
tests/
src/
The exact folders matter less than the principle: stable knowledge belongs in files, not in chat history. Chat is where work happens. The repository is where the team remembers.
CLAUDE.md: the constitution, not the novel
The most common mistake with project instructions is writing a manifesto. Long vague rule files are easy to create and easy for an agent to ignore in practice. Anthropic’s memory guidance recommends keeping CLAUDE.md concise, specific, and structured, with larger or path-specific instructions moved into imported files, rules, or skills.8
Here is a practical starting point:
# CLAUDE.md
## Project context
This is a SaaS loyalty and customer-management platform. The product serves merchants, admins, and end customers. Reliability, auditability, and clear UX matter more than clever abstractions.
## Stack
- Frontend: React, TypeScript, Tailwind.
- Backend: Node.js, TypeScript, PostgreSQL.
- Tests: Vitest for unit tests, Playwright for E2E.
- Package manager: pnpm.
## Commands
- Install: `pnpm install`
- Typecheck: `pnpm typecheck`
- Lint: `pnpm lint`
- Unit tests: `pnpm test`
- E2E tests: `pnpm playwright test`
- Build: `pnpm build`
## Architecture rules
- Keep business rules in domain services, not UI components.
- Do not bypass repository/service layers for database access.
- Prefer explicit types over inferred `any`.
- Do not add new dependencies without explaining why.
## Security rules
- Never log tokens, passwords, API keys, session IDs, or PII.
- Treat all external input as untrusted.
- Use server-side authorization checks even if the UI hides an action.
- Any permission change must include tests for unauthorized access.
## Testing rules
- Every bug fix needs a regression test where practical.
- Every user-facing flow change needs either a Playwright test or a documented reason why not.
- Do not mark work complete until typecheck, lint, relevant unit tests, and relevant E2E tests pass.
## Definition of done
A task is done only when:
1. The requested behavior is implemented.
2. Relevant tests are added or updated.
3. The test commands above pass, or failures are documented as unrelated.
4. The PR description explains the change, risk, and verification.
5. Follow-up tasks are created for any discovered out-of-scope issues.
This file is not there to make the agent obedient by magic. It is there to make the expected behavior visible. If Claude repeatedly makes the same mistake, ask it for a short retrospective and update the rulebook only if the lesson is stable.
Skills, plugins, hooks, MCP, and subagents
Claude Code’s power comes from composability. The same is true across the newer agentic tools. Reusable workflows beat heroic prompts.
Claude Code plugins can package slash commands, specialized agents, hooks, and MCP servers so teams can share consistent workflows across projects.10 The official plugin marketplace also warns that users must trust plugins before installing them because plugins may include MCP servers, files, or software outside Anthropic’s control.11 That warning should be taken seriously. Plugin installation is software supply-chain behavior, not a cosmetic preference.
| Component | Plain-English meaning | Best use | Failure mode if misused |
|---|---|---|---|
CLAUDE.md | Project rulebook loaded as context. | Stable project conventions, commands, architecture, definition of done. | Turns into a vague novel nobody follows. |
| Skills | Reusable playbooks or task templates. | Code review, Playwright test writing, ADR updates, release notes, security checks. | Duplicated prompts and inconsistent behavior. |
| Plugins | Shareable bundles of commands, agents, hooks, and MCP. | Team-wide setup, official workflows, repeated project bootstrapping. | Supply-chain risk if installed blindly. |
| Hooks | Commands or endpoints triggered at lifecycle events. | Blocking unsafe shell commands, formatting after edits, linting before commit. | False sense of safety if hooks only advise and never enforce. |
| MCP | A standard way for agents to connect to tools and data. | Databases, logs, browser automation, design systems, internal APIs. | Over-broad access and accidental data exposure. |
| Subagents | Specialized assistants with isolated context and tools. | Code search, security review, documentation, QA, migration planning. | Too many unsupervised agents with unclear success criteria. |
Hooks deserve special attention because they are where “please do not” becomes “cannot.” Claude Code hooks can run at lifecycle points such as SessionStart, UserPromptSubmit, PreToolUse, PostToolUse, TaskCreated, TaskCompleted, PreCompact, and SessionEnd.9 A PreToolUse hook can deny dangerous operations. A PostToolUse hook can run formatters or log actions. A PreCompact hook can preserve key decisions before context is compressed.
Subagents are equally useful when they protect context. Anthropic describes subagents as specialized assistants with their own context windows, prompts, tool access, and permissions.12 Use them for tasks that would pollute the main thread: reading logs, searching a large codebase, checking security, summarizing docs, or comparing implementation options. Do not use them as a substitute for a plan.
Planning before implementation
The highest-leverage command in an agentic workflow is often “do not code yet.” OpenAI’s Codex best-practice docs recommend Plan mode for complex or ambiguous tasks and suggest asking the agent to interview the user when the idea is rough.4 Claude Code has the same practical pattern: force architecture before implementation.
A strong planning prompt looks like this:
We need to implement [feature/fix].
Before writing code:
1. Inspect the relevant files.
2. Summarize the current architecture.
3. List assumptions and unknowns.
4. Propose 2 implementation options.
5. Recommend one option with trade-offs.
6. Define acceptance criteria and tests.
7. Identify security/privacy risks.
8. Produce a step-by-step implementation plan.
Constraints:
- Do not change public APIs unless explicitly approved.
- Do not add dependencies without justification.
- Prefer small commits and minimal surface area.
- Ask questions if requirements conflict.
Stop after the plan and wait for approval.
For harder tasks, ask for a risk register. For migrations, ask for rollback steps. For user-facing flows, ask for acceptance tests before code. For performance work, ask for baseline measurement first. The agent will write code faster than it thinks; your job is to make thinking mandatory.
Tasks: GitHub Issues and Jira as the agent’s work queue
Agents work best when tasks are atomic, observable, and reviewable. A good issue is not bureaucracy. It is a machine-readable contract between product intent and engineering execution.
# Agent-ready issue template
## Background
What problem are we solving? Link PRD/spec/design if available.
## User story
As a [user], I want [capability], so that [outcome].
## Scope
Implement:
- [specific change]
- [specific change]
Do not implement:
- [explicit non-goal]
- [explicit non-goal]
## Relevant files or areas
- `src/...`
- `tests/...`
- `docs/...`
## Acceptance criteria
- [ ] Given [state], when [action], then [expected result].
- [ ] Unauthorized users cannot [restricted action].
- [ ] Empty/loading/error states are handled.
## Verification commands
- `pnpm typecheck`
- `pnpm lint`
- `pnpm test -- [relevant tests]`
- `pnpm playwright test [relevant spec]`
## Security and privacy notes
- Does this touch authentication, authorization, PII, billing, or external integrations?
- What must not be logged?
## Required updates
- [ ] Tests
- [ ] Docs or ADR if architecture changes
- [ ] Follow-up issues for discovered out-of-scope work
This template gives Claude Code or Codex enough information to act without inventing scope. It also gives product people a place to contribute without writing code. The acceptance criteria become tests. The non-goals prevent feature creep. The verification commands define done.
For Jira, use the same structure. The tool matters less than the artifact. The agent should be able to pick up a task, read the linked spec, propose a plan, implement, test, mark updates, create follow-up tasks, and ask for feedback when requirements conflict. That is the difference between delegation and abdication.
Verification: Playwright, Gauge, Pydantic, and Qlty
AI-generated code does not reduce the need for verification. It increases it. Fast code generation can create a large volume of plausible code before anyone has proven that the system still works.
Playwright is my preferred default for browser E2E because it is built around reliable automation: auto-waiting, web-first assertions, isolated browser contexts, resilient locators, tracing, and parallel cross-browser execution.13 It is also becoming explicitly agent-friendly, with structured accessibility snapshots, MCP support, a CLI for coding agents, and session monitoring.13 That matters because agents should not rely on screenshots and guesswork when a structured accessibility tree is available.
Gauge has a different strength. It is an open-source acceptance testing framework where tests are written in Markdown, with support for multiple languages, plugins, screenshots on failure, reports, parallelization, and data-driven testing.14 If your product team needs readable acceptance specs that map closely to behavior, Gauge can be useful. I still reach for Playwright first in modern web apps, but Gauge is a serious option when acceptance language and stakeholder readability matter.
Pydantic belongs in the agentic stack for a subtler reason. It lets Python teams define data contracts with type hints, validation, serialization, strict or lax modes, and JSON Schema generation.15 That makes it useful for agent inputs, tool outputs, generated task specifications, API payloads, and any workflow where unstructured text must become structured data. If an agent produces JSON that drives automation, validate it before acting on it.
Qlty adds the code-health layer. Its docs describe a platform for linting, auto-formatting, security scanning, coverage, duplication detection, complexity, code smells, and metrics, with PR analysis that focuses on newly introduced issues and configurable quality gates.16 That is exactly the kind of guardrail AI-heavy teams need: not a quarterly debt sermon, but immediate feedback on the diff about to be merged.
| Verification layer | Recommended default | What it catches | What it does not replace |
|---|---|---|---|
| Types and schemas | TypeScript, Pydantic, OpenAPI/JSON Schema. | Contract drift, malformed data, unsafe assumptions. | Product correctness. |
| Unit tests | Vitest/Jest/Pytest/etc. | Local logic errors and regression cases. | Browser behavior or integration risk. |
| E2E tests | Playwright. | Real user flows, auth, navigation, UI regressions. | Human UX judgment. |
| Acceptance specs | Gauge where readable specs matter. | Stakeholder-readable behavior coverage. | Low-level implementation correctness. |
| Static quality | Qlty, linters, SAST, dependency scans. | New complexity, security smells, style regressions. | Threat modeling and runtime testing. |
| Independent review | Gemini, Codex, Claude reviewer subagent. | Missed edge cases, security concerns, diff risk. | Human accountability. |
External review: Gemini, Antigravity, and Codex
Do not ask the same agent that wrote the code to be the only reviewer. It can help, but independent review is valuable because different models and tools fail differently.
Google’s Gemini CLI is described as an open-source AI agent for the terminal, integrated with Gemini Code Assist, and usable for coding, research, problem solving, and task management.2 The Gemini CLI Security Extension documentation describes /security:analyze for analyzing code changes for vulnerabilities such as hardcoded secrets, injection risks, broken access control, insecure data handling, prompt-injection risks, and unsafe tool usage.17 The same documentation warns that the report is a first-pass analysis, not a complete security audit.17 That caveat is exactly right. Use it as another reviewer, not as a compliance stamp.
Gemini Code Assist can also work inside GitHub pull requests, where it can summarize changes, review PRs, identify potential bugs or best-practice issues, and respond to /gemini comments.18 That makes it a useful second pass after Claude Code has implemented a feature.
Google Antigravity takes a broader agentic-development approach. Its public pages position it as an environment where agents plan, execute, and verify across editor, terminal, and browser, producing artifacts such as plans, task lists, screenshots, and recordings.3 This is directionally important: the future coding environment is not just an editor with a chat panel. It is a manager surface for multiple agents, each with review policies, tool permissions, and evidence trails.
OpenAI Codex is also moving in this direction. Official docs recommend treating Codex as a teammate configured over time, with prompts that include goal, context, constraints, and done-when criteria.4 Codex’s GitHub integration supports @codex review, automatic PR reviews, and repository-specific review guidelines in AGENTS.md; OpenAI states that Codex focuses GitHub review comments on serious P0/P1 issues to keep feedback high-signal.19 Codex Security adds repository-wide audits, diff scans, and bounded remediation workflows, with cloud scanning in research preview.20
A practical review flow looks like this:
1. Claude Code implements the approved issue.
2. Claude Code runs tests and writes a self-review.
3. Qlty comments on newly introduced quality/security issues.
4. Gemini reviews the diff for security, auth, data handling, and edge cases.
5. Codex reviews the PR for serious P0/P1 implementation risks.
6. Human reviews architecture, product behavior, and merge risk.
7. Agent fixes approved findings only.
8. Follow-up issues are created for everything out of scope.
This may sound heavy, but it can be faster than a human-only process because most of the repetitive review happens in parallel or near-real time. The human reviewer should spend less time finding missing null checks and more time deciding whether the system is becoming better or worse.
Security posture: least-capability autonomy
StrongDM’s explanation of the principle of least privilege is the classic security idea: give users only the access they need to do their jobs, balancing productivity and protection.21 AI agents need the same principle, but adapted. I call it least-capability autonomy.
An agent should have exactly the tools, files, credentials, and network access needed for the task, and no more. It should ask before destructive operations. It should not see production secrets unless there is a specific audited reason. It should run in a sandbox or development environment by default. It should not be allowed to silently deploy, charge customers, email users, delete databases, or rotate credentials because a prompt accidentally implied urgency.
| Risk | Practical control |
|---|---|
| Agent deletes or rewrites important files. | Require approval for destructive shell commands; use git branches and small commits. |
| Agent leaks secrets into logs or prompts. | Keep secrets out of context; use secret managers; add hooks and scanners for accidental exposure. |
| Agent makes unauthorized production changes. | Separate dev/staging/prod credentials; require human approval for deploys and migrations. |
| Agent over-fetches private data. | Scope MCP/database access to read-only or filtered resources where possible. |
| Agent expands task scope silently. | Require issue-linked plans, non-goals, and follow-up tasks for discovered work. |
| Agent produces insecure code confidently. | Use security-specific review prompts, SAST, dependency scans, and human threat modeling. |
This is not anti-agent. It is pro-agent. Good autonomy requires boundaries. A race car needs brakes because it is fast, not because it is slow.
Memory and “dreaming”: how projects should learn
Memory is not magic. It is project hygiene. Claude Code’s docs distinguish CLAUDE.md, which is user-written, from auto memory, which Claude can write based on learned project patterns and corrections.8 That distinction matters. Stable rules belong in explicit files. Observed patterns can become memory. Repeated workflows should become skills. Enforced policies should become hooks or CI checks.
The “dreaming” metaphor is useful if we keep it grounded. After a long feature or a sprint, ask the agent to summarize what it learned while the work is still fresh.
Review this completed task and produce a memory update proposal.
Include:
1. Repeated mistakes or corrections.
2. Project conventions that should be added to CLAUDE.md.
3. Workflows that should become skills.
4. Tests that should be added to prevent recurrence.
5. Documentation or ADR updates needed.
6. Items that should NOT become durable rules because they were task-specific.
Do not modify files yet. Propose the changes first.
This prevents two opposite failures. One failure is amnesia: the same correction is repeated every week. The other is context pollution: every temporary preference becomes a permanent rule until the agent is carrying a junk drawer of obsolete instructions. Use compacting deliberately. Summarize decisions, preserve links, and cut noise.
Concrete workflow recipe: solo founder MVP loop
A one-person entrepreneur has a different problem from a platform team. The bottleneck is not only code. It is deciding what to build, testing whether anyone cares, and shipping without creating an unmaintainable mess.
| Step | Action |
|---|---|
| 1. Product sketch | Use Claude or Claude Design to turn the idea into a one-page PRD, rough user flow, and landing-page story. |
| 2. Market/creative test | Use Pencil or a similar workflow to generate ad/landing variants and clarify the value proposition before building too much. |
| 3. Issue generation | Ask Claude to convert the PRD into 5–10 GitHub issues, each with acceptance criteria and non-goals. |
| 4. Plan first | For each issue, ask Claude Code to inspect the repo and propose a plan before implementation. |
| 5. Build small | Let Claude implement one issue at a time, with tests. |
| 6. Verify | Use Playwright for the core user flow and Qlty/lint/typecheck for PR feedback. |
| 7. Review | Ask Gemini or Codex to review the diff for security and edge cases. |
| 8. Learn | Update CLAUDE.md, docs, and follow-up tasks only after the feature is verified. |
The founder’s advantage is speed. The founder’s danger is entropy. Agents amplify both.
Concrete workflow recipe: developer feature loop
For a normal engineering feature, the workflow can be very explicit.
Take GitHub issue #123.
Process:
1. Read the issue, linked PRD, and relevant docs.
2. Inspect the codebase.
3. Produce a plan and wait.
4. After approval, implement the smallest viable change.
5. Add or update tests.
6. Run verification commands.
7. Produce a self-review with risk notes.
8. Update docs if needed.
9. Create follow-up issues for out-of-scope discoveries.
Definition of done:
- All acceptance criteria pass.
- Typecheck, lint, unit tests, and relevant Playwright tests pass.
- No new secrets, PII logging, or authorization bypasses.
- PR summary includes verification evidence.
This prompt is intentionally procedural. Agents are good at following procedure when the procedure is visible. They are bad at inferring your team’s hidden rituals from vibes.
Concrete workflow recipe: security review loop
Security review should be scoped. “Audit my app” is too broad unless you are deliberately running a deep assessment. For day-to-day development, review the diff and the touched threat model.
Review this pull request for security regressions.
Focus areas:
- Authentication and authorization.
- Tenant isolation.
- PII and secret handling.
- Injection risks.
- SSRF, XSS, CSRF where relevant.
- Unsafe deserialization or file handling.
- Missing rate limits or abuse controls.
- Logging of sensitive data.
Output format:
| Severity | File/Line | Finding | Exploit scenario | Recommended fix | Confidence |
Rules:
- Do not report generic best practices unless they apply to this diff.
- Prefer high-confidence findings.
- If evidence is insufficient, say what to inspect next.
Run this with a reviewer model separate from the implementer. Then run tools. Then let a human decide. Generated security analysis is a useful first pass, not a signed audit.17
A compact tool map
| Need | Primary tool | Why |
|---|---|---|
| Product exploration and visual planning | Claude Design | Fast prototypes, canvases, one-pagers, slides, and handoff bundles.5 6 |
| Marketing/ad creative exploration | Pencil | Brand-governed generation, scoring, localization, approval, and launch workflows.7 |
| Main coding agent | Claude Code | Strong codebase interaction, memory, hooks, skills, subagents, MCP.1 8 9 12 |
| Parallel agentic environment | Google Antigravity | Planning, execution, verification, artifacts, multi-agent workspace.3 |
| Terminal/code review/security second pass | Gemini CLI / Gemini Code Assist | CLI agent workflows, GitHub review, security diff analysis.2 17 18 |
| PR review and OpenAI second pass | Codex | Plan-first workflows, AGENTS.md, GitHub @codex review, Codex Security.4 19 20 |
| Browser E2E and agent browser automation | Playwright | Reliable tests, traces, accessibility snapshots, MCP/CLI support.13 |
| Readable acceptance specs | Gauge | Markdown specs, multi-language support, reports, data-driven acceptance tests.14 |
| Structured validation | Pydantic | Type-hint validation, serialization, strict/lax modes, JSON Schema.15 |
| PR quality gates | Qlty | Newly introduced issues, coverage, security scanning, code health.16 |
The bottom line
The winning pattern is not “AI writes the code.” That is too small and too chaotic. The winning pattern is this: humans define intent, constraints, and taste; agents execute bounded loops; tests and reviewers keep everyone honest; memory turns repeated lessons into system behavior.
Claude Code is strongest when it is not alone. It needs product pre-work from tools like Claude Design, structured task queues in GitHub or Jira, executable verification through Playwright and Qlty, structured contracts through Pydantic, readable acceptance criteria through Gauge where appropriate, and independent review from Gemini, Antigravity, Codex, or another model. The exact stack will change. The discipline will not.
If you remember one rule, remember this: do not ask the agent to be brilliant; build a workflow where ordinary agent behavior produces reliable progress. That is how developers, product people, and solo entrepreneurs get leverage without surrendering judgment.
References
- Anthropic Claude Code overview
- Google Gemini CLI code analysis codelab
- Google Antigravity
- OpenAI Codex best practices
- Anthropic announcement: Claude Design
- Anthropic Help Center: Claude Design
- Pencil official website
- Anthropic Claude Code memory documentation
- Anthropic Claude Code hooks reference
- Anthropic Claude Code plugins README
- Anthropic official Claude plugins directory
- Anthropic Claude Code subagents documentation
- Playwright official documentation
- Gauge official website
- Pydantic official documentation
- Qlty documentation: What is Qlty?
- Gemini CLI Security Extension official repository
- Gemini Code Assist GitHub code review documentation
- OpenAI Codex code review in GitHub
- OpenAI Codex Security documentation
- StrongDM: Principle of Least Privilege Explained
- Referenced Towards AI article on Claude Code plugin; only accessible excerpt used
- Referenced Plain English article on agentic AI; only accessible excerpt used