Most AI code reviews are noise. Here's how to fix that.

Published March 11, 2026 · Updated July 13, 2026

⚡ TL;DR

Ask Claude to review code without careful directions and you’ll get one of two extremes: a no-value LGTM or a storm of noise and miscellanea.

We’re all fighting cognitive exhaustion. Make the robot minimize your effort!

This skill flips the default: the reviewer must prove each issue with a real failing scenario before reporting it, and concisely summarize why it warrants your valuable attention. It ships for both Claude Code and Codex.

No proof, no report. The result is a short list of actual bugs instead of a long list of maybes.

When the stakes are higher — say, committing code a subagent wrote — the second-opinion skill gets a review from the other vendor’s model, and trusts neither pass until ground truth settles each finding.

Want to skip the explanation? Jump to the setup prompt.

⚠️ The problem: noise disguised as thoroughness

Claude’s default code review has two failure modes, and they compound each other.

Quantity over quality. Claude treats review as a completeness exercise. It scans for anything that could be an issue and reports everything. Style nitpicks sit next to real bugs. Speculative “what if someone later removes this guard” concerns sit next to actual logic errors. The signal drowns in noise.

Insufficient research. Claude reads the diff (or the file) and forms opinions based on what it sees in isolation. It doesn’t trace the full call chain. It doesn’t search for other callers. It doesn’t check whether a “missing” null check is handled three levels up. The result: findings that look plausible but fall apart when you check the actual code paths.

The combination is toxic. You get 15 findings, 12 are noise, and now you have to research each one to figure out which three matter. That’s the opposite of what code review is for.

🔍 What the review skill does differently

The skill inverts Claude’s default behavior with one rule: prove it or discard it.

For every potential issue, Claude must:

Read the project’s coding standards and design principles to understand the baseline for “issues.”
Read the actual code, not just the diff.
Search for all callers and usages to understand context.
Check for existing documentation that explains the design rationale (TPPs, design docs, comments).
Construct a specific failing scenario. If Claude can’t describe exactly how the bug manifests, it’s not an issue.
Discard it if research shows it’s intentional or already handled.

This is the opposite of how Claude naturally operates. Left to its defaults, Claude will report a potential null pointer without checking whether the caller guarantees non-null input, and then, gallingly, ask you to look into it. The skill forces it to check first.

What not to report

The skill explicitly excludes:

Style, organization, or naming preferences
Speculative future risks
Feature requests disguised as issues
Anything not backed by evidence from the codebase

This exclusion list matters. Without it, Claude will pad the review with “you might want to consider…” suggestions that waste your time.

📋 The skill

The canonical source lives in our plugin marketplace: photostructure/coding-skills. Install the coding plugin in either product:

# Claude Code
/plugin marketplace add photostructure/coding-skills
/plugin install coding@photostructure

# Codex
codex plugin marketplace add photostructure/coding-skills
codex plugin add coding@photostructure

Use /coding:review in Claude Code or $coding:review in Codex. You can also type $ in Codex to choose a skill.

Both the exclusion list and the proof gate live in the skill’s references/single-pass.md, which every reviewer — the primary one and any delegated helper — reads before judging a line of code.

The exclusion list:

Do not report style preferences, speculative future risks, feature requests,
issues outside changed lines, or diagnostics a compiler, typechecker, or linter
already reports.

Do not report an issue the code explicitly silences (`// eslint-disable`,
`# noqa`, `@ts-expect-error`, and similar). The author already made that call
deliberately. Report it only if you can prove the suppression itself is wrong.

That last one is worth dwelling on. A suppression comment is a decision someone already made, in writing. A reviewer that flags it isn’t finding a bug; it’s reopening an argument the author thought they’d finished. The escape hatch matters too: if the suppression is itself the bug, that’s still fair game — but now you have to prove it.

And the proof discipline:

1. Read the implementation, not only the diff.
2. Trace the complete call path and search for all relevant callers and uses.
3. Read nearby comments, tests, history, and design documents that may explain
   the behavior.
4. Construct a concrete failing scenario and compare it with the supplied
   ground truth when one is available.
5. Discard the candidate if an existing guard handles it, the behavior is
   intentional, or the failure cannot be demonstrated.

The marketplace file has the rest: the read-only rule, bounded delegation, and the two-step write-up-then-decide format.

Review first, edit only with permission

The shared skill treats a review request as read-only. It reports proven issues first and does not implement a fix until you authorize that change. Once you do, the agent can reproduce the failure, apply the focused fix, and run the relevant tests in the same task.

Validation needs executable evidence

When the host permits commands, the reviewer should run tests or a focused reproducer to prove that a suspected bug fails. The tool name differs by product; the evidence requirement does not.

Why explicit triage matters

Each finding gets a short ID and a complete write-up before the agent asks what to do. You accept, veto, or discuss each issue before any edit. The exact UI can be a normal question, a picker, or a checkbox; the decision boundary is what prevents unilateral fixes.

🔧 Adapting for your project

Reference your project’s standards. Replace the generic guidance with specific file paths. Put shared instructions in AGENTS.md, and keep CLAUDE.md when the project uses it:

## Before you start

Study these before reviewing:

- [AGENTS.md](AGENTS.md)
- [CLAUDE.md](CLAUDE.md) (when present)
- [DESIGN-PRINCIPLES.md](docs/DESIGN-PRINCIPLES.md)

Add project-specific “what to look for” items. The generic list covers correctness, code quality, and test gaps. Your project likely has additional concerns:

“Check that new public APIs have rate limiting”
“Verify database queries use parameterized inputs”
“Confirm error messages don’t leak internal paths”

Tune the exclusion list. If your team does want style feedback, remove that line. If you want Claude to suggest refactors, remove the “feature requests” line. The default is strict because noise is the bigger problem, but your team’s tolerance may differ.

Delegation is bounded, because it once wasn’t. An earlier version of these skills told the reviewer to use subagents freely: one per file, one per finding, another round for anything promising. It worked right up until a review subagent decided the correct way to review its slice was to run a double review. Which spawned two more reviewers. Which each had subagents available, and a skill telling them to use them.

That’s a fork bomb with a system prompt.

The fix is the one every recursive algorithm needs: a base case and a budget. The skill now spends at most two leaf reviews on an entire change — no subagent per file, no subagent per finding, no second iteration round. The primary review is yours. One leaf may take a coherent file group or a distinct angle (repository-guidance compliance, historical context). The other gets every candidate that survived the primary pass and is asked to disprove them by tracing the guards, callers, and design constraints the first pass might have missed.

The base case is the interesting part. Every leaf is launched with an explicit contract in its prompt — role: leaf-reviewer and delegation-budget: 0 — and the skill’s first section is a guard that reads it:

If the task identifies your role as `leaf-reviewer` or sets
`delegation-budget: 0`, read and follow `references/single-pass.md`, complete
one review yourself, return the report to the caller, and stop before the
orchestration and user-interaction steps below.

A leaf reviewer reads the same proof gate as everyone else, but the part of the skill that knows how to delegate is switched off before it ever gets there. It reviews, it reports, it terminates. The recursion has a floor.

The plugin ships a tool-restricted reviewer agent for those leaves; the skill uses it when the host exposes it, and falls back to an ordinary subagent otherwise. When no delegation mechanism exists at all, do the same exploration and validation yourself — narrow the scope rather than pretend an unverified scan is complete.

📦 Reviewing what you’re about to commit

review takes whatever scope you hand it. Most of the time the scope you actually care about is “the thing I’m about to commit,” so the plugin ships a sibling for exactly that: /coding:review-staged in Claude Code, or $coding:review-staged in Codex.

It reviews git diff --cached against the same proof gate, the same exclusion list, and the same two-leaf delegation budget — review-staged reads the very same single-pass.md. Two things are different, and both come from knowing the diff is commit-bound:

It judges the diff as a story. Beyond hunting bugs, it asks whether the staged changes tell one coherent story, and recommends how to split them if they don’t. That’s the same instinct behind gitplan, applied to the index you already built.

It ends in a commit message, and stops there. After the findings are adjudicated it drafts a Conventional Commit — focused on the why, since the diff already shows the what — and then waits. The skill says “Do NOT commit directly after the review” and requires explicit approval before it runs git commit.

Pair it with /coding:stage: stage decides what goes in the commit, review-staged decides whether what’s in there is correct and worth committing.

✌️ When one reviewer isn’t enough: ask the other model

The review skill fixes the noise problem for a single reviewer. The same plugin includes the second-opinion skill, which addresses a different problem: any single reviewer, however disciplined, is sometimes confidently wrong.

(It shipped as double-review through v1.1.0 of the coding plugin. The name described how it used to work — two blind reviewers over one diff — rather than what it gives you, so v2.0.0 renamed it.)

It’s a validation gate for freshly written code, especially code a subagent wrote. Run it once the code is written and the tests pass, before you commit. Two observations motivate it:

A green test suite is not proof of correctness. Implementers (subagents especially) settle for good-enough: they code until their own tests pass. Semantic mismatches with the spec, stateful-API gotchas, and edge cases the tests never pin — all survive.
Reviewers are confidently wrong, too. Every review pass mixes real bugs with plausible-but-wrong findings. Accept everything and you inject regressions; veto everything and you ship the bugs.

So the gate reviews the change twice, independently, then withholds judgment until ground truth rules on each finding. It ends with every finding either accepted with proof or vetoed with proof, and the accepted ones fixed and pinned by tests.

The version I actually use gets the second review from a different vendor’s model. A second opinion from the same model shares your blind spots: same training, same priors, same failure modes. Cross-model is the whole point.

Scope before launching

Both reviews get exactly the same scope, written down before anything runs:

The diff range: commit range, staged diff, or working-tree diff, plus the file list.
The ground truth: whatever a disputed finding can be tested against (a reference implementation you can execute, a spec with runnable examples, the real API), including the exact command to query it. If no executable ground truth exists, say so and name the fallback: spec text, or a maintainer ruling.
A scrutiny list: the three to six riskiest spots you’d check first. Stateful APIs, encoding boundaries, off-by-one-prone length math, error paths, concurrency. The list points both reviewers without capping them.

Ask the other model

Both CLIs run a review non-interactively and read-only, so each can call the other. If Claude is doing the work, it shells out to Codex:

codex exec review --base main "<scope, ground truth, scrutiny list>"

Use --commit <sha> for a commit you already made, or --uncommitted for working-tree and staged changes. If Codex is doing the work, it shells out to Claude:

claude -p "<same prompt>" --permission-mode plan

--permission-mode plan keeps that pass read-only.

The prompt carries the scope, the ground truth, the scrutiny list, and the proof-before-reporting bar. It does not carry your suspected findings: the second opinion is worth less the more you steer it.

Kick it off in the background, then read the new code yourself while it runs. You are the other reviewer, and the only one who knows the full context of what the change was supposed to do.

When the other CLI isn’t installed or isn’t authenticated, the skill falls back to a same-model subagent and tells you it did. That’s a weaker gate, and you should know when you’re getting it.

Vet every finding empirically

For each finding, from either reviewer or your own read, run ground truth and the new code on the same input and compare.

Accept only when ground truth confirms the bug.
Veto only when ground truth confirms the code is right, or the finding demands fidelity nothing requires (mimicking a reference’s internals on a path no contract pins, say).
A finding you can’t test this way gets downgraded to a question, never silently accepted.
When the diagnosis is right but the proposed fix is mediocre, take the better fix. Reviewers identify problems; you own the remedy.

Reviewer confidence, eloquence, and even agreement between the two passes are not evidence. Two models converging on the same wrong finding is less common across vendors than within one, but it still happens. One command against ground truth beats both.

Fix, pin, and record the verdicts

Every accepted finding gets a pinning test whose expected values come from ground truth (paste the command that produced them into the test’s comment). The full suite must be green again, not just the new tests.

Record the vetoes, too, and which model raised each one. The next session or the next reviewer will rediscover the same “bug,” and a written verdict with one-line evidence stops the re-litigation.

Two other skills call this gate. gitplan runs it on each staged theme before proposing the commit message, so the second opinion arrives while the local review is still going. And if you run whole queues of plans through subagents, tpp-orchestrate runs it after every plan — the TPP article covers that workflow.

🚀 Setting this up

Install the plugin:

# Claude Code
/plugin marketplace add photostructure/coding-skills
/plugin install coding@photostructure

# Codex
codex plugin marketplace add photostructure/coding-skills
codex plugin add coding@photostructure

Use /coding:review and /coding:second-opinion in Claude Code, or $coding:review and $coding:second-opinion in Codex. Start a new task after installing or updating the plugin.

To tune it for your project, open your coding agent in the project directory, use its planning mode when available, and paste this prompt:

We installed the `coding` plugin from
photostructure/coding-skills. The bundled
`review` skill is generic (`/coding:review` in Claude Code,
`$coding:review` in Codex).

Audit this project and propose project-specific tuning:

1. Which AGENTS.md, optional CLAUDE.md, and design docs the skill should
   reference by path
2. Project-specific "what to look for" items
   (e.g. parameterized DB queries, rate limiting)
3. Whether the strict exclusion list needs loosening
4. Any review concerns that should always be checked

Capture durable shared guidance in this project's AGENTS.md.
Also honor CLAUDE.md when present. Do not fork the skill.

🎲 Your mileage WILL vary

The exclusion list, the verification steps, the response format: these are starting points. Season to taste. Do you have testing requirements that it can help enforce? Linting tools it should integrate with?

The core principle holds regardless: require proof before reporting. A review with three proven bugs is worth more than a review with fifteen maybes.

Coding