The Four-Reviewer Gauntlet: Reviewing AI Code Before It Ships

Here is a thing that will happen to you if you build software with AI long enough: you will ship a bug that a ten-minute code review would have caught. The build was green. The TypeScript was clean. The feature worked in your test. And somewhere in the released code was a security hole, or a broken navigation link, or a data collection with permissions set to public that should never have been public.

The AI didn't catch it because the AI wrote it. You didn't catch it because you were reading the code as the person who commissioned it, not as the person who has to defend it. Both of those are the same problem: the reviewer is too close to the work.

The fix I landed on is four reviewers. Not one. Four separate lenses, run in order, before anything ships. Here's why each one exists.

1.The grumpy reviewer who almost saved me from myself

Before I formalised this process, I was building an app backend — API routes, a data collection, a submission widget. tsc passed. wix build passed. I was ready to push.

On a hunch, I asked Claude to re-read the code as a hostile reviewer. Grumpy, by-the-book, impatient. The kind of reviewer who assumes it's broken until proven otherwise.

It found five things in five minutes that the build had not caught:

The widget never populated instanceId, so every submission would return a

The widget was calling a relative URL — /api/submit-lead — from inside a

Wix site, so it would never reach the app's backend server. The data collection had itemInsert: ANYONE, which meant any anonymous request could write records directly — defeating the entire point of having an API in front of it. No API route had any authentication. And the mandatory pre-check for the component library had never been run.

None of those are compiler errors. All of them are ship-stoppers. The green build said the code was valid. The grumpy reviewer said it didn't work.

The rule: treat the grumpy review as a separate pass from writing. Re-read every file changed in the session as if someone else wrote it and you are looking for reasons to block the release. Look specifically for: does the happy path actually work end-to-end? Are data permissions what you intended? Does auth exist? This is not polish — it is a different job from building.

2.The security audit that found a cross-tenant data leak

A few days later, a different app. Same discipline — build passes, grumpy review passes. Then the security gate.

The security audit found a cross-tenant IDOR: one of the API routes was trusting a x-wix-instance-id header supplied by the client. In a multi-tenant app, that means any tenant could read or modify another tenant's data just by changing a header value. The grumpy review had passed because the code was internally consistent. The security review asked a different question: what can a malicious caller do with this?

That same audit found that when the list of authorised users was empty, the auth check passed rather than failed — an open-by-default posture where the intent was closed-by-default. Two lines of logic, one character of difference, and the app would have been exploitable from day one.

The rule: the security pass is not part of the grumpy review. It is its own gate, and it asks a different question: not "does this work?" but "what can someone do with this that I didn't intend?" Check every endpoint for authentication, every data collection for permission settings, every place a caller can supply an ID or change a parameter that affects whose data they see.

3.The resilience audit nobody thinks to run

By this point in the process, the code works and it's secure. The next thing that will kill it is reality.

The field-service app that passed both previous reviews still had a complete single point of failure — the whole thing died if one external auth service returned an error. It had non-atomic multi-step writes that silently corrupted data on partial failure. It had no pagination beyond 500 records, no handling for GPS permission denial, and no timeout on background jobs, which meant a stuck job would stay "running" forever with no way to recover.

None of those were code quality issues. None were security issues. They were resilience failures: the app was not built for the conditions real users produce.

The rule: before shipping, enumerate the failure modes for every critical path. Network down. API quota hit. Third-party service returns 503. User denies location permission. Dataset grows by 10x. What does the user see? What state is the data in? If the answer is "crash" or "silent corruption," fix it first.

4.The cold reader who reads like the App Market does

The fourth lens came later, after a CreatorHub release cycle. After grumpy, security, and resilience had all passed, I added one more read: imagine a senior platform engineer opening the repo for the first time. No context. No prior conversation. Cold.

Can they understand what each hook does and why? Are the non-obvious choices explained — the cross-tenant cache isolation, the two-watermark eviction logic, the user-gesture requirement for audio? Or is the code clever in ways that are only obvious to the person who wrote it?

That read surfaced a hook with inverted control flow — a setter being injected instead of data being returned, which was confusing enough to look like a bug to anyone reading it fresh. It found a JWT cache key that didn't include tenant ID, which meant two tenants could share a cached token. It found effect race conditions on tab selection that the previous three lenses had missed entirely.

This matters for a specific reason: the Wix App Market review team reads the code cold. If something looks wrong to a senior engineer on a first read, it is going to look wrong to the reviewer, and it may be the thing that comes back in a rejection.

The rule: do a fourth pass as a stranger. Flag anything that would require asking the author to explain. If it needs a question, it needs a comment — or it needs to be rewritten.

The pattern underneath

Four reviews sounds slow. It isn't, in practice — each pass is 15-20 minutes when the session's diff is a few files. The alternative is an App Market rejection cycle, a production security incident, or a data integrity bug you find out about from a user.

The four lenses are additive, not overlapping:

Grumpy: does the happy path actually work?
Security: what can a malicious caller do?
Resilience: what happens when reality doesn't cooperate?
Cold reader: would someone smart, reading this fresh, understand it?

The order matters. Grumpy first, because there's no point auditing security in code that doesn't work. Security second, because a resilient app that leaks data is not a better outcome. Resilience third, to stress-test what's now confirmed to work correctly. Cold reader last, as the final gate before the outside world sees it.

Claude is a very capable builder. It is also, like every builder, worse at reviewing its own work than reviewing someone else's. The gauntlet exists because "I wrote it" is not a review.

Next in the series · Post 4 of 12

What Building AI Apps Taught Me About Base44