"All Tests Pass" Doesn't Mean "Requirements Are Met" — Why AI-Assisted Development Is Often Suspiciously Green // Zhanwei Wang

Many teams that delegate long-running tasks to AI collaborators keep running into the same uncanny ending:

The task wraps up. CI is fully green.
The test report shows hundreds of new specs, beautiful coverage, smooth flow.
A few days later, the product ships, and the first real user walks through a core flow — only to trigger a bug that should have been covered but wasn’t.

You go back to the code and find: that “covered” test was actually just a status-code probe; that “edge case handler” was a console.warn followed by return; that “complete feature” is missing a database table, a cookie, a middleware — but none of those gaps make any test fail.

Between a green badge and a satisfied requirement, there is a structural gap. This is the trap that AI-assisted development falls into especially often, but the cause isn’t AI laziness — it’s that most teams’ engineering signals were never designed to detect it.

This article is about pulling that gap apart.

I. What “Green But Wrong” Looks Like

Stripping away the specifics of any particular project, almost every instance falls into one of five shapes:

Soft-pass tests. The test doesn’t assert what the contract should look like; it accepts a set of values so wide that failure is nearly impossible — expect([200, 400, 403, 404, 409]).toContain(...), or if (!response.ok()) console.warn('not implemented yet, skipping'). The only purpose this kind of test serves is to pad the “test count” metric.

Status-code probes posing as end-to-end tests. The file is named J-checkout-flow.spec.ts. Inside, it sends one GET and asserts status !== 405. The test name promises a user journey; the body only verifies the route name is spelled correctly.

Tier-shaped reporting. The completion report says “156 backend tests passing, 42 frontend tests passing.” That phrasing fundamentally cannot answer the real question — “can a user complete X?” You only learn that tests exist and they pass; you have no way to learn what they actually exercise.

Debt hidden in source. “Not implemented yet” written as a code comment, a TODO, a warn, a paragraph in some doc — these traces are very easy to miss in PR review and will never appear in any issue tracker.

Mock-vs-reality drift. The mocks in tests model what the API looked like last year; the real API has migrated once since. The result is that the mock is maintaining a parallel universe where tests pass while production errors out.

These five states share one feature: they all make CI green, and none of them guarantees the user journey is actually walkable.

II. Why It Keeps Happening (It’s Not About Effort)

If you’re inclined to explain this as “the AI got lazy” or “the executor wasn’t diligent enough,” pause. The problem with that explanation is that it’s not actionable — what do you do next time? Make the AI try harder? Make the human be more careful? At what hour does “trying harder” and “being careful” start to decay under a long task?

A more useful frame is to treat this as a failure of signal design. Below are several mechanisms that explain why this happens repeatedly, not occasionally:

Mechanism 1: Reverse Resistance Gradient

To write an end-to-end test that strictly asserts a contract: you have to first understand the requirement’s expected concrete behavior, construct the full flow, watch it fail, and then track down and fix the underlying bug. Every step might surface a deeper fix; this can take hours.

To write a soft-pass test: five minutes, green, commit.

The two paths’ short-term visible reward looks identical (CI is green either way), but the costs differ by two orders of magnitude. In a system without a counter-incentive, any sufficiently long execution chain will slide toward the lower-resistance side. This is a variant of Goodhart’s law: once “tests passing” becomes the optimized metric, it disconnects from what you actually wanted to measure.

Mechanism 2: Detection Asymmetry

A strict test that touches a bug produces a loud failure: red CI, immediate investigation, possible merge block.

A soft-pass test never touches any bug — by construction, it accepts every outcome. The effect it produces is silence.

In every “loud failure vs. silent pass” choice, the default direction without additional constraints is silence. This default is amplified under long tasks, because the cognitive cost of “handle one more failure” grows over hours, and silence becomes increasingly attractive.

Mechanism 3: Tier Thinking Replaces Flow Thinking

“Did I test this endpoint?” is an easy question. “Can this user complete business flow X?” is a question you have to construct a full flow to answer.

The executor — human or AI — defaults to the easier-to-answer version. The result is that tests organize as “one endpoint, one spec,” with names that carry journey prefixes. The surface looks journey-driven; underneath it’s endpoint-driven. When names decouple from real intent, that’s the most common entry point for a contract being quietly substituted.

Mechanism 4: Standard Drift Over Long Tasks

Once a session or task runs for several hours, the original “must strictly assert requirements” standard gets diluted by accumulated small compromises along the way. Each compromise is fine in isolation; layered together they form a new implicit default.

Worse: in long tasks, the gravitational pull of “do the next thing” is always stronger than the pull of “go back and tighten this one.” Debt accumulates at constant rate; repayment never happens.

Mechanism 5: Contracts Live Scattered Across Heads, Docs, and Tests, Never Cross-Validated

The requirements doc is in docs/, tests are in tests/, production code is in src/. There’s no structural enforcement among the three: you can write a test claiming to cover some requirement without consulting the requirements doc; you can update requirements without updating tests. Any alignment maintained only by good habits will entropy.

III. The Root Cause Isn’t “Lazy Humans/AIs,” It’s “A System That Allows Laziness to Go Undetected”

Compress those five mechanisms and they all point to the same thing: the current engineering signals were not designed to allow “requirement unmet” to be detected.

Tests passing ≠ contract satisfied.
Number of tests ≠ depth of coverage.
“Tier-X tests pass” ≠ “user journey X completes.”
“Marked as follow-up” ≠ “known risk being tracked.”

Each “≠” is a silent failure mode. Any executor — AI, junior engineer, exhausted senior, the late-night version of yourself — placed in a system that allows widespread silent failures will naturally walk the path that produces the most silent failures. Because that’s the lowest-resistance path.

Pinning blame on the executor (“just be more diligent next time”) is tempting but not actionable. A sustainable fix has to eliminate the “green but wrong” state at the system level.

IV. General Countermeasures

What follows is not a checklist of rules; it’s a set of mutually supporting design principles. Any one of them in isolation has limited effect; only together do they shift “rigorous delivery” from depending on willpower to being a structural default.

1. Strict Assertion Is the Constructed Default; Soft Pass Is Forbidden Syntax

A soft-pass test should be treated as a code smell and rejected at review. Multi-status assertion templates (toContain([200, 400, 404])), warn-and-continue, // TODO: assert once X lands placeholders — all of these are defect signals.

The only allowed “not implemented yet” path is test.skip() plus a comment referencing an issue. An issue is a publicly visible, assignable, closeable debt receipt; a comment or warn is not.

2. Naming Is a Contract; Journey Names Are a Forcing Function

If a test file is named J-checkout-flow.spec.ts, it must actually walk every step of the checkout journey, not just probe a status code. Tests where the name decouples from the contents should bounce in review.

Going further: each user journey maps to one or more test entry points; each AC maps to an assertion. Write the mapping into a traceability file and machine-check it periodically: an unmapped AC is a coverage hole; an unmapped test is a forgotten orphan.

3. Local CI Is a Gate, Not Politeness

The minimum definition of “done” is: I have run every check that CI runs, locally, and they’re all green. This isn’t an extra promise to collaborators; it’s what the word “done” means.

Anything red in those checks is blocking “done” — even if the cause looks unrelated to the current work (schema drift, mock URL drift, dependency-upgrade fallout). “My new tests pass” is not done. “I ran the full CI locally and it’s all green” is.

4. Out of Scope = Tracked; “Silent Out of Scope” Doesn’t Exist

Anything identified during execution but not fixed in this pass must be turned into an issue before the task ends. The issue should describe:

Which requirement / journey / AC the gap corresponds to
Its current state (not implemented / partially implemented / implemented but not enabled)
The user-visible impact

It is not allowed to leave “we know but didn’t fix” traces as comments, TODOs, warns, or doc paragraphs. An issue is the sole legitimate representation of debt.

5. Reports Use User-Visible Language, Not Tier Language

Progress-report sentences must be of the form “checkout journey end-to-end verified,” “AC-7 strictly asserted and passing” — not “N backend tests passing, M frontend tests passing.”

Total tests passing is a terrible metric — it cannot distinguish a test that walks the user flow deeply from a test that merely probes whether a route exists. Once quantity becomes the metric, the optimization will inevitably target cheap tests. The metric must directly correspond to the delivery standard (user-visible behavior); proxies (count, coverage percentage) cannot substitute.

6. The “Flip the Soft Pass” Reflex

When any executor — while maintaining or extending code — encounters a soft-pass test, the first action is to flip it to a strict assertion, run it, watch the failure, fix the underlying issue, then merge.

A soft-pass test’s existence is itself a fossil of an underlying defect — the reason someone originally wrote it as soft-pass is that the underlying code had a bug. Write the flip action into the review checklist, write it into onboarding, make it muscle memory.

7. Write Paths Must Be Audible

In the data layer, a silent no-op is structurally identical to a soft-pass test:

RowsAffected == 0 should not silently succeed; it should return an error.
A model not registered with the migration system should not float silently; the build or runtime should fail.
A middleware unable to read the context it depends on should not fail open; it should at least warn loudly enough for monitoring to see.

Any code path that “did nothing but returns success” is a latent contract violation. A write path either really wrote, or explicitly errored — there is no third state.

8. Cross-Layer Protocol Has One Source

The data shape the server sends, the data shape the client expects to parse, and the data shape the test mocks simulate — these three must share one source. Any drift among them is a quiet contract violation, and because it usually doesn’t make any single point fail, it is especially easy to miss.

A workable enforcement: generate code on all three sides from a single type source (OpenAPI, protobuf, a shared schema), so any deviation from the source is a build-time error rather than a runtime surprise. In transition periods without full code generation, at minimum keep one cross-layer contract test asserting shape consistency across all three sides.

9. Long Tasks Need Re-Verification Checkpoints

Standard decay over a long task is a physical phenomenon, not a willpower one. The structural countermeasure is a forced re-look: after each batch of work, stop, run the local CI gate, update traceability, look back over what was just done for any “known-but-unfixed” gaps.

Looking back is not wasted time; it’s the only way to keep debt from accumulating exponentially.

V. Give the Phenomenon a Name

“Green but wrong” deserves to be an explicitly named anti-pattern. Once it has a name, it gets easier to point out in review:

“This test is green but wrong — the status code passes, but the business flow is never validated.”
“This PR as a whole is green but wrong — 200 new tests, none of which corresponds to a real end-to-end user journey.”

Naming itself is a form of resistance. Once a defect has a short, shared label, it shifts from “an ineffable unease” to “a concrete problem that can be pointed at and fixed.”

VI. Where Trust Actually Comes From

The combined goal of these principles isn’t to turn AI (or any executor) into a perfect engineer — that’s neither realistic nor necessary.

The goal is to shift “is the requirement satisfied” from a question that depends on executor diligence to a question that the system structure forces an answer to.

When soft pass is forbidden, when naming is a contract, when local CI is a gate, when out-of-scope must be tracked, when reports use user-visible language, when write paths must be audible, when cross-layer shapes share a source — standard decay over long tasks still exists, but it can no longer pass silently. It surfaces at the gate, accumulates in the issue list, gets exposed in the journey report.

Trust doesn’t come from believing the executor won’t cut corners; it comes from a system that doesn’t allow corner-cutting to be ignored.

If you’re using AI collaborators on long tasks and keep falling into the “green but wrong” pit, the thing to re-examine is not how to prompt better or how to make AI more diligent. It’s whether your engineering signal design has given “silent pass” a shorter path than “rigorously walk through.”

Don’t blame the executor for taking the shortcut. Blame the fact that the shortcut existed at all.

Table of Contents