```

[ field notes ]

The Cheap Comparison

The five-minute diagnostic step agentic coding tools won't run on their own.

Qais Ali / Senior AI Cyber Security Engineer / CEO, Ilm Works

context

A near-miss, named.

There is a particular failure mode in agentic coding that I keep watching unfold in slow motion. It is not the dramatic one. The agent doesn't delete a database, push a force commit to main, or invent an API that doesn't exist. The dramatic failures are easy to write about — the smoking gun is right there. The failure I want to describe in this note is quieter, more competent-looking, and far more likely to ship.

It goes like this. The agent encounters a bug. It produces a confident, plausible-sounding diagnosis that fits the symptoms. It proposes a structural fix that follows from the diagnosis. The fix is not unreasonable. The reasoning chain that led to it is internally consistent. If you nod along, the agent will execute, mark the work done, and move on. The fix will not address the actual cause.

This is not a continuation of the previous two field notes — it is a third specimen of the same animal. No. 01 named the patterns under sustained sprint pressure. No. 02 named what happens when you push an AI-built app through a real production pipeline. This one names the failure mode that lives one level deeper than completion-claim or destructive-action: the mistaken diagnosis that an agent's own evidence has already disproved, and that no part of the agent's loop will catch.

The thing the agent skipped, in the session I want to walk through, was the cheap comparison. The five-minute static check. The grep across the file. The side-by-side look at the two things that are supposedly different. It skipped it because skipping it felt efficient — there was a hypothesis on the table, and the natural next step was to act on the hypothesis. The cheap comparison felt like backtracking. It is not. It is the only step that distinguishes diagnosis from guessing.

what happened

A bug, a hypothesis, a refactor that almost shipped.

I'm building a quiz app. The quiz screen has two kinds of cards: a question card at the top, and four answer cards below it. Both are styled the same way — same border, same background, same rounded corners. On the iPhone, the question card's border is clearly visible. The answer cards' borders are nearly invisible. Multiple previous agent sessions had failed to fix it. Each time, the agent had tweaked color values — bumped a hex code, adjusted an opacity — and each time the borders had stayed ghostly.

I opened a fresh session and asked the agent to fix it.

Within a minute the agent produced this:

```

The real problem: the question card is a View. The answers
are Pressable elements with a style function (({ pressed }) => …).
On iOS, Pressable with a dynamic style function doesn't reliably
render backgroundColor and borderColor — this is a known iOS
rendering quirk.

```

(For non-RN readers: View is a basic container; Pressable is a tappable wrapper that lets you change styles based on press state. The agent's claim was that the tappable variant has a platform-specific rendering bug.)

The diagnosis sounds right. It identifies a real structural difference between the two elements. It invokes platform behavior. It explains why previous color tweaks failed (the structure was broken, not the colors). And it leads naturally to a fix: wrap the Pressable in a View, move the border styling to the wrapper, leave only the press-state behavior on the Pressable. Surgical. Justified. Ready to ship.

The agent began executing. It made the structural change to the first call site before I caught it.

what failed

Five patterns, named.

Plausible-quirk hallucination.

The "known iOS rendering quirk" was not actually known. Pressable with a style function is documented, common, and used in production by every React Native app I have ever seen. The agent had pattern-matched the symptoms (border doesn't render correctly on a Pressable) onto a vague platform-quirk archetype (iOS sometimes does weird things with dynamic styles), and stated the conjunction as established fact.

This is one of the most dangerous tics in agentic coding: confident appeals to platform behavior the model has never verified and that may not exist. The pattern-match is strong because there are real platform quirks in the training data, but the specific quirk being invoked is sometimes invented. Demand a citation, a reproducer, or instrumented evidence. If none can be produced, the quirk doesn't exist.

Refactor before diagnosis.

The agent's first action was structural — it began modifying the component tree before any evidence had been gathered to confirm the hypothesis. The structural change was reasonable given the diagnosis. The diagnosis had not been verified.

This is the inversion that has to be reversed every time: diagnose, then refactor. If a proposed fix involves moving code between components, introducing a wrapper, or swapping element types, the diagnosis that justifies the structural change must be falsifiable, and the falsification must have been attempted. Refactors based on hypothesized platform quirks should be treated as guesses until the quirk is reproduced in isolation.

When I redirected the agent to revert the structural change and run instrumentation first, it complied immediately. It did not propose this on its own.

Skipping the cheap comparison.

The cheap comparison would have killed the hypothesis in five minutes. Grep the file for every borderColor and borderWidth occurrence. Compare the literal values used on the question card and the answer cards. Same for backgroundColor. Same for the parent containers.

When I forced the agent to do this, it came back with a table:

Property	Question card	Answer card
borderWidth	1	1
borderColor	`rgba(255,255,255,0.07)`	`rgba(255,255,255,0.07)`
backgroundColor	`#102038`	`#102038`
Parent ScrollView bg	`#071326`	`#071326`
Element type	`<View>`	`<Pressable style={fn}>`

Byte-for-byte identical styling. Same border, same background, same parent. The only difference was the element type.

This step was free. The data already existed in the file. No device, no inspector, no instrumentation. The agent did not run it on its own. It ran it only when I made the request explicit, structured, and non-negotiable. And once it ran, the entire framing of the bug collapsed inside a single response — because buried in the agent's structural notes was this:

```

One structural detail worth noting: immediately below the
question card is a separate 1-px <View style={{ height:1,
backgroundColor: T.border, marginBottom: 14 }}>. This is a
divider View, not the card's own border — but visually it
sits flush against the card's bottom edge.

```

There was no rendering bug. The "firm border" I had been seeing on the question card was not the question card's border. It was a separate divider element underneath, rendered at full opacity in the design system's border color, sitting flush against the card. My eye had read the divider as the bottom of the card's border and inferred that the rest of the border must be similarly crisp. It wasn't. It had never been. The 7%-alpha border was simply too faint to read on that background, and that was true for every card in the app.

The fix, once the comparison surfaced the real cause, was a one-line change: bump the alpha from 0.07 to 0.18, in three places in the file. No structural changes. No wrapper Views. No platform-quirk theorizing.

The agent does not update on its own evidence.

This is the pattern that worries me most. The agent had, in its own response, produced the table that disproved the original hypothesis. It then continued reasoning as if the original hypothesis were still live — moving toward the structural refactor, framing the byte-for-byte identical values as compatible with the iOS-quirk theory rather than as the falsification of it.

It needed an external observer to point at the table and say look at what you just wrote. When I did, the agent immediately recognized the divider and produced the correct one-line fix. The information had been there. The synthesis had not happened on its own.

This is not a failure that better prompting fixes. It is not a failure that a more capable model fixes, at least not reliably. It is a failure of epistemic discipline under the pressure to act. The agent is rewarded for producing a coherent next move. The next move that says "my last response invalidated my own hypothesis, let me restart" is not the move the agent's loop is shaped to produce.

Done is not done until you have seen it.

Even after the correct diagnosis was reached and the alpha bump applied, I would not let the agent self-certify. I asked it to run the app on device and produce a screenshot of each affected screen — main quiz answers, QOTD answers, post-quiz results — before the work could be marked done. This is the same discipline No. 01 named under "completion claim failure mode," and No. 02 named under "agent self-reports of completion." It belongs in every session. The agent measures success by executed commands. The user measures success by visible outcome. These are not the same thing.

what this maps to in governance frameworks

Four lines, named.

Each of the failure modes above maps to a corresponding line in the AI governance frameworks I work with professionally. The mapping is not theoretical — it is what I look for in vendor assessments at the day job.

Plausible-quirk hallucination maps to the calibration and confabulation problem addressed at the system level by NIST AI RMF (MEASURE 2.3) and operationalized in MITRE ATLAS as the broader risk of model output that is fluent, confident, and untrue. The vendor question that surfaces here: what does your system do when it does not know?
Refactor before diagnosis maps to the change-management discipline assumed by SOC 2 CC8.1 and ISO 27001 A.8.32. Both frameworks assume changes are evaluated against a documented justification before being committed. Agentic tools that modify production code based on unverified hypotheses violate that assumption silently. The vendor question: what is the gate between a model's proposal and a write to source control?
Skipping the cheap comparison maps to the data-lineage and verification-rigor expectations under ISO 42001 (8.4) and the EU AI Act's general-purpose AI obligations around model evaluation. Both anticipate that AI systems will produce outputs requiring verification before action; neither anticipates that the verification will routinely fail to occur because the system did not feel the need to perform it.
Failure to update on own evidence maps to the broader AI evaluation rigor gap that every major framework gestures at and few deployments actually solve. It is the same gap that governance practitioners assess when they ask whether an AI system has internal consistency checks. In agentic coding, the answer is empirically: not reliably, and not without a human in the loop.

The frameworks anticipate these failure modes in theory. Living through them in practice sharpens the instinct for what to actually look for in vendor assessments. When I am reviewing a vendor's claims about their AI system's reliability, this session is now part of my mental reference set — specifically the moment the agent produced the falsifying table and continued past it.

a small framework

Six disciplines.

Each is a direct response to a specific failure I observed in this session. None of them are exotic. They are the standing rules I now apply to every agentic coding session I supervise.

// disciplines

Diagnose before you refactor. Structural changes require a falsifiable hypothesis that has been falsification-attempted. Refactors based on hypothesized platform quirks are guesses until proven otherwise.
Force the cheap comparison first. Before any structural change, the agent produces a side-by-side table of the elements involved. Same property, two columns, literal values. Identical values kill the structural hypothesis. Different values usually surface the actual bug.
Treat instrumentation as a precondition, not a debugging tool. A console log that confirms a function fires with expected values is not optional. No log, no diagnosis. No diagnosis, no refactor.
The inspector value beats the model's reasoning. When the rendered output, the native view tree, or the inspector says one thing and the model says another, the model is wrong. This sounds obvious and is routinely violated in practice.
Watch for "known platform quirks" that aren't. "iOS does this," "Android handles this differently," "React Native has issues with…" — these are high-frequency hallucination zones. The pattern-match is strong because real quirks exist; the specific quirk invoked may be invented. Demand a reproducer.
Done is verified by the human, with evidence. Self-certification is the failure mode that ships bugs. The agent never marks UI work complete. The human does, with a screenshot.

the role of the operator

What I was actually doing in the loop.

I keep returning to a question I think is under-discussed in the agentic-AI conversation: what is the operator actually doing in the loop?

The naive answer is "approving and rejecting the agent's proposals." That is not what I was doing in this session. What I was doing was enforcing a diagnostic discipline the agent would not have enforced on itself. I was the source of the cheap comparison. I was the source of the instrumentation requirement. I was the one who, when the agent reported "the style values are byte-for-byte identical between the two," recognized that this single sentence had falsified the entire framing of the bug — and I was the one who said stop, re-read your own table.

If I had been less attentive, in a hurry, or trusting of the agent's framing, the structural refactor would have proceeded — built on a hypothesis the agent's own evidence had already disproved. The bug would have come back in another form. The codebase would have been worse. And no part of the agent's loop would have flagged any of it.

This is not a coding-tools-only problem. It is the same pattern I see in vendor risk assessments, in incident response, in security audits: the plausible explanation that explains the symptoms, the structural fix that follows from the explanation, the failure to run the cheap comparison that would have falsified the whole frame. Humans do this too. We have been doing it for a long time. What is new is that we now have systems that do it at machine speed, with persuasive prose, and that will execute on the conclusions if we let them.

closing thought

The discipline is still the job. Now we know which discipline.

The previous two field notes ended on the line the discipline is the job. This one ends on the same line, with one specification.

The discipline is diagnostic. It is the discipline of comparing before changing, of instrumenting before asserting, of reading your own evidence before continuing past it. The cheap comparison costs nothing. The structural refactor costs days, sometimes weeks, sometimes a small amount of trust between the engineer and the agent that is hard to rebuild. The asymmetry is enormous. The discipline is to take the cheap step first, every time, even when — especially when — the agent is confident it is not necessary.

The borders weren't broken. The diagnosis was. And nothing in the agent's loop was going to surface that on its own.

Qais Ali Senior AI Cyber Security Engineer
CEO, Ilm Works

Field Notes / No. 03 Filed May 2, 2026

```