[ field notes ]

The Layer Above the File

Seven small defects in a week. The agent caught none of them. None of them showed up in a diff.

Qais Ali / Senior AI Cyber Security Engineer / CEO, Ilm Works

context

Ten days after No. 06.

No. 05 closed by projecting that the next note would deal with the work the operator owns irreducibly — the product calls, the brand calls, the judgment about what to ship. No. 06 took a swing at that and framed it as application-level verification of what the agent is actually changing. I thought that was the answer at the time. It wasn't, or it wasn't the whole answer. This note is the part of the answer I didn't have ten days ago.

The week I'm writing about did not produce a single dramatic failure. Nothing in the league of No. 05's wrong-file disaster. The agent did good work, on the right files, against the right entry points, all week. Every defect in this note got caught before any user saw it. By any normal measure the week shipped.

What I want to write about is the part of the week that wasn't normal. Seven small things went wrong in a way I didn't fully see until item six. They weren't seven different kinds of mistakes. They were one kind of mistake showing up in seven places. The agent's evaluation surface ends at the unit it was asked to produce. Everything above that — whether the screen works, whether the badge means something, whether the order serves the first user through the door — is somewhere the agent does not look. I had been distributing those checks across what felt like separate concerns: code review, copy review, UX review, content review, domain review. They aren't separate. They're one layer, and that layer is everything outside the file.

the project

A consumer mobile app, twelve major features in two weeks.

The codebase is React Native, Expo, TypeScript, Zustand for state, internal i18n for English and Arabic. Standard stack. The work was a sequence of phased features — onboarding, matching logic, content rendering, settings, a new navigation structure. Twelve major features in two weeks. Each one spec'd out as a phase, implemented by the coding agent, verified on device by me before moving on.

The disciplines from earlier notes were holding. Live-file verification at session start. Backups out of the working tree. The agent's refusals treated as signals, not as noise. The lower-level supervision worked all week.

What this note is about is what shows up when the lower-level supervision works. There's still a class of defect that the lower disciplines don't touch, because the failure isn't at the level the lower disciplines are watching.

the seven

In the order they surfaced.

I want to walk through them in order, because by item three I started to see something, and by item seven there was no missing it.

One: the question that got rewritten.

I gave the agent a spec with nine items, locked verbatim, with explicit instructions not to rephrase or alter any of them. The feature shipped. Items one through three were the items I wrote. Item four was something the agent wrote. The wording was plausible — not worse than mine, not really better, just different. The diff didn't flag the substitution. The data file declared nine entries with the right shape. The fourth entry just had different content than the one I had specified.

The unit-level was perfect. The thing above the unit-level — did the agent ship the content I locked — was wrong, and the agent's loop had no way of catching it. Verbatim is not a property the agent can check against itself. The agent's instinct toward improvement and the agent's instinct toward rewording are the same instinct. If you say "don't reword," the agent will mostly not reword. But the small drift at item four is the kind of thing only the operator catches, item by item, after the fact, by reading the shipped file against the source spec.

I should have caught this in five minutes by diffing the shipped data file against the spec. I didn't do that for almost a day, because the feature looked correct on device.

Two: the options that rendered as letters.

Same screen. The four answer options for each item showed up on device as small circular badges — just "A," "B," "C," "D." No text next to them. Functionally unusable. Visually clean.

The data had the option text. The render component had a field reference. Somewhere between the two, the text was being dropped. The data file used the key text. The render expected something else. Mismatched key. Standard error.

What I want to flag is that this defect made it onto the device. The screen rendered. The build was green. The types matched, in the sense that the data was untyped at the boundary and the render had a fallback that displayed the letter badge regardless of whether the body string existed. Nothing in the toolchain noticed that the screen was unusable. I noticed it when I tried to take the quiz myself and realized I couldn't read the answers.

This is the gap between the code works and the screen works. The agent verifies the first. It is structurally unable to verify the second from where it sits. The only thing that catches the second is the operator opening the app on a real device and trying to use it as a user would.

Three: the badge that wasn't backed by anything.

There's a gold "VERIFIED" badge that renders on every entry in a content corpus. The badge has been in the visual design from early on. I had been treating it as part of the styling. This week, while working on something adjacent, I noticed that there was no field anywhere on the entry called verified. No review state. No approval workflow. No data backing the badge at all. It was a render-time constant. Every entry got it.

This had been in production-shape code for weeks. Every single entry in the corpus — including ones I hadn't personally vetted, including ones the agent had drafted from its training data without independent corroboration — was wearing the same gold mark of trustworthiness. The product was making a claim to users. The claim had nothing under it.

The spec for the feature, way back when it was first written, said something like "render a verified badge." The agent rendered a verified badge. The agent did exactly what was asked. The thing that didn't happen — and that I didn't ask for, and that the agent didn't propose — was the data shape and the gate that would make the badge mean something. A badge with no enforcement is a decoration with the word "verified" on it. I had been shipping a decoration and calling it a verification system.

The fix was an afternoon. The blast radius if it had shipped to users would have been weeks or months, and the credibility damage would have been the kind that doesn't get undone by a patch.

Four: the weak source under the gold badge.

Item three sharpened.

One entry in the corpus had its source field honestly labeled as weak — in the technical sense the domain uses for source criticism. The grade was correct in the data. The render didn't differentiate. The same gold "VERIFIED" badge that rendered on rigorously sourced entries also rendered on this one.

So the badge wasn't just unbacked in general. It was unbacked against a specific known counterexample sitting in the corpus. The data with the honest weak-source grade existed. The render was disconnected from it. Sourcing standards need to be enforced in render logic, not just declared in the data. Otherwise the data is doing honest accounting in a private notebook while the public-facing UI keeps saying everything is fine.

I fixed this by adding a real reviewStatus field, gating the badge on it, and removing the weakly-sourced entry from the corpus entirely because the brand's sourcing standard does not permit it. The fix took the same afternoon as item three. They were the same defect, really, just one with a sharper edge.

Five: the screen optimized for browsing, by a user who needed personalization.

I built a new top-level screen that aggregated several content browsing entry points along with a personalized matching flow — a nine-question quiz that produced a tailored result. The matching flow was the highest-value first-time-user action on the screen. The agent placed it fourth in the vertical stack, below three rows of browse options.

The ordering wasn't wrong. It was just optimized for the wrong job. A first-time user opening that screen doesn't need a library. They need to be guided into the library. The matching flow was the guiding mechanism. Burying it under three "Browse by X" rows pushed every new user into self-service mode for a feature designed to be personalized.

The thing I want to name is that I didn't see this until I tried to use the screen myself as a first-time user. The wireframe had looked fine. The spec had listed all five rows. The agent had implemented all five rows. The order was a default — the order the spec listed things in. Nothing in the agent's loop or in my supervisory loop had asked the question what does a new user need to see first. That question is one the operator has to bring. The agent will not bring it. The agent will produce feature parity — here is everything this screen can do — and call that done.

State-aware ordering is operator work. It means writing, in the spec, given the user has done X, show Y first; given the user has done Z, show W first. If you don't write that, you get the wireframe order. The wireframe order is rarely the right order.

Six: their life, his life, this life.

There's a result screen that displays a matched item to the user. The line of copy beneath the match read: "The traits you're cultivating are reflected in their life."

The matched item was one specific person. The agent had written "their" as a singular they — gender-neutral by default. The sentence parsed. The grammar was defensible. It read as plural when the subject was singular, which made it land slightly off without being technically wrong.

The fix was a single word. "Their life" → "this life." Done in thirty seconds.

I'm writing about this defect because it's the smallest one in the week and possibly the most important one to surface. There's no automated check that would catch this. There's no diff review that would flag it. The sentence is grammatically fine. It just isn't the sentence that should be there. Catching it requires reading the screen as a user reads it, asking does this land, and being willing to argue about one word.

The agent does not do this. The agent will produce copy that passes every grammar and accessibility check and still miss the right register by one word. The operator who reads the screen for tone, who notices the off-by-one, who is willing to argue for a single-word change — that operator is doing work no automation does, and the work is small per instance and large across a product.

Seven: the chain that conflated two things.

In a domain-specific feature, the agent replaced a weakly-sourced content item with a properly-sourced one. The new content was correct. But the metadata field that documented the chain of attribution — which sources the content came through — listed a primary text and a secondary supporting tradition as if they came through a single chain. They didn't. They were two different kinds of sources, with two different kinds of authority, and combining them into one chain was a category error.

To anyone reading the chain casually, the field looked authoritative. To anyone with the training to distinguish those two source types, it was the kind of error that would be embarrassing to ship in a domain-serious product.

This is the defect I want to write about most carefully, because it generalizes far beyond this project. The agent produced plausible-looking content in a specialized domain. The plausibility is precisely what gets the content past unit-level review. The plausibility is precisely what makes the defect dangerous. Domain literacy is the filter that catches this class of defect, and domain literacy is not something a coding agent has, a type system enforces, or a code reviewer surfaces.

Legal products, medical products, religious products, scientific products — anywhere the content has technical weight, the agent's plausibility is a hazard. Not because the agent will produce nonsense often. Because when the agent does produce nonsense, the nonsense will look correct, and the only thing that distinguishes it from correct content is someone in the loop who knows the actual subject.

what these have in common

The unit-level was correct. The thing above the unit-level was wrong.

I had been logging the seven items as seven different kinds of failure. By item five I started to feel a pattern. By item seven the pattern was unmissable. They aren't seven kinds. They're one kind, in seven places.

In every case, the agent did work inside a defined unit — a file, a component, a data shape, a render path — and the work was correct at that level. The code compiled. The types matched. The data validated. The render produced output. The agent verified its work against the specification it was given. No silent bridging. No invention. No drift in the No. 04 or No. 05 sense.

What went wrong was always one level above the unit. Whether the screen worked as a screen. Whether the badge meant what its label claimed. Whether the order served the user who would see it first. Whether the sentence landed on the ear. Whether the chain held together against domain literacy. None of these are unit-level properties. None of them can be verified by checking the unit against its spec, because the spec itself is operating at the unit level and the failure is happening above the spec.

This is the layer the operator owns. I had been splitting the ownership across what felt like separate concerns — five different kinds of review, five different mental categories. They aren't separate. They're one concern at one layer, and the layer is everything outside the file. The agent works inside the file. The operator works above the file. I had been doing both, but I had been doing the second one badly because I hadn't seen it as a single thing.

what failed

Five patterns. Each one a refinement of how the layer above operates.

The agent's evaluation surface is the unit it was asked to produce.

The agent will not, on its own, ask whether the unit it produced serves the user. The agent will ask whether the unit conforms to the spec. If the spec says "render a verified badge," the agent renders a verified badge. If the spec does not say "and ensure the badge is backed by data that justifies the claim," the agent does not add that requirement on its own. The agent has no native sense of the gap between the spec was satisfied and the user is served. Closing that gap is operator work. Every time.

A passing build is a unit-level signal.

This is obvious once said and easy to forget. Green builds and clean diffs tell you the unit-level constraints are met. They tell you nothing about whether the screen the unit produces is usable, whether the badge means anything, whether the ordering serves a first-time user. The temptation when working with an agent at speed is to treat passing-at-the-unit-level as confidence that the work is done. It isn't. Necessary, not sufficient. The sufficient verification is on-device, with operator's eyes, against the user the unit is for.

Render gates are policy, not styling.

A badge, label, icon, or any status indicator in a UI is a claim the product is making to the user. Every claim needs a backing field and a gate that enforces the field. If the field doesn't exist, the badge is a decoration with words on it. If the gate isn't in the render logic, the badge can drift from the data. When the agent adds any element that makes a claim, the operator's question is: what data backs this, and where is the gate. If either answer is unsatisfying, the element is a hazard. I am keeping this in my head as a rule now. I did not have it as a rule before this week.

Feature parity is not user state.

The default ordering an agent produces for a screen with multiple actions is here is everything you can do, in some plausible sequence. The default doesn't consider what the user has done, what they need next, or what would convert a first-time visitor into a returning one. State-aware ordering — different first action for a new user than for a returning user — is operator work. It requires the operator to articulate the user's journey on the user's behalf, then specify the rendering conditions that serve each step. The agent will execute the spec. The agent won't write it.

Plausibility is the danger in specialized domains.

In any technical domain, the agent can produce output that sounds correct to a non-specialist while being wrong in ways a specialist would spot in seconds. The plausibility carries the output past unit-level review. The plausibility puts it into the product. The only filter that catches this class of defect is a domain reviewer — someone whose literacy in the actual subject is independent of the coding tool. For domain-specific products, a domain reviewer isn't QA. They're part of the path from agent output to user-visible content. Their absence is a permanent risk surface, no matter how good the rest of the supervision is.

what this maps to in governance frameworks

Five lines.

The seven defects this week map onto governance frameworks I work with in vendor risk assessments at the day job, and the mapping sharpens the operator-layer formulation in ways the earlier notes hadn't surfaced.

Unit-level vs. system-level verification maps to SOC 2 CC7.2 (operations monitoring) and CC7.3 (anomaly handling). CC7.2 anticipates that the system produces logs and metrics. CC7.3 anticipates that someone interprets those signals against the system's intended purpose. The seven defects this week were all CC7.3 failures. Unit-level signals were green. System-level interpretation against user-facing purpose was absent. The vendor question I'd ask now: how does your AI-assisted development process verify that unit-level output serves the system-level user, not just the unit-level specification?
Claims without backing maps to NIST AI RMF MEASURE 2.3 (output validity) and EU AI Act Article 13 (accurate user-facing information). Both anticipate that AI-integrated systems will produce outputs that make claims. Neither names the specific case where a visual element makes a claim that isn't enforced by the data layer beneath it. Every user-facing claim is a regulated assertion. It needs a field, a gate, and an audit trail. Decorative claims are claims. They are just claims without enforcement.
State-aware UX maps to ISO 42001 8.3 (responsible AI design) and EU AI Act user-experience expectations. Both anticipate that AI-integrated systems consider the user. Neither anticipates that the coding agent itself will not infer user state unless the operator specifies it. The default agent output is feature-parity UX. State-aware UX requires the operator to specify the journey. The frameworks anticipate the obligation. The operator owns its discharge.
Plausibility-class defects map to NIST AI RMF MEASURE 2.7 (robustness in safety-critical contexts) and ISO 42001 6.1 (AI system risk assessment). Both anticipate that AI outputs in technical domains will sometimes be wrong. Neither names the specific failure mode where the output is plausible enough to pass unit-level review and only wrong on inspection by a domain specialist. The discipline: a domain reviewer is part of the production critical path. Not QA. Not optional. Part of the path.
Copy review maps to the user-experience and accessibility expectations under WCAG and Apple's Human Interface Guidelines. Neither is a governance framework in the same sense, but both are real obligations for shipped products. The operator-layer reading: tone, register, and pronouns are not in the agent's evaluation surface. Copy that passes grammar and accessibility can still land wrong by a single word. The operator reads the screen as a user reads it. That is work no tool does.

When I review a vendor's AI-assisted development practices going forward, the question I'll ask is: who, on your team, owns the layer above the file? What is their evaluation surface? How do they verify it? The answers will tell me more about the vendor's actual practice than any framework attestation does on its own.

a small framework

Six disciplines at the layer above the file.

These are continuations of the ones from No. 01 through No. 06, not replacements. The earlier notes were about unit-level disciplines — verify the live file, refuse to operate on backups, treat repeated refusals as a pattern. This one is about the layer above. Both layers are needed. Unit-level discipline without layer-above discipline produces well-engineered software that doesn't serve users. Layer-above discipline without unit-level discipline produces thoughtful product intentions implemented against the wrong files.

// disciplines at the layer above the file

On-device verification is not optional. Every phase ends with a screenshot or recording on a real device. If the operator has not seen the unit working in a user's hand, the unit has not been verified.
Every visual claim needs a backing field and a gate. When the agent adds a badge, label, icon, or any status indicator, the operator's questions are: what data backs this, and where is the gate that enforces it. If either answer is missing, the element is a hazard.
Locked content gets audited line by line. Any spec that says verbatim gets verified item by item after implementation. Spot-checking won't catch the one item the agent rewrote.
New screens get a user-state pass. Any new screen with more than three navigation options gets reviewed against what does this user need first, given what they've done. The agent produces feature parity by default. State-awareness is operator work.
Copy review is a session, not a check. Read the screen as a user reads it. Ask whether each sentence lands. Be willing to argue for a one-word change. Tone, register, and pronouns are not in the agent's evaluation surface.
Domain reviewers are part of the production critical path. For any specialized-domain product, a domain reviewer is not optional QA. They are the only mechanism that catches plausibility-class defects, and their absence is a permanent risk surface.

what didn't work

Three honest things.

I shipped the verified badge without ever asking what data backed it.

The badge had been in the visual spec from the start. I treated it as a styling decision and moved on. It was a policy decision. Every entry in the corpus, including the ones I hadn't personally vetted, was wearing the same gold mark of trustworthiness for weeks. The gap between the visual and the claim it made was long. It closed in an afternoon once I saw it. I didn't see it for weeks because I was looking at the unit level, not above it.

I treated the seven defects as seven separate things until item six.

The first four defects of the week, I logged and fixed and moved on. By item five something started to feel patterned. By six I knew. By seven I could name it. This is the same lesson No. 05 named about refusals — repeated similar-shaped events are a pattern, not a sequence of independent items. The discipline is to pause and look for the shape by the third instance, not to wait until the seventh forces the recognition. I should have stopped at four. I stopped at seven.

I let the agent set the ordering and trusted the result.

The buried matching-quiz row is the cleanest example of feature-parity-over-user-state in the week. I caught it because I happened to open the screen as a first-time user might. If I hadn't asked that question — what does a new user see first — the screen would have shipped with the most important action fourth in the stack. Every new user would have entered the corpus through self-service mode. The matching flow, which is the reason the screen exists, would have been bypassed by default.

I should have asked the question before the screen was implemented, not after. The fix afterwards is a reorder. The fix beforehand is a spec that names the journey. The second one costs ten minutes. The first one cost me a day of re-implementation across two states.

the shape of the next note

Where this is going.

No. 06 framed the operator's irreducible territory as application-level verification of what the agent is actually changing. This note refines that. The irreducible territory is everything above the file. The previous framing captured one slice — the verification slice. This note adds the rest: visual claims, user-state ordering, copy that lands, domain plausibility. They're all the same layer.

The next note will probably go back to the territory No. 05 first projected — the product decisions, brand calls, judgment calls about what to ship. That territory is also above the file. It's the layer above the layer this note documents. The series is building outward. Unit-level discipline first. Then the supervisory frame. Then the layer above the file. Then the layer above the product. Each layer requires the ones below it. None substitute for the ones above.

closing thought

The agent does careful work inside the file.

Everything outside the file is mine. That's what this week taught me, and it's what I had been distributing across seven separate concerns instead of seeing as one. The unit-level disciplines are necessary. The supervisory frame is necessary. The application-level verification is necessary. None of them substitute for the operator reading the screen as a user, asking what data backs each claim the product makes, and being willing to argue for a single word in a sentence the toolchain says is grammatically fine.

The frameworks anticipate the layer in principle. Living through seven defects in one week made it concrete. The discipline at the layer above the file isn't a different discipline from the ones in earlier notes. It's the same discipline applied at a higher level. The operator who has the lower-level disciplines and stops there is producing well-engineered software that doesn't serve users. The full discipline is the stack.

I'm keeping the formulation simple, because that's what the week earned. The agent verifies the unit against its specification. The operator verifies the specification against the user.

Everything else in this note is a corollary of that.

Qais Ali Senior AI Cyber Security Engineer
CEO, Ilm Works

Field Notes / No. 07 Filed May 19, 2026