[ field notes ]

The File That Wasn't Running

A week of paywall work, edited into a backup file the production app never imports.

Qais Ali / Senior AI Cyber Security Engineer / CEO, Ilm Works

context

Filed the same day as No. 04. Twelve hours later.

I filed Field Notes No. 04 — The Paste and the Actual Code — earlier today. Its central observation was that the agent had begun catching prompt-code mismatches I would not have caught on my own, and that this represented a meaningful shift in how the supervisory loop was working. I stand by the observation as it was made. It described what was visible to me at the time of writing.

What I did not see at the time of writing is that the agent had been catching prompt-code mismatches inside the wrong file for an entire week. The file I had been editing was a backup. The file the production app actually imports was different. Every careful diff, every rewarded refusal, every "good catch — apply the new value" — all of it had been happening in a file that wasn't running.

I want to be careful with the framing here. No. 04 is not retracted. It was honest about what the operator could see, and the operator's vantage was incomplete. This note documents what came after — the deeper finding the same evening's diagnostic work eventually surfaced. It is filed separately so the lineage stays clear: each note in this series records what the operator understood at the moment of filing. Sometimes the next note reveals the previous note was incomplete. That is the structure. That is how the work actually unfolds.

This entry is about a class of failure that lives one level deeper than No. 03's "the cheap comparison" and one level deeper than No. 04's "the paste and the actual code." It is the failure where you and the agent agree on what to do, agree on how to do it, agree the work is done — and the work has been happening on the wrong file the whole time.

the project context

A monolithic file, a sprint, and four backup copies.

The codebase is the same as in the previous note: ʿIlm Arena, a React Native monolith with a single 17,000-line index.tsx. The week's goal was the v2.0 paywall sprint — six surfaces, four configuration domains, a RevenueCat integration, App Store submission. By Friday evening the code work was, as far as I could tell, complete. The paywall had been redesigned. The benefits had been updated. The Privacy and Terms links had been added. The Arabic localization had been verified. The agent had been a reliable second pair of hands all week.

The work that triggered this note was Stage F — verification logging. With the RevenueCat dashboard configured and the API key in .env, the next step was to add three diagnostic console.log lines to confirm offerings loaded correctly on a real device. Two of the three logs landed cleanly. The third did not.

The agent's response on the third log was this:

Log #2 (offerings) skipped — Purchases.getOfferings() is not
called anywhere in the app yet. Adding it would be a
functional change. If you want those logs, you'll need to
first decide where to call getOfferings() (e.g. in the paywall
flow), then I can add the diagnostic logs around it.

This is the agent doing exactly what No. 04 said it had learned to do — refusing to silently bridge a gap between the prompt and the code. The prompt asked for a log around Purchases.getOfferings(). The code did not call Purchases.getOfferings() anywhere. The agent flagged the discrepancy and waited.

This was the third time in two days the agent had done something like this. The previous two refusals had each turned out to expose small mistakes in my prompts, which we corrected and moved on. This refusal was different. The thing the agent was flagging was not a small mistake. It was the surface of a much larger one.

the four diagnostics

Each one deeper than the last.

Diagnostic 1: the entitlement identifier mismatch.

I started where the original problem seemed to be. The RevenueCat dashboard had been showing an entitlement with the identifier 'Ilm Arena Premium'. My code, I assumed, used the conventional lowercase 'premium'. I asked the agent to find every place the entitlement identifier appeared in the codebase and report back.

The report came back with seven references. Five of them, in mobile/src/app/index.tsx, used 'Ilm Arena Premium' via a constant called RC_ENTITLEMENT. Two of them, in a separate file mobile/utils/usePremium.ts, used 'premium' via its own local constant. The agent flagged the inconsistency and noted that usePremium.ts, if used anywhere, would always return false for real subscribers — its identifier would never match what RevenueCat actually issued.

This was already worse than the original framing. The bug was not "code says X, dashboard says Y" — it was "two files in the code say two different things, neither matching the dashboard." Standard configuration drift, but layered.

I made the cheap-comparison mistake immediately. I almost decided to fix the dashboard to match the code without first verifying which code was the live code. The agent's report listed two files with two different identifiers and I had no idea which one was authoritative. Fixing the wrong file would have shipped a working bug.

Diagnostic 2: is the dead-looking file actually dead?

Before changing anything, I asked the agent whether usePremium.ts was actually used. If the file was orphaned, the inconsistency was harmless and could be cleaned up at leisure. If the file was live, the inconsistency was a bug masking another bug.

The report:

Importer 1:
  File: mobile/src/app/index.tsx
  Line: 39
  Import: `import { usePremium } from '../../utils/usePremium';`
  Call site: line 10995
    const hookPremium = usePremium();
    const effectivePremium =
      !freeModeActive &&
      (hookPremium || bypassActive || (premium && !loadingPremium));

The file was alive. It was imported, it was called, and its return value fed directly into effectivePremium — the master flag that gates every Premium feature in the app.

But the bug was masked. effectivePremium uses an OR of three sources. hookPremium always returned false for real subscribers because of the identifier mismatch. premium, set by a separately-functional code path called syncPremiumFromRC, returned true for real subscribers. The OR meant the broken hook was harmless dead weight — it never independently confirmed any subscriber, but it never blocked one either.

Two new lessons here, both worth naming. Redundant code is not dead code. It still runs, it still produces values, those values still flow through your logic — they just do not change outcomes when other paths agree. And: the OR-rescue pattern is dangerous. A bug whose visible effects are masked by a parallel working path will look fine in production for as long as the working path keeps working. The day the working path breaks, two failures cascade at once.

I made my decision: standardize everything to the lowercase RevenueCat-convention identifier 'premium'. One line change in index.tsx, no change needed to usePremium.ts, then update the dashboard. The agent applied the change cleanly. The cheap comparison had once again killed the original framing of the problem and replaced it with a real fix.

Diagnostic 3: where does the paywall actually get its prices?

This was the diagnostic that was supposed to be unnecessary, and the one that broke the case open. The agent had refused to add the getOfferings log because the function was never called. Before deciding whether to add a call to it, I needed to understand what the paywall was using instead. The PaywallModal component took an rcOfferings prop. Something had to be supplying that prop.

I asked the agent to trace the data flow. Where was rcOfferings declared, where was it set, and how did it reach the paywall.

The report was structured cleanly. Each section ended with a finding I had not expected:

Section 1 — PaywallModal rcOfferings prop declaration:
  Not found in any active source file. PaywallModal as a
  component does not exist. There is no rcOfferings prop
  anywhere in the live codebase.

Section 2 — PaywallModal callsites:
  Zero callsites. The 6 previously referenced callsites do
  not exist in the current index.tsx. The paywall is invoked
  via RevenueCatUI.presentPaywall() at line 9915 — the SDK
  handles its own UI entirely.

Section 3 — Source of rcOfferings value:
  Does not exist in the active file. Only declaration found
  is in backup files: index.tsx.bak line 8732.

Section 4 — getOfferings / availablePackages references:
  0 matches in active source. All matches in .bak files only.

I want to describe what reading this felt like, because the emotional shape matters for the lessons. The first section read as a typo. The second section read as a confused agent. By the third section I understood what the report was telling me. The custom paywall I had been editing all week — the one with adaptive Hard and Soft variants, four benefit rows, audio mechanics copy, Privacy and Terms links, Arabic localization — did not exist in the file the production app imports. It existed only in a series of .bak files: index.tsx.bak, index.tsx.bak2, index.tsx.bak3, all sixteen-thousand-five-hundred-twenty-three lines long, all snapshots from May 7. The active index.tsx, eight hundred and sixty-eight lines longer, contained no PaywallModal at all.

The live paywall, in production, was a single nine-line function:

async function openPaywall(source: PaywallSource) {
  console.log('[Paywall] presented source:', source);
  setTimerOn(false);
  const result = await RevenueCatUI.presentPaywall();
  if (result === PAYWALL_RESULT.PURCHASED || result === PAYWALL_RESULT.RESTORED) {
    await syncPremiumFromRC();
  }
  setTimerOn(true);
}

RevenueCat's SDK has a built-in native paywall component. RevenueCatUI.presentPaywall() fetches its own offerings, renders its own UI, handles its own purchase flow. None of the prop-passing, none of the offerings state, none of the custom layout I had spent the week refining was reachable from anywhere in the running app. Every Vibe Code prompt that "successfully applied" a change to PaywallModal had been applied to a file that index.tsx never imports.

Diagnostic 4: when did this happen and what is actually installed.

The fourth diagnostic was the confirmation pass. I needed to know whether this was a tonight problem, a this-week problem, or a months-old problem. I needed to know what RevenueCat packages were actually installed. I needed to know whether the .bak files were recoverable as a unit or were already orphans.

The findings:

Question	Finding
Backup files present	`index.tsx.bak`, `.bak2`, `.bak3`, `.bak-normscan` — four copies, all 16,523 lines
Active file size	`index.tsx` — 17,391 lines, 868 lines longer than the backups
Both packages installed?	Yes. `react-native-purchases` AND `react-native-purchases-ui`
Active imports	`Purchases` and `RevenueCatUI` both imported (lines 33–34)
Custom paywall helpers retained?	Yes — `purchasePackage`, `syncPremiumFromRC`, `checkYearlyTrialEligibility` all present
Live paywall callsite	One. `RevenueCatUI.presentPaywall()` at line 9915

The evidence pattern was specific. The active file was bigger, not smaller. The custom paywall infrastructure (the helpers, the imports, the type annotations) was still in place. Only the rendered component and its data flow had been removed. This is the signature of a deliberate replacement at some prior point in the project's history — not an accident, not a missing file, not a partial revert. Someone or something had decided, at some point, to replace the custom PaywallModal with the SDK-native paywall and clean up the rendered surface while leaving the helpers around the call site intact.

That decision had been made before the week's sprint started. The custom paywall in .bak was a snapshot of an earlier era. I had been editing the snapshot, agent-pair-programming carefully against the snapshot, polishing the snapshot, all without ever noticing that the snapshot was not the production target.

what failed

Six patterns, each more upstream than the last.

The discovery had a half-life that expired before the sprint started.

Field Notes No. 04 named "discovery has a half-life" as discipline #2. I cited the lesson and then immediately violated it, at a depth I had not anticipated. The discovery report I had been working from said that PaywallModal had six callsites at specific line numbers in index.tsx. That report was probably accurate at the moment it was generated. By the moment I started writing prompts based on it, the report had already been stale for an unknown number of days. By the end of the week, every line number in the report referenced code that did not exist in the live file. The agent obediently reported "found, applied, diff" against the file paths I named — which meant the .bak file paths, because those were the only paths where the structures from the discovery report still existed.

The lesson is sharper than I had it in No. 04. Discovery does not just go stale. Discovery can target a file that becomes a backup. The agent will continue to operate against that backup if the prompts keep referencing it, and nothing in the agent's loop will surface "this file isn't the one being imported by the entry point." The agent has no native sense of which file is the live file in the running application. It edits whichever file the prompt names.

The agent verifies prompt against code, not code against the application.

This is the architectural finding I want to flag clearly. The supervisory frame I described in No. 04 — show me the existing code first, propose the change, wait for approval, apply — does work. The agent will check that the prompt's claims about a file match the file's contents. What the supervisory frame does not do, and what the agent's own loop does not do, is verify that the file in question is actually the file being imported by the running application's entry point.

This is a category of verification the agent cannot perform without the operator setting it up. It requires walking from the entry point through the import graph to confirm the file under edit is reachable. The agent does not do this on its own and does not propose to do it on its own. The agent's locus of attention is whatever file the prompt last referenced. If that file is a backup, the agent does excellent careful work on a backup.

The .bak files were a trap that looked like a safety net.

Backup files are a defensive measure. They exist to protect you from data loss. In this case they did the opposite: they created a parallel, plausible-looking version of the codebase that swallowed an entire week of edits because nothing mechanical distinguished them from the live file at the level the agent was looking. index.tsx and index.tsx.bak are both real files. Both will return contents to a view tool call. Both can be edited. Neither is marked "this is the production file" or "this is the one being imported."

The lesson is not "don't keep backup files." Backups have real uses. The lesson is that a backup file that lives in the same directory as the live file, with similar naming, is an attractive nuisance for any system that operates on file paths rather than the import graph. If you must keep .bak files, keep them somewhere the agent's default file-resolution heuristics will not reach — a separate directory, a separate branch, an external archive. Not next to the live file.

The agent's refusal worked. It was the third refusal in two days.

The agent's diagnostic-2 refusal — "getOfferings() is not called anywhere in the app" — is exactly the kind of "reward refusal" behavior No. 04's discipline #4 articulated. The agent did the right thing. It refused to add a log to a function that did not exist in the live code. Without that refusal, I would have approved the third log, the diff would have applied to .bak, the build would have shown no [RC] Offerings output on device, and I would have spent hours debugging why a function I never actually called was not producing logs.

The agent's refusal worked perfectly. The operator's gap was that this was the third refusal in two days. The first two had been chalked up to small prompt sloppiness on my end and corrected without further investigation. The pattern across three refusals — each one revealing a small mismatch between my prompt and the live code — should have prompted me to ask why the prompt-code drift was so persistent. The drift was persistent because my mental model of the codebase was stale, and the staleness was not random — it was systematically pointing at an old file. I treated three signals as three errors. They were one error, repeated.

The OR-rescue pattern hid the worst version of the bug for an unknown duration.

The effectivePremium = hookPremium || bypassActive || (premium && !loadingPremium) expression is, in one sense, defensive engineering — three independent sources of premium status, any one can save the others. In another sense it is a way to never find out which source is broken. hookPremium has been returning false for every real subscriber since the day the identifier mismatch was introduced. The system has been functioning — because premium, set by syncPremiumFromRC, has been correct. The day syncPremiumFromRC breaks, the OR will collapse and every subscriber will simultaneously lose Premium access. The bug has been there the whole time. The system has been one parallel-path failure away from cascading.

This is a class of risk that fault-tolerant patterns introduce by design. Multiple redundant sources of truth that can disagree silently is a category of code that should be treated as monitoring-required, not reliability-providing. If a system has three sources of truth and only one needs to work, the system needs a watchdog that flags when the three disagree. Without that, you get a system that ships in a permanently-broken-but-functioning state until the day the working path stops working.

I did not run the cheapest possible verification at the start of the sprint.

The cheapest possible verification — at the start of the sprint, before any prompt was written — would have been: "Show me the active openPaywall function in the file the app entry point imports." Five minutes. One view call. Would have surfaced the entire situation immediately. The custom paywall was already gone. Replaced by a nine-line wrapper around RevenueCatUI.presentPaywall(). Anyone reading those nine lines would have known instantly what kind of paywall the app was using.

I did not run that check. I read a discovery report from a previous session, treated it as authoritative, and built a week of prompts on top of it. Every individual prompt referenced specific line numbers and component names that the discovery report had named. The agent obligingly went and edited those line numbers in whatever file matched the path I gave it. The system worked as designed. I just had a bad map.

This is the No. 03 lesson at full force. The cheap comparison costs nothing. The structural refactor costs days, sometimes weeks, sometimes a small amount of trust between the engineer and the agent that is hard to rebuild. A week of structural refactoring on a backup file is a perfect specimen of the cost asymmetry No. 03 named, transposed up one level — from "wrong fix" to "wrong file."

what this maps to in governance frameworks

Five lines, named.

Each pattern above maps to a governance framework expectation that practitioners often gesture at but rarely operationalize. I work with these frameworks in vendor risk assessments at the day job, and this week sharpened how I think about all five.

Code-application drift maps to the data-lineage and change-traceability expectations under SOC 2 CC8.1, ISO 27001 A.8.32, and ISO 42001 8.4. All three frameworks anticipate that the artifact under change is the artifact in production. None of them anticipate that an entire sprint of agent-assisted edits could complete successfully against a file the production system never imports. The vendor question that surfaces from this: how does your change management process verify that proposed changes will affect the running system, not just the named file?
Stale discovery maps to the configuration management expectations under NIST 800-53 CM-2 (baseline configuration) and ISO 27001 A.8.9. These frameworks anticipate that an organization knows its current state. They do not anticipate that an agent-assisted developer would knowingly operate from a snapshot of a previous state because the snapshot is what was at hand. The lesson: discovery artifacts in agentic workflows need expiration timestamps, and re-discovery should be cheap enough to run at the start of every session.
OR-rescue masking maps to the AI system reliability expectations under NIST AI RMF (MEASURE 2.5, MEASURE 2.6) and the EU AI Act's deployer obligations around continuous monitoring. Both anticipate that systems will produce outputs that need to be monitored for correctness. Neither anticipates that the system's internal redundancy would itself be the mechanism that hides a long-running failure. Redundant code paths are monitoring surfaces, not reliability guarantees.
Backup-as-attractive-nuisance maps to the secure development lifecycle expectations under SOC 2 CC8.1 and the EU AI Act Article 15 (accuracy, robustness). Backup files in the working tree are an artifact-management problem that becomes an AI-supervision problem the moment an agent operates on file paths. The discipline: backups belong in version control, in branches, in archives — not next to the live file with a .bak suffix.
Refusal-pattern interpretation maps to the operator-oversight expectations under ISO 42001 (8.4 operational controls) and the AI RMF MANAGE 4 cluster. The frameworks anticipate that operators will respond to AI-system uncertainty signals. They do not anticipate that operators will respond correctly to individual signals while missing the meaning of a pattern of signals. Three refusals in two days is not three errors. It is one error repeated. The discipline: when the agent flags the same shape of mismatch repeatedly, ask why the mismatch keeps happening, not just how to fix the immediate instance.

The frameworks anticipate the failure modes in theory. Living through them in practice sharpens the instinct for which ones to actually look for in vendor assessments. When I am reviewing a vendor's claim about their AI development practices, this week is now part of my mental reference set — specifically the moment I realized the agent had been a careful junior engineer on the wrong file for an entire week.

a small framework

Six disciplines.

These are continuations of, not replacements for, the standing rules from No. 01 through No. 04. They are specific to the code-application drift and refusal-pattern interpretation failures this week surfaced.

// disciplines

Verify the live file before writing the prompt. At the start of every session, do one view on the entry-point file. Trace from there to the function or component you intend to edit. Confirm the path you are about to give the agent is reachable from the entry point. Five minutes. Saves weeks.
Refuse to operate on backup files. If your project keeps .bak files in the working tree, move them out before starting an agentic session. Or at minimum, instruct the agent at session start: "Never read or edit any file with a .bak, .old, .backup, or numeric suffix. If a prompt names such a file, refuse and ask for clarification." The agent will follow that instruction.
Treat repeated refusals as a pattern, not as N independent errors. The first time the agent flags a prompt-code mismatch, it is a small mistake in the prompt. The second time, it might still be a small mistake. The third time, it is not three small mistakes. It is one larger mistake repeating itself. Stop fixing and start investigating why the prompts keep being wrong about the same shape of thing.
OR-rescue patterns require monitoring. Any expression that combines multiple sources of truth into a single output via OR (or any logical fallback) must have a watchdog that flags when the sources disagree. Otherwise the redundancy is not reliability — it is concealment. Add the watchdog at the time you write the OR.
Discovery artifacts have expiration timestamps. Any "where is X in the codebase" report from a previous session is presumed stale at the start of the next session. The presumption is rebuttable by re-running discovery, not by trusting the previous report. Treat discovery as a perishable resource.
The cheapest verification at the start of a sprint is the most valuable. Before any agent prompt is written, before any plan is made, view the entry point. View the function being modified. Confirm the modification path is reachable from the entry point. This is the No. 03 cheap comparison applied at sprint-start rather than mid-bug. The cost is five minutes. The cost of skipping it is up to a week.

what this means for No. 04

Standing, with one footnote.

I want to address No. 04 directly, because the temptation is to retract it and the right move is not to retract it.

No. 04 was filed this morning. Its central observation — that the agent had begun catching prompt-code mismatches the operator would not have caught alone — was true at the time of filing. The agent did do that. Three times in five days. The work documented in No. 04 happened. The observation was not invented.

What No. 04 missed is that the agent was catching mismatches inside the file the agent was editing, which was not the file the application was running. The supervisory frame No. 04 described works at the level of "is this prompt consistent with this file." It does not work at the level of "is this file consistent with the application." That is a layer up. That is the layer the operator owns. That is the layer I forfeited this week.

No. 04 stands as written. This note stands beside it as the next layer of the same story — the layer where the careful work documented in No. 04 turned out to be careful work on a file the production system never sees. Both are true. Both are part of the record. The discipline of leaving No. 04 published, dated, and unedited is itself a discipline I want to keep: the series records what the operator understood at each moment of filing, including when the next moment reveals the previous understanding was incomplete.

what didn't work

Three honest things.

I trusted a discovery report I did not generate myself this session.

The discovery report listing the six PaywallModal callsites and the line ranges was generated in an earlier session, and I worked from it for a week without re-running it. I cited the discipline against this in No. 04 and violated it on the same week.

I read three refusals as three small mistakes instead of one pattern.

The agent's prompt-code mismatch flags happened on Tuesday, Wednesday, and Friday. Each time, I corrected the local prompt and moved on. By Friday I should have asked: why does my prompt keep being slightly wrong about the same kind of thing? I did not ask that question until Friday evening, when the third refusal happened to be the one that did not have a small fix. By then I had a week of edits in the wrong file.

I let an entire week pass without rendering the paywall on a real device.

The cheapest possible verification of "is this paywall actually being shipped" is to launch the app on a phone and trigger the paywall. I did not do this once during the sprint. The Vibe Code preview tab and the agent's diff output were the surfaces I trusted. Both are upstream of the actual running system, and both can be correct about a file that does not run. A single TestFlight build, mid-week, would have surfaced the discrepancy in seconds.

the shape of the next note

Where this is going.

I had projected, at the close of No. 04, that No. 05 would be about the part of the development cycle that does not transfer to agentic tooling at all — product decisions, brand calls, judgment about what to ship. That note will still happen. It is not this one.

This one is about a failure mode I did not see in time, named after the diagnostic that surfaced it. The next one will return to the projected territory: the part of the work the operator owns irreducibly. With the additional finding that the operator's irreducible territory is one layer larger than I had it framed in No. 04 — not just product judgment, but also application-level verification of what the agent's careful work is actually changing.

closing thought

The discipline is splitting. The split is one layer deeper than I had it.

No. 04 ended on the line the discipline is splitting; some of it now lives in the agent. That is still true. The supervisory frame, repeated, transfers — the agent does verify prompt against code, does flag mismatches, does refuse to bridge gaps it does not understand. That much is real.

What this note adds is the layer above. The agent verifies prompt-against-code. The operator must verify code-against-application. The agent has no native sense of which file is the live file. The operator must supply that sense, every session, by walking the import graph from the entry point to the function under edit before any prompt is written.

This is not a complaint about the agent. The agent did what it was structured to do. It edited the file the path named. It refused to invent code that did not exist. It flagged mismatches. It earned the credit No. 04 gave it, and the credit still stands. The failure was that I gave the agent the wrong file path, every prompt, for a week, because the report I was working from was older than the codebase had become.

The frameworks anticipate the failure modes. The cost of skipping the cheap comparison at sprint-start, this week, was a week of careful work on a file the production system does not import. Living through it sharpens the instinct for what to actually look for next time. The cheapest verification belongs at the start of the sprint, not at the start of the bug.

Qais Ali Senior AI Cyber Security Engineer
CEO, Ilm Works

Field Notes / No. 05 Filed May 9, 2026