[ field notes ]

16 Hours, 25 Builds

Field notes from shipping AI-built software through the real production pipeline.

Qais Ali / Senior AI Cyber Security Engineer / CEO, Ilm Works

context

A 16-hour arc, named.

Two days after publishing my first field note on agentic AI coding, I lived through a longer and more revealing version of the same problem. Sixteen hours, twenty-five build attempts, one production credential leak, three sandbox outages, two production-versus-preview asset desynchronizations, and one mobile app finally delivered to TestFlight.

This is not a continuation of the previous piece. It's a sharper, more specific version of the same lessons, surfaced under heavier load. If the first field note was about working with an agentic coding assistant on bounded tasks, this one is about what happens when you take an AI-built app all the way through a real platform's build, signing, and distribution pipeline — the part where the work stops being about code and starts being about whether the code actually ships.

I'm a believer in these tools. I use them daily. My views on what's possible are not the views of a skeptic.

But there's a gap between what AI development tools promise and what they deliver when you push them past the demo. That gap is where most of the interesting failure modes live. This field note is what I saw in that gap, and how the patterns I lived through map directly to the AI governance frameworks I work with professionally.

what happened

The proximate problem and the actual cost.

The proximate problem on Saturday night was an iOS App Store build failing during the CocoaPods install step. Standard symptom — pod install --repo-update --ansi exited with non-zero code: 1 — but the surface error told me nothing about the cause.

What I expected: an hour of debugging.

What I got: sixteen hours, multiple AI agent failures, two production-versus-preview asset desynchronizations, one credential leak, three sandbox outages, and a much deeper understanding of how AI-built apps actually behave when you push them through the real build pipeline. By the end, the app was in TestFlight. The cost was a full day and most of a night.

What follows is the lessons, named.

what failed

Ten patterns, each with a name.

Confident misdiagnosis.

Late in the arc, after a build was already in TestFlight, I noticed the splash screen looked different from what the in-app preview showed. I asked the agent why. It produced a sixty-step investigation, complete with file references, code citations, and a "Root Cause Summary" identifying a missing environment variable for a third-party SDK. The diagnosis was internally consistent. The file paths checked out. The mechanism it described was technically plausible.

It was completely wrong. The third-party SDK in question was not wired into the app at all. The fallback behavior the agent identified as a bug was the intended behavior for an unconfigured client. The agent had pattern-matched on code structure, found a real-looking root cause, and presented it with confidence — without verifying whether the SDK was a live concern in the first place.

This is the calibration problem in production. AI agents are trained to produce coherent diagnoses. They are not reliably trained to verify foundational assumptions before presenting conclusions. The more elaborate the explanation, the more important it is to ask: did the agent confirm the premise, or just pattern-match the symptoms?

Preview state ≠ committed source.

The most architecturally important discovery of the night: the platform's hosted preview environment was decoupled from the actual source code repository in ways that aren't obvious until they hurt you. This produced two distinct silent failures.

The first: an updated logo asset rendered correctly in the in-app preview. It existed nowhere in the project's source files. When the build pipeline produced the IPA from main, it bundled the older logo because that was what was actually committed. The clean rebuild that fixed everything else shipped to TestFlight with the wrong logo, despite every other fix being committed correctly.

The second: when the agent tried to diagnose this, it pointed at environment variables in the build profile. Which raised a deeper question — when the platform builds and uploads to TestFlight from inside its own UI, which environment variables does it inject? The ones in the build config? The ones in the platform's own settings panel? Both? Neither? The answer was unclear, and the ambiguity itself was the bug.

The pattern: things visible in preview don't necessarily exist in committed source. The hosted dev environment carries state — assets, environment variables, runtime overrides — that doesn't get serialized back into the repo. Production builds only see committed source. So you accumulate "preview-only" state that silently disappears when the build pipeline runs.

Native dependency pairings the agent will miss.

The actual root cause of the original CocoaPods failure was a tight version pairing in the React Native ecosystem. react-native-reanimated@4.2.1 requires react-native-worklets@0.7.x. Transitive resolution had landed on 0.8.1. Result: a podspec validation failure forty lines deep in the build logs.

The fix, once identified, was a single-line addition to package.json:

"overrides": { "react-native-worklets": "0.7.4" }

Thirty seconds of work. It took multiple debugging cycles to identify, partly because the agent's first instinct was to add the package as a direct dependency at ~0.7.4 — a reasonable but insufficient fix that didn't account for the lockfile resolving differently. The agent reported success ("already installed... no action needed") while the production build kept failing identically.

Force-pin tight version pairings explicitly. Don't trust transitive resolution. Don't trust agent reports of "already installed." Verify the version actually present in node_modules matches what the dependency expects.

Fatigue compounding.

About ten hours into the debugging arc, I pasted a long-lived API token into chat in plain text. As an AI security engineer, I know better. I caught it within minutes, revoked the token, and continued. But the leak happened.

The agent didn't cause the leak — I did. But the iterative-failure environment made the mistake more likely. Long debugging sessions with AI tooling create cognitive load patterns that differ from traditional debugging. You're context-switching between writing prompts, reading explanations, screenshotting outputs, verifying agent claims, opening logs, scrolling logs, copying error strings, pasting them back. Cognitive overhead per unit of progress is high. Standard developer fatigue accelerates.

Most secure development training assumes traditional development rhythms. AI-assisted development introduces new fatigue patterns and new credential-handling pressure points. Secrets should be hard to leak by accident. Most AI development chats default to making them very easy to leak.

Single-platform code hosting as concentration risk.

The platform hosts code internally. Optional Git mirroring is available; I hadn't set it up. During multi-hour sandbox outages, I had no independent way to verify what was actually on main, no backup, no alternate clone, no third-party code review possible. When the agent claimed it had committed and pushed a fix, I had to trust the claim — there was no external view to verify against.

This is concentration risk at the project level. At enterprise scale, concentration risk on a single AI vendor for code, weights, infrastructure, and deployment is a category that supplier risk frameworks explicitly assess. Most vendor assessment templates ask whether a vendor uses AI; few ask what happens when the AI vendor is unavailable.

Agent self-reports of completion.

The agent claimed multiple times to have completed tasks — "dependency added, lockfile updated, committed and pushed" — that hadn't fully landed in the production environment that mattered. The agent measured success locally. It executed the commands it planned. The actual user-facing outcome — build still failing identically — was downstream and not part of the agent's verification loop.

This is the same completion-claim failure mode I named in the previous field note, surfacing again at higher cost. "Done," in agent-speak, often means "I executed the planned commands." It rarely means "the underlying problem you described has been solved end-to-end."

Define done by user-facing outcome, not by agent-reported completion. "Tell me when the build succeeds with commit X" is a meaningfully better instruction than "fix the dependency."

Sandbox fragility.

The platform sandbox went unresponsive multiple times across the arc. Hard Restart fixed it sometimes. Other times it didn't. Other projects on the same account remained healthy. There was no telemetry visible to me about why the sandbox was hanging, no SLA on resolution, no escalation path beyond support email.

Treat the sandbox as ephemeral. Commit early, commit often, push immediately. Anything that exists only in the sandbox session is in volatile memory. When the sandbox hangs, work resumes from the last commit — not from where the agent thought it was.

Production-mode bugs accumulating.

Three separate production-versus-dev gaps surfaced inside sixteen hours: an asset present in dev preview but missing from the production bundle, environment variables present in dev but ambiguous in production, native dependency resolution working in dev but failing podspec validation in production. Each would have been caught earlier with a continuous production-mode build. There wasn't one, because AI app platforms emphasize dev-loop velocity. Production-mode builds only happen at submission time. Bugs that only manifest in production builds therefore stack up and surface together at the worst possible moment.

Run a production-mode build on day one of any new app, even if it's empty. Establish that the production path works before adding features that depend on it.

Mobile-first tooling has rough edges.

Most of this work happened from an iPhone. Building, debugging, communicating with the agent, triggering builds, monitoring deployments. That's genuinely impressive. It's also painful in specific ways. Pasting tokens accidentally in plain text is more likely on mobile (the visual chrome is less obvious). Reading hundred-line stack traces on a six-inch screen is an experience. Some critical operations — triggering build pipeline workflow runs, managing certain App Store Connect settings — don't have functional mobile UIs and require desktop access.

Mobile-first works for normal flow. Desktop is necessary for incident response.

The agent does not escalate.

When the agent got stuck or confused, my repeated impulse was to throw more prompts at it. Refine the request. Restate the problem. Try again. The right move, repeatedly, was to step back, change tools, or change strategy entirely. I spent more time than I should have re-prompting the agent when the productive move was to pause, contact support, or pivot to direct CLI access.

Agents are good at executing within a workflow. They're bad at recognizing when the workflow itself is broken and a different approach is needed. That escalation has to come from the human.

When the agent fails at the same task twice with similar errors, the human steps out of the loop and chooses a new strategy. Treat repeated agent failure as a signal that the workflow needs to change, not just the prompt.

what this maps to in governance frameworks

Six lines, named.

Each of the failure modes above has a corresponding line in the AI governance frameworks I work with professionally. The mapping is not theoretical — it's what I look for when I'm doing vendor assessments at the day job.

Confident misdiagnosis maps to the calibration/hallucination problem addressed abstractly in NIST AI RMF and concretely in MITRE ATLAS adversarial example patterns.
Preview-versus-production decoupling maps to dev-prod parity gaps that show up in SOC 2 Type II assessments and ISO 27001 change management controls.
Silent state divergence maps to data lineage and configuration management problems in ISO 42001.
Sandbox fragility maps to availability and operational resilience requirements under EU AI Act and ISO 22301.
Single-platform concentration maps to third-party risk frameworks I use daily in supplier risk assessments.
Self-reported completion maps to the broader AI evaluation rigor gap that every major framework gestures at and few deployments actually solve.

The frameworks anticipate these failure modes in theory. Living through them in practice sharpens the instinct for what to actually look for in vendor assessments. When I'm reviewing an AI vendor's claims about reliability, reproducibility, and resilience, this sixteen-hour arc is now part of my mental reference set.

For CISOs and risk practitioners evaluating AI development platforms specifically: the questions worth asking go beyond "does it work?" The real questions are "how does it fail, and what do users have to do to recover?" Most vendors don't have crisp answers. The ones who do are the ones worth taking seriously.

what I would do differently

Seven disciplines.

Each is a direct response to a specific failure I observed. If I were starting the project over today:

// disciplines

Set up a private GitHub mirror on day one. Five minutes of setup; would have saved hours of incident-response time.
Run a production-mode build before adding any features. Establish that the production path works before relying on it.
Pin every native dependency pairing explicitly with overrides. Don't trust transitive resolution.
Treat preview state as untrusted. Verify each visible asset and behavior actually corresponds to a committed file before assuming it's persistent.
Use the platform's secret-storage feature for credentials from the first prompt. Never paste tokens in chat, even temporarily, even when the agent is asking for them directly.
Set time limits on debugging sessions. If progress stalls for an hour, stop and reassess.
Define "done" by external verification, not agent self-report.

None of these are exotic engineering practices. They are standard discipline. The lesson isn't that AI tools require new principles — it's that they often relax the visible enforcement of old ones, and the cost of that relaxation accumulates silently until production exposes it.

closing thought

The discipline is still the job.

The previous field note ended on the line the discipline is the job. This one ends on the same line, with one addition: the discipline is also where the governance frameworks meet the keyboard. NIST AI RMF, ISO 42001, EU AI Act, MITRE ATLAS, SOC 2, ISO 27001 — all of these say, in their own languages, that AI systems require verification gates, dev-prod parity, supplier risk management, operational resilience, and human oversight with explicit escalation paths.

What sixteen hours of real work makes clear is that the gap between the framework requires this and the framework is enforced in practice is exactly the gap where confident-but-wrong agents, silent preview-versus-production state, and single-platform concentration risks live.

The build is in TestFlight. The remaining work is bounded. And my intuition for what to actually look for in vendor assessments is sharper than it was a week ago.

That trade is worth it. The frameworks are catching up to the reality. So am I.

Qais Ali Senior AI Cyber Security Engineer
CEO, Ilm Works

Field Notes / No. 02 Filed April 27, 2026