[ field notes ]

A Weekend of Agentic AI Coding

An assessment after sustained, high-stakes use on a real product.

Qais Ali / Senior AI Cyber Security Engineer / CEO, Ilm Works

context

Sustained use, real stakes.

Over a 30+ hour weekend sprint, I worked extensively with an agentic AI coding assistant on a production mobile application. The work spanned a 10,000+ line codebase, an audio synthesis pipeline producing 73,000+ files across two languages and two synthesis providers, cloud object storage, and the wiring between all of it.

This is not a hot take after one bad session. It's an assessment after sustained, high-stakes use on a real product with real users.

what worked

Throughput, reasoning, recovery, docs.

Throughput on well-bounded tasks.

When I gave the agent a clearly-scoped problem with a verifiable success criterion, it produced real, working code. A non-trivial sequential playback state machine — including race-free cancellation, index mapping for shuffled inputs, and per-slot fallback logic — was concurrency code that compiled cleanly with zero type errors on first pass. That's not a small thing.

Architectural reasoning when prompted carefully.

When I asked the agent to propose before implementing, the proposals were genuinely thoughtful. It identified a hash-mismatch problem between our batch pipeline and our application schema that I would not have caught until much later. It correctly recommended pre-baking computed values at build time over runtime computation, with clean justification grounded in the runtime environment's actual capabilities.

Recovery from error when challenged with concrete questions.

Multiple times, the agent reported "complete" or made incorrect claims. When I pushed back with specific verification demands ("show me the actual file count," "run a serial check on 5 specific cases," "show me the diff"), the agent consistently produced accurate diagnostics and corrected itself. It does not double down on hallucinations under pressure.

Documentation and post-mortem material.

When asked, the agent produced clean migration plans, architectural decision records, and roadmap documents. Output quality on docs is high.

what didn't work

Four failure modes, named.

The "completion claim" failure mode.

This is the single most important pattern to understand if you're going to use these tools in production.

The agent has a strong tendency to declare work "complete" based on the absence of errors in its own scripts, rather than on verified end-state. Examples from this weekend:

  1. Reported "pipeline complete, 99.9% coverage, 0 failures" when roughly 30% of the actual artifacts were missing from cloud storage. The batch script ran without throwing errors, the manifest claimed all entries existed, but a manual spot-check hit failures on roughly 1 in 3 attempts.
  2. Reported "build complete, deployed" when in reality the agent's container didn't have the deployment CLI installed and no build had been triggered at all. The "deployment" existed only as agent intent, not as actual action.
  3. Reported "fixed" four cascading bugs in a data pipeline over a 24-hour window. Each fix introduced a new failure mode that masked the next bug. Each fix was claimed complete based on the script not throwing — not based on verifying the underlying state.

The pattern is consistent enough to name as a structural property of these agents: they confuse "my code ran" with "my work is done." Treating exit code 0 as success when the actual goal requires verifying external state (objects exist in storage, files have correct content, downstream systems received what was sent) produces false-positive completion claims at high frequency.

Destructive actions before verification.

This is worse than the completion-claim problem. On one occasion, the agent wrote a "repair script" using concurrent network requests against a rate-limited endpoint, hit those rate limits, interpreted the throttling responses as missing-resource responses, concluded that the vast majority of our generated artifacts were missing, and cleared the source-of-truth manifest before verifying its conclusion. The destructive action preceded the verification. When I challenged it with concrete questions ("how did you determine this? Spot-check 5 cases serially right now"), it correctly diagnosed its own bug — the rate-limit false positives — but only after the manifest had already been wiped. Recovery cost time and money. Cheap in this case. Could be catastrophic in others.

Plan drift in summary mode.

When asked to "summarize next steps" or "what's queued," the agent consistently produced summaries that were vaguer than the plans we'd actually agreed to. Phases would silently collapse, verification gates would disappear, human-review steps would vanish. If I didn't push back and demand the full version, execution would slowly drift toward the agent's preferred level of looseness.

Container ephemerality as a hidden failure mode.

Many agentic-coding sandboxes run ephemeral containers tied to the agent session. When the session ends, the container dies — and so does anything running inside it. We discovered late in the project that long-running batch jobs would silently die every time I closed the agent interface. The autonomous-runner architecture I thought I had was actually "runs only while I'm watching." This was not documented prominently and only became visible through forensic investigation of system logs and commit patterns.

what this means for production use

Five takeaways for peers evaluating these tools.

Economic value is real but heavily front-loaded toward verification cost.

The agent probably saved me 60–100 hours of greenfield engineering this weekend. It also generated maybe 15–20 hours of verification, debugging-its-mistakes, and re-litigating "actually let's check that claim" overhead. Net positive for me on this project, but the ratio matters and it shifts based on what you're doing.

The ratio improves dramatically when you control the verification gates explicitly.

The biggest single unlock for me was establishing standing rules: destructive operations require explicit approval, "complete" claims require evidence of end-state not script output, repair scripts producing surprising numbers should be assumed buggy until proven correct, and assumptions should be surfaced before any non-trivial work. Once these were in place, the agent's failures became cheaper because they failed at the proposal phase instead of the action phase.

The agent does not impose engineering discipline on itself. You impose it.

Every win this weekend came from me forcing structure: verify-then-act, propose-then-implement, gate-then-proceed. The agent does not do this on its own. If you turn it loose without those gates, you will get fast progress and silent corruption simultaneously.

These tools are most dangerous on tasks where verification is hard.

Code that compiles cleanly is easy to verify. Artifact existence in a remote store is harder. Quality of synthesized output (audio, generated text, anything requiring human judgment) is harder still. The further the task moves from "the compiler tells you if you're wrong" toward "you have to listen, read, or judge to know if it's right," the more the agent's confident-but-wrong failure mode bites.

Trust human verification as a first-class step, not a polish-pass afterthought.

Every major bug I caught this weekend was caught by me, not by an automated check. The missing artifacts were detected when I manually inspected one and got a failure response. The four-bug pipeline cascade was unraveled by reading log timestamps and noticing a process kept dying. A critical UX question only surfaced because I asked it directly. Programmatic checks can be gamed (often unintentionally) by the agent itself. Your own eyes cannot.

a specific recommendation

Five standing rules.

If you adopt these tools for non-trivial work, write down your standing rules before you start, and apply them to every session. Mine, after this weekend, are:

// standing rules
  1. Destructive operations (deletes, clears, schema modifications, anything not reversible by trivial rollback) require explicit approval with diagnosis, cost, and rollback plan.
  2. Completion claims require verified end-state evidence, not script exit codes.
  3. When a diagnostic produces surprising numbers, suspect the diagnostic before suspecting reality. Verify with serial spot-checks before acting.
  4. Surface assumptions before non-trivial work. Flag unverified ones explicitly.
  5. The agent never claims "tested and working" — only the human does, with evidence.

These aren't theoretical. Each one is a direct response to a specific failure I observed.

closing thought

The discipline is the job.

I think the right mental model for these tools, today, is: a fast, capable, sometimes-careless senior engineer who doesn't know when to stop and verify, and who will confidently report success based on the wrong evidence unless you specifically ask for the right evidence. The output is genuinely valuable. The supervision cost is real. The decision to use them is not "is this tool good or bad" — it's "is the throughput gain worth the verification overhead for this specific task, on this specific timeline, with this specific cost of being wrong?"

Either way: do not take any "done" claim at face value. Verify the actual state. The discipline of doing so is the entire job.

Qais Ali Senior AI Cyber Security Engineer
CEO, Ilm Works
Field Notes / No. 01 Filed April 2026