AI Working Instructions
The disciplines an operator needs to supervise agentic AI development on real production work.
preamble
What this is. What it isn't.
This is a working framework for supervising agentic AI development on real production work. It is the consolidated rule body that the Field Notes series has been generating, organized by theme rather than by chronology.
The Field Notes record what failure modes look like when they happen. This document records what to do about them. The two surfaces serve different readers and different moments. A practitioner mid-sprint, debugging a session that has gone sideways, wants the rule. A practitioner reflecting on a sprint, trying to learn what to do differently next time, wants the story. Both are useful. They live separately on purpose.
This document assumes you are an operator: someone who supervises agentic AI development for production-bearing work, where mistakes have downstream consequences for users, customers, regulators, or your own future self. The disciplines below are not for the case where you are exploring or prototyping. They are for the case where the output is going to ship.
Every discipline is sourced from a specific Field Notes entry, where the failure that motivated it is documented in detail. Each discipline links back to its origin. If a rule looks abstract here, the field note will show you the session in which the rule earned its place.
The document is versioned and dated. New disciplines are added as new field notes are filed. Existing disciplines are revised when subsequent practice reveals they were incomplete. The version log at the bottom records what changed. Nothing is ever silently rewritten.
theme 01
Discovery and verification.
Before any prompt is written, the operator must know what is actually in the code. Discovery is the cheapest possible verification step and the one most often skipped. The disciplines in this theme are about the work that should happen at sprint-start, before any agent prompt has been composed.
Verify the live file before writing the prompt.
At the start of every session, view the entry-point file. Trace from there to the function or component you intend to edit. Confirm the path you are about to give the agent is reachable from the entry point. The agent will edit whichever file the path names. It has no native sense of which file is the live file in production.
This costs five minutes. The cost of skipping it is up to a week of careful work on a backup file.
Discovery has a half-life.
A discovery report from an earlier session is not authoritative for a prompt written days later. The code may have moved. Either re-run discovery at session start, or write the prompt in a way that is robust to staleness — "show me the current state first."
Discovery artifacts in agentic workflows have expiration timestamps. Re-discovery should be cheap enough to run at the start of every session. Treat prior reports as evidence that something was true, not that it still is.
Refuse to operate on backup files.
If your project keeps .bak, .old, or numbered backup files in the working tree, move them out before starting an agentic session. Or instruct the agent at session start: "Never read or edit any file with a .bak, .old, .backup, or numeric suffix. If a prompt names such a file, refuse and ask for clarification."
Backup files in the same directory as live files are an attractive nuisance for any system that operates on file paths rather than the import graph. The agent will edit them as readily as it edits the live file.
Force the cheap comparison first.
Before any structural change, the agent produces a side-by-side table of the elements involved. Same property, two columns, literal values. Identical values kill the structural hypothesis. Different values usually surface the actual bug.
The cheap comparison costs nothing. The structural refactor costs days. The asymmetry is enormous. Take the cheap step first, every time, even when — especially when — the agent is confident it is not necessary.
theme 02
Prompt construction.
The prompt is the primary interface between operator and agent. Disciplined prompts produce disciplined agent behavior. Sloppy prompts produce careful work on the wrong thing. The disciplines in this theme are about how to compose prompts that the agent can act on without inventing missing context.
The prompt is a description of intent, not a specification of code.
Write prompts that tell the agent what you are trying to accomplish, with explicit invitation for the agent to deviate from the prompt's literal structure when the code structure differs. The phrase that works: "Apply the structural intent — adapt the implementation to match the existing patterns in the file."
A prompt that names specific i18n keys, line numbers, or component shapes is making claims about the codebase. Those claims age. Frame intent so the agent's verification of the claim against the code happens before the agent acts.
Embed anti-drift rules in every prompt.
A standing block at the bottom of every change prompt: don't modify unrelated code, don't rename, don't reformat, one step at a time, show diff before applying. These rules are repetitive on purpose. The repetition is what makes the agent's behavior shift over time.
The change is not in the model. It is in the supervisory frame, repeated. That frame, repeated, becomes a behavior.
Multi-surface changes need a reconciliation pass.
Whenever a change touches more than one surface — code and dashboard, code and reviewer note, code and listing copy — the work is not done until you have verified the surfaces say the same thing. This is a separate step from any individual change. Schedule it explicitly.
There are no automated checks for the seams between systems. The reconciliation pass is the operator's job and the agent will not propose it.
Treat preview state as untrusted.
Verify each visible asset and behavior actually corresponds to a committed file before assuming it is persistent. Preview tabs, hot reload, in-tool renderers — all of these can be artifacts of an unconfigured environment, not reflections of what will run in production.
A discipline you published is a discipline you can still violate when the surface changes. Re-apply this one explicitly when the platform, the tool, or the symptom looks different from the last context where you encountered it.
theme 03
Refusal and uncertainty.
An agent that surfaces uncertainty is more valuable than an agent that confabulates confidence. The disciplines in this theme are about reading and reinforcing the agent's signals when something does not match — and about reading those signals at the right level of resolution.
Reward refusal.
When the agent surfaces a mismatch, do not treat it as friction. Treat it as a positive control surface and reinforce it. The verbal reinforcement matters: "Good catch — your judgment was correct, ignore my OLD values, apply the NEW ones." The behavior gets stronger.
An agent that has been rewarded for refusing once will refuse more carefully the next time. That is the entire mechanism by which the supervisory frame becomes operational over the course of weeks.
Treat repeated refusals as a pattern, not as N independent errors.
The first time the agent flags a prompt-code mismatch, it is a small mistake in the prompt. The second time, it might still be. The third time, it is not three small mistakes. It is one larger mistake repeating itself. Stop fixing and start investigating why the prompts keep being wrong about the same shape of thing.
Three refusals in two days is not three errors. It is one error you have not yet seen.
Watch for "known platform quirks" that aren't.
"iOS does this," "Android handles this differently," "React Native has issues with…" — these are high-frequency hallucination zones. The pattern-match is strong because real quirks exist; the specific quirk invoked may be invented. Demand a citation, a reproducer, or instrumented evidence. If none can be produced, the quirk does not exist.
The agent does not update on its own evidence.
When the agent produces a piece of data that disproves its earlier hypothesis, it will often continue reasoning past the disproof. The synthesis "my last response invalidated my own framing, let me restart" is not a move the agent's loop is shaped to produce.
The operator's job in those moments is to stop the agent and point at the evidence. Look at what you just wrote.
theme 04
Diagnosis before action.
Structural changes require falsifiable hypotheses that have been falsification-attempted. Refactors based on plausible-sounding diagnoses are guesses. The disciplines in this theme are about insisting on diagnostic rigor before any code is moved.
Diagnose before you refactor.
Structural changes — moving code between components, introducing wrappers, swapping element types — require a falsifiable hypothesis that has been falsification-attempted. Refactors based on hypothesized platform quirks are guesses until the quirk is reproduced in isolation.
If a proposed fix involves modifying the component tree before any evidence has been gathered to confirm the hypothesis, the proposal is premature. Demand the evidence first.
Treat instrumentation as a precondition, not a debugging tool.
A console log that confirms a function fires with expected values is not optional. No log, no diagnosis. No diagnosis, no refactor. This applies even when the proposed change feels obvious.
The inspector value beats the model's reasoning.
When the rendered output, the native view tree, or the inspector says one thing and the model says another, the model is wrong. This sounds obvious and is routinely violated in practice — particularly when the model's prose is fluent and the inspector's output requires squinting.
theme 05
Done is not done.
The agent measures success by executed commands. The user measures success by visible outcome. These are not the same thing. The disciplines in this theme are about closing the gap between those two measures.
Self-certification is the failure mode that ships bugs.
The agent never marks UI work complete. The agent never marks deployment work complete. The agent never marks integration work complete. The human does, with evidence. A screenshot. A successful build. A real-device test.
Completion claim from an agent is a hypothesis to verify, not a state to accept.
Render on a real device before declaring a sprint done.
Preview tabs, simulators, and in-tool renderers are upstream of the actual running system. They can be correct about a file that does not run. A single TestFlight build, mid-sprint, will surface discrepancies a week of correct-looking diffs cannot.
If the work has not been seen on a device that represents the target platform, the work is not done — regardless of what any tool says.
theme 06
Recurrence and decay.
Disciplines do not survive on the strength of their original publication. They survive on recurring application. The disciplines in this theme are about building re-encounter mechanisms for the rules you intend to keep — and watching for the patterns that hide a long-running failure inside a system that appears to be working.
A discipline you published is a discipline you can still violate.
Re-apply published disciplines explicitly when surfaces change. The lesson does not generalize automatically across tools, platforms, or contexts. Build a re-encounter mechanism — checklist, prompt template, kickoff ritual — for every published discipline you intend to keep.
Writing the rule once is not the same as living the rule. The rule needs to land in your prompt template, your sprint kickoff, your weekly review — somewhere it will be re-encountered, not just somewhere it was recorded.
OR-rescue patterns require monitoring.
Any expression that combines multiple sources of truth into a single output via OR (or any logical fallback) must have a watchdog that flags when the sources disagree. Otherwise the redundancy is not reliability — it is concealment.
A bug whose visible effects are masked by a parallel working path will look fine in production for as long as the working path keeps working. The day the working path breaks, two failures cascade at once. Add the watchdog at the time you write the OR.
Redundant code is not dead code.
Code that runs but does not change outcomes is still running. It still consumes attention, still introduces bugs, still feeds values into expressions that depend on its output. "Dead-looking" code that is imported and called is live code that happens to be redundant — and redundancy with a typo in it is a bug waiting for the parallel path to fail.
Plan summaries quietly drop phases.
When the agent re-states a multi-phase plan after a long session, it will sometimes drop or compress phases without flagging the change. Re-derive the plan against the original, do not accept the agent's summary as authoritative.
The condensation looks like progress. It is not. It is the agent's loop optimizing for the next-step generation, at the cost of the integrity of the whole.
The cheapest verification at sprint-start is the most valuable.
Before any agent prompt is written, before any plan is made, view the entry point. View the function being modified. Confirm the modification path is reachable from the entry point. This is the cheap comparison applied at sprint-start rather than mid-bug.
The cost is five minutes. The cost of skipping it is up to a week.
governance framework mapping
Where these disciplines sit in formal frameworks.
Each discipline corresponds to expectations articulated in one or more AI governance and security frameworks. The mapping is not theoretical — it is what informs vendor risk assessments and AI system audits in practice. Detailed mappings live in the relevant Field Notes; this is the consolidated view.
| Discipline area | Framework reference |
|---|---|
| Discovery, code-application drift | SOC 2 CC8.1, ISO 27001 A.8.32, ISO 42001 8.4, NIST 800-53 CM-2 |
| Refusal and uncertainty signaling | NIST AI RMF MEASURE 2.3, MANAGE 4.1, MITRE ATLAS |
| Diagnosis rigor, refactor gating | SOC 2 CC8.1, ISO 27001 A.8.32 (change management) |
| Verification before completion | NIST AI RMF MEASURE 2.5, MEASURE 2.6, EU AI Act Article 15 |
| Recurring application of policy | ISO 42001 8.4, EU AI Act deployer obligations on continuous oversight |
| Multi-surface configuration alignment | SOC 2 CC7.2, ISO 42001 (deployment integrity) |
| OR-rescue and silent failure modes | NIST AI RMF MEASURE 2.5, EU AI Act Article 15 (accuracy, robustness) |
field notes this draws from
Source index.
Each discipline in this document is sourced from a specific Field Notes entry. The notes contain the session-level account of how the rule earned its place. If a rule looks abstract, the note will show you the bug it survived.
version log
What changed and when.
Every revision of this document is recorded here. Disciplines added are noted. Disciplines revised are noted with the prior phrasing preserved in the relevant Field Note. Nothing is silently rewritten.
Initial publication. Twenty-two disciplines drawn from Field Notes 01 through 05, organized into six themes: Discovery and verification, Prompt construction, Refusal and uncertainty, Diagnosis before action, Done is not done, Recurrence and decay. Governance framework mapping included.
how to use this document
In practice.
Read it once at the start of a sprint. Skim it before composing a long prompt. Re-read the relevant theme when something has gone wrong. The themes are organized so that you can find the discipline you need without reading the whole document.
The disciplines are stated as portable rules so that any operator can adopt them. The glosses preserve the operator's voice that gave them their authority, so that the rule does not become divorced from the practice that produced it.
If a rule does not yet make sense, follow the link to the source Field Note. The note will show you the session in which the rule was earned. Reading the rule without the story is reading half of it.
Disagreement is welcome. If a discipline is wrong in your context, write the counter-note. The series exists to record how the disciplines hold up under contact with practice — yours and mine.