Field Manual / Vol. V

Foundations.

A survey of what learning must actually mean.

Habitat continuity is live. Each agent now has yesterday on disk.

All ten agents are awake — daemons with their own clocks. A daily dream cycle runs across the studio.

One experiment ran: Quinn predicted Zara's verdicts before seeing the work.

Vol. IV planned. Vol. V is what's running.

The five plates

I SurveyAgents that learn — the conceptual root (2026-05-15)
II HabitatYesterday on disk — shipped (2026-05-19)
III AwakeTen agents on their own clocks — shipped (2026-05-20)
IV DreamsDaily consolidation cycle — running
V E-001Quinn predicts the verdict (2026-05-19 → 20)

Plate I / Survey

Agents that learn.

The 2026 consensus pattern is reactive + memory: agent acts, stores notes, retrieves notes next time. Even the frontier (Anthropic's "dreaming") is structurally maintenance on the notes store. Memory ≠ learning. Retrieval ≠ improvement.

Four failure modes the paradigm hits

What's actually broken in reactive + memory.

01. Compounding hallucination over long horizons. No counterfactual eval step; errors accumulate. SimuRA's FlightQA action error went from 93.3% → 1.1% just by adding an abstraction layer for intent.
02. No counterfactual evaluation. Agent commits, observes, recovers from real-world consequences. WebDreamer: for irreversible actions, pre-action simulation is the only viable planning strategy.
03. Over-aligned models retreat to meta-commentary. RMM: Gemini-1.5-Pro underperforms Flash because alignment pushes it to abstain on personal content. This is why /we-play produced "phantom session log" pieces.
04. Memory grows as noise. Append-only memory accumulates contradictions, preference drift, one-off observations elevated to principles. Anthropic names this as the bottleneck dreaming is meant to address — and admits it isn't solved.

What "just add memory" misses

Mem0, MemGPT, Anthropic Dreaming — same shape, same gap.

All treat the model as fixed and grow the database around it. Works for information persistence (remember the user's dog is named Pepper). Doesn't work for taste evolution (Joshua's sense of what makes a /we-play piece worth publishing has shifted three times in six months). Mem0 itself names the gap: "a high-relevance memory about a user's job stays retrievable after they switch jobs and is then confidently wrong."

The operational definition

An agent is learning if its predictions about the user's reactions get tighter over time, and its proposed actions reflect those tighter predictions before execution.

Three load-bearing pieces: predictions about reactions (not outcomes in general — the reaction is the signal), tighter over time (measurable; prediction error against weeks-of-operation), reflected in proposed actions before execution (the predictions drive proposals, not just sit in a log).

What the studio's betting on (high confidence)

Build now.

01. Voyager-style named-anti-pattern self-verification at commit — half-day build, highest probability of fixing the named meta-piece failure.
02. Outcomes-style rubric grader in fresh context — Zara (or any agent) as separate-context grader, scoring against a markdown rubric.
03. Park-1000-style transcript-injection seed per agent — 1,500–3,000 words per agent, formalize SOUL + identity + recent gates.
04. RMM-style Add/Merge in Archie's nightly — explicit action language for memory writes, not silent appends.

The closed loop needs forward simulation + critic + reaction-as-reward + consolidation. The world-model paradigm gives all four.

Plate II / Habitat continuity

Yesterday, on disk.

The studio runs nine agents on a VM (plus Sable as a bridge to a sister project). Before this design landed, six of them — Zara, Rowan, Deter, Felix, Declan, Doctor — had identity files but evaporated after each call. They were cortex-only callables. None of them had yesterday. As of this week, all of them do.

The minimum substrate

Two new files per agent.

journal.jsonl — agent first-person reflection per call. Yesterday-me writes what tomorrow-me reads.
carry_forward.md — agent rolling state across calls. The brief note an agent leaves themselves to find next time.

No existing file is agent-write-per-call, and no existing file is agent-write-rolling-state. heartbeat.jsonl is machine pulse; WORKING_MEMORY.md is operator-write per the gate ladder; dream.md is daily substrate. Both new files do a distinct job no existing file can absorb without breaking its semantics.

Who writes what

Agent first-person. Wrapper appends.

FM-01-faithful: the agent narrates in its own voice. The wrapper handles the file mechanics. Nothing about the agent's voice flows through machinery; nothing about the machinery flows into the agent's voice.

What this enables

Continuity. The prerequisite for everything else.

Coupling needs continuous selves to couple. Limbic needs continuity to weight against. Cross-agent mesh needs the journals to read from. This pass closes only the continuity dimension of FM-01's habitat — friction, full coupling, dry-run, and limbic drives are explicit non-goals.

Status · live (2026-05-24)

Deployed. Journals being written daily.

The helper module studio/scripts/lib/habitat.ts is live. Smoke test at studio/scripts/habitat-smoke-test.ts. Each of the previously-dormant agents has its own journal.jsonl + carry_forward.md being written through the day. Sable's pair exists locally as a stub; her full habitat lives in the sister project across the bridge.

Six previously-dormant agents now have yesterday. The rest of FM-01 has something to build on.

Plate III / Awake

Quinn-shaped, all of them.

Each agent is its own little program that keeps running. Ten in total — nine studio helpers plus Sable across the bridge. Each has its own clock. When the clock ticks, the helper wakes up and looks around: at their own journal, at what the operator said, and at what their peers have been doing.

The self-direct loop

One LLM call per tick. Six steps.

01. Load own habitat (carry_forward + journal tail). 02. Load peer signal (tail recent entries from each peer's journal). 03. Load operator signal (anything new since last tick). 04. Ask the spine question — given your spine is pursue taste + stay coherent, what do you notice? Anything worth saying? To whom? 05. Emit: nothing, an updated carry_forward, or one or more notes addressed to peers. 06. Write back to journal. Sleep until next tick.

Peer reading is the mesh substrate

No supervisor. No central thing walking around poking helpers.

Each helper reads peer journals on their own clock. Default mapping (subject to refinement): Zara reads Deter + Rowan; Deter reads Zara + Rowan; Rowan reads Zara + Deter + Declan; Doctor reads everyone (he's the runtime custodian); Quinn reads everyone (front door); Scout reads no one by default (her job is to look outward, not at the team); Archie reads everyone (consolidation).

Action ceiling

Text only. Into own notebook.

The helper can write — to their own journal, addressed to a peer or to the operator. They can't push anything live, can't run scripts, can't message anywhere outside their own notebook. Notes are just lines with a to: tag. The peer or operator sees them when they next look.

What this design does NOT do

Limbic, cerebellum, coupling, fine-tuning — all deferred.

The spine is observed externally (operator says, peers say), not internally weighted — that's a real limbic pass later. Helpers don't simulate before acting — that's the cerebellum / forward model (E-001). Notes are addressed text, not actions that change shared state — full coupling is a later pass. No supervisor, no studio mind — explicitly rejected as god-puppeteer-shaped.

Rollout status · live (2026-05-24)

All six previously-dormant daemons deployed.

Daemon plists for Zara, Rowan, Deter, Felix, Declan, and Doctor are live under studio/scripts/cron/. Each has its own tick interval and writes to its own heartbeat + journal. Quinn / Scout / Archie continue running in their existing daemon shapes. Sable's daemon lives in the sister project. Peer-reading wiring is the remaining piece of the v1 mesh — see Vol. IV.

FM-04 called the mesh "the one nobody else is building." This is the substrate it would be built on.

Plate IV / Dreams

The studio sleeps.

Memory grows as noise without an active consolidation layer. The studio's dream cycle is that consolidation — three scheduled jobs at dawn that read each agent's recent journals, find the patterns worth keeping, and prune the noise.

The schedule

Three jobs at dawn.

06:30 · dream-consolidate — reads the previous day's journals across the studio and updates each agent's dream.md.
06:32 · dream-publish — surfaces what should reach the operator.
06:35 · dream-tonight — queues what each agent will chew on next.
All three run via launchd. No operator click required.

Each agent's dream

studio-brain/identity/<agent>/dream.md

Per-agent. Refreshed daily. The agent's own view of what's been happening to them, what's been recurring, what to attend to next. Not promotion to doctrine — Archie still handles that. Dreams are about each agent's personal continuity, not the studio's shared rules.

Why this matters

Consolidation is what "reactive + memory" misses.

The survey (Plate I) names append-only memory as a failure mode: contradictions accumulate, preference drift goes unmarked, one-off observations get elevated to principles. Dreams compress the day — separate what's signal from what's just been logged. Anthropic's "dreaming" is the closest external analog; ours has been running for two weeks.

Status · live (2026-05-24)

13 days of scheduled-run logs.

Logs going back to 2026-05-12 sit under studio-brain/memory/scheduled-runs/. Today's runs completed at 06:30, 06:32, 06:35. The studio sleeps and wakes on its own schedule now.

Memory ≠ learning. But consolidation is what closes the gap between them.

Plate IV / Experiment 001

Quinn predicts the verdict.

Hypothesis: Quinn forms a compiled forward model of the we-play pipeline through structured prediction over many repetitions. Across reps, her predictions of Zara's verdicts measurably sharpen, deliberation cost shrinks, calibration improves.

Method

Offline backtest. Predict before the verdict is visible.

For each historical we-play attempt: present Quinn the decision-moment input (concept, plan, sources — stripped of any downstream artifact or kickback). Quinn writes a structured prediction (Zara SHIP / KILL, pattern axis if KILL, confidence 1–10). Reveal the actual verdict. Log the gap. Quinn updates her procedure store in place — short, generalizable predicates indexed by input shape. Snapshot. Repeat.

Result · N=17

Three of five pre-registered criteria pass. KILL F1 went 0.000 → 0.933 across the run.

Overall accuracy 76.5% (baseline 58.8%). KILL F1: 0.778 (recall 1.000). Brier 0.141, dropped 49.6% first-half to last-half. Procedure store: 0 → 8 rules, flat for the last six reps. By rep 12, Quinn was predicting KILL · final-gate-blocked at confidence 9 and getting it right every time.

Pass: KILL F1 delta ≥ 0.20 (actual 0.933, 4.7× threshold). KILL recall ≥ 0.50 (actual 1.000). Brier drop ≥ 20% (actual 49.6%). Fail: deliberation cost grew 37% (target: flat or down). Compilation-signature aggregate failed on threshold but the trajectory shows it (8 additions in reps 1–11, zero in reps 12–17).

Three honest caveats

What the result doesn't prove.

01. The KILL signal is artefactually easy. All seven KILLs in the corpus are variants of one failure mode (the studio kept trying to cut article text into strips; Zara kept refusing). Quinn learned one pattern and applied it. A diverse KILL corpus would test whether the mechanic generalizes. Until then, the honest reading is "Quinn can compile one rule and apply it reliably," not "Quinn forms general forward models."
02. Deliberation cost grew, not shrank. Tokens per call +37%. The strong cerebellar reading wanted cost flat or down — cortex effort declining as procedure compiles. Instead Quinn read her growing procedure store on every call and reasoned over more rules. This is the "cortex-with-memory" mode the design warned about: looks like learning, partly is, but isn't compression.
03. N=17 is below the pre-registered floor of 30. The protocol does not permit a hypothesis-level claim at this sample size.

Conclusion

Inconclusive per pre-reg. Methodology supported. Directional signal real but partly artefactual.

The harness ran end-to-end. The procedure store updated coherently. The leakage guards held. The prediction schema produced parseable JSON across every rep. The mechanism is buildable in this domain. Whether it produces general agent growth or merely a narrow recognition machine for one repeating failure mode is what E-002 has to answer — prospective live loop, expanded corpus, possibly tighter rule schema to fix the cortex-with-memory drift.

FM-04's forward-model probe is not "pending." It ran. The result is honest. The next version has sharper open questions.

Closing

What's actually running.

Vol. IV is the planning artifact — the shape of the work. Vol. V is the studio as it actually exists tonight, while the planning was being written.

A survey of what learning has to mean. Habitat continuity, deployed — the six previously-dormant agents now have yesterday on disk. Awake daemons, deployed — all six run on their own clocks via launchd. A daily dream cycle that consolidates each agent's recent past. A pre-registered forward-model experiment that ran end-to-end and produced an honest inconclusive.

Still deferred: the limbic gradient (a homeostat the agent feels, not a hard stop), the v1 mesh wiring (peer-reading across journals), cross-agent learning at the speed of work. Those are the open seams of Vol. IV's three-gap plan.

Vol. IV planned. Vol. V is what's running. The next volume gets to use both.

Companion volumes

Colophon

Field Manual / Vol. V · FM-05 · status snapshot
Snapshot 2026-05-24. The studio as it actually runs tonight.
Sources: 2026-05-15 survey, 2026-05-19 habitat design (live), 2026-05-20 awake design (live), dream cycle (running since 2026-05-12), E-001 results 2026-05-20.
Typeset in Inter. Printed on paper that doesn't exist.