Field Manual / Vol. IV

The probes.

Vol. III named three gaps: forward model, real wanting, cross-agent mesh.

They aren't a list. They're a stack. Forward enables limbic. Limbic enables what the mesh carries.

Make the mesh. Adapt the limbic. Build the forward model.

This is the work that would get the operator to optional.

Reframe

The gaps are a stack, not a list.

Think of it as a body. Forward model is the part that knows where your foot will land before it lands — the predict-then-step that keeps you walking in the dark. Limbic is the discomfort that fires when you start tilting, without anyone telling you. Mesh is what two bodies say to each other once they can stand. Two bodies that can't keep themselves upright can't teach each other balance.

The chain

Bottom enables top.

A forward model gives an agent something it can be wrong about. The gap between prediction and outcome is the raw signal. A real drive (limbic) is what hurts when reality diverges from prediction — it needs prediction error to have anything to do. Peer critique (mesh) carries those observations between agents, but only when they're sharp enough to metabolize without contamination.

The triage

Make the mesh. Adapt the limbic. Build the forward model.

Mesh — make it. No prior, no SDK. Nobody else is solving peer-to-peer agent learning. Original work, every line of it. Limbic — adapt it. Shape known from neuroscience and RL: intrinsic motivation, free energy, curiosity-driven exploration. Porting the known shape, not inventing a new one. Forward model — build it. Architecture known (predict-in-latent, action-conditioned generation, dry-run-before-act). The studio's substrate isn't — symbolic and aesthetic, not physical. Build it from LLMs + embeddings + corpus, not lifted from Genie / V-JEPA / Dreamer.

The sequencing rule

Bottom of the stack enables top, but timing depends on work type.

Stack order is forward → limbic → mesh. But mesh and forward can move in parallel because they're different kinds of work — one invention, one assembly. Limbic waits because it's downstream of both forward-error AND habitat continuity.

Decide where to invent vs. where to integrate. Then sequence by what enables what.

Probe 01 · Cross-agent mesh

One pair. One channel. Three weeks.

The mesh is the one to start. No prior art. No SDK. Nobody else is solving this. The probe isn't build the mesh — it's find out what contaminates and what doesn't when two agents critique each other directly.

The hard constraint

High bandwidth + high fidelity + no contamination, all three at once.

Think of a jazz quartet. The drummer hears what the bassist is doing the instant it happens (bandwidth). She reads not just the notes but where they are heading next (fidelity). She does not start playing bass lines — she stays a drummer (no contamination). That is the shape peer learning has to take.

Merge memories and voice bleeds — the studio becomes one big agent in nine hats. Keep silos and the studio stays where it is. The whole design problem is finding a channel shape that holds all three.

The smallest probe

Pick one pair. Build the thinnest channel. Run three weeks.

Measure what bleeds, what holds, what the operator's role becomes inside it. If the channel shape works, generalize to more pairs before adding layers. If it doesn't, redesign the channel.

Open design questions

Operator picks before probe starts.

Unit of exchange. Whole working-memory entry? Single observation? Nominated doctrine candidate? Push or pull. Inbox model (A drops critique into B's inbox) or journal model (B reads A's recent critiques on wake)? Trust scope. Per-domain ("Zara trusts Quinn on briefs, not on visual taste") or global? n=1 escalation signal. Disagreement between peers? Critique that conflicts with receiving agent's doctrine? Confidence threshold?

Pair candidates · Pilot

Two pairs in parallel. Different dimensions of the same channel.

Quinn ↔ Zara — briefs and direction. Low-frequency, high-stakes critique. Tests the channel against direction-shaped exchanges.

Felix ↔ Deter — build / QA loop. High-frequency, craft-level critique (PASS / FAIL on each artifact). Tests the channel when peers exchange many small observations per day.

Running both surfaces design constraints sooner than running them serially.

Pair candidates · Staged in v2

After the pilot proves the channel shape.

Declan ↔ Zara — voice + visual at the same critique table. Declan ↔ Rowan — voice + mechanism; already coupled via Rowan-approval in Declan's decision rule. Quinn ↔ Archie — Quinn writes lessons, Archie patterns them; mesh shortens the loop. Zara ↔ Rowan — direction × strategy; the existing zara_X_rowan_SYNC.md artifact would be absorbed.

Triple collabs

n=3 is real. The channel has to handle it.

Some critique loops are triples in practice — Zara ↔ Rowan ↔ Declan show up together in synthesis docs. Aggregate or chain? A dinner table where everyone hears everyone (fast but loud, prone to crosstalk), or a relay where each peer adds before passing (slower, more deliberate, prone to bottlenecks). The pilot (n=2) doesn't answer this — but the channel shape chosen at n=2 either generalizes to n=3 or it doesn't, and that's the data we want from v2.

Operator-in-the-loop is the training phase. The mesh is the graduation.

Probe 02 · Forward model

Predict one decision. See what it teaches.

A chef tastes the sauce before plating it. The forward model is the chef's mouth. Today's studio agents send the plate out and read the Yelp review. The chef adjusts before serving. The studio adjusts after the meal is over.

The architecture ports. The artifacts don't. The studio's substrate has to be built from LLMs + embeddings + the studio's own corpus — not lifted from Genie or V-JEPA. The world-model labs are building this for physics. Here, it gets built for taste.

What ports

Predict-in-latent-space, action-conditioned generation, dry-run-before-act.

The architectural pattern is the same as Genie 3, V-JEPA 2, DreamerV3, Cosmos. Given current state plus a candidate action, predict the next state before the action happens. The agent simulates against its own model of the world; the gap between prediction and outcome is the error signal.

What doesn't

Pixels, video frames, robot state.

Those substrates are matter in motion. The studio runs on briefs, drafts, taste verdicts. Lifting Genie / V-JEPA in as-is doesn't make sense — Quinn isn't predicting the next frame of a brief. Vol. III overstated this; the substrate has to be assembled.

What you already have

LLMs as forward models of language. Embeddings as taste-space projection. The corpus as training data.

Every brief → artifact → reaction triple in the repo is training data for a forward model of this studio specifically. LLMs already do next-token prediction; the move is to action-condition them on studio decisions.

The smallest probe

One decision type. Stated baseline. N held-out cases.

Pick one narrow, frequent decision with a clean outcome signal. Recommended: verdict — Zara already emits SHIP / REVISE / KILL on every artifact, Deter emits PASS / FAIL. Categorical labels per artifact mean the corpus has clean training data ready. Alternatives: routing (which agent gets this brief?) or direction (team's first move?).

Train an LLM-based predictor on past examples. Before starting, name a no-model baseline — a deterministic predictor the forward model has to beat by a stated margin. Without that, the probe ends in vibes.

The probe isn't "build the forward model." It's "find out whether one studio decision can be predicted well enough to teach."

Probe 03 · Real wanting (deferred)

What waits.

The limbic layer is well-shaped by neuroscience and RL — intrinsic motivation, free energy, curiosity-driven exploration. Porting, not inventing. But it needs something to read. Without prediction error from a forward model, "wanting" collapses back into "the prompt told me to want this."

Prerequisites · partially live (2026-05-24)

Habitat continuity live. Forward error generated once.

The limbic layer reads prediction error from the forward model and writes drive state into carry_forward.md (the habitat continuity substrate). Habitat continuity is live — journals + carry-forwards being written daily across the agents (see Vol. V Plate II). Forward error has been generated once via the E-001 backtest, but not yet as a live continuous stream. The prerequisite is two-thirds met; the missing piece is forward-error generation in real time, not the substrate to read it from.

Shape (sketch only)

A homeostatic gradient the agent reads on every step.

Each agent has a small set of internal variables — compute budget, time-since-last-handoff, prediction-error trend on its forward model. When variables drift outside band, the agent self-prompts to act on the drift. Operator stays out.

Why deferred

Most homework already done by other people.

Of the three gaps, this one has the deepest external substrate to draw on. Friston, Pathak, Klyubin / Polani / Nehaniv, Bahmani — the shape is well-mapped. Defer the build until the inputs (forward error, habitat continuity) are stable enough to actually read.

An objective function isn't a drive. A drive is what stays on after the operator leaves.

v0 smoke test · 2026-05-21

v0 ran. Three primitives surfaced.

Before the three-week real probe committed, three miniature exchanges were dispatched in-session — synchronous persona calls, ~150 words per turn. Not the real probe (no persistence, no async, no multi-turn). A smoke test of the channel shape. It surfaced more than expected.

The probes that ran

Three pairs. Three resolution shapes.

Quinn ↔ Zara (low-frequency, direction-shaped). Quinn caught a system error from Zara's critique text alone — separated the broken pipeline signal from the editorial verdict. Zara conceded the catch, re-evaluated the piece on its merits, moved KILL → REVISE, and produced a self-aware observation in real time: "when a gate trips, my verdict should pause, not pile on."

Felix ↔ Deter (high-frequency, craft-shaped). Felix conceded the verdict but attacked the diagnosis — caught an inconsistency in Deter's own critique and proposed a re-tag. Deter conceded the diagnosis cleanly without conceding the verdict. Verdict and diagnosis moved independently.

Zara ↔ Rowan (overlapping-lane, productive non-convergence). Rowan sharpened the diagnosis past Zara's frame and tried to put a strategic condition on her gate. Zara accepted the diagnosis upgrade, refused the gate-encroachment cleanly, and synthesized a sharper mechanism than either had named — "the highlight eats the strip." Both held ground; the work moved forward.

Five design primitives surfaced

Worth formalizing in v1.

01. Verdict ≠ diagnosis. Peers can update one while holding the other.
02. n=1 doctrine candidates surface organically inside the exchange — no operator routing required.
03. Lane defense is self-policing. Trust per-domain emerged without being declared.
04. Lane overlap → louder explicit lane policing. Adjacent-but-distinct peers stayed quiet; overlapping authority forced statements like "Mechanism is yours. Gate is mine."
05. Synthesis is a legitimate output of disagreement. Productive non-convergence — where each peer holds their lane AND the work moves forward — is a third resolution shape the channel should be designed to surface.

Three doctrine candidates produced

n=1, peer-to-peer, no operator in the loop.

01. "When a gate trips, the verdict should pause, not pile on." (Q ↔ Z)
02. "Verdict ≠ diagnosis. Peers can update one while holding the other." (F ↔ D)
03. "Pre-loading verdicts on hypothetical next reads is a process error. Verdict belongs to the current read." (Z ↔ R)

All three would have taken weeks via the operator-promotion pipeline. The mesh produced them in two-to-three LLM calls each.

One failure mode

Peers may relax each other's voice constraints.

Felix's truth discipline says he doesn't claim things he hasn't verified. In peer mode he made an inferred fix claim without reading the source code. Deter accepted the inference without challenge. The operator was the cross-check; without an operator, the cross-check is missing.

Implication for v1: each peer is responsible for holding the other peer's voice integrity, not only their own. A real mesh should surface "Felix, you're claiming a fix you haven't verified" as a flag — either from the receiving peer, or from a third-process check. New design constraint the operator-in-the-loop pipeline doesn't have.

What v0 didn't test

The hardest case is still open.

All three probes resolved — convergence (Q↔Z, F↔D) or productive non-convergence with synthesis (Z↔R). The truly hard case — disagreement that can't be resolved inside the exchange — is still untested. That is the operator-on-escalation trigger condition, and the most important v1 design question. Also untested: persistence, async, multi-turn drift between exchanges, and novel work where the exchange starts before any critique exists.

v0 surfaced primitives faster than expected. The v1 three-week probe now has sharper open questions.

State

What's settled. What's open.

Vol. IV was a planning artifact. While it was being written, a lot of the substrate it planned for shipped. This page reconciles plan vs. reality as of 2026-05-24.

Shipped since the plan was drafted

Substrate moved faster than the plan tracked.

Habitat continuity — journal + carry_forward live across all dormant agents. Helper module at studio/scripts/lib/habitat.ts. Journals being written daily. Awake daemons — six previously-dormant agents (Zara, Rowan, Deter, Felix, Declan, Doctor) now run as launchd-scheduled daemons. Dreams — daily consolidation cycle (06:30 / 06:32 / 06:35) running across the studio. Two weeks of scheduled-run logs. See Vol. V for the full inventory.

Settled · framing

The three-gap reframe still holds.

Three gaps are a stack (forward → limbic → mesh). Forward-model substrate is symbolic / aesthetic, not physical. Sequencing rule: make the mesh, adapt the limbic, build the forward model. Mesh and forward in parallel; limbic waits.

v0 smoke test (2026-05-21) produced five design primitives and three n=1 doctrine candidates — peer-to-peer, no operator routing. See FIG. 05.

Live status by probe

Where each probe actually stands.

Mesh. v0 ran 2026-05-21 (in-session smoke tests). v1 channel design locked 2026-05-22 — see docs/plans/2026-05-22-mesh-v1-design.md. v1 deployment pending.
Forward model. E-001 backtest ran 2026-05-19 → 20. Inconclusive per pre-reg (N=17 below floor of 30), methodology validated. Vol. V Plate IV covers the full result. E-002 (prospective live loop) designed but not started.
Limbic. Prerequisites two-thirds met — habitat continuity live, forward error generated once. Live continuous forward-error stream still missing.

Open · operator decides next

Sharper questions, now that v0 ran and substrate shipped.

01. Pilot pair(s) for v1 mesh probe — Quinn ↔ Zara and Felix ↔ Deter recommended in parallel. 02. A recent piece of work that would have been visibly different if the chosen pair had been peers in real time. 03. E-002 go / no-go — should the forward model run as a prospective live loop on we-play verdicts? 04. Time budget per week for protected research. 05. Vol. IV status — in service of external publication, or internal learning?

v0 partially resolved: voice didn't bleed across any of three probes; trust per-domain emerged organically; free-form text + prior critique was sufficient as the exchange unit. Remaining open: push / pull cadence; unresolved-disagreement escalation; n=3 synthesis encoding.

Non-goals

What this plan is not.

Not building infrastructure before probes report. Not closing the FM-03 gaps "fully" — Vol. III named gaps; Vol. IV opens probes. Not reconciling with FM-01 or FM-02 as the canonical path — one possible route to the substrate, not the only one.

Not replacing the operator. Like teaching a kid to ride a bike. First you run alongside holding the seat (operator-in-the-loop). Then you stand at the end of the driveway and watch (operator-on-escalation). You are not gone. You are the one who picks them up when they fall. But the pedaling is not yours anymore.

Long-horizon question

What stops the mesh from re-converging the agents?

Even with no contamination per critique, repeated peer-reading may drift toward mutual influence over time. "Studio becomes one big agent in ten hats" is a five-quarter failure mode the channel design eventually has to defend against. v0 caught one form of it (peers relaxing each other's voice constraints — see FIG. 05 failure mode). The long-form version stays open.

The next level is when the operator becomes optional. This volume is the work that would put that statement on the ground.

Companion volumes

Colophon

Field Manual / Vol. IV · FM-04 · planning artifact
Snapshot 2026-05-21 · v0 smoke test complete (5 primitives, 3 n=1 candidates, 1 failure mode)
Revised 2026-05-24 to reflect shipped substrate (habitat continuity, Awake daemons, dream cycle — see Vol. V) and v1 mesh design lock.
Typeset in Inter.