Xihe · a realm for crafted agents

Harness

The reliable system around the model — graph, cache, router, and the roles that drive them.

Harness

The DAG executor

A leaf either succeeds with all postconditions met, or it is discarded as if it never ran—failure contained at the graph boundary.

The DAG executor is the harness’s fault‑boundary primitive. Work compiles into a directed acyclic graph of atomic leaves, and the executor enforces a strict lifecycle: every unit either fully succeeds with all postconditions asserted, or is discarded as if it never ran. Reliability is a function of the graph’s shape and this binary contract, not of the model running inside a leaf.

The failure mode

Without this atomic contract, multi‑step pipelines built on fallible models or services suffer from partial state corruption. A leaf that fails midway can leave half‑written outputs, which downstream leaves consume unknowingly, silently compounding errors. Race conditions arise when multiple leaves share dependencies without a clear done/not‑done signal, so the system cannot answer “is this input stable?”. Retry logic becomes ambiguous: if a leaf partially mutated state, re‑executing it risks double‑applying effects or propagating stale data. The conductor cannot distinguish between a clean failure and a leaf that never properly started, so every failure forces either a costly full‑graph rerun or a manual untangling of side effects. The harness leaks ambiguity into every retry, skip, or reroute decision, and reliability collapses into a tangle of per‑leaf defensive checks.

How it works

The executor enforces a strict status lifecycle for every leaf:

pending – not yet evaluated for readiness.
ready – all upstream dependencies are done or skipped; the leaf is eligible for dispatch.
running – the leaf is executing on its assigned model or tool.
done – execution completed and the leaf asserted its postconditions successfully.
failed – execution completed but the postcondition assertion failed (or an exception occurred).
paused / skipped – external control states (e.g., manual hold or conductor‑directed skip).

A leaf transitions from pending to ready only when every upstream leaf has reached a terminal state (done or skipped). There is no partial readiness, no best‑effort speculation. This topological gate ensures no leaf ever runs against an incomplete or unverified input.

When a leaf finishes its payload, the executor runs the leaf’s postcondition check—a set of invariants that must hold (e.g., output format valid, required fields present). If the assertion passes, the leaf moves to done. If it fails, the leaf is immediately discarded: its status becomes failed, its output is erased from the DAG’s context, and downstream leaves never see it. The failure is contained at the leaf boundary. The conductor can then decide to retry the leaf (re‑instantiate it with possibly different parameters), skip it (mark it skipped so dependents adapt), or reroute the workflow—all without ambiguity, because no partial state escaped the leaf.

The DAG executor never interprets leaf logic. It only enforces the lifecycle, the dependency topology, and the postcondition gate. It guarantees that every leaf that reaches done is fully verified, and every failed leaf leaves zero visible trace inside the graph.

Trade-offs & boundaries

Postcondition cost. Every leaf must define and execute its own invariants. Cheap postconditions (type checks, schema validation) add low overhead, but heavyweight checks (e.g., re‑evaluating a model’s output) can become a bottleneck. The executor provides no optimization—leaf authors own that cost.
No partial results or streaming. The binary contract forbids incremental output. Any partial work produced by a failing leaf is lost; the leaf is an all‑or‑nothing unit. Workloads that need progressive delivery must split into smaller leaves or use a different primitive.
Side effects are not automatically rolled back. “Discarded as if it never ran” makes the leaf invisible to downstream consumers, but any pre‑committed external side effects (API calls, writes) remain. The contract relies on leaf authors to design leaves that are idempotent, delay irreversible effects until postcondition success, or manage compensation externally.
Acyclic only. The DAG is strictly acyclic. Cyclic workflows must be handled outside the executor, typically by the conductor replanning a new graph each iteration.
Degradation path. If postconditions are too weak (always pass), the executor degenerates into a simple scheduler and unverified outputs propagate, erasing the reliability guarantee. If postconditions are too strict, healthy outputs are rejected, causing excessive false‑negative failures. Tuning postcondition strictness is a continuous operational concern.
Granularity is a design choice, not a runtime adjustment. Coarse leaves discard large amounts of work on failure; fine leaves increase graph overhead and scheduling complexity. The executor treats every leaf as atomic regardless.

When it earns its place

The DAG executor pays off in any multi‑step workflow where a single failure must not corrupt the entire pipeline:

Model chaining with verification: Leaf A generates output; leaf B verifies it with a postcondition. If verification fails, only A is retried, with no contamination of downstream steps.
Expensive, mixed‑model pipelines: A cheap fast model runs a leaf; a postcondition leaf checks consistency. Failure can trigger a rerun of only that leaf on a more capable model, saving cost.
Parallel fan‑out with dependencies: The readiness rule guarantees downstream leaves see only stable, verified inputs from all upstream leaves, eliminating race conditions.
Audit and debugging: Every leaf logs a binary success or failure with its postcondition result, giving a clean, atomically‑traced execution history.
High‑reliability workflows: Financial calculations, compliance checks, or configuration generation where a half‑executed step would corrupt downstream state.

A leaf is either done or it never happened—the graph guarantees it.

Harness

Compare-and-swap state

Per-leaf compare-and-swap eliminates double-claim and double-completion, making parallel graph execution safe without global locks.

Compare-and-swap (CAS) state is a per‑leaf atomic state machine — pending → claimed → completed — that guarantees exactly one worker can claim and complete a given leaf under any degree of concurrency. It turns on a single idea: only the worker whose compare‑and‑swap instruction succeeds can transition the state; all others see a mismatch and move on. The result is a lock‑free harness that transforms parallel leaf execution from a race condition into a deterministic claim.

The failure mode

Without CAS, every ready leaf in the DAG becomes a potential point of silent corruption. Two or more workers inspect the same leaf, both see pending, and both decide to execute. They duplicate the work — wasting compute — and then both write a completion marker. Downstream dependencies receive two transitions where they expect exactly one, breaking their invariants. If the leaf produces side effects (a file write, an API call, a database insert), those fire twice. The graph’s dependency resolution, built on single‑completion semantics, silently becomes nonsense. The brittleness scales with concurrency: more workers raise the probability of collision, and a single unguarded leaf can corrupt the entire plan. A global mutex could prevent this, but at the cost of serialising all leaf transitions and killing throughput. Without CAS, reliable parallel execution is impossible — the harness becomes a lottery, not a pipeline.

How it works

Each leaf carries an atomically accessible state word. The harness defines exactly three legal states and two guarded transitions:

Claiming: A worker reads the leaf’s state (expecting pending) and attempts an atomic compare‑and‑swap to claimed. Only one worker’s CAS succeeds; the loser sees a mismatch — either claimed (another worker won) or completed (already finished). The losing worker does not block. It picks another leaf from the dispatch queue, or retries the same leaf only if the state remains pending (an edge case that quickly resolves).
Completing: After executing the leaf’s task, the winning worker performs a second CAS from claimed to completed. This prevents duplicate completions: if a bug or a stale recovery attempt tries to complete the same leaf again, the CAS fails because the state is already completed. Only the first successful completion matters.
Failure handling: If a worker crashes after the claiming CAS but before the completing CAS, the leaf stays claimed. CAS itself cannot time out or roll back that state — it is a pure atomic primitive. Recovery is delegated to the dispatch layer, which monitors leaf states and applies a timeout/retry policy (e.g., resetting the leaf to pending after a configurable deadline). This separation keeps the harness fast and minimal while higher‑level components handle liveness.

The CAS primitive is always on, invisible to the model, and enforced by the harness on every state transition. No global lock exists; workers never wait on each other.

Trade-offs & boundaries

CAS is a single‑state primitive. It does not coordinate multi‑leaf transactions — it cannot atomically claim leaf A and leaf B together, nor roll back a completed leaf. Those responsibilities belong to the conductor (plan‑level orchestration) and dispatch (topological ordering, recovery).

Stuck claimed leaves after a crash demand an external timeout; without it, progress halts. The dispatch layer’s retry policy must be tuned per workload to balance false‑positive resets against prolonged stalls. CAS also does not provide side‑effect idempotency: if a leaf writes to an external database and then the worker crashes before the completing CAS, the harness can retry the leaf after timeout, but it cannot undo the database write. The safety guarantee is state‑transition integrity, not external compensation.

Under extreme contention — many workers racing for the same leaf — CAS failures waste a few CPU cycles on retries. In practice, the dispatch layer distributes work over many ready leaves, keeping contention negligible. The cost of a successful CAS is a single hardware‑atomic instruction (a few cycles), dwarfed by model inference or I/O inside a leaf. If a workload consists solely of microsecond‑long tasks, that overhead might become measurable, but such use is atypical for an agent platform.

CAS does not prevent a leaf from producing a wrong result (model hallucination, buggy tool call). Verification — a separate component on the roadmap — addresses output correctness.

When it earns its place

CAS earns its weight whenever multiple workers execute leaves concurrently. This is the default for any throughput‑sensitive Xihe deployment. It shines when:

Leaves are costly (seconds of model inference or I/O) and duplicate execution would waste both time and external quota.
Leaves produce side effects that must not fire twice — e.g., rate‑limited API calls, database inserts, file writes.
Workers are heterogeneous or unreliable (varying model latency, possible crashes). CAS keeps state transitions honest even as workers come and go.
The DAG contains tens to thousands of leaves and the dispatch layer feeds many ready leaves simultaneously; a global lock would become a serialisation bottleneck.

In these situations, CAS transforms parallelism from a correctness hazard into a safe, predictable pipeline. If the system runs with a single worker, CAS adds a trivial overhead and never contends; the moment concurrency increases, it silently prevents corruption the developer never has to think about.

CAS turns concurrent leaf execution from a race into a predictable pipeline: one worker, one claim, one completion.

Harness

Frozen prefix-cache & concurrency

Frozen prefix-cache decouples concurrency width from prompt cost by sharing a byte-stable prefix across all leaves.

The harness materialises the shared context — system prompt plus task framing — once and locks it as a byte-stable prompt prefix. Every leaf in a DAG inherits that prefix through the provider’s prompt-cache mechanism, eliminating redundant token recomputation. The one idea it turns on is that parallel execution should not multiply the cost of a shared context; by caching that context, concurrency width becomes a function of provider rate limits, not token budgets.

The failure mode

Without a frozen prefix, every leaf in a wide DAG independently transmits and processes the identical system prompt and task framing. That redundancy is not merely wasteful — it compounds destructively. Each leaf pays full token generation price for the prefix, turning a 256‑way fan-out into a 256× token multiplier on the same context. Latency balloons because the provider must compute KV caches for the prefix again and again, saturating its own resources and throttling throughput. Concurrency collapses under the weight of token budgets and rate-limit headroom: what should be a fast, parallel leaf execution becomes a serialised, cost-exploding parade.

The failure is silent — no errors, no crashes. A developer may test with a handful of leaves and never see the problem, only to hit a wall when the DAG widens. The brittleness is that concurrency and prompt size are coupled; eliminating that coupling is the design’s core job.

How it works

The harness materialises the shared context exactly once, before any leaf is dispatched, and locks it as a deterministic, immutable byte sequence. Subsequent leaves do not re‑assemble this prefix; they reference it through the provider’s prompt‑cache mechanism. The provider stores the KV state of the prefix and serves it to every leaf that presents the same byte‑exact prefix, skipping recomputation entirely.

Each leaf’s request consists of the cached prefix (or a reference to it) plus a leaf‑specific suffix. The provider treats the prefix tokens as a cache hit, billing them at roughly 10% of generation price.
Recomputation drops to zero; only a lightweight cache‑lookup per leaf remains.
Structural sharing makes wide concurrency affordable: under this scheme, DeepSeek workloads routinely fan out ~256‑way on a single cached prefix; MiMo scales to ~8‑wide.
The frozen prefix is a red line. Any edit — even a single token change — invalidates the cache for all downstream leaves, forcing a full recompute of the prefix’s KV states for every leaf. Therefore the harness treats prefix modifications as schema‑breaking operations, permitted only at the conductor’s explicit re‑planning step.

The design divorces concurrency width from prompt‑size cost. Parallelism is now bounded by the provider’s request‑per‑second or token‑per‑minute limits, not by how large the shared context is.

Trade-offs & boundaries

The frozen prefix imposes strict rigidity: the shared context must be identical for all leaves in a DAG. There is no hybrid “partially dynamic” mechanism — any per‑leaf variation in the system prompt or task framing would require a different prefix, defeating the cache’s benefit. Those leaf‑specific needs must reside in the suffix, which does not benefit from caching.

The biggest cost is the red‑line penalty. Changing the prefix mid‑flight forces a complete recomputation of every subsequent leaf, spiking token usage and latency. This is acceptable only when re‑plans are rare — the conductor’s re‑planning step is explicitly gated to absorb that cost deliberately, not accidentally triggered by a leaf.

The mechanism depends on the provider supporting prompt caching. If a provider does not offer caching or charges full token price even on cache hits, the token‑savings vanish; only the structural simplification remains. For very small DAGs (a few leaves) or trivial shared contexts, the overhead of managing the frozen prefix may exceed the benefit.

Cache‑hit economics rely on the prefix being reused densely within a short window — exactly the pattern the harness guarantees. However, the harness does not control the provider’s cache eviction policy; it simply ensures the prefix is byte‑stable and transmitted uniformly. The degradation path is a full recompute per leaf, which the harness treats as an expensive cache miss, but does not attempt to retry or compensate.

When it earns its place

The frozen prefix pays off whenever a large, static shared context feeds many concurrent leaves.

Wide‑fan DAGs such as code generation where a single specification spawns hundreds of independent function implementations. Each leaf inherits the same architectural constraints; the prefix is paid once, parallelism becomes nearly free.
Multi‑agent evaluation where the same scoring rubric and system prompt are applied to dozens of responses in parallel.
Batch extraction or classification with a fixed schema description and formatting rules; only the input record varies per leaf.
Cost‑sensitive deployments where token budgets are tight and the ~10% cache‑hit pricing turns a prohibitive fan‑out into a manageable expense.
Latency‑sensitive real‑time agents where skipping the KV prefill shaves hundreds of milliseconds per leaf.

The mechanism is overkill for single‑leaf tasks or DAGs whose shared context is trivial (e.g., a one‑line system prompt). It earns its place precisely when the parallelism would otherwise be throttled by token cost — and the shared context is large enough to make the savings decisive.

The frozen prefix turns concurrency from a token cost into a rate‑limit game: pay for the shared context once, and only once.

Harness

Best-of-N graft

Best-of-N graft trades serial risk for parallel exploration — frozen prefix makes N attempts cheap, and a judge panel selects the winner.

Best-of-N graft is a harness strategy for high-stakes leaf nodes: the conductor launches N independent attempts in parallel, all sharing the same frozen prefix. A configurable judge panel scores each on cost, quality, and latency; only the top-scoring attempt is grafted into the main tree, while the rest are discarded. The one idea it turns on is that the frozen prefix means you pay for concurrency, not N times the tokens — the cost applies only to the divergent suffixes.

The failure mode

Without Best-of-N, every leaf is a single model call. For a tricky algorithm, a design decision, or an ambiguous spec, that one attempt is effectively a coin-flip. If the model’s first sample is suboptimal — a brittle implementation, a misguided design choice, a misinterpretation — the error propagates silently downstream. Later leaves build on a flawed artifact, compounding the damage. Detection comes late, forcing an expensive re‑plan that rolls back the entire tree or subtree. Even serial retries are wasteful: each pays the full token cost of the prefix again, and there is no mechanism to compare candidates systematically. The plan’s quality collapses to the reliability of its weakest leaf, and developers are left betting the whole tree on a single guess.

How it works

Frozen prefix. All N attempts share identical context up to the leaf — same conversation history, same intermediate results, same instructions. Divergence begins only at the leaf input, so token expenditure is bounded: you pay roughly for the common prefix once plus N times the suffix tokens. Concurrency replaces token multiplication.
Parallel execution. The conductor marks the leaf for Best-of-N and launches N independent invocations simultaneously. Every attempt runs in its own model call with no shared mutable state. Latency is the slowest attempt’s wall‑clock time, not the sum.
Judge panel. Per‑leaf scoring function evaluates each completed attempt on three axes — cost (tokens consumed), quality (correctness, style, adherence to constraints), and latency — each with configurable weights (e.g., quality 0.7, cost 0.2, latency 0.1). The panel may be a rule, a schema validator, or a model call. Only the highest-scoring attempt is grafted into the tree; the discarded N‑1 attempts are ephemeral and never visible downstream.
Quality gate. If every attempt falls below a minimum quality threshold, the leaf fails as a whole. No “least bad” result propagates. The conductor must then re‑plan or escalate the failure, preventing silent degradation.

Trade-offs & boundaries

What it costs. Concurrency overhead is the primary price: N parallel calls consume rate‑limit headroom, worker slots, and memory. When many leaves in a plan use Best-of-N simultaneously, the system may saturate and starve other branches. Latency is bounded by the max attempt duration plus judge evaluation, not the average. Discarded suffixes waste tokens — if the suffix is long relative to the prefix, the token savings shrink. Finally, the judge panel itself adds latency and, if it is a model call, additional token cost.

Where it does NOT apply. Leaves that produce expensive external side effects — database writes, email dispatch, payment API calls — must never use Best-of-N. Every attempt would fire its side effect, and there is no transactional rollback for losing attempts. State‑dependent judge logic also breaks the pattern: if the judge’s score depends on mutable external state (e.g., a counter that increments per attempt, or the order in which attempts complete), comparisons become invalid. The judge must be a pure function of the attempt’s output and the leaf’s input. Likewise, leaves whose output cannot be independently scored (e.g., purely subjective creative copy without a rubric) are poor candidates; an automated judge provides a false sense of reliability.

Degradation path. If all attempts fail the quality gate, the leaf fails cleanly — better than a silent rotten output. In practice, developers can tune N downward when resources are tight (a small N still diversifies risk) or increase N when the solution space is especially wide. The score gap between top attempts serves as a signal: if attempts score similarly, the extra concurrency buys little; if they diverge sharply, the graft is working.

When it earns its place

A tricky algorithm or refactor with a wide solution space. Multiple plausible implementations exist; one might handle edge cases poorly. Best-of-N tests the space concurrently and picks the cleanest, cheapest result.
A design decision with several plausible approaches. Choosing between an event‑driven vs. polling architecture, or a factory vs. strategy pattern. Each attempt argues for a different approach, and the judge picks the best trade‑offs.
Ambiguous specs where a single pass is a coin‑flip. When the prompt leaves room for interpretation, N attempts sample the interpretation space; the judge selects the most coherent or conservative match.
Any leaf where being wrong is expensive to discover later. Core business logic, data‑model schemas, security boundaries — a single mistake cascades through dozens of downstream nodes. The concurrency cost of a graft is an insurance premium against much larger rework.

Best-of-N graft buys reliability at the cost of concurrency: the frozen prefix keeps token cost low, and only the winner survives.

Harness

Hashline edits

Hashline uses content hashing to catch and auto-correct line-shift errors in cheap executor patches, decoupling positional accuracy from model quality.

Hashline is a harness-level verification step that eliminates line-shift errors from cheap executor models. It works by hashing every line an edit touches—plus surrounding context—before and after the patch, then comparing. If the intended target line’s content is found uniquely elsewhere in the modified file, the offset is auto-corrected; otherwise the patch is rejected outright. No prompt changes, no model involvement, negligible latency. The core idea: use the line’s own content as the positional anchor, not a model’s fragile line-number guess.

The failure mode

A cheap executor model, chosen deliberately for speed and cost, often misplaces an edit by one line. An instruction to change line 42 lands at line 41 or 43. The resulting diff still looks plausible—the code change is syntactically correct, the file compiles, and static checks pass. The error is invisible. Over repeated turns, these off-by-one mistakes compound silently. A guard condition drifts to the wrong branch, a comment migrates one line up, a function call shifts into a different scope. The codebase still builds, tests may still pass, but the logical structure rots. The corruption surfaces only when a downstream bug forces a costly forensic trace across turn history. Debugging that mess far exceeds the compute savings from using a cheap model. The root cause is that weak models treat line numbers as textual coordinates, not semantic anchors—their token-level spatial reasoning is fuzzy, and every edit is a gamble on positional accuracy. Without a mechanical guard, the harness trusts the model’s reported coordinates, and that trust is brittle.

How it works

Hashline inserts a content-based check into the patch-commit pipeline, running entirely outside the model’s invocation:

Before dispatch, the harness hashes every line the edit is supposed to touch, plus a configurable number of surrounding context lines. This captures the exact content of each line—whitespace and all—and stores the mapping of line-number to hash.
After the executor returns the patch, the harness applies it to a local copy of the file and re-hashes the same logical lines (by their original positions).
If every hash matches, the edit was correctly placed and the patch proceeds to commit.
If a hash mismatch occurs, the harness scans the neighbourhood (typically the whole file) for a line whose content hash matches the pre-edit hash of the intended target line. If exactly one such line exists, the offset is automatically corrected—the edit is applied to that line instead. If zero or multiple lines share that hash, the patch is rejected outright.
The model is never aware of this step. No prompt modifications, no retraining, no extra inference. The check completes in microseconds, orders of magnitude faster than the model call itself.

This mechanism decouples positional accuracy from model quality. The executor need only produce the correct content for the target line; the harness ensures that content lands on the correct line in the file. Reliability is engineered outside the model.

Trade-offs & boundaries

Hashline is a positional guardrail, not a semantic one. It verifies where an edit lands, not whether the edit itself is logically correct. A patch that is perfectly aligned but introduces a bug will still pass. It complements, rather than replaces, other verification stages.

Hash uniqueness requirement. Auto-correction depends on the intended target line having a unique content hash in the file. If the line is blank, contains only whitespace, or duplicates a neighbour (e.g., consecutive }) lines, the harness cannot disambiguate and falls back to rejection. This is by design: a safe rejection beats a silent misapplication. In practice, non-trivial code lines with distinct identifiers or comments rarely collide, so auto-correction activates most of the time. For files with heavy repetition (auto-generated headers, tabular data), the rejection rate may rise.

Architectural cost. The harness must maintain a pre-edit snapshot of the file to re-hash after the patch is applied—a minor memory overhead that is usually already present in the conductor’s state management. Hashing adds a few microseconds per edit, negligible next to model inference.

Degradation path. As duplicate content increases, rejections become more frequent. Each rejection costs a retry (the conductor redispatching the edit, possibly to a stronger model) but prevents silent corruption. Users can tune the context window size; a wider window can improve uniqueness but also increases the chance that a legitimate edit to a context line triggers a false mismatch, so the trade-off is deliberate.

Scope. Hashline operates on single-file, line-based edits. It does not apply to whole-file rewrites (where “touched lines” lose meaning) or to cross-file moves. Insertions and deletions are handled implicitly: the search compares content, not absolute positions, so an insertion that pushes the target line down does not break the check.

When it earns its place

Hashline earns its weight whenever a cheap executor model is selected for speed or cost rather than positional precision. Concrete scenarios:

Iterative, multi-turn editing loops—a conductor making dozens of small patches across a session. Without Hashline, the probability of at least one off-by-one error approaches certainty; with it, each patch is pinned to content, not fragile line numbers.
Cheap executor farms—teams running small, fast models (e.g., Llama 8B, Gemma, Claude Haiku) for bulk leaf edits. These models are cost-efficient but notoriously weak at spatial reasoning. Hashline lets them do surgical, line-precise work that they otherwise couldn’t be trusted with.
CI/CD fix-bots or automated refactoring—any pipeline where a patch must land correctly without human inspection. A rejection surfaces the failure immediately, rather than hiding it in a corrupted file that only breaks later.
Mixed-tier systems—a costly conductor model (e.g., Opus, GPT-4o) paired with cheap executors. The planner’s expensive reasoning is wasted if an executor misplaces a line; Hashline decouples the two, preserving the economic advantage of the cheap worker.

It is less useful when the executor model already has near-perfect positional accuracy (top-tier models on short contexts) or when the edit simply regenerates the whole file. But the overhead is so small that it is often left enabled by default—even strong models occasionally miscount lines in very long files.

Never trust a model to count lines—let the content hash be the anchor.

Harness

Language-server integration

Language-server integration replaces probabilistic text completion with anchored symbol queries, grounding edits in actual code structure.

Language-server integration feeds the executor a live, structured view of the codebase through the Language Server Protocol. It turns the problem of “guessing the program” into one of “querying the program,” shifting agent reasoning from probabilistic text completion to an anchored symbol table provided by the same toolchain that powers human editors.

The failure mode

Without LSP integration the executor operates on raw text. Every symbol reference, import path, and method signature must be inferred from surrounding tokens. The model can produce code that looks plausible—correct indentation, familiar naming—yet silently breaks invariants. It may rename a function while missing call sites in other files, generate a call to a method that does not exist, or import a module that was never installed. This is the hallucinated‑path failure mode: the model confidently constructs a reality that has no ground truth in the codebase, and because the text appears coherent the agent has no internal alarm.

The waste compounds. The executor might dump entire files into context hoping the model will spot relevant parts, burning tokens and attention budget on irrelevant function bodies. It may issue clarification rounds (“what is the return type of parseConfig?”) that could have been answered instantly. Worst is silent corruption—edits that pass visual review but break the build or introduce subtle semantic mismatches. Every such edit requires manual debugging that defeats the purpose of automation, and the brittleness scales combinatorially with the number of files the model consults.

How it works

The executor acts as a lightweight LSP client. Instead of asking for a file as free text, it sends targeted, read‑only queries and receives structured replies:

textDocument/documentSymbol returns the symbol tree of a file—names, kinds (function, class, variable), and location ranges.
textDocument/references enumerates every usage of a symbol across the entire project.
textDocument/definition resolves a symbol to its exact declaration site, including file, line, and column.

Each query returns AST‑level metadata, never raw text. The executor constructs its internal symbol table exclusively from these responses. The model can request only what the LSP reports; it cannot fabricate a definition or call site that does not exist. Edits become grounded: renaming a symbol reveals every affected location before the model writes a single token; adding a function call checks the target signature against the actual code graph. The executor reads only the files and line ranges the LSP points to—never entire files hoping context aligns—dramatically reducing token consumption.

The integration is read‑only on the LSP and per‑language. It never sends mutations like textDocument/rename; it activates only when a language server is running for the relevant file type. If no server is available, the executor falls back to text‑based heuristics—less reliable, but functional. When a server is present, structured queries are always preferred.

Trade-offs & boundaries

Language‑server integration has honest edges. It requires a running LSP per language, adding startup latency while the server indexes the project. For short‑lived, single‑file tasks the overhead may outweigh the benefit. The fallback to text heuristics is noisier and less precise, so environments without an LSP (some dynamic languages, custom DSLs) operate at reduced reliability.

The integration is strictly static. It provides structural metadata—symbol names, types, references—but not runtime types, dynamic dispatch, or metaprogramming patterns. A method that compiles but throws at runtime will not be caught. The model must still read bodies and comments for semantic understanding; LSP anchors the structure of reasoning, not its meaning.

LSP servers themselves vary: some lag behind language versions, some omit certain symbol kinds, some crash during large indexing. The harness must handle these gracefully—timeouts, retries, degradation—never blocking the plan indefinitely.

When it earns its place

LSP integration pays off whenever the agent must navigate a codebase with multiple modules or produce edits that demand structural precision. Concrete triggers:

Cross‑file refactoring: renaming a symbol that appears in dozens of files. The references query provides a definitive list of affected locations, preventing silent misses.
Type‑aware code generation: adding a function call that must match a signature in another file. The definition query reveals the exact parameter types and return type, eliminating guesswork.
Import management: documentSymbol lists a module’s exports, so the agent can choose the correct import path without scanning the file manually.
Validating edits before commit: a follow‑up references query can verify that no dangling references remain after a change.

For single‑file tasks with simple patterns (e.g., generating a unit test for a known signature), the startup cost may not be justified. But any time the agent would need to ask “what is the definition of this name?” or “who uses this variable?”, the LSP provides the answer instantly and groundedly.

The LSP is the single source of truth for static program structure—the harness engineers reliability from outside the model.

Harness

The MCP router

Centralizing all MCP calls through a harness router eliminates silent context poisoning by deferring schemas and sandboxing outputs.

The MCP router is a single entry point inside the Xihe harness through which every Model‑Context‑Protocol tool invocation must pass. It turns on one idea: separate the inventory of available tools from their schemas and results. Agents see a lightweight menu and compact pointers; schemas and large payloads are fetched or stored on demand, never preloaded or dumped into context.

The failure mode

Without a centralized router, every agent must preload the complete MCP tool schema into each prompt turn—a fixed tax that grows with the tool catalog. A single schema can weigh hundreds of tokens; a dozen tools waste 4–8K tokens per turn on unused metadata, crowding out task instructions and reasoning space. This is silent context poisoning: no crash, just a gradual degradation of response quality as the prompt fills with irrelevant structure.

The second failure is output dumping. When a tool returns a large payload—file contents, search hits, database rows—it lands directly in the context window. The context bloats monotonically, triggering latency spikes (the model must re‑ingest the entire bloated history), token overflow, and cost blowouts. Because each agent manages its own tool calls, there is no central point to trim, meter, or audit what happened. Adding tools makes prompts heavier; executing tools makes contexts larger. The result is a brittle, scale‑negative integration: more capability means worse performance, often invisibly until the system breaks.

How it works

The router sits in the harness, never on the agent side. It enforces a uniform interface through three meta‑tools, replacing raw schema injection with controlled, on‑demand access.

Resident menu. At startup the harness builds a lightweight list of tool names and one‑line descriptions. This menu costs only a handful of tokens per tool and lives in the agent’s context as the sole static reference. No schemas, no parameters—enough for an agent to discover what’s available.
On‑demand schema loading. When an agent decides to invoke a tool, it first calls describe(tool_name). The router fetches the full JSON schema only then, and only for that single tool. The schema never appears in the prompt unless the agent explicitly requests it, cutting per‑turn tool overhead by roughly 70% compared to preloading everything.
Three meta‑tools. All MCP interactions flow through search (discover tools by capability), describe (retrieve a tool’s schema), and call (execute with parameters). The router never exposes raw tool injection; an agent cannot bypass this interface. The harness intercepts and rejects any direct MCP traffic.
Output sandboxing. When a tool returns a large result, the router stores the full payload externally and returns a compact pointer—an identifier plus the byte size—instead of the raw output. The agent sees only the pointer, keeping the bulk out of the context window. For oversized results this yields approximately 98% context savings.
Audit boundary. Every search, describe, and call is logged at the router boundary: agent ID, tool name, input size, output size, latency. The harness maintains a single, structured audit stream for billing, debugging, and governance. The conductor and dispatch remain completely oblivious to output bulk; they schedule leaf tasks that produce pointers, not documents.

The developer experience is simple: register MCP servers with the router via a configuration map. No agent code changes when tools are added or removed. The router’s menu updates automatically at harness initialization.

Trade‑offs & boundaries

On‑demand schema loading adds one extra round‑trip (describe) the first time a tool is used in a turn. For agents that call the same small set of tools repeatedly, this overhead is small compared to LLM inference time but not zero. A very low‑latency pipeline (sub‑200 ms per turn) might feel the cost; in all other cases the token savings outweigh it.

The router is a single point of failure. If the router process goes down, no agent can call any MCP tool. The harness is already a critical component, so this is usually acceptable, but high‑reliability deployments should plan a standby or a graceful degradation path—for example, falling back to a static menu with no sandboxing.

Output sandboxing depends on an external store (in‑memory, file, or object store). If that store is unavailable, the router cannot return a valid pointer. A fallback policy must be explicit: either fail the tool call or allow inline output with a warning. The ~98% savings apply only to large outputs; results already below the sandboxing threshold pass through directly, yielding no savings.

The router does not enforce tool authentication or authorization—that remains the MCP server’s responsibility. It also does not rank tools or decide which to call; the agent’s reasoning loop owns that choice. Finally, the router is not a replacement for overall context window management. Repeatedly calling the same tool and holding old pointers still grows context, so the conductor must still apply summarization or eviction policies.

When it earns its place

The MCP router pays for itself as soon as an agent uses more than three MCP tools, or any tool regularly returns more than a few hundred tokens of output. It is indispensable when:

Agents share a growing set of tools (codebase search, file I/O, API clients) and preloading all schemas burns 5–15% of the context window.
Tools produce variable, often large, results—file contents, database dumps, search results—where a single oversized payload would otherwise poison the turn.
Token budgets are tight, either because of cost constraints or small model context windows.
Every tool call must be audited for compliance, cost allocation, or replay debugging.
Multiple agents use the same MCP servers; the router provides a single metered, sandboxed interface and prevents cross‑agent context contamination.

In contrast, a two‑tool prototype with sub‑500‑byte outputs and no audit requirements sees only overhead. The router’s value emerges at scale—precisely when the naïve MCP pattern begins to break.

The MCP router turns MCP from a per‑turn tax into a per‑use toll—metered, sandboxed, and auditable at the harness gate.

Harness

Context-mode (think in code)

Context-mode offloads deterministic work from expensive LLM context to cheap, auditable script execution by treating context as a budget to compress, not expand.

Instead of reading files into the context window for computation, the agent writes a small, self-contained script (Python, shell, SQL) that runs in a sandbox. Only the script’s stdout or a failure signal re-enters context — raw file contents never do. This shifts counting, diffing, formatting, and aggregation from an untyped, expensive channel (LLM reasoning) to a typed, cheap channel (execution), turning the model into a thinker rather than a calculator.

The failure mode

Without context-mode, every deterministic sub-task forces the LLM to ingest raw data and perform arithmetic, pattern matching, or logic inside a probabilistic environment. This breaks in three concrete ways. First, token waste and context bloat: raw files consume the budget that should go to planning, making the conductor forget earlier turns as the window fills with noise. Second, silent data hallucination: the model miscounts, invents numbers, or misaligns columns with high confidence, and the error propagates invisibly into downstream reasoning because there is no verifiable trace. Third, non-reproducible output and lost auditability: natural-language reasoning is opaque; a reviewer cannot replay the computation to find the mistake, and a hallucinated summary looks just as plausible as a true one. The harness cedes deterministic work to the component least suited for it, turning every such task into a reliability landmine that poisons the agent’s output without a crash.

How it works

The harness inspects the conductor’s plan and classifies leaf tasks whose result depends only on deterministic computation over known data (counting matches, diffing files, aggregating columns, formatting structured output). When the query is unambiguous enough, it routes the job to a sandboxed executor — this decision can be per-task or a global setting, and it is enforced by the dispatch layer, which is determinate code, not an LLM.

Script generation. The agent composes a short, self-contained script with explicit input paths, inline comments, type annotations, and assertions. These serve as the reasoning trace; the code is the thought.
Sandboxed execution. The script runs in an isolated environment (no network, read‑only filesystem, resource limits). Only stdout and an exit code cross the boundary. The sandbox rejects side‑effects and enforces timeouts.
Result compression. The harness captures stdout (typically a single value, JSON, or table) and forwards it to the conductor as a compressed fact. A non‑zero exit or timeout becomes an explicit, structured failure signal — never a guessed summary.
Full audit trail. The script source, invocation timestamp, exit code, and output are logged externally. Any developer can replay the exact computation later, turning reasoning from opaque prose into a reproducible artefact.
Fallback on ambiguity. When the task demands nuance, subjective judgment, or the agent’s script fails irrecoverably, the harness degrades silently to standard context reading. The fallback is transparent to the conductor but noted in logs, preserving correctness at the cost of extra tokens.

Trade-offs & boundaries

Context-mode imposes real costs. Latency from sandbox startup and script writing can be disproportionate for trivial queries (“count words in a single line”). Token overhead appears because generating correct code consumes output tokens; the trade‑off is upfront token spend for later correctness and auditability. Agent competence matters — small or non‑coding models may produce broken scripts, forcing a fallback and wasting time. Security is mandatory: the sandbox must prevent writes, network access, and runaway loops; a mis‑configuration turns code execution into a dangerous vector. Not for nuance: the mode absolutely cannot handle open‑ended tasks like tone analysis or style critique. If forced, a script returns a vacuous mechanical answer (e.g., “0 differences” because it only checked string equality). The red line is that raw file contents must never enter context when active; the harness must intercept and strip any unintended data. The degradation path fails explicitly: if a script errors out, the harness does not retry blindly but escalates to standard reading, adding latency but avoiding silent corruption. Classifying ambiguity correctly remains an engineering art, not a solved algorithm — over‑eager routing produces brittle failures, while under‑eager routing wastes tokens.

When it earns its place

Context-mode earns its place in every concrete, verifiable question from structured or semi‑structured data. Concrete examples: counting occurrences of a pattern across a codebase (PR review), diffing two configuration files or API responses, formatting a CSV into a nested JSON structure, aggregating metrics across multiple log files, or extracting the third column from rows where a value exceeds a threshold. It also shines in multi‑step DAGs where an intermediate count, sum, or constraint check must be exact before the next step proceeds — for instance, a conductor planning a refactor first needs an exact count of deprecated API calls. In audit‑critical environments, the script is a repeatable reproducer, and the output is a single deterministic line. The pattern eliminates the “hallucinated number” bug class that looks plausible in review but erodes trust in the entire system.

Context is a budget to compress, not expand — write a tool, not a document; let code carry the precision, and keep the model for what it does best.

Harness

Conductor · Dispatch · Executor

Conductor plans, Dispatch routes deterministically, Executor runs leaves—failures loop back to reasoning, never silently past.

The harness decomposes agent work into three hard-boundary layers—Conductor, Dispatch, Executor—each decoupled from any specific model. The one idea it turns on is that reliability is an engineered property of control flow, not a byproduct of careful prompting. By forcing planning, routing, and execution into separate roles, the harness eliminates the entanglement that makes model‑driven pipelines brittle.

The failure mode

Without this three‑way separation, agent loops collapse into monoliths where the same model decides what to do, when to do it, and how to recover. That coupling breeds a family of failures:

Race conditions on shared state. Two subtasks that appear independent in a prompt may update the same mutable context (a file handle, a progress counter) without coordination. Without compare‑and‑swap gating, they interleave and corrupt each other’s data, producing outputs that look plausible but are logically inconsistent.
Silent corruption. A leaf returns a wrong but well‑formatted result. No independent layer validates it; the error propagates downstream, and the developer sees a correct‑looking final answer. The damage is invisible.
Wasted inference. A single model re‑plans the entire task for a leaf‑level failure, burning tokens and latency to re‑reason over work that was already correct. Alternatively, the system retries the same prompt infinitely because it lacks a distinct re‑planning vocabulary.
Model‑lock‑in. Every step runs on the same model because there is no routing layer. A cheap, fast model cannot be used for routine leaves while a reasoning‑heavy model handles only the planning—cost and quality are locked together.
Brittle recovery. When the model itself both plans and executes, a failure is indistinguishable from a change of intent. The system either retries blindly or gives up entirely.

All of these are failures of orchestration, not model quality. The harness solves them by giving each responsibility a separate, deterministic owner.

How it works

The harness enforces three roles with iron boundaries:

Conductor receives a user intent and returns a directed acyclic graph (DAG) of atomic leaves. Each leaf defines a unit of work—a prompt template, a model binding (Opus, Codex, GLM), and a failure threshold. The conductor is reasoning‑heavy and hot‑swappable; you can swap the planning model without touching the rest of the harness. It never touches execution, never reads a model output, and never consumes raw tokens. Its product is the DAG alone.
Dispatch is deterministic harness code with no model in the loop. It polls the DAG for leaves whose dependencies are satisfied, layers them topologically, and fans them out under bounded concurrency. Each leaf is routed to its pre‑bound model. State transitions—ready, in‑flight, done, failed—are gated with compare‑and‑swap (CAS) to prevent double‑dispatch or races. Dispatch never plans, never validates an output, and never calls a model. Its job is strictly sequencing and routing.
Executor runs a single leaf atomically on its assigned model. It receives the leaf definition and pre‑computed inputs from its dependencies, calls the model, and returns either a structured output or a failure signal. It has zero visibility into the DAG, prior leaves, or the overall plan. It is cheap, stateless, and massively parallelizable. Executors do not validate—they only return a binary pass/fail.

The critical loop: when an executor signals failure, dispatch marks the leaf as failed, stops the DAG for that branch, and returns control to the conductor. The conductor then replans—re‑splitting the sub‑DAG, changing the model, or aborting the intent entirely. A failure never passes the dispatch gate silently; it always re‑enters the reasoning layer.

Trade-offs & boundaries

The separation is not free:

DAG construction overhead. Every task, even a trivial one, requires at least one conductor call to produce the initial plan. For a single‑step operation, this adds latency and cost without benefit.
Re‑planning latency. A failed leaf must bubble up to the conductor, be re‑reasoned, and produce a new sub‑DAG before execution resumes. For workflows where failures are rare, the indirection may be overkill.
Granularity burden. The developer must decompose intents into genuinely atomic leaves. Leaves that are too coarse waste re‑planning cycles; leaves that are too fine bloat the DAG and increase dispatch overhead. There is no automatic right‑sizing.
The verifier closes the plausibility gap. A leaf can return a plausible but wrong result, and the executor’s own pass/fail signal won’t catch it. The verifier — a model‑agnostic cross‑model skeptic, off by default — reviews results after the DAG runs; on a real flaw the conductor silently escalates to a stronger model and re‑plans. Opt‑in: without it configured, a returned result counts as success.
Determinism limits. Dispatch’s CAS gating guarantees each leaf runs exactly once, but the timing of concurrent dispatches is non‑deterministic. If leaves modify shared external resources, ordering can still cause races—the harness provides no transactional safety for external state.
Conductor degradation. If the conductor itself fails repeatedly (mis‑specifies the DAG, hits rate limits), the system degrades to a hard error. There is no graceful partial output; either the plan completes or it does not.

Red lines are absolute and breaking them voids the reliability guarantees: dispatch never plans, the executor never validates, and the conductor never consumes raw model outputs. If the conductor ever touches a leaf output, it reintroduces the silent‑corruption path the harness was built to eliminate.

When it earns its place

This architecture pays off when:

The workflow spans multiple models with different cost/latency profiles—e.g., a reasoning‑heavy planner, a fast executor for bulk operations, and a specialist for code generation. Each leaf is pre‑bound to the right model.
Independent subtasks can run in parallel (e.g., generating tests, docs, and stubs for the same function). Bounded concurrency prevents resource saturation while exploiting full parallelism.
High‑stakes correctness is non‑negotiable—financial, legal, or safety‑critical tasks where a silently propagated error is unacceptable. The re‑plan loop ensures that every failure is diagnosed at the reasoning level.
You need auditable traceability—each leaf’s input, output, and failure history is logged, and the DAG provides a full provenance record for debugging.
You expect model instability and want a fault‑tolerant system where retries or model switches happen at the plan level, not in an opaque prompt loop.

The planner plans, the router routes, the executor acts—failures return to the reasoning level, never to a model‑in‑the‑loop retry.

Harness

The web stack

The web stack engineers retrieval reliability outside the model, turning search intent into clean, bounded text through provider rotation, extraction, and sandboxing.

The web stack is a harness‑level resource that turns the internet into research‑grade signal by owning the entire retrieval pipeline — querying, failover, content extraction, and size sandboxing — outside the model. Its core premise: every fragile part of web retrieval is engineered into the harness, not prompted into an LLM. The conductor issues a search intent, and the stack autonomously returns clean, bounded text excerpts; the model never parses HTML, handles API keys, or reasons about missing providers.

The failure mode

Without the web stack, any agent that needs online information falls into silent failure patterns. A single‑provider dependency means one expired or exhausted API key stops all retrieval, often without a clear signal — the model receives an empty string or an error page and continues reasoning on noise. Manual URL curation or prompt‑based search forces the conductor to guess links or rely on the model to parse messy HTML, bloating context with navigation and ads. Oversized pages then explode the context window: a single article packed with boilerplate can consume thousands of tokens before any fact appears, triggering silent truncation or drowning the signal in irrelevant content. Because retrieval is treated as an in‑model problem, brittleness compounds — recovery logic, retry prompts, and hallucinated fallbacks all waste reasoning budget and corrupt downstream conclusions. The harness is absent, so reliability rests on fragile tool‑call chains that break silently.

How it works

The web stack operates as a single harness‑level resource invoked by the conductor with a search intent. No hand‑fed URLs, no recovery prompts. It encapsulates three autonomous mechanisms:

Provider rotation – A pool of search backends (Bing, SerpAPI, self‑hosted endpoints) is cycled automatically. When a provider’s API key is absent or exhausted, that backend is silently skipped. The stack returns a structured “no results” envelope only when every provider in the pool is unreachable. Degradation is graceful: a missing key never breaks retrieval, and the conductor never sees the failover.
Content extraction – Every fetched page is reduced to its text signal using a Trafilatura‑style extractor that discards navigation, ads, boilerplate, and formatting. The output is a clean block of paragraphs and headings, stripped to the semantic core. If the extracted text exceeds a configurable size threshold, the MCP output axis automatically sandboxes it. Downstream agents receive a pruned excerpt or a condensed reference — never a raw DOM that could overflow context or introduce noise.
Model‑agnostic reliability – Search dispatch, provider failover, extraction, and sandboxing all happen outside the model loop. The harness owns the mechanics, so the model stays focused on decision‑making. Retrieval is deterministic: the conductor sends an intent, the stack returns structured text snippets with source metadata.

Trade-offs & boundaries

The web stack’s design is deliberate, and its edges are honest.

Latency overhead – Rotating through providers and fetching full pages adds network time per query. The stack trades absolute speed for guaranteed structural signal; it is not tuned for sub‑second interactive loops.
Content extraction is lossy – Trafilatura‑style extraction works best on static, well‑structured HTML. It discards images, layout, and interactive elements, and it cannot execute JavaScript. Dynamic content (single‑page apps, lazy‑loaded text) may be minimal or missing. Workflows that need visual evidence or original page layout will get nothing useful.
Sandboxing is a blunt instrument – Oversized content is truncated, not intelligently summarised. A low threshold may discard nuance; a high threshold may still stress context windows. The configurable size limit demands tuning for the model in use.
Result quality depends on provider relevance – The stack returns the top results from the search provider as they are. It does not re‑rank, filter spam, or compensate for low‑quality indexes. A degraded provider that returns noisy results will feed those into extraction without signal‑quality checks.
Network and key dependency – At least one valid provider key must exist in the pool. If all keys are exhausted or misconfigured, the structured “no results” envelope is returned, and the conductor must handle that case explicitly (e.g., fallback to local knowledge or mark the branch unresolved). The stack will not retry indefinitely or alert beyond the empty envelope.
No interactive or streaming pages – The stack fetches a static URL per result. It cannot follow pagination chains, handle form submissions, or maintain authenticated sessions. It is a one‑shot, text‑first extractor, not a general web browser.

Red line: the stack never exposes raw HTML, never interprets JavaScript, and never bypasses the size sandbox. The conductor cannot override sandboxing or demand a specific provider on a per‑request basis.

When it earns its place

The web stack pays off wherever research breadth is high, context budgets are tight, and the conductor must ground plans in live, uncurated sources without fragile per‑agent integration.

Research‑intensive plans – A conductor that needs to verify facts from multiple sites or collect documentation across domains can issue search intents as DAG steps; the stack handles parallel retrieval, failover, and pruning, keeping token consumption predictable.
Long‑running autonomous agents – In loops that run for hours, the cost of a single oversized page grows exponentially. Mandatory sandboxing prevents context‑window explosions from accumulating over time.
Multi‑step synthesis or reports – Clean extracted text can be fed directly into a summarisation executor or verifier without preprocessing. No manual URL whitelisting or prompt‑engineered recovery logic.
Environments with multiple search providers – Teams that avoid vendor lock‑in by pooling Bing, SerpAPI, and self‑hosted endpoints gain resilience: one expiring key is a non‑event, and retrieval continues until all providers are exhausted.
Scenarios that previously demanded hand‑curated URLs – If the workflow required a human to pre‑select pages or a brittle script to scrape them, the web stack replaces that with a single intent call, keeping the conductor in its planning lane.

The harness owns retrieval so the model never has to negotiate a rate‑limit or parse a <div>.

Substrate

Memory that distils and forgets — layered, self-curating, compounding into capability.

Substrate

The temporal knowledge graph

Memory is never a replay log—it’s a layered graph that distills, abstracts, and forgets, so only provenanced facts survive.

Memory in Xihe is not a replay log; it is a layered temporal graph (L0–L6) that actively distills, abstracts, and forgets. The one idea it turns on: knowledge worth keeping must earn its place through repeated, conflict‑free reinforcement, and then gets compressed into higher‑level representations while lower‑level traces decay. This structural confidence—where a fact’s reliability is implicit in the layer it reached and the provenance behind it—replaces ad‑hoc eviction policies with a design‑driven memory system.

The failure mode

Without a temporal knowledge graph, agent memory succumbs to three intertwined failures. First, fact bloat: raw events, messages, and tool outputs accumulate without bound, inflating retrieval costs and drowning reasoning in noise. Second, silent pollution: stale or contradictory observations persist alongside current ones, competing for attention and corrupting downstream decisions—an agent treats a one‑off misstatement with the same weight as a repeatedly confirmed constraint. Third, confidence blindness: all facts appear equally reliable because no structural provenance exists; the agent cannot distinguish a single ephemeral trace from a pattern reinforced across dozens of independent sessions. These failures compound, making long‑running agents brittle, amnesiac, or expensive as context windows fill with irrelevance. The temporal knowledge graph eliminates them by making forgetting a first‑class, configurable process and provenance a structural backbone.

How it works

The graph is organised as seven layers, each with a distinct retention and abstraction rate:

L0 – Raw ephemeral event streams (high volume, minimal retention).
L1 – Normalised, deduplicated signals after absorption cleans the firehose.
L2 – Structured facts with provenance (who said what, when, and from which source layer).
L3 – Entity graph tracking relationships and temporal confidence, linking each fact to its source layer.
L4 – Behavioural genes: reusable interaction patterns compressed from repeated, conflict‑free evidence.
L5 – Strategic patterns and cause‑effect chains that emerge after sufficient consolidation.
L6 – Abstract concepts and invariant rules distilled from lower layers.

Three concurrent processes shape the graph continuously:

Absorption ingests new traces into L0, normalises them to L1, and constructs L2 facts with initial provenance.
Promotion elevates a consolidated fact when it passes a configurable threshold—typically repeated, independent reinforcement without contradiction. As a fact rises, it abstracts further: L3 maps entities and temporal edges; L4 compresses sequences into reusable genes; L5 extracts strategic cause‑effect; L6 crystallises invariants.
Decay demotes or removes edges that lose reinforcement. Lower layers degrade quickly unless constantly reaffirmed; higher layers persist only as long as they compress repeated, conflict‑free evidence. Configurable promotion thresholds and decay rates govern the tempo of learning and forgetting.

Provenance is structural: every L2+ fact stores a pointer to its source layer, so confidence is not a floating number but a property of the graph—a fact at L4 survived multiple promotion cycles and carries more weight than one at L2. Queries surface only consolidated facts (L2 and above); raw events (L0–L1) are inaccessible past a configurable horizon. Memory is never a replay log.

Trade-offs & boundaries

Latency of insight: Promotion and decay run asynchronously. A fact from a single conversation may not reach higher layers for many cycles, or ever, if it is never reinforced. Real‑time recall of fresh events requires a separate working buffer.
Tuning sensitivity: Thresholds that are too high starve abstraction; thresholds too low let noise flood upper layers. The right balance depends on the domain’s signal‑to‑noise ratio and novelty rate.
No raw replay: After the configurable horizon, raw events vanish. Debugging a mistaken inference relies on provenance traces, not full logs. Production observability trades precise historical playback for trust in the graph’s consolidation.
Not a database: The graph does not guarantee point‑in‑time queries, exact reconstruction, or transactional consistency. It is a probabilistic, always‑changing structure. Store immutable facts (e.g., legal names) in a separate key‑value store.
Lossy compression: Abstraction from L4 to L5 to L6 discards the original exemplars; the system cannot recover the raw conversations that created a behavioural gene.
Adversarial risk: If noise is deliberately repeated to simulate reinforcement, promotion can be gamed. Additional robustness checks outside the graph’s scope are needed in adversarial settings.
Degradation path: Deeply reinforced patterns at L4–L6 may persist stubbornly after a sudden behaviour change. They eventually decay, but the timeline can be long; explicit “forget” signals or temporarily lowered thresholds can accelerate the shift.

When it earns its place

Use the temporal knowledge graph in any long‑running agent system where past interactions shape future behaviour but noise and staleness are real dangers. It pays off in:

Personal assistants that learn user habits over weeks—yesterday’s miscommunication should not overshadow a year of consistent preference.
Codebase agents that observe commit patterns—a one‑off experimental refactor never becomes a “standard” unless repeatedly reinforced.
Continuous monitoring agents that absorb metric streams, promote recurring anomalies, and let transient blips fade.
Multi‑turn planning or research where the conductor must reason over consolidated facts, not full transcripts, to avoid context bloat.
Multi‑agent consensus where executors produce contradictory claims and the graph’s structural confidence determines which fact gets promoted, providing a shared, trustworthy memory.

The only fact worth keeping is one that has survived repeated, independent scrutiny and time—everything else is noise waiting to be forgotten.

Substrate

Dream consolidation

Dream consolidation turns an agent's past throughput into compounding aptitude by lossily distilling operational history during idle cycles.

Dream consolidation is the Substrate’s idle-time mechanism that turns raw operational history into layered, reusable memory—making an agent’s experience compound, not just accumulate. The single idea it turns on is that reliability and speed compound when past throughput is deliberately and lossily distilled into persistent patterns during downtime, so the agent never re‑derives lessons it has already learned.

The failure mode

Without consolidation, an agent’s memory remains a flat append‑only log. Every new task forces the conductor and executors to re‑derive patterns from scratch, scanning the same noise and the same contradictory signals every time. This creates three compounding failure modes:

Latency debt – retrieval and re‑clustering cost grow with log size, making the agent slower the longer it runs.
Silent pattern loss – useful recurring lessons (rate‑limit backoff that always works, a recovery sequence that succeeds on third attempt) stay buried as raw events and never crystallise into a queryable form. The agent’s behaviour remains fragile, at the mercy of the raw data.
No compounding – even after thousands of identical failures, the agent never improves its default approach because every task starts from the same undifferentiated pile. Manual curation or model retraining become the only escape—both slow, expensive, and brittle.

The root cause is treating memory as a single undifferentiated bucket. There is no mechanism to discriminate signal from noise, no way to promote stable lessons into reusable state, and no discipline that prevents past failures from being re‑experienced as if new. Dream consolidation eliminates all three by design.

How it works

While the agent is idle—no active work, no pending dispatches—the conductor triggers a background dream pass over a sliding window of recent events. The pass runs asynchronously and never preempts active tasks. Within the window, it applies three sequential operations:

Clusters episodes – groups causally or semantically related events into coherent chunks. Isolated noise (one‑off timeouts, irrelevant log lines) is discarded.
Extracts recurring lessons – identifies patterns that repeat across multiple independent clusters. A pattern might be “rate‑limit errors from service X are always preceded by >50 requests/min” or “the retry path sleeping 1 s, then 2 s, then 4 s succeeds within three attempts.”
Lifts patterns – lessons that have survived multiple passes (and are therefore stable) are promoted into persistent, configurable forms. These become entries in an experience graph—a structured memory layer that the conductor and executor query directly, bypassing raw log scans entirely.

The output is a layered memory. Raw logs remain intact for audit and debugging. The distilled patterns form a compact experience graph. Once promoted, a pattern is immutable—later passes may add new patterns alongside it but cannot overwrite the existing one. The sliding window advances continuously, so older raw events eventually drop out, while the promoted patterns persist. By design the process is lossy: contradictory or irrelevant signals are discarded, compressing memory into a sharper, faster‑to‑retrieve structure. Over successive idle cycles, this turns past throughput into compounding aptitude—no model retraining, no manual curation required.

Trade-offs & boundaries

Lossy by nature. Information that is contradictory, statistically insignificant, or temporally isolated is discarded permanently. Rare but critical edge cases (e.g., a once‑in‑a‑million race condition) may never survive cluster extraction and will not appear in the experience graph. Raw logs remain the fallback for such events, but the consolidated memory is optimised for high‑frequency, recurring patterns.
Requires idle cycles. If the agent is never idle, consolidation never runs. The sliding window advances without processing, and memory stays flat. This is acceptable only when idle is guaranteed (e.g., scheduled workloads with predictable gaps).
Immutable promoted lessons. A pattern that was correct yesterday may become stale tomorrow (e.g., a deprecated API error code changes meaning). Dream consolidation does not automatically retract or mutate promoted lessons; contradictory new evidence is simply not lifted into the same slot. The system assumes stability after multiple windows, but cannot predict eventual drift.
No real‑time adaptation. The experience graph updates only after the next complete idle pass. Immediately critical new lessons must wait for consolidation—or rely on slower raw‑log queries in the interim.
Does not replace training. Consolidation operates purely on the agent’s operational context; it never updates the underlying model’s weights. It is a runtime memory optimisation, not a learning algorithm.

When it earns its place

Dream consolidation pays off whenever an agent repeatedly performs similar tasks over time, especially in high‑noise environments:

Long‑running autonomous agents (code‑review bots, triage systems) that process thousands of similar requests daily. Without consolidation, each hour feels like the first; with it, response times drop and success rates rise as common failure patterns become pre‑avoided.
Environments with predictable idle windows (nightly batches, weekends). The agent consolidates the day’s experience during off‑peak hours and starts the next day sharper.
Teams that want to avoid manual pattern extraction. Instead of humans reading logs to write retry strategies or rate‑limiter thresholds, the agent derives and promotes those lessons automatically.
Any system where raw log size grows faster than retrieval speed. The lossy compression keeps hot memory cheap to query while preserving audit trails.

The agent’s past throughput becomes its future aptitude—only if idle time is spent discarding what it has already learned not to repeat.

Substrate

Restraint gates & the write validator

A model’s output earns permanence only when it is both permitted and deniable.

Restraint gates and the write validator are a two‑stage guard on every automatic fact write in the Substrate. The mechanism turns on a single principle: persistent memory must never trust a model’s output by default; it demands that any automatically stored fact be of an explicitly allowed event type and capable of being disproven by future evidence. Secrets are scrubbed between the model and the gates; an explicit remember command bypasses all protections, preserving owner sovereignty. Reliability against hallucination contamination is engineered into the substrate, independent of which model generated the claim.

The failure mode

Without these guards, every model-generated assertion would be eligible for permanent storage the moment it reaches a write instruction. A confident hallucination—“the project deadline is next Tuesday”—is recorded as fact with the same status as a genuine observation. The failure is silent and cumulative: there is no automatic correction, no natural decay, and no flag to mark contamination. False memories pollute all downstream reasoning that reads memory, steering decisions off course and eroding trust in the system’s state.

The damage is not limited to outright falsehoods. Vague generalities (“the user seems happy”), unfalsifiable interpretations, and contradictory statements can all coexist in memory, turning it into a jumble of unverifiable assertions. Once written, these entries become dependencies for future agent steps—including planner and executor executions that rely on memory as ground truth. The result is brittleness: a single model slip rewrites shared state irreversibly, and the only recovery is manual pruning or a full reset.

Secrets that leak into model output—tokens, API keys, credentials—would be written alongside legitimate facts, creating an unrepairable security and compliance leak. The axis of trust would be placed entirely on the output distribution of a model that, at any moment, may drift or be exploited. The substrate’s memory would become a liability, not an asset.

How it works

Every automatic fact write—any fact the system attempts to persist without an explicit user directive—is passed through a deterministic, model‑free pipeline before it reaches persistent storage.

Secrets scrubbing runs before either gate. Tokens, keys, and credentials are stripped from the fact payload. This prevents accidental exposure on the automatic path even if the model inadvertently emits a secret.
Restraint gates inspect the event type of the write against a fail‑closed whitelist of allowed event types. Any type not explicitly permitted is silently dropped; no write occurs, no error is surfaced to the model. This blocks surprising writes—unexpected structured fact categories, model hallucinations that accidentally resemble memory operations—from ever touching persistent store.
Write validator tests the remaining fact content for falsifiability: the statement must be capable of being disproven by future evidence. Concrete claims (“the user’s time zone is America/New_York”) pass; vague or ungrounded statements (“the user is interested in performance”) are rejected. The validator does not assess truth—only whether the fact has a clear, testable truth condition. Hallucinations that cannot be pinned down never become memory.
Explicit remember command bypasses both gates entirely. When the user issues this directive, the fact is stored directly—secrets and all. Owner sovereignty is absolute: only the user decides what enters long‑term memory, and no substrate‑level guard interferes with that deliberate act.

Trade‑offs & boundaries

Conservatism vs. recall. The whitelist is fail‑closed. Any legitimate event type that the developer forgot to register is silently dropped. This forces the operator to anticipate every class of fact the application should automatically remember; a missed type results in invisible data loss. The falsifiability check is similarly strict: true but non‑falsifiable statements (“the user prefers clean code”) will be rejected, sacrificing useful context for the guarantee that every automatic fact is testable. Developers must either shape model outputs into falsifiable observations or rely on the remember path for those interpretations.

Sovereignty and its risks. The remember command is a deliberate escape hatch, but it is also a sharp edge. Once invoked, no gate applies. A user can store a hallucination, a false premise, or a secret intentionally. The substrate treats this as ground truth. This trade‑off is acceptable only because the user is the ultimate authority on their own memory; the gates exist solely for the automatic, model‑driven write path where the user is not in the loop.

Falsifiability is not truthfulness. A fact that passes the write validator can still be wrong. “The server’s IP is 10.0.0.1” is falsifiable and would be stored even if the real IP is different. The substrate’s guarantee is about memory hygiene—only testable statements accumulate—not about epistemic correctness. Truth verification is the domain of a separate verifier role (roadmap); the write validator ensures that when verification arrives, stored facts can actually be checked.

Boundaries of the guard system. The two‑stage guard applies only to automatic writes. Non‑persistent state (conversation context, ephemeral working memory) is not filtered. System‑internal writes and user‑directed remember commands circumvent the gates entirely. The scrubbing of secrets is a best‑effort defence: novel or obfuscated credentials may escape pattern detection, though the other gates still block unexpected event types that could carry them.

Degradation path. If the whitelist is over‑broadened (e.g., a catch‑all event type), the restraint gate becomes cosmetic. If a bug allows bypassing the validator, memory contamination returns with no automatic repair. Recovery demands manual audit. The design relies on these gates being simple, static, and rarely touched—change them carelessly and the protection evaporates.

When it earns its place

The two‑stage guard pays for itself whenever model‑generated facts are stored automatically and later used to drive behaviour. It is indispensable when:

The agent’s memory is long‑lived and shared across sessions or multiple agents. A single bad write would corrupt weeks of downstream reasoning.
Multiple models (different providers, sizes, or versions) contribute to the same memory store. The gates enforce a uniform safety standard regardless of model reliability.
Secrets are present in the environment—any pipeline dealing with tokens, keys, or credentials needs automatic scrubbing and the guarantee that a model cannot accidentally create a surprise write carrying a leak.
The operator cannot afford silent corruption. Each fact in memory acts as a dependency for planner and executor steps; contamination does not announce itself.
The system includes, or will include, a verifier that challenges stored assertions. The falsifiability check pre‑conditions memory so every entry is testable.

The gates are overkill for ephemeral, stateless interactions or for systems where every fact is curated by a human in a controlled pipeline. They earn their weight when memory is automatic, persistent, and the cost of one hallucinated entry outweighs the cost of losing a few unfalsifiable but true observations.

A fact that cannot be disproven by future evidence has no place in persistent memory.

Substrate

Entropy-reducing curation

Entropy-reducing curation is a lossy, asynchronous machine that compacts memory stores via a uniform purify-cluster algorithm across three adapters, preserving retrieval precision automatically.

Entropy-reducing curation is the substrate process that prevents persistent memory stores from degrading into noise by continuously compacting and restructuring their contents. It turns on a single insight: a uniform two‑phase algorithm — purify then cluster — applied through adapter‑specific lenses for facts, skills, and genes, can maintain a lossy, high‑signal graph without any hand‑tending, even as volume grows indefinitely. The machine runs asynchronously, triggered by write volume or idle cycles, and never fabricates new knowledge; it only refines what already exists.

The failure mode

Without curation, every write adds a fragment that may be a duplicate, a near‑duplicate, or a contradiction. Over time the store becomes a swamp: identical facts appear under slightly different phrasings, procedural steps accumulate redundant variants, and lineage forks sprout dead ends. Retrieval precision collapses because the index is diluted by near‑identical entries that all match a query, returning noise instead of signal. The failure is not a crash — it is a slow, cumulative erosion of the graph’s signal‑to‑noise ratio that makes the system progressively less reliable without any single observable breakage. Downstream, the conductor issues plans from contradictory premises, the executor picks noise‑laden skill steps, and verifiers waste cycles on ambiguous reads. Storage bloat compounds the waste. This silent degradation forces developers to periodically rebuild the store or manually deduplicate, neither of which scales.

How it works

A single curation machine drives three adapters — facts, skills, and genes — each applying the same two‑phase logic, but interpreting redundancy and coherence according to its own domain rules.

Purify removes redundant, contradictory, or low‑signal entries through deduplication and compression. The facts adapter discards duplicate assertions and contradictory statements to preserve consistency; the skills adapter collapses functionally identical procedural steps and prunes dead variants; the genes adapter merges equivalent ancestry records and removes dead branches. The machine never invents replacements — it only removes or merges existing entries.
Cluster groups the surviving fragments into coherent neighborhoods and rebuilds retrieval indices. Facts are linked by topic, skills by workflow dependency, genes by lineage. The rebuilt index ensures subsequent queries return a compact, high‑signal set.

The machine runs asynchronously: it triggers either after a configurable number of new writes (write‑volume threshold) or during idle cycles. Developers select which adapters a memory store uses and can optionally tune purification or clustering aggressiveness per adapter. Everything else — triggering, execution order, index rebuild — is automatic. Memory curates itself; you never hand‑tend the store.

Trade‑offs & boundaries

Curation is lossy by design. It intentionally discards non‑essential variance to keep the representation lean. Important nuance may be collapsed if aggressiveness is set too high; rare queries may lose some recall. Tune per adapter to find the right balance.

No knowledge creation. Curation refines existing memory; it never fabricates new facts, skills, or genes. Gaps remain unfilled; contradictions that cannot be resolved may be dropped entirely.

Eventual consistency. Because the machine runs asynchronously, there is a window after a write during which duplicates or contradictions may still be read. Systems requiring immediate precision after every write must either gate reads on curation completion or enforce instant deduplication at write time — neither of which curation provides.

Not a write‑time guard. Curation is a long‑term hygiene process, not an instantaneous check. If duplicate detection must happen at write time, it must be implemented separately.

Degradation paths. Over‑aggressive curation collapses distinct but adjacent concepts into one, reducing expressiveness (e.g., merging “fast sort” and “stable sort”). Under‑aggressive curation leaves enough noise that retrieval precision still decays. Write‑heavy workflows can also cause a backlog: if write volume consistently outpaces curation throughput, the store inflates and precision drops temporarily; retrieval recovers only when idle cycles catch up. Monitoring curation lag is recommended for high‑throughput deployments.

Red line. Curation discards data permanently. It is not a backup. To preserve full historical fidelity for auditing or rollback, maintain a separate append‑only log alongside the curated store.

When it earns its place

Long‑running agents that accumulate facts and skills across days or weeks — retrieval precision would otherwise degrade monotonically, undermining conductor and verifier.
Multi‑agent or multi‑model stores where different executors write overlapping entries; curation normalizes divergences without manual cross‑review.
Skill libraries that ingest trial‑and‑error procedures; curation consolidates verified routines and discards failed variants, keeping the procedural index lean.
Gene registries tracking model variant ancestry; curation prunes dead branches, preserving only active inheritance paths.
Any store where the cost of manual cleanup outweighs the loss of a small fraction of variance — curation becomes the default.
Temper it or disable it when the store is ephemeral, tiny (hundreds of fragments), or must retain every raw entry verbatim (e.g., strict audit trails).

Memory that curates itself stays sharp; memory that does

Substrate

Prune & decay

Prune & decay intentionally forgets low-signal memories using recency, confidence, and relevance, with a staleness floor protecting new items from premature culling.

Prune & decay is the substrate mechanism that turns memory into a finite, aging resource. It treats every stored item as subject to intentional forgetting, driven by three signals—recency, confidence, and relevance—and a staleness floor that grants new memories a protected window before decay begins. The one idea: forgetting is a designed filter, not a leak; without it, low-signal memories accumulate and degrade decisions.

The failure mode

Without prune & decay, every observation, intermediate result, and weakly reinforced fact persists indefinitely. Over hours or days, the substrate fills with noise. The conductor planning DAG must sift through stale context, degrading its ability to decompose tasks cleanly. Executors retrieve irrelevant memories that pollute their inputs, wasting cycles on abandoned paths. No single error surfaces; the agent still runs, but it drifts. Latency creeps up, decision quality erodes, and the system becomes brittle—remembering everything, and therefore remembering nothing useful. The failure is a silent corrosion of reliability, not a crash.

How it works

Prune & decay evaluates each memory against three independent signals that together determine its survival:

Recency – unused items lose weight over time. Each access resets the item’s recency clock; the longer it goes untouched, the lower its score.
Confidence – weakly reinforced or uncertain memories decay faster. A memory that was only written once, or that lacks reinforcement, carries a higher decay multiplier.
Relevance – context that no longer matches the agent’s active goals is pruned aggressively. When the agent’s focus shifts, the relevance signal drops sharply, accelerating the memory’s removal.

A staleness floor, anchored on the item’s created-at timestamp, protects fresh-but-unused entries. No matter how low the initial recency or confidence, a newly created memory is granted a minimum lifetime—no decay is applied until the floor duration expires. This prevents a burst of low-confidence observations from being culled before they can contribute.

Once past the floor, decay compounds. The item’s combined signals decline until they cross a configurable retirement threshold, at which point the memory is pruned immediately from the active substrate. Pruning is hard removal, not archival.

All decay parameters—decay rates per signal, confidence floors, staleness floor duration, retirement threshold—are configurable per subspace. This allows each subspace to tune its forgetting profile without changing source code, making the mechanism a tunable policy, not a hardcoded rule.

Trade-offs & boundaries

Prune & decay introduces configuration complexity. Each subspace must be tuned against its expected memory lifespan; too-aggressive decay prunes useful context, while too-lenient decay lets the substrate bloat, re-creating the very failure mode the mechanism was meant to prevent. There is no universal default.

The staleness floor guarantees a minimum lifetime for every new memory, which means even useless entries created under high load persist for the floor duration. This trades a small amount of noise retention for the safety of genuinely fresh signals.

Once pruned, a memory is irrecoverable. The mechanism provides no audit trail or backup—workflows that require complete retention must route critical information to a separate, non-pruned store. It also does not guard against adversarial manipulation (e.g., a malicious agent repeatedly accessing a low-value memory to keep it alive); that lies outside the scope of a decay policy.

Prune & decay is designed for persistent, shared memory across agent roles. It does not affect transient state within a single executor invocation, nor does it replace manual lifecycle management for permanent reference data.

When it earns its place

Prune & decay pays off whenever an agent accumulates context over time and active goals shift. Concrete scenarios:

Long-running engineering agents that iterate for days: resolved bugs, merged branches, and stale design notes decay, keeping the conductor focused on the current task.
High-throughput executor fleets that produce many low-confidence intermediate outputs (scraping results, candidate lists), most of which are noise and must be quickly forgotten.
Conversational or support agents that switch topics frequently; relevance-based pruning aggressively clears old context, reducing retrieval latency.
Memory-constrained deployments where unbounded growth is unacceptable; decay provides a predictable upper bound on substrate size.
Any workflow where retrieval speed matters—a smaller, signal-dense substrate yields faster context lookups, directly improving agent responsiveness.

Intentional forgetting is a designed filter, not a leak—the substrate stays lean and the agent focused on what matters.

Substrate

Behavioural genes & BM25 reuse

Behavioural genes turn known how‑to into idempotent, BM25‑retrieved fragments — reuse over recall, no drift.

Behavioural genes are reusable, idempotent fragments of how‑to — model‑agnostic execution patterns that encode a specific capability. Discovery runs on BM25, a bag‑of‑words retrieval model that balances term frequency against document length: no embeddings, no GPU, no drift. The entire mechanism turns on reuse over recall: a gene is applied exactly as stored, never trained, remembered, or fine‑tuned, and the harness falls back to deliberate planning only when lexical‑semantic alignment drops below a configurable threshold.

The failure mode

Without behavioural genes every task forces the conductor to reconstruct a solution from scratch, even when the same capability has been exercised thousands of times. This triggers a race between latency and quality: cheap, fast models hallucinate or produce inconsistent output; deliberate models burn budget and wall time re‑inventing stable patterns. Embedding‑based retrieval introduces GPU dependency, drift when the corpus evolves, and the risk of superficially similar but structurally wrong matches. Models fine‑tuned on particular phrasings silently corrode — they forget rare formats, degrade after updates, and couple behaviour to a specific checkpoint. Without a deterministic “I have solved this before” signal, the harness cannot distinguish the known from the novel. Every invocation wastes tokens on trivial repetition, idempotent guarantees evaporate (two identical tasks minutes apart can yield different outcomes), and debugging becomes opaque. The net result is brittle reuse, accumulating technical debt, and a harness that drifts instead of remaining a reliable substrate.

How it works

Discovery is lexical‑semantic, not semantic. BM25 scores every gene’s textual signature against the task description using term frequency, inverse document frequency, and length normalisation. It catches intent through overlapping vocabulary: “validate email format” and “check email regex” both share core terms, so the matching gene surfaces even when phrasing shifts. No embeddings, no GPU, no vector store — the score is deterministic for a fixed corpus.
Migration is idempotent. The highest‑scoring gene above the configurable threshold is injected into the execution context verbatim, with zero transformation. Re‑applying the same gene to the same task and context produces identical results with no side effects. Genes can be composed, retried, or swapped safely because they carry no hidden state.
Fallback on novelty. If no gene scores above threshold, the conductor treats the task as novel and falls back to deliberative planning (which may in turn produce a new gene). This preserves reliability: genes automate the known; the unknown is handled by explicit reasoning.
Genes evolve only by explicit addition or removal. The gene library is a static catalogue edited by a curator. There is no training loop, no gradient update, no memory of past successes. Adding or removing a gene rebuilds the BM25 index trivially because term frequencies are static. No drift, no stale embeddings, but also no automatic improvement.

Trade‑offs & boundaries

BM25’s strength — zero‑drift, zero‑GPU lexical matching — is also its boundary. It captures intent only when the query shares enough significant terms with a gene’s signature. Heavily paraphrased or synonym‑rich descriptions (“make the output polite” vs. “apply courtesy filter”) may fail to match unless the gene’s signature explicitly includes the alternate vocabulary. The mitigation is manual curation: widening the signature with known paraphrases increases recall, but at the cost of ongoing maintenance.

The configurable threshold is a sharp lever. Set too low, generic genes match tasks they were not designed for, silently injecting inappropriate behaviour. Set too high, too many tasks fall through to expensive conductor planning, defeating the purpose of reuse. Finding the sweet spot demands profiling a sample of task‑gene pairs specific to the domain.

Genes do not learn from usage. A pattern that works “well enough” for thousands of runs remains exactly as stored, and if the underlying capability needs refinement the gene must be manually edited. This is a feature for determinism but a limitation for systems that expect automatic adaptation. Domain‑specific vocabulary helps BM25 enormously — tasks and gene signatures thick with API names, industry jargon, or command tokens yield high‑precision matches. Conversely, tasks described in buzzwords a gene never uses degrade matching.

Fallback is safe but not free. When no gene scores above threshold, the harness gracefully degrades to planning, paying a per‑task inference cost. Operators must monitor fallback rate as a health signal; a rising rate signals a gene library that no longer covers the incoming task vocabulary. Finally, the idempotency guarantee holds within a single gene version. Swapping a gene for a different implementation is safe because each is an immutable artifact, but the new version’s idempotent chain is independent — cross‑version replay yields identical results only if the new version is itself designed to be idempotent.

When it earns its place

Behavioural genes pay off in workflows where the same how‑to recurs across many tasks with minor phrasing variance. Concrete situations include:

Repetitive, single‑purpose tasks such as “translate to Japanese”, “extract email addresses”, or “format as JSON”. BM25’s term‑frequency balance catches each variant without needing to re‑plan.
High‑throughput, low‑budget workloads. The CPU‑only BM25 lookup replaces token‑burning conductor calls, slashing per‑task cost and latency.
Deterministic pipelines with audit or replay requirements. A gene’s output for a given input is fixed, so replaying a job yields identical results — impossible with model recall that shifts with temperature, seed, or weight updates.
Cross‑model automation. The same gene works identically on GPT‑4, Claude, Mistral, or a local model, eliminating per‑model prompt engineering.
Stable domains with drifting phrasing. BM25 catches new wording as long as significant terms survive, so the gene library stays frozen while terminology evolves — no MLOps cycle, no model refresh.
Shared procedure libraries for teams. Developers contribute genes as battle‑tested how‑to fragments; new members add tasks without rewriting fundamental behaviours.

A gene is an execution contract discovered by a static text match, not a memory. The harness never learns; it only picks the right lever from a fixed toolbox.

Substrate

The skill-creator flywheel

The flywheel converts high-confidence execution patterns into skills via a held-out eval gate that prevents self-confirming bias and requires accumulated positive evidence for promotion.

The skill-creator flywheel transforms validated execution patterns into candidate skills, anchoring the agent’s growing repertoire in demonstrable outcomes rather than model introspection. Its central idea is a held‑out eval gate that prevents any proposal from ever glimpsing its own evaluation data, enforcing a strict separation between pattern discovery and skill promotion. Every change is logged with provenance on a single SQLite substrate shared with genes and facts, making the full lineage from raw emergence to operational skill completely auditable.

The failure mode

Without the flywheel, an agent’s skill base drifts toward the self‑confirming. The same runtime that produces a pattern also decides whether it’s good enough to keep, creating a closed feedback loop that conflates familiarity with generality. Two failure modes dominate:

Bias cascade. A model‑internal “this feels right” signal replaces empirical verification. A pattern that succeeded in one context is promoted without ever being tested against unseen scenarios, overfitting to coincidences. Once promoted, it seeds further biased extraction—the agent learns to favour what worked once, amplifying its own blind spots cycle after cycle.
Silent rot. Skills are never revisited. A pattern that was correct months ago becomes stale as the environment shifts, but without periodic re‑evaluation it remains operational, degrading behaviour gradually. The agent has no mechanism to forget or downgrade a skill until failure manifests externally.

Both pathologies share a root cause: the gate is either absent or porous. When evaluative data and training data overlap, self‑confirming bias leaks straight into production skills. The skill set bloats with untested claims, provenance evaporates, and no one can trace whether a skill was derived from a genuine invariant or a statistical fluke.

How it works

The flywheel runs entirely on a unified SQLite substrate where every row—skill, gene, fact—carries a provenance column. This forces all operational knowledge into one truth schema and makes every mutation traceable. The mechanism has four phases:

Episodic mining. A miner periodically scans the consolidated pattern store, selecting only entries tagged both confident and worked—patterns that have survived repeated execution and produced verified outcomes. It drafts candidate skill proposals from these high‑signal patterns, but it does so behind a held‑out eval gate: a partition of evaluation cases the proposal can never read. The separation guarantees that no proposal is ever constructed to fit the data that will later judge it.
Closed refinement loop. Each proposal enters a tight cycle:
- eval-generate runs the candidate against the held‑out cases, producing pass/fail signals and failure detail.
- record logs every outcome and fires evolution events (e.g., “proposal confidence increased,” “failure on edge case X”).
- suggest-actions converts results into follow‑up steps—fresh patterns to mine, tweaks to the candidate, or queries back to the planner—feeding the next miner run.
Every step is logged with provenance; the full history from raw pattern emergence to final promotion is replayable.
Promotion gating. A proposal never self‑promotes. It can only accumulate positive evaluation evidence, and only when that evidence crosses a predefined confidence tier does the gate open. The gate is a deterministic rule evaluated against the logged eval history—not a model call, not an internal state heuristic. If the threshold isn’t met, the proposal stays in the refinement loop or decays back into the pattern pool.
Auditability. Because every change is logged, any operator can walk backwards: “What raw pattern produced this skill? How many eval iterations did it survive? Which edge cases defeated it before promotion?” The audit trail is a first‑class query, leaving no black boxes.

Trade-offs & boundaries

Cost of the held‑out gate. Reserving unseen cases for evaluation shrinks the pool available for mining. In data‑sparse domains, the gate can starve the miner of signal, forcing deliberate sample budgeting.
Mining latency. The episodic miner does not react instantly. High‑velocity scenarios where a pattern must become a skill within seconds are incompatible; the flywheel is asynchronous by design. Any time‑critical promotion needs a separate hot path.
Confidence threshold tuning. Set the tier too low, and noise becomes skill. Set it too high, and the agent ossifies, never adapting to genuine shifts. The threshold is a governance parameter calibrated per domain, not a model property.
Pattern scope limits. Only patterns that can be expressed as confident + worked are eligible. Tacit knowledge—things the agent does well but cannot articulate as a discrete pattern—is invisible to the miner. The flywheel addresses only the explicit, evaluable part of agent behaviour.
Degradation path. If the held‑out eval set itself becomes stale or contaminated, promotion decisions deteriorate silently. Provenance can detect this, but the set must be periodically refreshed from recent production data. Likewise, without pruning, the substrate’s growing scan cost can exceed the time budget, requiring a separate culling mechanism.
Red line. Never merge eval‑side tables into the miner’s read path. Any query that leaks eval data during proposal drafting reintroduces self‑confirming bias and collapses the gate.

When it earns its place

The flywheel pays off whenever skill reliability depends on cumulative, auditable experience—especially when the cost of a bad skill is high:

Long‑running autonomous agents that operate for weeks without human retraining. The flywheel keeps the skill set lean and tested, preventing drift from stale patterns.
Multi‑model environments where executors run on different models. Because promotion is model‑agnostic (the gate evaluates outcomes, not internals), a skill validated on a cheap model can be promoted for use on an expensive one, or vice versa.
Regulated or audit‑heavy deployments where every operational decision must be traced to evidence. The provenance substrate makes compliance a side effect of normal operation.
Systems that evolve their own tooling. The flywheel lets the agent discover and promote new capabilities without human curators, as long as the eval gate stays intact—the core of a self‑improving agent that does not overfit itself.

The gate, not the model’s internal state, decides what becomes operational — evidence accumulated behind a held‑out partition is the only path from pattern to skill.

Method

The discipline that makes cheap-and-wide safe — contracts, modes, drift control, deliberation.

Method

SDD + TDD & falsifiable contracts

A falsifiable contract turns the spec into a continuously challenged executable boundary; correctness is a negotiation, not a declaration.

In Xihe's approach, the specification is not prose — it is a falsifiable contract defined by Type constraints, Validation rules, a State machine, and Given-When-Then scenarios, all machine‑executable. This inverts the ordinary delivery pipeline: passing acceptance tests is the floor, not the ceiling, because a passing contract only means no violation was found. The one idea it turns on is that correctness is a continuous negotiation between spec and code under a shared proof regime, where every TDD cycle attempts to falsify the current claim of correctness.

The failure mode

Without a falsifiable contract, the spec–code–acceptance pipeline breeds false confidence. Prose requirements are ambiguous; two implementers interpret “the user must be notified” in incompatible ways, and acceptance tests pass because they validate the implementer’s reading, not the spec’s intent. Silent drift sets in — the code does what it does, and the prose spec becomes a decorative artifact. When an edge case or a state transition falls through the cracks, a system that “passes acceptance” deadlocks in production. High‑risk seams (payment settlement, authorization, external integrations, model outputs) receive the same shallow coverage as happy paths because the spec never forces anyone to attack its own assumptions. The waste is rework at integration or UAT, the brittleness is the illusion that passing tests equals correctness, and the silent corruption is that a flaw in the specification sits unchallenged, reinforced by passing tests, until a real incident surfaces it. Without an executable adversary, the pipeline conflates conformance with correctness and ships its own blind spots as features.

How it works

The contract is an executable boundary composed of four mutually reinforcing artifacts:

Type constraints — structural invariants (e.g., UUID, non‑negative integer) that reject invalid data before logic runs.
Validation rules — cross‑field invariants beyond types (e.g., “delivery address must be within service area”) enforced at domain boundaries.
State machine — explicit allowed transitions and forbidden states; any operation that attempts an illegal transition is rejected immediately.
Given-When-Then scenarios — concrete, machine‑executable examples that define expected system‑edge behavior.

During TDD cycles, the contract is continuously falsified: a developer writes a scenario that the contract says must fail (red), implements enough code to pass it (green), then examines whether the implementation has exposed a flaw in the contract itself. Each pass is only evidence of no violation found — not proof of correctness.

The negotiation is governed by two resolution rules:

Contract wins — if the implementation contradicts a constraint, rule, transition, or scenario, the implementation is rejected. The code must adapt.
Code wins — if the implementation proves that a contract element is impossible, contradictory, or insufficient (for example, a state transition the domain actually requires but the state machine forbids), the contract yields and is amended on the record. The amendment is a first‑class artifact, making the change visible and auditable.

High‑risk seams receive an extra layer: adversarial disproof. Automated fuzzing or dedicated adversarial scenarios attempt to break the contract from outside‑in, assuming the spec is incomplete. Passing these adversarial probes is mandatory; if an adversarial scenario reveals a valid behavior the contract forbids, the contract yields and is amended. This outside‑in acceptance ensures that passing the spec is never taken as being correct — it remains a starting point for falsification.

Throughout, spec and code negotiate correctness under a shared proof regime. Neither is trusted alone. The TDD cycle becomes a constant attempt to falsify the current proposition of correctness, with each successful pass only temporarily deferring suspicion.

Trade-offs & boundaries

Cost. Crafting a falsifiable contract demands upfront precision. Writing state machines, validation rules, and structured GWT scenarios requires domain modeling and discipline; the overhead is substantial. For rapidly changing or throwaway code, the cost exceeds the benefit — the contract becomes a maintenance burden that outpaces the value of its protection. A sloppy contract (vague types, missing transitions) provides false comfort while adding ceremony.

Red lines. Equating “passing acceptance” with “correctness” undermines the entire regime. A passing contract is only the absence of detected violations; the falsification loop must continue. Ignoring adversarial disproof on high‑risk seams degrades the contract into a standard test suite that verifies conformance but does not challenge its own assumptions. Silently relaxing the contract to match a suspect implementation — without a recorded amendment — turns the contract into decoration and reverts to the original failure mode. The resolution rules require organizational trust; if a team weaponizes “code wins” to override a correct spec, the contract’s authority collapses.

Non‑applicability. Falsifiable contracts cover structural and behavioral conformance, not subjective qualities like usability, latency, or emergent system properties. They operate at component boundaries and do not replace full integration testing. Exploratory research, prototypes, and artistic‑code domains are poor fits. Legacy systems where the spec cannot be retrofitted into executable form without a major rewrite are similarly out of scope.

Degradation path. If adversarial disproof is skipped, the falsification loop weakens — the contract persists as executable documentation but ceases to be an adversary. If resolution rules are applied inconsistently, the contract loses credibility and developers treat it as a trap rather than a negotiation partner. The system can still function, but the primary benefit evaporates while the cost remains.

When it earns its place

This approach pays off wherever the cost of spec misinterpretation is high:

Critical business logic with complex state transitions (payment flows, order routing, account operations).
APIs with multiple consumers, where the contract prevents silent drift and makes versioning explicit.
Systems that integrate model outputs or external AI agents (as in Xihe), where the contract acts as an adversary that tries to disprove the model’s behavior, catching hallucinations or out‑of‑bound actions.
Long‑lived codebases that will be inherited across teams and years — the continuously falsified contract becomes the most reliable documentation.
Environments practicing TDD at scale, where the contract serves as the first failing test and every cycle becomes a negotiation.
High‑risk seams subjected to adversarial disproof, where “passes acceptance” is too weak a standard and proven correctness is the only acceptable floor.

A passing contract is not proof of correctness — it is only the starting point for falsification.

Method

M1–M7 thinking modes

Thinking modes assign a deterministic inference strategy per task type, preventing misrouted reasoning that produces brittle, costly output.

A thinking mode is not a persona — it is the default inference strategy an agent applies per task type. The single idea it turns on is that a typo fix and a subsystem redesign demand fundamentally different reasoning structures, and routing either to the wrong structure produces output that is brittle, expensive, and impossible to audit with confidence. Mode routing makes that selection deterministic and visible, anchoring the agent’s reasoning to the shape of the task rather than the model’s default next-token behaviour.

The failure mode

Without explicit thinking modes, an agent falls into a universal-solver trap. A bug-fix request triggers the same chain-of-thought pattern as an architecture discussion: the agent models system behaviour instead of tracing an error path, argues from first principles when a grep would suffice, or generates an elegant refactor plan for a problem that required pure subtraction. The output looks plausible on the surface but is misaligned with the logical demands of the task, and the mismatch only becomes visible when the fix fails review — or worse, when it silently passes and sows confusion later.

The first concrete symptom is silent cost escalation. A trivial bug that should be isolated with a log-dive and a binary search instead spawns a multi-step reasoning graph that burns expensive inference capacity on irrelevant speculation. The second symptom is reproducibility failure: the same task run twice at slightly different temperatures produces radically different reasoning paths because nothing structurally constrains the approach. The third is audit blindness. When a trace only records what was concluded, a human reviewer cannot distinguish “the agent considered the wrong approach” from “the model made a mistake inside a correct approach.” The system becomes a black box whose behaviour is dominated by hidden model biases rather than the engineer’s intent.

How it works

Tasks are classified on arrival and assigned one of seven modes. Classification is deterministic: it uses clear, non-LLM task signals, not heuristic guesswork. The assigned mode then constrains planning and execution throughout the conductor–dispatch–executor flow, and enforcement is structural — an agent reasoning M1 on a refactor will be cut short.

Each mode prescribes a canonical inference pattern:

M1 – root-cause (bugs): trace the error path, isolate the point of failure, and prove a single testable hypothesis before any fix is proposed.
M2 – first-principles (clean build): rebuild logic from axioms; existing code is not trusted as truth. Inherited behaviour is assumed suspect.
M3 – subtraction (refactor): start from what to remove, then minimise the diff. Additions are permitted only after removal boundaries are defined.
M4 – search-first (unknown cause): grep, log-dive, binary-search — do not model, find. The agent must exhaust search-based isolation before it may propose an explanatory model.
M5 – working-backwards (architecture): define the target state first, then derive the necessary changes. Implementation constraints may not be discussed until the target state is formalised.
M6 – evidence-driven (perf/quality): measure before acting, compare after, and reject speculative claims. A baseline measurement is a prerequisite for any action.
M7 – closed-loop (ops/default): execute, observe, re-enter on deviation — no silent resilience. The agent must verify output and restart the planning loop if reality diverges from expectation.

The active mode is a first-class field in every planner and executor trace. A human can therefore audit how a conclusion was reached, not just what the conclusion was. Mode labels are part of the output, providing a permanent, testable record of the reasoning strategy.

Trade-offs & boundaries

The primary cost is classification overhead. Deterministic routing demands well-defined signals; ambiguous tasks (e.g., “this is slow, can you make it better?”) may not map cleanly to a single mode, and misclassification produces the wrong reasoning shape. The system does not attempt to infer modes from free-form conversation — it relies on an explicit task taxonomy, and ambiguous cases require clarification before routing. There is no fallback “guess the mode” path. The degradation path is conservative: if the signal set cannot resolve a match, the system defaults to M7 (closed-loop). That mode will not generate creative work, but it will not produce wrong-mode generation either. The design choice foregoes plausible-looking output in favour of safe, observable under-generation.

Modes are strategies, not constraints on creativity. An agent in M2 may still reference existing code to validate axioms, but it may not treat existing code as authoritative. Mode-switching mid-task is deliberately blocked unless triggered by a verifier signal or a manual reset — the harness prevents self-reclassification to eliminate mode drift.

These modes are designed for engineering work where reliable outcomes and auditable reasoning matter. They are not applicable to free-form brainstorming, creative writing, or open-ended exploration. For those use cases, the cognitive scaffolding is unnecessary and may feel obstructive; the harness default (M7) provides discipline without prescribing a reasoning shape.

When it earns its place

Mode routing pays for itself in any non-trivial codebase where tasks vary in kind. A frontend bug routed to M1 cuts time-to-isolation by an order of magnitude because the agent cannot waste cycles on architectural speculation. A refactor routed to M3 produces a small, safe diff instead of an accidental rewrite. An architecture proposal routed to M5 forces agreement on a target state before the agent generates implementation detail, reducing review churn.

The mechanism becomes indispensable when a system picks work from a queue without human supervision — PR reviews, bug triage, performance tickets. In that setting, the absence of mode routing means the agent treats all tasks as equally expensive and equally opaque, producing plausible-but-wrong output at scale. With it, each task gets the correct reasoning shape by default, and the trace provides a structured explanation of how the conclusion was reached. That trace is what separates a reliable engineering agent from a stochastic parrot.

Routing a task to the wrong mode is a design error, not a prompt one — the reasoning shape belongs to the task, not the agent.

Method

Drift detection & rewind

Drift detection classifies reasoning deviations into five archetypes and rewinds to the last anchor, turning quality into a harness property.

Drift detection & rewind is a harness mechanism that classifies reasoning deviations into one of five archetypes and rewinds to the last immutable anchor. The core idea: reasoning quality becomes an engineered property of the system, not a bet on the model’s consistency.

The failure mode

Without drift detection, every agent run is a blind bet. A model’s reasoning can wander—rationalising a dead end, creeping scope, locking onto a premature solution, or dropping critical references—and the harness has no signal that degradation is underway. Outputs remain plausible while internal logic quietly corrupts, wasting tokens, compute, and time. The system compounds errors on a broken foundation, producing brittle artifacts that fail unpredictably. Debugging means manually replaying traces to guess where the reasoning broke, with no guarantee a rerun won’t repeat the same drift. The cost is not just wasted resources; it is silent corruption of the result, a systemic fragility where a run that looks productive can be entirely unsound.

How it works

A lightweight detector, operating at the harness level, monitors the conductor’s reasoning output. After each step, it compares the reasoning to the goal and the last committed anchor. When a deviation is found, it classifies the drift into one of five archetypes:

A (real-drift): genuine topic shift away from the goal.
B (rationalisation): fabricating after‑the‑fact justification for a dead end.
C (scope-creep): unauthorised expansion of task boundaries.
D (premature-solution): jumping to a final answer before sufficient exploration.
E (lost-thread): dropped references or logical breaks.

On detection, the system does not blind-retry. It rewinds to the most recent anchor—a validated checkpoint the conductor previously committed. Anchors are immutable once committed, providing a clean, auditable recovery surface. The rewind discards all reasoning between that anchor and the drift point; partial rollback within a single step is never attempted.

The classification itself is attached as semantic context to the conductor’s next planning cycle. For example, after a scope-creep rewind the conductor can re‑inject original constraints; after a rationalisation it can demand alternative evidence. Recovery logic lives in the harness, not inside the model prompt, so strategy adjustments remain deterministic and auditable. There is no silent degradation: every deviation is seen, named, and rolled back.

Trade-offs & boundaries

Cost: every step incurs detection overhead (classifier latency and compute). Rewinds discard work done after the last anchor, which can be costly if steps are long and the drift occurs early.
False readings: an imperfect classifier may produce false positives (unnecessary rewinds) or false negatives (drift that passes through). Tuning sensitivity is a per‑deployment calibration.
Granularity: rewinding always targets the most recent anchor. If the conductor commits anchors infrequently, a single drift can wipe out a long computation; if anchors are too dense, overhead rises. The mechanism cannot partially rollback within a step—the atomic unit of retreat is the entire anchor.
Preconditions: drift detection requires a conductor that commits discrete anchors. If no anchors are committed (or they are poorly chosen), there is no known-good state to fall back to. The mechanism also doesn’t catch a model that starts on a fundamentally wrong track—it only catches deviations from an already‑committed plan.
Closed taxonomy: the five archetypes are fixed. Novel drift patterns may be misclassified or missed entirely, though the classification still gives the conductor a signal to adjust.
Degradation: if repeated rewinds trigger on the same anchor, the harness must bound attempts to avoid a loop. A hard failure with a clear error is preferable to indefinite hopping.
Does not apply in brainstorming, creative generation, or open‑ended exploration where intentional divergence is the mode. It also doesn’t address tool‑execution errors or data‑transformation bugs—those need separate observability strategies.

When it earns its place

Drift detection pays off in any multi‑step workflow where correctness is critical and silent failures are unacceptable. Typical fits include:

Code generation spanning multiple modules, where a model drifting into a different signature or unauthorised optimisation cascades waste downstream.
Research, analysis, or compliance pipelines that must follow a fixed methodology without inventing extra steps or jumping to conclusions.
Constraint‑heavy outputs (configuration files, legal templates, financial reports) where scope‑creep or lost‑thread would produce invalid artefacts.
Auditable production agents: every drift event is logged with archetype and anchor, enabling post‑mortems and prompt tuning.
Delegation to cheaper, drift‑prone executors: the harness absorbs the risk, turning an unreliable model into a recoverable subsystem.

It is overkill for single‑turn queries or trivial chains where the cost of a full re‑run is negligible; the mechanism’s overhead would dominate without meaningful benefit.

No silent degradation: detected, classified, rewound.

Method

Plan mode

A read-only deliberation cockpit that enforces plan‑first, spend‑later by verifying a structured DAG inside‑out before any execution token is spent.

Plan mode is a read‑only deliberation cockpit: it reads, reasons, and structures but never writes or executes. The invariant is “plan first, spend later”—no token is consumed on execution until a structured DAG with verification gates has been validated inside‑out and explicitly approved. It turns planning from a lightweight preamble into a proactive quality gate that catches contradictions and hallucination before they burn budget.

The failure mode

Without a disciplined planning phase, an agent defaults to execution on the first plausible interpretation. This conflation of deliberation and action breeds three predictable failures.

Premature commitment and token waste. The agent reads a prompt, emits code or mutations, and discovers later that it misread a schema, missed a dependency, or built on an invalid assumption. It then backtracks, spending tokens twice—once on the wrong path, again on the correction. A single unverified guess cascades into a multi‑turn retry spiral that inflates cost without any progress.

Silent corruption. When steps are never cross‑checked against a plan, subtle inconsistencies slip through: a migration script that drops a column a later step still queries, an API refactor that changes a public signature without updating all callers, a cache key that changes without an invalidation hook. The defect is merged into the codebase and only surfaces later as a runtime error, a broken build, or data loss—long after the agent has moved on. Rolling back becomes a manual archaeology exercise.

Brittleness from hallucinated context. Without source grounding, the agent invents variable names, routes, or business logic that sound plausible but never existed. It then defends its own inventions, burning tokens to reconcile hallucinations against reality. The result is a fragile execution trail that collapses under the slightest real‑world friction, leaving the developer to restart the conversation and pay for the same reasoning a second time.

At root, every one of these failures shares the same cause: execution tokens were spent before the structure was verified inside‑out. Plan mode removes that race.

How it works

Plan mode enforces a strict read‑only boundary. The agent operating in this mode reads, requests information, decomposes the task, and produces a plan—but it cannot invoke any mutating tool until the human grants explicit approval. Six interdependent mechanisms enforce this discipline.

Source‑grounded planning. Every reasoning step must anchor to provided context: code files, schemas, conversation history. The agent cannot fabricate a step like “add validation for the payment payload” without citing the exact schema field and its location. If the needed information is absent, the agent must request it via the tool‑call gate rather than invent it. This eliminates hallucination at the root.
Tool‑call gate. The planner may call only read‑only tools—read a file, list a directory, query a schema. Any attempt to write, execute, or deploy is blocked deterministically. The gate serves two purposes: it prevents accidental side‑effects during deliberation, and it forces the planner to treat information gathering as part of planning, not as a shortcut that skips verification. All writes are deferred until the plan is approved.
Automatic complexity routing. Not every change deserves the same depth of deliberation. A one‑line tweak (renaming a variable) follows a fast lane: a lightweight plan of a few steps, surfaced almost immediately. A cross‑layer build (a feature that touches the database, API, and frontend) automatically earns a full plan with cross‑model review, root‑cause probing, and multiple verification gates. Deliberation scales to the change’s real depth, not its prompt length.
Root‑cause probing. When a sub‑problem is vague—“improve error handling,” “fix the login bug”—the planner recursively decomposes it until every leaf is concrete and testable. “Improve error handling” becomes “add a try‑except for FileNotFoundError in loader.py line 47” and “return a structured 500 response with correlation ID.” This probing stops only when every leaf is a single, executable action that can be verified independently, turning fuzzy requests into a decidable DAG.
Inside‑out verification. As the DAG is built, each step’s preconditions are checked against the outputs of prior steps. Detected contradictions—cyclic dependencies, a variable consumed before it is produced, a column dropped while a later step still queries it—are flagged and reported. Verification runs against the plan itself, catching structural faults before a single token is spent on execution.
Structured DAG output. The final plan is a DAG of steps, each annotated with verification gates (preconditions, expected outcomes). It is presented to the developer in human‑readable form. Only after explicit approval does the agent switch to execution mode, spending tokens against a validated blueprint, not a guess.

Trade‑offs & boundaries

Plan mode is not free. The upfront deliberation costs tokens and time; for a typo fix, even the fast lane adds a mode‑switch and approval step that a pure execution agent would skip. The trade‑off is worthwhile only when the cost of a wrong execution outweighs the price of a deliberate plan.

The human‑in‑loop approval introduces latency. If the developer is unavailable to review the DAG, progress stalls. In high‑velocity flows—pair programming, live debugging—the delay can feel heavier than the risk of a small mistake. Plan mode assumes a reviewer with both time and attention; if the reviewer rubber‑stamps the plan without scrutiny, the safety net collapses.

Source‑grounded planning is only as good as the supplied context. When the developer provides a vague intent without relevant files, the planner still produces a DAG—but grounded on nothing. The inside‑out verification will find no internal contradictions because there are no facts to contradict. The plan will look coherent but be disconnected from reality. Plan mode cannot compensate for incomplete input.

The DAG itself can become unwieldy. A complex refactor with forty steps and a dozen verification gates demands minutes of review. If the plan feels overwhelming, the developer may approve without reading, turning the mode into a ceremonial tax. The tool provides every leaf; it does not compress the plan for skim‑reading.

Plan mode is not a general‑purpose planning oracle. It cannot validate a plan against unstated, external requirements (“the feature should feel snappy”), and it does not replace runtime monitoring. It is also unsuitable for inherently exploratory tasks—data analysis where the shape emerges from intermediate results, creative writing, or parameter tuning where outcomes are stochastic. In those domains, Plan mode’s rigidity fights the workflow.

When it earns its place

Use Plan mode whenever the blast radius of a wrong change is large. Cross‑layer refactors—renaming a database column that propagates through the ORM, API, and UI—where the DAG exposes every edge that must be updated and verification gates catch missing references before any file is touched. Sensitive state changes—migration scripts, configuration updates, auth rule modifications—where the tool‑call gate prevents accidental writes and a human must assess blast radius. Onboarding to an unfamiliar codebase, where the plan serves as a map of intended changes and the developer can correct the agent’s understanding of the architecture before a single line is written. Feature work with vague requirements, where root‑cause probing forces the decomposition of ambiguous intent into ordered, testable actions, surfacing hidden assumptions early. And in token‑budget‑sensitive environments—when expensive models or tight API quotas make the cost of a multi‑turn execution failure prohibitive—the deliberation investment pays for itself by eliminating speculative retries.

Plan first, spend later: no execution token is burned against a guess—only against a verified blueprint.

Method

Roles decoupled from models

Models bolt onto roles at dispatch time — swap any role’s model in config without touching code.

Roles in Xihe are interfaces, not backends. They bind to models at runtime, per task, with no hardcoded dependency. The one idea this turns on is that the model is a pluggable lever, not a pillar you build around. You design agent architecture around role responsibilities — plan, execute, verify — and then decide which model fills each role at task start, purely through configuration.

The failure mode

Without decoupling, every role becomes permanently married to a specific model. Swapping a model — whether to upgrade, cut costs, or escape deprecation — forces you to touch code, rewrite prompts, re‑test the pipeline, and redeploy. The result is three brittle failure patterns:

Vendor lock‑in disguised as architecture. Model choice is buried in platform code. Migrating from one provider to another hardens into a multi‑sprint engineering effort, so teams stick with an inferior model long after a better one appears.
Pipeline brittleness and silent corruption. When a model silently changes behaviour after an API update or fine‑tune, the role that depends on it degrades without a clear boundary. Because reliability logic — retries, validation, fallback — is tangled with model‑specific prompts, every role inherits the shift. A degraded executor can silently corrupt results that the conductor and verifier expected to be valid.
Wasteful allocation and concurrency lock‑in. Without decoupling, all roles tend to share one model family. You either waste expensive tokens on trivial executor work or cripple the conductor with a cheap model that can’t plan. Concurrency suffers when a single model endpoint becomes a bottleneck, serialising work instead of running roles in parallel on separate endpoints.

Ultimately, the model becomes a central pillar: change it, and the whole structure trembles. The platform’s reliability — validation, retry, fallback, logging — gets cooked into model‑specific prompts instead of a reusable harness.

How it works

Xihe inverts the coupling. Every role — conductor, executor, verifier — is defined as an interface with a job description, not a model‑specific backend. Model binding happens at task dispatch:

Independent assignment at dispatch. At task start, the dispatch engine (deterministic code that never calls a model) reads a configuration map that pairs role names with model endpoints. Conductor receives its allocation; executor receives its own; verifier gets another. All are drawn from a config you control.
No code change to swap models. Changing the conductor from Opus to GLM for the next task means updating a single config key — an environment variable, a JSON value, whatever fits your deployment. No code change, no redeploy, no pipeline rebuild. The same DAG logic and prompts execute; only the endpoint and token differ.
Concurrent, isolated execution. Each role runs within the task’s DAG boundaries using its assigned endpoint. Executor leaves fire in parallel on their cheap model while the conductor plans; the verifier can cross‑examine results without waiting for all leaves. Failures stay contained: a misbehaving executor model does not propagate to the conductor or verifier.
Harness owns reliability. Retry logic, response validation, fallback to an alternative model, and logging all live in the harness layer surrounding each role, not in the prompt. The model is a callable endpoint that receives structured instructions and returns output. Your domain logic, brand rules, and secrets never cross into a model’s API or weight space.

This turns model choice into a configuration dial, not a design commitment.

Trade-offs & boundaries

Decoupling introduces a new configuration surface that must be managed. A misconfigured endpoint (typo, expired key) can silently fail a role at dispatch time, so operational discipline — config validation and model‑availability monitoring — becomes essential.

The per‑role binding is one‑to‑one for the lifetime of a task. You cannot, within a single task, split the executor role to send different leaves to different models. If you need per‑leaf model selection, decompose the work into separate tasks. This boundary keeps the interface simple and each role’s contract clear.

Decoupling does not guarantee that any model can fill any role. A conductor requires strong reasoning; an executor thrives on speed and low cost; a verifier needs cross‑model skepticism. Assigning a weak model to the conductor naturally degrades plan quality. If a role’s logic depends on a model‑specific primitive (e.g., a particular function‑calling schema unique to one provider), swapping models may break the contract. In that case, pin that role to a narrow set of allowed models in configuration — no code change needed.

Latency overhead from reading runtime config is microseconds, negligible compared to model inference time. However, running each role on an independent endpoint means you cannot share a connection pool or reuse an in‑memory context window across roles. Under heavy concurrency this may increase total API latency compared to a monolithic pipeline; the trade‑off is worth the isolation.

The degradation path is clear: if a model endpoint fails or returns garbage, the harness retries and, if configured, falls back to an alternative model for that role. Without a configured fallback, the task fails cleanly for that role, but other roles continue unaffected. The platform does not silently absorb bad output; it stops on contract violations.

When it earns its place

Role‑to‑model decoupling pays for itself as soon as you operate more than one model (or model version) in production and need to evolve without rewriting code. Concrete scenarios:

Multi‑model cost/quality tuning. Plan with an expensive reasoning model (e.g., Opus), execute hundreds of leaves with a cheap, fast model (MiMo), and verify with a third model (Codex) that brings an independent perspective. Each role gets the model best suited to its task.
A/B testing and gradual rollout. Route a small percentage of tasks to a candidate planner model while keeping the rest on the current one. Measure completion rates, latency, and cost without touching code. Roll back by reverting a config value.
Migration without downtime. A model is deprecated or becomes unreliable. Update the config map to point to the replacement. The DAG, prompts, and harness stay untouched.
Compliance and data isolation. Run the conductor on a cloud model for reasoning, but run the executor on a local model that never sends sensitive data outside your network. Configuration per role enforces data boundaries.
Multi‑tenant deployments. Different clients or use‑cases demand different model stacks. A config map per tenant lets one codebase serve all, without forking.

A role is not a model — it is a contract that a model fulfills, and that contract is wired per task, not per deploy.

Method

Hooks & extensibility

Four synchronous in‑process hooks let you inject deterministic policy at critical lifecycle points, engineering reliability outside the model.

Xihe provides four synchronous, in‑process extension points that let you plug custom logic into the agent runtime without forking or patching the core. Each hook maps to a specific lifecycle event, defaults to a no‑op, and is registered per deployment only where you need it. The one idea they turn on is that reliability guarantees the model cannot provide must be engineered outside the model, through deterministic, model‑blind interception seams.

The failure mode

Without these hooks, every domain‑specific policy — permission checks, PII redaction, external monitoring, private tool exposure — forces you into either forking the runtime (brittle, merge‑heavy, unrepeatable) or trusting the model to self‑regulate. That trust breaks silently:

Leaves execute unchecked. A leaf with malformed or unauthorized parameters still consumes model quota, and a destructive action (file deletion, external API call) can slip through because no gate exists to halt it before dispatch.
Observability is bolted on later. No lifecycle‑level tap for task outcomes means operators either tail raw model output or retrofit fragile wrappers — leaving audit blackouts and guesswork when things fail in production.
Memory writes commit uninspected. Model‑generated content, including PII, hallucinated facts, or data that violates retention policy, enters the persistent store with no interception point to redact, modify, or drop it.
Every new tool requires a core change. Internal APIs, legacy endpoints, or team‑specific services must be added by patching the runtime or by maintaining out‑of‑band proxies, coupling system releases to agent capability.

These aren’t edge cases; they are the difference between a prototype that works with a friendly model and a production system that must survive real‑world corruption, compliance audits, and operational noise.

How it works

Each hook is a callable you register at startup. The runtime invokes it synchronously at the defined moment, uses its return value (where applicable), and treats an absent hook as a no‑op. All four run in‑process, model‑agnostic, and deterministic.

pre_leaf_execute — Called immediately before any leaf task begins, receiving the leaf’s resolved parameters. To halt execution cleanly, raise a structured error; the error propagates to the conductor for re‑planning or surfacing, and the leaf is never dispatched to a model. This is the only hook that can block execution.
post_task_complete — Fires after any task (leaf or composite) finishes, with the outcome. Its return value is ignored, and it must not affect flow. Designed as a fire‑and‑forget side‑effect outlet: log, emit a metric, trigger a webhook. Exceptions are caught and logged, not fatal, preserving the task result.
memory_write_gate — Intercepts every memory write before it commits. Receives the proposed content; return it unchanged, return a modified version, or return None to silently drop the write. This overrides the built‑in restraint logic per domain, enabling PII redaction, retention enforcement, or content‑type bans without touching the persistence layer.
tool_discovery — Executed once at startup, after system tools are loaded. Returns a list of tool schemas; the conductor merges them with the system tool set. Gives agents access to internal APIs, databases, or custom endpoints without gateway changes. If the hook throws, the agent fails to start.

All hooks run on the dispatch thread in a fixed order, with no concurrency for a given event. They are registered per‑agent instance and can be composed only inside a single hook’s implementation.

Trade-offs & boundaries

Latency coupling. Synchronous in‑process execution means every hook adds its wall‑clock time to the critical path. A slow pre_leaf_execute (e.g., an external permission check) blocks leaf dispatch; a heavy memory_write_gate stalls every write. The degradation is linear: a 200 ms gate per leaf adds 2 seconds to a 10‑leaf chain. Hooks must stay fast; expensive work should be offloaded (e.g., enqueue to an async pipeline inside post_task_complete).
Failure semantics are sharp. In pre_leaf_execute and memory_write_gate, an unhandled exception aborts the operation (leaf or write). In post_task_complete, the error is swallowed silently, so you can lose side‑effects without noticing. tool_discovery exceptions prevent startup. Every hook demands robust internal error handling.
No cross‑hook state. There is no shared context or coordination; a pre_leaf_execute cannot see another leaf’s post_task_complete outcome. For correlation, thread your own identifiers through metadata.
tool_discovery is static at startup. You cannot add or remove tools mid‑session without a restart. The hook only adds tools; it cannot modify or remove system tools.
Not a general middleware pipeline. Hooks cannot inject new tasks, modify the DAG, or inspect plan state. They are point extensions, not control-flow owners. Recovery from model failures (the verifier’s role) is out of scope.
Process‑local only. In distributed deployments, each runtime instance registers its own hooks independently; there is no cross‑process coordination.

When it earns its place

Multi‑tenant deployments. Use pre_leaf_execute to enforce role‑based access on tool calls, aborting unauthorized leaves before they consume model budget or reach external services.
Compliance‑sensitive domains. memory_write_gate redacts PII (SSNs, credit‑card numbers) or enforces retention limits deterministically — the only reliable way to guarantee policy regardless of what the model outputs.
Operational observability. post_task_complete feeds every task outcome into your metrics pipeline, giving you latency histograms, error counts, and audit trails without touching the core.
Internal tool ecosystems. tool_discovery registers private APIs (employee directory, internal dashboards, legacy service wrappers) as first‑class tools, avoiding gateway releases.
Gradual policy rollout. Register a no‑op hook in staging, then gate production behaviour by hook presence alone, without code branches.

Hooks are where the model stops and your policy begins — engineer reliability outside the model.