Xihe · a realm for crafted agents

The philosophy

 The model is the lever.The harness is the fulcrum.Taste is the difference. 

Reliability is engineered, not prompted

It comes from the system around the model. Contracts pin the invariants; the verifier enforces them; the model is free to be brilliant inside the guardrails.

the verifier fires on high-risk seams even when every test is green

Evidence over intuition

Data, tests, and measurement decide, not vibes. A change that passes its spec still earns an adversarial second look on the high-risk seams.

every identifier a plan names is verified to exist before it is written (R6)

Built for six months from now

No “just for now” shortcuts. One plan done well beats ten patches. The substrate is designed to be copied to other agent-runtime shapes.

the whole substrate is one portable SQLite file, built to copy to other runtimes

The substrate is open, the taste is yours

Xihe ships the universal harness: orchestration, memory, skills, guardrails. Your domain logic, your brand, your secrets stay yours.

the universal core carries zero domain; your logic and secrets stay in an injected pack

A matter of taste

Same models, different outcomes

Taste isn’t decoration. It’s a set of defaults that compound — choices an industry makes for safety, made differently here on purpose.

Xihe

Singular over mainstream

Industry

Mainstream over singular

Consensus optimises for mediocrity.

Xihe

First principles over hacks

Industry

Hacks over principles

Hacks rot. Principles compound — load-on-demand cut a frozen 33k-token prompt by ~47%.

Xihe

Nordic restraint over clutter

Industry

Verbose feels “safe”

Less is not a lack. It’s a decision.

Xihe

Pay for quality over cut corners

Industry

Save compute at any cost

Quality is never free — best-of-N buys it with concurrency, −92% per-leaf context.

Spec-driven, disproof-tested

Method separates architecture from accident

Ordinary delivery writes a spec, builds to it, and ships when acceptance passes. Xihe adds a second half: the spec becomes a falsifiable contract, and passing it is the floor, not the ceiling.

Ordinary SDD

Spec
Code
Deploy
Acceptance

Xihe · SDD + TDD

Spec + contract
Implement
Adversarial disproof
Outside-in acceptance
Pass ≠ correct

Contract

A contract has four parts

Types

Validation

iii

State machine

Given-When-Then

Passing acceptance doesn’t mean correct. Only surviving attempted disproof does.

Falsifiable

When the code proves the contract wrong

A contract is a hypothesis, not a decree. When the implementation reveals it was wrong, the contract yields — and the change is recorded, never silent. A real instance, in shape:

contract said scrub secrets at every fact write — gate them inside validateFactWrite

the code revealed it also blocked the user’s explicit “remember” — user sovereignty was being denied

contract yielded secret-scrub became opt-in to the auto-learning path; an explicit remember bypasses it — on the record

Most sites prove rigor with green checkmarks. We show a place the method overrode its author.

The pivot

Pass is not the same as correct

A green suite only proves the code matches the spec — and the spec can be wrong. Passing is the floor, not the ceiling.

Adversarial disproof

High-risk seams face skeptics whose job is to refute the result, not bless it — even when every test is green.

Outside-in acceptance

Behaviour is verified at the intent layer, from the outside, not just at the unit it was written against.

Green is the entry fee, not the finish line.

Drift detection

Drift detection & rewind

When reasoning starts to wander, a detector classifies the drift and pulls the run back to the last good anchor — no silent degradation.

Real drift — wrong direction

Rationalisation — an excuse forming

Scope creep — the task expanding

Premature solution — diagnosis skipped

Lost thread — context dropped

No silent degradation. Detected, classified, rewound.

Patterns & modes

How the harness thinks, made explicit

An agent shouldn’t reason the same way about a typo and a new subsystem. Xihe routes each task to a cognitive mode, and a handful of patterns hold no matter which mode is in play.

mode		when	method
M1	Root-cause	bugs · incidents	five-whys + a red-team self-check + sweep for siblings
M2	First principles	clean new build	question → delete → simplify → accelerate → automate
M3	Subtraction	refactor · cleanup	deletion first; every line earns its place
M4	Search first	cause unknown	search precedent before forming a judgement
M5	Working backwards	architecture · new module	write the PR / FAQ before the code
M6	Evidence-driven	perf · quality	data, tests, and measurement over intuition
M7	Closed loop	deploy · ops · default	set the goal → track the process → take the result

Contracts first

Types, validation, a state machine, and acceptance tests pin the invariants before code. Implementation that conflicts loses; a contract the code proves wrong yields, on the record.

Evidence routing

Any plan naming a concrete identifier (a field, a table, an RPC) must be verified to exist before it’s written down. Structure agents map; ground-truth tools confirm.

Complexity routing

Work is sized before it’s done: a one-file tweak, a multi-file change with a precedent, or a cross-layer build that earns a full plan and a cross-model review.

Adversarial verify

High-risk seams get independent skeptics whose job is to refute the result. Passing the spec isn’t enough; surviving the attack is.

Effort matched to blast radius

Not every task earns the same ceremony

An agent shouldn’t reason about a typo and a new subsystem the same way. Work is sized before it’s done, and the discipline that fires scales with what’s at stake.

Vibe

one file

A CSS tweak, a comment, a config line. Do it, verify, done — no plan.

Lite

2–5 files, a precedent

A change with a standard to follow. Auto-enhance, build direct, verify.

Full

cross-layer · new module · security

Earns a full plan, a falsifiable contract, and a cross-model review before code.

Ceremony is a cost. Spend it where the blast radius is real.

Plan mode

A read-only deliberation cockpit

Ordinary plan mode reads and reasons but can’t write, then hands you a plan. Xihe embeds a full cockpit on top: a deliberation that remembers, ingests, argues with itself, and hardens into a contract — before a single line of code is written.

What we embedded

Ordinary plan mode

Reads & reasons, no writes
Produces a plan
You approve, then it builds

Xihe plan mode

A persistent ledger — every decision re-injected each turn, so the plan never forgets or drifts
A read-only gate with a doc carve-out — src stays locked, but plans & SDDs can still be written
Auto knowledge ingestion — paste a URL and it’s fetched, distilled, and folded in, no command needed
Context-stage awareness — it tracks how full the window is and harvests before it overflows

The cockpit, by command

Enter

shift+tab · /plan toggle the cockpit

Deliberate

/note <decision> record it into the ledger — re-injected every turn

/grill adversarial interrogation → contracts + ADR deltas

/council [deep] N lenses → judges → grafted synthesis (best-of-N)

/model switch the planning model & thinking level

Ingest

/ref <url> fetch + distil a link into the ledger (pasted URLs auto-ingest)

/search <query> metasearch more sources, distilled into context

Harvest

/crystallize save the deliberation to a cross-session crystal store

/sdd emit a structured SDD + TDD plan to docs/plan

/crystals recall past sessions’ conclusions into this one

Plan first, spend later. The cockpit makes the plan a living artifact, not a one-shot prompt.

A second model, adversarial

Reviewed across models, on a bounded clock

A different model (Codex) reviews the work — a cross-lens, not a rubber stamp. Its finding is a perspective, not ground truth; taste overrides it. Rounds are capped so review never becomes its own slop.

G1 Plan one round, before a phase begins

G2 Phase one round per phase end

G3 Release one round + at most one fix cycle

taken seriously

reproducible P0/P1 bug, with file:line
a concrete contract violation
a real, mechanism-level ship-blocker

refused as slop

niche / theoretical / “useful at scale”
telemetry, dashboards, coverage-of-coverage
style preferences · hypothetical races without a reproducer

Past the second round, review ROI falls off a cliff. We stop on purpose.

Converge, don’t just finish

Iterate to a fixpoint

One pass is rarely right. The whole graph runs, a judge asks “has it converged?”, and if not the failure is fed back and the conductor redraws — round after round, until it converges or a bounded cap stops it.

01 Run the DAG

02 Judge: converged?

03 Feed failure back

04 Conductor redraws

↻ until fixpoint

Bounded, never runaway

A hard round cap (default 3) — convergence or a clean stop, never an infinite loop.

It swaps in a stronger brain

Stuck for a round or two? The conductor escalates to a stronger model and redraws — not the same brain trying harder.

One loop, not two

The fixpoint judge is the single verify loop; the executor’s inner verify is switched off here, so the two never nest and double the cost.

Passing once is luck. Converging is a property of the loop.

Token efficiency by design

Context is cash — so we spend it once

Quality is never free, but waste is a choice. The harness is engineered to put every token where it earns its keep: shared context computed once, tools that load on demand, reasoning that runs as code instead of prose.

01 Frozen prefix-cache

−92% per-leaf context. The shared context is computed once and frozen; every leaf inherits it instead of recomputing — which is what makes DeepSeek 256-wide and MiMo 8-wide affordable.

Pay the context tax once.

02 MCP gateway

~70% fewer tool-schema tokens. Every tool converges on one route; the menu stays resident, schemas load on demand instead of bloating each prompt. The mechanism lives on the Harness page.

One contract to serve them all.

03 Code mode

1 MB → 7 KB on a big result. Verbose reasoning becomes code: compute the answer in a script and return only the result, not the whole file.

Think in code, not paragraphs.

The taste behind it: pay for quality, never for waste.

Routing that learns

It learns which model to trust

We don’t hand-pick “DeepSeek for coding” forever. For each kind of leaf, an ε-greedy bandit mostly uses the current best model, occasionally tries another, and updates on the reward — so the routing sharpens itself, per deployment.

exploit · best so far ε explore · try another → reward · verifier verdict

Reward = the verifier

A leaf’s reward is its own success × the verifier’s verdict. The same check that guards quality teaches the router what’s worth its money.

Self-evolving, not preset

It ships as a mechanism, not a tuned policy. Your model line-up and your tasks are unique — the bandit learns yours from your own runs.

Safe by construction

Cold-start is today’s static routing — day-one behaviour, zero regression. Exploration is bounded; a bad pick is caught by the verifier and escalated. The learned table persists in SQLite and compounds across sessions.

Spend the compute where it earns its keep — discovered, not decreed.

Concurrency economics

N tries, one winner — and the bill stays small

Quality and cost usually trade off. A frozen prefix and a judge panel let Xihe buy quality with concurrency instead of with tokens — which is why we lean on cheap, fast models run wide: DeepSeek v4-flash fanned out for the rivals, MiMo v2.5 brought in for multimodal execution. One attempt is a coin-flip; ten attempts, scored and grafted, are a guarantee that the best one survives.

Each attempt is scored on cost, quality, and latency. The winner is grafted into the main tree; the rest are discarded. No blind chance.

Point it at the work that punishes a single guess

Load-bearing refactors

When the change touches code everything else stands on, you do not want the first plausible diff — you want the one that survived four rivals.

Wide-open design calls

A problem with ten reasonable shapes and no obvious winner. Generate the shapes in parallel, let a judge panel argue them down to one.

Ambiguous or underspecified specs

Where a single pass is a coin-flip, N passes converge. The spread between attempts is itself a signal that the spec needs sharpening.

Expensive-to-discover mistakes

Anything cheap to get wrong now and ruinous to find in production. Pay a little concurrency here to never pay the incident later.

Where the tokens go

Standard per-leaf context 100%

Xihe frozen prefix 8%

−92%

shared context, computed once and inherited by every leaf