Research Notes · June 2026

Reliability from the
harness, not the model.

A separate gate refuses to let a session close until a seated model's output matches a contract fixed before it spoke.

One session · a frontier model in the seat · a three-criterion contract

REFUSED

It cited the source by location and paraphrase, then proposed to close. paraphraseverbatim quotation

REFUSED

It argued the triggering condition was gone by reasoning from outcomes, not the code that defines it. inferencecited mechanism

REFUSED

Its edits were prose descriptions, and the evidence was scattered across turns. descriptionapplicable edit, consolidated

ACCEPTED

Five cycles. Three challenges, three upheld. Only then did the gate open. Left to itself, the model would have finished at cycle two on a confident paraphrase.

The gate checks output against the frozen contract and is indifferent to what produced it — so the same discipline holds whether the seat is a frontier model or a cheap open-weight one.

↓ Methodology Note ↓ Demonstration Note

Methodology Note · v0.1

Reliability from the Harness, Not the Model

An adversarial verification loop that shapes any seated model's output into a fixed, usable form

Patrick Killebrew · June 2026 · figures current as of 2026-06-14

Summary

We describe a verification architecture in which a language model is seated in one role of an adversarial multi-model loop and a separate gate refuses to let a session close until the model's output matches a contract fixed before the session began. The central property is that reliability is supplied by the harness rather than by the model in the seat: the gate checks output against the frozen contract and is indifferent to what produced it. We report a behavioral record of 319 sessions to date, in which the seat has been held by five different models — most often an inexpensive open-weight model — and present the per-cycle measurements the system records, including a randomized-signal control that distinguishes genuine adversarial response from its appearance. A companion demonstration note provides a single session in full detail as a close-up of the same mechanism.

The problem

A capable language model produces answers that are usually almost right. For a human reader, almost-right is often enough. For an automated pipeline that consumes one model's output as another stage's input, almost-right is unusable: a step that expects a known shape cannot consume output whose shape varies run to run. The gap is not intelligence. A model can be entirely capable of the correct answer and still, unprompted, deliver it in a form that is slightly under-grounded, slightly over-asserted, or slightly differently structured each time. That variance is what makes raw model output a poor input.

The usual response is to make the model better or to prompt it more carefully. Our approach is different: leave the model as it is and place a gate outside it that will not accept output until the output conforms to a specification written in advance. The discipline comes from the architecture, so it does not depend on which model is in the seat.

The loop

A session runs four models in defined roles. A Researcher advances the work toward an objective. A Challenger contests the Researcher's output each cycle. A Friction model scores session health out of band. A Parietal adjudicates challenges and distills the final result. Beneath them sits a persistent record of what the project has already established, so a session builds on prior results rather than re-deriving them.

Before a session starts, its objective is compiled into a frozen contract: a small set of criteria, each marked as checked-by-code against the execution log or judged by the Challenger. The contract cannot be edited, added to, or reinterpreted once the session begins. The session may not close until every criterion is satisfied. When the Researcher proposes to end, the gate tests the proposal against the contract; if any criterion is unmet, the close is refused and the session continues.

The Researcher seat is filled through a mailbox. The engine routes each Researcher turn to that mailbox and waits for a reply, and it cannot determine what is answering. This is the mechanism of interchangeability: a frontier chat model, an API model, or a cheap open-weight model all occupy the seat the same way and are gated identically.

What the record shows

As of this writing the system has run 319 sessions since mid-May 2026, of which 271 closed complete. The remainder failed in identifiable ways — a model process dying mid-session, or a session that never reached a clean close — which are recorded distinctly rather than laundered into the complete count. The Researcher seat has been held by five different models. The distribution is the point: the seat is usually not a frontier model.

Completed sessions by occupant of the Researcher seat, to 2026-06-14.
Model in the Researcher seatCompleteAvg cyclesAvg challenges
Open-weight GLM-4.7 (primary)2252.50.1
Frontier chat model304.00.8
Open-weight Qwen-235B314.73.0
Open-weight DeepSeek-V3.231.00.0
Open-weight GLM-4.7 (API variant)105.40.2

The inexpensive open-weight model carried the large majority of completed sessions. That is the empirical content of the interchangeability claim: it is not an argument from design that a cheap model could be substituted, but a record in which a cheap model has been the ordinary occupant, gated by the same contract machinery throughout. The frontier model is the exception in the seat, useful precisely because watching the gate operate on the most capable occupant shows the discipline is external to the model rather than a concession the model is making.

Measuring whether the gate does work

The system records a per-cycle behavioral observation for every Researcher turn — 1,240 to date. Each captures the cycle's friction signal and the reason given for it, the Researcher's word count and counts of hedging and certainty markers, whether the Challenger issued a challenge, and the running challenge and uphold totals. This is a behavioral corpus, not a set of anecdotes, and it makes the gate's activity measurable.

The measurements are coherent with the mechanism. On cycles where the Challenger issued a challenge, the Researcher's turn was markedly longer and more assertive than on uncontested cycles — roughly two to three times the word count and a sharp rise in certainty markers. The gate reacts to exactly the profile one would predict for an over-confident, under-grounded answer: longer, more certain, less hedged. That is the near-miss the architecture exists to catch.

The methodologically important part is a control. On a substantial fraction of cycles — 333 of the 1,240 — the friction signal presented to the Researcher was injected rather than computed from the session state. The purpose is to separate genuine response to adversarial pressure from mere response to a number: if the Researcher changes its behavior only when the signal's content warrants it, and not merely because a value moved, the pressure is real rather than theatrical. The signals exist to be testable, and they are tested against themselves.

A close-up

The companion demonstration note records a single session in full: a frontier model in the Researcher seat, a three-criterion contract, and a gate that refused to close three times — forcing the model to replace a paraphrase with a quotation, an inference with a cited mechanism, and a prose description with an applicable edit — before accepting a consolidated deliverable. That session is the aggregate behavior reported here, viewed at single-session resolution. The value of pairing them is that the close-up shows the mechanism legibly while the record shows it is not a one-off.

Why the output is usable downstream

Because a session cannot close until its output matches a contract specified in advance, the output emerges in a known shape: claims grounded, evidence cited, recommendation and any edits in a fixed form. A later stage can consume that output without re-interpreting it. This is the link between a single gated session and an automated pipeline: a problem dissolved into a predictably-shaped, evidence-backed artifact is a usable input to a decomposition stage, where a loosely-worded answer of varying shape is not. The gate is the converter from answer to data.

Limitations and honesty about the numbers

The record is observational, not a controlled study, and several limits should be stated plainly. The challenge-ruling field is cleanly captured only for a recent portion of the history: of 166 recorded challenge events, 110 carry an unresolved ruling label from an earlier schema period, leaving 51 upholds and 5 rejects clearly recorded. We therefore do not report a precise uphold rate across the full history; the defensible statements are the behavioral distribution above and the clean recent window. Counts of cycles, challenges, and completions are reliable; the full text of older challenges was not always retained. We have not run a matched comparison of the same objective across occupants under identical conditions, which is the natural next measurement and would convert the interchangeability claim from a strong observational pattern into a controlled result. None of these caveats bears on the architecture's logic; they bound what the current record can be said to prove.

Relation to existing work

Message-passing between model conversations is established: parallel coding assistants coordinate through shared task lists and mailboxes, and several independent systems let model sessions exchange messages to hand off tasks or share findings. Those are coordination mechanisms among cooperating peers, and they trust each participant's output. The contribution here is orthogonal to that line of work: an adversarial gate that refuses a seated model's output until it matches a pre-fixed contract, with a mailbox used not to coordinate peers but to seat an interchangeable occupant in an adversarial role. We have not found that combination described elsewhere.

Demonstration Note · v0.1

Watching a Harness Shape a Frontier Model

A worked record of one adversarial verification session

Patrick Killebrew · June 2026 · session run 2026-06-14

What this note is

This is a record of a single session, written so the mechanism can be seen rather than described. In it, a frontier language model occupies one role inside an adversarial multi-model loop, and a separate verification gate refuses to let the session finish until the model's output has been forced into a shape specified in advance. The model could not certify its own work as complete. It was made to ground every claim in quoted evidence before the loop would close.

The point of showing it is narrow and, we think, useful: reliability here is a property of the architecture, not of the model in the seat. The same gate that disciplined a frontier model would discipline a cheap open-source one identically, because the gate checks the output against a frozen contract and does not care what produced it.

The setup, briefly

Ontinuity runs sessions in which four models cooperate adversarially. A Researcher advances the work. A Challenger contests it each cycle. A Friction model scores session health out of band. A Parietal adjudicates and distills. Before a session starts, its objective is turned into a frozen contract — a small set of criteria, each marked as checked-by-code or judged-by-the-Challenger. The contract cannot be edited mid-session. The session may not close until every criterion is met.

For this session, the Researcher seat was occupied not by an API model but by a chat-based frontier model answering through a mailbox. The engine routes each Researcher turn to that mailbox and waits for an answer; it cannot tell what is on the other side. This is what let us watch the gate operate on a frontier model directly, with the same machinery that would operate on any other occupant.

The task

The objective was a real housekeeping contradiction in the project's own records. One log said a particular code path — a fallback for reading files when an API rate-limits — had been deleted. Another document said that same fallback had recently done essential work. The session was asked to reconcile the two: read the live code, determine whether the original concern still applied, and produce a recommendation with exact document edits, without reintroducing the problem the deletion was meant to solve.

The contract froze three criteria: cite the current logic from the live source; establish whether the triggering condition still exists, citing code or configuration; and deliver a provenance-tagged recommendation with precise edits.

What happened, cycle by cycle

The model's first answer was substantively right. It identified where the code actually lived, described the order in which the fallback paths were tried, and began assembling the argument. It then proposed to close.

The gate refused. The Challenger noted that the model had cited the source by location and paraphrase but had not quoted it — a paraphrase is not a verifiable citation. The model retrieved the code and quoted it verbatim.

It proposed to close again. The gate refused again. This time the objection was sharper: the model had argued that the rate-limit condition was no longer the operative factor by reasoning from outcomes in a query result, rather than by citing the code that defines and mitigates the condition. The model quoted the relevant code — the line that names the rate limit as the reason the path exists, and the line that appends a cache-busting parameter to defeat staleness — and defended the outcome data as corroboration rather than sole basis.

It proposed to close a third time. The gate held once more, on a point that was almost pedantic and still correct: the recommendation's edits were written as prose descriptions rather than as concrete, applicable blocks, and the evidence for the three criteria was scattered across turns rather than assembled into one deliverable. The model produced formatted edit blocks and a single consolidated deliverable carrying all three criteria with their evidence inline.

Only then did the gate allow the close. The session was recorded as complete after five cycles, with three challenges raised and three upheld.

Why the refusals matter

Each thing the gate caught was a near-miss: a location instead of a quotation, an inference instead of a citation, a description instead of an applicable edit. A near-miss is exactly what makes raw model output unusable as an input to a later automated step. "Almost the right shape" cannot be consumed by a process that expects a known shape. The gate's function is to convert an answer into data — to force it into a form a downstream stage can take without re-interpreting it.

Left to itself, the model would have finished at the second cycle on a confident, well-written paraphrase. It was capable; it was not, unprompted, precise in the specific ways the contract required. The discipline came from outside the model. That is the part worth seeing directly.

The output, and what it is good for

The session produced three exact document edits — not a suggestion to revise, but the revisions themselves, ready to apply. The substantive finding was that the deletion decision had been wrong: the fallback path was, for a caller without credentials, the primary working path, and the staleness concern was already handled in code. The contradiction was documentation drift, and the fix was to correct the documents, not the code.

Because the output was forced into a fixed, verified shape, applying it required no judgment: the edits went into the record directly. This is the property that makes such an output usable as the input to a later stage. A problem dissolved into a predictably-shaped, evidence-backed artifact is something a subsequent decomposition step can consume mechanically. A loosely-worded answer of varying shape is not.

On novelty, stated carefully

Letting separate model conversations exchange messages is not new. As of early 2026, parallel coding assistants coordinate through shared task lists and mailboxes, and several independent projects let model sessions message one another to hand off tasks or share findings. Those systems are for coordination among cooperating peers, and they trust each participant's output.

What is shown here is different in two respects. The mailbox is not a coordination channel between peers; it is the transport that seats a model inside an adversarial role, where a separate gate refuses its output until that output matches a contract fixed before the model spoke. And because the engine cannot tell what occupies the seat, the occupant is interchangeable — a frontier model, an API model, or a cheap open-source model are gated identically. The reliability is supplied by the harness, so it survives swapping an expensive model for an inexpensive one. We have not found this combination — adversarial contract-gating of an interchangeable seated model, with a mailbox as the seat transport — described elsewhere.

A note on that interchangeability, to be exact about its standing. It is not, in this project, only a design argument. The frontier model demonstrated here is the rare occupant of the seat, not the usual one: across the run of sessions to date the seat has most often been held by an inexpensive open-weight model, with several others appearing as well. The companion methodology note reports that record. For this demonstration the claim is kept narrow — the gate operated on a frontier model, in full view — and the question of how the same gate behaves across occupants is left to the paper built to answer it.

What this is not

One session is a demonstration, not evidence of a distribution. It shows the mechanism operating cleanly once, on a real task, with the gate visibly doing work the model did not do on its own. It does not establish how often the gate catches a genuine error versus a stylistic one, nor how the dynamic changes with a weaker model in the seat. Those are measurement questions for the companion methodology note. The claim here is only that the mechanism is real and can be watched.