Continuous Eval with Harbor: The Probe That Runs the Code

Supplement to Unit 5, Part 4: Training Loop Instrumentation

Apr 29, 2026

∙ Paid

The ValidationProbe in Part 4 scores completions against fixed multiple-choice options. This is fast, cheap, and catches distributional divergence. It doesn’t catch the failure mode where the model produces plausible-looking outputs that don’t actually work. A model approaching Mu can score its way through a multiple-choice probe by learning which completion pattern statistically follows which context stem. The probe cannot distinguish that from genuine conditional reasoning, because genuine conditional reasoning and sophisticated pattern matching look identical when your measurement instrument is a logit comparison.

Harbor —a containerized, environment-based evaluation harness— catches it because it changes what you are measuring. (Harbor is a framework from the creators of Terminal-Bench for evaluating and optimizing agents and language models.) Instead of asking “which completion did the model assign highest probability,” it asks “did the model’s output actually do the thing?” The reward comes from environment-state transitions inside a containerized sandbox. The code either runs or it does not. The bug is either fixed or it is not. The file either exists with the right content or it does not. No logit comparison, no fixed completion set. The model is an agent, the sandbox is the world, and the reward is what the world says.

What Harbor Catches That the Probe Misses

A model collapsing in this way retains the surface statistics of the training distribution. It knows that code blocks follow error messages, that variable names should be syntactically valid, that docstrings should look like docstrings. What it loses is the conditional structure: the ability to read this specific error, diagnose this specific cause, produce this specific fix. The outputs look like code. Even when they’re not code, but the marginal distribution of code-shaped text.

The ValidationProbe measures distributional alignment;
Harbor measures environmental competence.

The multiple-choice probe is vulnerable to this failure because the probe’s completions were written by someone who knew the answer. The distractors were designed to be wrong in specific, recognizable ways. A model that has learned which kinds of completion statistically follow which kinds of context stem can score well on the probe long after it has lost the ability to solve novel instances of the same problem.

Harbor closes this gap by presenting novel instances every run. The sandbox generates a fresh bug. The model must read it, reason about it, and produce a fix. Reward Kit verifies the fix by running the code. There is no fixed answer to pattern-match against.

We just finished Unit 5: Training Loop Engineering. While a full deep-dive into evaluation harnesses is slated for Unit 7, we cannot effectively manage checkpoints (Unit 5 part 5) or detect training instability (Unit 5 part 6) using orbital sensors alone.

The ValidationProbe from Part 4 is a distributional measure; it tells us if the model still speaks the right language. Harbor is a behavioral measure; it tells us if the model can still solve a novel problem. So this is really an advance topic from the evaluation and benchmarking unit because you shouldn’t save another checkpoint until you know if the code it generates actually runs. Execution is the only ground truth.™

Installation and Behavioral Specification

Install Harbor:

uv tool install harbor
uv tool install harbor-rewardkit

Initialize the task:

harbor init --task "buildai/mu-probe"

This creates the following structure, which we’ll populate:

buildai/mu-probe/
  instruction.md
  task.toml
  environment/
    Dockerfile
  solution/
    solve.sh
  tests/
    test.sh
    criteria.py
    quality.toml

`task.toml`

schema_version = "1.1"

[task]
name = "buildai/mu-probe"
description = "Debug a Python script with a seeded fault. Tests contextual reasoning under novel conditions."
authors = [{ name = "BUILD AI", email = "build@buildai.substack.com" }]
keywords = ["reasoning", "debugging", "mu-probe", "training-eval"]

[metadata]
category = "reasoning"
difficulty_explanation = "Requires reading a specific error, identifying its cause, and producing a working fix. Pattern-matching on code shape is insufficient."

[verifier]
timeout_sec = 60.0
user = "root"

[agent]
timeout_sec = 90.0
user = "agent"

[environment]
cpus = 1
memory_mb = 1024
storage_mb = 4096
gpus = 0
allow_internet = false

`environment/Dockerfile`

FROM python:3.11-slim

RUN useradd -m agent
RUN pip install --no-cache-dir numpy pytest

# Create directory and set ownership to agent before switching context
RUN mkdir /app && chown agent:agent /app
WORKDIR /app

# Seed a buggy script at container build time
# The fault is injected via environment variable at runtime
# so each trial gets a fresh, unpredictable instance
# Ensure the script is executable by the agent user
COPY --chown=agent:agent inject_fault.py /usr/local/bin/inject_fault.py

RUN chmod +x /usr/local/bin/inject_fault.py

CMD ["/bin/bash", "-c", "python3 /usr/local/bin/inject_fault.py && exec bash"]

Here is the engine of the probe. It generates a fresh, seeded bug for every trial. The model cannot have seen this exact instance, only the underlying pattern.

`environment/inject_fault.py`

Continue reading this post for free, courtesy of Forest Mars.

Or purchase a paid subscription.

BUILD AI (with examples)