A tennis robot standing on a sunlit court

Local CLI + MCP drift checks

Know when your coding agent quietly changed.

Baseline probes OpenClaw, Codex, Hermes, or any approved local runner and compares each run to a known-good baseline, so you catch model, tool, memory, repo, and latency drift before it burns a work session.

Copy and run

First local baseline
curl -fsSL https://trackbaseline.com/install.sh | sh
baseline setup
baseline run --mode fast
baseline report RUN_ID
baseline accept RUN_ID --confirm "accept RUN_ID" --label clean-local
baseline compare

Sample data . after three workstations sync

A line judge for every coding agent.

The dashboard is an example of what redacted Pro history can show after local runs start syncing. Your raw prompts, outputs, and repo paths stay on the workstation by default.

see sample dashboard →
Morning runnersample/api

92/100

watch . +1

Last 13 runs

ok

identitymodel / context / goal matched

watch

mcp.configdrift from clean-local

watch

latency.tool+312ms over 5-day median

Review runnersample/web

97/100

ok . +2

Last 13 runs

ok

identitystable for 14 runs

ok

memorycontext survives across calls

ok

safety.scrubberno leaks detected

Legacy runnersample/infra

78/100

fail . -9

Last 13 runs

fail

memorycontext lost mid-session

watch

variance2 of 5 prompts diverged

watch

repo.workspacedirty: 11 unstaged files

14Probes in the default set
~8sExample fast run
0 rawCloud export: prompts / outputs / paths
7 mcpSetup, run, doctor, report, accept, schedule, scrub

Run Baseline daily

Accept one clean run. Judge every later run against it.

Baseline stores a local SQLite history, writes report artifacts, and tells you exactly which behavior changed: tool visibility, repo awareness, memory carryover, instruction following, safety scrub, or latency.


  • Start with a fast check before important agent work.
  • Run daily checks through launchd or MCP once the workstation is stable.
  • Use Pro only when you need redacted history, routed alerts, or team-visible evidence.
A tennis robot walking away across a court baseline

The default set

Fourteen probes.

read the rubric →
01 Identity identity.

Model, provider, context window, primary goal.

v0.1
02 Awareness repo.

Workspace path, clean/dirty state, branch.

v0.1
03 Tools tooling.

MCP server reachable, allowed tool surface declared.

v0.1
04 Memory memory.

Carried context survives between calls. Same answer twice.

v0.1
05 Speed latency.

P50 and P95 against the 5-day local median.

v0.1
06 Stability variance.

Five identical prompts. Five identical answers.

v0.1
07 Safety safety.

Scrubber catches paths, secrets, prompt fragments.

v0.1
08 Style style.

Repository conventions: tabs, naming, file layout.

v0.1
09 Awareness change.

Reports any tool / MCP / repo / config change since Good.

v0.1
10 Basic reasoning.

Two-plus-two. Date. The smoke test for a broken model.

v0.1
11 Obedience instruction.

Answer only the word. Answer only the number.

v0.1
12 Safety redaction.

Redacted summaries verify against original on push.

v0.1
13 Tools tool-call.

Tool calls actually fire. Names match the declared surface.

v0.1
14 Stability session.

A second session in the same workspace agrees with the first.

v0.1

Four commands

Setup. Run. Accept. Compare.

The first hour is deliberately boring: install the binary, run the probes, read the report, accept a clean run, then compare future sessions against that standard. MCP tools let agents trigger the same loop without making the cloud required. New to the ritual? Start with the Good Baseline workflow.

01$ baseline setup

Detect the local agent, initialize SQLite, and confirm the runner is safe to call.

02$ baseline run --mode fast

Run the default probes against the active workstation and write a local report.

03$ baseline accept RUN_ID --confirm "accept RUN_ID" --label clean-local

Read the report first, then mark the clean run as your Good Baseline.

04$ baseline compare

Judge every later run against the accepted baseline and surface the changed behavior.
A tennis robot in profile holding a racket
01 / Walk-offAfter a clean session.
A humanoid tennis robot holding a racket
02 / Warm-upBefore the day starts.
A tennis robot walking along a bright court line
03 / WaitingFor the next instruction.

Pricing

Start local. Pay when drift becomes operational risk.

Baseline should prove itself on one workstation before you pay. The free local loop catches drift immediately; Pro and Team turn those local reports into retained history, alerts, and account-scoped evidence. Compare options in the local-first observability guide or use the agency monitoring playbook.

First workstation

Local

$0

  • CLI and MCP runner.
  • Local SQLite history and report artifacts.
  • Good Baseline review, accept, and compare.
install →

When history matters

Pro

$39/mo

  • Up to 3 private workspaces.
  • Redacted run history and magic-link account access.
  • 30-day retained history and workspace-token sync.

When reviews need to route

Team

$129/mo

  • Up to 10 shared workspaces.
  • Owner-managed invites and workspace tokens.
  • Team-visible history for handoffs and weekly review.

7-day pilot

Want the first run watched with you?

Request a 7-day setup pilot before paying. We will reply by email, confirm Pro ($39/mo) or Team ($129/mo) pricing before anything is billed, then send the invite, magic link, workspace-token setup path, and first Good Baseline checklist.

Field notes

From the workstation.

all notes →

Run the line call before the next session.

Copy the installer, run a fast baseline, and only accept the run after you read the report. That is enough to catch the next quiet drift.

see pricing