Fingerprint similarity over time
Daily mean character-level edit distance (D) between Opus 4.6's
output and the pinned baseline reference, across all 30 probes.
Green zone = expected range
(within 2σ of calibration mean, ~95% of stable days land here).
Yellow zone = elevated
(2σ–3σ, worth watching).
Red zone = alarm
(beyond 3σ, very unlikely on a stable day).
Hover any point for the date and exact value.
No tick data yet. Run uv run python -m src.worker to populate.
Per-prompt evidence (latest tick)
Each row is one of the 30 test prompts we send to Opus 4.6 daily. Drift is how much today's output differs from the pinned baseline (0% = identical, 100% = completely different). Expected noise is the range of drift we measured during calibration — anything inside that range is normal T=0 jitter. Z-score is how unusual today's drift is relative to that noise (positive = more drift than expected, negative = less, only high positives are concerning). Status is derived from the z-score using standard statistical thresholds: <2.0 = 🟢 Stable (normal jitter), 2.0–3.0 = 🟡 Watch (mildly unusual, but with 30 prompts you'd expect 1–2 per day from chance alone), >3.0 = 🔴 Anomaly (genuinely unusual for a single prompt). Rows are sorted most-suspicious first. Click any row to see the actual diff.
| id | category | drift | expected noise | z-score | status |
|---|
No tick data yet.
Methodology
What this site does
Every day, a GitHub Actions cron calls Anthropic's API and runs a fixed
set of 30 prompts against claude-opus-4-6 at
temperature=0, three times each. We measure how much each
prompt's output differs from a pinned baseline reference (recorded once
during one-time calibration), aggregate, and publish the result.
The metric
Per-prompt distance is the median normalized character-level
Levenshtein distance between today's three samples and the
baseline reference. This is implemented with
rapidfuzz.distance.Levenshtein.normalized_distance and
is reproducible in any language in a few lines of code.
We chose character-level (not token-level) because the Anthropic Python
SDK no longer ships a local tokenizer (the legacy
get_tokenizer() was removed in SDK 0.39.0), and the
server-side count_tokens endpoint returns only an integer
— not the token sequence we'd need for a Levenshtein alignment. Any
local tokenization is therefore a proxy. Character-level is the
simplest, most reproducible proxy and, on our 200–400-token outputs,
offers strictly better signal-to-noise than a hash-equality metric.
The verdict
For each prompt we compute a z-score against the per-prompt noise floor measured in calibration. The site's primary indicator combines:
- An abrupt detector: the daily aggregate Z compared against empirical yellow/red thresholds (the 95th and 99.5th percentiles of the calibration Z distribution).
- A slow-drift detector: an upper-sided CUSUM control chart on the daily aggregate D, with k and h calibrated against the calibration D distribution.
What this site does not claim
From black-box API access, the categories quantization change, weight swap, and server-side moderation rewrite are not cleanly separable. A quantized model is a weight change. A server-side classifier rewrite can mimic either. So the site does not attempt to attribute which kind of change occurred. It detects "something changed" and shows the literal before/after evidence — visitors decide what it means.
Sensitivity floor (important)
Calibration measured the intrinsic temperature=0
nondeterminism of Opus 4.6 on long-form prose at roughly
20% of output characters per call, sample-to-sample.
That is the noise floor we live with. We did not pick it; it
comes from Anthropic's serving stack (mixed-precision arithmetic,
variable batching, possibly speculative decoding) and we have no
way to lower it from outside.
What this means in practice:
-
Detected: dramatic changes — model swaps (a
different checkpoint silently shipped under the same model ID),
heavy requantization (e.g., FP16 → INT4), large fine-tuning
rounds, server-side moderation rewrites that change refusal
patterns. Anything that adds roughly
≥0.10to the aggregate distance metric. - Not detected: subtle quantization changes that stay inside the noise floor. The kind of update that would take a careful human side-by-side comparison to spot is also plausibly small enough to hide here. We are honest about this limit instead of pretending we have superhuman drift sensitivity.
We chose long-form prompts on purpose: long outputs with many
near-margin token decisions are the most sensitive to genuine
weight drift. They are also, for the same reason, the noisiest
at temperature=0. The trade-off is intrinsic, and
the empirical calibration thresholds (yellow_z,
red_z, cusum.k, cusum.h)
are derived from the actual measured noise distribution rather
than hardcoded — so the system flags drift relative to whatever
the noise floor turned out to be on the day of calibration.
For the same reason, the per-prompt drill-down on the Evidence tab will routinely show diffs even on stable days. Some drift between baseline and today's samples is normal; only aggregate drift well above the calibration distribution is meaningful. Read the diffs as supporting evidence for the aggregate verdict, not as standalone proof.
Baseline resets
The site does not auto-reset its baseline. When the verdict goes and stays red, evidence stays up. A human (anyone with repo access) decides whether to run a new calibration — for example, after Anthropic publishes a model update — or leave the alarm in place as an unannounced-change flag. Old baselines stay in the repo forever under their version number, so anyone can replay the full history.
Reproducibility
Every probe, every calibration sample, every daily result, and every intermediate artifact lives in the GitHub repo as a git-tracked JSON file. Anyone can clone, fork, or replay the analysis from scratch. The dataset is the provenance.
External references
- Anthropic changelog — context for officially announced model updates.
- Source code & full dataset