Quantization Canary — Claude Opus 4.6

⏳

Loading…

Reading data.json

Fingerprint similarity over time

Daily mean character-level edit distance (D) between Opus 4.6's output and the pinned baseline reference, across all 30 probes. Green zone = expected range (within 2σ of calibration mean, ~95% of stable days land here). Yellow zone = elevated (2σ–3σ, worth watching). Red zone = alarm (beyond 3σ, very unlikely on a stable day). Hover any point for the date and exact value.

No tick data yet. Run uv run python -m src.worker to populate.

See per-prompt evidence → Methodology →

Per-prompt evidence (latest tick)

Each row is one of the 30 test prompts we send to Opus 4.6 daily. Drift is how much today's output differs from the pinned baseline (0% = identical, 100% = completely different). Expected noise is the range of drift we measured during calibration — anything inside that range is normal T=0 jitter. Z-score is how unusual today's drift is relative to that noise (positive = more drift than expected, negative = less, only high positives are concerning). Status is derived from the z-score using standard statistical thresholds: <2.0 = 🟢 Stable (normal jitter), 2.0–3.0 = 🟡 Watch (mildly unusual, but with 30 prompts you'd expect 1–2 per day from chance alone), >3.0 = 🔴 Anomaly (genuinely unusual for a single prompt). Rows are sorted most-suspicious first. Click any row to see the actual diff.

id	category	drift	expected noise	z-score	status

No tick data yet.

Methodology

What this site does

Every day, a GitHub Actions cron calls Anthropic's API and runs a fixed set of 30 prompts against claude-opus-4-6 at temperature=0, three times each. We measure how much each prompt's output differs from a pinned baseline reference (recorded once during one-time calibration), aggregate, and publish the result.

The metric

Per-prompt distance is the median normalized character-level Levenshtein distance between today's three samples and the baseline reference. This is implemented with rapidfuzz.distance.Levenshtein.normalized_distance and is reproducible in any language in a few lines of code.

We chose character-level (not token-level) because the Anthropic Python SDK no longer ships a local tokenizer (the legacy get_tokenizer() was removed in SDK 0.39.0), and the server-side count_tokens endpoint returns only an integer — not the token sequence we'd need for a Levenshtein alignment. Any local tokenization is therefore a proxy. Character-level is the simplest, most reproducible proxy and, on our 200–400-token outputs, offers strictly better signal-to-noise than a hash-equality metric.

The verdict

For each prompt we compute a z-score against the per-prompt noise floor measured in calibration. The site's primary indicator combines:

An abrupt detector: the daily aggregate Z compared against empirical yellow/red thresholds (the 95th and 99.5th percentiles of the calibration Z distribution).
A slow-drift detector: an upper-sided CUSUM control chart on the daily aggregate D, with k and h calibrated against the calibration D distribution.

What this site does not claim

From black-box API access, the categories quantization change, weight swap, and server-side moderation rewrite are not cleanly separable. A quantized model is a weight change. A server-side classifier rewrite can mimic either. So the site does not attempt to attribute which kind of change occurred. It detects "something changed" and shows the literal before/after evidence — visitors decide what it means.

Sensitivity floor (important)

Calibration measured the intrinsic temperature=0 nondeterminism of Opus 4.6 on long-form prose at roughly 20% of output characters per call, sample-to-sample. That is the noise floor we live with. We did not pick it; it comes from Anthropic's serving stack (mixed-precision arithmetic, variable batching, possibly speculative decoding) and we have no way to lower it from outside.

What this means in practice:

Detected: dramatic changes — model swaps (a different checkpoint silently shipped under the same model ID), heavy requantization (e.g., FP16 → INT4), large fine-tuning rounds, server-side moderation rewrites that change refusal patterns. Anything that adds roughly ≥0.10 to the aggregate distance metric.
Not detected: subtle quantization changes that stay inside the noise floor. The kind of update that would take a careful human side-by-side comparison to spot is also plausibly small enough to hide here. We are honest about this limit instead of pretending we have superhuman drift sensitivity.

We chose long-form prompts on purpose: long outputs with many near-margin token decisions are the most sensitive to genuine weight drift. They are also, for the same reason, the noisiest at temperature=0. The trade-off is intrinsic, and the empirical calibration thresholds (yellow_z, red_z, cusum.k, cusum.h) are derived from the actual measured noise distribution rather than hardcoded — so the system flags drift relative to whatever the noise floor turned out to be on the day of calibration.

For the same reason, the per-prompt drill-down on the Evidence tab will routinely show diffs even on stable days. Some drift between baseline and today's samples is normal; only aggregate drift well above the calibration distribution is meaningful. Read the diffs as supporting evidence for the aggregate verdict, not as standalone proof.

Baseline resets

The site does not auto-reset its baseline. When the verdict goes and stays red, evidence stays up. A human (anyone with repo access) decides whether to run a new calibration — for example, after Anthropic publishes a model update — or leave the alarm in place as an unannounced-change flag. Old baselines stay in the repo forever under their version number, so anyone can replay the full history.

Reproducibility

Every probe, every calibration sample, every daily result, and every intermediate artifact lives in the GitHub repo as a git-tracked JSON file. Anyone can clone, fork, or replay the analysis from scratch. The dataset is the provenance.

External references

Anthropic changelog — context for officially announced model updates.
Source code & full dataset