Research

The Frequency Prior Series

A paper series on how GPT-2 Small encodes, amplifies, and yields to training-frequency priors — and what that reveals about the limits of mechanistic intervention, from attention heads down to SAE features.

Model: GPT-2 Small (124M parameters)Platform: TransformerLens — all results from real-model inference

Paper 1Live

Frequency Wins

Diagnosing the lesion

Ask GPT-2 for the capital of India with worked examples in context, and the correct answer leads through layer eight — then gets overwritten by Mumbai at layer nine. A frequency prior, amplified by an identified retrieval circuit, beats in-context evidence at scale.

›ICL accuracy peaks at n=1 (79%), degrades monotonically to 55% at n=5
›Circuit: heads L9H8, L8H11, L10H0 do both retrieval and frequency amplification
›Two failure modes: late-override (Australia, Canada) vs. early-dominance (India, Switzerland, South Africa)
›Scale sweep: persistent errors dissolve at GPT-2 XL — capacity, not architecture

mechanistic-interpretabilitygpt-2in-context-learningactivation-patchinglogit-lens

Read paper →

Paper 2Live

Steering the Prior

Why activation steering mostly fails

Paper 1 diagnosed the lesion. This paper administered the indicated treatment and reports the trial honestly. The mechanistically-derived steering vector corrects one country in five, at triple the tolerable dose, while a black-box learned vector quietly fixes the cases the interpretable one cannot.

›Difference-of-means vector at L8 resid_post: stable, real, nearly orthogonal to embeddings
›Corrects Switzerland at α=3.0 — 3× outside the safe operating window
›Hypothesis inverted: late-override countries resist; early-dominance partially yields
›Learned vector (same norm, same hook) corrects Australia and Canada at safe doses

mechanistic-interpretabilityactivation-steeringgpt-2residual-streamnegative-results

Read paper →

Paper 3Live

Frequency in All Directions

Is there a frequency direction?

Two more domains, the full GPT-2 family, and one question: is there a single frequency direction in the residual stream? No — the prior is clustered, the one direction that bridges domains is the morphological (demonym) attractor, and scale cures frequency bias but not frequency absence.

›The phenomenon generalizes: languages and currencies replicate the inverted ICL gradient
›No single frequency direction — the prior is structured/clustered (subspace energy 0.45)
›The demonym (morphological) attractor is the bridge across anti-aligned semantic domains
›Scale splits the prior: frequency bias clears by 345M; frequency absence survives to 1.5B

mechanistic-interpretabilitygpt-2generalizationin-context-learningresidual-stream

Read paper →

Paper 4Live

The Morphology Circuit That Isn't

One circuit, every attractor

I went looking for a separate circuit behind the morphological (demonym) attractor and there isn't one — the same heads that amplify famous cities amplify country adjectives too. The frequency-amplification circuit is attractor-agnostic, which explains why Paper 3's demonym vector was the bridge across domains.

›Shared circuit: demonym and semantic attractors use the same heads L8H11/L9H8/L10H0 (effect-vector cosine 0.90)
›No depth gap — Paper 3's early-demonym crossover was a measurement artifact
›One nuance: the demonym attractor uniquely recruits induction-style attention (−41%)
›Clean head-ablation blocked by entanglement — bias and retrieval are inseparable (Paper 2, again)

mechanistic-interpretabilitygpt-2circuitspath-patchingnegative-results

Read paper →

Paper 5Live

The Resolution Ladder

The bias comes apart — one rung at a time

Heads couldn't separate bias from retrieval (Papers 2, 4). Directions couldn't either (Papers 2, 3). SAE features — the finest lens available — finally do it: currency-demonym errors sign-flip under a 3-feature ablation with zero collateral. But only that slice. The semantic early-dominance cases survive every resolution tried across five papers, and India actively backfires under the fix.

›First bias–retrieval separation in the series: currency-demonym errors sign-flip under 3-feature SAE ablation (Denmark, Sweden, Norway genuinely corrected)
›The separation is scoped: currencies-only — the same features have literally zero effect on language-domain demonyms
›Semantic early-dominance is unreached at every rung — heads, directions, features — and India backfires (+0.96) under the intervention meant to fix it
›The causal features are not the geometric direction (avg |cos| ≈ 0.12): levers and signatures are different objects

mechanistic-interpretabilitygpt-2sparse-autoencoderssae-featurescausal-ablationnegative-results

Read paper →

Companion Tools

The Frequency Prior ExplorerNew

Open app

Interactive companion app for the whole series. Walk the Resolution Ladder — heads → directions → SAE features — run simulated causal interventions on real measured data, trace logit-lens trajectories per country, and browse every verified Paper 5 finding with plain-language explainers.

ReactRechartsSAE FeaturesInteractiveNo-install

Attn Flow

Open app

Browser-native attention flow visualizer ported from a C terminal tool. Watch probability mass move through GPT-2's residual stream in real time — particle simulation showing token competition at each layer, color-coded by token identity.

Canvas 2DGPT-2VisualizationNo-install

All experiments run on GPT-2 Small via TransformerLens. Code in new_experiments/ — steering scripts, logit lens tools, battery runner, and scale sweep utilities.