Research
The Frequency Prior Trilogy
Three papers on how GPT-2 Small encodes, amplifies, and yields to training-frequency priors — and what that reveals about the limits of mechanistic intervention.
Frequency Wins
Diagnosing the lesion
Ask GPT-2 for the capital of India with worked examples in context, and the correct answer leads through layer eight — then gets overwritten by Mumbai at layer nine. A frequency prior, amplified by an identified retrieval circuit, beats in-context evidence at scale.
- ›ICL accuracy peaks at n=1 (79%), degrades monotonically to 55% at n=5
- ›Circuit: heads L9H8, L8H11, L10H0 do both retrieval and frequency amplification
- ›Two failure modes: late-override (Australia, Canada) vs. early-dominance (India, Switzerland, South Africa)
- ›Scale sweep: persistent errors dissolve at GPT-2 XL — capacity, not architecture
Steering the Prior
Why activation steering mostly fails
Paper 1 diagnosed the lesion. This paper administered the indicated treatment and reports the trial honestly. The mechanistically-derived steering vector corrects one country in five, at triple the tolerable dose, while a black-box learned vector quietly fixes the cases the interpretable one cannot.
- ›Difference-of-means vector at L8 resid_post: stable, real, nearly orthogonal to embeddings
- ›Corrects Switzerland at α=3.0 — 3× outside the safe operating window
- ›Hypothesis inverted: late-override countries resist; early-dominance partially yields
- ›Learned vector (same norm, same hook) corrects Australia and Canada at safe doses
Frequency in All Directions
Does the mechanism generalize?
A task battery spanning languages, currencies, chemical elements, and authors — and an unplanned discovery that complicates the taxonomy. The question is now whether there is a frequency prior direction in GPT-2's residual stream at all, or only a family of mode-specific directions.
- ›Languages and currencies replicate the inverted ICL gradient from Paper 1
- ›New attractor class discovered: morphological/demonym (Brazil → "Brazilian")
- ›Transfer test: do Paper 2 steering vectors carry over across domains?
- ›Three attractor classes: semantic-prominence, morphological, exemplar-copy
All experiments run on GPT-2 Small via TransformerLens. Code in new_experiments/ — steering scripts, logit lens tools, battery runner, and scale sweep utilities.