DJC
Dennis J. Carroll

Research

The Frequency Prior Trilogy

Three papers on how GPT-2 Small encodes, amplifies, and yields to training-frequency priors — and what that reveals about the limits of mechanistic intervention.

Model: GPT-2 Small (124M parameters)Platform: TransformerLens — all results from real-model inference
Paper 1Live

Frequency Wins

Diagnosing the lesion

Ask GPT-2 for the capital of India with worked examples in context, and the correct answer leads through layer eight — then gets overwritten by Mumbai at layer nine. A frequency prior, amplified by an identified retrieval circuit, beats in-context evidence at scale.

  • ICL accuracy peaks at n=1 (79%), degrades monotonically to 55% at n=5
  • Circuit: heads L9H8, L8H11, L10H0 do both retrieval and frequency amplification
  • Two failure modes: late-override (Australia, Canada) vs. early-dominance (India, Switzerland, South Africa)
  • Scale sweep: persistent errors dissolve at GPT-2 XL — capacity, not architecture
mechanistic-interpretabilitygpt-2in-context-learningactivation-patchinglogit-lens
Read paper →
Paper 2Live

Steering the Prior

Why activation steering mostly fails

Paper 1 diagnosed the lesion. This paper administered the indicated treatment and reports the trial honestly. The mechanistically-derived steering vector corrects one country in five, at triple the tolerable dose, while a black-box learned vector quietly fixes the cases the interpretable one cannot.

  • Difference-of-means vector at L8 resid_post: stable, real, nearly orthogonal to embeddings
  • Corrects Switzerland at α=3.0 — 3× outside the safe operating window
  • Hypothesis inverted: late-override countries resist; early-dominance partially yields
  • Learned vector (same norm, same hook) corrects Australia and Canada at safe doses
mechanistic-interpretabilityactivation-steeringgpt-2residual-streamnegative-results
Read paper →
Paper 3Coming Soon

Frequency in All Directions

Does the mechanism generalize?

A task battery spanning languages, currencies, chemical elements, and authors — and an unplanned discovery that complicates the taxonomy. The question is now whether there is a frequency prior direction in GPT-2's residual stream at all, or only a family of mode-specific directions.

  • Languages and currencies replicate the inverted ICL gradient from Paper 1
  • New attractor class discovered: morphological/demonym (Brazil → "Brazilian")
  • Transfer test: do Paper 2 steering vectors carry over across domains?
  • Three attractor classes: semantic-prominence, morphological, exemplar-copy
mechanistic-interpretabilitygpt-2generalizationin-context-learning

All experiments run on GPT-2 Small via TransformerLens. Code in new_experiments/ — steering scripts, logit lens tools, battery runner, and scale sweep utilities.

© 2026 Dennis J. Carroll. All rights reserved.