MechInterp | Bayesian Framework

Research Framework

Quantifying Uncertainty in
Circuit Identification

Moving beyond point estimates. Treating circuit discovery as Bayesian inferenceBayesian Inference: A statistical method that updates the probability of a hypothesis as more evidence becomes available, using Bayes' theorem. over functional hypotheses, delivering robust posterior distributionsPosterior Distribution: The probability distribution of a hypothesis after taking into account observed data. It combines the prior with the likelihood. and rigorous validation frameworks.

Phase 1: The Core Problem

From Point Estimates to Posteriors

The Challenge: Current interpretability methods often label a component (e.g., "Induction HeadInduction Head: An attention head that implements in-context learning by copying patterns it has seen earlier in the sequence.") with absolute certainty, ignoring polysemanticityPolysemanticity: When a single neuron or circuit responds to multiple, unrelated features — making interpretation ambiguous. and noise.

The Approach: We propose treating circuit discovery as Bayesian inferenceBayesian Inference: Updating probability estimates for hypotheses as evidence accumulates, producing a full distribution rather than a single answer.. Instead of asking "Is this an X?", we compute the probability distributionProbability Distribution: A mathematical function that gives the probabilities of all possible outcomes, summing to 1. over all potential functional hypotheses.

Bayes' Theorem for Circuit Hypothesis
$$P(\mathcal{H} \mid \mathcal{D}) = \frac{P(\mathcal{D} \mid \mathcal{H}) \cdot P(\mathcal{H})}{P(\mathcal{D})}$$

Interactive Comparison

Hypothesis Confidence Distribution

Viewing probability distribution over functional roles.

Posterior Probability Distribution — \(P(\mathcal{H}_i \mid \mathcal{D})\)

Induction Head
55%
Previous Token
25%
Duplicate Token
15%
Noise
5%
Interactive Controls

Adjust Prior BeliefsPrior Probability: Your initial belief about each hypothesis before observing any data. Priors encode domain knowledge or assumptions.

Drag the sliders to set prior weights for each hypothesis. Values are auto-normalized.

25%
25%
25%
25%
Normalized Priors →
Live Inference Demo

Adjust Evidence, Watch PosteriorsPosterior Probability: The updated belief about a hypothesis after incorporating observed evidence. Computed via Bayes' theorem. Update

Simulate how different types of observed evidence shift the posterior distribution in real time.

Evidence Strengths (LikelihoodsLikelihood: P(Data | Hypothesis) — How probable the observed data is, assuming a particular hypothesis is true.)

0.80
0.40
0.30
0.10

Live PosteriorPosterior: P(H|D) ∝ P(D|H) × P(H). The core output of Bayesian inference — combining prior beliefs with new evidence. Distribution

Entropy: —
Induction Head
Previous Token
Duplicate Token
Noise
Formula →

The Inference Framework

How we operationalize Bayesian inference for mechanistic interpretability. This workflow shifts the focus from finding a single explanation to mapping the landscape of possibilities.

Define Hypotheses

Formalize a set of functional candidates (e.g., Induction, Previous Token, Translation, Noise). These form the prior spacePrior Space: The complete set of hypotheses and their initial probability weights before any data is observed. \(\{\mathcal{H}_1, \mathcal{H}_2, \ldots, \mathcal{H}_n\}\).

hypotheses = ['induction', 'prev_token', 'duplicate', 'noise']

Observe Activations

Collect activation data across diverse inputs. Compute \(P(\mathcal{D} \mid \mathcal{H}_i)\) to update our beliefs.

activations = model.run(inputs)
likelihoods = compute_likelihood(activations)

Compute Posterior

Generate a distribution: \(P(\mathcal{H} \mid \mathcal{D})\). Result: "60% Induction, 30% Copy, 10% Noise".

posterior = normalize(likelihoods * prior)
# {'induction': 0.60, 'copy': 0.30, ...}
Phase 2: Validation

Posterior Predictive Checks

Question: How do we distinguish a real mechanism from a statistical fluke?

The Deliverable: A validation framework analogous to hierarchical modeling checks. We simulate data from our inferred "circuit" and compare it to the real model's behavior.

Select Validation Scenario

Scenario A: Coincidental Correlation
High variance, poor predictive power
Scenario B: True Mechanism
Tight clustering, strong predictive fit

Predictive Check: Actual vs. Simulated

PASS: Strong Correlation
X-AXIS
Model Prediction \(\hat{y}\)
Y-AXIS
Actual Observation \(y\)

Phase 1 Deliverables

  • [+] Formal definitions of functional hypotheses priors
  • [+] Computational pipeline for posterior calculation
  • [+] Shift from binary labeling to probability distributions

Phase 2 Deliverables

  • [+] Suite of posterior predictive checks
  • [+] Statistical tests for "Signal vs. Noise" distinction
  • [+] Validated hierarchical models for circuit types