MechInterp | Bayesian Framework

Research Framework

Quantifying Uncertainty in
Circuit Identification

Moving beyond point estimates. Treating circuit discovery as Bayesian inference over functional hypotheses, delivering robust posterior distributions and rigorous validation frameworks.

Phase 1: The Core Problem

From Point Estimates to Posteriors

The Challenge: Current interpretability methods often label a component (e.g., "Induction Head") with absolute certainty, ignoring polysemanticity and noise.

The Approach: We propose treating circuit discovery as Bayesian inference. Instead of asking "Is this an X?", we compute the probability distribution over all potential functional hypotheses.

Bayes' Theorem for Circuit Hypothesis
P(Hypothesis | Data) ∝ P(Data | Hypothesis) × P(Hypothesis)

Interactive Comparison

Hypothesis Confidence Distribution

Viewing probability distribution over functional roles.

Posterior Probability Distribution

Induction Head
55%
Previous Token
25%
Duplicate Token
15%
Noise
5%

The Inference Framework

How we operationalize Bayesian inference for mechanistic interpretability. This workflow shifts the focus from finding a single explanation to mapping the landscape of possibilities.

Define Hypotheses

Formalize a set of functional candidates (e.g., Induction, Previous Token, Translation, Noise). These form the prior space.

hypotheses = ['induction', 'prev_token', 'duplicate', 'noise']

Observe Activations

Collect activation data across diverse inputs. Treat these as likelihood evidence to update our beliefs.

activations = model.run(inputs)
likelihoods = compute_likelihood(activations)

Compute Posterior

Generate a distribution: P(Hypothesis | Data). Result: "60% Induction, 30% Copy, 10% Noise".

posterior = normalize(likelihoods * prior)
# {'induction': 0.60, 'copy': 0.30, ...}
Phase 2: Validation

Posterior Predictive Checks

Question: How do we distinguish a real mechanism from a statistical fluke?

The Deliverable: A validation framework analogous to hierarchical modeling checks. We simulate data from our inferred "circuit" and compare it to the real model's behavior.

Select Validation Scenario

Scenario A: Coincidental Correlation
High variance, poor predictive power
Scenario B: True Mechanism
Tight clustering, strong predictive fit

Predictive Check: Actual vs. Simulated

PASS: Strong Correlation
X-AXIS
Model Prediction (Hypothesis)
Y-AXIS
Actual Observation

Phase 1 Deliverables

  • [+] Formal definitions of functional hypotheses priors
  • [+] Computational pipeline for posterior calculation
  • [+] Shift from binary labeling to probability distributions

Phase 2 Deliverables

  • [+] Suite of posterior predictive checks
  • [+] Statistical tests for "Signal vs. Noise" distinction
  • [+] Validated hierarchical models for circuit types