Quantifying Uncertainty in
Circuit Identification
Moving beyond point estimates. Treating circuit discovery as Bayesian inference over functional hypotheses, delivering robust posterior distributions and rigorous validation frameworks.
From Point Estimates to Posteriors
The Challenge: Current interpretability methods often label a component (e.g., "Induction Head") with absolute certainty, ignoring polysemanticity and noise.
The Approach: We propose treating circuit discovery as Bayesian inference. Instead of asking "Is this an X?", we compute the probability distribution over all potential functional hypotheses.
P(Hypothesis | Data) ∝ P(Data | Hypothesis) × P(Hypothesis)
Interactive Comparison
Hypothesis Confidence Distribution
Viewing probability distribution over functional roles.
Posterior Probability Distribution
The Inference Framework
How we operationalize Bayesian inference for mechanistic interpretability. This workflow shifts the focus from finding a single explanation to mapping the landscape of possibilities.
Define Hypotheses
Formalize a set of functional candidates (e.g., Induction, Previous Token, Translation, Noise). These form the prior space.
Observe Activations
Collect activation data across diverse inputs. Treat these as likelihood evidence to update our beliefs.
likelihoods = compute_likelihood(activations)
Compute Posterior
Generate a distribution: P(Hypothesis | Data). Result: "60% Induction, 30% Copy, 10% Noise".
# {'induction': 0.60, 'copy': 0.30, ...}
Posterior Predictive Checks
Question: How do we distinguish a real mechanism from a statistical fluke?
The Deliverable: A validation framework analogous to hierarchical modeling checks. We simulate data from our inferred "circuit" and compare it to the real model's behavior.
Select Validation Scenario
Predictive Check: Actual vs. Simulated
PASS: Strong CorrelationModel Prediction (Hypothesis)
Actual Observation
Phase 1 Deliverables
- [+] Formal definitions of functional hypotheses priors
- [+] Computational pipeline for posterior calculation
- [+] Shift from binary labeling to probability distributions
Phase 2 Deliverables
- [+] Suite of posterior predictive checks
- [+] Statistical tests for "Signal vs. Noise" distinction
- [+] Validated hierarchical models for circuit types