Quantifying Uncertainty in
Circuit Identification
Moving beyond point estimates. Treating circuit discovery as Bayesian inferenceBayesian Inference: A statistical method that updates the probability of a hypothesis as more evidence becomes available, using Bayes' theorem. over functional hypotheses, delivering robust posterior distributionsPosterior Distribution: The probability distribution of a hypothesis after taking into account observed data. It combines the prior with the likelihood. and rigorous validation frameworks.
From Point Estimates to Posteriors
The Challenge: Current interpretability methods often label a component (e.g., "Induction HeadInduction Head: An attention head that implements in-context learning by copying patterns it has seen earlier in the sequence.") with absolute certainty, ignoring polysemanticityPolysemanticity: When a single neuron or circuit responds to multiple, unrelated features — making interpretation ambiguous. and noise.
The Approach: We propose treating circuit discovery as Bayesian inferenceBayesian Inference: Updating probability estimates for hypotheses as evidence accumulates, producing a full distribution rather than a single answer.. Instead of asking "Is this an X?", we compute the probability distributionProbability Distribution: A mathematical function that gives the probabilities of all possible outcomes, summing to 1. over all potential functional hypotheses.
Interactive Comparison
Hypothesis Confidence Distribution
Viewing probability distribution over functional roles.
Posterior Probability Distribution — \(P(\mathcal{H}_i \mid \mathcal{D})\)
Adjust Prior BeliefsPrior Probability: Your initial belief about each hypothesis before observing any data. Priors encode domain knowledge or assumptions.
Drag the sliders to set prior weights for each hypothesis. Values are auto-normalized.
Adjust Evidence, Watch PosteriorsPosterior Probability: The updated belief about a hypothesis after incorporating observed evidence. Computed via Bayes' theorem. Update
Simulate how different types of observed evidence shift the posterior distribution in real time.
Evidence Strengths (LikelihoodsLikelihood: P(Data | Hypothesis) — How probable the observed data is, assuming a particular hypothesis is true.)
Live PosteriorPosterior: P(H|D) ∝ P(D|H) × P(H). The core output of Bayesian inference — combining prior beliefs with new evidence. Distribution
Entropy: —The Inference Framework
How we operationalize Bayesian inference for mechanistic interpretability. This workflow shifts the focus from finding a single explanation to mapping the landscape of possibilities.
Define Hypotheses
Formalize a set of functional candidates (e.g., Induction, Previous Token, Translation, Noise). These form the prior spacePrior Space: The complete set of hypotheses and their initial probability weights before any data is observed. \(\{\mathcal{H}_1, \mathcal{H}_2, \ldots, \mathcal{H}_n\}\).
Observe Activations
Collect activation data across diverse inputs. Compute \(P(\mathcal{D} \mid \mathcal{H}_i)\) to update our beliefs.
likelihoods = compute_likelihood(activations)
Compute Posterior
Generate a distribution: \(P(\mathcal{H} \mid \mathcal{D})\). Result: "60% Induction, 30% Copy, 10% Noise".
# {'induction': 0.60, 'copy': 0.30, ...}
Posterior Predictive Checks
Question: How do we distinguish a real mechanism from a statistical fluke?
The Deliverable: A validation framework analogous to hierarchical modeling checks. We simulate data from our inferred "circuit" and compare it to the real model's behavior.
Select Validation Scenario
Predictive Check: Actual vs. Simulated
PASS: Strong CorrelationModel Prediction \(\hat{y}\)
Actual Observation \(y\)
Phase 1 Deliverables
- [+] Formal definitions of functional hypotheses priors
- [+] Computational pipeline for posterior calculation
- [+] Shift from binary labeling to probability distributions
Phase 2 Deliverables
- [+] Suite of posterior predictive checks
- [+] Statistical tests for "Signal vs. Noise" distinction
- [+] Validated hierarchical models for circuit types