Bayesian Mechanistic Interpretability - Neural Circuit Analysis

Research Framework

Quantifying Uncertainty in
Circuit Identification

Moving beyond point estimates. Treating circuit discovery as Bayesian inferenceBayesian Inference: A statistical method that updates the probability of a hypothesis as more evidence becomes available, using Bayes' theorem. over functional hypotheses, delivering robust posterior distributionsPosterior Distribution: The probability distribution of a hypothesis after taking into account observed data. It combines the prior with the likelihood. and rigorous validation frameworks.

[+] Explore Framework [>] Validation Methods

Phase 1: The Core Problem

From Point Estimates to Posteriors

The Challenge: Current interpretability methods often label a component (e.g., "Induction HeadInduction Head: An attention head that implements in-context learning by copying patterns it has seen earlier in the sequence.") with absolute certainty, ignoring polysemanticityPolysemanticity: When a single neuron or circuit responds to multiple, unrelated features — making interpretation ambiguous. and noise.

The Approach: We propose treating circuit discovery as Bayesian inferenceBayesian Inference: Updating probability estimates for hypotheses as evidence accumulates, producing a full distribution rather than a single answer.. Instead of asking "Is this an X?", we compute the probability distributionProbability Distribution: A mathematical function that gives the probabilities of all possible outcomes, summing to 1. over all potential functional hypotheses.

Bayes' Theorem for Circuit Hypothesis $$P(\mathcal{H} \mid \mathcal{D}) = \frac{P(\mathcal{D} \mid \mathcal{H}) \cdot P(\mathcal{H})}{P(\mathcal{D})}$$

Interactive Comparison

Hypothesis Confidence Distribution

Viewing probability distribution over functional roles.

Posterior Probability Distribution — \(P(\mathcal{H}_i \mid \mathcal{D})\)

Induction Head

55%

Previous Token

25%

Duplicate Token

15%

Noise

Interactive Controls

Adjust Prior BeliefsPrior Probability: Your initial belief about each hypothesis before observing any data. Priors encode domain knowledge or assumptions.

Drag the sliders to set prior weights for each hypothesis. Values are auto-normalized.

Induction Head25%

Previous Token25%

Duplicate Token25%

Noise25%

Normalized Priors →

Live Inference Demo

Adjust Evidence, Watch PosteriorsPosterior Probability: The updated belief about a hypothesis after incorporating observed evidence. Computed via Bayes' theorem. Update

Simulate how different types of observed evidence shift the posterior distribution in real time.

Evidence Strengths (LikelihoodsLikelihood: P(Data | Hypothesis) — How probable the observed data is, assuming a particular hypothesis is true.)

Repeated-token pattern0.80

Adjacent-token copying0.40

Exact-duplicate signal0.30

Random-noise baseline0.10

Live PosteriorPosterior: P(H|D) ∝ P(D|H) × P(H). The core output of Bayesian inference — combining prior beliefs with new evidence. Distribution

Entropy: —

Induction Head

Previous Token

Duplicate Token

Noise

Formula →

The Inference Framework

How we operationalize Bayesian inference for mechanistic interpretability. This workflow shifts the focus from finding a single explanation to mapping the landscape of possibilities.

Define Hypotheses

Formalize a set of functional candidates (e.g., Induction, Previous Token, Translation, Noise). These form the prior spacePrior Space: The complete set of hypotheses and their initial probability weights before any data is observed. \(\{\mathcal{H}_1, \mathcal{H}_2, \ldots, \mathcal{H}_n\}\).

hypotheses = ['induction', 'prev_token', 'duplicate', 'noise']

Observe Activations

Collect activation data across diverse inputs. Compute \(P(\mathcal{D} \mid \mathcal{H}_i)\) to update our beliefs.

                        activations = model.run(inputs)

                        likelihoods = compute_likelihood(activations)

Compute Posterior

Generate a distribution: \(P(\mathcal{H} \mid \mathcal{D})\). Result: "60% Induction, 30% Copy, 10% Noise".

                        posterior = normalize(likelihoods * prior)

                        # {'induction': 0.60, 'copy': 0.30, ...}

Phase 2: Validation

Posterior Predictive Checks

Question: How do we distinguish a real mechanism from a statistical fluke?

The Deliverable: A validation framework analogous to hierarchical modeling checks. We simulate data from our inferred "circuit" and compare it to the real model's behavior.

Select Validation Scenario

Scenario A: Coincidental Correlation

High variance, poor predictive power

→

Scenario B: True Mechanism

Tight clustering, strong predictive fit

→

Predictive Check: Actual vs. Simulated

PASS: Strong Correlation

                            X-AXIS

                            Model Prediction \(\hat{y}\)

                            Y-AXIS

                            Actual Observation \(y\)

Phase 1 Deliverables

[+] Formal definitions of functional hypotheses priors
[+] Computational pipeline for posterior calculation
[+] Shift from binary labeling to probability distributions

Phase 2 Deliverables

[+] Suite of posterior predictive checks
[+] Statistical tests for "Signal vs. Noise" distinction
[+] Validated hierarchical models for circuit types

Quantifying Uncertainty in Circuit Identification