NeurIPS 2025

12 Dec 2025

Attended NeurIPS 2025 in San Diego and presented three papers at the Mechanistic Interpretability Workshop:

How adversarial attacks systematically exploit interference between features represented in superposition, providing a mechanistic explanation for attack transferability and class-specific vulnerability.
How data correlations shape the geometric arrangement of superposed features, giving rise to semantically rich structures like ordered circles and semantic clusters observed in language models.
ContextBench, a benchmark for methods that generate targeted, linguistically fluent inputs to activate specific latent features or elicit model behaviours.

Favourite talks from the conference included Chris Olah’s account of Anthropic’s interpretability journey and Been Kim’s brilliant summary of 15 years of interpretability in 15 minutes.