Edward Stevinson

Hi, I’m Ed, a research scientist and PhD candidate at the CIRCLE group at Imperial College London, where I work on mechanistic interpretability and adversarial robustness. My research centres on understanding the representation geometry of neural networks and how this shapes adversarial vulnerability.

Find me on Twitter, Google Scholar, Github and LinkedIn. Please reach out by email if you want to talk about any research!

news

Jun 11, 2026	Our paper looking at how mechanistic knowledge can predict vulnerabilties was accepted to the Mechanistic Interpretability workshop at ICML 2026.
May 23, 2026	Our paper was accepted to the Pluralistic Alignment and Trustworthy AI for Good workshops at ICML 2026.
May 08, 2026	Recognised as a Gold Reviewer at ICML 2026
Apr 30, 2026	Paper on superposition created adversarial vulnerability accepted at ICML 2026 – read it on arXiv
Jan 26, 2026	Our paper on feature geometry was accepted at ICLR 2026.
Jan 26, 2026	Our LASR Labs paper, ContextBench, was accepted at ICLR 2026
Dec 12, 2025	Two spotlight papers at the Mechanistic Interpretability workshop, NeurIPS 2025
Sep 15, 2025	Outstanding paper award at NeSy 2025 for our paper on probabilistic neuro-symbolic robustness verification
Mar 30, 2025	Our paper on the ViT-Prisma toolkit, an open-source mechanistic interpretability library for vision models, was accepted at the MIV workshop at CVPR 2025
Nov 25, 2024	2nd place in the Apart AI safety hackathon for work on detecting adversarial prompt vulnerabilities
May 22, 2024	Took the opposition in an EA debate at Imperial entitled Is AI an Existential Risk?

selected publications

ICML
Adversarial Vulnerability from Interference Between Features in Superposition

Edward Stevinson, Lucas Prieto, Melih Barsbey, and 1 more author

In International Conference on Machine Learning (ICML), 2026

Abs arXiv Bib

Why do adversarial examples exist, and why do they transfer between models? Existing explanations appeal to high-dimensional geometry, non-robust patterns in the input, and decision boundary structure, but none provides a representation-level mechanism that explains why specific perturbations succeed and why attacks transfer between models. In this paper, we show that adversarial vulnerability can stem from efficient information encoding in neural networks. Specifically, vulnerability can arise from superposition – the phenomenon where networks represent more concepts than they have dimensions, forcing non-orthogonal representation and thus interference. This interference causes perturbations targeting one representation to affect others, creating vulnerabilities determined by interference patterns. In synthetic settings with precisely controlled superposition, we establish that superposition suffices to create adversarial vulnerability. The resulting attacks are predictable: PGD-discovered perturbations align with theoretically optimal perturbations derived from the interference geometry. Models trained on similar data develop similar interference patterns, explaining attack transferability. We then show that successful attacks on image classifiers exhibit the structure predicted by our proposed mechanism. These findings reveal that adversarial vulnerability can be a byproduct of networks’ representational compression, complementing existing explanations based on data properties or architectural factors.
@inproceedings{stevinson2026superposition, title = {Adversarial Vulnerability from Interference Between Features in Superposition}, author = {Stevinson, Edward and Prieto, Lucas and Barsbey, Melih and Birdal, Tolga}, booktitle = {International Conference on Machine Learning (ICML)}, year = {2026}, }
ICLR
ContextBench: Modifying Contexts for Targeted Latent Activation and Behaviour Elicitation

Edward Stevinson, Robert Graham, Leo Richter, and 3 more authors

In International Conference on Learning Representations (ICLR), 2026

Abs arXiv Bib

Identifying inputs that trigger specific behaviours or latent features in language models could have a wide range of safety use cases. We investigate a class of methods capable of generating targeted, linguistically fluent inputs that activate specific latent features or elicit model behaviours. We formalise this approach as context modification and present ContextBench – a benchmark with tasks assessing core method capabilities and potential safety applications. Our evaluation framework measures both elicitation strength (activation of latent features or behaviours) and linguistic fluency, highlighting how current state-of-the-art methods struggle to balance these objectives. We enhance Evolutionary Prompt Optimisation (EPO) with LLM-assistance and diffusion model inpainting, and demonstrate that these variants achieve state-of-the-art performance in balancing elicitation effectiveness and fluency.
@inproceedings{stevinson2026contextbench, title = {ContextBench: Modifying Contexts for Targeted Latent Activation and Behaviour Elicitation}, author = {Stevinson, Edward and Graham, Robert and Richter, Leo and Chia, Alexander and Miller, Joseph and Bloom, Joseph Isaac}, booktitle = {International Conference on Learning Representations (ICLR)}, year = {2026}, }
ICLR
From Data Statistics to Feature Geometry: How Correlations Shape Superposition

Lucas Prieto, Edward Stevinson, Melih Barsbey, and 2 more authors

In International Conference on Learning Representations (ICLR), 2026

Abs arXiv Bib

A central idea in mechanistic interpretability is that neural networks represent more features than they have dimensions, arranging them in superposition to form an over-complete basis. This framing has been influential, motivating dictionary learning approaches such as sparse autoencoders. However, superposition has mostly been studied in idealized settings where features are sparse and uncorrelated. In these settings, superposition is typically understood as introducing interference that must be minimized geometrically and filtered out by non-linearities such as ReLUs, yielding local structures like regular polytopes. We show that this account is incomplete for realistic data by introducing Bag-of-Words Superposition (BOWS), a controlled setting to encode binary bag-of-words representations of internet text in superposition. Using BOWS, we find that when features are correlated, interference can be constructive rather than just noise to be filtered out. This is achieved by arranging features according to their co-activation patterns, making interference between active features constructive, while still using ReLUs to avoid false positives. We show that this kind of arrangement is more prevalent in models trained with weight decay and naturally gives rise to semantic clusters and cyclical structures which have been observed in real language models yet were not explained by the standard picture of superposition.
@inproceedings{stevinson2026bows, title = {From Data Statistics to Feature Geometry: How Correlations Shape Superposition}, author = {Prieto, Lucas and Stevinson, Edward and Barsbey, Melih and Birdal, Tolga and Mediano, Pedro A.M.}, booktitle = {International Conference on Learning Representations (ICLR)}, year = {2026}, }