Two papers accepted at ICLR 2026:
ContextBench: Modifying Contexts for Targeted Latent Activation and Behaviour Elicitation — A benchmark for methods that generate linguistically fluent inputs to activate specific latent features or elicit model behaviours. We develop two novel enhancements to Evolutionary Prompt Optimisation — LLM-assistance and diffusion model inpainting — achieving state-of-the-art performance in balancing elicitation and fluency.
Correlations in the Data Lead to Semantically Rich Feature Geometry Under Superposition — We introduce the Bag-of-Words Superposition (BOWS) framework, revealing that semantically rich structures observed in language models (ordered circles, semantic clusters) can emerge from linear superposition driven by data correlations and compression, without requiring non-linear computation.