NeurIPS 2023 (Spotlight)Scale Alone Does not Improve Mechanistic Interpretability in Vision Models
Zimmermann*, R. S., Klein*, T. and Brendel, W.
arXiv, 2023We compare the mechanistic interpretability of vision models differing with respect to scale, architecture, training paradigm and dataset size and find that none of these design choices have any significant effect on the interpretability of individual units. We release a dataset of unit-wise interpretability scores that enables research on automated alignment.