Publications

NeurIPS 2023 (Spotlight)
Scale Alone Does not Improve Mechanistic Interpretability in Vision Models

Zimmermann*, R. S., Klein*, T. and Brendel, W.

arXiv, 2023

We compare the mechanistic interpretability of vision models differing with respect to scale, architecture, training paradigm and dataset size and find that none of these design choices have any significant effect on the interpretability of individual units. We release a dataset of unit-wise interpretability scores that enables research on automated alignment.

Access the paper

ICML 2023 (Oral)
Provably Learning Object-Centric Representations

Brady*, J., Zimmermann*, R. S., Sharma*, Y., Schölkopf, B., von Kügelken, J. and Brendel, W.,

ICML 2023, 2023

We analyze when object-centric representations can be learned without supervision and introduces two assumptions, compositionality and irreducibility, to prove that ground-truth object representations can be identified.

Access the paper

NeurIPS 2022
Increasing Confidence in Adversarial Robustness Evaluations

Zimmermann, R. S., Brendel, W., Tramer, F., and Carlini, N.

NeurIPS 2022, 2022

We propose a test that enables researchers to find flawed adversarial robustness evaluations. Passing our test produces compelling evidence that the attacks used have sufficient power to evaluate the model’s robustness.

Access the paper

workshop
Score-Based Generative Classifiers

Zimmermann, R. S., Schott, L., Song, Y. Dunn, B. A. and Klindt, D. A.

arXiv, 2021

We evaluate score-based generative models as classifiers on CIFAR-10 and find that they yield good accuracy and likelihoods but no adversarial robustness.

Access the workshop paper

ICLR 2021
Exemplary Natural Images Explain CNN Activations Better than State-of-the-Art Feature Visualization

Borowski*, J., Zimmermann*, R. S., Schepers, J., Geirhos, R. , Wallis, T. S. A., Bethge, M. and Brendel, W.

NeurIPS 2020 Workshop: Shared Visual Representations in Human & Machine Intelligence, 2021

Using human psychophysical experiments, we show that natural images can be significantly more informative for interpreting neural network activations than synthetic feature visualizations.

Access the paper

workshop paper
Natural images are more informative for interpreting CNN activations than synthetic feature visualizations

Borowski*, J., Zimmermann*, R. S., Schepers, J., Geirhos, R. , Wallis, T. S. A., Bethge, M. and Brendel, W.

NeurIPS 2020 Workshop: Shared Visual Representations in Human & Machine Intelligence, 2020

Using human psychophysical experiments, we show that natural images can be significantly more informative for interpreting neural network activations than synthetic feature visualizations.

Access the workshop paper