Papers
arxiv:2507.07776

SCOOTER: A Human Evaluation Framework for Unrestricted Adversarial Examples

Published on Jul 10, 2025
Authors:
,
,
,
,
,
,

Abstract

An open-source framework called SCOOTER is introduced to evaluate unrestricted adversarial attacks through large-scale human studies and provide statistical insights into imperceptibility, revealing discrepancies between human perception and automated vision systems.

AI-generated summary

Unrestricted adversarial attacks aim to fool computer vision models without being constrained by ell_p-norm bounds to remain imperceptible to humans, for example, by changing an object's color. This allows attackers to circumvent traditional, norm-bounded defense strategies such as adversarial training or certified defense strategies. However, due to their unrestricted nature, there are also no guarantees of norm-based imperceptibility, necessitating human evaluations to verify just how authentic these adversarial examples look. While some related work assesses this vital quality of adversarial attacks, none provide statistically significant insights. This issue necessitates a unified framework that supports and streamlines such an assessment for evaluating and comparing unrestricted attacks. To close this gap, we introduce SCOOTER - an open-source, statistically powered framework for evaluating unrestricted adversarial examples. Our contributions are: (i) best-practice guidelines for crowd-study power, compensation, and Likert equivalence bounds to measure imperceptibility; (ii) the first large-scale human vs. model comparison across 346 human participants showing that three color-space attacks and three diffusion-based attacks fail to produce imperceptible images. Furthermore, we found that GPT-4o can serve as a preliminary test for imperceptibility, but it only consistently detects adversarial examples for four out of six tested attacks; (iii) open-source software tools, including a browser-based task template to collect annotations and analysis scripts in Python and R; (iv) an ImageNet-derived benchmark dataset containing 3K real images, 7K adversarial examples, and over 34K human ratings. Our findings demonstrate that automated vision systems do not align with human perception, reinforcing the need for a ground-truth SCOOTER benchmark.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2507.07776 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2507.07776 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2507.07776 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.