Visible to the public AutoAttacker: A reinforcement learning approach for black-box adversarial attacks

TitleAutoAttacker: A reinforcement learning approach for black-box adversarial attacks
Publication TypeConference Paper
Year of Publication2019
AuthorsTsingenopoulos, Ilias, Preuveneers, Davy, Joosen, Wouter
Conference Name2019 IEEE European Symposium on Security and Privacy Workshops (EuroS PW)
Date Publishedjun
Keywordsadversarial example discovery, adversarial-machine-learning, AutoAttacker, Black Box Security, black-box adversarial attacks, black-box model, black-box-attack, classifier attack, composability, cryptography, data mining, learning (artificial intelligence), machine learning model, Metrics, pattern classification, perturbed inputs, pubcrawl, reinforcement learning, reinforcement-learning, resilience, Resiliency, white-box access
AbstractRecent research has shown that machine learning models are susceptible to adversarial examples, allowing attackers to trick a machine learning model into making a mistake and producing an incorrect output. Adversarial examples are commonly constructed or discovered by using gradient-based methods that require white-box access to the model. In most real-world AI system deployments, having complete access to the machine learning model is an unrealistic threat model. However, it is possible for an attacker to construct adversarial examples even in the black-box case - where we assume solely a query capability to the model - with a variety of approaches each with its advantages and shortcomings. We introduce AutoAttacker, a novel reinforcement learning framework where agents learn how to operate around the black-box model by querying it, to effectively extract the underlying decision behaviour, and to undermine it successfully. AutoAttacker is a first of kind framework that uses reinforcement learning and assumes nothing about the differentiability or structure of the underlying function and is thus robust towards common defenses like gradient obfuscation or adversarial training. Finally, without differentiable output, as in binary classification, most methods cease to operate and require either an approximation of the gradient, or another approach altogether. Our approach, however, maintains the capability to function when the output descriptiveness diminishes.
DOI10.1109/EuroSPW.2019.00032
Citation Keytsingenopoulos_autoattacker_2019