Title | Semi-black-box Attacks Against Speech Recognition Systems Using Adversarial Samples |
Publication Type | Conference Paper |
Year of Publication | 2019 |
Authors | Wu, Yi, Liu, Jian, Chen, Yingying, Cheng, Jerry |
Conference Name | 2019 IEEE International Symposium on Dynamic Spectrum Access Networks (DySPAN) |
Keywords | adversarial attacks, adversarial samples, adversary-expected transcript texts, automatic speech recognition systems, Black Box Security, composability, Computational modeling, Deep Neural Network, deep neural networks, genetic algorithms, gradient descent algorithm, gradient methods, gradient-independent genetic algorithm, Hidden Markov models, high attack success rate, Kaldi, Metrics, neural nets, Perturbation methods, pubcrawl, resilience, Resiliency, security of data, security vulnerabilities, semi-black-box attacks, semiblack-box attack, Sociology, Speech recognition, Statistics, targeted ASR systems, white-box attacks |
Abstract | As automatic speech recognition (ASR) systems have been integrated into a diverse set of devices around us in recent years, security vulnerabilities of them have become an increasing concern for the public. Existing studies have demonstrated that deep neural networks (DNNs), acting as the computation core of ASR systems, is vulnerable to deliberately designed adversarial attacks. Based on the gradient descent algorithm, existing studies have successfully generated adversarial samples which can disturb ASR systems and produce adversary-expected transcript texts designed by adversaries. Most of these research simulated white-box attacks which require knowledge of all the components in the targeted ASR systems. In this work, we propose the first semi-black-box attack against the ASR system - Kaldi. Requiring only partial information from Kaldi and none from DNN, we can embed malicious commands into a single audio chip based on the gradient-independent genetic algorithm. The crafted audio clip could be recognized as the embedded malicious commands by Kaldi and unnoticeable to humans in the meanwhile. Experiments show that our attack can achieve high attack success rate with unnoticeable perturbations to three types of audio clips (pop music, pure music, and human command) without the need of the underlying DNN model parameters and architecture. |
DOI | 10.1109/DySPAN.2019.8935789 |
Citation Key | wu_semi-black-box_2019 |