Visible to the public Biblio

Filters: Keyword is Speech recognition  [Clear All Filters]
2023-09-08
Miao, Yu.  2022.  Construction of Computer Big Data Security Technology Platform Based on Artificial Intelligence. 2022 Second International Conference on Advanced Technologies in Intelligent Control, Environment, Computing & Communication Engineering (ICATIECE). :1–4.
Artificial technology developed in recent years. It is an intelligent system that can perform tasks without human intervention. AI can be used for various purposes, such as speech recognition, face recognition, etc. AI can be used for good or bad purposes, depending on how it is implemented. The discuss the application of AI in data security technology and its advantages over traditional security methods. We will focus on the good use of AI by analyzing the impact of AI on the development of big data security technology. AI can be used to enhance security technology by using machine learning algorithms, which can analyze large amounts of data and identify patterns that cannot be detected automatically by humans. The computer big data security technology platform based on artificial intelligence in this paper is the process of creating a system that can identify and prevent malicious programs. The system must be able to detect all types of threats, including viruses, worms, Trojans and spyware. It should also be able to monitor network activity and respond quickly in the event of an attack.
2023-07-21
Churaev, Egor, Savchenko, Andrey V..  2022.  Multi-user facial emotion recognition in video based on user-dependent neural network adaptation. 2022 VIII International Conference on Information Technology and Nanotechnology (ITNT). :1—5.
In this paper, the multi-user video-based facial emotion recognition is examined in the presence of a small data set with the emotions of end users. By using the idea of speaker-dependent speech recognition, we propose a novel approach to solve this task if labeled video data from end users is available. During the training stage, a deep convolutional neural network is trained for user-independent emotion classification. Next, this classifier is adapted (fine-tuned) on the emotional video of a concrete person. During the recognition stage, the user is identified based on face recognition techniques, and an emotional model of the recognized user is applied. It is experimentally shown that this approach improves the accuracy of emotion recognition by more than 20% for the RAVDESS dataset.
Avula, Himaja, R, Ranjith, S Pillai, Anju.  2022.  CNN based Recognition of Emotion and Speech from Gestures and Facial Expressions. 2022 6th International Conference on Electronics, Communication and Aerospace Technology. :1360—1365.
The major mode of communication between hearing-impaired or mute people and others is sign language. Prior, most of the recognition systems for sign language had been set simply to recognize hand signs and convey them as text. However, the proposed model tries to provide speech to the mute. Firstly, hand gestures for sign language recognition and facial emotions are trained using CNN (Convolutional Neural Network) and then by training the emotion to speech model. Finally combining hand gestures and facial emotions to realize the emotion and speech.
Abbasi, Nida Itrat, Song, Siyang, Gunes, Hatice.  2022.  Statistical, Spectral and Graph Representations for Video-Based Facial Expression Recognition in Children. ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). :1725—1729.
Child facial expression recognition is a relatively less investigated area within affective computing. Children’s facial expressions differ significantly from adults; thus, it is necessary to develop emotion recognition frameworks that are more objective, descriptive and specific to this target user group. In this paper we propose the first approach that (i) constructs video-level heterogeneous graph representation for facial expression recognition in children, and (ii) predicts children’s facial expressions using the automatically detected Action Units (AUs). To this aim, we construct three separate length-independent representations, namely, statistical, spectral and graph at video-level for detailed multi-level facial behaviour decoding (AU activation status, AU temporal dynamics and spatio-temporal AU activation patterns, respectively). Our experimental results on the LIRIS Children Spontaneous Facial Expression Video Database demonstrate that combining these three feature representations provides the highest accuracy for expression recognition in children.
2023-06-09
Liu, Luchen, Lin, Xixun, Zhang, Peng, Zhang, Lei, Wang, Bin.  2022.  Learning Common Dependency Structure for Unsupervised Cross-Domain Ner. ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). :8347—8351.
Unsupervised cross-domain NER task aims to solve the issues when data in a new domain are fully-unlabeled. It leverages labeled data from source domain to predict entities in unlabeled target domain. Since training models on large domain corpus is time-consuming, in this paper, we consider an alternative way by introducing syntactic dependency structure. Such information is more accessible and can be shared between sentences from different domains. We propose a novel framework with dependency-aware GNN (DGNN) to learn these common structures from source domain and adapt them to target domain, alleviating the data scarcity issue and bridging the domain gap. Experimental results show that our method outperforms state-of-the-art methods.
2023-03-31
Li, Yunchen, Luo, Da.  2022.  Adversarial Audio Detection Method Based on Transformer. 2022 International Conference on Machine Learning and Intelligent Systems Engineering (MLISE). :77–82.
Speech recognition technology has been applied to all aspects of our daily life, but it faces many security issues. One of the major threats is the adversarial audio examples, which may tamper the recognition results of the acoustic speech recognition system (ASR). In this paper, we propose an adversarial detection framework to detect adversarial audio examples. The method is based on the transformer self-attention mechanism. Spectrogram features are extracted from the audio and divided into patches. Position information are embedded and then fed into transformer encoder. Experimental results show that the method achieves good performance with the detection accuracy of above 96.5% under the white-box attacks and blackbox attacks, and noisy circumstances. Even when detecting adversarial examples generated by the unknown attacks, it also achieves satisfactory results.
2023-01-05
Omman, Bini, Eldho, Shallet Mary T.  2022.  Speech Emotion Recognition Using Bagged Support Vector Machines. 2022 International Conference on Computing, Communication, Security and Intelligent Systems (IC3SIS). :1—4.
Speech emotion popularity is one of the quite promising and thrilling issues in the area of human computer interaction. It has been studied and analysed over several decades. It’s miles the technique of classifying or identifying emotions embedded inside the speech signal.Current challenges related to the speech emotion recognition when a single estimator is used is difficult to build and train using HMM and neural networks,Low detection accuracy,High computational power and time.In this work we executed emotion category on corpora — the berlin emodb, and the ryerson audio-visible database of emotional speech and track (Ravdess). A mixture of spectral capabilities was extracted from them which changed into further processed and reduced to the specified function set. When compared to single estimators, ensemble learning has been shown to provide superior overall performance. We endorse a bagged ensemble model which consist of support vector machines with a gaussian kernel as a possible set of rules for the hassle handy. Inside the paper, ensemble studying algorithms constitute a dominant and state-of-the-art approach for acquiring maximum overall performance.
2022-06-06
Silva, J. Sá, Saldanha, Ruben, Pereira, Vasco, Raposo, Duarte, Boavida, Fernando, Rodrigues, André, Abreu, Madalena.  2019.  WeDoCare: A System for Vulnerable Social Groups. 2019 International Conference on Computational Science and Computational Intelligence (CSCI). :1053–1059.
One of the biggest problems in the current society is people's safety. Safety measures and mechanisms are especially important in the case of vulnerable social groups, such as migrants, homeless, and victims of domestic and/or sexual violence. In order to cope with this problem, we witness an increasing number of personal alarm systems in the market, most of them based on panic buttons. Nevertheless, none of them has got widespread acceptance mainly because of limited Human-Computer Interaction. In the context of this work, we developed an innovative mobile application that recognizes an attack through speech and gesture recognition. This paper describes such a system and presents its features, some of them based on the emerging concept of Human-in-the-Loop Cyber-physical Systems and new concepts of Human-Computer Interaction.
2022-03-09
Chandankhede, Pankaj H., Titarmare, Abhijit S., Chauhvan, Sarang.  2021.  Voice Recognition Based Security System Using Convolutional Neural Network. 2021 International Conference on Computing, Communication, and Intelligent Systems (ICCCIS). :738—743.
Following review depicts a unique speech recognition technique, based on planned analysis and utilization of Neural Network and Google API using speech’s characteristics. Multifactor security system pioneered for the authentication of vocal modalities and identification. Undergone project drives completely unique strategy of independent convolution layers structure and involvement of totally unique convolutions includes spectrum and Mel-frequency cepstral coefficient. This review takes in the statistical analysis of sound using scaled up and scaled down spectrograms, conjointly by exploitation the Google Speech-to-text API turns speech to pass code, it will be cross-verified for extended security purpose. Our study reveals that the incorporated methodology and the result provided elucidate the inclination of research in this area and encouraged us to advance in this field.
2022-01-31
Zhang, Yun, Li, Hongwei, Xu, Guowen, Luo, Xizhao, Dong, Guishan.  2021.  Generating Audio Adversarial Examples with Ensemble Substituted Models. ICC 2021 - IEEE International Conference on Communications. :1–6.
The rapid development of machine learning technology has prompted the applications of Automatic Speech Recognition(ASR). However, studies have shown that the state-of-the-art ASR technologies are still vulnerable to various attacks, which undermines the stability of ASR destructively. In general, most of the existing attack techniques for the ASR model are based on white box scenarios, where the adversary uses adversarial samples to generate a substituted model corresponding to the target model. On the contrary, there are fewer attack schemes in the black-box scenario. Moreover, no scheme considers the problem of how to construct the architecture of the substituted models. In this paper, we point out that constructing a good substituted model architecture is crucial to the effectiveness of the attack, as it helps to generate a more sophisticated set of adversarial examples. We evaluate the performance of different substituted models by comprehensive experiments, and find that ensemble substituted models can achieve the optimal attack effect. The experiment shows that our approach performs attack over 80% success rate (2% improvement compared to the latest work) meanwhile maintaining the authenticity of the original sample well.
2022-01-25
Saleem, Summra, Dilawari, Aniqa, Khan, Usman Ghani.  2021.  Spoofed Voice Detection using Dense Features of STFT and MDCT Spectrograms. 2021 International Conference on Artificial Intelligence (ICAI). :56–61.
Attestation of audio signals for recognition of forgery in voice is challenging task. In this research work, a deep convolutional neural network (CNN) is utilized to detect audio operations i.e. pitch shifted and amplitude varied signals. Short-time Fourier transform (STFT) and Modified Discrete Cosine Transform (MDCT) features are chosen for audio processing and their plotted patterns are fed to CNN. Experimental results show that our model can successfully distinguish tampered signals to facilitate the audio authentication on TIMIT dataset. Proposed CNN architecture can distinguish spoofed voices of shifting pitch with accuracy of 97.55% and of varying amplitude with accuracy of 98.85%.
2022-01-10
M, Babu, R, Hemchandhar, D, Harish Y., S, Akash, K, Abhishek Todi.  2021.  Voice Prescription with End-to-End Security Enhancements. 2021 6th International Conference on Communication and Electronics Systems (ICCES). :1–8.

The recent analysis indicates more than 250,000 people in the United States of America (USA) die every year because of medical errors. World Health Organisation (WHO) reports states that 2.6 million deaths occur due to medical and its prescription errors. Many of the errors related to the wrong drug/dosage administration by caregivers to patients due to indecipherable handwritings, drug interactions, confusing drug names, etc. The espousal of Mobile-based speech recognition applications will eliminate the errors. This allows physicians to narrate the prescription instead of writing. The application can be accessed through smartphones and can be used easily by everyone. An application program interface has been created for handling requests. Natural language processing is used to read text, interpret and determine the important words for generating prescriptions. The patient data is stored and used according to the Health Insurance Portability and Accountability Act of 1996 (HIPAA) guidelines. The SMS4-BSK encryption scheme is used to provide the data transmission securely over Wireless LAN.

2021-07-08
Hou, Dai, Han, Hao, Novak, Ed.  2020.  TAES: Two-factor Authentication with End-to-End Security against VoIP Phishing. 2020 IEEE/ACM Symposium on Edge Computing (SEC). :340—345.
In the current state of communication technology, the abuse of VoIP has led to the emergence of telecommunications fraud. We urgently need an end-to-end identity authentication mechanism to verify the identity of the caller. This paper proposes an end-to-end, dual identity authentication mechanism to solve the problem of telecommunications fraud. Our first technique is to use the Hermes algorithm of data transmission technology on an unknown voice channel to transmit the certificate, thereby authenticating the caller's phone number. Our second technique uses voice-print recognition technology and a Gaussian mixture model (a general background probabilistic model) to establish a model of the speaker to verify the caller's voice to ensure the speaker's identity. Our solution is implemented on the Android platform, and simultaneously tests and evaluates transmission efficiency and speaker recognition. Experiments conducted on Android phones show that the error rate of the voice channel transmission signature certificate is within 3.247 %, and the certificate signature verification mechanism is feasible. The accuracy of the voice-print recognition is 72%, making it effective as a reference for identity authentication.
2021-06-28
Wei, Wenqi, Liu, Ling, Loper, Margaret, Chow, Ka-Ho, Gursoy, Mehmet Emre, Truex, Stacey, Wu, Yanzhao.  2020.  Adversarial Deception in Deep Learning: Analysis and Mitigation. 2020 Second IEEE International Conference on Trust, Privacy and Security in Intelligent Systems and Applications (TPS-ISA). :236–245.
The burgeoning success of deep learning has raised the security and privacy concerns as more and more tasks are accompanied with sensitive data. Adversarial attacks in deep learning have emerged as one of the dominating security threats to a range of mission-critical deep learning systems and applications. This paper takes a holistic view to characterize the adversarial examples in deep learning by studying their adverse effect and presents an attack-independent countermeasure with three original contributions. First, we provide a general formulation of adversarial examples and elaborate on the basic principle for adversarial attack algorithm design. Then, we evaluate 15 adversarial attacks with a variety of evaluation metrics to study their adverse effects and costs. We further conduct three case studies to analyze the effectiveness of adversarial examples and to demonstrate their divergence across attack instances. We take advantage of the instance-level divergence of adversarial examples and propose strategic input transformation teaming defense. The proposed defense methodology is attack-independent and capable of auto-repairing and auto-verifying the prediction decision made on the adversarial input. We show that the strategic input transformation teaming defense can achieve high defense success rates and are more robust with high attack prevention success rates and low benign false-positive rates, compared to existing representative defense methods.
2021-05-13
Lit, Yanyan, Kim, Sara, Sy, Eric.  2021.  A Survey on Amazon Alexa Attack Surfaces. 2021 IEEE 18th Annual Consumer Communications Networking Conference (CCNC). :1–7.
Since being launched in 2014, Alexa, Amazon's versatile cloud-based voice service, is now active in over 100 million households worldwide [1]. Alexa's user-friendly, personalized vocal experience offers customers a more natural way of interacting with cutting-edge technology by allowing the ability to directly dictate commands to the assistant. Now in the present year, the Alexa service is more accessible than ever, available on hundreds of millions of devices from not only Amazon but third-party device manufacturers. Unfortunately, that success has also been the source of concern and controversy. The success of Alexa is based on its effortless usability, but in turn, that has led to a lack of sufficient security. This paper surveys various attacks against Amazon Alexa ecosystem including attacks against the frontend voice capturing and the cloud backend voice command recognition and processing. Overall, we have identified six attack surfaces covering the lifecycle of Alexa voice interaction that spans several stages including voice data collection, transmission, processing and storage. We also discuss the potential mitigation solutions for each attack surface to better improve Alexa or other voice assistants in terms of security and privacy.
2020-12-11
Huang, Y., Wang, Y..  2019.  Multi-format speech perception hashing based on time-frequency parameter fusion of energy zero ratio and frequency band variance. 2019 3rd International Conference on Electronic Information Technology and Computer Engineering (EITCE). :243—251.

In order to solve the problems of the existing speech content authentication algorithm, such as single format, ununiversal algorithm, low security, low accuracy of tamper detection and location in small-scale, a multi-format speech perception hashing based on time-frequency parameter fusion of energy zero ratio and frequency band bariance is proposed. Firstly, the algorithm preprocesses the processed speech signal and calculates the short-time logarithmic energy, zero-crossing rate and frequency band variance of each speech fragment. Then calculate the energy to zero ratio of each frame, perform time- frequency parameter fusion on time-frequency features by mean filtering, and the time-frequency parameters are constructed by difference hashing method. Finally, the hash sequence is scrambled with equal length by logistic chaotic map, so as to improve the security of the hash sequence in the transmission process. Experiments show that the proposed algorithm is robustness, discrimination and key dependent.

2020-09-14
Ma, Zhuo, Liu, Yang, Liu, Ximeng, Ma, Jianfeng, Li, Feifei.  2019.  Privacy-Preserving Outsourced Speech Recognition for Smart IoT Devices. IEEE Internet of Things Journal. 6:8406–8420.
Most of the current intelligent Internet of Things (IoT) products take neural network-based speech recognition as the standard human-machine interaction interface. However, the traditional speech recognition frameworks for smart IoT devices always collect and transmit voice information in the form of plaintext, which may cause the disclosure of user privacy. Due to the wide utilization of speech features as biometric authentication, the privacy leakage can cause immeasurable losses to personal property and privacy. Therefore, in this paper, we propose an outsourced privacy-preserving speech recognition framework (OPSR) for smart IoT devices in the long short-term memory (LSTM) neural network and edge computing. In the framework, a series of additive secret sharing-based interactive protocols between two edge servers are designed to achieve lightweight outsourced computation. And based on the protocols, we implement the neural network training process of LSTM for intelligent IoT device voice control. Finally, combined with the universal composability theory and experiment results, we theoretically prove the correctness and security of our framework.
2020-09-11
Shekhar, Heemany, Moh, Melody, Moh, Teng-Sheng.  2019.  Exploring Adversaries to Defend Audio CAPTCHA. 2019 18th IEEE International Conference On Machine Learning And Applications (ICMLA). :1155—1161.
CAPTCHA is a web-based authentication method used by websites to distinguish between humans (valid users) and bots (attackers). Audio captcha is an accessible captcha meant for the visually disabled section of users such as color-blind, blind, near-sighted users. Firstly, this paper analyzes how secure current audio captchas are from attacks using machine learning (ML) and deep learning (DL) models. Each audio captcha is made up of five, seven or ten random digits[0-9] spoken one after the other along with varying background noise throughout the length of the audio. If the ML or DL model is able to correctly identify all spoken digits and in the correct order of occurance in a single audio captcha, we consider that captcha to be broken and the attack to be successful. Throughout the paper, accuracy refers to the attack model's success at breaking audio captchas. The higher the attack accuracy, the more unsecure the audio captchas are. In our baseline experiments, we found that attack models could break audio captchas that had no background noise or medium background noise with any number of spoken digits with nearly 99% to 100% accuracy. Whereas, audio captchas with high background noise were relatively more secure with attack accuracy of 85%. Secondly, we propose that the concepts of adversarial examples algorithms can be used to create a new kind of audio captcha that is more resilient towards attacks. We found that even after retraining the models on the new adversarial audio data, the attack accuracy remained as low as 25% to 36% only. Lastly, we explore the benefits of creating adversarial audio captcha through different algorithms such as Basic Iterative Method (BIM) and deepFool. We found that as long as the attacker has less than 45% sample from each kinds of adversarial audio datasets, the defense will be successful at preventing attacks.
Zhang, Yang, Gao, Haichang, Pei, Ge, Luo, Sainan, Chang, Guoqin, Cheng, Nuo.  2019.  A Survey of Research on CAPTCHA Designing and Breaking Techniques. 2019 18th IEEE International Conference On Trust, Security And Privacy In Computing And Communications/13th IEEE International Conference On Big Data Science And Engineering (TrustCom/BigDataSE). :75—84.
The Internet plays an increasingly important role in people's lives, but it also brings security problems. CAPTCHA, which stands for Completely Automated Public Turing Test to Tell Computers and Humans Apart, has been widely used as a security mechanism. This paper outlines the scientific and technological progress in both the design and attack of CAPTCHAs related to these three CAPTCHA categories. It first presents a comprehensive survey of recent developments for each CAPTCHA type in terms of usability, robustness and their weaknesses and strengths. Second, it summarizes the attack methods for each category. In addition, the differences between the three CAPTCHA categories and the attack methods will also be discussed. Lastly, this paper provides suggestions for future research and proposes some problems worthy of further study.
2020-09-04
Wu, Yi, Liu, Jian, Chen, Yingying, Cheng, Jerry.  2019.  Semi-black-box Attacks Against Speech Recognition Systems Using Adversarial Samples. 2019 IEEE International Symposium on Dynamic Spectrum Access Networks (DySPAN). :1—5.
As automatic speech recognition (ASR) systems have been integrated into a diverse set of devices around us in recent years, security vulnerabilities of them have become an increasing concern for the public. Existing studies have demonstrated that deep neural networks (DNNs), acting as the computation core of ASR systems, is vulnerable to deliberately designed adversarial attacks. Based on the gradient descent algorithm, existing studies have successfully generated adversarial samples which can disturb ASR systems and produce adversary-expected transcript texts designed by adversaries. Most of these research simulated white-box attacks which require knowledge of all the components in the targeted ASR systems. In this work, we propose the first semi-black-box attack against the ASR system - Kaldi. Requiring only partial information from Kaldi and none from DNN, we can embed malicious commands into a single audio chip based on the gradient-independent genetic algorithm. The crafted audio clip could be recognized as the embedded malicious commands by Kaldi and unnoticeable to humans in the meanwhile. Experiments show that our attack can achieve high attack success rate with unnoticeable perturbations to three types of audio clips (pop music, pure music, and human command) without the need of the underlying DNN model parameters and architecture.
Taori, Rohan, Kamsetty, Amog, Chu, Brenton, Vemuri, Nikita.  2019.  Targeted Adversarial Examples for Black Box Audio Systems. 2019 IEEE Security and Privacy Workshops (SPW). :15—20.
The application of deep recurrent networks to audio transcription has led to impressive gains in automatic speech recognition (ASR) systems. Many have demonstrated that small adversarial perturbations can fool deep neural networks into incorrectly predicting a specified target with high confidence. Current work on fooling ASR systems have focused on white-box attacks, in which the model architecture and parameters are known. In this paper, we adopt a black-box approach to adversarial generation, combining the approaches of both genetic algorithms and gradient estimation to solve the task. We achieve a 89.25% targeted attack similarity, with 35% targeted attack success rate, after 3000 generations while maintaining 94.6% audio file similarity.
2020-08-28
Khomytska, Iryna, Teslyuk, Vasyl.  2019.  Mathematical Methods Applied for Authorship Attribution on the Phonological Level. 2019 IEEE 14th International Conference on Computer Sciences and Information Technologies (CSIT). 3:7—11.

The proposed combination of statistical methods has proved efficient for authorship attribution. The complex analysis method based on the proposed combination of statistical methods has made it possible to minimize the number of phoneme groups by which the authorial differentiation of texts has been done.

2020-08-10
Kwon, Hyun, Yoon, Hyunsoo, Park, Ki-Woong.  2019.  Selective Poisoning Attack on Deep Neural Network to Induce Fine-Grained Recognition Error. 2019 IEEE Second International Conference on Artificial Intelligence and Knowledge Engineering (AIKE). :136–139.

Deep neural networks (DNNs) provide good performance for image recognition, speech recognition, and pattern recognition. However, a poisoning attack is a serious threat to DNN's security. The poisoning attack is a method to reduce the accuracy of DNN by adding malicious training data during DNN training process. In some situations such as a military, it may be necessary to drop only a chosen class of accuracy in the model. For example, if an attacker does not allow only nuclear facilities to be selectively recognized, it may be necessary to intentionally prevent UAV from correctly recognizing nuclear-related facilities. In this paper, we propose a selective poisoning attack that reduces the accuracy of only chosen class in the model. The proposed method reduces the accuracy of a chosen class in the model by training malicious training data corresponding to a chosen class, while maintaining the accuracy of the remaining classes. For experiment, we used tensorflow as a machine learning library and MNIST and CIFAR10 as datasets. Experimental results show that the proposed method can reduce the accuracy of the chosen class to 43.2% and 55.3% in MNIST and CIFAR10, while maintaining the accuracy of the remaining classes.

2020-07-30
Deeba, Farah, Tefera, Getenet, Kun, She, Memon, Hira.  2019.  Protecting the Intellectual Properties of Digital Watermark Using Deep Neural Network. 2019 4th International Conference on Information Systems Engineering (ICISE). :91—95.

Recently in the vast advancement of Artificial Intelligence, Machine learning and Deep Neural Network (DNN) driven us to the robust applications. Such as Image processing, speech recognition, and natural language processing, DNN Algorithms has succeeded in many drawbacks; especially the trained DNN models have made easy to the researchers to produces state-of-art results. However, sharing these trained models are always a challenging task, i.e. security, and protection. We performed extensive experiments to present some analysis of watermark in DNN. We proposed a DNN model for Digital watermarking which investigate the intellectual property of Deep Neural Network, Embedding watermarks, and owner verification. This model can generate the watermarks to deal with possible attacks (fine tuning and train to embed). This approach is tested on the standard dataset. Hence this model is robust to above counter-watermark attacks. Our model accurately and instantly verifies the ownership of all the remotely expanded deep learning models without affecting the model accuracy for standard information data.

2020-06-04
Shang, Jiacheng, Wu, Jie.  2019.  Enabling Secure Voice Input on Augmented Reality Headsets using Internal Body Voice. 2019 16th Annual IEEE International Conference on Sensing, Communication, and Networking (SECON). :1—9.

Voice-based input is usually used as the primary input method for augmented reality (AR) headsets due to immersive AR experience and good recognition performance. However, recent researches have shown that an attacker can inject inaudible voice commands to the devices that lack voice verification. Even if we secure voice input with voice verification techniques, an attacker can easily steal the victim's voice using low-cast handy recorders and replay it to voice-based applications. To defend against voice-spoofing attacks, AR headsets should be able to determine whether the voice is from the person who is using the AR headsets. Existing voice-spoofing defense systems are designed for smartphone platforms. Due to the special locations of microphones and loudspeakers on AR headsets, existing solutions are hard to be implemented on AR headsets. To address this challenge, in this paper, we propose a voice-spoofing defense system for AR headsets by leveraging both the internal body propagation and the air propagation of human voices. Experimental results show that our system can successfully accept normal users with average accuracy of 97% and defend against two types of attacks with average accuracy of at least 98%.