Spoofing detection via simultaneous verification of audio-visual synchronicity and transcription
Title | Spoofing detection via simultaneous verification of audio-visual synchronicity and transcription |
Publication Type | Conference Paper |
Year of Publication | 2017 |
Authors | Schonherr, L., Zeiler, S., Kolossa, D. |
Conference Name | 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) |
Keywords | acoustic coupling, Acoustic signal processing, acoustic speaker recognition systems, audio-visual speaker recognition, audio-visual synchronicity, audio-visual systems, bimodal replay attack, coupled hidden Markov models, Cyber-physical systems, data synchronicity, feature extraction, Hidden Markov models, Human Behavior, liveness detection, multimodal biometrics, pubcrawl, replayed synthesized utterances, resilience, Resiliency, Scalability, security of data, speaker recognition, Speech recognition, spoofing attacks, spoofing detection, Streaming media, text-dependent spoofing detection, Training, visualization |
Abstract | Acoustic speaker recognition systems are very vulnerable to spoofing attacks via replayed or synthesized utterances. One possible countermeasure is audio-visual speaker recognition. Nevertheless, the addition of the visual stream alone does not prevent spoofing attacks completely and only provides further information to assess the authenticity of the utterance. Many systems consider audio and video modalities independently and can easily be spoofed by imitating only a single modality or by a bimodal replay attack with a victim's photograph or video. Therefore, we propose the simultaneous verification of the data synchronicity and the transcription in a challenge-response setup. We use coupled hidden Markov models (CHMMs) for a text-dependent spoofing detection and introduce new features that provide information about the transcriptions of the utterance and the synchronicity of both streams. We evaluate the features for various spoofing scenarios and show that the combination of the features leads to a more robust recognition, also in comparison to the baseline method. Additionally, by evaluating the data on unseen speakers, we show the spoofing detection to be applicable in speaker-independent use-cases. |
URL | https://ieeexplore.ieee.org/document/8268990 |
DOI | 10.1109/ASRU.2017.8268990 |
Citation Key | schonherr_spoofing_2017 |
- multimodal biometrics
- visualization
- Training
- text-dependent spoofing detection
- Streaming media
- spoofing detection
- spoofing attacks
- Speech recognition
- speaker recognition
- security of data
- Scalability
- Resiliency
- resilience
- replayed synthesized utterances
- pubcrawl
- acoustic coupling
- liveness detection
- Human behavior
- Hidden Markov models
- feature extraction
- data synchronicity
- cyber-physical systems
- coupled hidden Markov models
- bimodal replay attack
- audio-visual systems
- audio-visual synchronicity
- audio-visual speaker recognition
- acoustic speaker recognition systems
- Acoustic signal processing