Biblio
Most anti-collusion audio fingerprinting schemes are aiming at finding colluders from the illegal redistributed audio copies. However, the loss caused by the redistributed versions is inevitable. In this letter, a novel fingerprinting scheme is proposed to eliminate the motivation of collusion attack. The audio signal is transformed to the frequency domain by the Fourier transform, and the coefficients in frequency domain are reversed in different degrees according to the fingerprint sequence. Different from other fingerprinting schemes, the coefficients of the host media are excessively modified by the proposed method in order to reduce the quality of the colluded version significantly, but the imperceptibility is well preserved. Experiments show that the colluded audio cannot be reused because of the poor quality. In addition, the proposed method can also resist other common attacks. Various kinds of copyright risks and losses caused by the illegal redistribution are effectively avoided, which is significant for protecting the copyright of audio.
An acoustic fingerprint is a condensed and powerful digital signature of an audio signal which is used for audio sample identification. A fingerprint is the pattern of a voice or audio sample. A large number of algorithms have been developed for generating such acoustic fingerprints. These algorithms facilitate systems that perform song searching, song identification, and song duplication detection. In this study, a comprehensive and powerful survey of already developed algorithms is conducted. Four major music fingerprinting algorithms are evaluated for identifying and analyzing the potential hurdles that can affect their results. Since the background and environmental noise reduces the efficiency of music fingerprinting algorithms, behavioral analysis of fingerprinting algorithms is performed using audio samples of different languages and under different environmental conditions. The results of music fingerprint classification are more successful when deep learning techniques for classification are used. The testing of the acoustic feature modeling and music fingerprinting algorithms is performed using the standard dataset of iKala, MusicBrainz and MIR-1K.
Indoor localization has been a popular research subject in recent years. Usually, object localization using sound involves devices on the objects, acquiring data from stationary sound sources, or by localizing the objects with external sensors when the object generates sounds. Indoor localization systems using microphones have traditionally also used systems with several microphones, setting the limitations on cost efficiency and required space for the systems. In this paper, the goal is to investigate whether it is possible for a stationary system to localize a silent object in a room, with only one microphone and ambient noise as information carrier. A subtraction method has been combined with a fingerprint technique, to define and distinguish the noise absorption characteristic of the silent object in the frequency domain for different object positions. The absorption characteristics of several positions of the object is taken as comparison references, serving as fingerprints of known positions for an object. With the experiment result, the tentative idea has been verified as feasible, and noise signal based lateral localization of silent objects can be achieved.
Voice user interfaces can offer intuitive interaction with our devices, but the usability and audio quality could be further improved if multiple devices could collaborate to provide a distributed voice user interface. To ensure that users' voices are not shared with unauthorized devices, it is however necessary to design an access management system that adapts to the users' needs. Prior work has demonstrated that a combination of audio fingerprinting and fuzzy cryptography yields a robust pairing of devices without sharing the information that they record. However, the robustness of these systems is partially based on the extensive duration of the recordings that are required to obtain the fingerprint. This paper analyzes methods for robust generation of acoustic fingerprints in short periods of time to enable the responsive pairing of devices according to changes in the acoustic scenery and can be integrated into other typical speech processing tools.
Over the years, technology has reformed the perception of the world related to security concerns. To tackle security problems, we proposed a system capable of detecting security alerts. System encompass audio events that occur as an outlier against background of unusual activity. This ambiguous behaviour can be handled by auditory classification. In this paper, we have discussed two techniques of extracting features from sound data including: time-based and signal based features. In first technique, we preserve time-series nature of sound, while in other signal characteristics are focused. Convolution neural network is applied for categorization of sound. Major aim of research is security challenges, so we have generated data related to surveillance in addition to available datasets such as UrbanSound 8k and ESC-50 datasets. We have achieved 94.6% accuracy for proposed methodology based on self-generated dataset. Improved accuracy on locally prepared dataset demonstrates novelty in research.
To prevent users' privacy from leakage, more and more mobile devices employ biometric-based authentication approaches, such as fingerprint, face recognition, voiceprint authentications, etc., to enhance the privacy protection. However, these approaches are vulnerable to replay attacks. Although state-of-art solutions utilize liveness verification to combat the attacks, existing approaches are sensitive to ambient environments, such as ambient lights and surrounding audible noises. Towards this end, we explore liveness verification of user authentication leveraging users' lip movements, which are robust to noisy environments. In this paper, we propose a lip reading-based user authentication system, LipPass, which extracts unique behavioral characteristics of users' speaking lips leveraging build-in audio devices on smartphones for user authentication. We first investigate Doppler profiles of acoustic signals caused by users' speaking lips, and find that there are unique lip movement patterns for different individuals. To characterize the lip movements, we propose a deep learning-based method to extract efficient features from Doppler profiles, and employ Support Vector Machine and Support Vector Domain Description to construct binary classifiers and spoofer detectors for user identification and spoofer detection, respectively. Afterwards, we develop a binary tree-based authentication approach to accurately identify each individual leveraging these binary classifiers and spoofer detectors with respect to registered users. Through extensive experiments involving 48 volunteers in four real environments, LipPass can achieve 90.21% accuracy in user identification and 93.1% accuracy in spoofer detection.
An ideal audio retrieval method should be not only highly efficient in identifying an audio track from a massive audio dataset, but also robust to any distortion. Unfortunately, none of the audio retrieval methods is robust to all types of distortions. An audio retrieval method has to do with both the audio fingerprint and the strategy, especially how they are combined. We argue that the Sampling and Counting Method (SC), a state-of-the-art audio retrieval method, would be promising towards an ideal audio retrieval method, if we could make it robust to time-stretch and pitch-stretch. Towards this objective, this paper proposes a turning point alignment method to enhance SC with resistance to time-stretch, which makes Philips and Philips-like fingerprints resist to time-stretch. Experimental results show that our approach can resist to time-stretch from 70% to 130%, which is on a par to the state-of-the-art methods. It also marginally improves the retrieval performance with various noise distortions.
Indoor localization of unknown acoustic events with MEMS microphone arrays have a huge potential in applications like home assisted living and surveillance. This article presents an Angle of Arrival (AoA) fingerprinting method for use in Wireless Acoustic Sensor Networks (WASNs) with low-profile microphone arrays. In a first research phase, acoustic measurements are performed in an anechoic room to evaluate two computationally efficient time domain delay-based AoA algorithms: one based on dot product calculations and another based on dot products with a PHAse Transform (PHAT). The evaluation of the algorithms is conducted with two sound events: white noise and a female voice. The algorithms are able to calculate the AoA with Root Mean Square Errors (RMSEs) of 3.5° for white noise and 9.8° to 16° for female vocal sounds. In the second research phase, an AoA fingerprinting algorithm is developed for acoustic event localization. The proposed solution is experimentally verified in a room of 4.25 m by 9.20 m with 4 acoustic sensor nodes. Acoustic fingerprints of white noise, recorded along a predefined grid in the room, are used to localize white noise and vocal sounds. The localization errors are evaluated using one node at a time, resulting in mean localization errors between 0.65 m and 0.98 m for white noise and between 1.18 m and 1.52 m for vocal sounds.
Acoustic speaker recognition systems are very vulnerable to spoofing attacks via replayed or synthesized utterances. One possible countermeasure is audio-visual speaker recognition. Nevertheless, the addition of the visual stream alone does not prevent spoofing attacks completely and only provides further information to assess the authenticity of the utterance. Many systems consider audio and video modalities independently and can easily be spoofed by imitating only a single modality or by a bimodal replay attack with a victim's photograph or video. Therefore, we propose the simultaneous verification of the data synchronicity and the transcription in a challenge-response setup. We use coupled hidden Markov models (CHMMs) for a text-dependent spoofing detection and introduce new features that provide information about the transcriptions of the utterance and the synchronicity of both streams. We evaluate the features for various spoofing scenarios and show that the combination of the features leads to a more robust recognition, also in comparison to the baseline method. Additionally, by evaluating the data on unseen speakers, we show the spoofing detection to be applicable in speaker-independent use-cases.
In this work, a measurement system is developed based on acoustic resonance which can be used for classification of materials. Basically, the inspection methods based on acoustic, utilized for containers screening in the field, identification of defective pills hold high significance in the fields of health, security and protection. However, such techniques are constrained by costly instrumentation, offline analysis and complexities identified with transducer holder physical coupling. So a simple, non-destructive and amazingly cost effective technique in view of acoustic resonance has been formulated here for quick data acquisition and analysis of acoustic signature of liquids for their constituent identification and classification. In this system, there are two ceramic coated piezoelectric transducers attached at both ends of V-shaped glass, one is act as transmitter and another as receiver. The transmitter generates sound with the help of white noise generator. The pick up transducer on another end of the V-shaped glass rod detects the transmitted signal. The recording is being done with arduino interfaced to computer. The FFTs of recorded signals are being analyzed and the resulted resonant frequency observed for water, water+salt and water+sugar are 4.8 KHz, 6.8 KHz and 3.2 KHz respectively. The different resonant frequency in case different sample is being observed which shows that the developed prototype model effectively classifying the materials.
This work presents a novel method to estimate natural expressed emotions in speech through binary acoustic modeling. Standard acoustic features are mapped to a binary value representation and a support vector regression model is used to correlate them with the three-continuous emotional dimensions. Three different sets of speech features, two based on spectral parameters and one on prosody are compared on the VAM corpus, a set of spontaneous dialogues from a German TV talk-show. The regression analysis, in terms of correlation coefficient and mean absolute error, show that the binary key modeling is able to successfully capture speaker emotion characteristics. The proposed algorithm obtains comparable results to those reported on the literature while it relies on a much smaller set of acoustic descriptors. Furthermore, we also report on preliminary results based on the combination of the binary models, which brings further performance improvements.
In the field of scene understanding, researchers have mainly focused on using video/images to extract different elements in a scene. The computational as well as monetary cost associated with such implementations is high. This paper proposes a low-cost system which uses sound-based techniques in order to jointly perform localization as well as fingerprinting of the sound sources. A network of embedded nodes is used to sense the sound inputs. Phase-based sound localization and Support-Vector Machine classification are used to locate and classify elements of the scene, respectively. The fusion of all this data presents a complete “picture” of the scene. The proposed concepts are applied to a vehicular-traffic case study. Experiments show that the system has a fingerprinting accuracy of up to 97.5%, localization error less than 4 degrees and scene prediction accuracy of 100%.
This paper presents a novel and efficient audio signal recognition algorithm with limited computational complexity. As the audio recognition system will be used in real world environment where background noises are high, conventional speech recognition techniques are not directly applicable, since they have a poor performance in these environments. So here, we introduce a new audio recognition algorithm which is optimized for mechanical sounds such as car horn, telephone ring etc. This is a hybrid time-frequency approach which makes use of acoustic fingerprint for the recognition of audio signal patterns. The limited computational complexity is achieved through efficient usage of both time domain and frequency domain in two different processing phases, detection and recognition respectively. And the transition between these two phases is carried out through a finite state machine(FSM)model. Simulation results shows that the algorithm effectively recognizes audio signals within a noisy environment.
Recently, the demand for more robust protection against unauthorized use of mobile devices has been rapidly growing. This paper presents a novel biometric modality Transient Evoked Otoacoustic Emission (TEOAE) for mobile security. Prior works have investigated TEOAE for biometrics in a setting where an individual is to be identified among a pre-enrolled identity gallery. However, this limits the applicability to mobile environment, where attacks in most cases are from imposters unknown to the system before. Therefore, we employ an unsupervised learning approach based on Autoencoder Neural Network to tackle such blind recognition problem. The learning model is trained upon a generic dataset and used to verify an individual in a random population. We also introduce the framework of mobile biometric system considering practical application. Experiments show the merits of the proposed method and system performance is further evaluated by cross-validation with an average EER 2.41% achieved.
This article presents results of the recognition process of acoustic fingerprints from a noise source using spectral characteristics of the signal. Principal Components Analysis (PCA) is applied to reduce the dimensionality of extracted features and then a classifier is implemented using the method of the k-nearest neighbors (KNN) to identify the pattern of the audio signal. This classifier is compared with an Artificial Neural Network (ANN) implementation. It is necessary to implement a filtering system to the acquired signals for 60Hz noise reduction generated by imperfections in the acquisition system. The methods described in this paper were used for vessel recognition.
In this paper, an edit detection method for forensic audio analysis is proposed. It develops and improves a previous method through changes in the signal processing chain and a novel detection criterion. As with the original method, electrical network frequency (ENF) analysis is central to the novel edit detector, for it allows monitoring anomalous variations of the ENF related to audio edit events. Working in unsupervised manner, the edit detector compares the extent of ENF variations, centered at its nominal frequency, with a variable threshold that defines the upper limit for normal variations observed in unedited signals. The ENF variations caused by edits in the signal are likely to exceed the threshold providing a mechanism for their detection. The proposed method is evaluated in both qualitative and quantitative terms via two distinct annotated databases. Results are reported for originally noisy database signals as well as versions of them further degraded under controlled conditions. A comparative performance evaluation, in terms of equal error rate (EER) detection, reveals that, for one of the tested databases, an improvement from 7% to 4% EER is achieved, respectively, from the original to the new edit detection method. When the signals are amplitude clipped or corrupted by broadband background noise, the performance figures of the novel method follow the same profile of those of the original method.