Visible to the public Biblio

Filters: Keyword is speech processing  [Clear All Filters]
2023-01-13
Taneja, Vardaan, Chen, Pin-Yu, Yao, Yuguang, Liu, Sijia.  2022.  When Does Backdoor Attack Succeed in Image Reconstruction? A Study of Heuristics vs. Bi-Level Solution ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). :4398—4402.
Recent studies have demonstrated the lack of robustness of image reconstruction networks to test-time evasion attacks, posing security risks and potential for misdiagnoses. In this paper, we evaluate how vulnerable such networks are to training-time poisoning attacks for the first time. In contrast to image classification, we find that trigger-embedded basic backdoor attacks on these models executed using heuristics lead to poor attack performance. Thus, it is non-trivial to generate backdoor attacks for image reconstruction. To tackle the problem, we propose a bi-level optimization (BLO)-based attack generation method and investigate its effectiveness on image reconstruction. We show that BLO-generated back-door attacks can yield a significant improvement over the heuristics-based attack strategy.
2022-11-18
Li, Pengzhen, Koyuncu, Erdem, Seferoglu, Hulya.  2021.  Respipe: Resilient Model-Distributed DNN Training at Edge Networks. ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). :3660–3664.
The traditional approach to distributed deep neural network (DNN) training is data-distributed learning, which partitions and distributes data to workers. This approach, although has good convergence properties, has high communication cost, which puts a strain especially on edge systems and increases delay. An emerging approach is model-distributed learning, where a training model is distributed across workers. Model-distributed learning is a promising approach to reduce communication and storage costs, which is crucial for edge systems. In this paper, we design ResPipe, a novel resilient model-distributed DNN training mechanism against delayed/failed workers. We analyze the communication cost of ResPipe and demonstrate the trade-off between resiliency and communication cost. We implement ResPipe in a real testbed consisting of Android-based smartphones, and show that it improves the convergence rate and accuracy of training for convolutional neural networks (CNNs).
2022-10-20
Butora, Jan, Fridrich, Jessica.  2020.  Steganography and its Detection in JPEG Images Obtained with the "TRUNC" Quantizer. ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). :2762—2766.
Many portable imaging devices use the operation of "trunc" (rounding towards zero) instead of rounding as the final quantizer for computing DCT coefficients during JPEG compression. We show that this has rather profound consequences for steganography and its detection. In particular, side-informed steganography needs to be redesigned due to the different nature of the rounding error. The steganographic algorithm J-UNIWARD becomes vulnerable to steganalysis with the JPEG rich model and needs to be adjusted for this source. Steganalysis detectors need to be retrained since a steganalyst unaware of the existence of the trunc quantizer will experience 100% false alarm.
2022-05-06
Bai, Zilong, Hu, Beibei.  2021.  A Universal Bert-Based Front-End Model for Mandarin Text-To-Speech Synthesis. ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). :6074–6078.
The front-end text processing module is considered as an essential part that influences the intelligibility and naturalness of a Mandarin text-to-speech system significantly. For commercial text-to-speech systems, the Mandarin front-end should meet the requirements of high accuracy and low time latency while also ensuring maintainability. In this paper, we propose a universal BERT-based model that can be used for various tasks in the Mandarin front-end without changing its architecture. The feature extractor and classifiers in the model are shared for several sub-tasks, which improves the expandability and maintainability. We trained and evaluated the model with polyphone disambiguation, text normalization, and prosodic boundary prediction for single task modules and multi-task learning. Results show that, the model maintains high performance for single task modules and shows higher accuracy and lower time latency for multi-task modules, indicating that the proposed universal front-end model is promising as a maintainable Mandarin front-end for commercial applications.
2021-11-30
Subramanian, Vinod, Pankajakshan, Arjun, Benetos, Emmanouil, Xu, Ning, McDonald, SKoT, Sandler, Mark.  2020.  A Study on the Transferability of Adversarial Attacks in Sound Event Classification. ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). :301–305.
An adversarial attack is an algorithm that perturbs the input of a machine learning model in an intelligent way in order to change the output of the model. An important property of adversarial attacks is transferability. According to this property, it is possible to generate adversarial perturbations on one model and apply it the input to fool the output of a different model. Our work focuses on studying the transferability of adversarial attacks in sound event classification. We are able to demonstrate differences in transferability properties from those observed in computer vision. We show that dataset normalization techniques such as z-score normalization does not affect the transferability of adversarial attacks and we show that techniques such as knowledge distillation do not increase the transferability of attacks.
Li, Gangqiang, Wu, Sissi Xiaoxiao, Zhang, Shengli, Li, Qiang.  2020.  Detect Insider Attacks Using CNN in Decentralized Optimization. ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). :8758–8762.
This paper studies the security issue of a gossip-based distributed projected gradient (DPG) algorithm, when it is applied for solving a decentralized multi-agent optimization. It is known that the gossip-based DPG algorithm is vulnerable to insider attacks because each agent locally estimates its (sub)gradient without any supervision. This work leverages the convolutional neural network (CNN) to perform the detection and localization of the insider attackers. Compared to the previous work, CNN can learn appropriate decision functions from the original state information without preprocessing through artificially designed rules, thereby alleviating the dependence on complex pre-designed models. Simulation results demonstrate that the proposed CNN-based approach can effectively improve the performance of detecting and localizing malicious agents, as compared with the conventional pre-designed score-based model.
2021-10-12
Yang, Howard H., Arafa, Ahmed, Quek, Tony Q. S., Vincent Poor, H..  2020.  Age-Based Scheduling Policy for Federated Learning in Mobile Edge Networks. ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). :8743–8747.
Federated learning (FL) is a machine learning model that preserves data privacy in the training process. Specifically, FL brings the model directly to the user equipments (UEs) for local training, where an edge server periodically collects the trained parameters to produce an improved model and sends it back to the UEs. However, since communication usually occurs through a limited spectrum, only a portion of the UEs can update their parameters upon each global aggregation. As such, new scheduling algorithms have to be engineered to facilitate the full implementation of FL. In this paper, based on a metric termed the age of update (AoU), we propose a scheduling policy by jointly accounting for the staleness of the received parameters and the instantaneous channel qualities to improve the running efficiency of FL. The proposed algorithm has low complexity and its effectiveness is demonstrated by Monte Carlo simulations.
2021-06-24
Tsaknakis, Ioannis, Hong, Mingyi, Liu, Sijia.  2020.  Decentralized Min-Max Optimization: Formulations, Algorithms and Applications in Network Poisoning Attack. ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). :5755–5759.
This paper discusses formulations and algorithms which allow a number of agents to collectively solve problems involving both (non-convex) minimization and (concave) maximization operations. These problems have a number of interesting applications in information processing and machine learning, and in particular can be used to model an adversary learning problem called network data poisoning. We develop a number of algorithms to efficiently solve these non-convex min-max optimization problems, by combining techniques such as gradient tracking in the decentralized optimization literature and gradient descent-ascent schemes in the min-max optimization literature. Also, we establish convergence to a first order stationary point under certain conditions. Finally, we perform experiments to demonstrate that the proposed algorithms are effective in the data poisoning attack.
2021-05-13
Li, Xu, Zhong, Jinghua, Wu, Xixin, Yu, Jianwei, Liu, Xunying, Meng, Helen.  2020.  Adversarial Attacks on GMM I-Vector Based Speaker Verification Systems. ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). :6579—6583.
This work investigates the vulnerability of Gaussian Mixture Model (GMM) i-vector based speaker verification systems to adversarial attacks, and the transferability of adversarial samples crafted from GMM i-vector based systems to x-vector based systems. In detail, we formulate the GMM i-vector system as a scoring function of enrollment and testing utterance pairs. Then we leverage the fast gradient sign method (FGSM) to optimize testing utterances for adversarial samples generation. These adversarial samples are used to attack both GMM i-vector and x-vector systems. We measure the system vulnerability by the degradation of equal error rate and false acceptance rate. Experiment results show that GMM i-vector systems are seriously vulnerable to adversarial attacks, and the crafted adversarial samples are proved to be transferable and pose threats to neural network speaker embedding based systems (e.g. x-vector systems).
2021-05-05
Elvira, Clément, Herzet, Cédric.  2020.  Short and Squeezed: Accelerating the Computation of Antisparse Representations with Safe Squeezing. ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). :5615—5619.
Antisparse coding aims at spreading the information uniformly over representation coefficients and can be expressed as the solution of an ℓ∞-norm regularized problem. In this paper, we propose a new methodology, coined "safe squeezing", accelerating the computation of antisparse representations. The idea consists in identifying saturated entries of the solution via simple tests and compacting their contribution to achieve some form of dimensionality reduction. Numerical experiments show that the proposed approach leads to significant computational gain.
2021-03-29
Zhang, S., Ma, X..  2020.  A General Difficulty Control Algorithm for Proof-of-Work Based Blockchains. ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). :3077–3081.
Designing an efficient difficulty control algorithm is an essential problem in Proof-of-Work (PoW) based blockchains because the network hash rate is randomly changing. This paper proposes a general difficulty control algorithm and provides insights for difficulty adjustment rules for PoW based blockchains. The proposed algorithm consists a two-layer neural network. It has low memory cost, meanwhile satisfying the fast-updating and low volatility requirements for difficulty adjustment. Real data from Ethereum are used in the simulations to prove that the proposed algorithm has better performance for the control of the block difficulty.
2021-01-20
Zarazaga, P. P., Bäckström, T., Sigg, S..  2020.  Acoustic Fingerprints for Access Management in Ad-Hoc Sensor Networks. IEEE Access. 8:166083—166094.

Voice user interfaces can offer intuitive interaction with our devices, but the usability and audio quality could be further improved if multiple devices could collaborate to provide a distributed voice user interface. To ensure that users' voices are not shared with unauthorized devices, it is however necessary to design an access management system that adapts to the users' needs. Prior work has demonstrated that a combination of audio fingerprinting and fuzzy cryptography yields a robust pairing of devices without sharing the information that they record. However, the robustness of these systems is partially based on the extensive duration of the recordings that are required to obtain the fingerprint. This paper analyzes methods for robust generation of acoustic fingerprints in short periods of time to enable the responsive pairing of devices according to changes in the acoustic scenery and can be integrated into other typical speech processing tools.

2021-01-18
Huang, Y., Wang, S., Wang, Y., Li, H..  2020.  A New Four-Dimensional Chaotic System and Its Application in Speech Encryption. 2020 Information Communication Technologies Conference (ICTC). :171–175.
Traditional encryption algorithms are not suitable for modern mass speech situations, while some low-dimensional chaotic encryption algorithms are simple and easy to implement, but their key space often small, leading to poor security, so there is still a lot of room for improvement. Aiming at these problems, this paper proposes a new type of four-dimensional chaotic system and applies it to speech encryption. Simulation results show that the encryption scheme in this paper has higher key space and security, which can achieve the speech encryption goal.
2020-12-11
Huang, Y., Wang, Y..  2019.  Multi-format speech perception hashing based on time-frequency parameter fusion of energy zero ratio and frequency band variance. 2019 3rd International Conference on Electronic Information Technology and Computer Engineering (EITCE). :243—251.

In order to solve the problems of the existing speech content authentication algorithm, such as single format, ununiversal algorithm, low security, low accuracy of tamper detection and location in small-scale, a multi-format speech perception hashing based on time-frequency parameter fusion of energy zero ratio and frequency band bariance is proposed. Firstly, the algorithm preprocesses the processed speech signal and calculates the short-time logarithmic energy, zero-crossing rate and frequency band variance of each speech fragment. Then calculate the energy to zero ratio of each frame, perform time- frequency parameter fusion on time-frequency features by mean filtering, and the time-frequency parameters are constructed by difference hashing method. Finally, the hash sequence is scrambled with equal length by logistic chaotic map, so as to improve the security of the hash sequence in the transmission process. Experiments show that the proposed algorithm is robustness, discrimination and key dependent.

2020-10-05
Su, Jinsong, Zeng, Jiali, Xiong, Deyi, Liu, Yang, Wang, Mingxuan, Xie, Jun.  2018.  A Hierarchy-to-Sequence Attentional Neural Machine Translation Model. IEEE/ACM Transactions on Audio, Speech, and Language Processing. 26:623—632.

Although sequence-to-sequence attentional neural machine translation (NMT) has achieved great progress recently, it is confronted with two challenges: learning optimal model parameters for long parallel sentences and well exploiting different scopes of contexts. In this paper, partially inspired by the idea of segmenting a long sentence into short clauses, each of which can be easily translated by NMT, we propose a hierarchy-to-sequence attentional NMT model to handle these two challenges. Our encoder takes the segmented clause sequence as input and explores a hierarchical neural network structure to model words, clauses, and sentences at different levels, particularly with two layers of recurrent neural networks modeling semantic compositionality at the word and clause level. Correspondingly, the decoder sequentially translates segmented clauses and simultaneously applies two types of attention models to capture contexts of interclause and intraclause for translation prediction. In this way, we can not only improve parameter learning, but also well explore different scopes of contexts for translation. Experimental results on Chinese-English and English-German translation demonstrate the superiorities of the proposed model over the conventional NMT model.

2020-09-08
Kassim, Sarah, Megherbi, Ouerdia, Hamiche, Hamid, Djennoune, Saïd, Bettayeb, Maamar.  2019.  Speech encryption based on the synchronization of fractional-order chaotic maps. 2019 IEEE International Symposium on Signal Processing and Information Technology (ISSPIT). :1–6.
This work presents a new method of encrypting and decrypting speech based on a chaotic key generator. The proposed scheme takes advantage of the best features of chaotic systems. In the proposed method, the input speech signal is converted into an image which is ciphered by an encryption function using a chaotic key matrix generated from a fractional-order chaotic map. Based on a deadbeat observer, the exact synchronization of system used is established, and the decryption is performed. Different analysis are applied for analyzing the effectiveness of the encryption system. The obtained results confirm that the proposed system offers a higher level of security against various attacks and holds a strong key generation mechanism for satisfactory speech communication.
2020-08-28
Khomytska, Iryna, Teslyuk, Vasyl.  2019.  Mathematical Methods Applied for Authorship Attribution on the Phonological Level. 2019 IEEE 14th International Conference on Computer Sciences and Information Technologies (CSIT). 3:7—11.

The proposed combination of statistical methods has proved efficient for authorship attribution. The complex analysis method based on the proposed combination of statistical methods has made it possible to minimize the number of phoneme groups by which the authorial differentiation of texts has been done.

2020-08-03
Dai, Haipeng, Liu, Alex X., Li, Zeshui, Wang, Wei, Zhang, Fengmin, Dong, Chao.  2019.  Recognizing Driver Talking Direction in Running Vehicles with a Smartphone. 2019 IEEE 16th International Conference on Mobile Ad Hoc and Sensor Systems (MASS). :10–18.
This paper addresses the fundamental problem of identifying driver talking directions using a single smartphone, which can help drivers by warning distraction of having conversations with passengers in a vehicle and enable safety enhancement. The basic idea of our system is to perform talking status and direction identification using two microphones on a smartphone. We first use the sound recorded by the two microphones to identify whether the driver is talking or not. If yes, we then extract the so-called channel fingerprint from the speech signal and classify it into one of three typical driver talking directions, namely, front, right and back, using a trained model obtained in advance. The key novelty of our scheme is the proposition of channel fingerprint which leverages the heavy multipath effects in the harsh in-vehicle environment and cancels the variability of human voice, both of which combine to invalidate traditional TDoA, DoA and fingerprint based sound source localization approaches. We conducted extensive experiments using two kinds of phones and two vehicles for four phone placements in three representative scenarios, and collected 23 hours voice data from 20 participants. The results show that our system can achieve 95.0% classification accuracy on average.
2020-07-06
Chai, Yadeng, Liu, Yong.  2019.  Natural Spoken Instructions Understanding for Robot with Dependency Parsing. 2019 IEEE 9th Annual International Conference on CYBER Technology in Automation, Control, and Intelligent Systems (CYBER). :866–871.
This paper presents a method based on syntactic information, which can be used for intent determination and slot filling tasks in a spoken language understanding system including the spoken instructions understanding module for robot. Some studies in recent years attempt to solve the problem of spoken language understanding via syntactic information. This research is a further extension of these approaches which is based on dependency parsing. In this model, the input for neural network are vectors generated by a dependency parsing tree, which we called window vector. This vector contains dependency features that improves performance of the syntactic-based model. The model has been evaluated on the benchmark ATIS task, and the results show that it outperforms many other syntactic-based approaches, especially in terms of slot filling, it has a performance level on par with some state of the art deep learning algorithms in recent years. Also, the model has been evaluated on FBM3, a dataset of the RoCKIn@Home competition. The overall rate of correctly understanding the instructions for robot is quite good but still not acceptable in practical use, which is caused by the small scale of FBM3.
2019-12-16
Karve, Shreya, Nagmal, Arati, Papalkar, Sahil, Deshpande, S. A..  2018.  Context Sensitive Conversational Agent Using DNN. 2018 Second International Conference on Electronics, Communication and Aerospace Technology (ICECA). :475–478.
We investigate a method of building a closed domain intelligent conversational agent using deep neural networks. A conversational agent is a dialog system intended to converse with a human, with a coherent structure. Our conversational agent uses a retrieval based model that identifies the intent of the input user query and maps it to a knowledge base to return appropriate results. Human conversations are based on context, but existing conversational agents are context insensitive. To overcome this limitation, our system uses a simple stack based context identification and storage system. The conversational agent generates responses according to the current context of conversation. allowing more human-like conversations.
2019-11-25
Jawad, Ameer K., Abdullah, Hikmat N., Hreshee, Saad S..  2018.  Secure speech communication system based on scrambling and masking by chaotic maps. 2018 International Conference on Advance of Sustainable Engineering and its Application (ICASEA). :7–12.
As a result of increasing the interest in developing the communication systems that use public channels for transmitting information, many channel problems are raised up. Among these problems, the important one should be addressed is the information security. This paper presents a proposed communication system with high security uses two encryption levels based on chaotic systems. The first level is chaotic scrambling, while the second one is chaotic masking. This configuration increases the information security since the key space becomes too large. The MATLAB simulation results showed that the Segmental Spectral Signal to Noise Ratio (SSSNR) of the first level (chaotic scrambling) is reduced by -5.195 dB comparing to time domain scrambling. Furthermore, in the second level (chaotic masking), the SSSNR is reduced by -20.679 dB. It is also showed that when the two levels are combined, the overall reduction obtained is -21.755 dB.
2018-05-02
Yao, Y., Xiao, B., Wu, G., Liu, X., Yu, Z., Zhang, K., Zhou, X..  2017.  Voiceprint: A Novel Sybil Attack Detection Method Based on RSSI for VANETs. 2017 47th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). :591–602.

Vehicular Ad Hoc Networks (VANETs) enable vehicle-to-vehicle (V2V) and vehicle-to-infrastructure (V2I) communications that bring many benefits and conveniences to improve the road safety and drive comfort in future transportation systems. Sybil attack is considered one of the most risky threats in VANETs since a Sybil attacker can generate multiple fake identities with false messages to severely impair the normal functions of safety-related applications. In this paper, we propose a novel Sybil attack detection method based on Received Signal Strength Indicator (RSSI), Voiceprint, to conduct a widely applicable, lightweight and full-distributed detection for VANETs. To avoid the inaccurate position estimation according to predefined radio propagation models in previous RSSI-based detection methods, Voiceprint adopts the RSSI time series as the vehicular speech and compares the similarity among all received time series. Voiceprint does not rely on any predefined radio propagation model, and conducts independent detection without the support of the centralized infrastructure. It has more accurate detection rate in different dynamic environments. Extensive simulations and real-world experiments demonstrate that the proposed Voiceprint is an effective method considering the cost, complexity and performance.

2018-02-06
Badii, A., Faulkner, R., Raval, R., Glackin, C., Chollet, G..  2017.  Accelerated Encryption Algorithms for Secure Storage and Processing in the Cloud. 2017 International Conference on Advanced Technologies for Signal and Image Processing (ATSIP). :1–6.

The objective of this paper is to outline the design specification, implementation and evaluation of a proposed accelerated encryption framework which deploys both homomorphic and symmetric-key encryptions to serve the privacy preserving processing; in particular, as a sub-system within the Privacy Preserving Speech Processing framework architecture as part of the PPSP-in-Cloud Platform. Following a preliminary study of GPU efficiency gains optimisations benchmarked for AES implementation we have addressed and resolved the Big Integer processing challenges in parallel implementation of bilinear pairing thus enabling the creation of partially homomorphic encryption schemes which facilitates applications such as speech processing in the encrypted domain on the cloud. This novel implementation has been validated in laboratory tests using a standard speech corpus and can be used for other application domains to support secure computation and privacy preserving big data storage/processing in the cloud.

2015-05-06
Shimauchi, S., Ohmuro, H..  2014.  Accurate adaptive filtering in square-root Hann windowed short-time fourier transform domain. Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on. :1305-1309.

A novel short-time Fourier transform (STFT) domain adaptive filtering scheme is proposed that can be easily combined with nonlinear post filters such as residual echo or noise reduction in acoustic echo cancellation. Unlike normal STFT subband adaptive filters, which suffers from aliasing artifacts due to its poor prototype filter, our scheme achieves good accuracy by exploiting the relationship between the linear convolution and the poor prototype filter, i.e., the STFT window function. The effectiveness of our scheme was confirmed through the results of simulations conducted to compare it with conventional methods.

2015-05-04
Rafii, Z., Coover, B., Jinyu Han.  2014.  An audio fingerprinting system for live version identification using image processing techniques. Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on. :644-648.

Suppose that you are at a music festival checking on an artist, and you would like to quickly know about the song that is being played (e.g., title, lyrics, album, etc.). If you have a smartphone, you could record a sample of the live performance and compare it against a database of existing recordings from the artist. Services such as Shazam or SoundHound will not work here, as this is not the typical framework for audio fingerprinting or query-by-humming systems, as a live performance is neither identical to its studio version (e.g., variations in instrumentation, key, tempo, etc.) nor it is a hummed or sung melody. We propose an audio fingerprinting system that can deal with live version identification by using image processing techniques. Compact fingerprints are derived using a log-frequency spectrogram and an adaptive thresholding method, and template matching is performed using the Hamming similarity and the Hough Transform.