Visible to the public Biblio

Filters: Keyword is Speech  [Clear All Filters]
2020-10-05
Su, Jinsong, Zeng, Jiali, Xiong, Deyi, Liu, Yang, Wang, Mingxuan, Xie, Jun.  2018.  A Hierarchy-to-Sequence Attentional Neural Machine Translation Model. IEEE/ACM Transactions on Audio, Speech, and Language Processing. 26:623—632.

Although sequence-to-sequence attentional neural machine translation (NMT) has achieved great progress recently, it is confronted with two challenges: learning optimal model parameters for long parallel sentences and well exploiting different scopes of contexts. In this paper, partially inspired by the idea of segmenting a long sentence into short clauses, each of which can be easily translated by NMT, we propose a hierarchy-to-sequence attentional NMT model to handle these two challenges. Our encoder takes the segmented clause sequence as input and explores a hierarchical neural network structure to model words, clauses, and sentences at different levels, particularly with two layers of recurrent neural networks modeling semantic compositionality at the word and clause level. Correspondingly, the decoder sequentially translates segmented clauses and simultaneously applies two types of attention models to capture contexts of interclause and intraclause for translation prediction. In this way, we can not only improve parameter learning, but also well explore different scopes of contexts for translation. Experimental results on Chinese-English and English-German translation demonstrate the superiorities of the proposed model over the conventional NMT model.

2018-03-19
Rocha, A., Scheirer, W. J., Forstall, C. W., Cavalcante, T., Theophilo, A., Shen, B., Carvalho, A. R. B., Stamatatos, E..  2017.  Authorship Attribution for Social Media Forensics. IEEE Transactions on Information Forensics and Security. 12:5–33.

The veil of anonymity provided by smartphones with pre-paid SIM cards, public Wi-Fi hotspots, and distributed networks like Tor has drastically complicated the task of identifying users of social media during forensic investigations. In some cases, the text of a single posted message will be the only clue to an author's identity. How can we accurately predict who that author might be when the message may never exceed 140 characters on a service like Twitter? For the past 50 years, linguists, computer scientists, and scholars of the humanities have been jointly developing automated methods to identify authors based on the style of their writing. All authors possess peculiarities of habit that influence the form and content of their written works. These characteristics can often be quantified and measured using machine learning algorithms. In this paper, we provide a comprehensive review of the methods of authorship attribution that can be applied to the problem of social media forensics. Furthermore, we examine emerging supervised learning-based methods that are effective for small sample sizes, and provide step-by-step explanations for several scalable approaches as instructional case studies for newcomers to the field. We argue that there is a significant need in forensics for new authorship attribution algorithms that can exploit context, can process multi-modal data, and are tolerant to incomplete knowledge of the space of all possible authors at training time.

2017-12-20
Yamaguchi, M., Kikuchi, H..  2017.  Audio-CAPTCHA with distinction between random phoneme sequences and words spoken by multi-speaker. 2017 IEEE International Conference on Systems, Man, and Cybernetics (SMC). :3071–3076.
Audio-CAPTCHA prevents malicious bots from attacking Web services and provides Web accessibility for visually-impaired persons. Most of the conventional methods employ statistical noise to distort sounds and let users remember and spell the words, which are difficult and laborious work for humans. In this paper, we utilize the difficulty on speaker-independent recognition for ASR machines instead of distortion with statistical noise. Our scheme synthesizes various voices by changing voice speed, pitch and native language of speakers. Moreover, we employ semantic identification problems between random phoneme sequences and meaningful words to release users from remembering and spelling words, so it improves the accuracy of humans and usability. We also evaluated our scheme in several experiments.
2017-02-21
K. Naruka, O. P. Sahu.  2015.  "An improved speech enhancement approach based on combination of compressed sensing and Kalman filter". 2015 IEEE International Conference on Computational Intelligence and Computing Research (ICCIC). :1-5.

This paper reviews some existing Speech Enhancement techniques and also proposes a new method for enhancing the speech by combining Compressed Sensing and Kalman filter approaches. This approach is based on reconstruction of noisy speech signal using Compressive Sampling Matching Pursuit (CoSaMP) algorithm and further enhanced by Kalman filter. The performance of the proposed method is evaluated and compared with that of the existing techniques in terms of intelligibility and quality measure parameters of speech. The proposed algorithm shows an improved performance compared to Spectral Subtraction, MMSE, Wiener filter, Signal Subspace, Kalman filter in terms of WSS, LLR, SegSNR, SNRloss, PESQ and overall quality.

2017-02-14
A. Chouhan, S. Singh.  2015.  "Real time secure end to end communication over GSM network". 2015 International Conference on Energy Systems and Applications. :663-668.

GSM network is the most widely used communication network for mobile phones in the World. However the security of the voice communication is the main issue in the GSM network. This paper proposes the technique for secure end to end communication over GSM network. The voice signal is encrypted at real time using digital techniques and transmitted over the GSM network. At receiver end the same decoding algorithm is used to extract the original speech signal. The speech trans-coding process of the GSM, severely distort an encrypted signal that does not possess the characteristics of speech signal. Therefore, it is not possible to use standard modem techniques over the GSM speech channel. The user may choose an appropriate algorithm and hardware platform as per requirement.

2015-05-06
Shimauchi, S., Ohmuro, H..  2014.  Accurate adaptive filtering in square-root Hann windowed short-time fourier transform domain. Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on. :1305-1309.

A novel short-time Fourier transform (STFT) domain adaptive filtering scheme is proposed that can be easily combined with nonlinear post filters such as residual echo or noise reduction in acoustic echo cancellation. Unlike normal STFT subband adaptive filters, which suffers from aliasing artifacts due to its poor prototype filter, our scheme achieves good accuracy by exploiting the relationship between the linear convolution and the poor prototype filter, i.e., the STFT window function. The effectiveness of our scheme was confirmed through the results of simulations conducted to compare it with conventional methods.

2015-05-05
Jialing Mo, Qiang He, Weiping Hu.  2014.  An adaptive threshold de-noising method based on EEMD. Signal Processing, Communications and Computing (ICSPCC), 2014 IEEE International Conference on. :209-214.

In view of the difficulty in selecting wavelet base and decomposition level for wavelet-based de-noising method, this paper proposes an adaptive de-noising method based on Ensemble Empirical Mode Decomposition (EEMD). The autocorrelation, cross-correlation method is used to adaptively find the signal-to-noise boundary layer of the EEMD in this method. Then the noise dominant layer is filtered directly and the signal dominant layer is threshold de-noised. Finally, the de-noising signal is reconstructed by each layer component which is de-noised. This method solves the problem of mode mixing in Empirical Mode Decomposition (EMD) by using EEMD and combines the advantage of wavelet threshold. In this paper, we focus on the analysis and verification of the correctness of the adaptive determination of the noise dominant layer. The simulation experiment results prove that this de-noising method is efficient and has good adaptability.
 

2015-05-04
Luque, J., Anguera, X..  2014.  On the modeling of natural vocal emotion expressions through binary key. Signal Processing Conference (EUSIPCO), 2014 Proceedings of the 22nd European. :1562-1566.

This work presents a novel method to estimate natural expressed emotions in speech through binary acoustic modeling. Standard acoustic features are mapped to a binary value representation and a support vector regression model is used to correlate them with the three-continuous emotional dimensions. Three different sets of speech features, two based on spectral parameters and one on prosody are compared on the VAM corpus, a set of spontaneous dialogues from a German TV talk-show. The regression analysis, in terms of correlation coefficient and mean absolute error, show that the binary key modeling is able to successfully capture speaker emotion characteristics. The proposed algorithm obtains comparable results to those reported on the literature while it relies on a much smaller set of acoustic descriptors. Furthermore, we also report on preliminary results based on the combination of the binary models, which brings further performance improvements.

Rafii, Z., Coover, B., Jinyu Han.  2014.  An audio fingerprinting system for live version identification using image processing techniques. Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on. :644-648.

Suppose that you are at a music festival checking on an artist, and you would like to quickly know about the song that is being played (e.g., title, lyrics, album, etc.). If you have a smartphone, you could record a sample of the live performance and compare it against a database of existing recordings from the artist. Services such as Shazam or SoundHound will not work here, as this is not the typical framework for audio fingerprinting or query-by-humming systems, as a live performance is neither identical to its studio version (e.g., variations in instrumentation, key, tempo, etc.) nor it is a hummed or sung melody. We propose an audio fingerprinting system that can deal with live version identification by using image processing techniques. Compact fingerprints are derived using a log-frequency spectrogram and an adaptive thresholding method, and template matching is performed using the Hamming similarity and the Hough Transform.

Ghatak, S., Lodh, A., Saha, E., Goyal, A., Das, A., Dutta, S..  2014.  Development of a keyboardless social networking website for visually impaired: SocialWeb. Global Humanitarian Technology Conference - South Asia Satellite (GHTC-SAS), 2014 IEEE. :232-236.

Over the past decade, we have witnessed a huge upsurge in social networking which continues to touch and transform our lives till present day. Social networks help us to communicate amongst our acquaintances and friends with whom we share similar interests on a common platform. Globally, there are more than 200 million visually impaired people. Visual impairment has many issues associated with it, but the one that stands out is the lack of accessibility to content for entertainment and socializing safely. This paper deals with the development of a keyboard less social networking website for visually impaired. The term keyboard less signifies minimum use of keyboard and allows the user to explore the contents of the website using assistive technologies like screen readers and speech to text (STT) conversion technologies which in turn provides a user friendly experience for the target audience. As soon as the user with minimal computer proficiency opens this website, with the help of screen reader, he/she identifies the username and password fields. The user speaks out his username and with the help of STT conversion (using Web Speech API), the username is entered. Then the control moves over to the password field and similarly, the password of the user is obtained and matched with the one saved in the website database. The concept of acoustic fingerprinting has been implemented for successfully validating the passwords of registered users and foiling intentions of malicious attackers. On successful match of the passwords, the user is able to enjoy the services of the website without any further hassle. Once the access obstacles associated to deal with social networking sites are successfully resolved and proper technologies are put to place, social networking sites can be a rewarding, fulfilling, and enjoyable experience for the visually impaired people.

Moussallam, M., Daudet, L..  2014.  A general framework for dictionary based audio fingerprinting. Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on. :3077-3081.

Fingerprint-based Audio recognition system must address concurrent objectives. Indeed, fingerprints must be both robust to distortions and discriminative while their dimension must remain to allow fast comparison. This paper proposes to restate these objectives as a penalized sparse representation problem. On top of this dictionary-based approach, we propose a structured sparsity model in the form of a probabilistic distribution for the sparse support. A practical suboptimal greedy algorithm is then presented and evaluated on robustness and recognition tasks. We show that some existing methods can be seen as particular cases of this algorithm and that the general framework allows to reach other points of a Pareto-like continuum.

2014-09-26
Bursztein, E., Bethard, S., Fabry, C., Mitchell, J.C., Jurafsky, D..  2010.  How Good Are Humans at Solving CAPTCHAs? A Large Scale Evaluation Security and Privacy (SP), 2010 IEEE Symposium on. :399-413.

Captchas are designed to be easy for humans but hard for machines. However, most recent research has focused only on making them hard for machines. In this paper, we present what is to the best of our knowledge the first large scale evaluation of captchas from the human perspective, with the goal of assessing how much friction captchas present to the average user. For the purpose of this study we have asked workers from Amazon’s Mechanical Turk and an underground captchabreaking service to solve more than 318 000 captchas issued from the 21 most popular captcha schemes (13 images schemes and 8 audio scheme). Analysis of the resulting data reveals that captchas are often difficult for humans, with audio captchas being particularly problematic. We also find some demographic trends indicating, for example, that non-native speakers of English are slower in general and less accurate on English-centric captcha schemes. Evidence from a week’s worth of eBay captchas (14,000,000 samples) suggests that the solving accuracies found in our study are close to real-world values, and that improving audio captchas should become a priority, as nearly 1% of all captchas are delivered as audio rather than images. Finally our study also reveals that it is more effective for an attacker to use Mechanical Turk to solve captchas than an underground service.