Latest in eess.as

total 327took 0.11s
A spelling correction model for end-to-end speech recognitionFeb 19 2019Attention-based sequence-to-sequence models for speech recognition jointly train an acoustic model, language model (LM), and alignment mechanism using a single neural network and require only parallel audio-text pairs. Thus, the language model component ... More
Low-Latency Deep Clustering For Speech SeparationFeb 19 2019This paper proposes a low algorithmic latency adaptation of the deep clustering approach to speaker-independent speech separation. It consists of three parts: a) the usage of long-short-term-memory (LSTM) networks instead of their bidirectional variant ... More
Self-Attention Aligner: A Latency-Control End-to-End Model for ASR Using Self-Attention Network and Chunk-HoppingFeb 18 2019Self-attention network, an attention-based feedforward neural network, has recently shown the potential to replace recurrent neural networks (RNNs) in a variety of NLP tasks. However, it is not clear if the self-attention network could be a good alternative ... More
Deep-learning inversion: a next generation seismic velocity-model building methodFeb 17 2019Seismic velocity is one of the most important parameters used in seismic exploration. Accurate velocity models are key prerequisites for reverse-time migration and other high-resolution seismic imaging techniques. Such velocity information has traditionally ... More
An improved uncertainty propagation method for robust i-vector based speaker recognitionFeb 15 2019The performance of automatic speaker recognition systems degrades when facing distorted speech data containing additive noise and/or reverberation. Statistical uncertainty propagation has been introduced as a promising paradigm to address this challenge. ... More
Multimodal music information processing and retrieval: survey and future challengesFeb 14 2019Towards improving the performance in various music information processing tasks, recent studies exploit different modalities able to capture diverse aspects of music. Such modalities include audio recordings, symbolic music scores, mid-level representations, ... More
Recurrent Neural Networks with Stochastic Layers for Acoustic Novelty DetectionFeb 13 2019In this paper, we adapt Recurrent Neural Networks with Stochastic Layers, which are the state-of-the-art for generating text, music and speech, to the problem of acoustic novelty detection. By integrating uncertainty into the hidden states, this type ... More
Enhanced Robot Speech Recognition Using Biomimetic Binaural Sound Source LocalizationFeb 13 2019Inspired by the behavior of humans talking in noisy environments, we propose an embodied embedded cognition approach to improve automatic speech recognition (ASR) systems for robots in challenging environments, such as with ego noise, using binaural sound ... More
Improving performance and inference on audio classification tasks using capsule networksFeb 13 2019Classification of audio samples is an important part of many auditory systems. Deep learning models based on the Convolutional and the Recurrent layers are state-of-the-art in many such tasks. In this paper, we approach audio classification tasks using ... More
Multitask Learning for Polyphonic Piano Transcription, a Case StudyFeb 12 2019Viewing polyphonic piano transcription as a multitask learning problem, where we need to simultaneously predict onsets, intermediate frames and offsets of notes, we investigate the performance impact of additional prediction targets, using a variety of ... More
FurcaNeXt: End-to-end monaural speech separation with dynamic gated dilated temporal convolutional networksFeb 12 2019Deep dilated temporal convolutional networks (TCN) have been proved to be very effective in sequence modeling. In this paper we propose several improvements of TCN for end-to-end approach to monaural speech separation, which consists of 1) multi-scale ... More
Adversarial Generation of Time-Frequency Features with application in audio synthesisFeb 11 2019Time-frequency (TF) representations provide powerful and intuitive features for the analysis of time series such as audio. But still, generative modeling of audio in the TF domain is a subtle matter. Consequently, neural audio synthesis widely relies ... More
A Vocoder-free WaveNet Voice Conversion with Non-Parallel DataFeb 11 2019In a typical voice conversion system, vocoder is commonly used for speech-to-features analysis and features-to-speech synthesis. However, vocoder can be a source of speech quality degradation. This paper presents a vocoder-free voice conversion approach ... More
Performance Advantages of Deep Neural Networks for Angle of Arrival EstimationFeb 10 2019The problem of estimating the number of sources and their angles of arrival from a single antenna array observation has been an active area of research in the signal processing community for the last few decades. When the number of sources is large, the ... More
Generative Moment Matching Network-based Random Modulation Post-filter for DNN-based Singing Voice Synthesis and Neural Double-trackingFeb 09 2019This paper proposes a generative moment matching network (GMMN)-based post-filter that provides inter-utterance pitch variation for deep neural network (DNN)-based singing voice synthesis. The natural pitch variation of a human singing voice leads to ... More
Machine learning and chord based feature engineering for genre prediction in popular Brazilian musicFeb 08 2019Music genre can be hard to describe: many factors are involved, such as style, music technique, and historical context. Some genres even have overlapping characteristics. Looking for a better understanding of how music genres are related to musical harmonic ... More
Speaker diarisation using 2D self-attentive combination of embeddingsFeb 08 2019Speaker diarisation systems often cluster audio segments using speaker embeddings such as i-vectors and d-vectors. Since different types of embeddings are often complementary, this paper proposes a generic framework to improve performance by combining ... More
Speech enhancement with variational autoencoders and alpha-stable distributionsFeb 08 2019This paper focuses on single-channel semi-supervised speech enhancement. We learn a speaker-independent deep generative speech model using the framework of variational autoencoders. The noise model remains unsupervised because we do not assume prior knowledge ... More
Hide and Speak: Deep Neural Networks for Speech SteganographyFeb 07 2019Steganography is the science of hiding a secret message within an ordinary public message, which referred to as Carrier. Traditionally, digital signal processing techniques, such as least significant bit encoding, were used for hiding messages. In this ... More
Target Speaker Extraction for Overlapped Multi-Talker Speaker VerificationFeb 07 2019The performance of speaker verification degrades significantly when the test speech is corrupted by interference speakers. Speaker diarization does well to separate speakers if the speakers are temporally overlapped. However, if multi-talkers speak at ... More
Conv-codes: Audio Hashing For Bird Species ClassificationFeb 07 2019In this work, we propose a supervised, convex representation based audio hashing framework for bird species classification. The proposed framework utilizes archetypal analysis, a matrix factorization technique, to obtain convex-sparse representations ... More
End-to-end losses based on speaker basis vectors and all-speaker hard negative mining for speaker verificationFeb 07 2019In recent years, speaker verification has been primarily performed using deep neural networks that are trained to output embeddings from input features such as spectrograms or filterbank energies. Therefore, studies have been conducted to design various ... More
End-to-end Anchored Speech RecognitionFeb 06 2019Voice-controlled house-hold devices, like Amazon Echo or Google Home, face the problem of performing speech recognition of device-directed speech in the presence of interfering background speech, i.e., background noise and interfering speech from another ... More
Centroid-based deep metric learning for speaker recognitionFeb 06 2019Speaker embedding models that utilize neural networks to map utterances to a space where distances reflect similarity between speakers have driven recent progress in the speaker recognition task. However, there is still a significant performance gap between ... More
Unsupervised Polyglot Text To SpeechFeb 06 2019We present a TTS neural network that is able to produce speech in multiple languages. The proposed network is able to transfer a voice, which was presented as a sample in a source language, into one of several target languages. Training is done without ... More
Transfer Learning From Sound Representations For Anger Detection in SpeechFeb 06 2019In this work, we train fully convolutional networks to detect anger in speech. Since training these deep architectures requires large amounts of data and the size of emotion datasets is relatively small, we use transfer learning. However, unlike previous ... More
Polyphonic Music Composition with LSTM Neural Networks and Reinforcement LearningFeb 05 2019In the domain of algorithmic music composition, machine learning-driven systems eliminate the need for carefully hand-crafting rules for composition. In particular, the capability of recurrent neural networks to learn complex temporal patterns lends itself ... More
A variance modeling framework based on variational autoencoders for speech enhancementFeb 05 2019In this paper we address the problem of enhancing speech signals in noisy mixtures using a source separation approach. We explore the use of neural networks as an alternative to a popular speech variance model based on supervised non-negative matrix factorization ... More
An Ensemble SVM-based Approach for Voice Activity DetectionFeb 05 2019Voice activity detection (VAD), used as the front end of speech enhancement, speech and speaker recognition algorithms, determines the overall accuracy and efficiency of the algorithms. Therefore, a VAD with low complexity and high accuracy is highly ... More
Active Acoustic Source Tracking Exploiting Particle Filtering and Monte Carlo Tree SearchFeb 04 2019In this paper, we address the task of active acoustic source tracking as part of robotic path planning. It denotes the planning of sequences of robotic movements to enhance tracking results of acoustic sources, e.g., talking humans, by fusing observations ... More
Overlap-Add Windows with Maximum Energy Concentration for Speech and Audio ProcessingFeb 04 2019Processing of speech and audio signals with time-frequency representations require windowing methods which allow perfect reconstruction of the original signal and where processing artifacts have a predictable behavior. The most common approach for this ... More
Deep Autotuner: A Data-Driven Approach to Natural-Sounding Pitch Correction for Singing Voice in Karaoke PerformancesFeb 03 2019We describe a machine-learning approach to pitch correcting a solo singing performance in a karaoke setting, where the solo voice and accompaniment are on separate tracks. The proposed approach addresses the situation where no musical score of the vocals ... More
Using multi-task learning to improve the performance of acoustic-to-word and conventional hybrid modelsFeb 02 2019Acoustic-to-word (A2W) models that allow direct mapping from acoustic signals to word sequences are an appealing approach to end-to-end automatic speech recognition due to their simplicity. However, prior works have shown that modelling A2W typically ... More
FurcaNet: An end-to-end deep gated convolutional, long short-term memory, deep neural networks for single channel speech separationFeb 02 2019Deep gated convolutional networks have been proved to be very effective in single channel speech separation. However current state-of-the-art framework often considers training the gated convolutional networks in time-frequency (TF) domain. Such an approach ... More
Is CQT more suitable for monaural speech separation than STFT? an empirical studyFeb 02 2019Short-time Fourier transform (STFT) is used as the front end of many popular successful monaural speech separation methods, such as deep clustering (DPCL), permutation invariant training (PIT) and their various variants. Since the frequency component ... More
Exploring attention mechanism for acoustic-based classification of speech utterances into system-directed and non-system-directedFeb 01 2019Voice controlled virtual assistants (VAs) are now available in smartphones, cars, and standalone devices in homes. In most cases, the user needs to first "wake-up" the VA by saying a particular word/phrase every time he or she wants the VA to do something. ... More
Image reconstruction enhancement via masked regularizationJan 31 2019Image reconstruction based on an edge-sparsity assumption has become popular in recent years. Many methods of this type are capable of reconstructing nearly perfect edge-sparse images using limited data. In this paper, we present a method to improve the ... More
End-to-End Probabilistic Inference for Nonstationary Audio AnalysisJan 31 2019A typical audio signal processing pipeline includes multiple disjoint analysis stages, including calculation of a time-frequency representation followed by spectrogram-based feature analysis. We show how time-frequency analysis and nonnegative matrix ... More
Discriminate natural versus loudspeaker emitted speechJan 31 2019In this work, we address a novel, but potentially emerging, problem of discriminating the natural human voices and those played back by any kind of audio devices in the context of interactions with in-house voice user interface. The tackled problem may ... More
Automated Image Analysis and Contiguity Estimation for Liquid Phase Sintered Tungsten Heavy AlloysJan 29 2019In this study an automated software model using digital image processing techniques is proposed for extracting the image characteristics and contiguity of liquid phase sintered tungsten heavy alloys. The developed model takes a typical image as input ... More
Applying Visual Domain Style Transfer and Texture Synthesis Techniques to Audio - Insights and ChallengesJan 29 2019Style transfer is a technique for combining two images based on the activations and feature statistics in a deep learning neural network architecture. This paper studies the analogous task in the audio domain and takes a critical look at the problems ... More
Detection of a Signal in Colored Noise: A Random Matrix Theory Based AnalysisJan 28 2019This paper investigates the classical statistical signal processing problem of detecting a signal in the presence of colored noise with an unknown covariance matrix. In particular, we consider a scenario where m-dimensional p possible signal-plus-noise ... More
A Convolutional Neural Network model based on Neutrosophy for Noisy Speech RecognitionJan 27 2019Convolutional neural networks are sensitive to unknown noisy condition in the test phase and so their performance degrades for the noisy data classification task including noisy speech recognition. In this research, a new convolutional neural network ... More
Unsupervised speech representation learning using WaveNet autoencodersJan 25 2019We consider the task of unsupervised extraction of meaningful latent representations of speech by applying autoencoding neural networks to speech waveforms. The goal is to learn a representation able to capture high level semantic content from the signal, ... More
Speech Separation Using Gain-Adapted Factorial Hidden Markov ModelsJan 22 2019We present a new probabilistic graphical model which generalizes factorial hidden Markov models (FHMM) for the problem of single-channel speech separation (SCSS) in which we wish to separate the two speech signals $X(t)$ and $V(t)$ from a single recording ... More
Non linear time compression of clear and normal speech at high ratesJan 22 2019We compare a series of time compression methods applied to normal and clear speech. First we evaluate a linear (uniform) method applied to these styles as well as to naturally-produced fast speech. We found, in line with the literature, that unprocessed ... More
Towards Universal End-to-End Affect Recognition from Multilingual Speech by ConvNetsJan 19 2019We propose an end-to-end affect recognition approach using a Convolutional Neural Network (CNN) that handles multiple languages, with applications to emotion and personality recognition from speech. We lay the foundation of a universal model that is trained ... More
Using DNNs to Detect Materials in a Room based on Sound AbsorptionJan 17 2019The materials of surfaces in a room play an important room in shaping the auditory experience within them. Different materials absorb energy at different levels. The level of absorption also varies across frequencies. This paper investigates how cues ... More
Edge-masked CT image reconstruction from limited dataJan 16 2019This paper presents an iterative inversion algorithm for computed tomography image reconstruction that performs well in terms of accuracy and speed using limited data. The computational method combines an image domain technique and statistical reconstruction ... More
Real-time separation of non-stationary sound fields on spheresJan 16 2019The sound field separation methods can separate the target field from the interfering noises, facilitating the study of the acoustic characteristics of the target source, which is placed in a noisy environment. However, most of the existing sound field ... More
AI Pipeline - bringing AI to you. End-to-end integration of data, algorithms and deployment toolsJan 15 2019Next generation of embedded Information and Communication Technology (ICT) systems are interconnected collaborative intelligent systems able to perform autonomous tasks. Training and deployment of such systems on Edge devices however require a fine-grained ... More
A linear programming approach to the tracking of partialsJan 15 2019A new approach to the tracking of sinusoidal chirps using linear programming is proposed. It is demonstrated that the classical algorithm of McAulay and Quatieri is greedy and exhibits exponential complexity for long searches, while approaches based on ... More
Exploring Transfer Learning for Low Resource Emotional TTSJan 14 2019During the last few years, spoken language technologies have known a big improvement thanks to Deep Learning. However Deep Learning-based algorithms require amounts of data that are often difficult and costly to gather. Particularly, modeling the variability ... More
Machine learning for the recognition of emotion in the speech of couples in psychotherapy using the Stanford Suppes Brain Lab Psychotherapy DatasetJan 14 2019The automatic recognition of emotion in speech can inform our understanding of language, emotion, and the brain. It also has practical application to human-machine interactive systems. This paper examines the recognition of emotion in naturally occurring ... More
A survey on acoustic sensingJan 11 2019The rise of Internet-of-Things (IoT) has brought many new sensing mechanisms. Among these mechanisms, acoustic sensing attracts much attention in recent years. Acoustic sensing exploits acoustic sensors beyond their primary uses, namely recording and ... More
Presence-absence estimation in audio recordings of tropical frog communitiesJan 08 2019One non-invasive way to study frog communities is by analyzing long-term samples of acoustic material containing calls. This immense task has been optimized by the development of Machine Learning tools to extract ecological information. We explored a ... More
Sinusoidal wave generating network based on adversarial learning and its application: synthesizing frog sounds for data augmentationJan 07 2019Simulators that generate observations based on theoretical models can be important tools for development, prediction, and assessment of signal processing algorithms. In order to design these simulators, painstaking effort is required to construct mathematical ... More
Extraction of digital wavefront sets using applied harmonic analysis and deep neural networksJan 05 2019Microlocal analysis provides deep insight into singularity structures and is often crucial for solving inverse problems, predominately, in imaging sciences. Of particular importance is the analysis of wavefront sets and the correct extraction of those. ... More
Improving noise robustness of automatic speech recognition via parallel data and teacher-student learningJan 05 2019Jan 11 2019For real-world speech recognition applications, noise robustness is still a challenge. In this work, we adopt the teacher-student (T/S) learning technique using a parallel clean and noisy corpus for improving automatic speech recognition (ASR) performance ... More
AVA-ActiveSpeaker: An Audio-Visual Dataset for Active Speaker DetectionJan 05 2019Active speaker detection is an important component in video analysis algorithms for applications such as speaker diarization, video re-targeting for meetings, speech enhancement, and human-robot interaction. The absence of a large, carefully labeled audio-visual ... More
Learning Sound Event Classifiers from Web Audio with Noisy LabelsJan 04 2019As sound event classification moves towards larger datasets, issues of label noise become inevitable. Web sites can supply large volumes of user-contributed audio and metadata, but inferring labels from this metadata introduces errors due to unreliable ... More
Introduction to Voice Presentation Attack Detection and Recent AdvancesJan 04 2019Over the past few years significant progress has been made in the field of presentation attack detection (PAD) for automatic speaker recognition (ASV). This includes the development of new speech corpora, standard evaluation protocols and advancements ... More
Feature reinforcement with word embedding and parsing information in neural TTSJan 03 2019In this paper, we propose a feature reinforcement method under the sequence-to-sequence neural text-to-speech (TTS) synthesis framework. The proposed method utilizes the multiple input encoder to take three levels of text information, i.e., phoneme sequence, ... More
End-to-End Model for Speech Enhancement by Consistent Spectrogram MaskingJan 02 2019Recently, phase processing is attracting increasinginterest in speech enhancement community. Some researchersintegrate phase estimations module into speech enhancementmodels by using complex-valued short-time Fourier transform(STFT) spectrogram based ... More
A Multiversion Programming Inspired Approach to Detecting Audio Adversarial ExamplesDec 26 2018Adversarial examples (AEs) are crafted by adding human-imperceptible perturbations to inputs such that a machine-learning based classifier incorrectly labels them. They have become a severe threat to the trustworthiness of machine learning. While AEs ... More
Design of generalized fractional order gradient descent methodDec 24 2018This paper focuses on the convergence problem of the emerging fractional order gradient descent method, and proposes three solutions to overcome the problem. In fact, the general fractional gradient method cannot converge to the real extreme point of ... More
Multi-Domain Processing via Hybrid Denoising Networks for Speech EnhancementDec 21 2018We present a hybrid framework that leverages the trade-off between temporal and frequency precision in audio representations to improve the performance of speech enhancement task. We first show that conventional approaches using specific representations ... More
Fréchet Audio Distance: A Metric for Evaluating Music Enhancement AlgorithmsDec 20 2018Jan 17 2019We propose the Fr\'echet Audio Distance (FAD), a novel, reference-free evaluation metric for music enhancement algorithms. We demonstrate how typical evaluation metrics for speech enhancement and blind source separation can fail to accurately measure ... More
Detecting the Trend in Musical Taste over the Decade -- A Novel Feature Extraction Algorithm to Classify Musical Content with Simple FeaturesDec 19 2018This work proposes a novel feature selection algorithm to classify Songs into different groups. Classification of musical content is often a non-trivial job and still relatively less explored area. The main idea conveyed in this article is to come up ... More
BandNet: A Neural Network-based, Multi-Instrument Beatles-Style MIDI Music Composition MachineDec 18 2018In this paper, we propose a recurrent neural network (RNN)-based MIDI music composition machine that is able to learn musical knowledge from existing Beatles' songs and generate music in the style of the Beatles with little human intervention. In the ... More
Persian phonemes recognition using PPNetDec 17 2018In this paper a new approach for recognition of Persian phonemes on the PCVC speech dataset is proposed. Nowadays deep neural networks are playing main rule in classification tasks. However the best results in speech recognition are not as good as human ... More
Persian Vowel recognition with MFCC and ANN on PCVC speech datasetDec 17 2018In this paper a new method for recognition of consonant-vowel phonemes combination on a new Persian speech dataset titled as PCVC (Persian Consonant-Vowel Combination) is proposed which is used to recognize Persian phonemes. In PCVC dataset, there are ... More
Quaternion Convolutional Neural Networks for Detection and Localization of 3D Sound EventsDec 17 2018Learning from data in the quaternion domain enables us to exploit internal dependencies of 4D signals and treating them as a single entity. One of the models that perfectly suits with quaternion-valued data processing is represented by 3D acoustic signals ... More
Circular Statistics-based low complexity DOA estimation for hearing aid applicationDec 17 2018The proposed Circular statistics-based Inter-Microphone Phase difference estimation Localizer (CIMPL) method is tailored toward binaural hearing aid systems with microphone arrays in each unit. The method utilizes the circular statistics (circular mean ... More
Learning to Generate Music with BachPropDec 17 2018As deep learning advances, algorithms of music composition increase in performance. However, most of the successful models are designed for specific musical structures. Here, we present BachProp, an algorithmic composer that can generate music scores ... More
Voiceprint recognition of Parkinson patients based on deep learningDec 17 2018More than 90% of the Parkinson Disease (PD) patients suffer from vocal disorders. Speech impairment is already indicator of PD. This study focuses on PD diagnosis through voiceprint features. In this paper, a method based on Deep Neural Network (DNN) ... More
Evaluation of an open-source implementation of the SRP-PHAT algorithm within the 2018 LOCATA challengeDec 14 2018This short paper presents an efficient, flexible implementation of the SRP-PHAT multichannel sound source localization method. The method is evaluated on the single-source tasks of the LOCATA 2018 development dataset, and an associated Matlab toolbox ... More
MorpheuS: generating structured music with constrained patterns and tensionDec 12 2018Automatic music generation systems have gained in popularity and sophistication as advances in cloud computing have enabled large-scale complex computations such as deep models and optimization algorithms on personal devices. Yet, they still face an important ... More
FPUAS : Fully Parallel UFANS-based End-to-End Acoustic System with 10x Speed UpDec 12 2018Dec 18 2018A lightweight end-to-end acoustic system is crucial in the deployment of text-to-speech tasks. Finding one that produces good audios with small time latency and fewer errors remains a problem. In this paper, we propose a new non-autoregressive, fully ... More
To Reverse the Gradient or Not: An Empirical Comparison of Adversarial and Multi-task Learning in Speech RecognitionDec 09 2018Dec 13 2018Transcribed datasets typically contain speaker identity for each instance in the data. We investigate two ways to incorporate this information during training: Multi-Task Learning and Adversarial Learning. In multi-task learning, the goal is speaker prediction; ... More
The USTC-NEL Speech Translation system at IWSLT 2018Dec 06 2018This paper describes the USTC-NEL system to the speech translation task of the IWSLT Evaluation 2018. The system is a conventional pipeline system which contains 3 modules: speech recognition, post-processing and machine translation. We train a group ... More
Multichannel reconstruction from nonuniform samples with application to image recoveryDec 05 2018The multichannel trigonometric reconstruction from uniform samples was proposed recently. It not only makes use of multichannel information about the signal but is also capable to generate various kinds of interpolation formulas according to the types ... More
Intensity Particle Flow SMC-PHD Filter For Audio Speaker TrackingDec 04 2018Non-zero diffusion particle flow Sequential Monte Carlo probability hypothesis density (NPF-SMC-PHD) filtering has been recently introduced for multi-speaker tracking. However, the NPF does not consider the missing detection which plays a key role in ... More
Localization and Tracking of an Acoustic Source using a Diagonal Unloading Beamforming and a Kalman FilterDec 04 2018We present the signal processing framework and some results for the IEEE AASP challenge on acoustic source localization and tracking (LOCATA). The system is designed for the direction of arrival (DOA) estimation in single-source scenarios. The proposed ... More
SwishNet: A Fast Convolutional Neural Network for Speech, Music and Noise Classification and SegmentationDec 01 2018Speech, Music and Noise classification/segmentation is an important preprocessing step for audio processing/indexing. To this end, we propose a novel 1D Convolutional Neural Network (CNN) - SwishNet. It is a fast and lightweight architecture that operates ... More
Naive Dictionary On Musical Corpora: From Knowledge Representation To Pattern RecognitionNov 29 2018In this paper, we propose and develop the novel idea of treating musical sheets as literary documents in the traditional text analytics parlance, to fully benefit from the vast amount of research already existing in statistical text mining and topic modelling. ... More
UFANS: U-shaped Fully-Parallel Acoustic Neural Structure For Statistical Parametric Speech Synthesis With 20X FasterNov 28 2018Neural networks with Auto-regressive structures, such as Recurrent Neural Networks (RNNs), have become the most appealing structures for acoustic modeling of parametric text to speech synthesis (TTS) in ecent studies. Despite the prominent capacity to ... More
Large-scale Speaker Retrieval on Random Speaker Variability SubspaceNov 27 2018This paper describes a fast speaker search system to retrieve segments of the same voice identity in the large-scale data. Locality Sensitive Hashing (LSH) is a fast nearest neighbor search algorithm and the recent study shows that LSH enables quick retrieval ... More
DONUT: CTC-based Query-by-Example Keyword SpottingNov 26 2018Keyword spotting--or wakeword detection--is an essential feature for hands-free operation of modern voice-controlled devices. With such devices becoming ubiquitous, users might want to choose a personalized custom wakeword. In this work, we present DONUT, ... More
Interpretable Convolutional Filters with SincNetNov 23 2018Deep learning is currently playing a crucial role toward higher levels of artificial intelligence. This paradigm allows neural networks to learn complex and abstract representations, that are progressively obtained by combining simpler ones. Nevertheless, ... More
Learning pronunciation from a foreign language in speech synthesis networksNov 23 2018Although there are more than 65,000 languages in the world, the pronunciations of many phonemes sound similar across the languages. When people learn a foreign language, their pronunciation often reflect their native language's characteristics. That motivates ... More
Differentiable Consistency Constraints for Improved Deep Speech EnhancementNov 20 2018In recent years, deep networks have led to dramatic improvements in speech enhancement by framing it as a data-driven pattern recognition problem. In many modern enhancement systems, large amounts of data are used to train a deep network to estimate masks ... More
Improving Sequence-to-Sequence Acoustic Modeling by Adding Text-SupervisionNov 20 2018This paper presents methods of making using of text supervision to improve the performance of sequence-to-sequence (seq2seq) voice conversion. Compared with conventional frame-to-frame voice conversion approaches, the seq2seq acoustic modeling method ... More
Utterance-Based Audio Sentiment Analysis Learned by a Parallel Combination of CNN and LSTMNov 20 2018Audio Sentiment Analysis is a popular research area which extends the conventional text-based sentiment analysis to depend on the effectiveness of acoustic features extracted from speech. However, current progress on audio sentiment analysis mainly focuses ... More
Harmonic Recomposition using Conditional Autoregressive ModelingNov 18 2018We demonstrate a conditional autoregressive pipeline for efficient music recomposition, based on methods presented in van den Oord et al.(2017). Recomposition (Casal & Casey, 2010) focuses on reworking existing musical pieces, adhering to structure at ... More
Representation Mixing for TTS SynthesisNov 17 2018Nov 24 2018Recent character and phoneme-based parametric TTS systems using deep learning have shown strong performance in natural speech generation. However, the choice between character or phoneme input can create serious limitations for practical deployment, as ... More
The Intrinsic Memorability of Everyday SoundsNov 17 2018Our aural experience plays an integral role in the perception and memory of the events in our lives. Some of the sounds we encounter throughout the day stay lodged in our minds more easily than others; these, in turn, may serve as powerful triggers of ... More
Polyphonic audio tagging with sequentially labelled data using CRNN with learnable gated linear unitsNov 17 2018Audio tagging aims to detect the types of sound events occurring in an audio recording. To tag the polyphonic audio recordings, we propose to use Connectionist Temporal Classification (CTC) loss function on the top of Convolutional Recurrent Neural Network ... More
Exploring Tradeoffs in Models for Low-latency Speech EnhancementNov 16 2018We explore a variety of neural networks configurations for one- and two-channel spectrogram-mask-based speech enhancement. Our best model improves on previous state-of-the-art performance on the CHiME2 speech enhancement task by 0.4 decibels in signal-to-distortion ... More
Protecting Voice Controlled Systems Using Sound Source Identification Based on Acoustic CuesNov 16 2018Over the last few years, a rapidly increasing number of Internet-of-Things (IoT) systems that adopt voice as the primary user input have emerged. These systems have been shown to be vulnerable to various types of voice spoofing attacks. Existing defense ... More