Latest in cs.sd

total 1012took 0.12s
A spelling correction model for end-to-end speech recognitionFeb 19 2019Attention-based sequence-to-sequence models for speech recognition jointly train an acoustic model, language model (LM), and alignment mechanism using a single neural network and require only parallel audio-text pairs. Thus, the language model component ... More
Low-Latency Deep Clustering For Speech SeparationFeb 19 2019This paper proposes a low algorithmic latency adaptation of the deep clustering approach to speaker-independent speech separation. It consists of three parts: a) the usage of long-short-term-memory (LSTM) networks instead of their bidirectional variant ... More
Self-Attention Aligner: A Latency-Control End-to-End Model for ASR Using Self-Attention Network and Chunk-HoppingFeb 18 2019Self-attention network, an attention-based feedforward neural network, has recently shown the potential to replace recurrent neural networks (RNNs) in a variety of NLP tasks. However, it is not clear if the self-attention network could be a good alternative ... More
An improved uncertainty propagation method for robust i-vector based speaker recognitionFeb 15 2019The performance of automatic speaker recognition systems degrades when facing distorted speech data containing additive noise and/or reverberation. Statistical uncertainty propagation has been introduced as a promising paradigm to address this challenge. ... More
Multimodal music information processing and retrieval: survey and future challengesFeb 14 2019Towards improving the performance in various music information processing tasks, recent studies exploit different modalities able to capture diverse aspects of music. Such modalities include audio recordings, symbolic music scores, mid-level representations, ... More
Recurrent Neural Networks with Stochastic Layers for Acoustic Novelty DetectionFeb 13 2019In this paper, we adapt Recurrent Neural Networks with Stochastic Layers, which are the state-of-the-art for generating text, music and speech, to the problem of acoustic novelty detection. By integrating uncertainty into the hidden states, this type ... More
Enhanced Robot Speech Recognition Using Biomimetic Binaural Sound Source LocalizationFeb 13 2019Inspired by the behavior of humans talking in noisy environments, we propose an embodied embedded cognition approach to improve automatic speech recognition (ASR) systems for robots in challenging environments, such as with ego noise, using binaural sound ... More
Improving performance and inference on audio classification tasks using capsule networksFeb 13 2019Classification of audio samples is an important part of many auditory systems. Deep learning models based on the Convolutional and the Recurrent layers are state-of-the-art in many such tasks. In this paper, we approach audio classification tasks using ... More
Multitask Learning for Polyphonic Piano Transcription, a Case StudyFeb 12 2019Viewing polyphonic piano transcription as a multitask learning problem, where we need to simultaneously predict onsets, intermediate frames and offsets of notes, we investigate the performance impact of additional prediction targets, using a variety of ... More
Puppet DubbingFeb 12 2019Dubbing puppet videos to make the characters (e.g. Kermit the Frog) convincingly speak a new speech track is a popular activity with many examples of well-known puppets speaking lines from films or singing rap songs. But manually aligning puppet mouth ... More
FurcaNeXt: End-to-end monaural speech separation with dynamic gated dilated temporal convolutional networksFeb 12 2019Deep dilated temporal convolutional networks (TCN) have been proved to be very effective in sequence modeling. In this paper we propose several improvements of TCN for end-to-end approach to monaural speech separation, which consists of 1) multi-scale ... More
Adversarial Generation of Time-Frequency Features with application in audio synthesisFeb 11 2019Time-frequency (TF) representations provide powerful and intuitive features for the analysis of time series such as audio. But still, generative modeling of audio in the TF domain is a subtle matter. Consequently, neural audio synthesis widely relies ... More
A Vocoder-free WaveNet Voice Conversion with Non-Parallel DataFeb 11 2019In a typical voice conversion system, vocoder is commonly used for speech-to-features analysis and features-to-speech synthesis. However, vocoder can be a source of speech quality degradation. This paper presents a vocoder-free voice conversion approach ... More
Performance Advantages of Deep Neural Networks for Angle of Arrival EstimationFeb 10 2019The problem of estimating the number of sources and their angles of arrival from a single antenna array observation has been an active area of research in the signal processing community for the last few decades. When the number of sources is large, the ... More
Generative Moment Matching Network-based Random Modulation Post-filter for DNN-based Singing Voice Synthesis and Neural Double-trackingFeb 09 2019This paper proposes a generative moment matching network (GMMN)-based post-filter that provides inter-utterance pitch variation for deep neural network (DNN)-based singing voice synthesis. The natural pitch variation of a human singing voice leads to ... More
Machine learning and chord based feature engineering for genre prediction in popular Brazilian musicFeb 08 2019Music genre can be hard to describe: many factors are involved, such as style, music technique, and historical context. Some genres even have overlapping characteristics. Looking for a better understanding of how music genres are related to musical harmonic ... More
Speaker diarisation using 2D self-attentive combination of embeddingsFeb 08 2019Speaker diarisation systems often cluster audio segments using speaker embeddings such as i-vectors and d-vectors. Since different types of embeddings are often complementary, this paper proposes a generic framework to improve performance by combining ... More
Speech enhancement with variational autoencoders and alpha-stable distributionsFeb 08 2019This paper focuses on single-channel semi-supervised speech enhancement. We learn a speaker-independent deep generative speech model using the framework of variational autoencoders. The noise model remains unsupervised because we do not assume prior knowledge ... More
Hide and Speak: Deep Neural Networks for Speech SteganographyFeb 07 2019Steganography is the science of hiding a secret message within an ordinary public message, which referred to as Carrier. Traditionally, digital signal processing techniques, such as least significant bit encoding, were used for hiding messages. In this ... More
Target Speaker Extraction for Overlapped Multi-Talker Speaker VerificationFeb 07 2019The performance of speaker verification degrades significantly when the test speech is corrupted by interference speakers. Speaker diarization does well to separate speakers if the speakers are temporally overlapped. However, if multi-talkers speak at ... More
Conv-codes: Audio Hashing For Bird Species ClassificationFeb 07 2019In this work, we propose a supervised, convex representation based audio hashing framework for bird species classification. The proposed framework utilizes archetypal analysis, a matrix factorization technique, to obtain convex-sparse representations ... More
End-to-end losses based on speaker basis vectors and all-speaker hard negative mining for speaker verificationFeb 07 2019In recent years, speaker verification has been primarily performed using deep neural networks that are trained to output embeddings from input features such as spectrograms or filterbank energies. Therefore, studies have been conducted to design various ... More
End-to-end Anchored Speech RecognitionFeb 06 2019Voice-controlled house-hold devices, like Amazon Echo or Google Home, face the problem of performing speech recognition of device-directed speech in the presence of interfering background speech, i.e., background noise and interfering speech from another ... More
Centroid-based deep metric learning for speaker recognitionFeb 06 2019Speaker embedding models that utilize neural networks to map utterances to a space where distances reflect similarity between speakers have driven recent progress in the speaker recognition task. However, there is still a significant performance gap between ... More
Unsupervised Polyglot Text To SpeechFeb 06 2019We present a TTS neural network that is able to produce speech in multiple languages. The proposed network is able to transfer a voice, which was presented as a sample in a source language, into one of several target languages. Training is done without ... More
Transfer Learning From Sound Representations For Anger Detection in SpeechFeb 06 2019In this work, we train fully convolutional networks to detect anger in speech. Since training these deep architectures requires large amounts of data and the size of emotion datasets is relatively small, we use transfer learning. However, unlike previous ... More
Polyphonic Music Composition with LSTM Neural Networks and Reinforcement LearningFeb 05 2019In the domain of algorithmic music composition, machine learning-driven systems eliminate the need for carefully hand-crafting rules for composition. In particular, the capability of recurrent neural networks to learn complex temporal patterns lends itself ... More
A variance modeling framework based on variational autoencoders for speech enhancementFeb 05 2019In this paper we address the problem of enhancing speech signals in noisy mixtures using a source separation approach. We explore the use of neural networks as an alternative to a popular speech variance model based on supervised non-negative matrix factorization ... More
An Ensemble SVM-based Approach for Voice Activity DetectionFeb 05 2019Voice activity detection (VAD), used as the front end of speech enhancement, speech and speaker recognition algorithms, determines the overall accuracy and efficiency of the algorithms. Therefore, a VAD with low complexity and high accuracy is highly ... More
Deep Autotuner: A Data-Driven Approach to Natural-Sounding Pitch Correction for Singing Voice in Karaoke PerformancesFeb 03 2019We describe a machine-learning approach to pitch correcting a solo singing performance in a karaoke setting, where the solo voice and accompaniment are on separate tracks. The proposed approach addresses the situation where no musical score of the vocals ... More
Using multi-task learning to improve the performance of acoustic-to-word and conventional hybrid modelsFeb 02 2019Acoustic-to-word (A2W) models that allow direct mapping from acoustic signals to word sequences are an appealing approach to end-to-end automatic speech recognition due to their simplicity. However, prior works have shown that modelling A2W typically ... More
FurcaNet: An end-to-end deep gated convolutional, long short-term memory, deep neural networks for single channel speech separationFeb 02 2019Deep gated convolutional networks have been proved to be very effective in single channel speech separation. However current state-of-the-art framework often considers training the gated convolutional networks in time-frequency (TF) domain. Such an approach ... More
Is CQT more suitable for monaural speech separation than STFT? an empirical studyFeb 02 2019Short-time Fourier transform (STFT) is used as the front end of many popular successful monaural speech separation methods, such as deep clustering (DPCL), permutation invariant training (PIT) and their various variants. Since the frequency component ... More
End-to-End Probabilistic Inference for Nonstationary Audio AnalysisJan 31 2019A typical audio signal processing pipeline includes multiple disjoint analysis stages, including calculation of a time-frequency representation followed by spectrogram-based feature analysis. We show how time-frequency analysis and nonnegative matrix ... More
Discriminate natural versus loudspeaker emitted speechJan 31 2019In this work, we address a novel, but potentially emerging, problem of discriminating the natural human voices and those played back by any kind of audio devices in the context of interactions with in-house voice user interface. The tackled problem may ... More
Applying Visual Domain Style Transfer and Texture Synthesis Techniques to Audio - Insights and ChallengesJan 29 2019Style transfer is a technique for combining two images based on the activations and feature statistics in a deep learning neural network architecture. This paper studies the analogous task in the audio domain and takes a critical look at the problems ... More
A Convolutional Neural Network model based on Neutrosophy for Noisy Speech RecognitionJan 27 2019Convolutional neural networks are sensitive to unknown noisy condition in the test phase and so their performance degrades for the noisy data classification task including noisy speech recognition. In this research, a new convolutional neural network ... More
Speech Separation Using Gain-Adapted Factorial Hidden Markov ModelsJan 22 2019We present a new probabilistic graphical model which generalizes factorial hidden Markov models (FHMM) for the problem of single-channel speech separation (SCSS) in which we wish to separate the two speech signals $X(t)$ and $V(t)$ from a single recording ... More
Using DNNs to Detect Materials in a Room based on Sound AbsorptionJan 17 2019The materials of surfaces in a room play an important room in shaping the auditory experience within them. Different materials absorb energy at different levels. The level of absorption also varies across frequencies. This paper investigates how cues ... More
Real-time separation of non-stationary sound fields on spheresJan 16 2019The sound field separation methods can separate the target field from the interfering noises, facilitating the study of the acoustic characteristics of the target source, which is placed in a noisy environment. However, most of the existing sound field ... More
AI Pipeline - bringing AI to you. End-to-end integration of data, algorithms and deployment toolsJan 15 2019Next generation of embedded Information and Communication Technology (ICT) systems are interconnected collaborative intelligent systems able to perform autonomous tasks. Training and deployment of such systems on Edge devices however require a fine-grained ... More
A linear programming approach to the tracking of partialsJan 15 2019A new approach to the tracking of sinusoidal chirps using linear programming is proposed. It is demonstrated that the classical algorithm of McAulay and Quatieri is greedy and exhibits exponential complexity for long searches, while approaches based on ... More
Exploring Transfer Learning for Low Resource Emotional TTSJan 14 2019During the last few years, spoken language technologies have known a big improvement thanks to Deep Learning. However Deep Learning-based algorithms require amounts of data that are often difficult and costly to gather. Particularly, modeling the variability ... More
Machine learning for the recognition of emotion in the speech of couples in psychotherapy using the Stanford Suppes Brain Lab Psychotherapy DatasetJan 14 2019The automatic recognition of emotion in speech can inform our understanding of language, emotion, and the brain. It also has practical application to human-machine interactive systems. This paper examines the recognition of emotion in naturally occurring ... More
A survey on acoustic sensingJan 11 2019The rise of Internet-of-Things (IoT) has brought many new sensing mechanisms. Among these mechanisms, acoustic sensing attracts much attention in recent years. Acoustic sensing exploits acoustic sensors beyond their primary uses, namely recording and ... More
Presence-absence estimation in audio recordings of tropical frog communitiesJan 08 2019One non-invasive way to study frog communities is by analyzing long-term samples of acoustic material containing calls. This immense task has been optimized by the development of Machine Learning tools to extract ecological information. We explored a ... More
Sinusoidal wave generating network based on adversarial learning and its application: synthesizing frog sounds for data augmentationJan 07 2019Simulators that generate observations based on theoretical models can be important tools for development, prediction, and assessment of signal processing algorithms. In order to design these simulators, painstaking effort is required to construct mathematical ... More
Improving noise robustness of automatic speech recognition via parallel data and teacher-student learningJan 05 2019Jan 11 2019For real-world speech recognition applications, noise robustness is still a challenge. In this work, we adopt the teacher-student (T/S) learning technique using a parallel clean and noisy corpus for improving automatic speech recognition (ASR) performance ... More
AVA-ActiveSpeaker: An Audio-Visual Dataset for Active Speaker DetectionJan 05 2019Active speaker detection is an important component in video analysis algorithms for applications such as speaker diarization, video re-targeting for meetings, speech enhancement, and human-robot interaction. The absence of a large, carefully labeled audio-visual ... More
Learning Sound Event Classifiers from Web Audio with Noisy LabelsJan 04 2019As sound event classification moves towards larger datasets, issues of label noise become inevitable. Web sites can supply large volumes of user-contributed audio and metadata, but inferring labels from this metadata introduces errors due to unreliable ... More
Introduction to Voice Presentation Attack Detection and Recent AdvancesJan 04 2019Over the past few years significant progress has been made in the field of presentation attack detection (PAD) for automatic speaker recognition (ASV). This includes the development of new speech corpora, standard evaluation protocols and advancements ... More
Feature reinforcement with word embedding and parsing information in neural TTSJan 03 2019In this paper, we propose a feature reinforcement method under the sequence-to-sequence neural text-to-speech (TTS) synthesis framework. The proposed method utilizes the multiple input encoder to take three levels of text information, i.e., phoneme sequence, ... More
Deep Speech Enhancement for Reverberated and Noisy Signals using Wide Residual NetworksJan 03 2019This paper proposes a deep speech enhancement method which exploits the high potential of residual connections in a wide neural network architecture, a topology known as Wide Residual Network. This is supported on single dimensional convolutions computed ... More
End-to-End Model for Speech Enhancement by Consistent Spectrogram MaskingJan 02 2019Recently, phase processing is attracting increasinginterest in speech enhancement community. Some researchersintegrate phase estimations module into speech enhancementmodels by using complex-valued short-time Fourier transform(STFT) spectrogram based ... More
A Multiversion Programming Inspired Approach to Detecting Audio Adversarial ExamplesDec 26 2018Adversarial examples (AEs) are crafted by adding human-imperceptible perturbations to inputs such that a machine-learning based classifier incorrectly labels them. They have become a severe threat to the trustworthiness of machine learning. While AEs ... More
Multi-Domain Processing via Hybrid Denoising Networks for Speech EnhancementDec 21 2018We present a hybrid framework that leverages the trade-off between temporal and frequency precision in audio representations to improve the performance of speech enhancement task. We first show that conventional approaches using specific representations ... More
Fréchet Audio Distance: A Metric for Evaluating Music Enhancement AlgorithmsDec 20 2018Jan 17 2019We propose the Fr\'echet Audio Distance (FAD), a novel, reference-free evaluation metric for music enhancement algorithms. We demonstrate how typical evaluation metrics for speech enhancement and blind source separation can fail to accurately measure ... More
Detecting the Trend in Musical Taste over the Decade -- A Novel Feature Extraction Algorithm to Classify Musical Content with Simple FeaturesDec 19 2018This work proposes a novel feature selection algorithm to classify Songs into different groups. Classification of musical content is often a non-trivial job and still relatively less explored area. The main idea conveyed in this article is to come up ... More
BandNet: A Neural Network-based, Multi-Instrument Beatles-Style MIDI Music Composition MachineDec 18 2018In this paper, we propose a recurrent neural network (RNN)-based MIDI music composition machine that is able to learn musical knowledge from existing Beatles' songs and generate music in the style of the Beatles with little human intervention. In the ... More
Persian phonemes recognition using PPNetDec 17 2018In this paper a new approach for recognition of Persian phonemes on the PCVC speech dataset is proposed. Nowadays deep neural networks are playing main rule in classification tasks. However the best results in speech recognition are not as good as human ... More
Persian Vowel recognition with MFCC and ANN on PCVC speech datasetDec 17 2018In this paper a new method for recognition of consonant-vowel phonemes combination on a new Persian speech dataset titled as PCVC (Persian Consonant-Vowel Combination) is proposed which is used to recognize Persian phonemes. In PCVC dataset, there are ... More
Quaternion Convolutional Neural Networks for Detection and Localization of 3D Sound EventsDec 17 2018Learning from data in the quaternion domain enables us to exploit internal dependencies of 4D signals and treating them as a single entity. One of the models that perfectly suits with quaternion-valued data processing is represented by 3D acoustic signals ... More
Circular Statistics-based low complexity DOA estimation for hearing aid applicationDec 17 2018The proposed Circular statistics-based Inter-Microphone Phase difference estimation Localizer (CIMPL) method is tailored toward binaural hearing aid systems with microphone arrays in each unit. The method utilizes the circular statistics (circular mean ... More
Learning to Generate Music with BachPropDec 17 2018As deep learning advances, algorithms of music composition increase in performance. However, most of the successful models are designed for specific musical structures. Here, we present BachProp, an algorithmic composer that can generate music scores ... More
Voiceprint recognition of Parkinson patients based on deep learningDec 17 2018More than 90% of the Parkinson Disease (PD) patients suffer from vocal disorders. Speech impairment is already indicator of PD. This study focuses on PD diagnosis through voiceprint features. In this paper, a method based on Deep Neural Network (DNN) ... More
Evaluation of an open-source implementation of the SRP-PHAT algorithm within the 2018 LOCATA challengeDec 14 2018This short paper presents an efficient, flexible implementation of the SRP-PHAT multichannel sound source localization method. The method is evaluated on the single-source tasks of the LOCATA 2018 development dataset, and an associated Matlab toolbox ... More
MorpheuS: generating structured music with constrained patterns and tensionDec 12 2018Automatic music generation systems have gained in popularity and sophistication as advances in cloud computing have enabled large-scale complex computations such as deep models and optimization algorithms on personal devices. Yet, they still face an important ... More
FPUAS : Fully Parallel UFANS-based End-to-End Acoustic System with 10x Speed UpDec 12 2018Dec 18 2018A lightweight end-to-end acoustic system is crucial in the deployment of text-to-speech tasks. Finding one that produces good audios with small time latency and fewer errors remains a problem. In this paper, we propose a new non-autoregressive, fully ... More
To Reverse the Gradient or Not: An Empirical Comparison of Adversarial and Multi-task Learning in Speech RecognitionDec 09 2018Dec 13 2018Transcribed datasets typically contain speaker identity for each instance in the data. We investigate two ways to incorporate this information during training: Multi-Task Learning and Adversarial Learning. In multi-task learning, the goal is speaker prediction; ... More
The USTC-NEL Speech Translation system at IWSLT 2018Dec 06 2018This paper describes the USTC-NEL system to the speech translation task of the IWSLT Evaluation 2018. The system is a conventional pipeline system which contains 3 modules: speech recognition, post-processing and machine translation. We train a group ... More
Intensity Particle Flow SMC-PHD Filter For Audio Speaker TrackingDec 04 2018Non-zero diffusion particle flow Sequential Monte Carlo probability hypothesis density (NPF-SMC-PHD) filtering has been recently introduced for multi-speaker tracking. However, the NPF does not consider the missing detection which plays a key role in ... More
Localization and Tracking of an Acoustic Source using a Diagonal Unloading Beamforming and a Kalman FilterDec 04 2018We present the signal processing framework and some results for the IEEE AASP challenge on acoustic source localization and tracking (LOCATA). The system is designed for the direction of arrival (DOA) estimation in single-source scenarios. The proposed ... More
SwishNet: A Fast Convolutional Neural Network for Speech, Music and Noise Classification and SegmentationDec 01 2018Speech, Music and Noise classification/segmentation is an important preprocessing step for audio processing/indexing. To this end, we propose a novel 1D Convolutional Neural Network (CNN) - SwishNet. It is a fast and lightweight architecture that operates ... More
Naive Dictionary On Musical Corpora: From Knowledge Representation To Pattern RecognitionNov 29 2018In this paper, we propose and develop the novel idea of treating musical sheets as literary documents in the traditional text analytics parlance, to fully benefit from the vast amount of research already existing in statistical text mining and topic modelling. ... More
UFANS: U-shaped Fully-Parallel Acoustic Neural Structure For Statistical Parametric Speech Synthesis With 20X FasterNov 28 2018Neural networks with Auto-regressive structures, such as Recurrent Neural Networks (RNNs), have become the most appealing structures for acoustic modeling of parametric text to speech synthesis (TTS) in ecent studies. Despite the prominent capacity to ... More
DONUT: CTC-based Query-by-Example Keyword SpottingNov 26 2018Keyword spotting--or wakeword detection--is an essential feature for hands-free operation of modern voice-controlled devices. With such devices becoming ubiquitous, users might want to choose a personalized custom wakeword. In this work, we present DONUT, ... More
Learning pronunciation from a foreign language in speech synthesis networksNov 23 2018Although there are more than 65,000 languages in the world, the pronunciations of many phonemes sound similar across the languages. When people learn a foreign language, their pronunciation often reflect their native language's characteristics. That motivates ... More
Differentiable Consistency Constraints for Improved Deep Speech EnhancementNov 20 2018In recent years, deep networks have led to dramatic improvements in speech enhancement by framing it as a data-driven pattern recognition problem. In many modern enhancement systems, large amounts of data are used to train a deep network to estimate masks ... More
Improving Sequence-to-Sequence Acoustic Modeling by Adding Text-SupervisionNov 20 2018This paper presents methods of making using of text supervision to improve the performance of sequence-to-sequence (seq2seq) voice conversion. Compared with conventional frame-to-frame voice conversion approaches, the seq2seq acoustic modeling method ... More
Utterance-Based Audio Sentiment Analysis Learned by a Parallel Combination of CNN and LSTMNov 20 2018Audio Sentiment Analysis is a popular research area which extends the conventional text-based sentiment analysis to depend on the effectiveness of acoustic features extracted from speech. However, current progress on audio sentiment analysis mainly focuses ... More
Harmonic Recomposition using Conditional Autoregressive ModelingNov 18 2018We demonstrate a conditional autoregressive pipeline for efficient music recomposition, based on methods presented in van den Oord et al.(2017). Recomposition (Casal & Casey, 2010) focuses on reworking existing musical pieces, adhering to structure at ... More
Representation Mixing for TTS SynthesisNov 17 2018Nov 24 2018Recent character and phoneme-based parametric TTS systems using deep learning have shown strong performance in natural speech generation. However, the choice between character or phoneme input can create serious limitations for practical deployment, as ... More
The Intrinsic Memorability of Everyday SoundsNov 17 2018Our aural experience plays an integral role in the perception and memory of the events in our lives. Some of the sounds we encounter throughout the day stay lodged in our minds more easily than others; these, in turn, may serve as powerful triggers of ... More
Polyphonic audio tagging with sequentially labelled data using CRNN with learnable gated linear unitsNov 17 2018Audio tagging aims to detect the types of sound events occurring in an audio recording. To tag the polyphonic audio recordings, we propose to use Connectionist Temporal Classification (CTC) loss function on the top of Convolutional Recurrent Neural Network ... More
Exploring Tradeoffs in Models for Low-latency Speech EnhancementNov 16 2018We explore a variety of neural networks configurations for one- and two-channel spectrogram-mask-based speech enhancement. Our best model improves on previous state-of-the-art performance on the CHiME2 speech enhancement task by 0.4 decibels in signal-to-distortion ... More
Protecting Voice Controlled Systems Using Sound Source Identification Based on Acoustic CuesNov 16 2018Over the last few years, a rapidly increasing number of Internet-of-Things (IoT) systems that adopt voice as the primary user input have emerged. These systems have been shown to be vulnerable to various types of voice spoofing attacks. Existing defense ... More
Semi-supervised multichannel speech enhancement with variational autoencoders and non-negative matrix factorizationNov 16 2018In this paper we address speaker-independent multichannel speech enhancement in unknown noisy environments. Our work is based on a well-established multichannel local Gaussian modeling framework. We propose to use a neural network for modeling the speech ... More
AclNet: efficient end-to-end audio classification CNNNov 16 2018We propose an efficient end-to-end convolutional neural network architecture, AclNet, for audio classification. When trained with our data augmentation and regularization, we achieved state-of-the-art performance on the ESC-50 corpus with 85:65% accuracy. ... More
HCU400: An Annotated Dataset for Exploring Aural Phenomenology Through Causal UncertaintyNov 15 2018The way we perceive a sound depends on many aspects-- its ecological frequency, acoustic features, typicality, and most notably, its identified source. In this paper, we present the HCU400: a dataset of 402 sounds ranging from easily identifiable everyday ... More
Comprehensive evaluation of statistical speech waveform synthesisNov 15 2018Dec 11 2018Statistical TTS systems that directly predict the speech waveform have recently reported improvements in synthesis quality. This investigation evaluates Amazon's statistical speech waveform synthesis (SSWS) system. An in-depth evaluation of SSWS is conducted ... More
Effects of Lombard Reflex on the Performance of Deep-Learning-Based Audio-Visual Speech Enhancement SystemsNov 15 2018Humans tend to change their way of speaking when they are immersed in a noisy environment, a reflex known as Lombard effect. Current speech enhancement systems based on deep learning do not usually take into account this change in the speaking style, ... More
On Training Targets and Objective Functions for Deep-Learning-Based Audio-Visual Speech EnhancementNov 15 2018Audio-visual speech enhancement (AV-SE) is the task of improving speech quality and intelligibility in a noisy environment using audio and visual information from a talker. Recently, deep learning techniques have been adopted to solve the AV-SE task in ... More
Exploring RNN-Transducer for Chinese Speech RecognitionNov 13 2018End-to-end approaches have drawn much attention recently for significantly simplifying the construction of an automatic speech recognition (ASR) system. RNN transducer (RNN-T) is one of the popular end-to-end methods. Previous studies have shown that ... More
Stream attention-based multi-array end-to-end speech recognitionNov 12 2018Automatic Speech Recognition (ASR) using multiple microphone arrays has achieved great success in the far-field robustness. Taking advantage of all the information that each array shares and contributes is crucial in this task. Motivated by the advances ... More
Reinforcement Learning Based Speech Enhancement for Robust Speech RecognitionNov 10 2018Conventional deep neural network (DNN)-based speech enhancement (SE) approaches aim to minimize the mean square error (MSE) between enhanced speech and clean reference. The MSE-optimized model may not directly improve the performance of an automatic speech ... More
Native Language Identification using i-vectorNov 09 2018The task of determining a speaker's native language based only on his speeches in a second language is known as Native Language Identification or NLI. Due to its increasing applications in various domains of speech signal processing, this has emerged ... More
Speech Enhancement Based on Reducing the Detail Portion of Speech Spectrograms in Modulation Domain via Discrete Wavelet TransformNov 08 2018In this paper, we propose a novel speech enhancement (SE) method by exploiting the discrete wavelet transform (DWT). This new method reduces the amount of fast time-varying portion, viz. the DWT-wise detail component, in the spectrogram of speech signals ... More
Class-conditional embeddings for music source separationNov 07 2018Isolating individual instruments in a musical mixture has a myriad of potential applications, and seems imminently achievable given the levels of performance reached by recent deep learning methods. While most musical source separation techniques learn ... More
Reconstructing Speech Stimuli From Human Auditory Cortex Activity Using a WaveNet ApproachNov 06 2018Nov 08 2018The superior temporal gyrus (STG) region of cortex critically contributes to speech recognition. In this work, we show that a proposed WaveNet, with limited available data, is able to reconstruct speech stimuli from STG intracranial recordings. We further ... More
SDR - half-baked or well done?Nov 06 2018In speech enhancement and source separation, signal-to-noise ratio is a ubiquitous objective measure of denoising/separation quality. A decade ago, the BSS_eval toolkit was developed to give researchers worldwide a way to evaluate the quality of their ... More