Latest in cs.sd

total 2055took 0.17s
SpecAugment: A Simple Data Augmentation Method for Automatic Speech RecognitionApr 18 2019We present SpecAugment, a simple data augmentation method for speech recognition. SpecAugment is applied directly to the feature inputs of a neural network (i.e., filter bank coefficients). The augmentation policy consists of warping the features, masking ... More
Inspecting and Interacting with Meaningful Music Representations using VAEApr 18 2019Variational Autoencoders(VAEs) have already achieved great results on image generation and recently made promising progress on music generation. However, the generation process is still quite difficult to control in the sense that the learned latent representations ... More
Regression and Classification for Direction-of-Arrival Estimation with Convolutional Recurrent Neural NetworksApr 17 2019We present a novel learning-based approach to estimate the direction-of-arrival (DOA) of a sound source using a convolutional recurrent neural network (CRNN) trained via regression on synthetic data and Cartesian labels. We also describe an improved method ... More
Deep Filtering: Signal Extraction Using Complex Time-Frequency FiltersApr 17 2019Signal extraction from a single-channel mixture with additional undesired signals is most commonly performed using time-frequency (TF) masks. Typically, the mask is estimated with a deep neural network (DNN) and element-wise applied to the complex mixture ... More
MOSNet: Deep Learning based Objective Assessment for Voice ConversionApr 17 2019Existing objective evaluation metrics for voice conversion (VC) are not always correlated well with human perception. Therefore, training VC models with such criteria may not effectively improve naturalness and similarity of converted speech. In this ... More
Few Shot Speaker Recognition using Deep Neural NetworksApr 17 2019The recent advances in deep learning are mostly driven by availability of large amount of training data. However, availability of such data is not always possible for specific tasks such as speaker recognition where collection of large amount of data ... More
Audio-Text Sentiment Analysis using Deep Robust Complementary Fusion of Multi-Features and Multi-ModalitiesApr 17 2019Sentiment analysis research has been rapidly developing in the last decade and has attracted widespread attention from academia and industry, most of which is based on text. However, the information in the real world usually comes as different modalities. ... More
RawNet: Advanced end-to-end deep neural network using raw waveforms for text-independent speaker verificationApr 17 2019Recently, direct modeling of raw waveforms using deep neural networks has been widely studied for a number of tasks in audio domains. In speaker verification, however, utilization of raw waveforms is in its preliminary phase, requiring further investigation. ... More
A Multi-Task Learning Framework for Overcoming the Catastrophic Forgetting in Automatic Speech RecognitionApr 17 2019Recently, data-driven based Automatic Speech Recognition (ASR) systems have achieved state-of-the-art results. And transfer learning is often used when those existing systems are adapted to the target domain, e.g., fine-tuning, retraining. However, in ... More
Hard Sample Mining for the Improved Retraining of Automatic Speech RecognitionApr 17 2019It is an effective way that improves the performance of the existing Automatic Speech Recognition (ASR) systems by retraining with more and more new training data in the target domain. Recently, Deep Neural Network (DNN) has become a successful model ... More
Expediting TTS Synthesis with Adversarial VocodingApr 16 2019Recent approaches in text-to-speech (TTS) synthesis employ neural network strategies to vocode perceptually-informed spectrogram representations directly into listenable waveforms. Such vocoding procedures create a computational bottleneck in modern TTS ... More
Audio-Visual Model Distillation Using Acoustic ImagesApr 16 2019In this paper, we investigate how to learn rich and robust feature representations for audio classification from visual data and a novel audio data modality, namely acoustic images. Former models learn audio representations from raw signals or spectral ... More
Improved Speech Separation with Time-and-Frequency Cross-domain Joint Embedding and ClusteringApr 16 2019Speech separation has been very successful with deep learning techniques. Substantial effort has been reported based on approaches over spectrogram, which is well known as the standard time-and-frequency cross-domain representation for speech signals. ... More
Co-Separating Sounds of Visual ObjectsApr 16 2019Learning how objects sound from video is challenging, since they often heavily overlap in a single audio channel. Current methods for visually-guided audio source separation sidestep the issue by training with artificially mixed video clips, but this ... More
Joined Audio-Visual Speech Enhancement and Recognition in the Cocktail Party: The Tug Of War Between Enhancement and Recognition LossesApr 16 2019In this paper we propose an end-to-end LSTM-based model that performs single-channel speech enhancement and phone recognition in a cocktail party scenario where visual information of the target speaker is available. In the speech enhancement phase the ... More
Audio Denoising with Deep Network PriorsApr 16 2019We present a method for audio denoising that combines processing done in both the time domain and the time-frequency domain. Given a noisy audio clip, the method trains a deep neural network to fit this signal. Since the fitting is only partly successful ... More
Unsupervised acoustic unit discovery for speech synthesis using discrete latent-variable neural networksApr 16 2019For our submission to the ZeroSpeech 2019 challenge, we apply discrete latent-variable neural networks to unlabelled speech and use the discovered units for speech synthesis. Unsupervised discrete subword modelling could be useful for studies of phonetic ... More
Spoof detection using x-vector and feature switchingApr 16 2019Detecting spoofed utterances is a fundamental problem in voice-based biometrics. Spoofing can be performed either by logical accesses like speech synthesis, voice conversion or by physical accesses such as replaying the pre-recorded utterance. Inspired ... More
I4U Submission to NIST SRE 2018: Leveraging from a Decade of Shared ExperiencesApr 16 2019The I4U consortium was established to facilitate a joint entry to NIST speaker recognition evaluations (SRE). The latest edition of such joint submission was in SRE 2018, in which the I4U submission was among the best-performing systems. SRE'18 also marks ... More
RHR-Net: A Residual Hourglass Recurrent Neural Network for Speech EnhancementApr 15 2019Most current speech enhancement models use spectrogram features that require an expensive transformation and result in phase information loss. Previous work has overcome these issues by using convolutional networks to learn long-range temporal correlations ... More
Are Nearby Neighbors Relatives?: Diagnosing Deep Music Embedding SpacesApr 15 2019Deep neural networks have frequently been used to directly learn representations useful for a given task from raw input data. In terms of overall performance metrics, machine learning solutions employing deep representations frequently have been reported ... More
Semantic query-by-example speech search using visual groundingApr 15 2019A number of recent studies have started to investigate how speech systems can be trained on untranscribed speech by leveraging accompanying images at training time. Examples of tasks include keyword prediction and within- and across-mode retrieval. Here ... More
Singing voice synthesis based on convolutional neural networksApr 15 2019The present paper describes a singing voice synthesis based on convolutional neural networks (CNNs). Singing voice synthesis systems based on deep neural networks (DNNs) are currently being proposed and are improving the naturalness of synthesized singing ... More
Proximal binaural sound can induce subjective frissonApr 15 2019Sound frisson is a subjective experience wherein people tend to perceive the feeling of chills in addition to a physiological response, such as goosebumps. Multiple examples of frisson inducing sounds have been reported in the large online community, ... More
SpeechYOLO: Detection and Localization of Speech ObjectsApr 14 2019In this paper, we propose to apply object detection methods from the vision domain on the speech recognition domain, by treating audio fragments as objects. More specifically, we present SpeechYOLO, which is inspired by the YOLO algorithm for object detection ... More
A robust DOA estimation method for a linear microphone array under reverberant and noisy environmentsApr 14 2019A robust method for linear array is proposed to address the difficulty of direction-of-arrival (DOA) estimation in reverberant and noisy environments. A direct-path dominance test based on the onset detection is utilized to extract time-frequency bins ... More
Towards Vulnerability Analysis of Voice-Driven Interfaces and Countermeasures for ReplayApr 13 2019Fake audio detection is expected to become an important research area in the field of smart speakers such as Google Home, Amazon Echo and chatbots developed for these platforms. This paper presents replay attack vulnerability of voice-driven interfaces ... More
Unsupervised Singing Voice ConversionApr 13 2019We present a deep learning method for singing voice conversion. The proposed network is not conditioned on the text or on the notes, and it directly converts the audio of one singer to the voice of another. Training is performed without any form of supervision: ... More
Audio Compression Using Graph-based TransformApr 13 2019Graph-based Transform is one of the recent transform coding methods which has been used successfully in the state-of-art data decorrelation applications. In this paper, we propose a Graph-based Transform (GT) for audio compression. Hence, we introduce ... More
End-to-end Text-to-speech for Low-resource Languages by Cross-Lingual Transfer LearningApr 13 2019End-to-end text-to-speech (TTS) has shown great success on large quantities of paired text plus speech data. However, laborious data collection remains difficult for at least 95% of the languages over the world, which hinders the development of TTS in ... More
Low-Latency Speaker-Independent Continuous Speech SeparationApr 13 2019Speaker independent continuous speech separation (SI-CSS) is a task of converting a continuous audio stream, which may contain overlapping voices of unknown speakers, into a fixed number of continuous signals each of which contains no overlapping speech ... More
Assisted Sound Sample Generation with Musical Conditioning in Adversarial Auto-EncodersApr 12 2019Generative models have thrived in computer vision, enabling unprecedented image processes. Yet the results in audio remain less advanced. Our project targets real-time sound synthesis from a reduced set of high-level parameters, including semantic controls ... More
Examining the Mapping Functions of Denoising Autoencoders in Music Source SeparationApr 12 2019The goal of this work is to investigate what music source separation approaches based on neural networks learn from the data. We examine the mapping functions of neural networks that are based on the denoising autoencoder (DAE) model, and conditioned ... More
Unsupervised Speech Domain Adaptation Based on Disentangled Representation Learning for Robust Speech RecognitionApr 12 2019In general, the performance of automatic speech recognition (ASR) systems is significantly degraded due to the mismatch between training and test environments. Recently, a deep-learning-based image-to-image translation technique to translate an image ... More
DNN-based Acoustic-to-Articulatory Inversion using Ultrasound Tongue ImagingApr 12 2019Speech sounds are produced as the coordinated movement of the speaking organs. There are several available methods to model the relation of articulatory movements and the resulting speech signal. The reverse problem is often called as acoustic-to-articulatory ... More
RNN-based speech synthesis using a continuous sinusoidal modelApr 12 2019Recently in statistical parametric speech synthesis, we proposed a continuous sinusoidal model (CSM) using continuous F0 (contF0) in combination with Maximum Voiced Frequency (MVF), which was successfully giving state-of-the-art vocoders performance (e.g. ... More
Building a mixed-lingual neural TTS system with only monolingual dataApr 12 2019When deploying a Chinese neural text-to-speech (TTS) synthesis system, one of the challenges is to synthesize Chinese utterances with English phrases or words embedded. This paper looks into the problem in the encoder-decoder framework when only monolingual ... More
Direct speech-to-speech translation with a sequence-to-sequence modelApr 12 2019We present an attention-based sequence-to-sequence neural network which can directly translate speech from one language into speech in another language, without relying on an intermediate text representation. The network is trained end-to-end, learning ... More
The Sound of MotionsApr 11 2019Sounds originate from object motions and vibrations of surrounding air. Inspired by the fact that humans is capable of interpreting sound sources from how objects move visually, we propose a novel system that explicitly captures such motion cues for the ... More
A Simple Baseline for Audio-Visual Scene-Aware DialogApr 11 2019The recently proposed audio-visual scene-aware dialog task paves the way to a more data-driven way of learning virtual assistants, smart speakers and car navigation systems. However, very little is known to date about how to effectively extract meaningful ... More
Cross-task learning for audio tagging, sound event detection spatial localization: DCASE 2019 baseline systemsApr 11 2019Apr 14 2019The Detection and Classification of Acoustic Scenes and Events (DCASE) 2019 challenge focuses on audio tagging, sound event detection and spatial localisation. DCASE 2019 consists of five tasks: 1) acoustic scene classification, 2) audio tagging with ... More
Cross-task learning for audio tagging, sound event detection spatial localization: DCASE 2019 baseline systemsApr 11 2019The Detection and Classification of Acoustic Scenes and Events (DCASE) 2019 challenge focuses on audio tagging, sound event detection and spatial localisation. DCASE 2019 consists of five tasks: 1) acoustic scene classification, 2) audio tagging with ... More
STC Antispoofing Systems for the ASVspoof2019 ChallengeApr 11 2019This paper describes the Speech Technology Center (STC) antispoofing systems submitted to the ASVspoof 2019 challenge. The ASVspoof2019 is the extended version of the previous challenges and includes 2 evaluation conditions: logical access use-case scenario ... More
One-shot Voice Conversion by Separating Speaker and Content Representations with Instance NormalizationApr 10 2019Recently, voice conversion (VC) without parallel data has been successfully adapted to multi-target scenario in which a single model is trained to convert the input voice to many different speakers. However, such model suffers from the limitation that ... More
One-shot Voice Conversion by Separating Speaker and Content Representations with Instance NormalizationApr 10 2019Apr 16 2019Recently, voice conversion (VC) without parallel data has been successfully adapted to multi-target scenario in which a single model is trained to convert the input voice to many different speakers. However, such model suffers from the limitation that ... More
Autoencoder-Based Articulatory-to-Acoustic Mapping for Ultrasound Silent Speech InterfacesApr 10 2019When using ultrasound video as input, Deep Neural Network-based Silent Speech Interfaces usually rely on the whole image to estimate the spectral parameters required for the speech synthesis step. Although this approach is quite straightforward, and it ... More
Expectation-Maximization for Speech Source Separation Using Convolutive Transfer FunctionApr 10 2019This paper addresses the problem of under-determinded speech source separation from multichannel microphone singals, i.e. the convolutive mixtures of multiple sources. The time-domain signals are first transformed to the short-time Fourier transform (STFT) ... More
A Compact and Discriminative Feature Based on Auditory Summary Statistics for Acoustic Scene ClassificationApr 10 2019One of the biggest challenges of acoustic scene classification (ASC) is to find proper features to better represent and characterize environmental sounds. Environmental sounds generally involve more sound sources while exhibiting less structure in temporal ... More
Acoustic Scene Classification by Implicitly Identifying Distinct Sound EventsApr 10 2019In this paper, we propose a new strategy for acoustic scene classification (ASC) , namely recognizing acoustic scenes through identifying distinct sound events. This differs from existing strategies, which focus on characterizing global acoustical distributions ... More
Audio-noise Power Spectral Density Estimation Using Long Short-term MemoryApr 10 2019We propose a method using a long short-term memory (LSTM) network to estimate the noise power spectral density (PSD) of single-channel audio signals represented in the short time Fourier transform (STFT) domain. An LSTM network common to all frequency ... More
RawNet: Fast End-to-End Neural VocoderApr 10 2019Neural networks based vocoders have recently demonstrated the powerful ability to synthesize high quality speech. These models usually generate samples by conditioning on some spectrum features, such as Mel-spectrum. However, these features are extracted ... More
A Framework for Multi-f0 Modeling in SATB Choir RecordingsApr 10 2019Fundamental frequency (f0) modeling is an important but relatively unexplored aspect of choir singing. Performance evaluation as well as auditory analysis of singing, whether individually or in a choir, often depend on extracting f0 contours for the singing ... More
From Semi-supervised to Almost-unsupervised Speech Recognition with Very-low Resource by Jointly Learning Phonetic Structures from Audio and Text EmbeddingsApr 10 2019Producing a large amount of annotated speech data for training ASR systems remains difficult for more than 95% of languages all over the world which are low-resourced. However, we note human babies start to learn the language by the sounds (or phonetic ... More
Neuralogram: A Deep Neural Network Based Representation for Audio SignalsApr 10 2019We propose the Neuralogram -- a deep neural network based representation for understanding audio signals which, as the name suggests, transforms an audio signal to a dense, compact representation based upon embeddings learned via a neural architecture. ... More
An Interactive Musical Prediction System with Mixture Density Recurrent Neural NetworksApr 10 2019This paper is about creating digital musical instruments where a predictive neural network model is integrated into the interactive system. Rather than predicting symbolic music (e.g., MIDI notes), we suggest that predicting future control data from the ... More
Distributed Deep Learning Strategies For Automatic Speech RecognitionApr 10 2019In this paper, we propose and investigate a variety of distributed deep learning strategies for automatic speech recognition (ASR) and evaluate them with a state-of-the-art Long short-term memory (LSTM) acoustic model on the 2000-hour Switchboard (SWB2000), ... More
A New GAN-based End-to-End TTS Training AlgorithmApr 09 2019End-to-end, autoregressive model-based TTS has shown significant performance improvements over the conventional one. However, the autoregressive module training is affected by the exposure bias, or the mismatch between the different distributions of real ... More
Exploiting Syntactic Features in a Parsed Tree to Improve End-to-End TTSApr 09 2019The end-to-end TTS, which can predict speech directly from a given sequence of graphemes or phonemes, has shown improved performance over the conventional TTS. However, its predicting capability is still limited by the acoustic/phonetic coverage of the ... More
ASVspoof 2019: Future Horizons in Spoofed and Fake Audio DetectionApr 09 2019ASVspoof, now in its third edition, is a series of community-led challenges which promote the development of countermeasures to protect automatic speaker verification (ASV) from the threat of spoofing. Advances in the 2019 edition include: (i) a consideration ... More
ASVspoof 2019: Future Horizons in Spoofed and Fake Audio DetectionApr 09 2019Apr 14 2019ASVspoof, now in its third edition, is a series of community-led challenges which promote the development of countermeasures to protect automatic speaker verification (ASV) from the threat of spoofing. Advances in the 2019 edition include: (i) a consideration ... More
CycleGAN-VC2: Improved CycleGAN-based Non-parallel Voice ConversionApr 09 2019Non-parallel voice conversion (VC) is a technique for learning the mapping from source to target speech without relying on parallel data. This is an important task, but it has been challenging due to the disadvantages of the training conditions. Recently, ... More
Ensemble Models for Spoofing Detection in Automatic Speaker VerificationApr 09 2019Detecting spoofing attempts of automatic speaker verification (ASV) systems is challenging, especially when using only one modeling approach. For robustness, we use both deep neural networks and traditional machine learning models and combine them as ... More
Crossmodal Voice ConversionApr 09 2019Humans are able to imagine a person's voice from the person's appearance and imagine the person's appearance from his/her voice. In this paper, we make the first attempt to develop a method that can convert speech into a voice that matches an input face ... More
Speech Enhancement with Wide Residual Networks in Reverberant EnvironmentsApr 09 2019This paper proposes a speech enhancement method which exploits the high potential of residual connections in a Wide Residual Network architecture. This is supported on single dimensional convolutions computed alongside the time domain, which is a powerful ... More
Probability density distillation with generative adversarial networks for high-quality parallel waveform generationApr 09 2019This paper proposes an effective probability density distillation (PDD) algorithm for WaveNet-based parallel waveform generation (PWG) systems. Recently proposed teacher-student frameworks in the PWG system have successfully achieved a real-time generation ... More
Audio Classification of Bit-Representation WaveformApr 08 2019This paper investigates waveform representation for audio signal classification. Recently, many studies on audio waveform classification such as acoustic event detection and music genre classification have been increasing. Most studies on audio waveform ... More
Deep Learning the EEG Manifold for Phonological Categorization from Active ThoughtsApr 08 2019Speech-related Brain Computer Interfaces (BCI) aim primarily at finding an alternative vocal communication pathway for people with speaking disabilities. As a step towards full decoding of imagined speech from active thoughts, we present a BCI system ... More
Exploring Methods for the Automatic Detection of Errors in Manual TranscriptionApr 08 2019Quality of data plays an important role in most deep learning tasks. In the speech community, transcription of speech recording is indispensable. Since the transcription is usually generated artificially, automatically finding errors in manual transcriptions ... More
Unsupervised Feature Learning for Environmental Sound Classification Using Cycle Consistent Generative Adversarial NetworkApr 08 2019In this paper we propose a novel environmental sound classification approach incorporating unsupervised feature learning from codebook via spherical $K$-Means++ algorithm and a new architecture for high-level data augmentation. The audio signal is transformed ... More
Parrotron: An End-to-End Speech-to-Speech Conversion Model and its Applications to Hearing-Impaired Speech and Speech SeparationApr 08 2019We describe Parrotron, an end-to-end-trained speech-to-speech conversion model that maps an input spectrogram directly to another spectrogram, without utilizing any intermediate discrete representation. The network is composed of an encoder, spectrogram ... More
Audio Source Separation via Multi-Scale Learning with Dilated Dense U-NetsApr 08 2019Modern audio source separation techniques rely on optimizing sequence model architectures such as, 1D-CNNs, on mixture recordings to generalize well to unseen mixtures. Specifically, recent focus is on time-domain based architectures such as Wave-U-Net ... More
Completely Unsupervised Phoneme Recognition By A Generative Adversarial Network Harmonized With Iteratively Refined Hidden Markov ModelsApr 08 2019Producing a large annotated speech corpus for training ASR systems remains difficult for more than 95% of languages all over the world which are low-resourced, but collecting a relatively big unlabeled data set for such languages is more achievable. This ... More
GELP: GAN-Excited Liner Prediction for Speech Synthesis from Mel-spectrogramApr 08 2019Recent advances in neural network -based text-to-speech have reached human level naturalness in synthetic speech. The present sequence-to-sequence models can directly map text to mel-spectrogram acoustic features, which are convenient for modeling, but ... More
GELP: GAN-Excited Linear Prediction for Speech Synthesis from Mel-spectrogramApr 08 2019Apr 10 2019Recent advances in neural network -based text-to-speech have reached human level naturalness in synthetic speech. The present sequence-to-sequence models can directly map text to mel-spectrogram acoustic features, which are convenient for modeling, but ... More
Bayesian Subspace Hidden Markov Model for Acoustic Unit DiscoveryApr 08 2019This work tackles the problem of learning a set of language specific acoustic units from unlabeled speech recordings given a set of labeled recordings from other languages. Our approach may be described by the following two steps procedure: first the ... More
Duration robust sound event detectionApr 08 2019Task 4 of the Dcase2018 challenge demonstrated that substantially more research is needed for a real-world application of sound event detection. Analyzing the challenge results it can be seen that most successful models are biased towards predicting long ... More
A Statistical Investigation of Long Memory in Language and MusicApr 08 2019Representation and learning of long-range dependencies is a central challenge confronted in modern applications of machine learning to sequence data. Yet despite the prominence of this issue, the basic problem of measuring long-range dependence, either ... More
Direct Modelling of Speech Emotion from Raw SpeechApr 08 2019Speech emotion recognition is a challenging task and heavily depends on hand-engineered acoustic features, which are typically crafted to echo human perception of speech signals. However, a filter bank that is designed from perceptual evidence is not ... More
Direct Modelling of Speech Emotion from Raw SpeechApr 08 2019Apr 09 2019Speech emotion recognition is a challenging task and heavily depends on hand-engineered acoustic features, which are typically crafted to echo human perception of speech signals. However, a filter bank that is designed from perceptual evidence is not ... More
Adversarial Audio: A New Information Hiding Method and Backdoor for DNN-based Speech Recognition ModelsApr 08 2019Audio is an important medium in people's daily life, hidden information can be embedded into audio for covert communication. Current audio information hiding techniques can be roughly classed into time domain-based and transform domain-based techniques. ... More
Temporal Convolution for Real-time Keyword Spotting on Mobile DevicesApr 08 2019Keyword spotting (KWS) plays a critical role in enabling speech-based user interactions on smart devices. Recent developments in the field of deep learning have led to wide adoption of convolutional neural networks (CNNs) in KWS systems due to their exceptional ... More
Improved Speaker-Dependent Separation for CHiME-5 ChallengeApr 08 2019This paper summarizes several follow-up contributions for improving our submitted NWPU speaker-dependent system for CHiME-5 challenge, which aims to solve the problem of multi-channel, highly-overlapped conversational speech recognition in a dinner party ... More
Bayesian Non-Parametric Multi-Source Modelling Based Determined Blind Source SeparationApr 08 2019This paper proposes a determined blind source separation method using Bayesian non-parametric modelling of sources. Conventionally source signals are separated from a given set of mixture signals by modelling them using non-negative matrix factorization ... More
Time Domain Audio Visual Speech SeparationApr 07 2019Audio-visual multi-modal modeling has been demonstrated to be effective in many speech related tasks, such as speech recognition and speech enhancement. This paper introduces a new time-domain audio-visual architecture for target speaker extraction from ... More
Speech Model Pre-training for End-to-End Spoken Language UnderstandingApr 07 2019Whereas conventional spoken language understanding (SLU) systems map speech to text, and then text to intent, end-to-end SLU systems map speech directly to intent through a single trainable model. Achieving high accuracy with these end-to-end models without ... More
VAE-based regularization for deep speaker embeddingApr 07 2019Deep speaker embedding has achieved state-of-the-art performance in speaker recognition. A potential problem of these embedded vectors (called `x-vectors') are not Gaussian, causing performance degradation with the famous PLDA back-end scoring. In this ... More
MCE 2018: The 1st Multi-target Speaker Detection and Identification Challenge EvaluationApr 07 2019The Multi-target Challenge aims to assess how well current speech technology is able to determine whether or not a recorded utterance was spoken by one of a large number of blacklisted speakers. It is a form of multi-target speaker detection based on ... More
VoiceID Loss: Speech Enhancement for Speaker VerificationApr 07 2019In this paper, we propose VoiceID loss, a novel loss function for training a speech enhancement model to improve the robustness of speaker verification. In contrast to the commonly used loss functions for speech enhancement such as the L2 loss, the VoiceID ... More
Spoken Language Intent Detection using Confusion2VecApr 07 2019Decoding speaker's intent is a crucial part of spoken language understanding (SLU). The presence of noise or errors in the text transcriptions, in real life scenarios make the task more challenging. In this paper, we address the spoken language intent ... More
Spatio-Temporal Attention Pooling for Audio Scene ClassificationApr 06 2019Acoustic scenes are rich and redundant in their content. In this work, we present a spatio-temporal attention pooling layer coupled with a convolutional recurrent neural network to learn from patterns that are discriminative while suppressing those that ... More
Taco-VC: A Single Speaker Tacotron based Voice Conversion with Limited DataApr 06 2019This paper introduces Taco-VC, a novel architecture for voice conversion (VC) based on the Tacotron synthesizer, which is a sequence-to-sequence with attention model. Most current prosody preserving VC systems suffer from target similarity and quality ... More
Large Margin Softmax Loss for Speaker VerificationApr 06 2019In neural network based speaker verification, speaker embedding is expected to be discriminative between speakers while the intra-speaker distance should remain small. A variety of loss functions have been proposed to achieve this goal. In this paper, ... More
Cross-task learning for audio tagging, sound event detection and spatial localization: DCASE 2019 baseline systemsApr 06 2019The Detection and Classification of Acoustic Scenes and Events (DCASE) 2019 challenge focuses on audio tagging, sound event detection and spatial localisation. DCASE 2019 consists of five tasks: 1) acoustic scene classification, 2) audio tagging with ... More
Cross-task learning for audio tagging, sound event detection and spatial localization: DCASE 2019 baseline systemsApr 06 2019Apr 14 2019The Detection and Classification of Acoustic Scenes and Events (DCASE) 2019 challenge focuses on audio tagging, sound event detection and spatial localisation. DCASE 2019 consists of five tasks: 1) acoustic scene classification, 2) audio tagging with ... More
Token-Level Ensemble Distillation for Grapheme-to-Phoneme ConversionApr 06 2019Grapheme-to-phoneme (G2P) conversion is an important task in automatic speech recognition and text-to-speech systems. Recently, G2P conversion is viewed as a sequence to sequence task and modeled by RNN or CNN based encoder-decoder framework. However, ... More
Towards Generalized Speech Enhancement with Generative Adversarial NetworksApr 06 2019The speech enhancement task usually consists of removing additive noise or reverberation that partially mask spoken utterances, affecting their intelligibility. However, little attention is drawn to other, perhaps more aggressive signal distortions like ... More
Learning Problem-agnostic Speech Representations from Multiple Self-supervised TasksApr 06 2019Learning good representations without supervision is still an open issue in machine learning, and is particularly challenging for speech signals, which are often characterized by long sequences with a complex hierarchical structure. Some recent works, ... More
ReMASC: Realistic Replay Attack Corpus for Voice Controlled SystemsApr 06 2019This paper introduces a new database of voice recordings with the goal of supporting research on vulnerabilities and protection of voice-controlled systems (VCSs). In contrast to prior efforts, the proposed database contains both genuine voice commands ... More
Factorization of Discriminatively Trained i-vector Extractor for Speaker RecognitionApr 05 2019In this work, we continue in our research on i-vector extractor for speaker verification (SV) and we optimize its architecture for fast and effective discriminative training. We were motivated by computational and memory requirements caused by the large ... More
Jasper: An End-to-End Convolutional Neural Acoustic ModelApr 05 2019In this paper, we report state-of-the-art results on LibriSpeech among end-to-end speech recognition models without any external training data. Our model, Jasper, uses only 1D convolutions, batch normalization, ReLU, dropout, and residual connections. ... More