Latest in cs.sd

total 2474took 0.14s
Telephonetic: Making Neural Language Models Robust to ASR and Semantic NoiseJun 13 2019Speech processing systems rely on robust feature extraction to handle phonetic and semantic variations found in natural language. While techniques exist for desensitizing features to common noise patterns produced by Speech-to-Text (STT) and Text-to-Speech ... More
(A) Data in the Life: Authorship Attribution of Lennon-McCartney SongsJun 12 2019The songwriting duo of John Lennon and Paul McCartney, the two founding members of the Beatles, composed some of the most popular and memorable songs of the last century. Despite having authored songs under the joint credit agreement of Lennon-McCartney, ... More
Toward Interpretable Music Tagging with Self-AttentionJun 12 2019Self-attention is an attention mechanism that learns a representation by relating different positions in the sequence. The transformer, which is a sequence model solely based on self-attention, and its variants achieved state-of-the-art results in many ... More
Deep Learning based Emotion Recognition System Using Speech Features and TranscriptionsJun 11 2019This paper proposes a speech emotion recognition method based on speech features and speech transcriptions (text). Speech features such as Spectrogram and Mel-frequency Cepstral Coefficients (MFCC) help retain emotion-related low-level characteristics ... More
Word-level Speech Recognition with a Dynamic LexiconJun 10 2019We propose a direct-to-word sequence model with a dynamic lexicon. Our word network constructs word embeddings dynamically from the character level tokens. The word network can be integrated seamlessly with arbitrary sequence models including Connectionist ... More
Estimation of 2D Velocity Model using Acoustic Signals and Convolutional Neural NetworksJun 10 2019The parameters estimation of a system using indirect measurements over the same system is a problem that occurs in many fields of engineering, known as the inverse problem. It also happens in the field of underwater acoustic, especially in mediums that ... More
Transfer Learning for Ultrasound Tongue Contour Extraction with Different DomainsJun 10 2019Medical ultrasound technology is widely used in routine clinical applications such as disease diagnosis and treatment as well as other applications like real-time monitoring of human tongue shapes and motions as visual feedback in second language training. ... More
Using generative modelling to produce varied intonation for speech synthesisJun 10 2019Unlike human speakers, typical text-to-speech (TTS) systems are unable to produce multiple distinct renditions of a given sentence. This has previously been addressed by adding explicit external control. In contrast, generative models are able to capture ... More
BowNet: Dilated Convolution Neural Network for Ultrasound Tongue Contour ExtractionJun 10 2019Ultrasound imaging is safe, relatively affordable, and capable of real-time performance. One application of this technology is to visualize and to characterize human tongue shape and motion during a real-time speech to study healthy or impaired speech ... More
"Did You Hear That?" Learning to Play Video Games from Audio CuesJun 10 2019Jun 11 2019Game-playing AI research has focused for a long time on learning to play video games from visual input or symbolic information. However, humans benefit from a wider array of sensors which we utilise in order to navigate the world around us. In particular, ... More
Deep Learning-Based Automatic Downbeat Tracking: A Brief ReviewJun 10 2019As an important format of multimedia, music has filled almost everyone's life. Automatic analyzing music is a significant step to satisfy people's need for music retrieval and music recommendation in an effortless way. Thereinto, downbeat tracking has ... More
DCASE 2019: CNN depth analysis with different channel inputs for Acoustic Scene ClassificationJun 10 2019The objective of this technical report is to describe the framework used in Task 1, Acoustic scene classification (ASC), of the DCASE 2019 challenge. The presented approach is based on Log-Mel spectrogram representations and VGG-based Convolutional Neural ... More
Deep Unsupervised Drum TranscriptionJun 09 2019We introduce DrummerNet, a drum transcription system that is trained in an unsupervised manner. DrummerNet does not require any ground-truth transcription, and with the data-scalability of deep neural networks, it learns from a large unlabelled dataset. ... More
Deep Music Analogy Via Latent Representation DisentanglementJun 09 2019Analogy is a key solution to automated music generation, featured by its ability to generate both natural and creative pieces based on only a few examples. In general, an analogy is made by partially transferring the music abstractions, i.e., high-level ... More
rVAD: An Unsupervised Segment-Based Robust Voice Activity Detection MethodJun 09 2019This paper presents an unsupervised segment-based method for robust voice activity detection (rVAD). The method consists of two passes of denoising followed by a voice activity detection (VAD) stage. In the first pass, high-energy segments in a speech ... More
Effective Use of Variational Embedding Capacity in Expressive End-to-End Speech SynthesisJun 08 2019Recent work has explored sequence-to-sequence latent variable models for expressive speech synthesis (supporting control and transfer of prosody and style), but has not presented a coherent framework for understanding the trade-offs between the competing ... More
Audio tagging with noisy labels and minimal supervisionJun 07 2019This paper introduces Task 2 of the DCASE2019 Challenge, titled "Audio tagging with noisy labels and minimal supervision". This task was hosted on the Kaggle platform as "Freesound Audio Tagging 2019". The task evaluates systems for multi-label audio ... More
Singing voice separation: a study on training dataJun 06 2019In the recent years, singing voice separation systems showed increased performance due to the use of supervised training. The design of training datasets is known as a crucial factor in the performance of such systems. We investigate on how the characteristics ... More
GIBBONR: An R package for the detection and classification of acoustic signals using machine learningJun 06 20191. The recent improvements in recording technology, data storage and battery life have led to an increased interest in the use of passive acoustic monitoring for a variety of research questions. One of the main obstacles in implementing wide scale acoustic ... More
Efficient Full-Rank Spatial Covariance Estimation Using Independent Low-Rank Matrix Analysis for Blind Source SeparationJun 06 2019In this paper, we propose a new algorithm that efficiently separates a directional source and diffuse background noise based on independent low-rank matrix analysis (ILRMA). ILRMA is one of the state-of-the-art techniques of blind source separation (BSS) ... More
Complex Evolution Recurrent Neural Networks (ceRNNs)Jun 05 2019Unitary Evolution Recurrent Neural Networks (uRNNs) have three attractive properties: (a) the unitary property, (b) the complex-valued nature, and (c) their efficient linear operators. The literature so far does not address -- how critical is the unitary ... More
Automated Activity Recognition of Construction Equipment Using a Data Fusion ApproachJun 05 2019Automated monitoring of construction operations, especially operations of equipment and machines, is an essential step toward cost-estimating, and planning of construction projects. In recent years, a number of methods were suggested for recognizing activities ... More
Dilated Convolution with Dilated GRU for Music Source SeparationJun 04 2019Stacked dilated convolutions used in Wavenet have been shown effective for generating high-quality audios. By replacing pooling/striding with dilation in convolution layers, they can preserve high-resolution information and still reach distant locations. ... More
Exploring Phoneme-Level Speech Representations for End-to-End Speech TranslationJun 04 2019Previous work on end-to-end translation from speech has primarily used frame-level features as speech representations, which creates longer, sparser sequences than text. We show that a naive method to create compressed phoneme-like speech representations ... More
MelNet: A Generative Model for Audio in the Frequency DomainJun 04 2019Capturing high-level structure in audio waveforms is challenging because a single second of audio spans tens of thousands of timesteps. While long-range dependencies are difficult to model directly in the time domain, we show that they can be more tractably ... More
A Review of Language and Speech Features for Cognitive-Linguistic AssessmentJun 04 2019It is widely accepted that information derived from analyzing speech (the acoustic signal) and language production (words and sentences) serves as a useful window into the health of an individual's cognitive ability. In fact, most neuropsychological batteries ... More
ShEMO -- A Large-Scale Validated Database for Persian Speech Emotion DetectionJun 04 2019This paper introduces a large-scale, validated database for Persian called Sharif Emotional Speech Database (ShEMO). The database includes 3000 semi-natural utterances, equivalent to 3 hours and 25 minutes of speech data extracted from online radio plays. ... More
A Surprising Density of Illusionable Natural SpeechJun 03 2019Recent work on adversarial examples has demonstrated that most natural inputs can be perturbed to fool even state-of-the-art machine learning systems. But does this happen for humans as well? In this work, we investigate: what fraction of natural instances ... More
Blow: a single-scale hyperconditioned flow for non-parallel raw-audio voice conversionJun 03 2019End-to-end models for raw audio generation are a challenge, specially if they have to work with non-parallel data, which is a desirable setup in many situations. Voice conversion, in which a model has to impersonate a speaker in a recording, is one of ... More
Problem-Agnostic Speech Embeddings for Multi-Speaker Text-to-Speech with SampleRNNJun 03 2019Text-to-speech (TTS) acoustic models map linguistic features into an acoustic representation out of which an audible waveform is generated. The latest and most natural TTS systems build a direct mapping between linguistic and waveform domains, like SampleRNN. ... More
Voice Mimicry Attacks Assisted by Automatic Speaker VerificationJun 03 2019In this work, we simulate a scenario, where a publicly available ASV system is used to enhance mimicry attacks against another closed source ASV system. In specific, ASV technology is used to perform a similarity search between the voices of recruited ... More
Robust Sequence-to-Sequence Acoustic Modeling with Stepwise Monotonic Attention for Neural TTSJun 03 2019Neural TTS has demonstrated strong capabilities to generate human-like speech with high quality and naturalness, while its generalization to out-of-domain texts is still a challenging task, with regard to the design of attention-based sequence-to-sequence ... More
Continual Learning of New Sound Classes using Generative ReplayJun 03 2019Continual learning consists in incrementally training a model on a sequence of datasets and testing on the union of all datasets. In this paper, we examine continual learning for the problem of sound classification, in which we wish to refine already ... More
MUSICNTWRK: data tools for music theory, analysis and compositionJun 03 2019We present the API for MUSICNTWRK, a python library for pitch class set and rhythmic sequences classification and manipulation, the generation of networks in generalized music and sound spaces, deep learning algorithms for timbre recognition, and the ... More
From Speech Chain to Multimodal Chain: Leveraging Cross-modal Data Augmentation for Semi-supervised LearningJun 03 2019The most common way for humans to communicate is by speech. But perhaps a language system cannot know what it is communicating without a connection to the real world by image perception. In fact, humans perceive these multiple sources of information together ... More
Evaluating Non-aligned Musical Score Transcriptions with MV2HJun 03 2019The original MV2H metric was designed to evaluate systems which transcribe from an input audio (or MIDI) piece to a complete musical score. However, it requires both the transcribed score and the ground truth score to be time-aligned with the input. Some ... More
An acoustic model of a Helmholtz resonator under a grazing turbulent boundary layerJun 02 2019Acoustic models of resonant duct systems with turbulent flow depend on fitted constants based on expensive experimental test series. We introduce a new model of a resonant cavity, flush mounted in a duct or flat plate, under grazing turbulent flow. Based ... More
Super-resolution of Time-series Labels for Bootstrapped Event DetectionJun 01 2019Solving real-world problems, particularly with deep learning, relies on the availability of abundant, quality data. In this paper we develop a novel framework that maximises the utility of time-series datasets that contain only small quantities of expertly-labelled ... More
What does a Car-ssette tape tell?May 31 2019Captioning has attracted much attention in image and video understanding while little work examines audio captioning. This paper contributes a manually-annotated dataset on car scene, in extension to a previously published hospital audio captioning dataset. ... More
Increasing Compactness Of Deep Learning Based Speech Enhancement Models With Parameter Pruning And Quantization TechniquesMay 31 2019Most recent studies on deep learning based speech enhancement (SE) focused on improving denoising performance. However, successful SE applications require striking a desirable balance between denoising performance and computational cost in real scenarios. ... More
Real-Time Adversarial AttacksMay 31 2019In recent years, many efforts have demonstrated that modern machine learning algorithms are vulnerable to adversarial attacks, where small, but carefully crafted, perturbations on the input can make them fail. While these attack methods are very effective, ... More
Lattice-based lightly-supervised acoustic model trainingMay 30 2019In the broadcast domain there is an abundance of related text data and partial transcriptions, such as closed captions and subtitles. This text data can be used for lightly supervised training, in which text matching the audio is selected using an existing ... More
Musical Composition Style Transfer via Disentangled Timbre RepresentationsMay 30 2019Music creation involves not only composing the different parts (e.g., melody, chords) of a musical work but also arranging/selecting the instruments to play the different parts. While the former has received increasing attention, the latter has not been ... More
Speaker Anonymization Using X-vector and Neural Waveform ModelsMay 30 2019The social media revolution has produced a plethora of web services to which users can easily upload and share multimedia documents. Despite the popularity and convenience of such services, the sharing of such inherently personal data, including speech ... More
A Music Classification Model based on Metric Learning and Feature Extraction from MP3 Audio FilesMay 30 2019The development of models for learning music similarity and feature extraction from audio media files is an increasingly important task for the entertainment industry. This work proposes a novel music classification model based on metric learning and ... More
A new definition of the distortion matrix for an audio-to-score alignment systemMay 29 2019In this paper we present a new definition of the distortion matrix for a score following framework based on DTW. The proposal consists of arranging the score information in a sequence of note combinations and learning a spectral pattern for each combination ... More
Guided Source Separation Meets a Strong ASR Backend: Hitachi/Paderborn University Joint Investigation for Dinner Party ASRMay 29 2019In this paper, we present Hitachi and Paderborn University's joint effort for automatic speech recognition (ASR) in a dinner party scenario. The main challenges of ASR systems for dinner party recordings obtained by multiple microphone arrays are (1) ... More
Texture Selection for Automatic Music Genre ClassificationMay 28 2019Music Genre Classification is the problem of associating genre-related labels to digitized music tracks. It has applications in the organization of commercial and personal music collections. Often, music tracks are described as a set of timbre-inspired ... More
SignalTrain: Profiling Audio Compressors with Deep Neural NetworksMay 28 2019May 30 2019In this work we present a data-driven approach for predicting the behavior of (i.e., profiling) a given non-linear audio signal processing effect (henceforth "audio effect"). Our objective is to learn a mapping function that maps the unprocessed audio ... More
SignalTrain: Profiling Audio Compressors with Deep Neural NetworksMay 28 2019In this work we present a data-driven approach for predicting the behavior of (i.e., profiling) a given non-linear audio signal processing effect (henceforth "audio effect"). Our objective is to learn a mapping function that maps the unprocessed audio ... More
Automatic Quality Control and Enhancement for Voice-Based Remote Parkinson's Disease DetectionMay 28 2019The performance of voice-based Parkinson's disease (PD) detection systems degrades when there is an acoustic mismatch between training and operating conditions caused mainly by degradation in test signals. In this paper, we address this mismatch by considering ... More
Automatic Quality Control and Enhancement for Voice-Based Remote Parkinson's Disease DetectionMay 28 2019May 31 2019The performance of voice-based Parkinson's disease (PD) detection systems degrades when there is an acoustic mismatch between training and operating conditions caused mainly by degradation in test signals. In this paper, we address this mismatch by considering ... More
Two-level Explanations in Music Emotion RecognitionMay 28 2019Current ML models for music emotion recognition, while generally working quite well, do not give meaningful or intuitive explanations for their predictions. In this work, we propose a 2-step procedure to arrive at spectrogram-level explanations that connect ... More
Ensemble-based cover song detectionMay 28 2019Audio-based cover song detection has received much attention in the MIR community in the recent years. To date, the most popular formulation of the problem has been to compare the audio signals of two tracks and to make a binary decision based on this ... More
Demonstration of PerformanceNet: A Convolutional Neural Network Model for Score-to-Audio Music GenerationMay 28 2019We present in this paper PerformacnceNet, a neural network model we proposed recently to achieve score-to-audio music generation. The model learns to convert a music piece from the symbolic domain to the audio domain, assigning performance-level attributes ... More
Towards robust audio spoofing detection: a detailed comparison of traditional and learned featuresMay 28 2019Automatic speaker verification, like every other biometric system, is vulnerable to spoofing attacks. Using only a few minutes of recorded voice of a genuine client of a speaker verification system, attackers can develop a variety of spoofing attacks ... More
Unsupervised End-to-End Learning of Discrete Linguistic Units for Voice ConversionMay 28 2019We present an unsupervised end-to-end training scheme where we discover discrete subword units from speech without using any labels. The discrete subword units are learned under an ASR-TTS autoencoder reconstruction setting, where an ASR-Encoder is trained ... More
UWB-NTIS Speaker Diarization System for the DIHARD II 2019 ChallengeMay 27 2019In this paper, we present our system developed by the team from the New Technologies for the Information Society (NTIS) research center of the University of West Bohemia in Pilsen, for the Second DIHARD Speech Diarization Challenge. The base of our system ... More
CIF: Continuous Integrate-and-Fire for End-to-End Speech RecognitionMay 27 2019Automatic speech recognition (ASR) system is undergoing an exciting pathway to be more simplified and practical with the spring up of various end-to-end models. However, the mainstream of them neglects the positioning of token boundaries from continuous ... More
EG-GAN: Cross-Language Emotion Gain Synthesis based on Cycle-Consistent Adversarial NetworksMay 27 2019Despite remarkable contributions from existing emotional speech synthesizers, we find that these methods are based on Text-to-Speech system or limited by aligned speech pairs, which suffered from pure emotion gain synthesis. Meanwhile, few studies have ... More
Audio2Face: Generating Speech/Face Animation from Single Audio with Attention-Based Bidirectional LSTM NetworksMay 27 2019We propose an end to end deep learning approach for generating real-time facial animation from just audio. Specifically, our deep architecture employs deep bidirectional long short-term memory network and attention mechanism to discover the latent representations ... More
Transcribing Content from Structural Images with Spotlight MechanismMay 27 2019Transcribing content from structural images, e.g., writing notes from music scores, is a challenging task as not only the content objects should be recognized, but the internal structure should also be preserved. Existing image recognition methods mainly ... More
Auditory Separation of a Conversation from Background via Attentional GatingMay 26 2019We present a model for separating a set of voices out of a sound mixture containing an unknown number of sources. Our Attentional Gating Network (AGN) uses a variable attentional context to specify which speakers in the mixture are of interest. The attentional ... More
Reconstructing faces from voicesMay 25 2019Voice profiling aims at inferring various human parameters from their speech, e.g. gender, age, etc. In this paper, we address the challenge posed by a subtask of voice profiling - reconstructing someone's face from their voice. The task is designed to ... More
Fast computation of loudness using a deep neural networkMay 24 2019The present paper introduces a deep neural network (DNN) for predicting the instantaneous loudness of a sound from its time waveform. The DNN was trained using the output of a more complex model, called the Cambridge loudness model. While a modern PC ... More
Self-supervised audio representation learning for mobile devicesMay 24 2019We explore self-supervised models that can be potentially deployed on mobile devices to learn general purpose audio representations. Specifically, we propose methods that exploit the temporal context in the spectrogram domain. One method estimates the ... More
Disentangled Feature for Weakly Supervised Multi-class Sound Event DetectionMay 24 2019We propose a disentangled feature for weakly supervised multiclass sound event detection (SED), which helps ameliorate the performance and the training efficiency of class-wise attention based detection system by the introduction of more class-wise prior ... More
Disentangled Feature for Weakly Supervised Multi-class Sound Event DetectionMay 24 2019Jun 03 2019We propose a disentangled feature for weakly supervised multiclass sound event detection (SED), which helps ameliorate the performance and the training efficiency of class-wise attention based detection system by the introduction of more class-wise prior ... More
A Perceptual Weighting Filter Loss for DNN Training in Speech EnhancementMay 23 2019May 24 2019Single-channel speech enhancement with deep neural networks (DNNs) has shown promising performance and is thus intensively being studied. In this paper, instead of applying the mean squared error (MSE) as the loss function during DNN training for speech ... More
A Perceptual Weighting Filter Loss for DNN Training in Speech EnhancementMay 23 2019Single-channel speech enhancement with deep neural networks (DNNs) has shown promising performance and is thus intensively being studied. In this paper, instead of applying the mean squared error (MSE) as the loss function during DNN training for speech ... More
FastSpeech: Fast, Robust and Controllable Text to SpeechMay 22 2019Neural network based end-to-end text to speech (TTS) has significantly improved the quality of synthesized speech. Prominent methods (e.g., Tacotron 2) usually first generate mel-spectrogram from text, and then synthesize speech from mel-spectrogram using ... More
FastSpeech: Fast, Robust and Controllable Text to SpeechMay 22 2019May 29 2019Neural network based end-to-end text to speech (TTS) has significantly improved the quality of synthesized speech. Prominent methods (e.g., Tacotron 2) usually first generate mel-spectrogram from text, and then synthesize speech from mel-spectrogram using ... More
FastSpeech: Fast, Robust and Controllable Text to SpeechMay 22 2019May 23 2019Neural network based end-to-end text to speech (TTS) has significantly improved the quality of synthesized speech. Prominent methods (e.g., Tacotron 2) usually first generate mel-spectrogram from text, and then synthesize speech from mel-spectrogram using ... More
Une ou deux composantes ? La réponse de la diffusion en ondelettesMay 21 2019With the aim of constructing a biologically plausible model of machine listening, we study the representation of a multicomponent stationary signal by a wavelet scattering network. First, we show that renormalizing second-order nodes by their first-order ... More
Bayesian Pitch Tracking Based on the Harmonic ModelMay 21 2019Fundamental frequency is one of the most important characteristics of speech and audio signals. Harmonic model-based fundamental frequency estimators offer a higher estimation accuracy and robustness against noise than the widely used autocorrelation-based ... More
A multi-room reverberant dataset for sound event localization and detectionMay 21 2019This paper presents the sound event localization and detection (SELD) task setup for the DCASE 2019 challenge. The goal of the SELD task is to detect the temporal activities of a known set of sound event classes, and further localize them in space when ... More
A multi-room reverberant dataset for sound event localization and detectionMay 21 2019May 24 2019This paper presents the sound event localization and detection (SELD) task setup for the DCASE 2019 challenge. The goal of the SELD task is to detect the temporal activities of a known set of sound event classes, and further localize them in space when ... More
DNN-Based Multi-Frame MVDR Filtering for Single-Microphone Speech EnhancementMay 21 2019Multi-frame approaches for single-microphone speech enhancement, e.g., the multi-frame minimum-variance-distortionless-response (MVDR) filter, are able to exploit speech correlations across neighboring time frames. In contrast to single-frame approaches ... More
Effective parameter estimation methods for an ExcitNet model in generative text-to-speech systemsMay 21 2019In this paper, we propose a high-quality generative text-to-speech (TTS) system using an effective spectrum and excitation estimation method. Our previous research verified the effectiveness of the ExcitNet-based speech generation model in a parametric ... More
Parallel Neural Text-to-SpeechMay 21 2019Jun 05 2019In this work, we propose a non-autoregressive seq2seq model that converts text to spectrogram. It is fully convolutional and obtains about 46.7 times speed-up over Deep Voice 3 at synthesis while maintaining comparable speech quality using a WaveNet vocoder. ... More
Parallel Neural Text-to-SpeechMay 21 2019In this work, we propose a non-autoregressive seq2seq model that converts text to spectrogram. It is fully convolutional and obtains about 17.5 times speed-up over Deep Voice 3 at synthesis while maintaining comparable speech quality using a WaveNet vocoder. ... More
Robust sound event detection in bioacoustic sensor networksMay 20 2019Bioacoustic sensors, sometimes known as autonomous recording units (ARUs), can record sounds of wildlife over long periods of time in scalable and minimally invasive ways. Deriving per-species abundance estimates from these sensors requires detection, ... More
Independent Vector Analysis with more Microphones than SourcesMay 20 2019May 22 2019We extend frequency-domain blind source separation based on independent vector analysis to the case where there are more microphones than sources. The signal is modelled as non-Gaussian sources in a Gaussian background. The proposed algorithm is based ... More
Human Vocal Sentiment AnalysisMay 19 2019In this paper, we use several techniques with conventional vocal feature extraction (MFCC, STFT), along with deep-learning approaches such as CNN, and also context-level analysis, by providing the textual data, and combining different approaches for improved ... More
A comprehensive study of speech separation: spectrogram vs waveform separationMay 17 2019Speech separation has been studied widely for single-channel close-talk recordings over the past few years; developed solutions are mostly in frequency-domain. Recently, a raw audio waveform separation network (TasNet) introduced for single-channel data, ... More
Dance Hit Song PredictionMay 17 2019Record companies invest billions of dollars in new talent around the globe each year. Gaining insight into what actually makes a hit song would provide tremendous benefits for the music industry. In this research we tackle this question by focussing on ... More
Weakly-Supervised Temporal Localization via Occurrence Count LearningMay 17 2019We propose a novel model for temporal detection and localization which allows the training of deep neural networks using only counts of event occurrences as training labels. This powerful weakly-supervised framework alleviates the burden of the imprecise ... More
CHiVE: Varying Prosody in Speech Synthesis with a Linguistically Driven Dynamic Hierarchical Conditional Variational NetworkMay 17 2019The prosodic aspects of speech signals produced by current text-to-speech systems are typically averaged over training material, and as such lack the variety and liveliness found in natural speech. To avoid monotony and averaged prosody contours, it is ... More
CHiVE: Varying Prosody in Speech Synthesis with a Linguistically Driven Dynamic Hierarchical Conditional Variational NetworkMay 17 2019Jun 04 2019The prosodic aspects of speech signals produced by current text-to-speech systems are typically averaged over training material, and as such lack the variety and liveliness found in natural speech. To avoid monotony and averaged prosody contours, it is ... More
End-to-end Adaptation with Backpropagation through WFST for On-device Speech Recognition SystemMay 17 2019An on-device DNN-HMM speech recognition system efficiently works with a limited vocabulary in the presence of a variety of predictable noise. In such a case, vocabulary and environment adaptation is highly effective. In this paper, we propose a novel ... More
The Audio Auditor: Participant-Level Membership Inference in Voice-Based IoTMay 17 2019Voice interfaces and assistants implemented by various services have become increasingly sophisticated, powered by increased availability of data. However, users' audio data needs to be guarded while enforcing data-protection regulations, such as the ... More
Learning discriminative features in sequence training without requiring framewise labelled dataMay 16 2019In this work, we try to answer two questions: Can deeply learned features with discriminative power benefit an ASR system's robustness to acoustic variability? And how to learn them without requiring framewise labelled sequence training data? As existing ... More
Multi Web Audio Sequencer: Collaborative Music MakingMay 16 2019Recent advancements in web-based audio systems have enabled sufficiently accurate timing control and real-time sound processing capabilities. Numerous specialized music tools, as well as digital audio workstations, are now accessible from browsers. Features ... More
Effective Sentence Scoring Method using Bidirectional Language Model for Speech RecognitionMay 16 2019In automatic speech recognition, many studies have shown performance improvements using language models (LMs). Recent studies have tried to use bidirectional LMs (biLMs) instead of conventional unidirectional LMs (uniLMs) for rescoring the $N$-best list ... More
Articulatory and bottleneck features for speaker-independent ASR of dysarthric speechMay 16 2019The rapid population aging has stimulated the development of assistive devices that provide personalized medical support to the needies suffering from various etiologies. One prominent clinical application is a computer-assisted speech training system ... More
Articulatory and bottleneck features for speaker-independent ASR of dysarthric speechMay 16 2019May 21 2019The rapid population aging has stimulated the development of assistive devices that provide personalized medical support to the needies suffering from various etiologies. One prominent clinical application is a computer-assisted speech training system ... More
End-to-End Multi-Channel Speech SeparationMay 15 2019The end-to-end approach for single-channel speech separation has been studied recently and shown promising results. This paper extended the previous approach and proposed a new end-to-end model for multi-channel speech separation. The primary contributions ... More
End-to-End Multi-Channel Speech SeparationMay 15 2019May 28 2019The end-to-end approach for single-channel speech separation has been studied recently and shown promising results. This paper extended the previous approach and proposed a new end-to-end model for multi-channel speech separation. The primary contributions ... More
A general-purpose deep learning approach to model time-varying audio effectsMay 15 2019Audio processors whose parameters are modified periodically over time are often referred as time-varying or modulation based audio effects. Most existing methods for modeling these type of effect units are often optimized to a very specific circuit and ... More
Speaker-Independent Speech-Driven Visual Speech Synthesis using Domain-Adapted Acoustic ModelsMay 15 2019Speech-driven visual speech synthesis involves mapping features extracted from acoustic speech to the corresponding lip animation controls for a face model. This mapping can take many forms, but a powerful approach is to use deep neural networks (DNNs). ... More