Latest in eess.as

total 2167took 0.14s
A Realistic Face-to-Face Conversation System based on Deep Neural NetworksAug 21 2019To improve the experiences of face-to-face conversation with avatar, this paper presents a novel conversation system. It is composed of two sequence-to-sequence models respectively for listening and speaking and a Generative Adversarial Network (GAN) ... More
From Text to Sound: A Preliminary Study on Retrieving Sound Effects to Radio StoriesAug 20 2019Sound effects play an essential role in producing high-quality radio stories but require enormous labor cost to add. In this paper, we address the problem of automatically adding sound effects to radio stories with a retrieval-based model. However, directly ... More
A Microphone Array and Voice Algorithm based Smart Hearing AidAug 20 2019Approximately 6.2% of the world's population (466 million people) suffer from disabling hearing impairment [1]. Hearing impairment impacts negatively on one's education, financial success [2][3], cognitive development in childhood [4], including increased ... More
Prosodic Phrase Alignment for Machine DubbingAug 20 2019Dubbing is a type of audiovisual translation where dialogues are translated and enacted so that they give the impression that the media is in the target language. It requires a careful alignment of dubbed recordings with the lip movements of performers ... More
AI for Earth: Rainforest Conservation by Acoustic SurveillanceAug 20 2019Saving rainforests is a key to halting adverse climate changes. In this paper, we introduce an innovative solution built on acoustic surveillance and machine learning technologies to help rainforest conservation. In particular, We propose new convolutional ... More
Fuzzy C-Means Clustering and Sonification of HRV FeaturesAug 19 2019Linear and non-linear measures of heart rate variability (HRV) are widely investigated as non-invasive indicators of health. Stress has a profound impact on heart rate, and different meditation techniques have been found to modulate heartbeat rhythm. ... More
Salient Speech Representations Based on Cloned NetworksAug 19 2019We define salient features as features that are shared by signals that are defined as being equivalent by a system designer. The definition allows the designer to contribute qualitative information. We aim to find salient features that are useful as conditioning ... More
Two-Staged Acoustic Modeling Adaption for Robust Speech Recognition by the Example of German Oral History InterviewsAug 19 2019In automatic speech recognition, often little training data is available for specific challenging tasks, but training of state-of-the-art automatic speech recognition systems requires large amounts of annotated speech. To address this issue, we propose ... More
Audio query-based music source separationAug 19 2019In recent years, music source separation has been one of the most intensively studied research areas in music information retrieval. Improvements in deep learning lead to a big progress in music source separation performance. However, most of the previous ... More
Efficient Context Aggregation for End-to-End Speech Enhancement Using a Densely Connected Convolutional and Recurrent NetworkAug 18 2019In speech enhancement, an end-to-end deep neural network converts a noisy speech signal to a clean speech directly in time domain without time-frequency transformation or mask estimation. However, aggregating contextual information from a high-resolution ... More
Music Transcription Based on Bayesian Piece-Specific Score Models Capturing RepetitionsAug 18 2019Most work on models for music transcription has focused on describing local sequential dependence of notes in musical scores and failed to capture their global repetitive structure, which can be a useful guide for transcribing music. Focusing on the rhythm, ... More
Onset detection: A new approach to QBH systemAug 17 2019Query by Humming (QBH) is an system to provide a user with the song(s) which the user hums to the system. Current QBH method requires the extraction of onset and pitch information in order to track similarity with various versions of different songs. ... More
JVS corpus: free Japanese multi-speaker voice corpusAug 17 2019Thanks to improvements in machine learning techniques, including deep learning, speech synthesis is becoming a machine learning task. To accelerate speech synthesis research, we are developing Japanese voice corpora reasonably accessible from not only ... More
Survey on Deep Neural Networks in Speech and Vision SystemsAug 16 2019This survey presents a review of state-of-the-art deep neural network architectures, algorithms, and systems in vision and speech applications. Recent advances in deep artificial neural network algorithms and architectures have spurred rapid innovation ... More
Towards Generating Ambisonics Using Audio-Visual Cue for Virtual RealityAug 16 2019Ambisonics i.e., a full-sphere surround sound, is quintessential with 360-degree visual content to provide a realistic virtual reality (VR) experience. While 360-degree visual content capture gained a tremendous boost recently, the estimation of corresponding ... More
Sub-Spectrogram Segmentation for Environmental Sound Classification via Convolutional Recurrent Neural Network and Score Level FusionAug 16 2019Environmental Sound Classification (ESC) is an important and challenging problem, and feature representation is a critical and even decisive factor in ESC. Feature representation ability directly affects the accuracy of sound classification. Therefore, ... More
Learning Sub-Sampling and Signal Recovery with Applications in Ultrasound ImagingAug 15 2019Limitations on bandwidth and power consumption impose strict bounds on data rates of diagnostic imaging systems. Consequently, the design of suitable (i.e. task- and data-aware) compression and reconstruction techniques has attracted considerable attention ... More
Speaker Verification Using Simple Temporal Features and Pitch Synchronous Cepstral CoefficientsAug 15 2019Speaker verification is the process by which a speakers claim of identity is tested against a claimed speaker by his or her voice. Speaker verification is done by the use of some parameters (features) from the speakers voice which can be used to differentiate ... More
Conditional LSTM-GAN for Melody Generation from LyricsAug 15 2019Melody generation from lyrics has been a challenging research issue in the field of artificial intelligence and music, which enables to learn and discover latent relationship between interesting lyrics and accompanying melody. Unfortunately, the limited ... More
State-of-the-art Speech Recognition using EEG and Towards Decoding of Speech Spectrum From EEGAug 14 2019In this paper we first demonstrate continuous noisy speech recognition using electroencephalography (EEG) signals on English vocabulary using different types of state of the art end-to-end automatic speech recognition (ASR) models, we further provide ... More
Interleaved Multitask Learning for Audio Source Separation with Independent DatabasesAug 14 2019Deep Neural Network-based source separation methods usually train independent models to optimize for the separation of individual sources. Although this can lead to good performance for well-defined targets, it can also be computationally expensive. The ... More
Components Loss for Neural Networks in Mask-Based Speech EnhancementAug 14 2019Estimating time-frequency domain masks for single-channel speech enhancement using deep learning methods has recently become a popular research field with promising results. In this paper, we propose a novel components loss (CL) for the training of neural ... More
RTF-steered binaural MVDR beamforming incorporating multiple external microphonesAug 13 2019The binaural minimum-variance distortionless-response (BMVDR) beamformer is a well-known noise reduction algorithm that can be steered using the relative transfer function (RTF) vector of the desired speech source. Exploiting the availability of an external ... More
IMS-Speech: A Speech to Text ToolAug 13 2019We present the IMS-Speech, a web based tool for German and English speech transcription aiming to facilitate research in various disciplines which require accesses to lexical information in spoken language materials. This tool is based on modern open ... More
End-to-End Multi-Speaker Speech Recognition using Speaker Embeddings and Transfer LearningAug 13 2019This paper presents our latest investigation on end-to-end automatic speech recognition (ASR) for overlapped speech. We propose to train an end-to-end system conditioned on speaker embeddings and further improved by transfer learning from clean speech. ... More
Estimating & Mitigating the Impact of Acoustic Environments on Machine-to-Machine SignallingAug 13 2019The advance of technology for transmitting Data-over-Sound in various IoT and telecommunication applications has led to the concept of machine-to-machine over-the-air acoustic signalling. Reverberation can have a detrimental effect on such machine-to-machine ... More
Personal VAD: Speaker-Conditioned Voice Activity DetectionAug 12 2019In this paper, we propose "personal VAD", a system to detect the voice activity of a target speaker at the frame level. This system is useful for gating the inputs to a streaming speech recognition system, such that it only triggers for the target user, ... More
Committee Draft of JPEG XL Image Coding SystemAug 12 2019JPEG XL is a practical approach focused on scalable web distribution and efficient compression of high-quality images. It provides various benefits compared to existing image formats: 60% size reduction at equivalent subjective quality; fast, parallelizable ... More
Committee Draft of JPEG XL Image Coding SystemAug 12 2019Aug 13 2019JPEG XL is a practical approach focused on scalable web distribution and efficient compression of high-quality images. It provides various benefits compared to existing image formats: 60% size reduction at equivalent subjective quality; fast, parallelizable ... More
A Study on Angular Based Embedding Learning for Text-independent Speaker VerificationAug 12 2019Learning a good speaker embedding is important for many automatic speaker recognition tasks, including verification, identification and diarization. The embeddings learned by softmax are not discriminative enough for open-set verification tasks. Angular ... More
Emotion Dependent Facial Animation from Affective SpeechAug 11 2019In human-to-computer interaction, facial animation in synchrony with affective speech can deliver more naturalistic conversational agents. In this paper, we present a two-stage deep learning approach for affective speech driven facial shape animation. ... More
Unsupervised Stemming based Language Model for Telugu Broadcast News TranscriptionAug 10 2019In Indian Languages , native speakers are able to understand new words formed by either combining or modifying root words with tense and / or gender. Due to data insufficiency, Automatic Speech Recognition system (ASR) may not accommodate all the words ... More
Multi-modality Latent Interaction Network for Visual Question AnsweringAug 10 2019Exploiting relationships between visual regions and question words have achieved great success in learning multi-modality features for Visual Question Answering (VQA). However, we argue that existing methods mostly model relations between individual visual ... More
Emotionless: Privacy-Preserving Speech Analysis for Voice AssistantsAug 09 2019Voice-enabled interactions provide more human-like experiences in many popular IoT systems. Cloud-based speech analysis services extract useful information from voice input using speech recognition techniques. The voice signal is a rich resource that ... More
The role of cue enhancement and frequency fine-tuning in hearing impaired phone recognitionAug 09 2019A speech-based hearing test is designed to identify the susceptible error-prone phones for individual hearing impaired (HI) ear. Only robust tokens in the experiment noise levels had been chosen for the test. The noise-robustness of tokens is measured ... More
Exploiting Cross-Lingual Speaker and Phonetic Diversity for Unsupervised Subword ModelingAug 09 2019This research addresses the problem of acoustic modeling of low-resource languages for which transcribed training data is absent. The goal is to learn robust frame-level feature representations that can be used to identify and distinguish subword-level ... More
Challenging the Boundaries of Speech Recognition: The MALACH CorpusAug 09 2019There has been huge progress in speech recognition over the last several years. Tasks once thought extremely difficult, such as SWITCHBOARD, now approach levels of human performance. The MALACH corpus (LDC catalog LDC2012S05), a 375-Hour subset of a large ... More
ToyADMOS: A Dataset of Miniature-Machine Operating Sounds for Anomalous Sound DetectionAug 09 2019This paper introduces a new dataset called "ToyADMOS" designed for anomaly detection in machine operating sounds (ADMOS). To the best our knowledge, no large-scale datasets are available for ADMOS, although large-scale datasets have contributed to recent ... More
Monitor-Based Runtime Assurance for Temporal Logic SpecificationsAug 08 2019This paper introduces the safety controller architecture as a runtime assurance mechanism for system specifications expressed as safety properties in Linear Temporal Logic (LTL). The safety controller has three fundamental components: a performance controller, ... More
Exploiting semi-supervised training through a dropout regularization in end-to-end speech recognitionAug 08 2019In this paper, we explore various approaches for semi supervised learning in an end to end automatic speech recognition (ASR) framework. The first step in our approach involves training a seed model on the limited amount of labelled data. Additional unlabelled ... More
Universal Adversarial Audio PerturbationsAug 08 2019We demonstrate the existence of universal adversarial perturbations, which can fool a family of audio processing architectures, for both targeted and untargeted attacks. To the best of our knowledge, this is the first study on generating universal adversarial ... More
Universal Adversarial Audio PerturbationsAug 08 2019Aug 12 2019We demonstrate the existence of universal adversarial perturbations, which can fool a family of audio processing architectures, for both targeted and untargeted attacks. To the best of our knowledge, this is the first study on generating universal adversarial ... More
Self-supervised Attention Model for Weakly Labeled Audio Event ClassificationAug 07 2019We describe a novel weakly labeled Audio Event Classification approach based on a self-supervised attention model. The weakly labeled framework is used to eliminate the need for expensive data labeling procedure and self-supervised attention is deployed ... More
Audio-visual Speech Enhancement Using Conditional Variational Auto-EncoderAug 07 2019Variational auto-encoders (VAEs) are deep generative latent variable models that can be used for learning the distribution of complex data. VAEs have been successfully used to learn a probabilistic prior over speech signals, which is then used to perform ... More
Pitch-Synchronous Single Frequency Filtering Spectrogram for Speech Emotion RecognitionAug 07 2019Convolutional neural networks (CNN) are widely used for speech emotion recognition (SER). In such cases, the short time fourier transform (STFT) spectrogram is the most popular choice for representing speech, which is fed as input to the CNN. However, ... More
Understanding Optical Music RecognitionAug 07 2019For over 50 years, researchers have been trying to teach computers to read music notation, referred to as Optical Music Recognition (OMR). However, this field is still difficult to access for new researchers, especially those without a significant musical ... More
Understanding Optical Music RecognitionAug 07 2019Aug 14 2019For over 50 years, researchers have been trying to teach computers to read music notation, referred to as Optical Music Recognition (OMR). However, this field is still difficult to access for new researchers, especially those without a significant musical ... More
Viterbi Extraction tutorial with Hidden Markov ToolkitAug 07 2019An algorithm used to extract HMM parameters is revisited. Most parts of the extraction process are taken from implemented Hidden Markov Toolkit (HTK) program under name HInit. The algorithm itself shows a few variations compared to another domain of implementations. ... More
Practical Speech Recognition with HTKAug 06 2019The practical aspects of developing an Automatic Speech Recognition System (ASR) with HTK are reviewed. Steps are explained concerning hardware, software, libraries, applications and computer programs used. The common procedure to rapidly apply speech ... More
An End-to-End Text-independent Speaker Verification Framework with a Keyword Adversarial NetworkAug 06 2019This paper presents an end-to-end text-independent speaker verification framework by jointly considering the speaker embedding (SE) network and automatic speech recognition (ASR) network. The SE network learns to output an embedding vector which distinguishes ... More
Maximum likelihood convolutional beamformer for simultaneous denoising and dereverberationAug 06 2019This article describes a probabilistic formulation of a Weighted Power minimization Distortionless response convolutional beamformer (WPD). The WPD unifies a weighted prediction error based dereverberation method (WPE) and a minimum power distortionless ... More
Acceleration of rank-constrained spatial covariance matrix estimation for blind speech extractionAug 06 2019In this paper, we propose new accelerated update rules for rank-constrained spatial covariance model estimation, which efficiently extracts a directional target source in diffuse background noise.The naive updat e rule requires heavy computation such ... More
Two-stage Training for Chinese Dialect RecognitionAug 06 2019In this paper, we present a two-stage language identification (LID) system based on a shallow ResNet14 followed by a simple 2-layer recurrent neural network (RNN) architecture, which was used for Xunfei (iFlyTek) Chinese Dialect Recognition Challenge ... More
Two-stage Training for Chinese Dialect RecognitionAug 06 2019Aug 10 2019In this paper, we present a two-stage language identification (LID) system based on a shallow ResNet14 followed by a simple 2-layer recurrent neural network (RNN) architecture, which was used for Xunfei (iFlyTek) Chinese Dialect Recognition Challenge ... More
Triplet Based Embedding Distance and Similarity Learning for Text-independent Speaker VerificationAug 06 2019Speaker embeddings become growing popular in the text-independent speaker verification task. In this paper, we propose two improvements during the training stage. The improvements are both based on triplet cause the training stage and the evaluation stage ... More
Adversarially Trained End-to-end Korean Singing Voice Synthesis SystemAug 06 2019In this paper, we propose an end-to-end Korean singing voice synthesis system from lyrics and a symbolic melody using the following three novel approaches: 1) phonetic enhancement masking, 2) local conditioning of text and pitch to the super-resolution ... More
Presenting the Acoustic Sounds for Wellbeing Dataset and Baseline Classification ResultsAug 05 2019The field of sound healing includes ancient practices coming from a broad range of cultures. Across such practices there is a variety of instrumentation utilised. Practitioners suggest the ability of sound to target both mental and even physical health ... More
Robust Over-the-Air Adversarial Examples Against Automatic Speech Recognition SystemsAug 05 2019Automatic speech recognition (ASR) systems are possible to fool via targeted adversarial examples. These can induce the ASR to produce arbitrary transcriptions in response to any type of audio signal, be it speech, environmental sounds, or music. However, ... More
V2S attack: building DNN-based voice conversion from automatic speaker verificationAug 05 2019This paper presents a new voice impersonation attack using voice conversion (VC). Enrolling personal voices for automatic speaker verification (ASV) offers natural and flexible biometric authentication systems. Basically, the ASV systems do not include ... More
Cross-lingual Text-independent Speaker Verification using Unsupervised Adversarial Discriminative Domain AdaptationAug 05 2019Speaker verification systems often degrade significantly when there is a language mismatch between training and testing data. Being able to improve cross-lingual speaker verification system using unlabeled data can greatly increase the robustness of the ... More
Sound Event Detection in Multichannel Audio using Convolutional Time-Frequency-Channel Squeeze and ExcitationAug 04 2019In this study, we introduce a convolutional time-frequency-channel "Squeeze and Excitation" (tfc-SE) module to explicitly model inter-dependencies between the time-frequency domain and multiple channels. The tfc-SE module consists of two parts: tf-SE ... More
Probabilistic Permutation Invariant Training for Speech SeparationAug 04 2019Single-microphone, speaker-independent speech separation is normally performed through two steps: (i) separating the specific speech sources, and (ii) determining the best output-label assignment to find the separation error. The second step is the main ... More
LSTM Based Music Generation SystemAug 02 2019Traditionally, music was treated as an analogue signal and was generated manually. In recent years, music is conspicuous to technology which can generate a suite of music automatically without any human intervention. To accomplish this task, we need to ... More
SANTLR: Speech Annotation Toolkit for Low Resource LanguagesAug 02 2019While low resource speech recognition has attracted a lot of attention from the speech community, there are a few tools available to facilitate low resource speech collection. In this work, we present SANTLR: Speech Annotation Toolkit for Low Resource ... More
Multilingual Speech Recognition with Corpus Relatedness SamplingAug 02 2019Multilingual acoustic models have been successfully applied to low-resource speech recognition. Most existing works have combined many small corpora together and pretrained a multilingual model by sampling from each corpus uniformly. The model is eventually ... More
High-Level Control of Drum Track Generation Using Learned Patterns of Rhythmic InteractionAug 02 2019Spurred by the potential of deep learning, computational music generation has gained renewed academic interest. A crucial issue in music generation is that of user control, especially in scenarios where the music generation process is conditioned on existing ... More
A local direct method for module identification in dynamic networks with correlated noiseAug 02 2019The identification of local modules in dynamic networks with known topology has recently been addressed by formulating conditions for arriving at consistent estimates of the module dynamics, under the assumption of having disturbances that are uncorrelated ... More
Sound source detection, localization and classification using consecutive ensemble of CRNN modelsAug 02 2019In this paper, we describe our method for DCASE2019 task3: Sound Event Localization and Detection (SELD). We use four CRNN SELDnet-like single output models which run in a consecutive manner to recover all possible information of occurring events. We ... More
The Efficiency of Generalized Nash and Variational EquilibriaAug 02 2019Shared-constraint games are noncooperative $N$-player games where players are coupled through a common coupling constraint. It is known that such games admit two kinds of equilibria -- generalized Nash equilibria (GNE) and variational equilibria (VE) ... More
Learning Joint Acoustic-Phonetic Word EmbeddingsAug 01 2019Most speech recognition tasks pertain to mapping words across two modalities: acoustic and orthographic. In this work, we suggest learning encoders that map variable-length, acoustic or phonetic, sequences that represent words into fixed-dimensional vectors ... More
Multi-Scale Learned Iterative ReconstructionAug 01 2019Model-based learned iterative reconstruction methods have recently been shown to outperform classical reconstruction methods. Applicability of these methods to large scale inverse problems is however limited by the available memory for training and extensive ... More
Quantifying Cochlear Implant Users' Ability for Speaker Identification using CI Auditory StimuliJul 31 2019Speaker recognition is a biometric modality that uses underlying speech information to determine the identity of the speaker. Speaker Identification (SID) under noisy conditions is one of the challenging topics in the field of speech processing, specifically ... More
Personalizing ASR for Dysarthric and Accented Speech with Limited DataJul 31 2019Automatic speech recognition (ASR) systems have dramatically improved over the last few years. ASR systems are most often trained from 'typical' speech, which means that underrepresented groups don't experience the same level of improvement. In this paper, ... More
Augmenting Music Sheets with Harmonic FingerprintsJul 31 2019Conventional Music Notation (CMN) is the well-established foundation for the written communication of musical information, such as rhythm, harmony, or timbre. However, CMN suffers from the complexity of its visual encoding and the need for extensive training ... More
Marine Mammal Species Classification using Convolutional Neural Networks and a Novel Acoustic RepresentationJul 30 2019Research into automated systems for detecting and classifying marine mammals in acoustic recordings is expanding internationally due to the necessity to analyze large collections of data for conservation purposes. In this work, we present a Convolutional ... More
Fast and Robust 3-D Sound Source Localization with DSVD-PHATJul 29 2019This paper introduces a variant of the Singular Value Decomposition with Phase Transform (SVD-PHAT), named Difference SVD-PHAT (DSVD-PHAT), to achieve robust Sound Source Localization (SSL) in noisy conditions. Experiments are performed on a Baxter robot ... More
Multi-Frame Cross-Entropy Training for Convolutional Neural Networks in Speech RecognitionJul 29 2019We introduce Multi-Frame Cross-Entropy training (MFCE) for convolutional neural network acoustic models. Recognizing that similar to RNNs, CNNs are in nature sequence models that take variable length inputs, we propose to take as input to the CNN a part ... More
MIRaGe: Multichannel Database Of Room Impulse Responses Measured On High-Resolution Cube-Shaped Grid In Multiple Acoustic ConditionsJul 29 2019We introduce a database of multi-channel recordings performed in an acoustic lab with adjustable reverberation time. The recordings provide information about room impulse responses (RIR) for various positions of a loudspeaker. In particular, the main ... More
StarGAN-VC2: Rethinking Conditional Methods for StarGAN-Based Voice ConversionJul 29 2019Non-parallel multi-domain voice conversion (VC) is a technique for learning mappings among multiple domains without relying on parallel data. This is important but challenging owing to the requirement of learning multiple mappings and the non-availability ... More
StarGAN-VC2: Rethinking Conditional Methods for StarGAN-Based Voice ConversionJul 29 2019Aug 07 2019Non-parallel multi-domain voice conversion (VC) is a technique for learning mappings among multiple domains without relying on parallel data. This is important but challenging owing to the requirement of learning multiple mappings and the non-availability ... More
Dilated FCN: Listening Longer to Hear BetterJul 27 2019Deep neural network solutions have emerged as a new and powerful paradigm for speech enhancement (SE). The capabilities to capture long context and extract multi-scale patterns are crucial to design effective SE networks. Such capabilities, however, are ... More
Generalization of Spectrum Differential based Direct Waveform Modification for Voice ConversionJul 27 2019We present a modification to the spectrum differential based direct waveform modification for voice conversion (DIFFVC) so that it can be directly applied as a waveform generation module to voice conversion models. The recently proposed DIFFVC avoids ... More
On the Use/Misuse of the Term 'Phoneme'Jul 26 2019The term 'phoneme' lies at the heart of speech science and technology, and yet it is not clear that the research community fully appreciates its meaning and implications. In particular, it is suspected that many researchers use the term in a casual sense ... More
Localization Uncertainty in Time-Intensity Stereophonic ReproductionJul 26 2019This paper studies the effects of inter-channel time and level differences in stereophonic reproduction on perceived localization uncertainty. Towards this end, a computational model of localization uncertainty is proposed first. The model calculates ... More
Robust On-Line ADP-based Solution of a Class of Hierarchical Nonlinear Differential GameJul 26 2019In this paper, a hierarchical one-leader-multi-followers game for a class of continuous-time nonlinear systems with disturbance is investigated by a novel policy iteration reinforcement learning technique in which, the game model consists both of the ... More
Correlation Distance Skip Connection Denoising Autoencoder (CDSK-DAE) for Speech Feature EnhancementJul 26 2019Performance of learning based Automatic Speech Recognition (ASR) is susceptible to noise, especially when it is introduced in the testing data while not presented in the training data. This work focuses on a feature enhancement for noise robust end-to-end ... More
Interactive Lungs Auscultation with Reinforcement Learning AgentJul 25 2019To perform a precise auscultation for the purposes of examination of respiratory system normally requires the presence of an experienced doctor. With most recent advances in machine learning and artificial intelligence, automatic detection of pathological ... More
Cross-Attention End-to-End ASR for Two-Party ConversationsJul 24 2019We present an end-to-end speech recognition model that learns interaction between two speakers based on the turn-changing information. Unlike conventional speech recognition models, our model exploits two speakers' history of conversational-context information ... More
A neural network based post-filter for speech-driven head motion synthesisJul 24 2019Jul 25 2019Despite the fact that neural networks are widely used for speech-driven head motion synthesis, it is well-known that the output of neural networks is noisy or discontinuous due to the limited capability of deep neural networks in predicting human motion. ... More
Non-Parallel Voice Conversion with Cyclic Variational AutoencoderJul 24 2019In this paper, we present a novel technique for a non-parallel voice conversion (VC) with the use of cyclic variational autoencoder (CycleVAE)-based spectral modeling. In a variational autoencoder(VAE) framework, a latent space, usually with a Gaussian ... More
Log Complex Color for Visual Pattern Recognition of Total SoundJul 23 2019While traditional audio visualization methods depict amplitude intensities vs. time, such as in a time-frequency spectrogram, and while some may use complex phase information to augment the amplitude representation, such as in a reassigned spectrogram, ... More
Discriminative Learning for Monaural Speech Separation Using Deep Embedding FeaturesJul 23 2019Deep clustering (DC) and utterance-level permutation invariant training (uPIT) have been demonstrated promising for speaker-independent speech separation. DC is usually formulated as two-step processes: embedding learning and embedding clustering, which ... More
NONOTO: A Model-agnostic Web Interface for Interactive Music Composition by InpaintingJul 23 2019Inpainting-based generative modeling allows for stimulating human-machine interactions by letting users perform stylistically coherent local editions to an object using a statistical model. We present NONOTO, a new interface for interactive music generation ... More
EmoBed: Strengthening Monomodal Emotion Recognition via Training with Crossmodal Emotion EmbeddingsJul 23 2019Despite remarkable advances in emotion recognition, they are severely restrained from either the essentially limited property of the employed single modality, or the synchronous presence of all involved multiple modalities. Motivated by this, we propose ... More
LSTM based Similarity Measurement with Spectral Clustering for Speaker DiarizationJul 23 2019More and more neural network approaches have achieved considerable improvement upon submodules of speaker diarization system, including speaker change detection and segment-wise speaker embedding extraction. Still, in the clustering stage, traditional ... More
On Modeling ASR Word ConfidenceJul 22 2019We present a new method for computing ASR word confidences that effectively mitigates ASR errors for diverse downstream applications, improves the word error rate of the 1-best result, and allows better comparison of scores across different models. We ... More
A Deep Neural Network for Short-Segment Speaker RecognitionJul 22 2019Todays interactive devices such as smart-phone assistants and smart speakers often deal with short-duration speech segments. As a result, speaker recognition systems integrated into such devices will be much better suited with models capable of performing ... More
Bilevel Optimization, Deep Learning and Fractional Laplacian Regularization with Applications in TomographyJul 22 2019In this work we consider a generalized bilevel optimization framework for solving inverse problems. We introduce fractional Laplacian as a regularizer to improve the reconstruction quality, and compare it with the total variation regularization. We emphasize ... More
ML Estimation and CRBs for Reverberation, Speech and Noise PSDs in Rank-Deficient Noise-FieldJul 22 2019Speech communication systems are prone to performance degradation in reverberant and noisy acoustic environments. Dereverberation and noise reduction algorithms typically require several model parameters, e.g. the speech, reverberation and noise power ... More
Crowdsourcing a Dataset of Audio CaptionsJul 22 2019Audio captioning is a novel field of multi-modal translation and it is the task of creating a textual description of the content of an audio signal (e.g. "people talking in a big room"). The creation of a dataset for this task requires a considerable ... More