Latest in cs.sd

total 2985took 0.13s
Sound Localization and Separation in Three-dimensional Space Using a Single Microphone with a Metamaterial EnclosureAug 22 2019Conventional approaches to sound localization and separation are based on microphone arrays in artificial systems. Inspired by the selective perception of human auditory system, we design a multi-source listening system which can separate simultaneous ... More
Coarse-to-fine Optimization for Speech EnhancementAug 21 2019In this paper, we propose the coarse-to-fine optimization for the task of speech enhancement. Cosine similarity loss [1] has proven to be an effective metric to measure similarity of speech signals. However, due to the large variance of the enhanced speech ... More
A Realistic Face-to-Face Conversation System based on Deep Neural NetworksAug 21 2019To improve the experiences of face-to-face conversation with avatar, this paper presents a novel conversation system. It is composed of two sequence-to-sequence models respectively for listening and speaking and a Generative Adversarial Network (GAN) ... More
From Text to Sound: A Preliminary Study on Retrieving Sound Effects to Radio StoriesAug 20 2019Sound effects play an essential role in producing high-quality radio stories but require enormous labor cost to add. In this paper, we address the problem of automatically adding sound effects to radio stories with a retrieval-based model. However, directly ... More
A Microphone Array and Voice Algorithm based Smart Hearing AidAug 20 2019Approximately 6.2% of the world's population (466 million people) suffer from disabling hearing impairment [1]. Hearing impairment impacts negatively on one's education, financial success [2][3], cognitive development in childhood [4], including increased ... More
Prosodic Phrase Alignment for Machine DubbingAug 20 2019Dubbing is a type of audiovisual translation where dialogues are translated and enacted so that they give the impression that the media is in the target language. It requires a careful alignment of dubbed recordings with the lip movements of performers ... More
AI for Earth: Rainforest Conservation by Acoustic SurveillanceAug 20 2019Saving rainforests is a key to halting adverse climate changes. In this paper, we introduce an innovative solution built on acoustic surveillance and machine learning technologies to help rainforest conservation. In particular, We propose new convolutional ... More
Fuzzy C-Means Clustering and Sonification of HRV FeaturesAug 19 2019Linear and non-linear measures of heart rate variability (HRV) are widely investigated as non-invasive indicators of health. Stress has a profound impact on heart rate, and different meditation techniques have been found to modulate heartbeat rhythm. ... More
Salient Speech Representations Based on Cloned NetworksAug 19 2019We define salient features as features that are shared by signals that are defined as being equivalent by a system designer. The definition allows the designer to contribute qualitative information. We aim to find salient features that are useful as conditioning ... More
Two-Staged Acoustic Modeling Adaption for Robust Speech Recognition by the Example of German Oral History InterviewsAug 19 2019In automatic speech recognition, often little training data is available for specific challenging tasks, but training of state-of-the-art automatic speech recognition systems requires large amounts of annotated speech. To address this issue, we propose ... More
Audio query-based music source separationAug 19 2019In recent years, music source separation has been one of the most intensively studied research areas in music information retrieval. Improvements in deep learning lead to a big progress in music source separation performance. However, most of the previous ... More
Efficient Context Aggregation for End-to-End Speech Enhancement Using a Densely Connected Convolutional and Recurrent NetworkAug 18 2019In speech enhancement, an end-to-end deep neural network converts a noisy speech signal to a clean speech directly in time domain without time-frequency transformation or mask estimation. However, aggregating contextual information from a high-resolution ... More
Music Transcription Based on Bayesian Piece-Specific Score Models Capturing RepetitionsAug 18 2019Most work on models for music transcription has focused on describing local sequential dependence of notes in musical scores and failed to capture their global repetitive structure, which can be a useful guide for transcribing music. Focusing on the rhythm, ... More
Onset detection: A new approach to QBH systemAug 17 2019Query by Humming (QBH) is an system to provide a user with the song(s) which the user hums to the system. Current QBH method requires the extraction of onset and pitch information in order to track similarity with various versions of different songs. ... More
JVS corpus: free Japanese multi-speaker voice corpusAug 17 2019Thanks to improvements in machine learning techniques, including deep learning, speech synthesis is becoming a machine learning task. To accelerate speech synthesis research, we are developing Japanese voice corpora reasonably accessible from not only ... More
Survey on Deep Neural Networks in Speech and Vision SystemsAug 16 2019This survey presents a review of state-of-the-art deep neural network architectures, algorithms, and systems in vision and speech applications. Recent advances in deep artificial neural network algorithms and architectures have spurred rapid innovation ... More
Towards Generating Ambisonics Using Audio-Visual Cue for Virtual RealityAug 16 2019Ambisonics i.e., a full-sphere surround sound, is quintessential with 360-degree visual content to provide a realistic virtual reality (VR) experience. While 360-degree visual content capture gained a tremendous boost recently, the estimation of corresponding ... More
Sub-Spectrogram Segmentation for Environmental Sound Classification via Convolutional Recurrent Neural Network and Score Level FusionAug 16 2019Environmental Sound Classification (ESC) is an important and challenging problem, and feature representation is a critical and even decisive factor in ESC. Feature representation ability directly affects the accuracy of sound classification. Therefore, ... More
Speaker Verification Using Simple Temporal Features and Pitch Synchronous Cepstral CoefficientsAug 15 2019Speaker verification is the process by which a speakers claim of identity is tested against a claimed speaker by his or her voice. Speaker verification is done by the use of some parameters (features) from the speakers voice which can be used to differentiate ... More
Conditional LSTM-GAN for Melody Generation from LyricsAug 15 2019Melody generation from lyrics has been a challenging research issue in the field of artificial intelligence and music, which enables to learn and discover latent relationship between interesting lyrics and accompanying melody. Unfortunately, the limited ... More
State-of-the-art Speech Recognition using EEG and Towards Decoding of Speech Spectrum From EEGAug 14 2019In this paper we first demonstrate continuous noisy speech recognition using electroencephalography (EEG) signals on English vocabulary using different types of state of the art end-to-end automatic speech recognition (ASR) models, we further provide ... More
Interleaved Multitask Learning for Audio Source Separation with Independent DatabasesAug 14 2019Deep Neural Network-based source separation methods usually train independent models to optimize for the separation of individual sources. Although this can lead to good performance for well-defined targets, it can also be computationally expensive. The ... More
Components Loss for Neural Networks in Mask-Based Speech EnhancementAug 14 2019Estimating time-frequency domain masks for single-channel speech enhancement using deep learning methods has recently become a popular research field with promising results. In this paper, we propose a novel components loss (CL) for the training of neural ... More
RTF-steered binaural MVDR beamforming incorporating multiple external microphonesAug 13 2019The binaural minimum-variance distortionless-response (BMVDR) beamformer is a well-known noise reduction algorithm that can be steered using the relative transfer function (RTF) vector of the desired speech source. Exploiting the availability of an external ... More
IMS-Speech: A Speech to Text ToolAug 13 2019We present the IMS-Speech, a web based tool for German and English speech transcription aiming to facilitate research in various disciplines which require accesses to lexical information in spoken language materials. This tool is based on modern open ... More
End-to-End Multi-Speaker Speech Recognition using Speaker Embeddings and Transfer LearningAug 13 2019This paper presents our latest investigation on end-to-end automatic speech recognition (ASR) for overlapped speech. We propose to train an end-to-end system conditioned on speaker embeddings and further improved by transfer learning from clean speech. ... More
Estimating & Mitigating the Impact of Acoustic Environments on Machine-to-Machine SignallingAug 13 2019The advance of technology for transmitting Data-over-Sound in various IoT and telecommunication applications has led to the concept of machine-to-machine over-the-air acoustic signalling. Reverberation can have a detrimental effect on such machine-to-machine ... More
A Study on Angular Based Embedding Learning for Text-independent Speaker VerificationAug 12 2019Learning a good speaker embedding is important for many automatic speaker recognition tasks, including verification, identification and diarization. The embeddings learned by softmax are not discriminative enough for open-set verification tasks. Angular ... More
Emotion Dependent Facial Animation from Affective SpeechAug 11 2019In human-to-computer interaction, facial animation in synchrony with affective speech can deliver more naturalistic conversational agents. In this paper, we present a two-stage deep learning approach for affective speech driven facial shape animation. ... More
Multi-modality Latent Interaction Network for Visual Question AnsweringAug 10 2019Exploiting relationships between visual regions and question words have achieved great success in learning multi-modality features for Visual Question Answering (VQA). However, we argue that existing methods mostly model relations between individual visual ... More
Emotionless: Privacy-Preserving Speech Analysis for Voice AssistantsAug 09 2019Voice-enabled interactions provide more human-like experiences in many popular IoT systems. Cloud-based speech analysis services extract useful information from voice input using speech recognition techniques. The voice signal is a rich resource that ... More
The role of cue enhancement and frequency fine-tuning in hearing impaired phone recognitionAug 09 2019A speech-based hearing test is designed to identify the susceptible error-prone phones for individual hearing impaired (HI) ear. Only robust tokens in the experiment noise levels had been chosen for the test. The noise-robustness of tokens is measured ... More
Challenging the Boundaries of Speech Recognition: The MALACH CorpusAug 09 2019There has been huge progress in speech recognition over the last several years. Tasks once thought extremely difficult, such as SWITCHBOARD, now approach levels of human performance. The MALACH corpus (LDC catalog LDC2012S05), a 375-Hour subset of a large ... More
ToyADMOS: A Dataset of Miniature-Machine Operating Sounds for Anomalous Sound DetectionAug 09 2019This paper introduces a new dataset called "ToyADMOS" designed for anomaly detection in machine operating sounds (ADMOS). To the best our knowledge, no large-scale datasets are available for ADMOS, although large-scale datasets have contributed to recent ... More
Exploiting semi-supervised training through a dropout regularization in end-to-end speech recognitionAug 08 2019In this paper, we explore various approaches for semi supervised learning in an end to end automatic speech recognition (ASR) framework. The first step in our approach involves training a seed model on the limited amount of labelled data. Additional unlabelled ... More
Universal Adversarial Audio PerturbationsAug 08 2019We demonstrate the existence of universal adversarial perturbations, which can fool a family of audio processing architectures, for both targeted and untargeted attacks. To the best of our knowledge, this is the first study on generating universal adversarial ... More
Universal Adversarial Audio PerturbationsAug 08 2019Aug 12 2019We demonstrate the existence of universal adversarial perturbations, which can fool a family of audio processing architectures, for both targeted and untargeted attacks. To the best of our knowledge, this is the first study on generating universal adversarial ... More
Self-supervised Attention Model for Weakly Labeled Audio Event ClassificationAug 07 2019We describe a novel weakly labeled Audio Event Classification approach based on a self-supervised attention model. The weakly labeled framework is used to eliminate the need for expensive data labeling procedure and self-supervised attention is deployed ... More
Audio-visual Speech Enhancement Using Conditional Variational Auto-EncoderAug 07 2019Variational auto-encoders (VAEs) are deep generative latent variable models that can be used for learning the distribution of complex data. VAEs have been successfully used to learn a probabilistic prior over speech signals, which is then used to perform ... More
Pitch-Synchronous Single Frequency Filtering Spectrogram for Speech Emotion RecognitionAug 07 2019Convolutional neural networks (CNN) are widely used for speech emotion recognition (SER). In such cases, the short time fourier transform (STFT) spectrogram is the most popular choice for representing speech, which is fed as input to the CNN. However, ... More
Understanding Optical Music RecognitionAug 07 2019For over 50 years, researchers have been trying to teach computers to read music notation, referred to as Optical Music Recognition (OMR). However, this field is still difficult to access for new researchers, especially those without a significant musical ... More
Understanding Optical Music RecognitionAug 07 2019Aug 14 2019For over 50 years, researchers have been trying to teach computers to read music notation, referred to as Optical Music Recognition (OMR). However, this field is still difficult to access for new researchers, especially those without a significant musical ... More
Viterbi Extraction tutorial with Hidden Markov ToolkitAug 07 2019An algorithm used to extract HMM parameters is revisited. Most parts of the extraction process are taken from implemented Hidden Markov Toolkit (HTK) program under name HInit. The algorithm itself shows a few variations compared to another domain of implementations. ... More
Practical Speech Recognition with HTKAug 06 2019The practical aspects of developing an Automatic Speech Recognition System (ASR) with HTK are reviewed. Steps are explained concerning hardware, software, libraries, applications and computer programs used. The common procedure to rapidly apply speech ... More
An End-to-End Text-independent Speaker Verification Framework with a Keyword Adversarial NetworkAug 06 2019This paper presents an end-to-end text-independent speaker verification framework by jointly considering the speaker embedding (SE) network and automatic speech recognition (ASR) network. The SE network learns to output an embedding vector which distinguishes ... More
Maximum likelihood convolutional beamformer for simultaneous denoising and dereverberationAug 06 2019This article describes a probabilistic formulation of a Weighted Power minimization Distortionless response convolutional beamformer (WPD). The WPD unifies a weighted prediction error based dereverberation method (WPE) and a minimum power distortionless ... More
Acceleration of rank-constrained spatial covariance matrix estimation for blind speech extractionAug 06 2019In this paper, we propose new accelerated update rules for rank-constrained spatial covariance model estimation, which efficiently extracts a directional target source in diffuse background noise.The naive updat e rule requires heavy computation such ... More
Triplet Based Embedding Distance and Similarity Learning for Text-independent Speaker VerificationAug 06 2019Speaker embeddings become growing popular in the text-independent speaker verification task. In this paper, we propose two improvements during the training stage. The improvements are both based on triplet cause the training stage and the evaluation stage ... More
Adversarially Trained End-to-end Korean Singing Voice Synthesis SystemAug 06 2019In this paper, we propose an end-to-end Korean singing voice synthesis system from lyrics and a symbolic melody using the following three novel approaches: 1) phonetic enhancement masking, 2) local conditioning of text and pitch to the super-resolution ... More
Presenting the Acoustic Sounds for Wellbeing Dataset and Baseline Classification ResultsAug 05 2019The field of sound healing includes ancient practices coming from a broad range of cultures. Across such practices there is a variety of instrumentation utilised. Practitioners suggest the ability of sound to target both mental and even physical health ... More
Robust Over-the-Air Adversarial Examples Against Automatic Speech Recognition SystemsAug 05 2019Automatic speech recognition (ASR) systems are possible to fool via targeted adversarial examples. These can induce the ASR to produce arbitrary transcriptions in response to any type of audio signal, be it speech, environmental sounds, or music. However, ... More
V2S attack: building DNN-based voice conversion from automatic speaker verificationAug 05 2019This paper presents a new voice impersonation attack using voice conversion (VC). Enrolling personal voices for automatic speaker verification (ASV) offers natural and flexible biometric authentication systems. Basically, the ASV systems do not include ... More
Cross-lingual Text-independent Speaker Verification using Unsupervised Adversarial Discriminative Domain AdaptationAug 05 2019Speaker verification systems often degrade significantly when there is a language mismatch between training and testing data. Being able to improve cross-lingual speaker verification system using unlabeled data can greatly increase the robustness of the ... More
Sound Event Detection in Multichannel Audio using Convolutional Time-Frequency-Channel Squeeze and ExcitationAug 04 2019In this study, we introduce a convolutional time-frequency-channel "Squeeze and Excitation" (tfc-SE) module to explicitly model inter-dependencies between the time-frequency domain and multiple channels. The tfc-SE module consists of two parts: tf-SE ... More
Probabilistic Permutation Invariant Training for Speech SeparationAug 04 2019Single-microphone, speaker-independent speech separation is normally performed through two steps: (i) separating the specific speech sources, and (ii) determining the best output-label assignment to find the separation error. The second step is the main ... More
LSTM Based Music Generation SystemAug 02 2019Traditionally, music was treated as an analogue signal and was generated manually. In recent years, music is conspicuous to technology which can generate a suite of music automatically without any human intervention. To accomplish this task, we need to ... More
SANTLR: Speech Annotation Toolkit for Low Resource LanguagesAug 02 2019While low resource speech recognition has attracted a lot of attention from the speech community, there are a few tools available to facilitate low resource speech collection. In this work, we present SANTLR: Speech Annotation Toolkit for Low Resource ... More
Multilingual Speech Recognition with Corpus Relatedness SamplingAug 02 2019Multilingual acoustic models have been successfully applied to low-resource speech recognition. Most existing works have combined many small corpora together and pretrained a multilingual model by sampling from each corpus uniformly. The model is eventually ... More
High-Level Control of Drum Track Generation Using Learned Patterns of Rhythmic InteractionAug 02 2019Spurred by the potential of deep learning, computational music generation has gained renewed academic interest. A crucial issue in music generation is that of user control, especially in scenarios where the music generation process is conditioned on existing ... More
Sound source detection, localization and classification using consecutive ensemble of CRNN modelsAug 02 2019In this paper, we describe our method for DCASE2019 task3: Sound Event Localization and Detection (SELD). We use four CRNN SELDnet-like single output models which run in a consecutive manner to recover all possible information of occurring events. We ... More
Learning Joint Acoustic-Phonetic Word EmbeddingsAug 01 2019Most speech recognition tasks pertain to mapping words across two modalities: acoustic and orthographic. In this work, we suggest learning encoders that map variable-length, acoustic or phonetic, sequences that represent words into fixed-dimensional vectors ... More
Quantifying Cochlear Implant Users' Ability for Speaker Identification using CI Auditory StimuliJul 31 2019Speaker recognition is a biometric modality that uses underlying speech information to determine the identity of the speaker. Speaker Identification (SID) under noisy conditions is one of the challenging topics in the field of speech processing, specifically ... More
Personalizing ASR for Dysarthric and Accented Speech with Limited DataJul 31 2019Automatic speech recognition (ASR) systems have dramatically improved over the last few years. ASR systems are most often trained from 'typical' speech, which means that underrepresented groups don't experience the same level of improvement. In this paper, ... More
Augmenting Music Sheets with Harmonic FingerprintsJul 31 2019Conventional Music Notation (CMN) is the well-established foundation for the written communication of musical information, such as rhythm, harmony, or timbre. However, CMN suffers from the complexity of its visual encoding and the need for extensive training ... More
Marine Mammal Species Classification using Convolutional Neural Networks and a Novel Acoustic RepresentationJul 30 2019Research into automated systems for detecting and classifying marine mammals in acoustic recordings is expanding internationally due to the necessity to analyze large collections of data for conservation purposes. In this work, we present a Convolutional ... More
Fast and Robust 3-D Sound Source Localization with DSVD-PHATJul 29 2019This paper introduces a variant of the Singular Value Decomposition with Phase Transform (SVD-PHAT), named Difference SVD-PHAT (DSVD-PHAT), to achieve robust Sound Source Localization (SSL) in noisy conditions. Experiments are performed on a Baxter robot ... More
Multi-Frame Cross-Entropy Training for Convolutional Neural Networks in Speech RecognitionJul 29 2019We introduce Multi-Frame Cross-Entropy training (MFCE) for convolutional neural network acoustic models. Recognizing that similar to RNNs, CNNs are in nature sequence models that take variable length inputs, we propose to take as input to the CNN a part ... More
MIRaGe: Multichannel Database Of Room Impulse Responses Measured On High-Resolution Cube-Shaped Grid In Multiple Acoustic ConditionsJul 29 2019We introduce a database of multi-channel recordings performed in an acoustic lab with adjustable reverberation time. The recordings provide information about room impulse responses (RIR) for various positions of a loudspeaker. In particular, the main ... More
StarGAN-VC2: Rethinking Conditional Methods for StarGAN-Based Voice ConversionJul 29 2019Non-parallel multi-domain voice conversion (VC) is a technique for learning mappings among multiple domains without relying on parallel data. This is important but challenging owing to the requirement of learning multiple mappings and the non-availability ... More
StarGAN-VC2: Rethinking Conditional Methods for StarGAN-Based Voice ConversionJul 29 2019Aug 07 2019Non-parallel multi-domain voice conversion (VC) is a technique for learning mappings among multiple domains without relying on parallel data. This is important but challenging owing to the requirement of learning multiple mappings and the non-availability ... More
Dilated FCN: Listening Longer to Hear BetterJul 27 2019Deep neural network solutions have emerged as a new and powerful paradigm for speech enhancement (SE). The capabilities to capture long context and extract multi-scale patterns are crucial to design effective SE networks. Such capabilities, however, are ... More
On the Use/Misuse of the Term 'Phoneme'Jul 26 2019The term 'phoneme' lies at the heart of speech science and technology, and yet it is not clear that the research community fully appreciates its meaning and implications. In particular, it is suspected that many researchers use the term in a casual sense ... More
Correlation Distance Skip Connection Denoising Autoencoder (CDSK-DAE) for Speech Feature EnhancementJul 26 2019Performance of learning based Automatic Speech Recognition (ASR) is susceptible to noise, especially when it is introduced in the testing data while not presented in the training data. This work focuses on a feature enhancement for noise robust end-to-end ... More
Interactive Lungs Auscultation with Reinforcement Learning AgentJul 25 2019To perform a precise auscultation for the purposes of examination of respiratory system normally requires the presence of an experienced doctor. With most recent advances in machine learning and artificial intelligence, automatic detection of pathological ... More
Cross-Attention End-to-End ASR for Two-Party ConversationsJul 24 2019We present an end-to-end speech recognition model that learns interaction between two speakers based on the turn-changing information. Unlike conventional speech recognition models, our model exploits two speakers' history of conversational-context information ... More
Non-Parallel Voice Conversion with Cyclic Variational AutoencoderJul 24 2019In this paper, we present a novel technique for a non-parallel voice conversion (VC) with the use of cyclic variational autoencoder (CycleVAE)-based spectral modeling. In a variational autoencoder(VAE) framework, a latent space, usually with a Gaussian ... More
Log Complex Color for Visual Pattern Recognition of Total SoundJul 23 2019While traditional audio visualization methods depict amplitude intensities vs. time, such as in a time-frequency spectrogram, and while some may use complex phase information to augment the amplitude representation, such as in a reassigned spectrogram, ... More
Speech, Head, and Eye-based Cues for Continuous Affect PredictionJul 23 2019Continuous affect prediction involves the discrete time-continuous regression of affect dimensions. Dimensions to be predicted often include arousal and valence. Continuous affect prediction researchers are now embracing multimodal model input. This provides ... More
Discriminative Learning for Monaural Speech Separation Using Deep Embedding FeaturesJul 23 2019Deep clustering (DC) and utterance-level permutation invariant training (uPIT) have been demonstrated promising for speaker-independent speech separation. DC is usually formulated as two-step processes: embedding learning and embedding clustering, which ... More
NONOTO: A Model-agnostic Web Interface for Interactive Music Composition by InpaintingJul 23 2019Inpainting-based generative modeling allows for stimulating human-machine interactions by letting users perform stylistically coherent local editions to an object using a statistical model. We present NONOTO, a new interface for interactive music generation ... More
Multisensory Learning Framework for Robot DrummingJul 23 2019The hype about sensorimotor learning is currently reaching high fever, thanks to the latest advancement in deep learning. In this paper, we present an open-source framework for collecting large-scale, time-synchronised synthetic data from highly disparate ... More
EmoBed: Strengthening Monomodal Emotion Recognition via Training with Crossmodal Emotion EmbeddingsJul 23 2019Despite remarkable advances in emotion recognition, they are severely restrained from either the essentially limited property of the employed single modality, or the synchronous presence of all involved multiple modalities. Motivated by this, we propose ... More
LSTM based Similarity Measurement with Spectral Clustering for Speaker DiarizationJul 23 2019More and more neural network approaches have achieved considerable improvement upon submodules of speaker diarization system, including speaker change detection and segment-wise speaker embedding extraction. Still, in the clustering stage, traditional ... More
On Modeling ASR Word ConfidenceJul 22 2019We present a new method for computing ASR word confidences that effectively mitigates ASR errors for diverse downstream applications, improves the word error rate of the 1-best result, and allows better comparison of scores across different models. We ... More
A Deep Neural Network for Short-Segment Speaker RecognitionJul 22 2019Todays interactive devices such as smart-phone assistants and smart speakers often deal with short-duration speech segments. As a result, speaker recognition systems integrated into such devices will be much better suited with models capable of performing ... More
ML Estimation and CRBs for Reverberation, Speech and Noise PSDs in Rank-Deficient Noise-FieldJul 22 2019Speech communication systems are prone to performance degradation in reverberant and noisy acoustic environments. Dereverberation and noise reduction algorithms typically require several model parameters, e.g. the speech, reverberation and noise power ... More
Crowdsourcing a Dataset of Audio CaptionsJul 22 2019Audio captioning is a novel field of multi-modal translation and it is the task of creating a textual description of the content of an audio signal (e.g. "people talking in a big room"). The creation of a dataset for this task requires a considerable ... More
Statistical Voice Conversion with Quasi-Periodic WaveNet VocoderJul 21 2019In this paper, we investigate the effectiveness of a quasi-periodic WaveNet (QPNet) vocoder combined with a statistical spectral conversion technique for a voice conversion task. The WaveNet (WN) vocoder has been applied as the waveform generation module ... More
Sound Search by Text Description or Vocal Imitation?Jul 19 2019Searching sounds by text labels is often difficult, as text descriptions cannot describe the audio content in detail. Query by vocal imitation bridges such gap and provides a novel way to sound search. Several algorithms for sound search by vocal imitation ... More
Data Augmentation for Instrument Classification Robust to Audio EffectsJul 19 2019Reusing recorded sounds (sampling) is a key component in Electronic Music Production (EMP), which has been present since its early days and is at the core of genres like hip-hop or jungle. Commercial and non-commercial services allow users to obtain collections ... More
Language Modelling for Sound Event Detection with Teacher Forcing and Scheduled SamplingJul 19 2019Jul 22 2019A sound event detection (SED) method typically takes as an input a sequence of audio frames and predicts the activities of sound events in each frame. In real-life recordings, the sound events exhibit some temporal structure: for instance, a "car horn" ... More
Language Modelling for Sound Event Detection with Teacher Forcing and Scheduled SamplingJul 19 2019A sound event detection (SED) method typically takes as an input a sequence of audio frames and predicts the activities of sound events in each frame. In real-life recordings, the sound events exhibit some temporal structure: for instance, a "car horn" ... More
DNN-based Speaker Embedding Using Subjective Inter-speaker Similarity for Multi-speaker Modeling in Speech SynthesisJul 19 2019This paper proposes novel algorithms for speaker embedding using subjective inter-speaker similarity based on deep neural networks (DNNs). Although conventional DNN-based speaker embedding such as a $d$-vector can be applied to multi-speaker modeling ... More
Batch Uniformization for Minimizing Maximum Anomaly Score of DNN-based Anomaly Detection in SoundsJul 19 2019Use of an autoencoder (AE) as a normal model is a state-of-the-art technique for unsupervised-anomaly detection in sounds (ADS). The AE is trained to minimize the sample mean of the anomaly score of normal sounds in a mini-batch. One problem with this ... More
Leveraging Knowledge Bases And Parallel Annotations For Music Genre TranslationJul 18 2019Prevalent efforts have been put in automatically inferring genres of musical items. Yet, the propose solutions often rely on simplifications and fail to address the diversity and subjectivity of music genres. Accounting for these has, though, many benefits ... More
Leveraging Knowledge Bases And Parallel Annotations For Music Genre TranslationJul 18 2019Jul 27 2019Prevalent efforts have been put in automatically inferring genres of musical items. Yet, the propose solutions often rely on simplifications and fail to address the diversity and subjectivity of music genres. Accounting for these has, though, many benefits ... More
Forward-Backward Decoding for Regularizing End-to-End TTSJul 18 2019Neural end-to-end TTS can generate very high-quality synthesized speech, and even close to human recording within similar domain text. However, it performs unsatisfactory when scaling it to challenging test sets. One concern is that the encoder-decoder ... More
Automatic vocal tract landmark localization from midsagittal MRI dataJul 18 2019The various speech sounds of a language are obtained by varying the shape and position of the articulators surrounding the vocal tract. Analyzing their variability is crucial for understanding speech production, diagnosing speech and swallowing disorders ... More
HODGEPODGE: Sound event detection based on ensemble of semi-supervised learning methodsJul 17 2019In this paper, we present a method called HODGEPODGE\footnotemark[1] for large-scale detection of sound events using weakly labeled, synthetic, and unlabeled data proposed in the Detection and Classification of Acoustic Scenes and Events (DCASE) 2019 ... More
Conversational Help for Task Completion and Feature Discovery in Personal AssistantsJul 16 2019Intelligent Personal Assistants (IPAs) have become widely popular in recent times. Most of the commercial IPAs today support a wide range of skills including Alarms, Reminders, Weather Updates, Music, News, Factual Questioning-Answering, etc. The list ... More