Latest in cs.sd
total 1012took 0.12s
A spelling correction model for end-to-end speech recognitionFeb 19 2019Attention-based sequence-to-sequence models for speech recognition jointly train an acoustic model, language model (LM), and alignment mechanism using a single neural network and require only parallel audio-text pairs. Thus, the language model component ... More Low-Latency Deep Clustering For Speech SeparationFeb 19 2019This paper proposes a low algorithmic latency adaptation of the deep clustering approach to speaker-independent speech separation. It consists of three parts: a) the usage of long-short-term-memory (LSTM) networks instead of their bidirectional variant ... More Puppet DubbingFeb 12 2019Dubbing puppet videos to make the characters (e.g. Kermit the Frog) convincingly speak a new speech track is a popular activity with many examples of well-known puppets speaking lines from films or singing rap songs. But manually aligning puppet mouth ... More A Vocoder-free WaveNet Voice Conversion with Non-Parallel DataFeb 11 2019In a typical voice conversion system, vocoder is commonly used for speech-to-features analysis and features-to-speech synthesis. However, vocoder can be a source of speech quality degradation. This paper presents a vocoder-free voice conversion approach ... More Hide and Speak: Deep Neural Networks for Speech SteganographyFeb 07 2019Steganography is the science of hiding a secret message within an ordinary public message, which referred to as Carrier. Traditionally, digital signal processing techniques, such as least significant bit encoding, were used for hiding messages. In this ... More Conv-codes: Audio Hashing For Bird Species ClassificationFeb 07 2019In this work, we propose a supervised, convex representation based audio hashing framework for bird species classification. The proposed framework utilizes archetypal analysis, a matrix factorization technique, to obtain convex-sparse representations ... More End-to-end Anchored Speech RecognitionFeb 06 2019Voice-controlled house-hold devices, like Amazon Echo or Google Home, face the problem of performing speech recognition of device-directed speech in the presence of interfering background speech, i.e., background noise and interfering speech from another ... More Centroid-based deep metric learning for speaker recognitionFeb 06 2019Speaker embedding models that utilize neural networks to map utterances to a space where distances reflect similarity between speakers have driven recent progress in the speaker recognition task. However, there is still a significant performance gap between ... More Unsupervised Polyglot Text To SpeechFeb 06 2019We present a TTS neural network that is able to produce speech in multiple languages. The proposed network is able to transfer a voice, which was presented as a sample in a source language, into one of several target languages. Training is done without ... More An Ensemble SVM-based Approach for Voice Activity DetectionFeb 05 2019Voice activity detection (VAD), used as the front end of speech enhancement, speech and speaker recognition algorithms, determines the overall accuracy and efficiency of the algorithms. Therefore, a VAD with low complexity and high accuracy is highly ... More Discriminate natural versus loudspeaker emitted speechJan 31 2019In this work, we address a novel, but potentially emerging, problem of discriminating the natural human voices and those played back by any kind of audio devices in the context of interactions with in-house voice user interface. The tackled problem may ... More Real-time separation of non-stationary sound fields on spheresJan 16 2019The sound field separation methods can separate the target field from the interfering noises, facilitating the study of the acoustic characteristics of the target source, which is placed in a noisy environment. However, most of the existing sound field ... More A linear programming approach to the tracking of partialsJan 15 2019A new approach to the tracking of sinusoidal chirps using linear programming is proposed. It is demonstrated that the classical algorithm of McAulay and Quatieri is greedy and exhibits exponential complexity for long searches, while approaches based on ... More Exploring Transfer Learning for Low Resource Emotional TTSJan 14 2019During the last few years, spoken language technologies have known a big improvement thanks to Deep Learning. However Deep Learning-based algorithms require amounts of data that are often difficult and costly to gather. Particularly, modeling the variability ... More A survey on acoustic sensingJan 11 2019The rise of Internet-of-Things (IoT) has brought many new sensing mechanisms. Among these mechanisms, acoustic sensing attracts much attention in recent years. Acoustic sensing exploits acoustic sensors beyond their primary uses, namely recording and ... More Persian phonemes recognition using PPNetDec 17 2018In this paper a new approach for recognition of Persian phonemes on the PCVC speech dataset is proposed. Nowadays deep neural networks are playing main rule in classification tasks. However the best results in speech recognition are not as good as human ... More Learning to Generate Music with BachPropDec 17 2018As deep learning advances, algorithms of music composition increase in performance. However, most of the successful models are designed for specific musical structures. Here, we present BachProp, an algorithmic composer that can generate music scores ... More The USTC-NEL Speech Translation system at IWSLT 2018Dec 06 2018This paper describes the USTC-NEL system to the speech translation task of the IWSLT Evaluation 2018. The system is a conventional pipeline system which contains 3 modules: speech recognition, post-processing and machine translation. We train a group ... More DONUT: CTC-based Query-by-Example Keyword SpottingNov 26 2018Keyword spotting--or wakeword detection--is an essential feature for hands-free operation of modern voice-controlled devices. With such devices becoming ubiquitous, users might want to choose a personalized custom wakeword. In this work, we present DONUT, ... More Representation Mixing for TTS SynthesisNov 17 2018Nov 24 2018Recent character and phoneme-based parametric TTS systems using deep learning have shown strong performance in natural speech generation. However, the choice between character or phoneme input can create serious limitations for practical deployment, as ... More The Intrinsic Memorability of Everyday SoundsNov 17 2018Our aural experience plays an integral role in the perception and memory of the events in our lives. Some of the sounds we encounter throughout the day stay lodged in our minds more easily than others; these, in turn, may serve as powerful triggers of ... More Exploring Tradeoffs in Models for Low-latency Speech EnhancementNov 16 2018We explore a variety of neural networks configurations for one- and two-channel spectrogram-mask-based speech enhancement. Our best model improves on previous state-of-the-art performance on the CHiME2 speech enhancement task by 0.4 decibels in signal-to-distortion ... More AclNet: efficient end-to-end audio classification CNNNov 16 2018We propose an efficient end-to-end convolutional neural network architecture, AclNet, for audio classification. When trained with our data augmentation and regularization, we achieved state-of-the-art performance on the ESC-50 corpus with 85:65% accuracy. ... More Comprehensive evaluation of statistical speech waveform synthesisNov 15 2018Dec 11 2018Statistical TTS systems that directly predict the speech waveform have recently reported improvements in synthesis quality. This investigation evaluates Amazon's statistical speech waveform synthesis (SSWS) system. An in-depth evaluation of SSWS is conducted ... More Exploring RNN-Transducer for Chinese Speech RecognitionNov 13 2018End-to-end approaches have drawn much attention recently for significantly simplifying the construction of an automatic speech recognition (ASR) system. RNN transducer (RNN-T) is one of the popular end-to-end methods. Previous studies have shown that ... More Native Language Identification using i-vectorNov 09 2018The task of determining a speaker's native language based only on his speeches in a second language is known as Native Language Identification or NLI. Due to its increasing applications in various domains of speech signal processing, this has emerged ... More Class-conditional embeddings for music source separationNov 07 2018Isolating individual instruments in a musical mixture has a myriad of potential applications, and seems imminently achievable given the levels of performance reached by recent deep learning methods. While most musical source separation techniques learn ... More SDR - half-baked or well done?Nov 06 2018In speech enhancement and source separation, signal-to-noise ratio is a ubiquitous objective measure of denoising/separation quality. A decade ago, the BSS_eval toolkit was developed to give researchers worldwide a way to evaluate the quality of their ... More