Latest in eess.as

total 22took 0.04s
Similarity measures for vocal-based drum sample retrieval using deep convolutional auto-encodersFeb 14 2018The expressive nature of the voice provides a powerful medium for communicating sonic ideas, motivating recent research on methods for query by vocalisation. Meanwhile, deep learning methods have demonstrated state-of-the-art results for matching vocal ... More
Completely Distributed Power Allocation using Deep Neural Network for Device to Device communication Underlaying LTEFeb 08 2018Device to device (D2D) communication underlaying LTE can be used to distribute traffic loads of eNBs. However, a conventional D2D link is controlled by an eNB, and it still remains burdens to the eNB. We propose a completely distributed power allocation ... More
A Divide and Conquer Strategy for Musical Noise-free Speech Enhancement in Adverse EnvironmentsFeb 07 2018A divide and conquer strategy for enhancement of noisy speeches in adverse environments involving lower levels of SNR is presented in this paper, where the total system of speech enhancement is divided into two separate steps. The first step is based ... More
Joint Modeling of Accents and Acoustics for Multi-Accent Speech RecognitionFeb 07 2018The performance of automatic speech recognition systems degrades with increasing mismatch between the training and testing scenarios. Differences in speaker accents are a significant source of such mismatch. The traditional approach to deal with multiple ... More
Recognition of Acoustic Events Using Masked Conditional Neural NetworksFeb 07 2018Automatic feature extraction using neural networks has accomplished remarkable success for images, but for sound recognition, these models are usually modified to fit the nature of the multi-dimensional temporal representation of the audio signal in spectrograms. ... More
Learning from Past Mistakes: Improving Automatic Speech Recognition Output via Noisy-Clean Phrase Context ModelingFeb 07 2018Automatic speech recognition (ASR) systems lack joint optimization during decoding over the acoustic, lexical and language models; for instance the ASR will often prune words due to acoustics using short-term context, prior to rescoring with long-term ... More
A Generative Model for Natural Sounds Based on Latent Force ModellingFeb 02 2018Recent advances in analysis of subband amplitude envelopes of natural sounds have resulted in convincing synthesis, showing subband amplitudes to be a crucial component of perception. Probabilistic latent variable analysis is particularly revealing, but ... More
Deep Predictive Models in Interactive MusicJan 31 2018Automatic music generation is a compelling task where much recent progress has been made with deep learning models. In this paper, we ask how these models can be integrated into interactive music systems; how can they encourage or enhance the music making ... More
Highly-Reverberant Real Environment database: HRREJan 29 2018Speech recognition in highly-reverberant real environments remains a major challenge. An evaluation dataset for this task is needed. This report describes the generation of the Highly-Reverberant Real Environment database (HRRE). This database contains ... More
CommanderSong: A Systematic Approach for Practical Adversarial Voice RecognitionJan 24 2018ASR (automatic speech recognition) systems like Siri, Alexa, Google Voice or Cortana has become quite popular recently. One of the key techniques enabling the practical use of such systems in people's daily life is deep learning. Though deep learning ... More
Learning audio and image representations with bio-inspired trainable feature extractorsJan 02 2018Recent advancements in pattern recognition and signal processing concern the automatic learning of data representations from labeled training samples. Typical approaches are based on deep learning and convolutional neural networks, which require large ... More
A Light-Weight Multimodal Framework for Improved Environmental Audio TaggingDec 27 2017The lack of strong labels has severely limited the state-of-the-art fully supervised audio tagging systems to be scaled to larger dataset. Meanwhile, audio-visual learning models based on unlabeled videos have been successfully applied to audio tagging, ... More
Multiple Instance Deep Learning for Weakly Supervised Audio Event DetectionDec 27 2017State-of-the-art audio event detection (AED) systems rely on supervised learning using strongly labeled data. However, this dependence severely limits scalability to large-scale datasets where fine resolution annotations are too expensive to obtain. In ... More
Classification vs. Regression in Supervised Learning for Single Channel Speaker Count EstimationDec 12 2017The task of estimating the maximum number of concurrent speakers from single channel mixtures is important for various audio-based applications, such as blind source separation, speaker diarisation, audio surveillance or auditory scene classification. ... More
Visual Features for Context-Aware Speech RecognitionDec 01 2017Automatic transcriptions of consumer-generated multi-media content such as "Youtube" videos still exhibit high word error rates. Such data typically occupies a very broad domain, has been recorded in challenging conditions, with cheap hardware and a focus ... More
Wavenet based low rate speech codingDec 01 2017Traditional parametric coding of speech facilitates low rate but provides poor reconstruction quality because of the inadequacy of the model used. We describe how a WaveNet generative speech model can be used to generate high quality speech from the bit ... More
HoME: a Household Multimodal EnvironmentNov 29 2017We introduce HoME: a Household Multimodal Environment for artificial agents to learn from vision, audio, semantics, physics, and interaction with objects and other agents, all within a realistic context. HoME integrates over 45,000 diverse 3D house layouts ... More
Onsets and Frames: Dual-Objective Piano TranscriptionOct 30 2017We consider the problem of transcribing polyphonic piano music with an emphasis on generalizing to unseen instruments. We use deep neural networks and propose a novel approach that predicts onsets and frames using both CNNs and LSTMs. This model predicts ... More
Lip2AudSpec: Speech reconstruction from silent lip movements videoOct 26 2017In this study, we propose a deep neural network for reconstructing intelligible speech from silent lip movement videos. We use auditory spectrogram as spectral representation of speech and its corresponding sound generation method resulting in a more ... More
End-to-end DNN Based Speaker Recognition Inspired by i-vector and PLDAOct 06 2017Jan 08 2018Recently several end-to-end speaker verification systems based on deep neural networks (DNNs) have been proposed. These systems have been proven to be competitive for text-dependent tasks as well as for text-independent tasks with short utterances. However, ... More
Linear Computer-Music through Sequences over Galois FieldsSep 19 2017It is shown how binary sequences can be associated with automatic composition of monophonic pieces. We are concerned with the composition of e-music from finite field structures. The information at the input may be either random or information from a ... More
Understanding MIDI: A Painless Tutorial on Midi FormatMay 15 2017A short overview demystifying the midi audio format is presented. The goal is to explain the file structure and how the instructions are used to produce a music signal, both in the case of monophonic signals as for polyphonic signals.