Latest in

total 726took 0.23s
Similarity measures for vocal-based drum sample retrieval using deep convolutional auto-encodersFeb 14 2018The expressive nature of the voice provides a powerful medium for communicating sonic ideas, motivating recent research on methods for query by vocalisation. Meanwhile, deep learning methods have demonstrated state-of-the-art results for matching vocal ... More
A Divide and Conquer Strategy for Musical Noise-free Speech Enhancement in Adverse EnvironmentsFeb 07 2018A divide and conquer strategy for enhancement of noisy speeches in adverse environments involving lower levels of SNR is presented in this paper, where the total system of speech enhancement is divided into two separate steps. The first step is based ... More
Joint Modeling of Accents and Acoustics for Multi-Accent Speech RecognitionFeb 07 2018The performance of automatic speech recognition systems degrades with increasing mismatch between the training and testing scenarios. Differences in speaker accents are a significant source of such mismatch. The traditional approach to deal with multiple ... More
Recognition of Acoustic Events Using Masked Conditional Neural NetworksFeb 07 2018Automatic feature extraction using neural networks has accomplished remarkable success for images, but for sound recognition, these models are usually modified to fit the nature of the multi-dimensional temporal representation of the audio signal in spectrograms. ... More
Learning from Past Mistakes: Improving Automatic Speech Recognition Output via Noisy-Clean Phrase Context ModelingFeb 07 2018Automatic speech recognition (ASR) systems lack joint optimization during decoding over the acoustic, lexical and language models; for instance the ASR will often prune words due to acoustics using short-term context, prior to rescoring with long-term ... More
A Generative Model for Natural Sounds Based on Latent Force ModellingFeb 02 2018Recent advances in analysis of subband amplitude envelopes of natural sounds have resulted in convincing synthesis, showing subband amplitudes to be a crucial component of perception. Probabilistic latent variable analysis is particularly revealing, but ... More
Deep Predictive Models in Interactive MusicJan 31 2018Automatic music generation is a compelling task where much recent progress has been made with deep learning models. In this paper, we ask how these models can be integrated into interactive music systems; how can they encourage or enhance the music making ... More
Highly-Reverberant Real Environment database: HRREJan 29 2018Speech recognition in highly-reverberant real environments remains a major challenge. An evaluation dataset for this task is needed. This report describes the generation of the Highly-Reverberant Real Environment database (HRRE). This database contains ... More
CommanderSong: A Systematic Approach for Practical Adversarial Voice RecognitionJan 24 2018ASR (automatic speech recognition) systems like Siri, Alexa, Google Voice or Cortana has become quite popular recently. One of the key techniques enabling the practical use of such systems in people's daily life is deep learning. Though deep learning ... More
Expectation Learning for Adaptive Crossmodal Stimuli AssociationJan 23 2018The human brain is able to learn, generalize, and predict crossmodal stimuli. Learning by expectation fine-tunes crossmodal processing at different levels, thus enhancing our power of generalization and adaptation in highly dynamic environments. In this ... More
Learning audio and image representations with bio-inspired trainable feature extractorsJan 02 2018Recent advancements in pattern recognition and signal processing concern the automatic learning of data representations from labeled training samples. Typical approaches are based on deep learning and convolutional neural networks, which require large ... More
A Light-Weight Multimodal Framework for Improved Environmental Audio TaggingDec 27 2017The lack of strong labels has severely limited the state-of-the-art fully supervised audio tagging systems to be scaled to larger dataset. Meanwhile, audio-visual learning models based on unlabeled videos have been successfully applied to audio tagging, ... More
Multiple Instance Deep Learning for Weakly Supervised Audio Event DetectionDec 27 2017State-of-the-art audio event detection (AED) systems rely on supervised learning using strongly labeled data. However, this dependence severely limits scalability to large-scale datasets where fine resolution annotations are too expensive to obtain. In ... More
Classification vs. Regression in Supervised Learning for Single Channel Speaker Count EstimationDec 12 2017The task of estimating the maximum number of concurrent speakers from single channel mixtures is important for various audio-based applications, such as blind source separation, speaker diarisation, audio surveillance or auditory scene classification. ... More
Wavenet based low rate speech codingDec 01 2017Traditional parametric coding of speech facilitates low rate but provides poor reconstruction quality because of the inadequacy of the model used. We describe how a WaveNet generative speech model can be used to generate high quality speech from the bit ... More
HoME: a Household Multimodal EnvironmentNov 29 2017We introduce HoME: a Household Multimodal Environment for artificial agents to learn from vision, audio, semantics, physics, and interaction with objects and other agents, all within a realistic context. HoME integrates over 45,000 diverse 3D house layouts ... More
Adversarial Semi-Supervised Audio Source Separation applied to Singing Voice ExtractionOct 31 2017The state of the art in music source separation employs neural networks trained in a supervised fashion on multi-track databases to estimate the sources from a given mixture. With only few datasets available, often extensive data augmentation is used ... More
Onsets and Frames: Dual-Objective Piano TranscriptionOct 30 2017We consider the problem of transcribing polyphonic piano music with an emphasis on generalizing to unseen instruments. We use deep neural networks and propose a novel approach that predicts onsets and frames using both CNNs and LSTMs. This model predicts ... More
End-to-end DNN Based Speaker Recognition Inspired by i-vector and PLDAOct 06 2017Jan 08 2018Recently several end-to-end speaker verification systems based on deep neural networks (DNNs) have been proposed. These systems have been proven to be competitive for text-dependent tasks as well as for text-independent tasks with short utterances. However, ... More
Linear Computer-Music through Sequences over Galois FieldsSep 19 2017It is shown how binary sequences can be associated with automatic composition of monophonic pieces. We are concerned with the composition of e-music from finite field structures. The information at the input may be either random or information from a ... More
Speech Dereverberation Using Nonnegative Convolutive Transfer Function and Spectro temporal ModelingSep 16 2017This paper presents two single channel speech dereverberation methods to enhance the quality of speech signals that have been recorded in an enclosed space. For both methods, the room acoustics are modeled using a nonnegative approximation of the convolutive ... More
Object-Based Audio RenderingAug 23 2017Apparatus and methods are disclosed for performing object-based audio rendering on a plurality of audio objects which define a sound scene, each audio object comprising at least one audio signal and associated metadata. The apparatus comprises: a plurality ... More
Classical Music Composition Using State Space ModelsAug 12 2017Algorithmic composition of music has a long history and with the development of powerful deep learning methods, there has recently been increased interest in exploring algorithms and models to create art. We explore the utility of state space models, ... More
Understanding MIDI: A Painless Tutorial on Midi FormatMay 15 2017A short overview demystifying the midi audio format is presented. The goal is to explain the file structure and how the instructions are used to produce a music signal, both in the case of monophonic signals as for polyphonic signals.
Note Value Recognition for Piano Transcription Using Markov Random FieldsMar 23 2017Jul 07 2017This paper presents a statistical method for use in music transcription that can estimate score times of note onsets and offsets from polyphonic MIDI performance signals. Because performed note durations can deviate largely from score-indicated values, ... More
A Comparison of deep learning methods for environmental soundMar 20 2017Environmental sound detection is a challenging application of machine learning because of the noisy nature of the signal, and the small amount of (labeled) data that is typically available. This work thus presents a comparison of several state-of-the-art ... More
An Information-theoretic Approach to Machine-oriented Music SummarizationDec 07 2016Applying generic media-agnostic summarization to music allows for higher efficiency in automatic processing, storage, and communication of datasets while also alleviating copyright issues. This process has already been proven useful in the context of ... More
Towards computer-assisted understanding of dynamics in symphonic musicDec 07 2016Many people enjoy classical symphonic music. Its diverse instrumentation makes for a rich listening experience. This diversity adds to the conductor's expressive freedom to shape the sound according to their imagination. As a result, the same piece may ... More
Segmental Convolutional Neural Networks for Detection of Cardiac Abnormality With Noisy Heart Sound RecordingsDec 06 2016Heart diseases constitute a global health burden, and the problem is exacerbated by the error-prone nature of listening to and interpreting heart sounds. This motivates the development of automated classification to screen for abnormal heart sounds. Existing ... More
FMA: A Dataset For Music AnalysisDec 06 2016We present a new music dataset that can be used for several music analysis tasks. Our major goal is to go beyond the existing limitations of available music datasets, which are either the small size of datasets with raw audio tracks, the availability ... More
An algorithm to assign musical prime commas to every prime number to allow construction of a complete notation for frequencies in free Just IntonationDec 05 2016Musical frequencies in Just Intonation are comprised of rational numbers. The structure of rational numbers is determined by prime factorisations. Just Intonation frequencies can be split into two components. The larger component uses only integer powers ... More
Algorithmic Songwriting with ALYSIADec 04 2016This paper introduces ALYSIA: Automated LYrical SongwrIting Application. ALYSIA is based on a machine learning model using Random Forests, and we discuss its success at pitch and rhythm prediction. Next, we show how ALYSIA was used to create original ... More
DeepBach: a Steerable Model for Bach chorales generationDec 03 2016The composition of polyphonic chorale music in the style of J.S Bach has represented a major challenge in automatic music composition over the last decades. The art of Bach chorales composition involves combining four-part harmony with characteristic ... More
FRIDA: FRI-Based DOA Estimation for Arbitrary Array LayoutsDec 02 2016In this paper we present FRIDA---an algorithm for estimating directions of arrival of multiple wideband sound sources. FRIDA combines multi-band information coherently and achieves state-of-the-art resolution at extremely low signal-to-noise ratios. It ... More
A Non Linear Approach towards Automated Emotion Analysis in Hindustani MusicDec 01 2016In North Indian Classical Music, raga forms the basic structure over which individual improvisations is performed by an artist based on his/her creativity. The Alap is the opening section of a typical Hindustani Music (HM) performance, where the raga ... More
A Non Linear Multifractal Study to Illustrate the Evolution of Tagore Songs Over a CenturyDec 01 2016The works of Rabindranath Tagore have been sung by various artistes over generations spanning over almost 100 years. there are few songs which were popular in the early years and have been able to retain their popularity over the years while some others ... More
Learning Features of Music from ScratchNov 29 2016We introduce a new large-scale music dataset, MusicNet, to serve as a source of supervision and evaluation of machine learning methods for music research. MusicNet consists of hundreds of freely-licensed classical music recordings by 10 composers, written ... More
Getting Closer to the Essence of Music: The Con Espressione ManifestoNov 29 2016This text offers a personal and very subjective view on the current situation of Music Information Research (MIR). Motivated by the desire to build systems with a somewhat deeper understanding of music than the ones we currently have, I try to sketch ... More
Learning Filter Banks Using Deep Learning For Acoustic SignalsNov 29 2016Designing appropriate features for acoustic event recognition tasks is an active field of research. Expressive features should both improve the performance of the tasks and also be interpret-able. Currently, heuristically designed features based on the ... More
Understanding Audio Pattern Using Convolutional Neural Network From Raw WaveformsNov 29 2016One key step in audio signal processing is to transform the raw signal into representations that are efficient for encoding the original information. Traditionally, people transform the audio into spectral representations, as a function of frequency, ... More
Fast Wavenet Generation AlgorithmNov 29 2016This paper presents an efficient implementation of the Wavenet generation process called Fast Wavenet. Compared to a naive implementation that has complexity O(2^L) (L denotes the number of layers in the network), our proposed approach removes redundant ... More
Deep attractor network for single-microphone speaker separationNov 27 2016Despite the overwhelming success of deep learning in various speech processing tasks, the problem of separating simultaneous speakers in a mixture remains challenging. Two major difficulties in such systems are the arbitrary source permutation and unknown ... More
Invariant Representations for Noisy Speech RecognitionNov 27 2016Modern automatic speech recognition (ASR) systems need to be robust under acoustic variability arising from environmental, speaker, channel, and recording conditions. Ensuring such robustness to variability is a challenge in modern day neural network-based ... More
SISO and SIMO Accompaniment Cancellation for Live Solo Recordings Based on Short-Time ERB-Band Wiener Filtering and Spectral SubtractionNov 27 2016Research in collaborative music learning is subject to unresolved problems demanding new technological solutions. One such problem poses the suppression of the accompaniment in a live recording of a performance during practice, which can be for the purposes ... More
Fast Chirplet Transform feeding CNN, application to orca and bird bioacousticsNov 26 2016Advanced soundscape analysis or machine listening are requiring efficient time frequency decompositions. The recent scattering theory is offering a robust hierar- chical convolutional decomposition, nevertheless its kernels need to be fixed. The CNN can ... More
MOMOS-MT: Mobile Monophonic System for Music TranscriptionNov 22 2016Music holds a significant cultural role in social identity and in the encouragement of socialization. Technology, by the destruction of physical and cultural distance, has lead to many changes in musical themes and the complete loss of forms. Yet, it ... More
Robust end-to-end deep audiovisual speech recognitionNov 21 2016Speech is one of the most effective ways of communication among humans. Even though audio is the most common way of transmitting speech, very important information can be found in other modalities, such as vision. Vision is particularly useful when the ... More
Decision-Based Transcription of Jazz Guitar Solos Using a Harmonic Bident Analysis Filter Bank and Spectral Distribution WeightingNov 20 2016Jazz guitar solos are improvised melody lines played on one instrument on top of a chordal accompaniment (comping). As the improvisation happens spontaneously, a reference score is non-existent, only a lead sheet. There are situations, however, when one ... More
Deep Clustering and Conventional Networks for Music Separation: Stronger TogetherNov 18 2016Deep clustering is the first method to handle general audio separation scenarios with multiple sources of the same type and an arbitrary number of sources, performing impressively in speaker-independent speech separation tasks. However, little is known ... More
Grammar Argumented LSTM Neural Networks with Note-Level Encoding for Music CompositionNov 16 2016Creating any aesthetically pleasing piece of art, like music, has been a long time dream for artificial intelligence research. Based on recent success of long-short term memory (LSTM) on sequence learning, we put forward a novel system to reflect the ... More
Composing Music with Grammar Argumented Neural Networks and Note-Level EncodingNov 16 2016Dec 07 2016Creating aesthetically pleasing pieces of art, including music, has been a long-term goal for artificial intelligence research. Despite recent successes of long-short term memory (LSTM) recurrent neural networks (RNNs) in sequential learning, LSTM neural ... More
Detecting tala Computationally in Polyphonic Context - A Novel ApproachNov 16 2016In North-Indian-Music-System(NIMS),tabla is mostly used as percussive accompaniment for vocal-music in polyphonic-compositions. The human auditory system uses perceptual grouping of musical-elements and easily filters the tabla component, thereby decoding ... More
Detection of north atlantic right whale upcalls using local binary patterns in a two-stage strategyNov 15 2016In this paper, we investigate the effectiveness of two-stage classification strategies in detecting north Atlantic right whale upcalls. Time-frequency measurements of data from passive acoustic monitoring devices are evaluated as images. Vocalization ... More
Audio Event and Scene Recognition: A Unified Approach using Strongly and Weakly Labeled DataNov 12 2016In this paper we propose a novel learning framework called Supervised and Weakly Supervised Learning where the goal is to learn simultaneously from weakly and strongly labeled data. Strongly labeled data can be simply understood as fully supervised data ... More
Audio Event and Scene Recognition: A Unified Approach using Strongly and Weakly Labeled DataNov 12 2016Nov 23 2016In this paper we propose a novel learning framework called Supervised and Weakly Supervised Learning where the goal is to learn simultaneously from weakly and strongly labeled data. Strongly labeled data can be simply understood as fully supervised data ... More
Landmark-based consonant voicing detection on multilingual corporaNov 10 2016This paper tests the hypothesis that distinctive feature classifiers anchored at phonetic landmarks can be transferred cross-lingually without loss of accuracy. Three consonant voicing classifiers were developed: (1) manually selected acoustic features ... More
Song From PI: A Musically Plausible Network for Pop Music GenerationNov 10 2016We present a novel framework for generating pop music. Our model is a hierarchical Recurrent Neural Network, where the layers and the structure of the hierarchy encode our prior knowledge about how pop music is composed. In particular, the bottom layers ... More
Noise reduction combining microphone and piezoelectric deviceNov 10 2016It is often required to extract the sound of an objective instrument played in concert with other instruments. Microphone array is one of the effective ways to enhance a sound from a specific direction. However it is not effective in an echoic room such ... More
VR 'SPACE OPERA': Mimetic Spectralism in an Immersive Starlight Audification SystemNov 09 2016This paper describes a system designed as part of an interactive VR opera, which immerses a real-time composer and an audience (via a network) in the historical location of Gobeklitepe, in southern Turkey during an imaginary scenario set in the Pre-Pottery ... More
Automatic recognition of child speech for robotic applications in noisy environmentsNov 08 2016Automatic speech recognition (ASR) allows a natural and intuitive interface for robotic educational applications for children. However there are a number of challenges to overcome to allow such an interface to operate robustly in realistic settings, including ... More
Domain Adaptation For Formant Estimation Using Deep LearningNov 06 2016In this paper we present a domain adaptation technique for formant estimation using a deep network. We first train a deep learning network on a small read speech dataset. We then freeze the parameters of the trained network and use several different datasets ... More
Multiple-Speaker Localization Based on Direct-Path Features and Likelihood Maximization with Spatial Sparsity RegularizationNov 03 2016This paper addresses the problem of multiple-speaker localization in noisy and reverberant environments, using binaural recordings of an acoustic scene. A Gaussian mixture model (GMM) is adopted, whose components correspond to all the possible candidate ... More
Frame Theory for Signal Processing in PsychoacousticsNov 03 2016This review chapter aims to strengthen the link between frame theory and signal processing tasks in psychoacoustics. On the one side, the basic concepts of frame theory are presented and some proofs are provided to explain those concepts in some detail. ... More
The Intelligent Voice 2016 Speaker Recognition SystemNov 02 2016This paper presents the Intelligent Voice (IV) system submitted to the NIST 2016 Speaker Recognition Evaluation (SRE). The primary emphasis of SRE this year was on developing speaker recognition technology which is robust for novel languages that are ... More
Enhanced Factored Three-Way Restricted Boltzmann Machines for Speech DetectionNov 01 2016In this letter, we propose enhanced factored three way restricted Boltzmann machines (EFTW-RBMs) for speech detection. The proposed model incorporates conditional feature learning by multiplying the dynamical state of the third unit, which allows a modulation ... More
SoundNet: Learning Sound Representations from Unlabeled VideoOct 27 2016We learn rich natural sound representations by capitalizing on large amounts of unlabeled sound data collected in the wild. We leverage the natural synchronization between vision and sound to learn an acoustic representation using two-million unlabeled ... More
Voice Conversion using Convolutional Neural NetworksOct 27 2016The human auditory system is able to distinguish the vocal source of thousands of speakers, yet not much is known about what features the auditory system uses to do this. Fourier Transforms are capable of capturing the pitch and harmonic structure of ... More
Automatic measurement of vowel duration via structured predictionOct 26 2016A key barrier to making phonetic studies scalable and replicable is the need to rely on subjective, manual annotation. To help meet this challenge, a machine learning algorithm was developed for automatic measurement of a widely used phonetic measure: ... More
A model of infant speech perception and learningOct 19 2016Infant speech perception and learning is modeled using Echo State Network classification and Reinforcement Learning. Ambient speech for the modeled infant learner is created using the speech synthesizer Vocaltractlab. An auditory system is trained to ... More
A Bayesian Approach to Estimation of Speaker Normalization ParametersOct 19 2016In this work, a Bayesian approach to speaker normalization is proposed to compensate for the degradation in performance of a speaker independent speech recognition system. The speaker normalization method proposed herein uses the technique of vocal tract ... More
A multi-task learning model for malware classification with useful file access pattern from API call sequenceOct 19 2016Based on API call sequences, semantic-aware and machine learning (ML) based malware classifiers can be built for malware detection or classification. Previous works concentrate on crafting and extracting various features from malware binaries, disassembled ... More
Acoustic Reflector Localization: Novel Image Source Reversion and Direct Localization MethodsOct 18 2016Acoustic reflector localization is an important issue in audio signal processing, with direct applications in spatial audio, scene reconstruction, and source separation. Several methods have recently been proposed to estimate the 3D positions of acoustic ... More
Improving Short Utterance PLDA Speaker Verification using SUV Modelling and Utterance Partitioning ApproachOct 17 2016This paper analyses the short utterance probabilistic linear discriminant analysis (PLDA) speaker verification with utterance partitioning and short utterance variance (SUV) modelling approaches. Experimental studies have found that instead of using single ... More
Making Mainstream Synthesizers with CsoundOct 16 2016For more than the past twenty years, Csound has been one of the leaders in the world of the computer music research, implementing innovative synthesis methods and making them available beyond the academic environments from which they often arise, and ... More
Semi-Supervised Source Localization on Multiple-Manifolds with Distributed MicrophonesOct 15 2016The problem of source localization with ad hoc microphone networks in noisy and reverberant enclosures, given a training set of prerecorded measurements, is addressed in this paper. The training set is assumed to consist of a limited number of labelled ... More
Non-negative matrix factorization-based subband decomposition for acoustic source localizationOct 15 2016A novel non-negative matrix factorization (NMF) based subband decomposition in frequency spatial domain for acoustic source localization using a microphone array is introduced. The proposed method decomposes source and noise subband and emphasises source ... More
Tonal consonance parameters expose a hidden order in musicOct 14 2016Consonance is related to the perception of pleasantness arising from the combination of sounds and has been approached quantitatively using mathematical relations, physics, information theory and psychoacoustics. Tonal consonance is present in timbre, ... More
A Geometrical-Statistical approach to outlier removal for TDOA measumentsOct 14 2016The curse of outlier measurements in estimation problems is a well known issue in a variety of fields. Therefore, outlier removal procedures, which enables the identification of spurious measurements within a set, have been developed for many different ... More
Voice Conversion from Non-parallel Corpora Using Variational Auto-encoderOct 13 2016We propose a flexible framework for spectral conversion (SC) that facilitates training with unaligned corpora. Many SC frameworks require parallel corpora, phonetic alignments, or explicit frame-wise correspondence for learning conversion functions or ... More
Dictionary Update for NMF-based Voice Conversion Using an Encoder-Decoder NetworkOct 13 2016In this paper, we propose a dictionary update method for Nonnegative Matrix Factorization (NMF) with high dimensional data in a spectral conversion (SC) task. Voice conversion has been widely studied due to its potential applications such as personalized ... More
A Survey of Voice Translation Methodologies - Acoustic Dialect DecoderOct 13 2016Speech Translation has always been about giving source text or audio input and waiting for system to give translated output in desired form. In this paper, we present the Acoustic Dialect Decoder (ADD) - a voice to voice ear-piece translation device. ... More
RAVEN X High Performance Data Mining Toolbox for Bioacoustic Data AnalysisOct 12 2016Objective of this work is to integrate high performance computing (HPC) technologies and bioacoustics data-mining capabilities by offering a MATLAB-based toolbox called Raven-X. Raven-X will provide a hardware-independent solution, for processing large ... More
Maximum entropy models for generation of expressive musicOct 12 2016In the context of contemporary monophonic music, expression can be seen as the difference between a musical performance and its symbolic representation, i.e. a musical score. In this paper, we show how Maximum Entropy (MaxEnt) models can be used to generate ... More
DNN based Speaker Recognition on Short UtterancesOct 11 2016This paper investigates the effects of limited speech data in the context of speaker verification using deep neural network (DNN) approach. Being able to reduce the length of required speech data is important to the development of speaker verification ... More
Investigation of Synthetic Speech Detection Using Frame- and Segment-Specific Importance WeightingOct 10 2016Speaker verification systems are vulnerable to spoofing attacks which presents a major problem in their real-life deployment. To date, most of the proposed synthetic speech detectors (SSDs) have weighted the importance of different segments of speech ... More
Domain adaptation based Speaker Recognition on Short UtterancesOct 10 2016Oct 11 2016This paper explores how the in- and out-domain probabilistic linear discriminant analysis (PLDA) speaker verification behave when enrolment and verification lengths are reduced. Experiment studies have found that when full-length utterance is used for ... More
A Music-generating System Inspired by the Science of Complex Adaptive SystemsOct 08 2016This paper presents NetWorks (NW), an interactive music generation system that uses a hierarchically clustered scale free network to generate music that ranges from orderly to chaotic. NW was inspired by the Honing Theory of creativity, according to which ... More
An Automatic System for Acoustic Microphone Geometry Calibration based on Minimal SolversOct 07 2016In this paper, robust detection, tracking and geometry estimation methods are developed and combined into a system for estimating time-difference estimates, microphone localization and sound source movement. No assumptions on the 3D locations of the microphones ... More
A Joint Detection-Classification Model for Audio Tagging of Weakly Labelled DataOct 06 2016Audio tagging aims to assign one or several tags to an audio clip. Most of the datasets are weakly labelled, which means only the tags of the clip are known, without knowing the occurrence time of the tags. The labeling of an audio clip is often based ... More
Divide-and-Conquer based Ensemble to Spot Emotions in Speech using MFCC and Random ForestOct 05 2016Besides spoken words, speech signals also carry information about speaker gender, age, and emotional state which can be used in a variety of speech analysis applications. In this paper, a divide and conquer strategy for ensemble classification has been ... More
Monaural Multi-Talker Speech Recognition using Factorial Speech Processing ModelsOct 05 2016A Pascal challenge entitled monaural multi-talker speech recognition was developed, targeting the problem of robust automatic speech recognition against speech like noises which significantly degrades the performance of automatic speech recognition systems. ... More
Speech Enhancement via Two-Stage Dual Tree Complex Wavelet Packet Transform with a Speech Presence Probability EstimatorOct 03 2016Oct 04 2016In this paper, a two-stage dual tree complex wavelet packet transform (DTCWPT) based speech enhancement algorithm has been proposed, in which a speech presence probability (SPP) estimator and a generalized minimum mean squared error (MMSE) estimator are ... More
On the Modeling of Musical Solos as Complex NetworksOct 03 2016Notes in a musical piece are building blocks employed in non-random ways to create melodies. It is the "interaction" among a limited amount of notes that allows constructing the variety of musical compositions that have been written in centuries and within ... More
Very Deep Convolutional Neural Networks for Raw WaveformsOct 01 2016Learning acoustic models directly from the raw waveform data with minimal processing is challenging. Current waveform-based models have generally used very few (~2) convolutional layers, which might be insufficient for building high-level discriminative ... More
Optimal spectral transportation with application to music transcriptionSep 30 2016Oct 10 2016Many spectral unmixing methods rely on the non-negative decomposition of spectral data onto a dictionary of spectral templates. In particular, state-of-the-art music transcription systems decompose the spectrogram of the input signal onto a dictionary ... More
Adaptive dictionary based approach for background noise and speaker classification and subsequent source separationSep 30 2016Oct 28 2016A judicious combination of dictionary learning methods, block sparsity and source recovery algorithm are used in a hierarchical manner to identify the noises and the speakers from a noisy conversation between two people. Conversations are simulated using ... More
Adaptive dictionary based approach for background noise and speaker classification and subsequent source separationSep 30 2016A judicious combination of dictionary learning methods, block sparsity and source recovery algorithm are used in a hierarchical manner to identify the noises and the speakers from a noisy conversation between two people. Conversations are simulated using ... More
Hearing in a shoe-box : Binaural source position and wall absorption estimation using virtually supervised learningSep 30 2016This paper introduces a new framework for supervised sound source localization referred to as virtually-supervised learning. An acoustic shoe-box room simulator is used to generate a large number of binaural single-source audio scenes. These scenes are ... More
Phase Unmixing : Multichannel Source Separation with Magnitude ConstraintsSep 30 2016We consider the problem of estimating the phases of K mixed complex signals from a multichannel observation, when the mixing matrix and signal magnitudes are known. This problem can be cast as a non-convex quadratically constrained quadratic program which ... More
Rectified binaural ratio: A complex T-distributed feature for robust sound localizationSep 30 2016Most existing methods in binaural sound source localization rely on some kind of aggregation of phase-and level-difference cues in the time-frequency plane. While different ag-gregation schemes exist, they are often heuristic and suffer in adverse noise ... More