Show Abstract Hide Abstract
The singing voice is one of the most salient parts of music. Thus, singing voice-related tasks are always significant for different areas in Music Information Retrieval (MIR), such as cover song identification, music recommendation, singer identification. We exclusively focus on singing voice detection (SVD) and gender classification in this thesis as they are two of the most popular tasks. On the other hand, researchers use source Separatios (SS) which is a technique to separate components of the mixture. Thanks to that, we can have cleaner vocal signals, bass signals, etc. We use SS as a pre-processing step for the singing voice-related tasks, and we are expecting to have better results for those tasks by applying SS models. To validate that idea, we utilize the most popular open-source musical source separation models as Demucs, Open-Unmix and Spleeter. Thanks to different models, we can see how the properties of models change the results of singing voice-related tasks. As a dataset, we used Richard Wagner's Ring Opera dataset and Smule Dataset. With those datasets, we can see the robustness of our methods. Our main contributions are that understanding how SS models can help to be better in singing voice related tasks, robustness of SS models with different scenarios and how different SS models perform with those tasks. Thus, we can also see which SS models are better suited to tasks. According to results of singing voice detection experiments, using SS models are mostly beneficial and Spleeter outperforms all other SS models. Also, we investigated the longest errors of classifiers to see possible reasons for them.
Show Abstract Hide Abstract
In this study, we propose a novel probabilistic model for separating clean speech signals from noisy mixtures by decomposing the mixture spectrograms into a structured speech part and a more flexible residual part. The main novelty in our model is that it uses a family of heavy-tailed distributions, so called the α-stable distributions, for modeling the residual signal. We develop an expectation-maximization algorithm for parameter estimation and a Monte Carlo scheme for posterior estimation of the clean speech. Our experiments show that the proposed method outperforms relevant factorization-based algorithms by a significant margin.