Open Access System for Information Sharing

Login Library

 

Thesis
Cited 0 time in webofscience Cited 0 time in scopus
Metadata Downloads

임베딩 기반 음성 분리에서 Mahalanobis 거리를 사용한 정확한 어트랙터 감지

Title
임베딩 기반 음성 분리에서 Mahalanobis 거리를 사용한 정확한 어트랙터 감지
Authors
윤상호
Date Issued
2022
Publisher
포항공과대학교
Abstract
In the past decade, research on speech recognition has risen rapidly because it is highly accessible and provides experiences such as human interaction. Speech recognition extracts time-frequency characteristics from the target speaker's voice and analyzes the characteristics of each speaker. However, in a multi-speaker environment, the noise or voice of the interfering speaker significantly degrades the target speaker's cognitive ability. Therefore, speech separation has been considered as an essential process for speech recognition. Several speech separation algorithms have been proposed and they can be classified into two groups with respect to generate the features of speakers. The hand-crafted feature methods are based on the characteristics of voice which are defined by engineer. These methods extracts quite accurate separated speech when a mixture sound includes known speakers. However, the hand-crafted feature methods are difficult to separate a mixture which includes unknown speakers due to the undefined speaker's features. On the other hand, the feature-learning methods are based on a deep neural network, which automatically determines speaker's voice characteristics. These models are good for separating both known speakers and unknown speakers. However, the feature-learning methods also have several problems. Representatively, the quality of separated speech can be degraded by label permutation problem, output dimension mismatch problem, and speaker tracing problem. Speech separation has a feature that it does not matter which speaker is assigned to the first output, so it is independent of the order of the labels. However, if the output order of labels is changed during the training process, a label permutation problem occurs, which degrades learning performance. On the other hand, when a feature is trained using a deep neural network, the number of speakers to be separated is fixed. However, since the number of speakers included in the input mixture speech can be changed, a problem in which the number of speakers does not match occurs, which is referred to as an output dimension mismatch problem. In addition, to improve the quality of separated speech, speech separation algorithm usually truncate the mixture speech into smaller speech units and separate them, so the label permutation problem can be applied at every moment. As a result, a speaker tracing problem occurs that cannot guarantee that the voice of the same speaker is consistently output in one output layer. To solve these problems, permutation invariant training (PIT) and utterance-level permutation invariant training (uPIT) have been proposed. Both methods are multi-class regression-based methods that learn masks to be directly applied to mixture speech. PIT and uPIT solve the label permutation problem by consideration of all label permutations in training process. However, since permutation of the output mask cannot be guaranteed, the output dimension mismatch problem could not be solved, and the speaker tracing problem could not be completely solved. To solve this problem, a deep attractor network (DANet) has been proposed. DANet is an embedding-based method that finds an appropriate embedding space that can distinguish each speaker. The goal is to spread the speech segments of mixture speech in the embedding space and learn to cluster them by speaker. At this time, the central point of each speaker in the embedding space is called an attractor, and it learns to cluster well around it. However, since the attractor is found through k-means clustering in the test process, a center mismatch problem occurs in which it is difficult to find the attractor accurately. In order to alleviate the center mismatch problem, an anchored deep attractor network (ADANet) has been proposed to set a reference point called an anchor by expanding the number of cases of attractor selection. However, there are still many speech segments that are difficult to separate even after learning, so there is a limitation in that the separation performance is deteriorated. However, embedding-based methods still have some problems that degrade the separation accuracy. First, it is difficult to find the proper location of the T-F bins for various speakers and sentences. Therefore, each speaker should be separated using accurately distinguishable attractors in the high-dimensional embedding space. Second, in embedding-based methods, the scale of the separated speech is changed. Therefore, the training should consider the scale of the speech to reduce distortion of the separated speech. Third, embedding-based methods rely on clustering accuracy to obtain separated speeches while masking scattered data in the embedding space. This requires mathematically accurate modeling. Therefore, this dissertation proposes an accurate attractor detection using the Mahalanobis distance to solve the limitations of the existing methods. First, we mitigated the overlapping problem of each speaker's data by accurately finding the attractors of the scattered data and applying a method that considers the distribution of the embedded data. Instead of assigning weights based on binary or Euclidean distance for speaker assignment indicating which speaker each speech segment corresponds to, weights are assigned in consideration of distribution using Mahalanobis distance. Second, the loss function incorporates the concept of a scale-invariant signal-to-distortion ratio, maintaining the scale consistency of the separated speech. The separation accuracy is improved by generating a scale factor using projection in the vector space based on the scale difference between the original signal and the separated signal, and applying it to training. Third, the masking function is accurately modeled by converting the Mahalanobis distance as if it is a physical distance to obtain high clustering accuracy. Experimental results show that the proposed algorithm has superior separation accuracy than the existing method. Compared to the best method among the existing methods, the proposed algorithm achieved 15.11dB, which is about 4.9% improved. At the same time, the proposed algorithm achieved a clustering accuracy of 99.06%, which is about 5.33% improved compared to ADANet, which shows the best performance among embedding-based methods. Although it shows a slightly lower SI-SDR in a noise-free environment, it shows a robust result even in a noisy environment. In addition, it can be evaluated that it shows the best separation accuracy in terms of significantly fewer parameters. From the experimental results, I conclude that the proposed method has the best separation accuracy and high quality of separated speech even with low network parameters.
URI
http://postech.dcollection.net/common/orgView/200000600802
https://oasis.postech.ac.kr/handle/2014.oak/117253
Article Type
Thesis
Files in This Item:
There are no files associated with this item.

qr_code

  • mendeley

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.

Views & Downloads

Browse