TK980 : Speech Emotion Recognition Using Data Augmentation Method by Cycle-Generative Adversarial Networks
Thesis > Central Library of Shahrood University > Electrical Engineering > PhD > 2023
Authors:
Arash Shilandari [Author], Hosein Marvi[Supervisor], Hossein Khosravi[Advisor]
Abstarct: The limited availability of labeled data presents a significant obstacle in developing accurate speech-emotion recognition (SER) systems. To address this challenge, data augmentation has emerged as an effective approach for increasing training data. In this article, we propose the utilization of a cycle-generative adversarial network (cycle-GAN) for data augmentation in SER systems. Our approach involves the design of adversarial networks for each of the five considered emotions, generating data with distributions similar to the main data within each emotion class while distinct from other classes. By training these networks adversarially, we aim to generate feature vectors resembling those in the training set, which are then incorporated into the original data. To mitigate gradient vanishing issues and produce high-quality samples, we employ the Wasserstein divergence instead of the common cross-entropy loss for training the cycle-GANs. We evaluate the quality of the generated data using support vector machines and deep neural networks as classifiers. Our results demonstrate that the recognition accuracy, measured by the unweighted average recall, reaches approximately 83.33%, surpassing the performance of baxseline methods. Additionally, we explore the conversion of emotions extracted from speech signals by transforming neutral emotion into happiness and other emotional classes using the frequency spectrum of the signal with the assistance of Cycle-GANs. Our approach employs a PatchGAN model for the differentiator network and a ResNet network for the generator network within a cycle-compatible adversary generator network. Through various configurations, we determine the optimal number of ResNet network blocks in the generator network to optimize the Cycle-GAN. We conduct simulations for emotion conversion on different pairs of emotions, altering emotions in the frequency spectrum of speech signals, and report the results. The proposed network effectively converts emotions from speech signals, demonstrating competitiveness with advanced speech-emotion conversion systems. Moreover, we address the effective selection of features, as the success of feature reduction methods in enhancing speech recognition systems remains uncertain. We discuss feature selection as a means to augment data in a speech emotion recognition system. Experiments are performed on four commonly used databaxses, employing the EMODB, eNTERFACE05, SAVEE, and IEMOCAP datasets, utilizing Python software. Data analysis is conducted for sadness, anger, happiness, and neutrality. We introduce a data augmentation network and employ two networks to select combined features from the FDR and LDA algorithms in a two-step process. By incorporating feedback from the classification network, this approach optimizes the speech recognition system in terms of data quantity and dimensions. Our findings indicate that principal component analysis is more efficient for correlated data, while the LDA algorithm performs better for low-dimensional data. Furthermore, Fisher's method proves more effective in reducing feature size compared to principal component analysis. Emotion classification is performed using a support vector machine, highlighting the simultaneous use of LDA and FDR in the adversarial data system to filter features while preserving emotional information for classification.
Keywords:
#systemincorporated Keeping place: Central Library of Shahrood University
Visitor: