Speech emotion recognition using deep feedforward neural network

Speech emotion recognition (SER) is currently a research hotspot due to its challenging nature but bountiful future prospects. The objective of this research is to utilize Deep Neural Networks (DNNs) to recognize human speech emotion. First, the chosen speech feature Mel-frequency cepstral coefficie...

Full description

Bibliographic Details
Main Authors: Alghifari, Muhammad Fahreza, Gunawan, Teddy Surya, Kartiwi, Mira
Format: Article
Language:English
English
Published: IAES 2018
Subjects:
Online Access:http://irep.iium.edu.my/62495/
http://irep.iium.edu.my/62495/
http://irep.iium.edu.my/62495/
http://irep.iium.edu.my/62495/7/62495%20Speech%20emotion%20recognition%20SCOPUS.pdf
http://irep.iium.edu.my/62495/13/62495_Speech%20emotion%20recognition%20using%20deep%20feedforward%20neural%20network_article.pdf
Description
Summary:Speech emotion recognition (SER) is currently a research hotspot due to its challenging nature but bountiful future prospects. The objective of this research is to utilize Deep Neural Networks (DNNs) to recognize human speech emotion. First, the chosen speech feature Mel-frequency cepstral coefficient (MFCC) were extracted from raw audio data. Second, the speech features extracted were fed into the DNN to train the network. The trained network was then tested onto a set of labelled emotion speech audio and the recognition rate was evaluated. Based on the accuracy rate the MFCC, number of neurons and layers are adjusted for optimization. Moreover, a custom-made database is introduced and validated using the network optimized. The optimum configuration for SER is 13 MFCC, 12 neurons and 2 layers for 3 emotions and 25 MFCC, 21 neurons and 4 layers for 4 emotions, achieving a total recognition rate of 96.3% for 3 emotions and 97.1% for 4 emotions.