3260 papers • 126 benchmarks • 313 datasets
This is a leaderboard for multimodal emotion recognition on the IEMOCAP dataset. The modality abbreviations are A: Acoustic T: Text V: Visual Please include the modality in the bracket after the model name. All models must use standard five emotion categories and are evaluated in standard leave-one-session-out (LOSO). See the papers for references.
(Image credit: Papersgraph)
These leaderboards are used to track progress in multimodal-emotion-recognition-25
Use these libraries to find multimodal-emotion-recognition-25 models and implementations
No subtasks available.
The proposed model outperforms previous state-of-the-art methods in assigning data to one of four emotion categories when the model is applied to the IEMOCAP dataset, as reflected by accuracies ranging from 68.8% to 71.8%.
It is shown that lighter machine learning based models trained over a few hand-crafted features are able to achieve performance comparable to the current deep learning based state-of-the-art method for emotion recognition.
Surprisingly, DFF-ATMF also achieves new state-of-the-art results on the IEMOCAP dataset, indicating that the proposed fusion strategy also has a good generalization ability for multimodal emotion recognition.
This work proposes an emotion recognition system using auditory and visual modalities using a convolutional neural network to extract features from the speech, while for the visual modality a deep residual network of 50 layers is used.
A LSTM-based model is proposed that enables utterances to capture contextual information from their surroundings in the same video, thus aiding the classification process and showing 5-10% performance improvement over the state of the art and high robustness to generalizability.
This approach is the first that uses the multiple modes of data offered by IEMOCAP for a more robust and accurate emotion detection, and it is hoped that this approach will help improve the quality of emotion detection systems in the future.
A new method based on recurrent neural networks that keeps track of the individual party states throughout the conversation and uses this information for emotion classification and outperforms the state of the art by a significant margin on two different datasets.
This work attempts to explore different neural networks to improve accuracy of emotion recognition and finds (CNN+RNN) + 3DCNN multi-model architecture which processes audio spectrograms and corresponding video frames giving emotion prediction accuracy of 54.0% among 4 emotions and 71.75% among 3 emotions using IEMOCAP[2] dataset.
A novel feature fusion strategy that proceeds in a hierarchical fashion, first fusing the modalities two in two and only then fusing all three modalities, which outperforms conventional concatenation of features by 1%, which amounts to 5% reduction in error rate.
This paper presents the effort for the audio-video based sub-challenge of the Emotion Recognition in the Wild (EmotiW) 2018 challenge, which requires participants to assign a single emotion label to the video clip from the six universal emotions.
Adding a benchmark result helps the community track progress.