predict the intensity of 10 categorical emotions
3260 papers • 126 benchmarks • 313 datasets
predict the intensity of 10 categorical emotions
(Image credit: Papersgraph)
These leaderboards are used to track progress in speech-emotion-recognition
Use these libraries to find speech-emotion-recognition models and implementations
No subtasks available.
A convolutional neural network for computing a high-resolution depth map given a single RGB image with the help of transfer learning, which outperforms state-of-the-art on two datasets and also produces qualitatively better results that capture object boundaries more faithfully.
The superiority of the proposed HRNet in a wide range of applications, including human pose estimation, semantic segmentation, and object detection, is shown, suggesting that the HRNet is a stronger backbone for computer vision problems.
This paper proposes a network that maintains high-resolution representations through the whole process of human pose estimation and empirically demonstrates the effectiveness of the network through the superior pose estimation results over two benchmark datasets: the COCO keypoint detection dataset and the MPII Human Pose dataset.
A simple modification is introduced to augment the high-resolution representation by aggregating the (upsampled) representations from all the parallel convolutions rather than only the representation from thehigh-resolution convolution, which leads to stronger representations, evidenced by superior results.
It is found that applying orthogonal regularization to the generator renders it amenable to a simple "truncation trick," allowing fine control over the trade-off between sample fidelity and variety by reducing the variance of the Generator's input.
These latent diffusion models achieve new state of the art scores for image inpainting and class-conditional image synthesis and highly competitive performance on various tasks, including unconditional image generation, text-to-image synthesis, and super-resolution, while significantly reducing computational requirements compared to pixel-based DMs.
A new method for synthesizing high-resolution photo-realistic images from semantic label maps using conditional generative adversarial networks (conditional GANs) is presented, which significantly outperforms existing methods, advancing both the quality and the resolution of deep image synthesis and editing.
This work addresses the large number of samples typically required and the difficulty of obtaining stable and steady improvement despite the nonstationarity of the incoming data by using value functions to substantially reduce the variance of policy gradient estimates at the cost of some bias.
An image cascade network (ICNet) that incorporates multi-resolution branches under proper label guidance to address the challenging task of real-time semantic segmentation is proposed and in-depth analysis of the framework is provided.
An adaptive gradient clipping technique is developed which overcomes instabilities in batch normalization, and a significantly improved class of Normalizer-Free ResNets is designed, achieving significantly better performance when finetuning on ImageNet.
Adding a benchmark result helps the community track progress.