3260 papers • 126 benchmarks • 313 datasets
This task has no description! Would you like to contribute one?
(Image credit: Papersgraph)
These leaderboards are used to track progress in multi-modal-classification-5
Use these libraries to find multi-modal-classification-5 models and implementations
This paper identifies two main causes for this performance drop: first, multi-modal networks are often prone to overfitting due to increased capacity and second, different modalities overfit and generalize at different rates, so training them jointly with a single optimization strategy is sub-optimal.
This paper presents a novel multi-modal approach that fuses images and text descriptions to improve multi- modal classification performance in real-world scenarios and evaluates the approach against two famous multi-Modal strategies namely early fusion and late fusion.
Integration of heterogeneous and high-dimensional data (e.g., multiomics) is becoming increasingly important. Existing multimodal classification algorithms mainly focus on improving performance by exploiting the complementarity from different modalities. However, conventional approaches are basically weak in providing trustworthy multimodal fusion, especially for safety-critical applications (e.g., medical diagnosis). For this issue, we propose a novel trustworthy multimodal classification algorithm termed Multimodal Dynamics, which dynamically evaluates both the feature-level and modality-level informativeness for different samples and thus trustworthily integrates multiple modalities. Specifically, a sparse gating is introduced to capture the information variation of each within-modality feature and the true class probability is employed to assess the classification confidence of each modality. Then a transparent fusion algorithm based on the dynamical informativeness estimation strategy is induced. To the best of our knowledge, this is the first work to jointly model both feature and modality variation for different samples to provide trustworthy fusion in multi-modal classification. Extensive experiments are conducted on multimodal medical classification datasets. In these experiments, superior performance and trustworthiness of our algorithm are clearly validated compared to the state-of-the-art methods.
A Hindi-English code-mixed dataset is developed for the multi-modal sarcasm detection and humor classification in conversational dialog, and a novel attention-rich neural architecture for the utterance classification is proposed.
A plug-and-play loss function method is proposed, whereby the feature space for each label is adaptively learned according to the training set statistics, which yields remarkable performance improvements compared with the baselines, demonstrating its superiority on reducing the modality bias problem.
The UAVM achieves a new state-of-the-art audio-visual event classification accuracy of 65.8% on VGGSound and finds a few intriguing properties of UavM that the modality-independent counterparts do not have.
The Contrastive Audio-Visual Masked Auto-Encoder (CAV-MAE) is proposed by combining contrastive learning and masked data modeling, two major self-supervised learning frameworks, to learn a joint and coordinated audio-visual representation.
This work proposes a novel FAshion-focused Multi-task Efficient learning method for Vision-and-Language tasks (FAME-ViL), which applies a single model for multiple heterogeneous fashion tasks, therefore being much more parameter-efficient.
Adding a benchmark result helps the community track progress.