3260 papers • 126 benchmarks • 313 datasets
Multimodal deep learning is a type of deep learning that combines information from multiple modalities, such as text, image, audio, and video, to make more accurate and comprehensive predictions. It involves training deep neural networks on data that includes multiple types of information and using the network to make predictions based on this combined data. One of the key challenges in multimodal deep learning is how to effectively combine information from multiple modalities. This can be done using a variety of techniques, such as fusing the features extracted from each modality, or using attention mechanisms to weight the contribution of each modality based on its importance for the task at hand. Multimodal deep learning has many applications, including image captioning, speech recognition, natural language processing, and autonomous vehicles. By combining information from multiple modalities, multimodal deep learning can improve the accuracy and robustness of models, enabling them to perform better in real-world scenarios where multiple types of information are present.
(Image credit: Papersgraph)
These leaderboards are used to track progress in multimodal-deep-learning-10
Use these libraries to find multimodal-deep-learning-10 models and implementations
A zero-initialized attention mechanism with zero gating is proposed, which adaptively injects the new instructional cues into LLaMA, while effectively preserves its pre-trained knowledge on traditional vision and language tasks, demonstrating the superior generalization capacity of the approach.
A novel framework for evaluating multimodal deep learning models with respect to their language understanding and generalization abilities is introduced, in which artificial data is automatically generated according to the experimenter's specifications.
Fine-grained image classification is a challenging task due to the presence of hierarchical coarse-to-fine-grained distribution in the dataset. Generally, parts are used to discriminate various objects in fine-grained datasets, however, not all parts are beneficial and indispensable. In recent years, natural language descriptions are used to obtain information on discriminative parts of the object. This paper leverages on natural language description and proposes a strategy for learning the joint representation of natural language description and images using a two-branch network with multiple layers to improve the fine-grained classification task. Extensive experiments show that our approach gains significant improvements in accuracy for the fine-grained image classification task. Furthermore, our method achieves new state-of-the-art results on the CUB-200-2011 dataset.
A multimodal neural network able to learn from word embeddings, computed on text extracted by OCR, and from the image is designed that boosts pure image accuracy by 3% on Tobacco3482 and RVL-CDIP augmented by the new QS-OCR text dataset, even without clean text information.
This paper leverages recent progress on Convolutional Neural Networks (CNNs) and proposes a novel RGB-D architecture for object recognition that is composed of two separate CNN processing streams - one for each modality - which are consecutively combined with a late fusion network.
This work introduces a multimodal dataset for robust autonomous driving with long-range perception and trained unimodal and multi-modal baseline models for 3D object detection.
A novel model architecture is suggested that combines three feature sets for visual content and motion to predict importance scores, and improves state-of-the-art results for SumMe, while being on par with the state of the art for TVSum dataset.
A novel, efficient, modular and scalable framework for content based visual media retrieval systems by leveraging the power of Deep Learning which is flexible to work both for images and videos conjointly and an efficient comparison and filtering metric for retrieval is proposed.
A multimodal deep learning system, dubbed HyMNet, which combines fundus images and cardiometabolic risk factors, specifically age and gender, to improve hypertension detection capabilities is introduced, concluding that diabetes is used as a confounding variable for distinguishing hypertensive cases.
The Bi-Bimodal Fusion Network (BBFN), a novel end-to-end network that performs fusion (relevance increment) and separation (difference increment) on pairwise modality representations that significantly outperforms the SOTA.
Adding a benchmark result helps the community track progress.