3260 papers • 126 benchmarks • 313 datasets
Multimodal generation refers to the process of generating outputs that incorporate multiple modalities, such as images, text, and sound. This can be done using deep learning models that are trained on data that includes multiple modalities, allowing the models to generate output that is informed by more than one type of data. For example, a multimodal generation model could be trained to generate captions for images that incorporate both text and visual information. The model could learn to identify objects in the image and generate descriptions of them in natural language, while also taking into account contextual information and the relationships between the objects in the image. Multimodal generation can also be used in other applications, such as generating realistic images from textual descriptions or generating audio descriptions of video content. By combining multiple modalities in this way, multimodal generation models can produce more accurate and comprehensive output, making them useful for a wide range of applications.
(Image credit: Papersgraph)
These leaderboards are used to track progress in multimodal-generation-8
Use these libraries to find multimodal-generation-8 models and implementations
No subtasks available.
Finite scalar quantization (FSQ) is proposed, where each dimension is quantized to a small set of fixed values, leading to an (implicit) codebook given by the product of these sets.
A map that takes a content code, derived from a face image, and a randomly chosen style code to an anime image is shown, which guarantees the map is diverse -- a very wide range of anime can be produced from a single content code.
This paper comprehensively review existing efforts that integrate RAG technique into AIGC scenarios, and introduces the benchmarks for RAG, discuss the limitations of current RAG systems, and suggest potential directions for future research.
This work introduces a novel continual architecture search (CAS) approach, so as to continually evolve the model parameters during the sequential training of several tasks, without losing performance on previously learned tasks, thus enabling life-long learning.
This paper introduces two novel multimodal datasets: the synthetic CLEVR-ATVC dataset (620K) and the manually pictured Fruit-ATVC dataset (50K) and introduces specific rules as supervisory signals within the datasets to facilitate the accountability of multimodal systems in rejecting human requests.
This work proposes a new task, Multimedia Generative Script Learning, to generate subsequent steps by tracking historical states in both text and vision modalities, as well as presenting the first benchmark containing 5,652 tasks and 79,089 multimedia steps.
Multimodal Cross-Quantization VAE (MXQ-VAE), a novel vector quantizer for joint image-text representations, is designed, with which it is discovered that a jointimage-text representation space is effective for semantically consistent image- text pair generation.
An activity domain generation framework which creates novel ADL appearances (novel domains) from different existing activity modalities (source domains) inferred from video training data, resulting in models far less susceptible to changes in data distributions.
This work unify the discrete diffusion process for multimodal signals by proposing a unified transition matrix and design a mutual attention module with fused embedding layer and a unified objective function to emphasise the inter-modal linkages, which are vital for multi-modality generation.
Generating photos satisfying multiple constraints finds broad utility in the content creation industry. A key hurdle to accomplishing this task is the need for paired data consisting of all modalities (i.e., constraints) and their corresponding output. Moreover, existing methods need retraining using paired data across all modalities to introduce a new condition. This paper proposes a solution to this problem based on denoising diffusion probabilistic models (DDPMs). Our motivation for choosing diffusion models over other generative models comes from the flexible internal structure of diffusion models. Since each sampling step in the DDPM follows a Gaussian distribution, we show that there exists a closed-form solution for generating an image given various constraints. Our method can unite multiple diffusion models trained on multiple sub-tasks and conquer the combined task through our proposed sampling strategy. We also introduce a novel reliability parameter that allows using different off-the-shelf diffusion models trained across various datasets during sampling time alone to guide it to the desired outcome satisfying multiple constraints. We perform experiments on various standard multimodal tasks to demonstrate the effectiveness of our approach. More details can be found at: https://nithin-gk.github.io/projectpages/Multidiff
Adding a benchmark result helps the community track progress.