mPLUG is a new vision-language foundation model for both cross-modal understanding and generation that achieves state-of-the-art results on a wide range of vision- language downstream tasks, including image captioning, image-text retrieval, visual grounding and visual question answering.
Large-scale pre-trained foundation models have been an emerging paradigm for building artificial intelligence (AI) systems, which can be quickly adapted to a wide range of downstream tasks. This paper presents mPLUG, a new vision-language foundation model for both cross-modal understanding and generation. Most existing pre-trained models suffer from inefficiency and linguistic signal overwhelmed by long visual sequences in cross-modal alignment. To address both problems, mPLUG introduces an effective and efficient vision-language architecture with novel cross-modal skip-connections.mPLUG is pre-trained end-to-end on large-scale image-text pairs with both discriminative and generative objectives. It achieves state-of-the-art results on a wide range of vision-language downstream tasks, including image captioning, image-text retrieval, visual grounding and visual question answering. mPLUG also demonstrates strong zero-shot transferability on vision-language and video-language tasks. The code and pre-trained models are available at https://github.com/alibaba/AliceMind
Ji Zhang
6 papers
Ming Yan
3 papers
Haiyang Xu
6 papers
Jiabo Ye
3 papers
Chenliang Li
4 papers
Bin Bi
3 papers
Songfang Huang
3 papers
Feiran Huang
2 papers
Junfeng Tian
2 papers
Wei Wang
1 papers
Hehong Chen
1 papers
Zheng-da Cao
1 papers