UNITER, a UNiversal Image-TExt Representation, learned through large-scale pre-training over four image-text datasets is introduced, which can power heterogeneous downstream V+L tasks with joint multimodal embeddings.
Zhe Gan
19 papers
Linjie Li
14 papers
Licheng Yu
12 papers
Yen-Chun Chen
3 papers
Jingjing Liu
13 papers
Yu Cheng
8 papers
Ahmed El Kholy
1 papers
Faisal Ahmed