The experiments show that GPT-4V with SoM outperforms the state-of-the-art fully-finetuned referring segmentation model on RefCOCOg in a zero-shot setting and the effectiveness of SoM on a wide range of fine-grained vision and multimodal tasks.
Jianwei Yang
14 papers
Chun-yue Li
5 papers
Jianfeng Gao
2 papers
Xueyan Zou
Hao Zhang
8 papers
Feng Li
9 papers