3260 papers • 126 benchmarks • 313 datasets
Visual Prompting is the task of streamlining computer vision processes by harnessing the power of prompts, inspired by the breakthroughs of text prompting in NLP. This innovative approach involves using a few visual prompts to swiftly convert an unlabeled dataset into a deployed model, significantly reducing development time for both individual projects and enterprise solutions.
(Image credit: Papersgraph)
These leaderboards are used to track progress in visual-prompting-5
No benchmarks available.
Use these libraries to find visual-prompting-5 models and implementations
No datasets available.
No subtasks available.
The Segment Anything Model (SAM) is introduced: a new task, model, and dataset for image segmentation, and its zero-shot performance is impressive – often competitive with or even superior to prior fully supervised results.
This paper builds on top of an encoder-decoder architecture, and develops a versatile prompt encoder to support a variety of prompts like strokes, boxes, and points, and enhances it to take an arbitrary number of reference image segments as the context.
This work proposes a new VP method, termed Class-wise Adversarial Visual Prompting (C-AVP), to generate class-wise visual prompts so as to not only leverage the strengths of ensemble prompts but also optimize their interrelations to improve model robustness.
This paper takes inspiration from the widely-used pre-training and then prompt tuning protocols in NLP and proposes a new visual prompting model, named Explicit Visual Prompting (EVP), which freezes a pre-trained model and then learns task-specific knowledge using a few extra parameters.
The surprising effectiveness of visual prompting provides a new perspective on adapting pre-trained models in vision and is particularly effective for CLIP and robust to distribution shift, achieving performance competitive with standard linear probes.
This paper investigates visual prompting: given input-output image example(s) of a new task at test time and a new input image, the goal is to automatically produce the output image, consistent with the given examples, and shows that posing this problem as simple image inpainting turns out to be surprisingly effective.
A new VP framework, termed ILM-VP (iterative label mapping-based visual prompting), which automatically re-maps the source labels to the target labels and progressively improves the target task accuracy of VP is proposed.
A simple and effective visual prompting method for adapting pre-trained models to downstream recognition tasks, which sets a new record of 82.8% average accuracy across 12 popular classification datasets, substantially surpassing the prior art by +5.6%.
This paper proposes a novel text-visual prompting (TVP) framework, which incorporates optimized perturbation patterns (that the authors call 'prompts') into both visual inputs and textual features of a TVG model and shows that TVP allows us to effectively co-train vision encoder and language encoder in a 2D TVG models and improves the performance of crossmodal feature fusion using only low-complexity sparse 2D visual features.
The experiments show that GPT-4V with SoM outperforms the state-of-the-art fully-finetuned referring segmentation model on RefCOCOg in a zero-shot setting and the effectiveness of SoM on a wide range of fine-grained vision and multimodal tasks.
Adding a benchmark result helps the community track progress.