3260 papers • 126 benchmarks • 313 datasets
PSG task abstracts the given image with a scene graph, where nodes are grounded by panoptic segmentation
(Image credit: Papersgraph)
These leaderboards are used to track progress in panoptic-scene-graph-generation-5
Use these libraries to find panoptic-scene-graph-generation-5 models and implementations
No subtasks available.
This work analyzes the role of motifs: regularly appearing substructures in scene graphs and introduces Stacked Motif Networks, a new architecture designed to capture higher order motifs in scene graph graphs that improves on the previous state-of-the-art by an average of 3.6% relative improvement across evaluation settings.
A hybrid learning procedure is developed which integrates end-task supervised learning and the tree structure reinforcement learning, where the former's evaluation result serves as a self-critic for the latter's structure exploration.
This work explicitly model the objects and their relationships using scene graphs, a visually-grounded graphical structure of an image, and proposes a novel end-to-end model that generates such structured scene representation from an input image.
A high-quality PVSG dataset is contributed, which consists of 400 videos (289 third-person + 111 egocentric videos) with totally 150K frames labeled with panoptic segmentation masks as well as fine, temporal scene graphs.
The proposed HiLo framework lets different network branches specialize on low and high frequency relations, enforce their consistency and fuse the results, and is the first to propose an explicitly unbiased Panoptic Scene Graph generation method.
Panoptic scene graph generation (PSG) is introduced, a new problem task that requires the model to generate a more comprehensive scene graph representation based on panoptic segmentations rather than rigid bounding boxes.
A novel framework is presented: Pair then Relation (Pair-Net), which uses a Pair Proposal Network (PPN) to learn and filter sparse pair-wise relationships between subjects and objects and achieves over 10% absolute gains compared to the baseline, PSGFormer.
The Vision-Language Prompting (VLPrompt) model is proposed, which acquires vision information from images and language information from LLMs, and through a prompter network based on attention mechanism, it achieves precise relation prediction.
A novel framework named ADTrans is proposed to adaptively transfer biased predicate annotations to informative and unified ones, and to promise consistency and accuracy during the transfer process, and learn unbiased prototypes of predicates with different intensities.
Adding a benchmark result helps the community track progress.