3260 papers • 126 benchmarks • 313 datasets
Grounded Situation Recognition aims to produce the structured image summary which describes the primary activity (verb), its relevant entities (nouns), and their bounding-box groundings.
(Image credit: Papersgraph)
These leaderboards are used to track progress in grounded-situation-recognition-2
Use these libraries to find grounded-situation-recognition-2 models and implementations
No subtasks available.
A novel approach where the two processes for activity classification and entity estimation are interactive and complementary, which achieves the state of the art in all evaluation metrics on the SWiG dataset.
This paper studies semantic sparsity in situation recognition, the task of producing structured summaries of what is happening in images, including activities, objects and the roles objects play within the activity.
A model based on Graph Neural Networks is proposed that allows us to efficiently capture joint dependencies between roles using neural networks defined on a graph and significantly outperforms existing work, as well as multiple baselines.
A Joint Situation Localizer is proposed and it is found that jointly predicting situations and groundings with end-to-end training handily outperforms independent training on the entire grounding metric suite with relative gains between 8% and 32%.
Situation Recognition (SR) is a fine-grained action recognition task where the model is expected to not only predict the salient action of the image, but also predict values of all associated semantic roles of the action. Predicting semantic roles is very challenging: a vast variety of possibilities can be the match for a semantic role. Existing work has focused on dependency modelling architectures to solve this issue. Inspired by the success achieved by query-based visual reasoning (e.g., Visual Question Answering), we propose to address semantic role prediction as a query-based visual reasoning problem. However, existing query-based reasoning methods have not considered handling of inter-dependent queries which is a unique requirement of semantic role prediction in SR. Therefore, to the best of our knowledge, we propose the first set of methods to address inter-dependent queries in query-based visual reasoning. Extensive experiments demonstrate the effectiveness of our proposed method which achieves outstanding performance on Situation Recognition task. Furthermore, leveraging query inter-dependency, our methods improve upon a state-of-the-art method that answers queries separately. Our code: https://github.com/thilinicooray/context-aware-reasoning-for-sr
The attention mechanism of the model enables accurate verb classification by capturing high-level semantic feature of an image effectively, and allows the model to flexibly deal with the complicated and image-dependent relations between entities for improved noun classification and localization.
A novel two-stage framework that focuses on utilizing bidirectional relations within verbs and roles is proposed, and extensive experimental results show that the renovated framework outperforms other state-of-the-art methods under various metrics.
A novel SituFormer for GSR which consists of a Coarse-to-Fine Verb Model (CFVM) and a Transformer-based Noun Model (TNM), which is a transformer-based semantic role detection model, which detects all roles parallelly.
A cross-attention-based Transformer known as ClipSitu XTF outperforms existing state-of-the-art by a large margin of 14.1\% on semantic role labelling (value) for top-1 accuracy using imSitu dataset.
Adding a benchmark result helps the community track progress.