GSRFormer: Grounded Situation Recognition Transformer with Alternate Semantic Attention Refinement (2022-08-18T00:00:00.000000Z)

TL;DR

A novel two-stage framework that focuses on utilizing bidirectional relations within verbs and roles is proposed, and extensive experimental results show that the renovated framework outperforms other state-of-the-art methods under various metrics.

Abstract

Grounded Situation Recognition (GSR) aims to generate structured semantic summaries of images for "human-like'' event understanding. Specifically, GSR task not only detects the salient activity verb (e.g. buying), but also predicts all corresponding semantic roles (e.g. agent and goods). Inspired by object detection and image captioning tasks, existing methods typically employ a two-stage framework: 1) detect the activity verb, and then 2) predict semantic roles based on the detected verb. Obviously, this illogical framework constitutes a huge obstacle to semantic understanding. First, pre-detecting verbs solely without semantic roles inevitably fails to distinguish many similar daily activities (e.g., offering and giving, buying and selling). Second, predicting semantic roles in a closed auto-regressive manner can hardly exploit the semantic relations among the verb and roles. To this end, in this paper we propose a novel two-stage framework that focuses on utilizing such bidirectional relations within verbs and roles. In the first stage, instead of pre-detecting the verb, we postpone the detection step and assume a pseudo label, where an intermediate representation for each corresponding semantic role is learned from images. In the second stage, we exploit transformer layers to unearth the potential semantic relations within both verbs and semantic roles. With the help of a set of support images, an alternate learning scheme is designed to simultaneously optimize the results: update the verb using nouns corresponding to the image, and update nouns using verbs from support images. Extensive experimental results on challenging SWiG benchmarks show that our renovated framework outperforms other state-of-the-art methods under various metrics.

Authors

References84 items

Collaborative Transformers for Grounded Situation Recognition

Group Contextualization for Video Recognition

Video as Conditional Graph Hierarchy for Multi-Granular Question Answering

Rethinking the Two-Stage Framework for Grounded Situation Recognition

Grounded Situation Recognition with Transformers

Token Shift Transformer for Video Classification

Spatial-Temporal Transformer for Dynamic Scene Graph Generation

From Show to Tell: A Survey on Deep Learning-Based Image Captioning

Understanding and Evaluating Racial Biases in Image Captioning

Dynamic Head: Unifying Object Detection Heads with Attentions

Unifying Cross-Lingual Semantic Role Labeling with Heterogeneous Linguistic Resources

Towards Accurate Text-based Image Captioning with Content Diversity Exploration

Visual Semantic Role Labeling for Video Understanding

Robust and Accurate Object Detection via Adversarial Learning

LightningDOT: Pre-training Visual-Semantic Embeddings for Real-Time Image-Text Retrieval

Training data-efficient image transformers & distillation through attention

Improving Image Captioning by Leveraging Intra- and Inter-layer Global Representation in Transformer Network

DB-LSTM: Densely-connected Bi-directional LSTM for human action recognition

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Deformable DETR: Deformable Transformers for End-to-End Object Detection

HOSE-Net: Higher Order Structure Embedded Network for Scene Graph Generation

Attention-Based Context Aware Reasoning for Situation Recognition

End-to-End Object Detection with Transformers

Temporal Pyramid Network for Action Recognition

Grounded Situation Recognition

Counterfactual Samples Synthesizing for Robust Visual Question Answering

PPDM: Parallel Point Detection and Matching for Real-Time Human-Object Interaction Detection

Meshed-Memory Transformer for Image Captioning

EfficientDet: Scalable and Efficient Object Detection

Mixture-Kernel Graph Attention Network for Situation Recognition

Generating Long Sequences with Sparse Transformers

Cross-Modal Self-Attention Network for Referring Image Segmentation

Relation-Aware Graph Attention Network for Visual Question Answering

MUREL: Multimodal Relational Reasoning for Visual Question Answering

Generalized Intersection Over Union: A Metric and a Loss for Bounding Box Regression

Counterfactual Critic Multi-Agent Training for Scene Graph Generation

Learning to Transfer: Generalizable Attribute Learning with Multitask Neural Model Search

Graph R-CNN for Scene Graph Generation

Personalized clothing recommendation combining user social circle and fashion style consistency

Linguistically-Informed Self-Attention for Semantic Role Labeling

GNAS: A Greedy Neural Architecture Search Method for Multi-Attribute Learning

Image Transformer

MovieGraphs: Towards Understanding Human-Centric Situations from Videos

Deep Semantic Role Labeling with Self-Attention

A Closer Look at Spatiotemporal Convolutions for Action Recognition

Neural Motifs: Scene Graph Parsing with Global Context

Situation Recognition with Graph Neural Networks

Scene Graph Generation from Objects, Phrases and Region Captions

Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

Video2Shop: Exact Matching Clothes in Videos to Online Shopping Images

Attention is All you Need

On the Selection of Anchors and Targets for Video Hyperlinking

Video eCommerce++: Toward Large Scale Online Video Advertising

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset

Neural Message Passing for Quantum Chemistry

Recurrent Models for Situation Recognition

Scene Graph Generation by Iterative Message Passing

Large-Scale Image Retrieval with Attentive Deep Local Features

Single Image Action Recognition Using Semantic Body Part Actions

Feature Pyramid Networks for Object Detection

Commonly Uncommon: Semantic Sparsity in Situation Recognition

Self-Critical Sequence Training for Image Captioning

SCA-CNN: Spatial and Channel-Wise Attention in Convolutional Networks for Image Captioning

Video eCommerce: Towards Online Video Advertising

Context-aware Image Tweet Modelling and Recommendation

Semi-Supervised Classification with Graph Convolutional Networks

Layer Normalization

Situation Recognition: Visual Semantic Role Labeling for Image Understanding

Image Captioning with Semantic Attention

Rethinking the Inception Architecture for Computer Vision

Gated Graph Sequence Neural Networks

You Only Look Once: Unified, Real-Time Object Detection

Semantic Role Labeling

Background to Framenet

The Berkeley FrameNet Project

That Backbone

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

VIREO @ TRECVID 2017: Video-to-Text, Ad-hoc Video Search, and Video hyperlinking

Kaiming He, Bharath Hariharan, and Serge Belongie

Kipf and MaxWelling

From TreeBank to PropBank

Frame semantics for text understanding

Author manuscript, published in "International Conference on Computer Vision (2013)" Action Recognition with Improved Trajectories

Attention Refinement MM ’22, October 10–14, 2022, Lisboa, Portugal attentions