1
Collaborative Transformers for Grounded Situation Recognition
2
Group Contextualization for Video Recognition
3
Video as Conditional Graph Hierarchy for Multi-Granular Question Answering
4
Rethinking the Two-Stage Framework for Grounded Situation Recognition
5
Grounded Situation Recognition with Transformers
6
Token Shift Transformer for Video Classification
7
Spatial-Temporal Transformer for Dynamic Scene Graph Generation
8
From Show to Tell: A Survey on Deep Learning-Based Image Captioning
9
Understanding and Evaluating Racial Biases in Image Captioning
10
Dynamic Head: Unifying Object Detection Heads with Attentions
11
Unifying Cross-Lingual Semantic Role Labeling with Heterogeneous Linguistic Resources
12
Towards Accurate Text-based Image Captioning with Content Diversity Exploration
13
Visual Semantic Role Labeling for Video Understanding
14
Robust and Accurate Object Detection via Adversarial Learning
15
LightningDOT: Pre-training Visual-Semantic Embeddings for Real-Time Image-Text Retrieval
16
Training data-efficient image transformers & distillation through attention
17
Improving Image Captioning by Leveraging Intra- and Inter-layer Global Representation in Transformer Network
18
DB-LSTM: Densely-connected Bi-directional LSTM for human action recognition
19
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
20
Deformable DETR: Deformable Transformers for End-to-End Object Detection
21
HOSE-Net: Higher Order Structure Embedded Network for Scene Graph Generation
22
Attention-Based Context Aware Reasoning for Situation Recognition
23
End-to-End Object Detection with Transformers
24
Temporal Pyramid Network for Action Recognition
25
Grounded Situation Recognition
26
Counterfactual Samples Synthesizing for Robust Visual Question Answering
27
PPDM: Parallel Point Detection and Matching for Real-Time Human-Object Interaction Detection
28
Meshed-Memory Transformer for Image Captioning
29
EfficientDet: Scalable and Efficient Object Detection
30
Mixture-Kernel Graph Attention Network for Situation Recognition
31
Generating Long Sequences with Sparse Transformers
32
Cross-Modal Self-Attention Network for Referring Image Segmentation
33
Relation-Aware Graph Attention Network for Visual Question Answering
34
MUREL: Multimodal Relational Reasoning for Visual Question Answering
35
Generalized Intersection Over Union: A Metric and a Loss for Bounding Box Regression
36
Counterfactual Critic Multi-Agent Training for Scene Graph Generation
37
Learning to Transfer: Generalizable Attribute Learning with Multitask Neural Model Search
38
Graph R-CNN for Scene Graph Generation
39
Personalized clothing recommendation combining user social circle and fashion style consistency
40
Linguistically-Informed Self-Attention for Semantic Role Labeling
41
GNAS: A Greedy Neural Architecture Search Method for Multi-Attribute Learning
43
MovieGraphs: Towards Understanding Human-Centric Situations from Videos
44
Deep Semantic Role Labeling with Self-Attention
45
A Closer Look at Spatiotemporal Convolutions for Action Recognition
46
Neural Motifs: Scene Graph Parsing with Global Context
47
Situation Recognition with Graph Neural Networks
48
Scene Graph Generation from Objects, Phrases and Region Captions
49
Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering
50
Video2Shop: Exact Matching Clothes in Videos to Online Shopping Images
51
Attention is All you Need
52
On the Selection of Anchors and Targets for Video Hyperlinking
53
Video eCommerce++: Toward Large Scale Online Video Advertising
54
Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
55
Neural Message Passing for Quantum Chemistry
56
Recurrent Models for Situation Recognition
57
Scene Graph Generation by Iterative Message Passing
58
Large-Scale Image Retrieval with Attentive Deep Local Features
59
Single Image Action Recognition Using Semantic Body Part Actions
60
Feature Pyramid Networks for Object Detection
61
Commonly Uncommon: Semantic Sparsity in Situation Recognition
62
Self-Critical Sequence Training for Image Captioning
63
SCA-CNN: Spatial and Channel-Wise Attention in Convolutional Networks for Image Captioning
64
Video eCommerce: Towards Online Video Advertising
65
Context-aware Image Tweet Modelling and Recommendation
66
Semi-Supervised Classification with Graph Convolutional Networks
68
Situation Recognition: Visual Semantic Role Labeling for Image Understanding
69
Image Captioning with Semantic Attention
70
Rethinking the Inception Architecture for Computer Vision
71
Gated Graph Sequence Neural Networks
72
You Only Look Once: Unified, Real-Time Object Detection
75
The Berkeley FrameNet Project
77
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
78
VIREO @ TRECVID 2017: Video-to-Text, Ad-hoc Video Search, and Video hyperlinking
79
Kaiming He, Bharath Hariharan, and Serge Belongie
81
From TreeBank to PropBank
82
Frame semantics for text understanding
83
Author manuscript, published in "International Conference on Computer Vision (2013)" Action Recognition with Improved Trajectories
84
Attention Refinement MM ’22, October 10–14, 2022, Lisboa, Portugal attentions