3260 papers • 126 benchmarks • 313 datasets
Retrieval of similar human poses from images or videos
(Image credit: Papersgraph)
These leaderboards are used to track progress in pose-retrieval-2
Use these libraries to find pose-retrieval-2 models and implementations
No subtasks available.
An approach for learning a compact view-invariant embedding space from 2D joint keypoints alone, without explicitly predicting 3D poses, and uses probabilistic embeddings to model this input uncertainty.
This work introduces a fast-search method that approximates an exhaustive search based on this objective function for simultaneously retrieving the object category, a CAD model, and the pose of an object given an approximate 3D bounding box.
An automatic method for annotating images of indoor scenes with the CAD models of the objects by relying on RGB-D scans is presented, and a ’cloning procedure’ is introduced that identifies objects that have the same geometry, to annotate these objects with the same CAD models.
A new captioning dataset named FixMyPose is introduced, which presents strong cross-attention baseline models (uni/multimodal, RL, multilingual) and also shows that the baselines are competitive with other models when evaluated on other image-difference datasets.
This work bridges the domain gap by efficiently transfer-learning from both domain-specific and task-specific source models and applies the resultant state-of-the-art character pose estimator to solve the novel task of pose-guided illustration retrieval.
Human pose estimation (HPE) from RGB and depth images has recently experienced a push for viewpoint-invariant and scale-invariant pose retrieval methods. Current methods fail to generalize to unconventional viewpoints due to the lack of viewpoint-invariant data at training time. Existing datasets do not provide multiple-viewpoint observations and mostly focus on frontal views. In this work, we introduce PanopTOP, a fully automatic framework for the generation of semi-synthetic RGB and depth samples with 2D and 3D ground truth of pedestrian poses from multiple arbitrary viewpoints. Starting from the Panoptic Dataset [15], we use the PanopTOP framework to generate the PanopTOP31K dataset, consisting of 31K images from 23 different subjects recorded from diverse and challenging viewpoints, also including the top-view. Finally, we provide baseline results and cross-validation tests for our dataset, demonstrating how it is possible to generalize from the semi-synthetic to the real-world domain. The dataset and the code will be made publicly available upon acceptance.
This work introduces TriBERT -- a transformer-based architecture, inspired by ViLBERT, which enables contextual feature learning across three modalities: vision, pose, and audio, with the use of flexible co-attention, and introduces a learned visual tokenization scheme based on spatial attention and leverage weak-supervision to allow granular cross-modal interactions for visual and pose modalities.
This work proposes doing category-level pose estimation by learning an alignment metric in an embedding space using a contrastive loss with a dynamic margin and a continuous pose-label space and achieves state-of-the-art performance on PASCAL3D and OccludedPASCAL 3D and surpasses the competing methods on KITTI3D in a cross-dataset evaluation setting.
Adding a benchmark result helps the community track progress.