3260 papers • 126 benchmarks • 313 datasets
This is the task of detecting 3D objects from monocular images (as opposed to LiDAR based counterparts). It is usually associated with autonomous driving based tasks. ( Image credit: Orthographic Feature Transform for Monocular 3D Object Detection )
(Image credit: Papersgraph)
These leaderboards are used to track progress in 3d-object-detection-from-monocular-images
Use these libraries to find 3d-object-detection-from-monocular-images models and implementations
No subtasks available.
This work proposes VoteNet, an end-to-end 3D object detection network based on a synergy of deep point set networks and Hough voting that achieves state-of-the-art 3D detection on two large datasets of real 3D scans, ScanNet and SUN RGB-D with a simple design, compact model size and high efficiency.
This work quantifies the impact introduced by each sub-task and found the ‘localization error’ is the vital factor in restricting monocular 3D detection, and investigates the underlying reasons behind localization errors.
The orthographic feature transform is introduced, which enables us to escape the image domain by mapping image-based features into an orthographic 3D space and allows us to reason holistically about the spatial configuration of the scene in a domain where scale is consistent and distances between objects are meaningful.
Experiments on challenging, real-world imagery from ScanNet show that ROCA signif-icantly improves on state of the art, from 9.5% to 17.6% in retrieval-aware CAD alignment accuracy.
This work proposes MonoDTR, a novel end-to-end depth-aware transformer network for monocular 3D object detection that outperforms previous state-of-the-art monocular-based methods and achieves real-time detection.
This paper introduces the first DETR framework for Monocular DEtection with a depth-guided TRansformer, named MonoDETR, and modify the vanilla transformer to be depth-aware and guide the whole detection process by contextual depth cues.
M3D-RPN is able to significantly improve the performance of both monocular 3D Object Detection and Bird's Eye View tasks within the KITTI urban autonomous driving dataset, while efficiently using a shared multi-class model.
This paper proposes Depth EquiVarIAnt NeTwork (DEVIANT), a neural network built with existing scale equivariant steerable blocks that achieves state-of-the-art monocular 3D detection results on KITTI and Waymo datasets in the image-only category and performs competitively to methods using extra information.
A GUP Net is proposed to tackle the error amplification problem at both inference and training stages and can infer more reliable object depth than existing methods and outperforms the state-of-the-art image-based monocular 3D detectors.
GrooMeD-NMS addresses the mismatch between training and inference pipelines and, therefore, forces the network to select the best 3D box in a differentiable manner.
Adding a benchmark result helps the community track progress.