Context and Geometry Aware Voxel Transformer for Semantic Scene Completion (2024-05-22T00:00:00.000000Z)

TL;DR

A novel context and geometry aware voxel transformer that extends deformable cross-attention from 2D to 3D pixel space, enabling the differentiation of points with similar image coordinates based on their depth coordinates and outperforms approaches employing temporal images as inputs or much larger image backbone networks.

Abstract

Vision-based Semantic Scene Completion (SSC) has gained much attention due to its widespread applications in various 3D perception tasks. Existing sparse-to-dense approaches typically employ shared context-independent queries across various input images, which fails to capture distinctions among them as the focal regions of different inputs vary and may result in undirected feature aggregation of cross-attention. Additionally, the absence of depth information may lead to points projected onto the image plane sharing the same 2D position or similar sampling points in the feature map, resulting in depth ambiguity. In this paper, we present a novel context and geometry aware voxel transformer. It utilizes a context aware query generator to initialize context-dependent queries tailored to individual input images, effectively capturing their unique characteristics and aggregating information within the region of interest. Furthermore, it extend deformable cross-attention from 2D to 3D pixel space, enabling the differentiation of points with similar image coordinates based on their depth coordinates. Building upon this module, we introduce a neural network named CGFormer to achieve semantic scene completion. Simultaneously, CGFormer leverages multiple 3D representations (i.e., voxel and TPV) to boost the semantic and geometric representation abilities of the transformed 3D volume from both local and global perspectives. Experimental results demonstrate that CGFormer achieves state-of-the-art performance on the SemanticKITTI and SSCBench-KITTI-360 benchmarks, attaining a mIoU of 16.87 and 20.05, as well as an IoU of 45.99 and 48.07, respectively. Remarkably, CGFormer even outperforms approaches employing temporal images as inputs or much larger image backbone networks.

References64 items

Deep Height Decoupling for Precise Vision-Based 3D Occupancy Prediction

Instance-Aware Monocular 3D Semantic Scene Completion

Not All Voxels are Equal: Hardness-Aware Semantic Scene Completion with Self-Distillation

TOD3Cap: Towards 3D Dense Captioning in Outdoor Scenes

H2GFormer: Horizontal-to-Global Voxel Transformer for 3D Semantic Scene Completion

Tri-Perspective view Decomposition for Geometry-Aware Depth Completion

MonoOcc: Digging into Monocular Semantic Occupancy Prediction

Camera-Based 3D Semantic Scene Completion With Sparse Guidance Network

DFA3D: 3D Deformable Attention For 2D-to-3D Feature Lifting

Aggregating Feature Point Cloud for Depth Completion

PointOcc: Cylindrical Tri-Perspective View for Point-based 3D Semantic Occupancy Prediction

Symphonize 3D Semantic Scene Completion with Contextual Instance Queries

BEVStereo: Enhancing Depth Estimation in Multi-View 3D Object Detection with Temporal Stereo

SSCBench: A Large-Scale 3D Semantic Scene Completion Benchmark for Autonomous Driving

Occ3D: A Large-Scale 3D Occupancy Prediction Benchmark for Autonomous Driving

DETRs Beat YOLOs on Real-time Object Detection

OccFormer: Dual-path Transformer for Vision-based 3D Semantic Occupancy Prediction

SurroundOcc: Multi-Camera 3D Occupancy Prediction for Autonomous Driving

OpenOccupancy: A Large Scale Benchmark for Surrounding Semantic Occupancy Perception

OccDepth: A Depth-Aware Method for 3D Semantic Scene Completion

LODE: Locally Conditioned Eikonal Implicit Scene Completion from Sparse LiDAR

VoxFormer: Sparse Voxel Transformer for Camera-Based 3D Semantic Scene Completion

Tri-Perspective View for Vision-Based 3D Semantic Occupancy Prediction

Planning-oriented Autonomous Driving

BEVFormer v2: Adapting Modern Image Backbones to Bird's-Eye-View Recognition via Perspective Supervision

BEVDepth: Acquisition of Reliable Depth for Multi-view 3D Object Detection

PETRv2: A Unified Framework for 3D Perception from Multi-Camera Images

BEVFusion: A Simple and Robust LiDAR-Camera Fusion Framework

BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird's-Eye View Representation

Neighborhood Attention Transformer

M2BEV: Multi-Camera Joint 3D Detection and Segmentation with Unified Birds-Eye View Representation

BEVFormer: Learning Bird's-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers

PETR: Position Embedding Transformation for Multi-View 3D Object Detection

DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection

BEVDet: High-performance Multi-camera 3D Object Detection in Bird-Eye-View

MonoScene: Monocular 3D Semantic Scene Completion

Masked Autoencoders Are Scalable Vision Learners

DETR3D: 3D Object Detection from Multi-view Images via 3D-to-2D Queries

KITTI-360: A Novel Dataset and Benchmarks for Urban Scene Understanding in 2D and 3D

MobileStereoNet: Towards Lightweight Deep Networks for Stereo Matching

RigNet: Repetitive Image Guided Network for Depth Completion

FIERY: Future Instance Prediction in Bird’s-Eye View from Surround Monocular Cameras

S3CNet: A Sparse Semantic Scene Completion Network for LiDAR Point Clouds

Sparse Single Sweep LiDAR Point Cloud Segmentation via Learning Contextual Shape Priors from Scene Completion

AdaBins: Depth Estimation Using Adaptive Bins

Deformable DETR: Deformable Transformers for End-to-End Object Detection

LMSCNet: Lightweight Multiscale 3D Semantic Completion

Lift, Splat, Shoot: Encoding Images From Arbitrary Camera Rigs by Implicitly Unprojecting to 3D

End-to-End Object Detection with Transformers

Anisotropic Convolutional Networks for 3D Semantic Scene Completion

EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks

SemanticKITTI: A Dataset for Semantic Scene Understanding of LiDAR Sequences

Orthographic Feature Transform for Monocular 3D Object Detection

Efficient Semantic Scene Completion Network with Spatial Group Convolution

Decoupled Weight Decay Regularization

Feature Pyramid Networks for Object Detection

Semantic Scene Completion from a Single Depth Image

Deep Residual Learning for Image Recognition

Are we ready for autonomous driving? The KITTI vision benchmark suite

StereoScene: BEV-Assisted Stereo Matching Empowers 3D Semantic Scene Completion

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

Depthssc: Depth-spatial alignment and dynamic voxel resolution for monocular 3d semantic scene completion

A.6 Limitations While CGFormer exhibits strong performance on benchmarks

If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully