Think Twice before Driving: Towards Scalable Decoders for End-to-End Autonomous Driving (2023-05-10T00:00:00.000000Z)

TL;DR

The above refinement module could be stacked in a cascaded fashion, which extends the capacity of the decoder with spatial-temporal prior knowledge about the conditioned future, and achieves state-of-the-art performance in closed-loop benchmarks.

Abstract

End-to-end autonomous driving has made impressive progress in recent years. Existing methods usually adopt the decoupled encoder-decoder paradigm, where the encoder extracts hidden features from raw sensor data, and the decoder outputs the ego-vehicle's future trajectories or actions. Under such a paradigm, the encoder does not have access to the intended behavior of the ego agent, leaving the burden of finding out safety-critical regions from the massive receptive field and inferring about future situations to the decoder. Even worse, the decoder is usually composed of several simple multi-layer perceptrons (MLP) or GRUs while the encoder is delicately designed (e.g., a combination of heavy ResNets or Transformer). Such an imbalanced resource-task division hampers the learning process. In this work, we aim to alleviate the aforementioned problem by two principles: (1) fully utilizing the capacity of the encoder; (2) increasing the capacity of the decoder. Concretely, we first predict a coarse-grained future position and action based on the encoder features. Then, conditioned on the position and action, the future scene is imagined to check the ramification if we drive accordingly. We also retrieve the encoder features around the predicted coordinate to obtain fine-grained information about the safety-critical region. Finally, based on the predicted future and the retrieved salient feature, we refine the coarse-grained position and action by predicting its offset from ground-truth. The above refinement module could be stacked in a cascaded fashion, which extends the capacity of the decoder with spatial-temporal prior knowledge about the conditioned future. We conduct experiments on the CARLA simulator and achieve state-of-the-art performance in closed-loop benchmarks. Extensive ablation studies demonstrate the effectiveness of each proposed module.

References99 items

Planning-oriented Autonomous Driving

BEVPoolv2: A Cutting-edge Implementation of BEVDet Toward Deployment

PlanT: Explainable Planning Transformers via Object-Level Representations

Model-Based Imitation Learning for Urban Driving

Motion Transformer with Global Intention Localization and Local Movement Refinement

BEVStereo: Enhancing Depth Estimation in Multi-view 3D Object Detection with Dynamic Temporal Stereo

Delving Into the Devils of Bird’s-Eye-View Perception: A Review, Evaluation and Recipe

Safety-Enhanced Autonomous Driving Using Interpretable Sensor Fusion Transformer

ST-P3: End-to-end Vision-based Autonomous Driving via Spatial-Temporal Feature Learning

MMFN: Multi-Modal-Fusion-Net for End-to-End Driving

IBISCape: A Simulated Benchmark for multi-modal SLAM Systems Evaluation in Large-scale Dynamic Environments

BEVDepth: Acquisition of Reliable Depth for Multi-view 3D Object Detection

Trajectory-guided Control Prediction for End-to-end Autonomous Driving: A Simple yet Strong Baseline

TransFuser: Imitation With Transformer-Based Sensor Fusion for Autonomous Driving

BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird's-Eye View Representation

Cross-view Transformers for real-time Map-view Semantic Segmentation

M2BEV: Multi-Camera Joint 3D Detection and Segmentation with Unified Birds-Eye View Representation

BEVFormer: Learning Bird's-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers

Learning from All Vehicles

PersFormer: 3D Lane Detection via Perspective Transformer and the OpenLane Benchmark

PETR: Position Embedding Transformation for Multi-View 3D Object Detection

DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection

A ConvNet for the 2020s

BEVDet: High-performance Multi-camera 3D Object Detection in Bird-Eye-View

AFDetV2: Rethinking the Necessity of the Second Stage for Object Detection from Point Clouds

GRI: General Reinforced Imitation and its Application to Vision-Based Autonomous Driving

Structured Bird’s-Eye-View Traffic Scene Understanding from Onboard Images

NEAT: Neural Attention Fields for End-to-End Autonomous Driving

End-to-End Urban Driving by Imitating a Reinforcement Learning Coach

RSN: Range Sparse Net for Efficient, Accurate LiDAR 3D Object Detection

Learning to drive from a world on rails

Multi-Modal Fusion Transformer for End-to-End Autonomous Driving

LiDAR R-CNN: An Efficient and Universal 3D Object Detector

Categorical Depth Distribution Network for Monocular 3D Object Detection

MP3: A Unified Model to Map, Perceive, Predict and Plan

LookOut: Diverse Multi-Future Prediction and Planning for Self-Driving

Voxel R-CNN: Towards High Performance Voxel-based 3D Object Detection

From Goals, Waypoints & Paths To Long Term Human Trajectory Forecasting

IDE-Net: Interactive Driving Event and Pattern Extraction From Human Data

Fighting Copycat Agents in Behavioral Cloning from Observation Histories

Deformable DETR: Deformable Transformers for End-to-End Object Detection

Perceive, Predict, and Plan: Safe Motion Planning Through Interpretable Semantic Representations

DSDNet: Deep Structured self-Driving Network

Lift, Splat, Shoot: Encoding Images From Arbitrary Camera Rigs by Implicitly Unprojecting to 3D

Center-based 3D Object Detection and Tracking

MANTRA: Memory Augmented Networks for Multiple Trajectory Prediction

End-to-End Object Detection with Transformers

TITAN: Future Forecast Using Action Priors

RAFT: Recurrent All-Pairs Field Transforms for Optical Flow

PV-RCNN: Point-Voxel Feature Set Abstraction for 3D Object Detection

Learning by Cheating

PyTorch: An Imperative Style, High-Performance Deep Learning Library

End-to-End Model-Free Reinforcement Learning for Urban Driving Using Implicit Affordances

STD: Sparse-to-Dense 3D Object Detector for Point Cloud

End-To-End Interpretable Neural Motion Planner

BASNet: Boundary-Aware Salient Object Detection

Exploring the Limitations of Behavior Cloning for Autonomous Driving

PointRCNN: 3D Object Proposal Generation and Detection From Point Cloud

ChauffeurNet: Learning to Drive by Imitating the Best and Synthesizing the Worst

3D-LaneNet: End-to-End 3D Multiple Lane Detection

Orthographic Feature Transform for Monocular 3D Object Detection

SECOND: Sparsely Embedded Convolutional Detection

R³Net: Recurrent Residual Refinement Network for Saliency Detection

Conditional Affordance Learning for Driving in Urban Environments

Path Aggregation Network for Instance Segmentation

VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection

Decoupled Weight Decay Regularization

CARLA: An Open Urban Driving Simulator

End-to-End Driving Via Conditional Imitation Learning

A Stagewise Refinement Model for Detecting Salient Objects in Images

PWC-Net: CNNs for Optical Flow Using Pyramid, Warping, and Cost Volume

FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks

DHSNet: Deep Hierarchical Saliency Network for Salient Object Detection

End to End Learning for Self-Driving Cars

Deep Residual Learning for Image Recognition

Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

U-Net: Convolutional Networks for Biomedical Image Segmentation

Fast R-CNN

FlowNet: Learning Optical Flow with Convolutional Networks

On the Properties of Neural Machine Translation: Encoder–Decoder Approaches

Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation

Off-Road Obstacle Avoidance through End-to-End Learning

Decentred Lens-Systems

Policy Pre-training for End-to-end Autonomous Driving via Self-supervised Geometric Modeling

Towards Capturing the Temporal Dynamics for Trajectory Prediction: a Coarse-to-Fine Approach

Multi-Agent Trajectory Prediction by Combining Egocentric and Allocentric Views

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

MMCV: OpenMMLab computer vision foundation

MMCV Contributors

ALVINN, an autonomous land vehicle in a neural network

The refined trajectory notices the jay-walker and leads to a emergency stop during the lane-changing process

The refined trajectory leaves more advance for the merging, which leads to a safer and smoother driving

For the jay-walker with the nearby vehicle's occlusion, the refined trajectory leads to a deacceleration compared to the original one

Figure 1. Visualization for the predictions from different layers of decoder. Larger and brighter dots are from deeper layers

license agreement with IEEE. Restrictions apply

Xiaogang

Authorized licensed use limited to the terms of the applicable license agreement with IEEE

On the properties of neural machine [27 21991 Authorized licensed use limited to the terms of the applicable