Point Transformer V3: Simpler, Faster, Stronger (2023-12-15T00:00:00.000000Z)

TL;DR

This paper presents Point Transformer V3 (PTv3), which prioritizes simplicity and efficiency over the accuracy of certain mechanisms that are minor to the over-all performance after scaling, such as replacing the precise neighbor search by KNN with an efficient serialized neighbor mapping of point clouds organized with specific patterns.

Abstract

This paper is not motivated to seek innovation within the attention mechanism. Instead, it focuses on overcoming the existing trade-offs between accuracy and efficiency within the context of point cloud processing, leveraging the power of scale. Drawing inspiration from recent advances in 3D large-scale representation learning, we recognize that model performance is more influenced by scale than by intricate design. Therefore, we present Point Transformer V3 (PTv3), which prioritizes simplicity and efficiency over the accuracy of certain mechanisms that are minor to the over-all performance after scaling, such as replacing the precise neighbor search by KNN with an efficient serialized neighbor mapping of point clouds organized with specific patterns. This principle enables significant scaling, expanding the receptive field from 16 to 1024 points while remaining efficient (a 3 x increase in processing speed and a 10 x improvement in memory efficiency compared with its pre-decessor, PTv2). PTv3 attains state-of-the-art results on over 20 downstream tasks that span both indoor and out-door scenarios. Further enhanced with multi-dataset joint training, PTv3 pushes these results to a higher level.

Authors

Wanli Ouyang

4 papers

Xiaoyang Wu

2 papers

Hengshuang Zhao

3 papers

TL;DR

Abstract

Authors

References106 items

Multi-Space Alignments Towards Universal LiDAR Segmentation

OA-CNNs: Omni-Adaptive Sparse CNNs for 3D Semantic Segmentation

GroupContrast: Semantic-Aware Self-Supervised Representation Learning for 3D Understanding

PonderV2: Pave the Way for 3D Foundation Model with A Universal Pre-training Paradigm

Efficient Streaming Language Models with Attention Sinks

Towards Large-Scale 3D Representation Learning with Multi-Dataset Point Prompt Training

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

Efficient 3D Semantic Segmentation with Superpoint Transformer

Self-Supervised Pre-Training with Masked Shape Prediction for 3D Scene Understanding

OctFormer: Octree-based Transformers for 3D Point Clouds

Swin3D: A Pretrained Transformer Backbone for 3D Indoor Scene Understanding

Segment Anything

Masked Scene Contrast: A Scalable Framework for Unsupervised 3D Representation Learning

Spherical Transformer for LiDAR-Based 3D Recognition

GPT-4 Technical Report

Rethinking Range View Representation for LiDAR Segmentation

LLaMA: Open and Efficient Foundation Language Models

Using a Waffle Iron for Automotive Point Cloud Semantic Segmentation

FlatFormer: Flattened Window Attention for Efficient Point Cloud Transformer

Images Speak in Images: A Generalist Painter for In-Context Visual Learning

Meta Architecture for Point Cloud Analysis

Point Transformer V2: Grouped Vector Attention and Partition-based Pooling

PointConvFormer: Revenge of the Point-based Convolution

2DPASS: 2D Priors Assisted Semantic Segmentation on LiDAR Point Clouds

LargeKernel3D: Scaling up Kernels in 3D Sparse CNNs

PointNeXt: Revisiting PointNet++ with Improved Training and Scaling Strategies

Point-to-Voxel Knowledge Distillation for LiDAR Semantic Segmentation

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

PillarNet: Real-Time and High-Performance Pillar-based 3D Object Detection

Language-Grounded Indoor 3D Semantic Segmentation in the Wild

Voxel Set Transformer: A Set-to-Set Approach to 3D Object Detection from Point Clouds

Masked Autoencoders for Point Cloud Self-supervised Learning

Rethinking Network Design and Local Geometry in Point Cloud: A Simple Residual MLP Framework

Embracing Single Stride 3D Object Detector with Sparse Transformer

Fast Point Transformer

Point-BERT: Pre-training 3D Point Cloud Transformers with Masked Point Modeling

ExT5: Towards Extreme Multi-Task Scaling for Transfer Learning

Swin Transformer V2: Scaling Up Capacity and Resolution

Panoptic nuScenes: A Large-Scale Benchmark for LiDAR Panoptic Segmentation and Tracking

CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows

Divide and Contrast: Self-supervised Learning from Uncurated Data

Emerging Properties in Self-Supervised Vision Transformers

PAConv: Position Adaptive Convolution with Dynamic Kernel Assembling on Point Clouds

Self-supervised Pretraining of Visual Features in the Wild

Conditional Positional Encodings for Vision Transformers

(AF)2-S3Net: Attentive Feature Fusion with Adaptive Feature Selection for Sparse Semantic Segmentation Network

PCT: Point cloud transformer

Exploring Data-Efficient 3D Scene Understanding with Contrastive Scene Contexts

Cylindrical and Asymmetrical 3D Convolution Networks for LiDAR Segmentation

Point Transformer

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Searching Efficient 3D Architectures with Sparse Point-Voxel Convolution

PointContrast: Unsupervised Pre-training for 3D Point Cloud Understanding

JSENet: Joint Semantic Segmentation and Edge Detection Network for 3D Point Clouds

Center-based 3D Object Detection and Tracking

Info3D: Representation Learning on 3D Objects using Mutual Information Maximization and Contrastive Learning

SegGCN: Efficient 3D Point Cloud Segmentation With Fuzzy Spherical Kernel

PointGroup: Dual-Set Point Grouping for 3D Instance Segmentation

PointASNL: Robust Point Clouds Processing Using Nonlocal Neural Networks With Adaptive Sampling

On Layer Normalization in the Transformer Architecture

Scaling Laws for Neural Language Models

Scalability in Perception for Autonomous Driving: Waymo Open Dataset

RandLA-Net: Efficient Semantic Segmentation of Large-Scale Point Clouds

Hierarchical Point-Edge Interaction Network for Point Cloud Semantic Segmentation

A Unified Point-Based Framework for 3D Segmentation

PointWeb: Enhancing Local Neighborhood Features for Point Cloud Processing

Deep Closest Point: Learning Representations for Point Cloud Registration

Generating Long Sequences with Sparse Transformers

4D Spatio-Temporal ConvNets: Minkowski Convolutional Neural Networks

KPConv: Flexible and Deformable Convolution for Point Clouds

Modeling Point Clouds With Self-Attention and Gumbel Subset Sampling

SemanticKITTI: A Dataset for Semantic Scene Understanding of LiDAR Sequences

nuScenes: A Multimodal Dataset for Autonomous Driving

PanopticFusion: Online Volumetric Semantic Mapping at the Level of Stuff and Things

Self-Supervised Deep Learning on Point Clouds by Reconstructing Space

PointPillars: Fast Encoders for Object Detection From Point Clouds