3260 papers • 126 benchmarks • 313 datasets
Evaluation on target datasets for 3D Point Cloud Classification without any training
(Image credit: Papersgraph)
These leaderboards are used to track progress in training-free-3d-point-cloud-classification-10
Use these libraries to find training-free-3d-point-cloud-classification-10 models and implementations
No subtasks available.
Recently, zero-shot and few-shot learning via Contrastive Vision-Language Pre-training (CLIP) have shown inspirational performance on 2D visual recognition, which learns to match images with their corresponding texts in open-vocabulary settings. However, it remains under explored that whether CLIP, pre-trained by large-scale image-text pairs in 2D, can be generalized to 3D recognition. In this paper, we identify such a setting is feasible by proposing PointCLIP, which conducts alignment between CLIP-encoded point clouds and 3D category texts. Specifically, we encode a point cloud by projecting it onto multi-view depth maps and aggregate the view-wise zero-shot prediction in an end-to-end manner, which achieves efficient knowledge transfer from 2D to 3D. We further design an inter-view adapter to better extract the global feature and adaptively fuse the 3D few-shot knowledge into CLIP pre-trained in 2D. By just fine-tuning the adapter under few-shot settings, the performance of PointCLIP could be largely improved. In addition, we observe the knowledge complementary property between PointCLIP and classical 3D-supervised networks. Via simple ensemble during inference, PointCLIP contributes to favorable performance enhancement over state-of-the-art 3D networks. Therefore, PointCLIP is a promising alternative for effective 3D point cloud understanding under low data regime with marginal resource cost. We conduct thorough experiments on Model-NetlO, ModelNet40 and ScanObjectNN to demonstrate the effectiveness of PointCLIP. Code is available at https://github.com/ZrrSkywalker/PointCLIP.
This paper first collaborate CLIP and GPT to be a unified 3D open-world learner, named as Point-CLIP V2, which fully unleashes their potential for zero-shot 3D classification, segmentation, and detection, demonstrating the generalization ability for unified 3D open-world learning.
We present a Non-parametric Network for 3D point cloud analysis, Point-NN, which consists of purely non-learnable components: farthest point sampling (FPS), k-nearest neighbors (k-NN), and pooling operations, with trigonometric functions. Surprisingly, it performs well on various 3D tasks, requiring no parameters or training, and even surpasses existing fully trained models. Starting from this basic non-parametric model, we propose two extensions. First, Point-NN can serve as a base architectural framework to construct Parametric Networks by simply inserting linear layers on top. Given the superior non-parametric foundation, the derived Point-PN exhibits a high performance-efficiency trade-off with only a few learnable parameters. Second, Point-NN can be regarded as a plug-and-play module for the already trained 3D models during inference. Point-NN captures the complementary geometric knowledge and enhances existing methods for different 3D benchmarks without re-training. We hope our work may cast a light on the community for understanding 3D point clouds with non-parametric methods. Code is available at https://github.com/ZrrSkywalker/Point-NN.
Experimental results show that CLIP2Point is effective in transferring CLIP knowledge to 3D vision, and a novel Gated Dual-Path Adapter (GDPA) allows the ensemble of CLIP and CLIP2Point, tuning pre-training knowledge to downstream tasks in an efficient adaptation.
A free-lunch enhancement method, CALIP, to boost CLIP's zero-shot performance via a parameter-free attention module and inserts a small number of linear layers in CALIP's attention module to verify the robustness under the few-shot settings.
ULIP is introduced to learn a unified representation of image, text, and 3D point cloud by pre-training with object triplets from the three modalities, using a pre-trained vision-language model that has already learned a common visual and textual space by training with massive image-text pairs.
ViT-Lens provides a unified solution for representation learning of increasing modalities with two appealing benefits: (i) Exploiting the pretrained ViT across tasks and domains effectively with efficient data regime; (ii) Emergent downstream capabilities of novel modalities are demonstrated due to the modality alignment space.
This paper introduces Point-GN, a novel non-parametric network for efficient and accurate 3D point cloud classification that outperforms existing non-parametric methods and matches the performance of fully trained models, all with zero learnable parameters.
Adding a benchmark result helps the community track progress.