Open3DIS

Open-Vocabulary 3D Instance Segmentation
with 2D Mask Guidance

CVPR 2024

Phuc Nguyen^*1, Tuan Duc Ngo^*1,4, Evangelos Kalogerakis⁴, Chuang Gan^2,4, Anh Tran¹, Cuong Pham^1,3, Khoi Nguyen¹

¹VinAI Research ²MIT-IBM Watson AI Lab ³Posts & Telecommunications Inst. of Tech. ⁴UMass Amherst

^*Equal contribution

arXiv Supplementary Paper Code Video

Abstract

TL;DR: We propose Open3DIS addressing 3D Instance Segmentation with Open-Vocabulary queries.

We introduce Open3DIS , a novel solution designed to tackle the problem of Open-Vocabulary Instance Segmentation within 3D scenes. Objects within 3D environments exhibit diverse shapes, scales, and colors, making precise instance-level identification a challenging task. Recent advancements in Open-Vocabulary scene understanding have made significant strides in this area by employing class-agnostic 3D instance proposal networks for object localization and learning queryable features for each 3D mask. While these methods produce high-quality instance proposals, they struggle with identifying small-scale and geometrically ambiguous objects. The key idea of our method is a new module that aggregates 2D instance masks across frames and maps them to geometrically coherent point cloud regions as high-quality object proposals addressing the above limitations. These are then combined with 3D class-agnostic instance proposals to include a wide range of objects in the real world. To validate our approach, we conducted experiments on three prominent datasets, including ScanNet200, S3DIS, and Replica, demonstrating significant performance gains in segmenting objects with diverse categories over the state-of-the-art approaches.

Original Image

Modified Image

Method

A pre-trained class-agnostic 3D Instance Segmenter proposes initial 3D objects, while a 2D Instance Segmenter generates masks for video frames. Our 2D-guide-3D Instance Proposal Module combines superpoints and 2D instance masks to enhance 3D proposals, integrating them with the initial 3D proposals. Finally, the Pointwise Feature Extraction module correlates instance-aware point cloud CLIP features from multiview images with text embeddings to generate the ultimate instance masks.

(Top) The 2D-G-3DIP module utilizes 2D per-frame instance masks to generate per-frame 3D proposals by leveraging 3D superpoints. (Bottom) Our proposed hierarchical merging. These proposals are considered point cloud regions and undergo a hierarchical merging process across multiple views, resulting in the final Augmented 3D proposals

Demo Instance-Aware Point Cloud Features

Quantitative Results

ScanNet200

OV-3DIS results on ScanNet200. Our proposed method achieves the highest AP, outperforming previous methods in all metrics. The best results are in bold. [Using 3D backbone ISBNet or Mask3D gains no significant different. Below 3D exps use ISBNet] (updated on 2024, Mar. 19th).

\[ \begin{array}{lccccccc} \hline \textbf{Method} & \text{Backbone3D} & \textbf{AP} & \textbf{AP$_{50}$} & \textbf{AP$_{25}$} & \textbf{AP$_{head}$} & \textbf{AP$_{com}$} & \textbf{AP$_{tail}$} \\ \hline \text{OpenScene} & \text{DBScan} & 2.8 & 7.8 & 18.6 & 2.7 & 3.1 & 2.6 \\ \text{OpenScene} & \text{Mask3D} & 11.7 & 15.2 & 17.8 & 13.4 & 11.6 & 9.9 \\ \text{SAM3D (SAM)} & & 6.1 & 14.2 & 21.3 & 7.0 & 6.2 & 4.6 \\ \text{OVIR-3D (G-SAM)} & & 13.0 & 24.9 & 32.3 & 14.4 & 12.7 & 11.7 \\ \text{OpenIns3D (Synthetic-Scene)} & \text{Mask3D} & 8.8 & 10.3 & 14.4 & 16.0 & 6.5 & 4.2 \\ \text{OpenMask3D} & \text{Mask3D} & 15.4 & 19.9 & 23.1 & 17.1 & 14.1 & 14.9 \\ \hline \textbf{Ours} \text{ (G-SAM)} & & 18.2 & {26.1} & 31.4 & 18.9 & 16.5 & {19.2} \\ \textbf{Ours} \text{ (3D)} & \text{ISBNet} & {18.6} & 23.1 & 27.3 & {24.7} & {16.9} & 13.3 \\ \textbf{Ours} \text{ (3D)} & \text{Mask3D} & {18.9} & 24.3 & 28.3 & {23.9} & {17.4} & 15.3 \\ \textbf{Ours} \text{ (3D + G-SAM)} & \text{ISBNet} & \textbf{23.7}& \textbf{29.4} & {32.8} & \textbf{27.8} & {21.2} & {21.8} \\ \textbf{Ours} \text{ (3D + G-SAM)} & \text{Mask3D} & \textbf{23.7} & {29.2} & \textbf{33.1} & 26.4 & \textbf{22.5} & \textbf{21.9} \\ \hline \end{array} \]

Class-agnostic 3DIS evalutaion results on ScanNet200! (updated on 2024, Mar. 29th) [not in arxiv Dec. 17th] *: unofficial.

\[ \begin{array}{lccc} \hline \textbf{Method} & \textbf{AP} & \textbf{AP$_{50}$} & \textbf{AP$_{25}$} & \textbf{AR} & \textbf{AR$_{50}$} & \textbf{AR$_{25}$} \\ \hline \text{Superpoint} & 5.0 & 12.7 & 38.9 \\ \text{DBSCAN} & 1.6 & 5.5 & 32.1 \\ \text{OVIR-3D* (Detic)} & 14.4 & 27.5 & 38.8 \\ \text{Mask Clustering (CropFormer)} & 17.4 & 33.3 & 46.7 \\ \hline \textbf{Ours} \text{ (3D)} & 40.2 & 50.0 & 54.6 & 66.8 & 80.4 & 87.4 \\ \textbf{Ours} \text{ (G-SAM)} & 29.7 & 45.2 & 56.8 & 49.0 & 70.0 & 83.2 \\ \textbf{Ours} \text{ (SAM)} & 31.5 & 45.3 & 51.1 & 61.2 & 87.1 & 97.5 \\ \textbf{Ours} \text{ (3D + G-SAM)} & 34.6 & 43.1 & 48.5 & 66.2 & 81.6 & 91.4 \\ \textbf{Ours} \text{ (3D + SAM)} & 41.5 & 51.6 & 56.3 & 74.8 & 90.9 & 97.8 \\ \hline \end{array} \]

ScanNet++

Class-agnostic 3DIS evalutaion results on ScanNet++. Using all frames is the same as OpenMask3D sample rate: 10. We use 1554 class instance masks. (Not subset of 100 semantic classes as link) (updated on 2024, Mar. 29th) [not in arxiv Dec. 17th].

\[ \begin{array}{lccc} \hline \textbf{Method} & \textbf{AP} & \textbf{AP$_{50}$} & \textbf{AP$_{25}$} & \textbf{AR} & \textbf{AR$_{50}$} & \textbf{AR$_{25}$} & \textbf{NOTE} \\ \hline \text{ISBNet (3D)} & 6.2 & 10.1 & 16.2 & 10.9 & 16.9 & 25.2 & \text{pretrained Scannet200}&\\ \hline \text{SAM3D} & 7.2 & 14.2 & 29.4 &\\ \text{SAM-guided Graph Cut} & 12.9 & 25.3 & 43.6 &\\ \text{Segment3D} & 12.0 & 22.7 & 37.8 &\\ \text{SAI3D (SAM)} & 17.1 & 31.1 & 49.5 &\\ \text {SAMPro3D (SAM)} & 18.9 & 33.7 & 51.6 & \\ \hline \textbf{Ours} \text{ (SAM)} & 18.5 & 33.5 & 44.3 & 35.6 & 63.7 & 82.7 & \text{100 frames per scene}&\\ \textbf{Ours} \text{ (SAM)} & 20.7 & 38.6 & 47.1 & 40.8 & 75.7 & 91.8 & \text{all frames per scene}&\\ \hline \end{array} \]

Note: To assess the quality of class-agnostic masks in the 2D context, we utilize all masks generated by the 2D-G-3DIP module without any postprocessing, which typically yields high recall albeit at the cost of precision. In the case of 3D masks, we select the top 100 masks from ISBNet based on their confidence scores. Subsequently, to evaluate the Open-Vocab capability, the class-agnostic masks undergo postprocessing by selecting the top k (where k ranges approximately between 300 and 600) masks with the highest CLIP scores. Final confidence score set to 1.0 (OpenMask3D). Applying some dynamic threshold finetuning can lead to higher precision score!

Qualitative Results

Original Image

Modified Image

Original Image

Modified Image

Original Image

Modified Image

Original Image

Modified Image

Single Image Lifting

3D instance proposals produced from a single view

ScanNet200 Benchmark

3D instance proposals results on ScanNet200. Orange Circle indicates the focus region

Open-Vocabulary Exploration

Open-Vocabulary exploration on ARKitScenes (Left) and ScanNet200 (Right)

Related Work

OVIR-3D: Open-Vocabulary 3D Instance Retrieval Without Training on 3D Data

OpenMask3D: Open-Vocabulary 3D Instance Segmentation

OpenScene: 3D Scene Understanding with Open Vocabularies

BibTeX


            @inproceedings{nguyen2023open3dis,
              title={Open3DIS: Open-Vocabulary 3D Instance Segmentation with 2D Mask Guidance}, 
              author={Phuc D. A. Nguyen and Tuan Duc Ngo and Evangelos Kalogerakis and Chuang Gan and Anh Tran and Cuong Pham and Khoi Nguyen},
              year={2024},
              booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}}