SPDC: A Super-Point and Point Combining Based Dual-Scale Contrastive Learning Network for Point Cloud Semantic Segmentation

May 24, 2023·
Dr. Shuai Zhang
Dr. Shuai Zhang
First author
,
Weihong Huang
,
Yiping Chen
Corresponding author
,
Shuhang Zhang
Corresponding author
,
Wuming Zhang
,
Jonathan Li
· 2 min read
PDF
SPDC architecture with dual-scale contrastive learning (super-point and point levels)
Abstract
Semantic segmentation of point clouds is one of the fundamental tasks of point cloud processing and is the basis for other downstream tasks. Deep learning has become the main method to solve point cloud processing. Most existing 3D deep learning models require large amounts of point cloud data to drive them, but annotating the data requires significant time and economic costs. To address the problem of semantic segmentation requiring large amounts of annotated data for training, this paper proposes a Super-point-level and Point-level Dual-scale Contrast learning network (SPDC). To solve the problem that contrastive learning is difficult to train and feature extraction is not sufficient, we introduce super-point maps to assist the network in feature extraction. We use a pre-trained super-point generation network to convert the original point cloud into a super-point map. A dynamic data augmentation (DDA) module is designed for the super-point maps for super-point-level contrastive learning. We map the extracted super-point-level features back to the original point-level scale and conduct secondary contrastive learning with the original point features. The whole feature extraction network is parameter sharing and to reduce the number of parameters we used the lightweight network DGCNN (encoder)+Self-attention as the backbone network. And we did a few-shot pre-training of the backbone network to make the network converge easily. Analogous to CutMix, we designed a new method for point cloud data augmentation called PointObjectMix (POM). This method solves the sample imbalance problem while preserving the overall characteristics of the objects in the scene. We conducted experiments on the S3DIS dataset and obtained 63.3% mIoU. We have also done a large number of ablation experiments to verify the effectiveness of the modules in our method. Experimental results show that our method outperforms the best-unsupervised network.
Type
Publication
The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, vol. XLVIII-1/W1-2023, pp. 571-578
publications

Highlights

  • SPDC Network: A dual-scale contrastive learning framework operating at both super-point and point levels to reduce reliance on annotated data for point cloud semantic segmentation.
  • Super-Point Assisted Learning: Introduces super-point maps to expand the receptive field and provide multi-scale feature representation, addressing insufficient feature extraction in standard contrastive learning.
  • PointObjectMix (POM): Novel object-level data augmentation method analogous to CutMix, solving sample imbalance while preserving structural semantic information of point cloud objects.
  • Dynamic Data Augmentation (DDA): Learnable augmentation module for super-point maps using MLP and noise signals to generate diverse positive samples for contrastive learning.
  • State-of-the-Art Performance: Achieves 63.3% mIoU on S3DIS dataset in the self-supervised setting, outperforming existing unsupervised methods (e.g., CrossPoint 58.4%, PointMatch 63.4%).

Methodology

The proposed SPDC framework consists of two main channels:

1. Pre-training Channel (Few-shot Learning)

  • PointObjectMix (POM): Mixes objects from different scenes to create new training samples, balancing categories while maintaining geometric integrity
  • Lightweight Backbone: Uses DGCNN with EdgeConv for local feature extraction plus Self-attention for global context modeling
  • Few-shot Pre-training: Trains on less than 10% of ScanNetV2 data (100 scenes) to provide initial weights for downstream tasks

2. Self-supervised Channel (Contrastive Learning)

  • Super-Point Generation: Converts point clouds into super-point maps using a lightweight PointNet-based network with learnable association mapping
  • Dynamic Data Augmentation (DDA): Generates augmented views via learnable affine transformations parameterized by MLPs and Gaussian noise
  • Dual-Scale Contrastive Learning:
    • Super-point level: Contrastive learning between augmented super-point features (U¹ᴬ vs U²)
    • Point level: Contrastive learning between original point features and back-projected super-point features (Uᴾ vs U¹ᴬᴾ)
    • Uses NT-Xent loss with cosine similarity

Experimental Results

Dataset: S3DIS (Stanford 3D Indoor Scene Dataset) - 271 rooms, 13 categories

Comparison with Self-supervised Methods:

MethodSupervisionmIoU (%)
DGCNN100%56.1
CrossPoint0%58.4
PointSmile0%58.9
PointMatch0.1%63.4
SPDC (Completed)0%63.3

Ablation Study Results:

ConfigurationmIoU (%)
No pre-training51.3
No super-point56.7
No self-attention59.6
Full SPDC63.3

Key findings:

  • Pre-training channel provides crucial initialization for specific downstream tasks (+5.4% mIoU)
  • Super-point features significantly improve segmentation by expanding receptive field (+2.9% mIoU)
  • Self-attention layer enhances global context modeling (+3.7% mIoU)
Dr. Shuai Zhang
Authors
PhD Student

I am currently a Ph.D. candidate at the Ai4City-Lab, Urban Governance and Design Thrust, Society Hub, The Hong Kong University of Science and Technology (Guangzhou), under the supervision of Prof. Wufan Zhao and Prof. Yuan Liu. Prior to this, I obtained my Master’s degree from the School of Geospatial Engineering and Science, Sun Yat-sen University, where I was advised by Prof. Wuming Zhang and Prof. Yiping Chen.

My research focuses on 3D visual perception, intelligent interpretation and processing of point cloud data, and multi-modal urban foundation models. I am particularly interested in bridging geometric understanding with semantic reasoning in large-scale urban environments, with an emphasis on open-vocabulary learning, training-free paradigms, and cross-modal fusion between 2D and 3D data.

My goal is to develop scalable, interpretable, and generalizable AI systems for urban analysis, enabling applications such as digital twin construction, urban scene understanding, and intelligent infrastructure management.