SPTNet: Sparse Convolution and Transformer Network for Woody and Foliage Components Separation From Point Clouds

Mar 18, 2024·
Dr. Shuai Zhang
Dr. Shuai Zhang
First author
,
Yiping Chen
Corresponding author
,
Biao Wang
,
Dong Pan
,
Wuming Zhang
Corresponding author
,
Aiguang Li
· 4 min read
PDF
SPTNet architecture overview showing SpConv and Transformer blocks
Abstract
The separation of woody and foliage components is beneficial in estimating the physical parameters of forests. However, many current methods incur high computational costs and rely on extensive prior knowledge. These methods display weak abilities in generalization for component separation from various light detection and ranging (LiDAR) sensors and tree species. In this article, a network that combines sparse convolution (SpConv) and transformer blocks is proposed for the separation of woody and foliage components in tree point clouds called SPTNet. The SpConv block facilitates efficient and effective local feature extraction, while the transformer block offers a solution for the inadequate global feature extraction in SpConv blocks. Point feature extraction blocks, called morphological detection coefficient (MDC) and normal difference operator (NDO), were specifically developed to aid in the segmentation task. Distinct adaptive radius strategies are implemented for each geometric feature block to minimize the need for a priori knowledge. Eight different tree species datasets were used to improve methods, including a simulated larch dataset. The other datasets consist of actual trees and comprise seven distinct tree species along with a large tropical tree dataset. Our experimental results demonstrate that our method attains state-of-the-art performance across all datasets. It is worth mentioning that SPTNet obtains an overall classification accuracy (OA) of 94.69% and 89.96% mean of intersection-over-union (mIoU) on the large tropical dataset, which encompasses 15 tree species. Moreover, SPTNet outperforms FWCNN, the current leading branch and leaf separation approach, by 0.43% OA and 0.72% mIoU.
Type
Publication
IEEE Transactions on Geoscience and Remote Sensing, vol. 62, 5702718
publications

Highlights

  • Novel Hybrid Architecture: First network to combine Sparse Convolution (SpConv) with Transformer blocks for tree component separation, balancing local feature extraction efficiency with global context modeling capabilities.
  • Pure Geometry-Based Approach: Uses only XYZ coordinates without relying on radiometric features (LRI), ensuring sensor-agnostic generalization across different LiDAR devices and wavelengths.
  • Adaptive Geometric Features: Introduces two innovative feature blocks—Morphological Detection Coefficient (MDC) and Normal Difference Operator (NDO)—with adaptive radius determination strategies (minimum entropy and Otsu methods) to eliminate manual parameter tuning.
  • State-of-the-Art Performance: Achieves 94.69% OA and 89.96% mIoU on large tropical tree dataset (15 species), outperforming FWCNN by 0.43% OA and 0.72% mIoU.
  • High Efficiency: Training requires only 8 seconds per epoch with 48 epochs to convergence (vs. 15-45s/epoch for competing methods), and model size is less than 1/3 of PointNet++.

Methodology

The proposed SPTNet consists of three main components:

  1. Geometric Feature Extraction (Preprocessing):

    • MDC (Morphological Detection Coefficient): Characterizes local point distribution patterns using eigenvalues of covariance matrix. Values > 0.5 indicate linear distribution (woody), < 0.5 indicate planar distribution (foliage). Adaptive radius determined via minimum entropy principle.
    • NDO (Normal Difference Operator): Measures normal vector variations within neighborhoods. Woody components show larger normal differences (cylindrical structure) vs. foliage (planar structure). Adaptive radius determined via Otsu thresholding.
    • Input Representation: 7-dimensional vector (XYZ + MDC + NDO) organized in voxel-hash structure.
  2. Network Architecture (UNet-based Encoder-Decoder):

    • Encoder: 5 SpConv blocks (local feature extraction) + 5 Transformer blocks (global context aggregation)
      • SpConv Block: 3 SubMConv layers + 1 SpConv downsampling layer with skip connections
      • Transformer Block: Multi-head self-attention (8 heads) with skip connections from SpConv features
    • Decoder: 4 inverse SpConv blocks for upsampling and dimensionality reduction
    • Output: Linear layer for binary classification (woody vs. foliage)
  3. Data Organization:

    • Voxelization: 0.02m voxel size with average pooling for point-to-voxel mapping
    • Sparse Convolution: Exploits data sparsity to reduce computational cost while preserving geometric details
    • Augmentation: Random sampling (density variation) + rotation (< 45°) for training robustness

Datasets

Comprehensive validation across diverse forest environments:

DatasetTypeSpeciesSamplesCharacteristics
SpruceRealPiceaMultipleConiferous
AspenRealPopulusMultipleBroadleaf
PoplarRealPopulusMultipleBroadleaf
BirchRealBetulaMultipleBroadleaf
PineRealPinusMultipleConiferous
MapleRealAcerMultipleBroadleaf
LarchSimulatedLarix100Synthetic (Helios++)
TropicalReal15 species61 treesCameroon, 8.7-53.6m height
  • Data Sources: Canada, Finland, Cameroon (real); OnyxTree simulation (synthetic)
  • Sensors: Riegl VZ-1000, various TLS devices
  • Annotation: Manual labeling for woody/foliage components

Experimental Results

Quantitative Performance (Eight Datasets):

DatasetOA (%)FA (%)WA (%)F1 ScoremIoU (%)
Spruce92.2099.1897.620.92393.70
Aspen90.8098.6789.500.90792.95
Poplar86.3598.7888.800.85585.70
Birch98.0599.7490.810.91296.92
Pine96.0199.9599.640.95992.20
Maple97.3499.5197.290.97394.82
Larch92.4397.2178.660.79178.21
Tropical94.6997.1091.090.89689.96

Comparison with State-of-the-Art Methods:

MethodTropical OA (%)Tropical mIoU (%)Birch OA (%)Birch mIoU (%)
LWCLF92.4587.2394.1289.45
LeWoS91.8886.9193.5688.76
FWCNN94.2689.2496.7895.18
SPTNet94.6989.9698.0596.92

Efficiency Comparison:

MethodTraining Time/EpochConvergence EpochsModel SizeInference Time (50k points)
PointNet++~15s245Large>10,000 ms
Point Transformer~45s90Large>60,000 ms
PointBert~15s102Large>60,000 ms
PointMLP~10s198Medium~8,000 ms
SPTNet~8s48Small~3,391 ms

Key Findings

  1. Ablation Studies:

    • SpConv vs. Transformer: Removing SpConv blocks causes larger accuracy drop than removing Transformer blocks, indicating local geometric features are crucial for woody/foliage discrimination.
    • Feature Importance: NDO features contribute more than MDC features; combining both yields optimal results (OA improvement of 2-3% over using coordinates only).
    • Backbone Comparison: SPTNet outperforms PointNet++, DGCNN, Point Transformer, and PointBert on all test datasets while maintaining 3-6× faster training speed.
  2. Robustness Analysis:

    • Occlusion Resistance: Maintains >90% OA even with 40% random point masking; degrades gracefully at 60-80% masking (OA drops to ~92% and ~77% mIoU at 80% masking).
    • Cross-Species Generalization: Successfully handles both coniferous (Pine: 99.64% WA) and broadleaf species, as well as complex tropical trees with irregular structures.
  3. Adaptive vs. Fixed Radius:

    • Adaptive radius strategies (minimum entropy for MDC, Otsu for NDO) eliminate manual parameter tuning while achieving comparable or better accuracy than manually optimized fixed radii (0.1-0.5m range tested).
  4. Sensor Independence:

    • Pure geometry-based approach (no radiometric features) enables seamless application to data from different LiDAR sensors without recalibration or retraining for specific wavelength characteristics.

Limitations and Future Work

  • Voxel Resolution Sensitivity: Performance depends on voxel size selection (0.02m used); smaller voxels improve accuracy but increase computational cost.
  • Scale Constraints: Currently tested at single-tree level; extension to plot-scale processing with multiple trees and ground points requires further development.
  • Occlusion Handling: While robust to moderate occlusion (up to 60%), severe occlusion (>80%) significantly degrades performance due to loss of neighborhood geometric information.
  • Future Directions: Extension to airborne LiDAR data, semantic segmentation of urban scenes, and integration with instance segmentation for complete forest inventory workflows.
Dr. Shuai Zhang
Authors
PhD Student

I am currently a Ph.D. candidate at the Ai4City-Lab, Urban Governance and Design Thrust, Society Hub, The Hong Kong University of Science and Technology (Guangzhou), under the supervision of Prof. Wufan Zhao and Prof. Yuan Liu. Prior to this, I obtained my Master’s degree from the School of Geospatial Engineering and Science, Sun Yat-sen University, where I was advised by Prof. Wuming Zhang and Prof. Yiping Chen.

My research focuses on 3D visual perception, intelligent interpretation and processing of point cloud data, and multi-modal urban foundation models. I am particularly interested in bridging geometric understanding with semantic reasoning in large-scale urban environments, with an emphasis on open-vocabulary learning, training-free paradigms, and cross-modal fusion between 2D and 3D data.

My goal is to develop scalable, interpretable, and generalizable AI systems for urban analysis, enabling applications such as digital twin construction, urban scene understanding, and intelligent infrastructure management.