UniD-Shift: Towards Unified Semantic Segmentation via Interpretable Share-Private Multimodal Decomposition

Fri, 08 May 2026 00:00:00 +0000

Highlights

Unified Multimodal Framework: Joint 2D-3D semantic segmentation achieving high single-domain accuracy and strong cross-domain generalization through structured feature interaction.
Shared-Private Decomposition: Explicit disentanglement of modality-invariant and modality-specific representations, improving semantic alignment and interpretability.
SAM + SPTNet Dual-Branch Design: Integrates SAM-based vision encoder with sparse convolution-transformer backbone to combine semantic richness and geometric precision.
SOTA Performance: 81.0% mIoU on nuScenes validation, 81.2% on nuScenes test, and 71.8% on SemanticKITTI, outperforming 2DPASS, CSFNet, and other multimodal fusion baselines.
Cross-Domain Robustness: 74.5% mIoU on nuScenes USA→Singapore cross-domain benchmark, demonstrating strong generalization under distribution shifts.
Interpretable Fusion: Attention-based Shared Attention Fusion (SAF) with Gram alignment and decorrelation regularization ensures stable optimization and meaningful feature separation.

Method Overview

The proposed UniD-Shift framework adopts a dual-branch architecture:

3D Branch: SPTNet backbone (sparse convolution + transformer) extracts hierarchical geometric features from LiDAR point clouds.
2D Branch: SAM-based vision encoder provides semantically rich visual representations from RGB images.
Shared-Private Decomposition: Features from both modalities are decomposed into shared (modality-invariant semantics) and private (modality-specific) components.
Shared Attention Fusion (SAF): The shared components are fused via cross-attention (3D→2D query) to produce a consistent multimodal representation.
Regularized Training: Gram matrix alignment encourages shared consistency, while decorrelation loss promotes subspace independence.

Key Results

Benchmark	Metric	UniD-Shift
nuScenes Validation	mIoU (%)	81.0
nuScenes Test	mIoU (%)	81.2
SemanticKITTI Validation	mIoU (%)	71.8
nuScenes USA→Singapore	mIoU (%)	74.5

The framework achieves competitive computational efficiency with 240ms inference latency on SemanticKITTI, demonstrating a practical balance between accuracy and speed.

Citation

@misc{zhang2026unidshift,
 title={UniD-Shift: Towards Unified Semantic Segmentation via Interpretable Share-Private Multimodal Decomposition},
 author={Shuai Zhang and Zhecheng Shi and Zhuxiao Li and Jing Ou and Tengxi Wang and Yuan Liu and Wufan Zhao},
 year={2026},
 eprint={2605.07356},
 archivePrefix={arXiv},
 primaryClass={cs.CV},
 url={https://arxiv.org/abs/2605.07356},
}

LiDAR | The personal website of Shuai Zhang

UniD-Shift: Towards Unified Semantic Segmentation via Interpretable Share-Private Multimodal Decomposition

Highlights

Method Overview

Key Results

Citation