SpaceSense-Bench: A Large-Scale Multi-Modal Benchmark for Spacecraft Perception and Pose Estimation

Aodi Wu1,2, Jianhong Zuo3, Zeyuan Zhao1,2, Xubo Luo1,2, Ruisuo Wang2, Xue Wan2
1 University of Chinese Academy of Sciences
2 Technology and Engineering Center for Space Utilization, Chinese Academy of Sciences
3 Nanjing University of Aeronautics and Astronautics

SpaceSense-Bench provides high-fidelity simulated observations for spacecraft perception, combining synchronized RGB images, depth maps, LiDAR point clouds, dense part labels, and accurate 6-DoF poses.

Abstract

Autonomous space operations such as on-orbit servicing and active debris removal demand robust part-level semantic understanding and precise relative navigation of target spacecraft, yet acquiring large-scale real data in orbit remains prohibitively expensive. Existing synthetic datasets, moreover, suffer from limited target diversity, single-modality sensing, and incomplete ground-truth annotations. To bridge these gaps, we present SpaceSense-Bench, a large-scale multi-modal benchmark for spacecraft perception encompassing 136 satellite models with approximately 70 GB of data. Each frame provides time-synchronized 1024×1024 RGB images, millimeter-precision depth maps, and 256-beam LiDAR point clouds, together with dense 7-class part-level semantic labels at both the pixel and point level as well as accurate 6-DoF pose ground truth. The dataset is generated through a high-fidelity space simulation built in Unreal Engine 5 and a fully automated pipeline covering data acquisition, multi-stage quality control, and conversion to mainstream formats. Comprehensive benchmarks on object detection, 2D semantic segmentation, RGB-LiDAR fusion 3D point cloud segmentation, monocular depth estimation, and orientation estimation reveal two key findings: (i) perceiving small-scale components such as thrusters and omni-antennas and generalizing to entirely unseen spacecraft in a zero-shot setting remain critical bottlenecks for current methods, and (ii) scaling up the number of training satellites yields substantial performance gains on novel targets, underscoring the value of large-scale, diverse datasets for space perception research.

Data Generation Pipeline

Data generation pipeline of SpaceSense-Bench

The pipeline consists of four stages: (1) 3D asset library construction and part decomposition, (2) high-fidelity space scene setup in UE5, (3) adaptive trajectory planning and multi-sensor synchronized capture, and (4) automated ground-truth generation, quality control, and mainstream format export.

Dataset Samples

Multi-modal data samples from SpaceSense-Bench

Each column shows one satellite. From top to bottom: RGB image with 6-DoF pose axes overlay, seven-class semantic segmentation mask, LiDAR point cloud with per-point semantic labels, and colorized depth map.

Dataset Highlights

136

Satellite models with diverse geometries and structures

~70 GB

Large-scale benchmark data generated in a high-fidelity simulator

3 Modalities

1024×1024 RGB, millimeter-precision depth, and 256-beam LiDAR

7 Classes

Dense part-level semantic labels at both pixel and point levels

6-DoF

Accurate relative pose annotations for each frame

UE5 Pipeline

Automated generation, quality control, and conversion workflow

Visual Overview

Benchmark Tasks and Findings

Supported Tasks

  • Object detection
  • 2D semantic segmentation
  • RGB-LiDAR fusion 3D point cloud segmentation
  • Monocular depth estimation
  • Orientation estimation

Key Findings

  • Small components such as thrusters and omni-antennas remain difficult to perceive reliably.
  • Zero-shot transfer to completely unseen spacecraft is still a major open challenge.
  • Increasing the number and diversity of training satellites substantially improves generalization.

(a) 2D Semantic Segmentation — per-class IoU (%)

ModelBackboneaAccmIoU bodysolardishomnipayloadthrusteradapter
FCNResNet-5099.1641.3068.182.930.71.019.012.816.1
DeepLabV3+ResNet-5099.1943.7068.183.238.21.021.517.720.2
SegFormerMiT-B399.2745.1471.287.639.72.922.620.117.3
Mask2FormerSwin-B99.2845.6371.488.626.31.919.124.633.3

(b) Object Detection (YOLO26) — per-class AP@0.5 (%)

ModelScalePrec.mAP50 bodysolardishomnipayloadthrusteradapter
YOLO26Nano48.833.989.078.019.13.68.68.430.5
YOLO26Small53.437.189.480.419.77.98.810.842.8
YOLO26Medium54.639.091.382.321.66.410.016.545.1
YOLO26Large52.639.590.581.927.16.010.815.345.2
YOLO26XLarge56.141.391.082.523.78.09.123.351.6

(c) 3D Point Cloud Segmentation (PMFNet, RGB+LiDAR) — per-class IoU (%)

ModelBackbonemAccmIoU bodysolardishomnipayloadthrusteradapter
PMFNetResNet-3457.542.468.885.851.78.921.925.234.2

(d) Monocular Depth Estimation (Depth Anything V2, zero-shot)

ModelBackboneAbsRel↓SqRel↓RMSE(m)↓RMSElog↓δ<1.25↑Spearman↑
DA-V2ViT-S0.02360.03170.7470.031999.77%0.555
DA-V2ViT-B0.02270.03040.7460.031299.77%0.578
DA-V2ViT-L0.02230.03040.7570.030799.77%0.602

(e) Orientation Estimation (Orient Anything, DINOv2-Large, zero-shot)

ModelBackboneMAAE↓Median↓<10°↑<20°↑<30°↑<45°↑
Orient-Any.DINOv2-L12.75°10.56°53.7%78.2%91.7%98.8%

BibTeX

@article{wu2026spacesensebench,
  title={SpaceSense-Bench: A Large-Scale Multi-Modal Benchmark for Spacecraft Perception and Pose Estimation},
  author={Wu, Aodi and Zuo, Jianhong and Zhao, Zeyuan and Luo, Xubo and Wang, Ruisuo and Wan, Xue},
  year={2026},
  url={https://arxiv.org/abs/2603.09320}
}