SpaceMind | Embodied VLM Agent for On-orbit Servicing

SpaceMind: A Modular and Self-Evolving Embodied Vision-Language Agent Framework for Autonomous On-orbit Servicing

Aodi Wu^1,2, Haodong Han^1,2, Xubo Luo^1,2, Ruisuo Wang², Shan He², Xue Wan²

¹ University of Chinese Academy of Sciences ² Technology and Engineering Center for Space Utilization, CAS

Abstract

Autonomous on-orbit servicing demands embodied agents that perceive through visual sensors, reason about 3D spatial situations, and execute multi-phase tasks over extended horizons. We present SpaceMind, a modular and self-evolving vision-language model (VLM) agent framework that decomposes knowledge, tools, and reasoning into three independently extensible dimensions: skill modules with dynamic routing, Model Context Protocol (MCP) tools with configurable profiles, and injectable reasoning-mode skills. An MCP-Redis interface layer enables the same codebase to operate across simulation and physical hardware without modification, and a Skill Self-Evolution mechanism distills operational experience into persistent skill files without model fine-tuning.

We validate SpaceMind through 192 closed-loop runs across five satellites, three task types, and two environments—a UE5 simulation and a physical laboratory—deliberately including degraded conditions to stress-test robustness. Under nominal conditions all modes achieve 90–100% navigation success; under degradation, the Prospective mode uniquely succeeds in search-and-approach tasks where other modes fail. A self-evolution study shows that the agent recovers from failure in four of six groups from a single failed episode, including complete failure to 100% success and inspection scores improving from 12 to 59 out of 100. Real-world validation confirms zero-code-modification transfer to a physical robot with 100% rendezvous success.

Key Results

192
Closed-loop runs across 5 satellites, 3 task types, and 2 environments

90–100%
Navigation success under nominal conditions across all reasoning modes

4/6
Groups recover from failure through self-evolution after a single episode

Architecture

SpaceMind architecture. The VLM decision core receives visual observations and context from the skill layer, reasons through one of three modes (Standard, ReAct, Prospective), and issues tool calls via MCP. The Redis message bus decouples the agent from environment-specific backends, enabling the same codebase to operate across UE5 simulation and physical hardware.

BibTeX

@article{wu2026spacemind, title={SpaceMind: A Modular and Self-Evolving Embodied Vision-Language Agent Framework for Autonomous On-orbit Servicing}, author={Wu, Aodi and Han, Haodong and Luo, Xubo and Wang, Ruisuo and He, Shan and Wan, Xue}, journal={Acta Astronautica}, year={2026}, note={Under review} }

SpaceMind: A Modular and Self-Evolving Embodied Vision-Language Agent Framework for Autonomous On-orbit Servicing

Video

Abstract

Key Results

Architecture

BibTeX