Recent multimodal reasoning models, inspired by DeepSeek-R1, have significantly advanced vision–language systems. However, in remote sensing (RS) tasks, we observe widespread pseudo reasoning: models narrate the process of reasoning rather than genuinely reason toward the correct answer based on visual evidence. We attribute this to the Glance Effect, where a single, coarse perception of large-scale RS imagery results in incomplete understanding and reasoning based on linguistic self-consistency instead of visual evidence.
To address this, we propose RS-EoT (Remote Sensing Evidence-of-Thought), a language-driven, iterative visual evidence-seeking paradigm. To instill this paradigm, we propose SocraticAgent, a self-play multi-agent system that synthesizes reasoning traces via alternating cycles of reasoning and visual inspection. To enhance and generalize these patterns, we propose a two-stage progressive RL strategy: first, RL on fine-grained Grounding tasks to enhance RS-EoT capabilities, followed by RL on RS VQA to generalize to broader understanding scenarios. Experiments show RS-EoT achieves state-of-the-art performance on multiple RS VQA and grounding benchmarks. Analyses reveal clear iterative cycles of reasoning and evidence seeking, confirming RS-EoT mitigates the Glance Effect and enables genuine evidence-grounded reasoning.
Inspired by the Socratic Method—which fosters knowledge acquisition through a structured process of questioning rather than direct instruction—we designed SocraticAgent.
It is a self-play multi-agent system comprising a Reasoner (pure-text logic) and a Perceiver (visual interface). They engage in a multi-turn dialogue:
To enhance and generalize the RS-EoT capability, we employ a progressive Reinforcement Learning pipeline:
Assuming a recently landed aircraft, is there an available gate with a jet bridge for it?
Remote Sensing Evidence-of-Thought Generation
Alright, I will reason in a self Q&A style and give the final reply. Begin with a global check to confirm this is an airport scene and that a recently landed aircraft is present, establishing the overall context before focusing on gates and jet bridges.
To determine if a recently landed aircraft can connect to a jet bridge, the key is whether any gate position is empty; if so, the aircraft can park there. I will ask a targeted question.
Global view confirms an airport with multiple aircraft and jet bridges. Targeted check indicates at least one empty jet bridge gate position. Therefore, a recently landed aircraft could park at that gate.
Coming soon