Asking like Socrates: Socrates helps VLMs understand
remote sensing images

Run Shao1,2,* Ziyu Li1,* Zhaoyang Zhang1 Linrui Xu1 Xinran He2 Hongyuan Yuan1,2 Bolei He2 Yongxing Dai2 Yiming Yan3 Yijun Chen3 Wang Guo1 Haifeng Li1,†
1School of Geosciences and Info-Physics, Central South University 2Baidu Inc. 3School of Earth Sciences, Zhejiang University
*Equal contribution Corresponding author
arXiv Code 🤗 Models 🤗 Dataset

Rethinking Reasoning in Remote Sensing

Figure 1: Illustration of the pseudo reasoning problem and our RS-EoT solution. (a) Existing models show pseudo reasoning (thinking without gain). (b) We attribute this to the "Glance Effect". (c) Our model, RS-EoT-7B, successfully solves the task by iteratively reasoning and seeking visual evidence.

Abstract

Recent multimodal reasoning models, inspired by DeepSeek-R1, have significantly advanced vision–language systems. However, in remote sensing (RS) tasks, we observe widespread pseudo reasoning: models narrate the process of reasoning rather than genuinely reason toward the correct answer based on visual evidence. We attribute this to the Glance Effect, where a single, coarse perception of large-scale RS imagery results in incomplete understanding and reasoning based on linguistic self-consistency instead of visual evidence.

To address this, we propose RS-EoT (Remote Sensing Evidence-of-Thought), a language-driven, iterative visual evidence-seeking paradigm. To instill this paradigm, we propose SocraticAgent, a self-play multi-agent system that synthesizes reasoning traces via alternating cycles of reasoning and visual inspection. To enhance and generalize these patterns, we propose a two-stage progressive RL strategy: first, RL on fine-grained Grounding tasks to enhance RS-EoT capabilities, followed by RL on RS VQA to generalize to broader understanding scenarios. Experiments show RS-EoT achieves state-of-the-art performance on multiple RS VQA and grounding benchmarks. Analyses reveal clear iterative cycles of reasoning and evidence seeking, confirming RS-EoT mitigates the Glance Effect and enables genuine evidence-grounded reasoning.

Methodology: Asking like Socrates

1. SocraticAgent

Inspired by the Socratic Method—which fosters knowledge acquisition through a structured process of questioning rather than direct instruction—we designed SocraticAgent.

It is a self-play multi-agent system comprising a Reasoner (pure-text logic) and a Perceiver (visual interface). They engage in a multi-turn dialogue:

  • The Reasoner poses incremental perceptual questions.
  • The Perceiver provides grounded visual answers.
  • A self-play prompting mechanism ensures detailed, step-by-step reasoning traces without "jumping" to conclusions.

2. Two-Stage Progressive RL

To enhance and generalize the RS-EoT capability, we employ a progressive Reinforcement Learning pipeline:

Stage 1
RL-Grounding ("Iron Sharpens Iron"):
We first train on fine-grained grounding tasks. Since these tasks demand precise visual evidence, they specifically strengthen the iterative evidence-seeking behavior.
Stage 2
RL-VQA (Generalization):
We then generalize to broader VQA scenarios. To prevent reward hacking on simple datasets, we introduce a novel multiple-choice reconstruction strategy with graded rewards.

Qualitative Analysis

Input Query

Assuming a recently landed aircraft, is there an available gate with a jet bridge for it?

Case 1 Image

RS-EoT-7B Reasoning

Remote Sensing Evidence-of-Thought Generation

<think>

Alright, I will reason in a self Q&A style and give the final reply. Begin with a global check to confirm this is an airport scene and that a recently landed aircraft is present, establishing the overall context before focusing on gates and jet bridges.

Q: Is this scene an airport with visible aircraft, runways, or taxiways?
Let’s look at the image. Yes, this scene is an airport. I can see multiple aircraft parked on a large paved area...

To determine if a recently landed aircraft can connect to a jet bridge, the key is whether any gate position is empty; if so, the aircraft can park there. I will ask a targeted question.

Q: Is there at least one empty jet bridge gate position visible?
Let’s look at the image. Yes, there is at least one empty jet bridge gate position visible. In the upper central part...

Global view confirms an airport with multiple aircraft and jet bridges. Targeted check indicates at least one empty jet bridge gate position. Therefore, a recently landed aircraft could park at that gate.

</think>
Final Answer
Yes

Citation

Coming soon