遥感多模态大语言模型：架构、关键技术和未来展望

许文嘉; 于睿卿; 薛铭浩; 张源奔; 魏智威; 张柘; 彭木根

doi:10.12000/JR25088

遥感多模态大语言模型：架构、关键技术和未来展望

DOI: 10.12000/JR25088 CSTR: 32380.14.JR25088

许文嘉^1, ,,
于睿卿¹,
薛铭浩¹,
张源奔²,
魏智威³,
张柘^{2, 4, 5, 6},
彭木根¹

1.
北京邮电大学网络与交换技术全国重点实验室北京 100081
2.
中国科学院空天信息创新研究院北京 100091
3.
湖南师范大学地理科学学院长沙 410081
4.
苏州空天信息研究院苏州 215123
5.
微波成像全国重点实验室北京 100190
6.
中国科学院大学电子电气与通信工程学院北京 100190

基金项目: 国家自然科学基金(62301063)，目标认知与应用技术重点实验室开放基金(2023-CXPT-LC-005)，微波成像技术国家重点实验室开放基金(70323006)

详细信息

作者简介:
许文嘉，博士，副教授，主要研究方向为通信遥感一体化与遥感智能解译

于睿卿，博士，主要研究方向为遥感多模态大语言模型

薛铭浩，硕士，主要研究方向为遥感多模态大语言模型

张源奔，博士，副研究员，主要研究方向为时空数据决策智能、数字地球与数字孪生

魏智威，博士，讲师，主要研究方向为地理可视化、LLM4Urban、时空图谱、三维重建等

张　柘，博士，研究员，主要研究方向为新体制SAR成像与信号处理技术

彭木根，博士，教授，主要研究方向为无线移动通信和低轨信息通信网络

通讯作者:
许文嘉 xuwenjia@bupt.edu.cn

责任主编：张帆 Corresponding Editor: ZHANG Fan

中图分类号: TN957.51; TP753
计量
- 文章访问数:
- HTML全文浏览量:
- PDF下载量:
- 被引次数: 0
出版历程
- 收稿日期: 2025-05-12
- 修回日期: 2025-07-22
- 网络出版日期: 2025-09-01

A Survey on Remote Sensing Multimodal Large Language Models: Framework, Core Technologies, and Future Perspectives

XU Wenjia^{1
, ,},
YU Ruiqing¹,
XUE Minghao¹,
ZHANG Yuanben²,
WEI Zhiwei³,
ZHANG Zhe^{2, 4, 5, 6},
PENG Mugen¹

1.
State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing 100081, China
2.
Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100091, China
3.
School of Geographical Sciences, Hunan Normal University, Changsha 410081, China
4.
Suzhou Aerospace Information Research Institute, Suzhou 215123, China
5.
National Key Laboratory of Microwave Imaging, Beijing 100190, China
6.
School of Electionic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing 100190, China

Funds: The National Natural Science Fundation of China (62301063), The Key Laboratory of Target Cognition and Application Technology (2023-CXPT-LC-005), The National Key Laboratory of Microwave Imaging Technology (70323006)

More Information

Corresponding author: XU Wenjia, xuwenjia@bupt.edu.cn

摘要

摘要: 近年来，人工智能技术和遥感领域的结合已成为领域发展的前沿热点，多模态大语言模型(MLLM)的快速发展为遥感智能解译带来新的机遇和挑战。遥感多模态大语言模型通过构建大语言模型与视觉模型之间的桥接机制并采用联合训练方式，深度融合遥感领域的视觉特征与语义信息，有效推动遥感智能解译由浅层语义匹配向高层的世界知识理解跃迁。该文系统性回顾了多模态大语言模型在遥感领域的相关研究成果，以期为新的研究方向提供依据。具体而言，该文首先明确了遥感多模态大语言模型(RS-MLLM)的概念定义，并梳理了遥感多模态大语言模型的发展脉络。随后，详细阐述了遥感多模态大语言模型的模型架构、训练方法、适用任务及其对应的基准数据集，并介绍了遥感智能体。最后，探讨了遥感多模态大语言模型的研究现状和未来发展方向。
- 大语言模型 /
- 多模态大语言模型 /
- 遥感多模态大语言模型 /
- 视觉语言模型 /
- 遥感智能体
Abstract: In recent years, the rapid development of Multimodal Large Language Models (MLLMs) and their applications in remote sensing have garnered significant attention. Remote sensing MLLMs achieve deep integration of visual features and semantic information through the design of bridging mechanisms between large language models and vision models, combined with joint training strategies. This integration facilitates a paradigm shift in intelligent remote sensing interpretation—from shallow semantic matching to higher-level understanding based on world knowledge. In this study, we systematically review the research progress in the applications of MLLMs in remote sensing, specifically examining the development of Remote Sensing MLLMs (RS-MLLMs), which provides a foundation for future research directions. Initially, we discuss the concept of RS-MLLMs and review their development in chronological order. Subsequently, we provide a detailed analysis and statistical summary of the proposed architectures, training methods, applications, and corresponding benchmark datasets, along with an introduction to remote sensing agents. Finally, we summarize the research status of RS-MLLMs and discuss future research directions.
- Large Language Model (LLM) /
- Multimodal Large Language Model (MLLM) /
- Remote Sensing Multimodal Large Language Model (RS-MLLM) /
- Vision Language Model (VLM) /
- Remote sensing agent

HTML全文

图 1 典型的多模态大语言模型架构

Figure 1. The typical MLLM architecture

下载: 全尺寸图片幻灯片

图 2 代表性遥感多模态大语言模型和遥感智能体的发展时间线

Figure 2. Timeline of representative RS-MLLMs and remote sensing agents

下载: 全尺寸图片幻灯片

图 3 RS-Agent架构

Figure 3. RS-Agent architecture

下载: 全尺寸图片幻灯片

表 1 遥感多模态大语言模型具体结构

Table 1. The specific structure of RS-MLLM

RS-MLLM	多模态编码器	多模态特征投影器	预训练大语言模型	训练硬件
AeroLite^[57]	CLIP ViT-L/14	A two-layer MLP	LLaMA3.2-3B	4090
Aquila^[55]	Aquila-CLIP ConvNext (A-CCN)	SFI	(MDA)-LLM (基于LLaMA3)	4$* $A800
Aquila-plus^[56]	CLIP ConvNeXt-L	Mask spatial feature extractor	Vicuna	-
CDChat^[49]	CLIP ViT-L/14	A two-layer MLP (GELU)	Vicuna-v1.5-7B	3$* $A100
ChangeChat^[48]	CLIP ViT/14	A two-layer MLP	Vicuna-v1.5	L20 (48 GB)
EagleVision^[58]	Baseline detector	Attribute disentangle	InternLM2.5-7B-Chat等	8$* $A100
EarthDial^[59]	Adaptive high resolution + Data fusion + InternViT-300M	A simple MLP	Phi-3-mini	8$* $A100 (80 GB)
EarthGPT^[12]	DINOv2 ViT-L/14 + CLIP ConvNeXt-L	A linear layer	LLaMA2-13B	16$* $A100
EarthGPT-X^[60]	DINOv2 ViT-L/14 + CLIP ConvNeXt-L + Hybrid signals mutual understanding	Vision-to-language Modality-align projection	LLaMA2-13B	8$* $A100 (80 GB)
EarthMarker^[61]	DINOv2 ViT-L/14 + CLIP ConvNeXt-L	A linear layer	LLaMA2-13B	8$* $A100 (80 GB)
GeoChat^[42]	CLIP ViT-L/14	A two-layer MLP (GELU)	Vicuna-v1.5-7B	-
GeoGround^[29]	CLIP ViT	A two-layer MLP	Vicuna-v1.5	8$* $V100 (32 GB)
GeoLLaVA-8K^[62]	CLIP ViT-L/14 + A two-step tokens compression module	A linear layer	Vicuna-v1.5-7B	-
IFShip^[46]	CLIP ViT-L/14	A four-layer MLP (GELU)	Vicuna-13B	-
LHRS-Bot^[11]	CLIP ViT-L/14	Vision perceiver	LLaMA2-7B	8$* $V100 (32 GB)
LHRS-Bot-Nova^[63]	SigLIP-L/14	Vision perceiver	LLaMA3-8B	8$* $H100
Popeye^[47]	DINOv2 ViT-L/14 + CLIP ConvNeXt-L	Alignment projection	LLaMA-7B	-
RingMoGPT^[30]	EVA-CLIP ViT-g/14	A Q-Former + A linear layer	Vicuna-13B	8$* $A100 (80 GB)
RS-CapRet^[45]	CLIP ViT-L/14	Three linear layers	LLaMA2-7B	-
RSGPT^[10]	EVA-G	A Q-Former + A linear layer	Vicuna-7B· Vicuna-13B	8$* $A100
RS-LLaVA^[64]	CLIP ViT-L	A two-layer MLP (GELU)	Vicuna-v1.5-7B Vicuna-v1.5-13B	2$* $A6000 (48 GB)
RSUniVLM^[65]	SigLIP-400M	A two-layer MLP	QWen2-0.5B	4$* $A40 (40 GB)
SkyEyeGPT^[31]	EVA-CLIP	A linear layer	LLaMA2	4$* $3090
SkySenseGPT^[44]	CLIP ViT-L/14	A two-layer MLP	Vicuna-v1.5	4$* $A100 (40 GB)
Spectral-LLaVA^[66]	SpectralGPT^[34] (encoder only)	A linear layer	LLaMA3	-
TEOChat^[50]	CLIP ViT-L/14	A two-layer MLP	LLaMA2	A4000 (16 GB)
UniRS^[13]	SigLIP + A change extraction module	A downsampling module + A MLP	Sheared-LLaMA (3B)	4$* $4090 (24 GB)
VHM^[43]	CLIP ViT-L/14	A two-layer MLP	Vicuna-v1.5-7B	16$* $A100 (80 GB)
注：斜体表示该模型在论文中并未给出正式名称或缩写。

下载: 导出CSV

表 2 遥感多模态大语言模型训练时使用的数据集

Table 2. Datasets used for train in RS-MLLMs

RS-MLLM	指令调优	其他训练
AeroLite^[57]	-	RSSCN7, DLRSD, iSAID, LoveDA, WHU, UCM-Captions, Sydney-Captions
Aquila^[55]	FIT-RS	CapERA, UCM-Captions, Sydney-Captions, NWPU-Captions, RSICD, RSITMD, RSVQA-HR, RSVQA-LR, WHU_RS19
Aquila-plus^[56]	Aquila-plus-100K	-
CDChat^[49]	LEVIR-CD, SYSU-CD	LEVIR-CD, SYSU-CD
ChangeChat^[48]	ChangeChat-87k	-
EagleVision^[58]	EVAttrs-95K	EVAttrs-95K
EarthDial^[59]	EarthDial-Instruct	EarthDial-Instruct
EarthGPT^[12]	MMRS-1M	LAION-400M, COCO Caption
EarthGPT-X^[60]	M-RSVP	-
EarthMarker^[61]	RSVP-3M	COCO Caption, RSVP-3M, RefCOCO, RefCOCO+
GeoChat^[42]	RS multimodal instruction following dataset	-
GeoGround^[29]	refGeo	FAIR1M, DIOR, DOTA
GeoLLaVA-8K^[62]	SuperRS-VQA, HighRS-VQA	-
IFShip^[46]	TITANIC-FGS	-
LHRS-Bot^[11]	LLaVA complex reasoning dataset, NWPU, RSITMD, LHRS-Instruct	LHRS-Align
LHRS-Bot-Nova^[63]	Multi-task instruction dataset	LHRS-Align-Recap, LHRS-Instruct, LHRS-Instruct-Plus, LRV-Instruct
Popeye^[47]	MMShip	COCO Caption
RingMoGPT^[30]	Instruction-tuning dataset	Image-text pre-training dataset
RS-CapRet^[45]	-	RSCID, UCM-Captions, Sydney-Captions, NWPU-Captions
RSGPT^[10]	-	RSICap
RS-LLaVA^[64]	RS-Instructions	-
RSUniVLM^[65]	RSUniVLM-Instruct-1.2M	RSUniVLM-Resampled
SkyEyeGPT^[31]	SkyEye-968k	SkyEye-968k
SkySenseGPT^[44]	FIT-RS, NWPU-Captions, UCM-Captions, RSITMD, EarthVQA, Floodnet-VQA, RSVQA-LR, DOTA, DIOR, FAIR1M	-
Spectral-LLaVA^[66]	BigEarthNet-v2	fMoW, BigEarthNet-v1
TEOChat^[50]	TEOChatlas	-
UniRS^[13]	GeoChat-Instruct, LEVIR-CC, EAR	-
VHM^[43]	VersaD-Instruct, VariousRS-Instruct, HnstD	VersaD
注：由于主要关注指令调优阶段的数据集，因此将其他训练阶段的数据集合并至第3列。斜体表示该数据集在论文中并未给出正式名称或缩写。

下载: 导出CSV

表 3 遥感多模态大语言模型的适用任务及其对应的基准数据集

Table 3. The applicable tasks of RS-MLLMs and their corresponding benchmark datasets

RS-MLLM	RSIC	RSVQA	RSVG	RSSC
AeroLite^[57]	Sydney-Captions^[74], UCM-Captions^[74]	–	–	–
Aquila^[55]	RSICD^[75], Sydney-Captions^[74], UCM-Captions^[74], FIT-RS^[44]	RSVQA-LR^[76], RSVQA-HR^[76], FIT-RS^[44]	–	–
EarthDial^[59]	NWPU-Captions^[77], RSICD^[75], RSITMD-Captions^[78], Sydney-Captions^[74], UCM-Captions^[74]	RSVQA-LR^[76], RSVQA-HR^[76]	–	AID^[79], UCMerced^[80]，WHU-RS19^[81], BigEarthNet^[82], xBD Set 1^[83], fMoW^[84]
EarthGPT^[12]	NWPU-Captions^[77]	CRSVQA^[85], RSVQA-HR^[76]	DIOR-RSVG^[86]	NWPU-RESISC45^[87], CLRS^[88], NaSC-TG2^[89]
GeoChat^[42]	–	RSVQA-LR^[76], RSVQA-HR^[76]	GeoChat*^[42]	AID^[79], UCMerced^[80]
GeoGround^[29]	–	–	DIOR-RSVG^[86], RSVG^[90], GeoChat^[42], VRSBench^[91], AVVG^[29]	–
LHRS-Bot^[11]	–	RSVQA-LR^[76], RSVQA-HR^[76]	DIOR-RSVG^[86], RSVG^[90]	AID^[79], WHU-RS19^[81], NWPU-RESISC45^[87], SIRI-WHU^[92], EuroSAT^[93], METER-ML^[94], fMoW^[84]
LHRS-Bot-Nova^[63]	–	RSVQA-LR^[76], RSVQA-HR^[76]	DIOR-RSVG^[86], RSVG^[90]	AID^[79], WHU-RS19^[81], NWPU-RESISC45^[87], SIRI-WHU^[92], EuroSAT^[93], METER-ML^[94], fMoW^[84]
RingMoGPT^[30]	DOTA-Cap^[30], DIOR-Cap^[30], NWPU-Captions^[77], RSICD^[75], Sydney-Captions^[74], UCM-Captions^[74]	HRVQA^[95]	–	AID^[79], NWPU-RESISC45^[87], UCMerced^[80], WHU-RS19^[81]
RS-CapRet^[45]	NWPU-Captions^[77], RSICD^[75], Sydney-Captions^[74], UCM-Captions^[74]	–	–	–
RSGPT^[10]	RSIEval^[10], UCM-Captions^[74], Sydney-Captions^[74], RSICD^[75]	RSIEval^[10], RSVQA-LR^[76], RSVQA-HR^[76]	–	–
RS-LLaVA^[64]	UCM-Captions^[74], UAV^[96]	RSVQA-LR^[76], RSIVQA-DOTA^[97]	–	–
RSUniVLM^[65]	–	RSVQA-LR^[76], RSVQA-HR^[76]	DIOR-RSVG^[86], VRSBench^[91]	AID^[79], WHU-RS19^[81], NWPU-RESISC45^[87], SIRI-WHU^[92]
SkyEyeGPT^[31]	UCM-Captions^[74], CapERA^[98]	RSVQA-LR^[76], RSVQA-HR^[76]	DIOR-RSVG^[86], RSVG^[90]	–
TEOChat^[50]	–	RSVQA-LR^[76], RSVQA-HR^[76]	–	AID^[79], UCMerced^[80]
UniRS^[13]	–	RSVQA-LR^[76], RSVQA-HR^[76], CRSVQA^[85]	–	–
VHM^[43]	–	RSVQA-LR^[76], RSVQA-HR^[76]	DIOR-RSVG^[86]	AID^[79], WHU-RS19^[81], NWPU-RESISC45^[87], SIRI-WHU^[92], METER-ML^[94]
注：*表示测试集已修改。

下载: 导出CSV

表 4 遥感多模态大语言模型在不同的遥感图像描述数据集上的性能表现

Table 4. Performance of RS-MLLMs on various remote sensing image captioning datasets

数据集	模型	BLEU-1↑	BLEU-2↑	BLEU-3↑	BLEU-4↑	METEOR↑	ROUGE-L↑	CIDEr↑	SPICE↑
NWPU- Captions^[77]	EarthDial^[59]	–	–	–	–	0.806	0.400	–	–
	EarthGPT^[12]	0.871	0.787	0.716	0.655	0.445	0.782	1.926	0.322
	EarthMarker^[61]	0.844	0.731	0.629	0.543	0.375	0.700	1.629	0.268
	RS-CapRet^[45]	0.871	0.787	0.717	0.656	0.436	0.776	1.929	0.311
RSICD^[75]	Aquila^[55]	0.746	–	–	–	–	–	–	–
	EarthDial^[59]	–	–	–	–	0.562	0.276	–	–
	RS-CapRet^[45]	0.741	0.622	0.529	0.455	0.376	0.649	2.605	0.484
	RSGPT^[10]	0.703	0.542	0.440	0.368	0.301	0.533	1.029	–
	SkyEyeGPT^[31]	0.867	0.767	0.673	0.600	0.354	0.626	0.837	–
	RingMoGPT^[30]	–	–	–	–	0.343	0.616	2.758	–
UCM- Captions^[74]	AeroLite^[57]	0.934	–	–	0.796	0.498	0.880	–	–
	Aquila^[55]	0.883	–	–	–	–	–	–	–
	EarthDial^[59]	–	–	–	–	0.514	0.342	–	–
	RS-CapRet^[45]	0.843	0.779	0.722	0.670	0.472	0.817	3.548	0.525
	RSGPT^[10]	0.861	0.791	0.723	0.657	0.422	0.783	3.332	–
	SkyEyeGPT^[31]	0.907	0.857	0.816	0.784	0.462	0.795	2.368	–
	RS-LLaVA^[64]	0.900	0.849	0.803	0.760	0.492	0.858	3.556	–
	RingMoGPT^[30]	–	–	–	–	0.499	0.833	3.593	–
Sydney- Captions^[74]	AeroLite^[57]	0.919	–	–	0.759	0.475	0.837	–	–
	Aquila^[55]	0.834	–	–	–	–	–	–	–
	EarthDial^[59]	–	–	–	–	0.573	0.410	–	–
	RS-CapRet^[45]	0.787	0.700	0.628	0.564	0.388	0.707	2.392	0.434
	RSGPT^[10]	0.823	0.753	0.686	0.622	0.414	0.748	2.731	–
	SkyEyeGPT^[31]	0.919	0.856	0.809	0.774	0.466	0.777	1.811	–
	RingMoGPT^[30]	–	–	–	–	0.421	0.734	2.888	–
FIT-RS^[44]	Aquila^[55]	0.351	–	–	–	–	–	–	–
	GeoChat^[42]	0.088	–	–	–	–	–	–	–
	SkySenseGPT^[44]	0.273	–	–	–	–	–	–	–
注：所有结果均引自对应论文原文，加粗表示最佳结果。

下载: 导出CSV

表 5 遥感多模态大语言模型在不同的遥感视觉问答数据集上的性能表现(%)

Table 5. Performance of RS-MLLMs on various remote sensing visual question answering datasets (%)

模型	RSVQA-LR^[76]数据集				RSVQA-HR^[76]数据集			CRSVQA^[85]数据集	FIT-RS^[44]数据集
模型	Presence	Compare	Rural/Urban	Avg	Presence	Compare	Avg	Avg	Avg
Aquila^[55]	92.72	–	–	–	92.64	–	–	–	83.87
EarthDial^[59]	92.58	92.75	94.00	92.70	58.89	83.11	72.45	–	–
EarthGPT^[12]	–	–	–	–	62.77	79.53	72.06	82.00	–
GeoChat^[42]	91.09	90.33	94.00	90.70	59.02	83.16	–	–	53.47
LHRS-Bot^[11]	89.07	88.51	90.00	89.19	92.57	92.53	92.55	–	–
LHRS-Bot-Nova^[63]	89.00	90.71	89.11	89.61	91.68	92.44	92.06	–	–
RSGPT^[10]	91.17	91.70	94.00	92.29	90.92	90.02	90.47	–	–
RS-LLaVA^[64]	92.27	91.37	95.00	88.10	–	–	–	–	–
RSUniVLM^[65]	92.00	91.51	92.65	92.05	90.81	90.88	90.85	–	–
SkyEyeGPT^[31]	88.93	88.63	75.00	84.19	80.00	80.13	82.56	–	–
SkySenseGPT^[44]	95.00	91.07	92.00	92.69	69.14	84.14	76.64	–	79.76
TEOChat^[50]	91.70	92.70	94.00	–	67.50	81.10	–	–	–
UniRS^[13]	91.81	93.23	93.00	92.63	59.29	84.05	73.15	86.67	–
VHM^[43]	91.17	89.89	88.00	89.33	64.00	83.50	73.75	–
注：所有结果均引自对应论文原文，加粗表示最佳结果。

下载: 导出CSV

表 6 遥感多模态大语言模型在不同的遥感视觉定位数据集上的性能表现(%)

Table 6. Performance of RS-MLLMs on various remote sensing visual grounding datasets (%)

模型	DIOR-RSVG^[86]	RSVG^[90]
模型	Pr@0.5	Pr@0.5
EarthGPT^[12]	76.65	–
GeoGround^[29]	77.73	26.65
LHRS-Bot^[11]	88.10	73.45
LHRS-Bot-Nova^[63]	92.87	81.85
RSUniVLM^[65]	72.47	–
SkyEyeGPT^[31]	88.59	70.50
VHM^[43]	56.17	–
注：所有结果均引自对应论文原文，加粗表示最佳结果。

下载: 导出CSV

表 7 遥感多模态大语言模型在不同的遥感场景分类数据集上的性能表现(%)

Table 7. Performance of RS-MLLMs on various remote sensing scene classification datasets (%)

模型	UCMerced^[80]	AID^[79]	NWPU- RESISC45^[87]	CLRS^[88]	NaSC- TG2^[89]	WHU- RS19^[81]	SIRI- WHU^[92]	EuroSAT^[93]	METER- ML^[94]	fMoW^[84]
EarthDial^[59]	92.42	88.76	–	–	–	96.21	–	–	–	70.03
EarthGPT^[12]	–	–	93.84	77.37	74.72	–	–	–	–	–
EarthMaker^[61]	86.52	77.97	–	–	–	–	–	–	–	–
GeoChat^[42]	84.43	72.03	–	–	–	–	–	–	–	–
LHRS-Bot^[11]	–	91.26	83.94	–	–	93.17	62.66	51.40	69.81	56.56
LHRS-Bot-Nova^[63]	–	88.32	86.80	–	–	95.63	74.75	63.54	70.05	57.11
RingMoGPT^[30]	86.48	97.94	96.47	–	–	97.71	–	–	–	–
RSUniVLM^[65]	–	81.18	86.86	–	–	84.91	68.13	–	–	–
SkyEyeGPT^[31]	60.95	26.30	–	–	–	–	–	–	–	–
TEOChat^[50]	86.30	80.90	–	–	–	–	–	–	–	–
VHM^[43]	–	91.70	94.54	–	–	95.80	70.88	–	72.74	–
注：所有结果均引自对应论文原文，加粗表示最佳结果。

下载: 导出CSV

参考文献(117)

[1]	王桥, 刘思含. 国家环境遥感监测体系研究与实现[J]. 遥感学报, 2016, 20(5): 1161–1169. doi: 10.11834/jrs.20166201. WANG Qiao and LIU Sihan. Research and implementation of national environmental remote sensing monitoring system[J]. Journal of Remote Sensing, 2016, 20(5): 1161–1169. doi: 10.11834/jrs.20166201.
[2]	安立强, 张景发, MONTEIRO R, 等. 地震灾害损失评估与遥感技术现状和展望[J]. 遥感学报, 2024, 28(4): 860–884. doi: 10.11834/jrs.20232093. AN Liqiang, ZHANG Jingfa, MONTEIRO R, et al. A review and prospective research of earthquake damage assessment and remote sensing[J]. National Remote Sensing Bulletin, 2024, 28(4): 860–884. doi: 10.11834/jrs.20232093.
[3]	张王菲, 陈尔学, 李增元, 等. 雷达遥感农业应用综述[J]. 雷达学报, 2020, 9(3): 444–461. doi: 10.12000/JR20051. ZHANG Wangfei, CHEN Erxue, LI Zengyuan, et al. Review of applications of radar remote sensing in agriculture[J]. Journal of Radars, 2020, 9(3): 444–461. doi: 10.12000/JR20051.
[4]	LI Yansheng, DANG Bo, ZHANG Yongjun, et al. Water body classification from high-resolution optical remote sensing imagery: Achievements and perspectives[J]. ISPRS Journal of Photogrammetry and Remote Sensing, 2022, 187: 306–327. doi: 10.1016/j.isprsjprs.2022.03.013.
[5]	LI Yansheng, WEI Fanyi, ZHANG Yongjun, et al. HS²P: Hierarchical spectral and structure-preserving fusion network for multimodal remote sensing image cloud and shadow removal[J]. Information Fusion, 2023, 94: 215–228. doi: 10.1016/j.inffus.2023.02.002.
[6]	CHEN Yongqi, FENG Shou, ZHAO Chunhui, et al. High-resolution remote sensing image change detection based on Fourier feature interaction and multiscale perception[J]. IEEE Transactions on Geoscience and Remote Sensing, 2024, 62: 5539115. doi: 10.1109/TGRS.2024.3500073.
[7]	杨桄, 刘湘南. 遥感影像解译的研究现状和发展趋势[J]. 国土资源遥感, 2004(2): 7–10, 15. doi: 10.3969/j.issn.1001-070X.2004.02.002. YANG Guang and LIU Xiangnan. The present research condition and development trend of remotely sensed imagery interpretation[J]. Remote Sensing for Land & Resources, 2004(2): 7–10, 15. doi: 10.3969/j.issn.1001-070X.2004.02.002.
[8]	ZHAO W X, ZHOU Kun, LI Junyi, et al. A survey of large language models[J]. arXiv preprint arXiv: 2303.18223, 2023.
[9]	YIN Shukang, FU Chaoyou, ZHAO Sirui, et al. A survey on multimodal large language models[J]. National Science Review, 2024, 11(12): nwae403. doi: 10.1093/nsr/nwae403.
[10]	HU Yuan, YUAN Jianlong, WEN Congcong, et al. RSGPT: A remote sensing vision language model and benchmark[J]. ISPRS Journal of Photogrammetry and Remote Sensing, 2025, 224: 272–286. doi: 10.1016/j.isprsjprs.2025.03.028.
[11]	MUHTAR D, LI Zhenshi, GU Feng, et al. LHRS-Bot: Empowering remote sensing with VGI-enhanced large multimodal language model[C]. The 18th European Conference on Computer Vision, Milan, Italy, 2024: 440–457. doi: 10.1007/978-3-031-72904-1_26.
[12]	ZHANG Wei, CAI Miaoxin, ZHANG Tong, et al. EarthGPT: A universal multimodal large language model for multisensor image comprehension in remote sensing domain[J]. IEEE Transactions on Geoscience and Remote Sensing, 2024, 62: 5917820. doi: 10.1109/TGRS.2024.3409624.
[13]	LI Yujie, XU Wenjia, LI Guangzuo, et al. UniRS: Unifying multi-temporal remote sensing tasks through vision language models[J]. arXiv preprint arXiv: 2412.20742, 2024.
[14]	VOUTILAINEN A. A syntax-based part-of-speech analyser[C]. The 7th Conference of the European Chapter of the Association for Computational Linguistics, Dublin, Ireland, 1995.
[15]	BRILL E and RESNIK P. A rule-based approach to prepositional phrase attachment disambiguation[C]. The 15th International Conference on Computational Linguistics, Kyoto, Japan, 1994.
[16]	HINTON G E, OSINDERO S, and TEH Y W. A fast learning algorithm for deep belief nets[J]. Neural Computation, 2006, 18(7): 1527–1554. doi: 10.1162/neco.2006.18.7.1527.
[17]	DEVLIN J, CHANG M W, LEE K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding[C]. 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, 2019: 4171–4186. doi: 10.18653/v1/N19-1423.
[18]	RADFORD A, NARASIMHAN K, SALIMANS T, et al. Improving language understanding by generative pre-training[EB/OL]. https://openai.com/iadex/language-unsupervised, 2018.
[19]	RADFORD A, WU J, CHILD R, et al. Language models are unsupervised multitask learners[EB/OL]. https://openai.com/index/better-langaage-models/, 2019.
[20]	BROWN T, MANN B, RYDER N, et al. Language models are few-shot learners[C]. The 34th International Conference on Neural Information Processing Systems, Vancouver, Canada, 2020: 159.
[21]	OpenAI. GPT-4 technical report[J]. arXiv preprint arXiv: 2303.08774, 2023.
[22]	TOUVRON H, LAVRIL T, IZACARD G, et al. LLaMA: Open and efficient foundation language models[J]. arXiv preprint arXiv: 2302.13971, 2023.
[23]	BAI Jinze, BAI Shuai, CHU Yunfei, et al. Qwen technical report[J]. arXiv preprint arXiv: 2309.16609, 2023.
[24]	DeepSeek-AI. DeepSeek-V3 technical report[J]. arXiv preprint arXiv: 2412.19437, 2024.
[25]	VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]. The 31st International Conference on Neural Information Processing Systems, Long Beach, USA, 2017: 6000–6010.
[26]	LIU Haotian, LI Chunyuan, WU Qingyang, et al. Visual instruction tuning[C]. The 37th International Conference on Neural Information Processing Systems, New Orleans, USA, 2023: 1516.
[27]	LIN Bin, YE Yang, ZHU Bin, et al. Video-LLaVA: Learning united visual representation by alignment before projection[C]. 2024 Conference on Empirical Methods in Natural Language Processing, Miami, USA, 2024: 5971–5984.
[28]	KOH J Y, FRIED D, and SALAKHUTDINOV R R. Generating images with multimodal language models[C]. The 37th International Conference on Neural Information Processing Systems, New Orleans, USA, 2023: 939.
[29]	ZHOU Yue, LAN Mengcheng, LI Xiang, et al. GeoGround: A unified large vision-language model for remote sensing visual grounding[J]. arXiv preprint arXiv: 2411.11904, 2024.
[30]	WANG Peijin, HU Huiyang, TONG Boyuan, et al. RingMoGPT: A unified remote sensing foundation model for vision, language, and grounded tasks[J]. IEEE Transactions on Geoscience and Remote Sensing, 2025, 63: 5611320. doi: 10.1109/TGRS.2024.3510833.
[31]	ZHAN Yang, XIONG Zhitong, and YUAN Yuan. SkyEyeGPT: Unifying remote sensing vision-language tasks via instruction tuning with large language model[J]. ISPRS Journal of Photogrammetry and Remote Sensing, 2025, 221: 64–77. doi: 10.1016/j.isprsjprs.2025.01.020.
[32]	张永军, 李彦胜, 党博, 等. 多模态遥感基础大模型: 研究现状与未来展望[J]. 测绘学报, 2024, 53(10): 1942–1954. doi: 10.11947/j.AGCS.2024.20240019. ZHANG Yongjun, LI Yansheng, DANG Bo, et al. Multi-modal remote sensing large foundation models: Current research status and future prospect[J]. Acta Geodaetica et Cartographica Sinica, 2024, 53(10): 1942–1954. doi: 10.11947/j.AGCS.2024.20240019.
[33]	HONG Danfeng, HAN Zhu, YAO Jing, et al. SpectralFormer: Rethinking hyperspectral image classification with transformers[J]. IEEE Transactions on Geoscience and Remote Sensing, 2022, 60: 5518615. doi: 10.1109/TGRS.2021.3130716.
[34]	HONG Danfeng, ZHANG Bing, LI Xuyang, et al. SpectralGPT: Spectral remote sensing foundation model[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024, 46(8): 5227–5244. doi: 10.1109/TPAMI.2024.3362475.
[35]	FULLER A, MILLARD K, and GREEN J R. CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders[C]. The 37th International Conference on Neural Information Processing Systems, New Orleans, USA, 2023: 241.
[36]	WANG Yi, ALBRECHT C M, BRAHAM N A A, et al. Decoupling common and unique representations for multimodal self-supervised learning[C]. The 18th European Conference on Computer Vision, Milan, Italy, 2024: 286–303. doi: 10.1007/978-3-031-73397-0_17.
[37]	张良培, 张乐飞, 袁强强. 遥感大模型: 进展与前瞻[J]. 武汉大学学报(信息科学版), 2023, 48(10): 1574–1581. doi: 10.13203/j.whugis20230341. ZHANG Liangpei, ZHANG Lefei, and YUAN Qiangqiang. Large remote sensing model: Progress and prospects[J]. Geomatics and Information Science of Wuhan University, 2023, 48(10): 1574–1581. doi: 10.13203/j.whugis20230341.
[38]	CHIANG W L, LI Zhuohan, LIN Zi, et al. Vicuna: An open-source Chatbot impressing GPT-4 with 90%* ChatGPT quality[EB/OL]. https://lmsys.org/blog/2023-03-30-vicuna/, 2023.
[39]	DeepSeek-AI. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning[J]. arXiv preprint arXiv: 2501.12948, 2025.
[40]	ALAYRAC J B, DONAHUE J, LUC P, et al. Flamingo: A visual language model for few-shot learning[C]. The 36th International Conference on Neural Information Processing Systems, New Orleans, USA, 2022: 1723.
[41]	LI Junnan, LI Dongxu, SAVARESE S, et al. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models[C]. The 40th International Conference on Machine Learning, Honolulu, USA, 2023: 814.
[42]	KUCKREJA K, DANISH M S, NASEER M, et al. GeoChat: Grounded large vision-language model for remote sensing[C]. The IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2024: 27831–27840. doi: 10.1109/CVPR52733.2024.02629.
[43]	PANG Chao, WENG Xingxing, WU Jiang, et al. VHM: Versatile and honest vision language model for remote sensing image analysis[C]. The 39th AAAI Conference on Artificial Intelligence, Philadelphia, USA, 2025: 6381–6388. doi: 10.1609/aaai.v39i6.32683.
[44]	LUO Junwei, PANG Zhen, ZHANG Yongjun, et al. SkySenseGPT: A fine-grained instruction tuning dataset and model for remote sensing vision-language understanding[J]. arXiv preprint arXiv: 2406.10100, 2024.
[45]	SILVA J D, MAGALHÃES J, TUIA D, et al. Large language models for captioning and retrieving remote sensing images[J]. arXiv preprint arXiv: 2402.06475, 2024.
[46]	GUO Mingning, WU Mengwei, SHEN Yuxiang, et al. IFShip: Interpretable fine-grained ship classification with domain knowledge-enhanced vision-language models[J]. Pattern Recognition, 2025, 166: 111672. doi: 10.1016/j.patcog.2025.111672.
[47]	ZHANG Wei, CAI Miaoxin, ZHANG Tong, et al. Popeye: A unified visual-language model for multisource ship detection from remote sensing imagery[J]. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2024, 17: 20050–20063. doi: 10.1109/JSTARS.2024.3488034.
[48]	DENG Pei, ZHOU Wenqian, and WU Hanlin. ChangeChat: An interactive model for remote sensing change analysis via multimodal instruction tuning[C]. ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, 2025: 1–5. doi: 10.1109/ICASSP49660.2025.10890620.
[49]	NOMAN M, AHSAN N, NASEER M, et al. CDChat: A large multimodal model for remote sensing change description[J]. arXiv preprint arXiv: 2409.16261, 2024.
[50]	IRVIN J A, LIU E R, CHEN J C, et al. TEOChat: A large vision-language assistant for temporal earth observation data[C]. The 13th International Conference on Learning Representations, Singapore, Singapore, 2025.
[51]	RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[C]. The 38th International Conference on Machine Learning, 2021: 8748–8763.
[52]	LIU Zhuang, MAO Hanzi, WU Chaoyuan, et al. A ConvNet for the 2020s[C]. The IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, 2022: 11966–11976. doi: 10.1109/CVPR52688.2022.01167.
[53]	SUN Quan, FANG Yuxin, WU L, et al. EVA-CLIP: Improved training techniques for CLIP at scale[J]. arXiv preprint arXiv: 2303.15389, 2023.
[54]	DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16x16 words: Transformers for image recognition at scale[C]. The 9th International Conference on Learning Representations, 2021.
[55]	LU Kaixuan, ZHANG Ruiqian, HUANG Xiao, et al. Aquila: A hierarchically aligned visual-language model for enhanced remote sensing image comprehension[J]. arXiv preprint arXiv: 2411.06074, 2024.
[56]	LU Kaixuan. Aquila-plus: Prompt-driven visual-language models for pixel-level remote sensing image understanding[J]. arXiv preprint arXiv: 2411.06142, 2024.
[57]	ZI Xing, NI Tengjun, FAN Xianjing, et al. AeroLite: Tag-guided lightweight generation of aerial image captions[J]. arXiv preprint arXiv: 2504.09528, 2025.
[58]	JIANG Hongxiang, YIN Jihao, WANG Qixiong, et al. EagleVision: Object-level attribute multimodal LLM for remote sensing[J]. arXiv preprint arXiv: 2503.23330, 2025.
[59]	SONI S, DUDHANE A, DEBARY H, et al. EarthDial: Turning multi-sensory earth observations to interactive dialogues[C]. The Computer Vision and Pattern Recognition Conference, Nashville, USA, 2025: 14303–14313.
[60]	ZHANG Wei, CAI Miaoxin, NING Yaqian, et al. EarthGPT-X: Enabling MLLMs to flexibly and comprehensively understand multi-source remote sensing imagery[J]. arXiv preprint arXiv: 2504.12795, 2025.
[61]	ZHANG Wei, CAI Miaoxin, ZHANG Tong, et al. EarthMarker: A visual prompting multimodal large language model for remote sensing[J]. IEEE Transactions on Geoscience and Remote Sensing, 2025, 63: 5604219. doi: 10.1109/TGRS.2024.3523505.
[62]	WANG Fengxiang, CHEN Mingshuo, LI Yueying, et al. GeoLLaVA-8K: Scaling remote-sensing multimodal large language models to 8K resolution[J]. arXiv preprint arXiv: 2505.21375, 2025.
[63]	LI Zhenshi, MUHTAR D, GU Feng, et al. LHRS-Bot-Nova: Improved multimodal large language model for remote sensing vision-language interpretation[J]. ISPRS Journal of Photogrammetry and Remote Sensing, 2025, 227: 539–550. doi: 10.1016/j.isprsjprs.2025.06.003.
[64]	BAZI Y, BASHMAL L, AL RAHHAL M M, et al. RS-LLaVA: A large vision-language model for joint captioning and question answering in remote sensing imagery[J]. Remote Sensing, 2024, 16(9): 1477. doi: 10.3390/rs16091477.
[65]	LIU Xu and LIAN Zhouhui. RSUniVLM: A unified vision language model for remote sensing via granularity-oriented mixture of experts[J]. arXiv preprint arXiv: 2412.05679, 2024.
[66]	KARANFIL E, IMAMOGLU N, ERDEM E, et al. A vision-language framework for multispectral scene representation using language-grounded features[J]. arXiv preprint arXiv: 2501.10144, 2025.
[67]	ZHANG Hao, LI Feng, LIU Shilong, et al. DINO: DETR with improved denoising anchor boxes for end-to-end object detection[C]. The 11th International Conference on Learning Representations, Kigali, Rwanda, 2023.
[68]	ZHAI Xiaohua, MUSTAFA B, KOLESNIKOV A, et al. Sigmoid loss for language image pre-training[C]. The IEEE/CVF International Conference on Computer Vision, Paris, France, 2023: 11941–11952. doi: 10.1109/ICCV51070.2023.01100.
[69]	ZHANG Jihai, QU Xiaoye, ZHU Tong, et al. CLIP-MoE: Towards building mixture of experts for CLIP with diversified multiplet upcycling[J]. arXiv preprint arXiv: 2409.19291, 2024.
[70]	WANG Weihan, LV Qingsong, YU Wenmeng, et al. CogVLM: Visual expert for pretrained language models[C]. The 38th International Conference on Neural Information Processing Systems, Vancouver, Canada, 2024: 3860.
[71]	LIU Haotian, LI Chunyuan, LI Yuheng, et al. Improved baselines with visual instruction tuning[C]. The IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2024: 26286–26296. doi: 10.1109/CVPR52733.2024.02484.
[72]	KAUFMANN T, WENG P, BENGS V, et al. A survey of reinforcement learning from human feedback[J]. Transactions on Machine Learning Research, in press, 2025.
[73]	HU E J, SHEN Yelong, WALLIS P, et al. LoRA: Low-rank adaptation of large language models[C]. The 10th International Conference on Learning Representations, 2022.
[74]	QU Bo, LI Xuelong, TAO Dacheng, et al. Deep semantic understanding of high resolution remote sensing image[C]. 2016 International Conference on Computer, Information and Telecommunication Systems (CITS), Kunming, China, 2016: 1–5. doi: 10.1109/CITS.2016.7546397.
[75]	LU Xiaoqiang, WANG Binqiang, ZHENG Xiangtao, et al. Exploring models and data for remote sensing image caption generation[J]. IEEE Transactions on Geoscience and Remote Sensing, 2018, 56(4): 2183–2195. doi: 10.1109/TGRS.2017.2776321.
[76]	LOBRY S, MARCOS D, MURRAY J, et al. RSVQA: Visual question answering for remote sensing data[J]. IEEE Transactions on Geoscience and Remote Sensing, 2020, 58(12): 8555–8566. doi: 10.1109/TGRS.2020.2988782.
[77]	CHENG Qimin, HUANG Haiyan, XU Yuan, et al. NWPU-captions dataset and MLCA-Net for remote sensing image captioning[J]. IEEE Transactions on Geoscience and Remote Sensing, 2022, 60: 5629419. doi: 10.1109/TGRS.2022.3201474.
[78]	YUAN Zhiqiang, ZHANG Wenkai, FU Kun, et al. Exploring a fine-grained multiscale method for cross-modal remote sensing image retrieval[J]. IEEE Transactions on Geoscience and Remote Sensing, 2022, 60: 4404119. doi: 10.1109/TGRS.2021.3078451.
[79]	XIA Guisong, HU Jingwen, HU Fan, et al. AID: A benchmark data set for performance evaluation of aerial scene classification[J]. IEEE Transactions on Geoscience and Remote Sensing, 2017, 55(7): 3965–3981. doi: 10.1109/TGRS.2017.2685945.
[80]	YANG Yi and NEWSAM S. Bag-of-visual-words and spatial extensions for land-use classification[C]. The 18th SIGSPATIAL International Conference on Advances in Geographic Information Systems, San Jose, USA, 2010: 270–279. doi: 10.1145/1869790.1869829.
[81]	DAI Dengxin and YANG Wen. Satellite image classification via two-layer sparse coding with biased image representation[J]. IEEE Geoscience and Remote Sensing Letters, 2011, 8(1): 173–176. doi: 10.1109/LGRS.2010.2055033.
[82]	SUMBUL G, CHARFUELAN M, DEMIR B, et al. BigEarthNet: A large-scale benchmark archive for remote sensing image understanding[C]. IGARSS 2019-2019 IEEE international geoscience and remote sensing symposium, Yokohama, Japan, 2019: 5901–5904. doi: 10.1109/IGARSS.2019.8900532.
[83]	GUPTA R, GOODMAN B, PATEL N, et al. Creating xBD: A dataset for assessing building damage from satellite imagery[C]. The IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Long Beach, USA, 2019: 10–17.
[84]	CHRISTIE G, FENDLEY N, WILSON J, et al. Functional map of the world[C]. The IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018: 6172–6180. doi: 10.1109/CVPR.2018.00646.
[85]	ZHANG Meimei, CHEN Fang, and LI Bin. Multistep question-driven visual question answering for remote sensing[J]. IEEE Transactions on Geoscience and Remote Sensing, 2023, 61: 4704912. doi: 10.1109/TGRS.2023.3312479.
[86]	ZHAN Yang, XIONG Zhitong, and YUAN Yuan. RSVG: Exploring data and models for visual grounding on remote sensing data[J]. IEEE Transactions on Geoscience and Remote Sensing, 2023, 61: 5604513. doi: 10.1109/TGRS.2023.3250471.
[87]	CHENG Gong, HAN Junwei, and LU Xiaoqiang. Remote sensing image scene classification: Benchmark and state of the art[J]. Proceedings of the IEEE, 2017, 105(10): 1865–1883. doi: 10.1109/JPROC.2017.2675998.
[88]	LI Haifeng, JIANG Hao, GU Xin, et al. CLRS: Continual learning benchmark for remote sensing image scene classification[J]. Sensors, 2020, 20(4): 1226. doi: 10.3390/s20041226.
[89]	ZHOU Zhuang, LI Shengyang, WU Wei, et al. NaSC-TG2: Natural scene classification with Tiangong-2 remotely sensed imagery[J]. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2021, 14: 3228–3242. doi: 10.1109/JSTARS.2021.3063096.
[90]	SUN Yuxi, FENG Shanshan, LI Xutao, et al. Visual grounding in remote sensing images[C]. The 30th ACM International Conference on Multimedia, Lisboa, Portugal, 2022: 404–412. doi: 10.1145/3503161.3548316.
[91]	LI Xiang, DING Jian, and ELHOSEINY M. VRSBench: A versatile vision-language benchmark dataset for remote sensing image understanding[C]. The 38th International Conference on Neural Information Processing Systems, Vancouver, Canada, 2024: 106.
[92]	ZHU Qiqi, ZHONG Yanfei, ZHAO Bei, et al. Bag-of-visual-words scene classifier with local and global features for high spatial resolution remote sensing imagery[J]. IEEE Geoscience and Remote Sensing Letters, 2016, 13(6): 747–751. doi: 10.1109/LGRS.2015.2513443.
[93]	HELBER P, BISCHKE B, DENGEL A, et al. EuroSAT: A novel dataset and deep learning benchmark for land use and land cover classification[J]. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2019, 12(7): 2217–2226. doi: 10.1109/JSTARS.2019.2918242.
[94]	ZHU B, LUI N, IRVIN J, et al. METER-ML: A multi-sensor earth observation benchmark for automated methane source mapping[C]. The 2nd Workshop on Complex Data Challenges in Earth Observation (CDCEO 2022) Co-Located with 31st International Joint Conference on Artificial Intelligence and the 25th European Conference on Artificial Intelligence (IJCAI-ECAI 2022), Vienna, Austria, 2022: 33–43.
[95]	LI Kun, VOSSELMAN G, and YANG M Y. HRVQA: A visual question answering benchmark for high-resolution aerial images[J]. ISPRS Journal of Photogrammetry and Remote Sensing, 2024, 214: 65–81. doi: 10.1016/j.isprsjprs.2024.06.002.
[96]	HOXHA G and MELGANI F. A novel SVM-based decoder for remote sensing image captioning[J]. IEEE Transactions on Geoscience and Remote Sensing, 2022, 60: 5404514. doi: 10.1109/TGRS.2021.3105004.
[97]	ZHENG Xiangtao, WANG Binqiang, DU Xingqian, et al. Mutual attention inception network for remote sensing visual question answering[J]. IEEE Transactions on Geoscience and Remote Sensing, 2022, 60: 5606514. doi: 10.1109/TGRS.2021.3079918.
[98]	BASHMAL L, BAZI Y, AL RAHHAL M M, et al. CapERA: Captioning events in aerial videos[J]. Remote Sensing, 2023, 15(8): 2139. doi: 10.3390/rs15082139.
[99]	SUN Xian, WANG Peijin, LU Wanxuan, et al. RingMo: A remote sensing foundation model with masked image modeling[J]. IEEE Transactions on Geoscience and Remote Sensing, 2023, 61: 5612822. doi: 10.1109/TGRS.2022.3194732.
[100]	BI Hanbo, FENG Yingchao, TONG Boyuan, et al. RingMoE: Mixture-of-modality-experts multi-modal foundation models for universal remote sensing image interpretation[J]. arXiv preprint arXiv: 2504.03166, 2025.
[101]	CHEN Hao and SHI Zhenwei. A spatial-temporal attention-based method and a new dataset for remote sensing image change detection[J]. Remote Sensing, 2020, 12(10): 1662. doi: 10.3390/rs12101662.
[102]	SHI Qian, LIU Mengxi, LI Shengchen, et al. A deeply supervised attention metric-based network and an open aerial image dataset for remote sensing change detection[J]. IEEE Transactions on Geoscience and Remote Sensing, 2022, 60: 5604816. doi: 10.1109/TGRS.2021.3085870.
[103]	LIU Chenyang, ZHAO Rui, CHEN Hao, et al. Remote sensing image change captioning with dual-branch transformers: A new method and a large scale dataset[J]. IEEE Transactions on Geoscience and Remote Sensing, 2022, 60: 5633520. doi: 10.1109/TGRS.2022.3218921.
[104]	XIA Guisong, BAI Xiang, DING Jian, et al. DOTA: A large-scale dataset for object detection in aerial images[C]. The IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018: 3974–3983. doi: 10.1109/CVPR.2018.00418.
[105]	LI Ke, WAN Gang, CHENG Gong, et al. Object detection in optical remote sensing images: A survey and a new benchmark[J]. ISPRS Journal of Photogrammetry and Remote Sensing, 2020, 159: 296–307. doi: 10.1016/j.isprsjprs.2019.11.023.
[106]	ZHAO Yuanxin, ZHANG Mi, YANG Bingnan, et al. LuoJiaHOG: A hierarchy oriented geo-aware image caption dataset for remote sensing image-text retrieval[J]. ISPRS Journal of Photogrammetry and Remote Sensing, 2025, 222: 130–151. doi: 10.1016/j.isprsjprs.2025.02.009.
[107]	YUAN Zhenghang, XIONG Zhitong, MOU Lichao, et al. ChatEarthNet: A global-scale image-text dataset empowering vision-language geo-foundation models[J]. Earth System Science Data, 2025, 17(3): 1245–1263. doi: 10.5194/essd-17-1245-2025.
[108]	LI Haodong, ZHANG Xiaofeng, and QU Haicheng. DDFAV: Remote sensing large vision language models dataset and evaluation benchmark[J]. Remote Sensing, 2025, 17(4): 719. doi: 10.3390/rs17040719.
[109]	AN Xiao, SUN Jiaxing, GUI Zihan, et al. COREval: A comprehensive and objective benchmark for evaluating the remote sensing capabilities of large vision-language models[J]. arXiv preprint arXiv: 2411.18145, 2024.
[110]	ZHOU Yue, FENG Litong, LAN Mengcheng, et al. GeoMath: A benchmark for multimodal mathematical reasoning in remote sensing[C]. The 13th International Conference on Representation Learning, Singapore, Singapore, 2025.
[111]	GE Junyao, ZHANG Xu, ZHENG Yang, et al. RSTeller: Scaling up visual language modeling in remote sensing with rich linguistic semantics from openly available data and large language models[J]. ISPRS Journal of Photogrammetry and Remote Sensing, 2025, 226: 146–163. doi: 10.1016/j.isprsjprs.2025.05.002.
[112]	DU Siqi, TANG Shengjun, WANG Weixi, et al. Tree-GPT: Modular large language model expert system for forest remote sensing image understanding and interactive analysis[J]. The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, 2023, XLVIII-1-W2-2023: 1729–1736. doi: 10.5194/isprs-archives-XLVIII-1-W2-2023-1729-2023.
[113]	GUO Haonan, SU Xin, WU Chen, et al. Remote sensing ChatGPT: Solving remote sensing tasks with ChatGPT and visual models[C]. IGARSS 2024-2024 IEEE International Geoscience and Remote Sensing Symposium, Athens, Greece, 2024: 11474–11478. doi: 10.1109/IGARSS53475.2024.10640736.
[114]	LIU Chenyang, CHEN Keyan, ZHANG Haotian, et al. Change-Agent: Towards interactive comprehensive remote sensing change interpretation and analysis[J]. IEEE Transactions on Geoscience and Remote Sensing, 2024, 62: 5635616. doi: 10.1109/TGRS.2024.3425815.
[115]	XU Wenjia, YU Zijian, MU Boyang, et al. RS-Agent: Automating remote sensing tasks through intelligent agents[J]. arXiv preprint arXiv: 2406.07089, 2024.
[116]	SINGH S, FORE M, and STAMOULIS D. GeoLLM-Engine: A realistic environment for building geospatial copilots[C]. The IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, USA, 2024: 585–594. doi: 10.1109/CVPRW63382.2024.00063.
[117]	SINGH S, FORE M, and STAMOULIS D. Evaluating tool-augmented agents in remote sensing platforms[J]. arXiv preprint arXiv: 2405.00709, 2024.