A Survey on Remote Sensing Multimodal Large Language Models: Framework, Core Technologies, and Future Perspectives
-
摘要: 近年来,人工智能技术和遥感领域的结合已成为领域发展的前沿热点,多模态大语言模型(MLLM)的快速发展为遥感智能解译带来新的机遇和挑战。遥感多模态大语言模型通过构建大语言模型与视觉模型之间的桥接机制并采用联合训练方式,深度融合遥感领域的视觉特征与语义信息,有效推动遥感智能解译由浅层语义匹配向高层的世界知识理解跃迁。该文系统性回顾了多模态大语言模型在遥感领域的相关研究成果,以期为新的研究方向提供依据。具体而言,该文首先明确了遥感多模态大语言模型(RS-MLLM)的概念定义,并梳理了遥感多模态大语言模型的发展脉络。随后,详细阐述了遥感多模态大语言模型的模型架构、训练方法、适用任务及其对应的基准数据集,并介绍了遥感智能体。最后,探讨了遥感多模态大语言模型的研究现状和未来发展方向。
-
关键词:
- 大语言模型 /
- 多模态大语言模型 /
- 遥感多模态大语言模型 /
- 视觉语言模型 /
- 遥感智能体
Abstract: In recent years, the rapid development of Multimodal Large Language Models (MLLMs) and their applications in remote sensing have garnered significant attention. Remote sensing MLLMs achieve deep integration of visual features and semantic information through the design of bridging mechanisms between large language models and vision models, combined with joint training strategies. This integration facilitates a paradigm shift in intelligent remote sensing interpretation—from shallow semantic matching to higher-level understanding based on world knowledge. In this study, we systematically review the research progress in the applications of MLLMs in remote sensing, specifically examining the development of Remote Sensing MLLMs (RS-MLLMs), which provides a foundation for future research directions. Initially, we discuss the concept of RS-MLLMs and review their development in chronological order. Subsequently, we provide a detailed analysis and statistical summary of the proposed architectures, training methods, applications, and corresponding benchmark datasets, along with an introduction to remote sensing agents. Finally, we summarize the research status of RS-MLLMs and discuss future research directions. -
表 1 遥感多模态大语言模型具体结构
Table 1. The specific structure of RS-MLLM
RS-MLLM 多模态编码器 多模态特征投影器 预训练大语言模型 训练硬件 AeroLite[57] CLIP ViT-L/14 A two-layer MLP LLaMA3.2-3B 4090 Aquila[55] Aquila-CLIP
ConvNext (A-CCN)SFI (MDA)-LLM
(基于LLaMA3)4$* $A800 Aquila-plus[56] CLIP ConvNeXt-L Mask spatial feature extractor Vicuna - CDChat[49] CLIP ViT-L/14 A two-layer MLP (GELU) Vicuna-v1.5-7B 3$* $A100 ChangeChat[48] CLIP ViT/14 A two-layer MLP Vicuna-v1.5 L20 (48 GB) EagleVision[58] Baseline detector Attribute disentangle InternLM2.5-7B-Chat等 8$* $A100 EarthDial[59] Adaptive high resolution + Data fusion +
InternViT-300MA simple MLP Phi-3-mini 8$* $A100 (80 GB) EarthGPT[12] DINOv2 ViT-L/14 + CLIP ConvNeXt-L A linear layer LLaMA2-13B 16$* $A100 EarthGPT-X[60] DINOv2 ViT-L/14 + CLIP ConvNeXt-L +
Hybrid signals mutual understandingVision-to-language
Modality-align projectionLLaMA2-13B 8$* $A100 (80 GB) EarthMarker[61] DINOv2 ViT-L/14 + CLIP ConvNeXt-L A linear layer LLaMA2-13B 8$* $A100 (80 GB) GeoChat[42] CLIP ViT-L/14 A two-layer MLP (GELU) Vicuna-v1.5-7B - GeoGround[29] CLIP ViT A two-layer MLP Vicuna-v1.5 8$* $V100 (32 GB) GeoLLaVA-8K[62] CLIP ViT-L/14 +
A two-step tokens compression moduleA linear layer Vicuna-v1.5-7B - IFShip[46] CLIP ViT-L/14 A four-layer MLP (GELU) Vicuna-13B - LHRS-Bot[11] CLIP ViT-L/14 Vision perceiver LLaMA2-7B 8$* $V100 (32 GB) LHRS-Bot-Nova[63] SigLIP-L/14 Vision perceiver LLaMA3-8B 8$* $H100 Popeye[47] DINOv2 ViT-L/14 + CLIP ConvNeXt-L Alignment projection LLaMA-7B - RingMoGPT[30] EVA-CLIP ViT-g/14 A Q-Former + A linear layer Vicuna-13B 8$* $A100 (80 GB) RS-CapRet[45] CLIP ViT-L/14 Three linear layers LLaMA2-7B - RSGPT[10] EVA-G A Q-Former + A linear layer Vicuna-7B· Vicuna-13B 8$* $A100 RS-LLaVA[64] CLIP ViT-L A two-layer MLP (GELU) Vicuna-v1.5-7B
Vicuna-v1.5-13B2$* $A6000 (48 GB) RSUniVLM[65] SigLIP-400M A two-layer MLP QWen2-0.5B 4$* $A40 (40 GB) SkyEyeGPT[31] EVA-CLIP A linear layer LLaMA2 4$* $ 3090 SkySenseGPT[44] CLIP ViT-L/14 A two-layer MLP Vicuna-v1.5 4$* $A100 (40 GB) Spectral-LLaVA[66] SpectralGPT[34] (encoder only) A linear layer LLaMA3 - TEOChat[50] CLIP ViT-L/14 A two-layer MLP LLaMA2 A4000 (16 GB) UniRS[13] SigLIP + A change extraction module A downsampling module +
A MLPSheared-LLaMA (3B) 4$* $ 4090 (24 GB)VHM[43] CLIP ViT-L/14 A two-layer MLP Vicuna-v1.5-7B 16$* $A100 (80 GB) 注:斜体表示该模型在论文中并未给出正式名称或缩写。 表 2 遥感多模态大语言模型训练时使用的数据集
Table 2. Datasets used for train in RS-MLLMs
RS-MLLM 指令调优 其他训练 AeroLite[57] - RSSCN7, DLRSD, iSAID, LoveDA, WHU, UCM-Captions, Sydney-Captions Aquila[55] FIT-RS CapERA, UCM-Captions, Sydney-Captions, NWPU-Captions, RSICD, RSITMD, RSVQA-HR, RSVQA-LR, WHU_RS19 Aquila-plus[56] Aquila-plus-100K - CDChat[49] LEVIR-CD, SYSU-CD LEVIR-CD, SYSU-CD ChangeChat[48] ChangeChat-87k - EagleVision[58] EVAttrs-95K EVAttrs-95K EarthDial[59] EarthDial-Instruct EarthDial-Instruct EarthGPT[12] MMRS-1M LAION-400M, COCO Caption EarthGPT-X[60] M-RSVP - EarthMarker[61] RSVP-3M COCO Caption, RSVP-3M, RefCOCO, RefCOCO+ GeoChat[42] RS multimodal instruction following dataset - GeoGround[29] refGeo FAIR1M, DIOR, DOTA GeoLLaVA-8K[62] SuperRS-VQA, HighRS-VQA - IFShip[46] TITANIC-FGS - LHRS-Bot[11] LLaVA complex reasoning dataset, NWPU,
RSITMD, LHRS-InstructLHRS-Align LHRS-Bot-Nova[63] Multi-task instruction dataset LHRS-Align-Recap, LHRS-Instruct, LHRS-Instruct-Plus, LRV-Instruct Popeye[47] MMShip COCO Caption RingMoGPT[30] Instruction-tuning dataset Image-text pre-training dataset RS-CapRet[45] - RSCID, UCM-Captions, Sydney-Captions, NWPU-Captions RSGPT[10] - RSICap RS-LLaVA[64] RS-Instructions - RSUniVLM[65] RSUniVLM-Instruct-1.2M RSUniVLM-Resampled SkyEyeGPT[31] SkyEye-968k SkyEye-968k SkySenseGPT[44] FIT-RS, NWPU-Captions, UCM-Captions, RSITMD, EarthVQA, Floodnet-VQA, RSVQA-LR, DOTA, DIOR, FAIR1M - Spectral-LLaVA[66] BigEarthNet-v2 fMoW, BigEarthNet-v1 TEOChat[50] TEOChatlas - UniRS[13] GeoChat-Instruct, LEVIR-CC, EAR - VHM[43] VersaD-Instruct, VariousRS-Instruct, HnstD VersaD 注:由于主要关注指令调优阶段的数据集,因此将其他训练阶段的数据集合并至第3列。斜体表示该数据集在论文中并未给出正式名称或缩写。 表 3 遥感多模态大语言模型的适用任务及其对应的基准数据集
Table 3. The applicable tasks of RS-MLLMs and their corresponding benchmark datasets
RS-MLLM RSIC RSVQA RSVG RSSC AeroLite[57] Sydney-Captions[74],
UCM-Captions[74]– – – Aquila[55] RSICD[75], Sydney-Captions[74], UCM-Captions[74], FIT-RS[44] RSVQA-LR[76], RSVQA-HR[76],
FIT-RS[44]– – EarthDial[59] NWPU-Captions[77], RSICD[75], RSITMD-Captions[78], Sydney-Captions[74], UCM-Captions[74] RSVQA-LR[76], RSVQA-HR[76] – AID[79], UCMerced[80],WHU-RS19[81], BigEarthNet[82],
xBD Set 1[83], fMoW[84]EarthGPT[12] NWPU-Captions[77] CRSVQA[85], RSVQA-HR[76] DIOR-RSVG[86] NWPU-RESISC45[87], CLRS[88],
NaSC-TG2[89]GeoChat[42] – RSVQA-LR[76], RSVQA-HR[76] GeoChat*[42] AID[79], UCMerced[80] GeoGround[29] – – DIOR-RSVG[86], RSVG[90], GeoChat*[42], VRSBench*[91], AVVG[29] – LHRS-Bot[11] – RSVQA-LR[76], RSVQA-HR[76] DIOR-RSVG[86], RSVG[90] AID[79], WHU-RS19[81], NWPU-RESISC45[87], SIRI-WHU[92], EuroSAT[93],
METER-ML[94], fMoW[84]LHRS-Bot-Nova[63] – RSVQA-LR[76], RSVQA-HR[76] DIOR-RSVG[86], RSVG[90] AID[79], WHU-RS19[81], NWPU-RESISC45[87], SIRI-WHU[92], EuroSAT[93], METER-ML[94], fMoW[84] RingMoGPT[30] DOTA-Cap[30], DIOR-Cap[30], NWPU-Captions[77], RSICD[75], Sydney-Captions[74],
UCM-Captions[74]HRVQA[95] – AID[79], NWPU-RESISC45[87], UCMerced[80], WHU-RS19[81] RS-CapRet[45] NWPU-Captions[77], RSICD[75], Sydney-Captions[74],
UCM-Captions[74]– – – RSGPT[10] RSIEval[10], UCM-Captions[74], Sydney-Captions[74], RSICD[75] RSIEval[10], RSVQA-LR[76], RSVQA-HR[76] – – RS-LLaVA[64] UCM-Captions[74], UAV[96] RSVQA-LR[76], RSIVQA-DOTA[97] – – RSUniVLM[65] – RSVQA-LR[76], RSVQA-HR[76] DIOR-RSVG[86], VRSBench[91] AID[79], WHU-RS19[81], NWPU-RESISC45[87], SIRI-WHU[92] SkyEyeGPT[31] UCM-Captions[74], CapERA[98] RSVQA-LR[76], RSVQA-HR[76] DIOR-RSVG[86], RSVG[90] – TEOChat[50] – RSVQA-LR[76], RSVQA-HR[76] – AID[79], UCMerced[80] UniRS[13] – RSVQA-LR[76], RSVQA-HR[76], CRSVQA[85] – – VHM[43] – RSVQA-LR[76], RSVQA-HR[76] DIOR-RSVG[86] AID[79], WHU-RS19[81], NWPU-RESISC45[87], SIRI-WHU[92],
METER-ML[94]注:*表示测试集已修改。 表 4 遥感多模态大语言模型在不同的遥感图像描述数据集上的性能表现
Table 4. Performance of RS-MLLMs on various remote sensing image captioning datasets
数据集 模型 BLEU-1↑ BLEU-2↑ BLEU-3↑ BLEU-4↑ METEOR↑ ROUGE-L↑ CIDEr↑ SPICE↑ NWPU-
Captions[77]EarthDial[59] – – – – 0.806 0.400 – – EarthGPT[12] 0.871 0.787 0.716 0.655 0.445 0.782 1.926 0.322 EarthMarker[61] 0.844 0.731 0.629 0.543 0.375 0.700 1.629 0.268 RS-CapRet[45] 0.871 0.787 0.717 0.656 0.436 0.776 1.929 0.311 RSICD[75] Aquila[55] 0.746 – – – – – – – EarthDial[59] – – – – 0.562 0.276 – – RS-CapRet[45] 0.741 0.622 0.529 0.455 0.376 0.649 2.605 0.484 RSGPT[10] 0.703 0.542 0.440 0.368 0.301 0.533 1.029 – SkyEyeGPT[31] 0.867 0.767 0.673 0.600 0.354 0.626 0.837 – RingMoGPT[30] – – – – 0.343 0.616 2.758 – UCM-
Captions[74]AeroLite[57] 0.934 – – 0.796 0.498 0.880 – – Aquila[55] 0.883 – – – – – – – EarthDial[59] – – – – 0.514 0.342 – – RS-CapRet[45] 0.843 0.779 0.722 0.670 0.472 0.817 3.548 0.525 RSGPT[10] 0.861 0.791 0.723 0.657 0.422 0.783 3.332 – SkyEyeGPT[31] 0.907 0.857 0.816 0.784 0.462 0.795 2.368 – RS-LLaVA[64] 0.900 0.849 0.803 0.760 0.492 0.858 3.556 – RingMoGPT[30] – – – – 0.499 0.833 3.593 – Sydney-
Captions[74]AeroLite[57] 0.919 – – 0.759 0.475 0.837 – – Aquila[55] 0.834 – – – – – – – EarthDial[59] – – – – 0.573 0.410 – – RS-CapRet[45] 0.787 0.700 0.628 0.564 0.388 0.707 2.392 0.434 RSGPT[10] 0.823 0.753 0.686 0.622 0.414 0.748 2.731 – SkyEyeGPT[31] 0.919 0.856 0.809 0.774 0.466 0.777 1.811 – RingMoGPT[30] – – – – 0.421 0.734 2.888 – FIT-RS[44] Aquila[55] 0.351 – – – – – – – GeoChat[42] 0.088 – – – – – – – SkySenseGPT[44] 0.273 – – – – – – – 注:所有结果均引自对应论文原文,加粗表示最佳结果。 表 5 遥感多模态大语言模型在不同的遥感视觉问答数据集上的性能表现(%)
Table 5. Performance of RS-MLLMs on various remote sensing visual question answering datasets (%)
模型 RSVQA-LR[76]数据集 RSVQA-HR[76]数据集 CRSVQA[85]数据集 FIT-RS[44]数据集 Presence Compare Rural/Urban Avg Presence Compare Avg Avg Avg Aquila[55] 92.72 – – – 92.64 – – – 83.87 EarthDial[59] 92.58 92.75 94.00 92.70 58.89 83.11 72.45 – – EarthGPT[12] – – – – 62.77 79.53 72.06 82.00 – GeoChat[42] 91.09 90.33 94.00 90.70 59.02 83.16 – – 53.47 LHRS-Bot[11] 89.07 88.51 90.00 89.19 92.57 92.53 92.55 – – LHRS-Bot-Nova[63] 89.00 90.71 89.11 89.61 91.68 92.44 92.06 – – RSGPT[10] 91.17 91.70 94.00 92.29 90.92 90.02 90.47 – – RS-LLaVA[64] 92.27 91.37 95.00 88.10 – – – – – RSUniVLM[65] 92.00 91.51 92.65 92.05 90.81 90.88 90.85 – – SkyEyeGPT[31] 88.93 88.63 75.00 84.19 80.00 80.13 82.56 – – SkySenseGPT[44] 95.00 91.07 92.00 92.69 69.14 84.14 76.64 – 79.76 TEOChat[50] 91.70 92.70 94.00 – 67.50 81.10 – – – UniRS[13] 91.81 93.23 93.00 92.63 59.29 84.05 73.15 86.67 – VHM[43] 91.17 89.89 88.00 89.33 64.00 83.50 73.75 – 注:所有结果均引自对应论文原文,加粗表示最佳结果。 表 6 遥感多模态大语言模型在不同的遥感视觉定位数据集上的性能表现(%)
Table 6. Performance of RS-MLLMs on various remote sensing visual grounding datasets (%)
表 7 遥感多模态大语言模型在不同的遥感场景分类数据集上的性能表现(%)
Table 7. Performance of RS-MLLMs on various remote sensing scene classification datasets (%)
模型 UCMerced[80] AID[79] NWPU-
RESISC45[87]CLRS[88] NaSC-
TG2[89]WHU-
RS19[81]SIRI-
WHU[92]EuroSAT[93] METER-
ML[94]fMoW[84] EarthDial[59] 92.42 88.76 – – – 96.21 – – – 70.03 EarthGPT[12] – – 93.84 77.37 74.72 – – – – – EarthMaker[61] 86.52 77.97 – – – – – – – – GeoChat[42] 84.43 72.03 – – – – – – – – LHRS-Bot[11] – 91.26 83.94 – – 93.17 62.66 51.40 69.81 56.56 LHRS-Bot-Nova[63] – 88.32 86.80 – – 95.63 74.75 63.54 70.05 57.11 RingMoGPT[30] 86.48 97.94 96.47 – – 97.71 – – – – RSUniVLM[65] – 81.18 86.86 – – 84.91 68.13 – – – SkyEyeGPT[31] 60.95 26.30 – – – – – – – – TEOChat[50] 86.30 80.90 – – – – – – – – VHM[43] – 91.70 94.54 – – 95.80 70.88 – 72.74 – 注:所有结果均引自对应论文原文,加粗表示最佳结果。 -
[1] 王桥, 刘思含. 国家环境遥感监测体系研究与实现[J]. 遥感学报, 2016, 20(5): 1161–1169. doi: 10.11834/jrs.20166201.WANG Qiao and LIU Sihan. Research and implementation of national environmental remote sensing monitoring system[J]. Journal of Remote Sensing, 2016, 20(5): 1161–1169. doi: 10.11834/jrs.20166201. [2] 安立强, 张景发, MONTEIRO R, 等. 地震灾害损失评估与遥感技术现状和展望[J]. 遥感学报, 2024, 28(4): 860–884. doi: 10.11834/jrs.20232093.AN Liqiang, ZHANG Jingfa, MONTEIRO R, et al. A review and prospective research of earthquake damage assessment and remote sensing[J]. National Remote Sensing Bulletin, 2024, 28(4): 860–884. doi: 10.11834/jrs.20232093. [3] 张王菲, 陈尔学, 李增元, 等. 雷达遥感农业应用综述[J]. 雷达学报, 2020, 9(3): 444–461. doi: 10.12000/JR20051.ZHANG Wangfei, CHEN Erxue, LI Zengyuan, et al. Review of applications of radar remote sensing in agriculture[J]. Journal of Radars, 2020, 9(3): 444–461. doi: 10.12000/JR20051. [4] LI Yansheng, DANG Bo, ZHANG Yongjun, et al. Water body classification from high-resolution optical remote sensing imagery: Achievements and perspectives[J]. ISPRS Journal of Photogrammetry and Remote Sensing, 2022, 187: 306–327. doi: 10.1016/j.isprsjprs.2022.03.013. [5] LI Yansheng, WEI Fanyi, ZHANG Yongjun, et al. HS2P: Hierarchical spectral and structure-preserving fusion network for multimodal remote sensing image cloud and shadow removal[J]. Information Fusion, 2023, 94: 215–228. doi: 10.1016/j.inffus.2023.02.002. [6] CHEN Yongqi, FENG Shou, ZHAO Chunhui, et al. High-resolution remote sensing image change detection based on Fourier feature interaction and multiscale perception[J]. IEEE Transactions on Geoscience and Remote Sensing, 2024, 62: 5539115. doi: 10.1109/TGRS.2024.3500073. [7] 杨桄, 刘湘南. 遥感影像解译的研究现状和发展趋势[J]. 国土资源遥感, 2004(2): 7–10, 15. doi: 10.3969/j.issn.1001-070X.2004.02.002.YANG Guang and LIU Xiangnan. The present research condition and development trend of remotely sensed imagery interpretation[J]. Remote Sensing for Land & Resources, 2004(2): 7–10, 15. doi: 10.3969/j.issn.1001-070X.2004.02.002. [8] ZHAO W X, ZHOU Kun, LI Junyi, et al. A survey of large language models[J]. arXiv preprint arXiv: 2303.18223, 2023. [9] YIN Shukang, FU Chaoyou, ZHAO Sirui, et al. A survey on multimodal large language models[J]. National Science Review, 2024, 11(12): nwae403. doi: 10.1093/nsr/nwae403. [10] HU Yuan, YUAN Jianlong, WEN Congcong, et al. RSGPT: A remote sensing vision language model and benchmark[J]. ISPRS Journal of Photogrammetry and Remote Sensing, 2025, 224: 272–286. doi: 10.1016/j.isprsjprs.2025.03.028. [11] MUHTAR D, LI Zhenshi, GU Feng, et al. LHRS-Bot: Empowering remote sensing with VGI-enhanced large multimodal language model[C]. The 18th European Conference on Computer Vision, Milan, Italy, 2024: 440–457. doi: 10.1007/978-3-031-72904-1_26. [12] ZHANG Wei, CAI Miaoxin, ZHANG Tong, et al. EarthGPT: A universal multimodal large language model for multisensor image comprehension in remote sensing domain[J]. IEEE Transactions on Geoscience and Remote Sensing, 2024, 62: 5917820. doi: 10.1109/TGRS.2024.3409624. [13] LI Yujie, XU Wenjia, LI Guangzuo, et al. UniRS: Unifying multi-temporal remote sensing tasks through vision language models[J]. arXiv preprint arXiv: 2412.20742, 2024. [14] VOUTILAINEN A. A syntax-based part-of-speech analyser[C]. The 7th Conference of the European Chapter of the Association for Computational Linguistics, Dublin, Ireland, 1995. [15] BRILL E and RESNIK P. A rule-based approach to prepositional phrase attachment disambiguation[C]. The 15th International Conference on Computational Linguistics, Kyoto, Japan, 1994. [16] HINTON G E, OSINDERO S, and TEH Y W. A fast learning algorithm for deep belief nets[J]. Neural Computation, 2006, 18(7): 1527–1554. doi: 10.1162/neco.2006.18.7.1527. [17] DEVLIN J, CHANG M W, LEE K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding[C]. 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, 2019: 4171–4186. doi: 10.18653/v1/N19-1423. [18] RADFORD A, NARASIMHAN K, SALIMANS T, et al. Improving language understanding by generative pre-training[EB/OL]. https://openai.com/iadex/language-unsupervised, 2018. [19] RADFORD A, WU J, CHILD R, et al. Language models are unsupervised multitask learners[EB/OL]. https://openai.com/index/better-langaage-models/, 2019. [20] BROWN T, MANN B, RYDER N, et al. Language models are few-shot learners[C]. The 34th International Conference on Neural Information Processing Systems, Vancouver, Canada, 2020: 159. [21] OpenAI. GPT-4 technical report[J]. arXiv preprint arXiv: 2303.08774, 2023. [22] TOUVRON H, LAVRIL T, IZACARD G, et al. LLaMA: Open and efficient foundation language models[J]. arXiv preprint arXiv: 2302.13971, 2023. [23] BAI Jinze, BAI Shuai, CHU Yunfei, et al. Qwen technical report[J]. arXiv preprint arXiv: 2309.16609, 2023. [24] DeepSeek-AI. DeepSeek-V3 technical report[J]. arXiv preprint arXiv: 2412.19437, 2024. [25] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]. The 31st International Conference on Neural Information Processing Systems, Long Beach, USA, 2017: 6000–6010. [26] LIU Haotian, LI Chunyuan, WU Qingyang, et al. Visual instruction tuning[C]. The 37th International Conference on Neural Information Processing Systems, New Orleans, USA, 2023: 1516. [27] LIN Bin, YE Yang, ZHU Bin, et al. Video-LLaVA: Learning united visual representation by alignment before projection[C]. 2024 Conference on Empirical Methods in Natural Language Processing, Miami, USA, 2024: 5971–5984. [28] KOH J Y, FRIED D, and SALAKHUTDINOV R R. Generating images with multimodal language models[C]. The 37th International Conference on Neural Information Processing Systems, New Orleans, USA, 2023: 939. [29] ZHOU Yue, LAN Mengcheng, LI Xiang, et al. GeoGround: A unified large vision-language model for remote sensing visual grounding[J]. arXiv preprint arXiv: 2411.11904, 2024. [30] WANG Peijin, HU Huiyang, TONG Boyuan, et al. RingMoGPT: A unified remote sensing foundation model for vision, language, and grounded tasks[J]. IEEE Transactions on Geoscience and Remote Sensing, 2025, 63: 5611320. doi: 10.1109/TGRS.2024.3510833. [31] ZHAN Yang, XIONG Zhitong, and YUAN Yuan. SkyEyeGPT: Unifying remote sensing vision-language tasks via instruction tuning with large language model[J]. ISPRS Journal of Photogrammetry and Remote Sensing, 2025, 221: 64–77. doi: 10.1016/j.isprsjprs.2025.01.020. [32] 张永军, 李彦胜, 党博, 等. 多模态遥感基础大模型: 研究现状与未来展望[J]. 测绘学报, 2024, 53(10): 1942–1954. doi: 10.11947/j.AGCS.2024.20240019.ZHANG Yongjun, LI Yansheng, DANG Bo, et al. Multi-modal remote sensing large foundation models: Current research status and future prospect[J]. Acta Geodaetica et Cartographica Sinica, 2024, 53(10): 1942–1954. doi: 10.11947/j.AGCS.2024.20240019. [33] HONG Danfeng, HAN Zhu, YAO Jing, et al. SpectralFormer: Rethinking hyperspectral image classification with transformers[J]. IEEE Transactions on Geoscience and Remote Sensing, 2022, 60: 5518615. doi: 10.1109/TGRS.2021.3130716. [34] HONG Danfeng, ZHANG Bing, LI Xuyang, et al. SpectralGPT: Spectral remote sensing foundation model[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024, 46(8): 5227–5244. doi: 10.1109/TPAMI.2024.3362475. [35] FULLER A, MILLARD K, and GREEN J R. CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders[C]. The 37th International Conference on Neural Information Processing Systems, New Orleans, USA, 2023: 241. [36] WANG Yi, ALBRECHT C M, BRAHAM N A A, et al. Decoupling common and unique representations for multimodal self-supervised learning[C]. The 18th European Conference on Computer Vision, Milan, Italy, 2024: 286–303. doi: 10.1007/978-3-031-73397-0_17. [37] 张良培, 张乐飞, 袁强强. 遥感大模型: 进展与前瞻[J]. 武汉大学学报(信息科学版), 2023, 48(10): 1574–1581. doi: 10.13203/j.whugis20230341.ZHANG Liangpei, ZHANG Lefei, and YUAN Qiangqiang. Large remote sensing model: Progress and prospects[J]. Geomatics and Information Science of Wuhan University, 2023, 48(10): 1574–1581. doi: 10.13203/j.whugis20230341. [38] CHIANG W L, LI Zhuohan, LIN Zi, et al. Vicuna: An open-source Chatbot impressing GPT-4 with 90%* ChatGPT quality[EB/OL]. https://lmsys.org/blog/2023-03-30-vicuna/, 2023. [39] DeepSeek-AI. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning[J]. arXiv preprint arXiv: 2501.12948, 2025. [40] ALAYRAC J B, DONAHUE J, LUC P, et al. Flamingo: A visual language model for few-shot learning[C]. The 36th International Conference on Neural Information Processing Systems, New Orleans, USA, 2022: 1723. [41] LI Junnan, LI Dongxu, SAVARESE S, et al. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models[C]. The 40th International Conference on Machine Learning, Honolulu, USA, 2023: 814. [42] KUCKREJA K, DANISH M S, NASEER M, et al. GeoChat: Grounded large vision-language model for remote sensing[C]. The IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2024: 27831–27840. doi: 10.1109/CVPR52733.2024.02629. [43] PANG Chao, WENG Xingxing, WU Jiang, et al. VHM: Versatile and honest vision language model for remote sensing image analysis[C]. The 39th AAAI Conference on Artificial Intelligence, Philadelphia, USA, 2025: 6381–6388. doi: 10.1609/aaai.v39i6.32683. [44] LUO Junwei, PANG Zhen, ZHANG Yongjun, et al. SkySenseGPT: A fine-grained instruction tuning dataset and model for remote sensing vision-language understanding[J]. arXiv preprint arXiv: 2406.10100, 2024. [45] SILVA J D, MAGALHÃES J, TUIA D, et al. Large language models for captioning and retrieving remote sensing images[J]. arXiv preprint arXiv: 2402.06475, 2024. [46] GUO Mingning, WU Mengwei, SHEN Yuxiang, et al. IFShip: Interpretable fine-grained ship classification with domain knowledge-enhanced vision-language models[J]. Pattern Recognition, 2025, 166: 111672. doi: 10.1016/j.patcog.2025.111672. [47] ZHANG Wei, CAI Miaoxin, ZHANG Tong, et al. Popeye: A unified visual-language model for multisource ship detection from remote sensing imagery[J]. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2024, 17: 20050–20063. doi: 10.1109/JSTARS.2024.3488034. [48] DENG Pei, ZHOU Wenqian, and WU Hanlin. ChangeChat: An interactive model for remote sensing change analysis via multimodal instruction tuning[C]. ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, 2025: 1–5. doi: 10.1109/ICASSP49660.2025.10890620. [49] NOMAN M, AHSAN N, NASEER M, et al. CDChat: A large multimodal model for remote sensing change description[J]. arXiv preprint arXiv: 2409.16261, 2024. [50] IRVIN J A, LIU E R, CHEN J C, et al. TEOChat: A large vision-language assistant for temporal earth observation data[C]. The 13th International Conference on Learning Representations, Singapore, Singapore, 2025. [51] RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[C]. The 38th International Conference on Machine Learning, 2021: 8748–8763. [52] LIU Zhuang, MAO Hanzi, WU Chaoyuan, et al. A ConvNet for the 2020s[C]. The IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, 2022: 11966–11976. doi: 10.1109/CVPR52688.2022.01167. [53] SUN Quan, FANG Yuxin, WU L, et al. EVA-CLIP: Improved training techniques for CLIP at scale[J]. arXiv preprint arXiv: 2303.15389, 2023. [54] DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16x16 words: Transformers for image recognition at scale[C]. The 9th International Conference on Learning Representations, 2021. [55] LU Kaixuan, ZHANG Ruiqian, HUANG Xiao, et al. Aquila: A hierarchically aligned visual-language model for enhanced remote sensing image comprehension[J]. arXiv preprint arXiv: 2411.06074, 2024. [56] LU Kaixuan. Aquila-plus: Prompt-driven visual-language models for pixel-level remote sensing image understanding[J]. arXiv preprint arXiv: 2411.06142, 2024. [57] ZI Xing, NI Tengjun, FAN Xianjing, et al. AeroLite: Tag-guided lightweight generation of aerial image captions[J]. arXiv preprint arXiv: 2504.09528, 2025. [58] JIANG Hongxiang, YIN Jihao, WANG Qixiong, et al. EagleVision: Object-level attribute multimodal LLM for remote sensing[J]. arXiv preprint arXiv: 2503.23330, 2025. [59] SONI S, DUDHANE A, DEBARY H, et al. EarthDial: Turning multi-sensory earth observations to interactive dialogues[C]. The Computer Vision and Pattern Recognition Conference, Nashville, USA, 2025: 14303–14313. [60] ZHANG Wei, CAI Miaoxin, NING Yaqian, et al. EarthGPT-X: Enabling MLLMs to flexibly and comprehensively understand multi-source remote sensing imagery[J]. arXiv preprint arXiv: 2504.12795, 2025. [61] ZHANG Wei, CAI Miaoxin, ZHANG Tong, et al. EarthMarker: A visual prompting multimodal large language model for remote sensing[J]. IEEE Transactions on Geoscience and Remote Sensing, 2025, 63: 5604219. doi: 10.1109/TGRS.2024.3523505. [62] WANG Fengxiang, CHEN Mingshuo, LI Yueying, et al. GeoLLaVA-8K: Scaling remote-sensing multimodal large language models to 8K resolution[J]. arXiv preprint arXiv: 2505.21375, 2025. [63] LI Zhenshi, MUHTAR D, GU Feng, et al. LHRS-Bot-Nova: Improved multimodal large language model for remote sensing vision-language interpretation[J]. ISPRS Journal of Photogrammetry and Remote Sensing, 2025, 227: 539–550. doi: 10.1016/j.isprsjprs.2025.06.003. [64] BAZI Y, BASHMAL L, AL RAHHAL M M, et al. RS-LLaVA: A large vision-language model for joint captioning and question answering in remote sensing imagery[J]. Remote Sensing, 2024, 16(9): 1477. doi: 10.3390/rs16091477. [65] LIU Xu and LIAN Zhouhui. RSUniVLM: A unified vision language model for remote sensing via granularity-oriented mixture of experts[J]. arXiv preprint arXiv: 2412.05679, 2024. [66] KARANFIL E, IMAMOGLU N, ERDEM E, et al. A vision-language framework for multispectral scene representation using language-grounded features[J]. arXiv preprint arXiv: 2501.10144, 2025. [67] ZHANG Hao, LI Feng, LIU Shilong, et al. DINO: DETR with improved denoising anchor boxes for end-to-end object detection[C]. The 11th International Conference on Learning Representations, Kigali, Rwanda, 2023. [68] ZHAI Xiaohua, MUSTAFA B, KOLESNIKOV A, et al. Sigmoid loss for language image pre-training[C]. The IEEE/CVF International Conference on Computer Vision, Paris, France, 2023: 11941–11952. doi: 10.1109/ICCV51070.2023.01100. [69] ZHANG Jihai, QU Xiaoye, ZHU Tong, et al. CLIP-MoE: Towards building mixture of experts for CLIP with diversified multiplet upcycling[J]. arXiv preprint arXiv: 2409.19291, 2024. [70] WANG Weihan, LV Qingsong, YU Wenmeng, et al. CogVLM: Visual expert for pretrained language models[C]. The 38th International Conference on Neural Information Processing Systems, Vancouver, Canada, 2024: 3860. [71] LIU Haotian, LI Chunyuan, LI Yuheng, et al. Improved baselines with visual instruction tuning[C]. The IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2024: 26286–26296. doi: 10.1109/CVPR52733.2024.02484. [72] KAUFMANN T, WENG P, BENGS V, et al. A survey of reinforcement learning from human feedback[J]. Transactions on Machine Learning Research, in press, 2025. [73] HU E J, SHEN Yelong, WALLIS P, et al. LoRA: Low-rank adaptation of large language models[C]. The 10th International Conference on Learning Representations, 2022. [74] QU Bo, LI Xuelong, TAO Dacheng, et al. Deep semantic understanding of high resolution remote sensing image[C]. 2016 International Conference on Computer, Information and Telecommunication Systems (CITS), Kunming, China, 2016: 1–5. doi: 10.1109/CITS.2016.7546397. [75] LU Xiaoqiang, WANG Binqiang, ZHENG Xiangtao, et al. Exploring models and data for remote sensing image caption generation[J]. IEEE Transactions on Geoscience and Remote Sensing, 2018, 56(4): 2183–2195. doi: 10.1109/TGRS.2017.2776321. [76] LOBRY S, MARCOS D, MURRAY J, et al. RSVQA: Visual question answering for remote sensing data[J]. IEEE Transactions on Geoscience and Remote Sensing, 2020, 58(12): 8555–8566. doi: 10.1109/TGRS.2020.2988782. [77] CHENG Qimin, HUANG Haiyan, XU Yuan, et al. NWPU-captions dataset and MLCA-Net for remote sensing image captioning[J]. IEEE Transactions on Geoscience and Remote Sensing, 2022, 60: 5629419. doi: 10.1109/TGRS.2022.3201474. [78] YUAN Zhiqiang, ZHANG Wenkai, FU Kun, et al. Exploring a fine-grained multiscale method for cross-modal remote sensing image retrieval[J]. IEEE Transactions on Geoscience and Remote Sensing, 2022, 60: 4404119. doi: 10.1109/TGRS.2021.3078451. [79] XIA Guisong, HU Jingwen, HU Fan, et al. AID: A benchmark data set for performance evaluation of aerial scene classification[J]. IEEE Transactions on Geoscience and Remote Sensing, 2017, 55(7): 3965–3981. doi: 10.1109/TGRS.2017.2685945. [80] YANG Yi and NEWSAM S. Bag-of-visual-words and spatial extensions for land-use classification[C]. The 18th SIGSPATIAL International Conference on Advances in Geographic Information Systems, San Jose, USA, 2010: 270–279. doi: 10.1145/1869790.1869829. [81] DAI Dengxin and YANG Wen. Satellite image classification via two-layer sparse coding with biased image representation[J]. IEEE Geoscience and Remote Sensing Letters, 2011, 8(1): 173–176. doi: 10.1109/LGRS.2010.2055033. [82] SUMBUL G, CHARFUELAN M, DEMIR B, et al. BigEarthNet: A large-scale benchmark archive for remote sensing image understanding[C]. IGARSS 2019-2019 IEEE international geoscience and remote sensing symposium, Yokohama, Japan, 2019: 5901–5904. doi: 10.1109/IGARSS.2019.8900532. [83] GUPTA R, GOODMAN B, PATEL N, et al. Creating xBD: A dataset for assessing building damage from satellite imagery[C]. The IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Long Beach, USA, 2019: 10–17. [84] CHRISTIE G, FENDLEY N, WILSON J, et al. Functional map of the world[C]. The IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018: 6172–6180. doi: 10.1109/CVPR.2018.00646. [85] ZHANG Meimei, CHEN Fang, and LI Bin. Multistep question-driven visual question answering for remote sensing[J]. IEEE Transactions on Geoscience and Remote Sensing, 2023, 61: 4704912. doi: 10.1109/TGRS.2023.3312479. [86] ZHAN Yang, XIONG Zhitong, and YUAN Yuan. RSVG: Exploring data and models for visual grounding on remote sensing data[J]. IEEE Transactions on Geoscience and Remote Sensing, 2023, 61: 5604513. doi: 10.1109/TGRS.2023.3250471. [87] CHENG Gong, HAN Junwei, and LU Xiaoqiang. Remote sensing image scene classification: Benchmark and state of the art[J]. Proceedings of the IEEE, 2017, 105(10): 1865–1883. doi: 10.1109/JPROC.2017.2675998. [88] LI Haifeng, JIANG Hao, GU Xin, et al. CLRS: Continual learning benchmark for remote sensing image scene classification[J]. Sensors, 2020, 20(4): 1226. doi: 10.3390/s20041226. [89] ZHOU Zhuang, LI Shengyang, WU Wei, et al. NaSC-TG2: Natural scene classification with Tiangong-2 remotely sensed imagery[J]. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2021, 14: 3228–3242. doi: 10.1109/JSTARS.2021.3063096. [90] SUN Yuxi, FENG Shanshan, LI Xutao, et al. Visual grounding in remote sensing images[C]. The 30th ACM International Conference on Multimedia, Lisboa, Portugal, 2022: 404–412. doi: 10.1145/3503161.3548316. [91] LI Xiang, DING Jian, and ELHOSEINY M. VRSBench: A versatile vision-language benchmark dataset for remote sensing image understanding[C]. The 38th International Conference on Neural Information Processing Systems, Vancouver, Canada, 2024: 106. [92] ZHU Qiqi, ZHONG Yanfei, ZHAO Bei, et al. Bag-of-visual-words scene classifier with local and global features for high spatial resolution remote sensing imagery[J]. IEEE Geoscience and Remote Sensing Letters, 2016, 13(6): 747–751. doi: 10.1109/LGRS.2015.2513443. [93] HELBER P, BISCHKE B, DENGEL A, et al. EuroSAT: A novel dataset and deep learning benchmark for land use and land cover classification[J]. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2019, 12(7): 2217–2226. doi: 10.1109/JSTARS.2019.2918242. [94] ZHU B, LUI N, IRVIN J, et al. METER-ML: A multi-sensor earth observation benchmark for automated methane source mapping[C]. The 2nd Workshop on Complex Data Challenges in Earth Observation (CDCEO 2022) Co-Located with 31st International Joint Conference on Artificial Intelligence and the 25th European Conference on Artificial Intelligence (IJCAI-ECAI 2022), Vienna, Austria, 2022: 33–43. [95] LI Kun, VOSSELMAN G, and YANG M Y. HRVQA: A visual question answering benchmark for high-resolution aerial images[J]. ISPRS Journal of Photogrammetry and Remote Sensing, 2024, 214: 65–81. doi: 10.1016/j.isprsjprs.2024.06.002. [96] HOXHA G and MELGANI F. A novel SVM-based decoder for remote sensing image captioning[J]. IEEE Transactions on Geoscience and Remote Sensing, 2022, 60: 5404514. doi: 10.1109/TGRS.2021.3105004. [97] ZHENG Xiangtao, WANG Binqiang, DU Xingqian, et al. Mutual attention inception network for remote sensing visual question answering[J]. IEEE Transactions on Geoscience and Remote Sensing, 2022, 60: 5606514. doi: 10.1109/TGRS.2021.3079918. [98] BASHMAL L, BAZI Y, AL RAHHAL M M, et al. CapERA: Captioning events in aerial videos[J]. Remote Sensing, 2023, 15(8): 2139. doi: 10.3390/rs15082139. [99] SUN Xian, WANG Peijin, LU Wanxuan, et al. RingMo: A remote sensing foundation model with masked image modeling[J]. IEEE Transactions on Geoscience and Remote Sensing, 2023, 61: 5612822. doi: 10.1109/TGRS.2022.3194732. [100] BI Hanbo, FENG Yingchao, TONG Boyuan, et al. RingMoE: Mixture-of-modality-experts multi-modal foundation models for universal remote sensing image interpretation[J]. arXiv preprint arXiv: 2504.03166, 2025. [101] CHEN Hao and SHI Zhenwei. A spatial-temporal attention-based method and a new dataset for remote sensing image change detection[J]. Remote Sensing, 2020, 12(10): 1662. doi: 10.3390/rs12101662. [102] SHI Qian, LIU Mengxi, LI Shengchen, et al. A deeply supervised attention metric-based network and an open aerial image dataset for remote sensing change detection[J]. IEEE Transactions on Geoscience and Remote Sensing, 2022, 60: 5604816. doi: 10.1109/TGRS.2021.3085870. [103] LIU Chenyang, ZHAO Rui, CHEN Hao, et al. Remote sensing image change captioning with dual-branch transformers: A new method and a large scale dataset[J]. IEEE Transactions on Geoscience and Remote Sensing, 2022, 60: 5633520. doi: 10.1109/TGRS.2022.3218921. [104] XIA Guisong, BAI Xiang, DING Jian, et al. DOTA: A large-scale dataset for object detection in aerial images[C]. The IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018: 3974–3983. doi: 10.1109/CVPR.2018.00418. [105] LI Ke, WAN Gang, CHENG Gong, et al. Object detection in optical remote sensing images: A survey and a new benchmark[J]. ISPRS Journal of Photogrammetry and Remote Sensing, 2020, 159: 296–307. doi: 10.1016/j.isprsjprs.2019.11.023. [106] ZHAO Yuanxin, ZHANG Mi, YANG Bingnan, et al. LuoJiaHOG: A hierarchy oriented geo-aware image caption dataset for remote sensing image-text retrieval[J]. ISPRS Journal of Photogrammetry and Remote Sensing, 2025, 222: 130–151. doi: 10.1016/j.isprsjprs.2025.02.009. [107] YUAN Zhenghang, XIONG Zhitong, MOU Lichao, et al. ChatEarthNet: A global-scale image-text dataset empowering vision-language geo-foundation models[J]. Earth System Science Data, 2025, 17(3): 1245–1263. doi: 10.5194/essd-17-1245-2025. [108] LI Haodong, ZHANG Xiaofeng, and QU Haicheng. DDFAV: Remote sensing large vision language models dataset and evaluation benchmark[J]. Remote Sensing, 2025, 17(4): 719. doi: 10.3390/rs17040719. [109] AN Xiao, SUN Jiaxing, GUI Zihan, et al. COREval: A comprehensive and objective benchmark for evaluating the remote sensing capabilities of large vision-language models[J]. arXiv preprint arXiv: 2411.18145, 2024. [110] ZHOU Yue, FENG Litong, LAN Mengcheng, et al. GeoMath: A benchmark for multimodal mathematical reasoning in remote sensing[C]. The 13th International Conference on Representation Learning, Singapore, Singapore, 2025. [111] GE Junyao, ZHANG Xu, ZHENG Yang, et al. RSTeller: Scaling up visual language modeling in remote sensing with rich linguistic semantics from openly available data and large language models[J]. ISPRS Journal of Photogrammetry and Remote Sensing, 2025, 226: 146–163. doi: 10.1016/j.isprsjprs.2025.05.002. [112] DU Siqi, TANG Shengjun, WANG Weixi, et al. Tree-GPT: Modular large language model expert system for forest remote sensing image understanding and interactive analysis[J]. The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, 2023, XLVIII-1-W2-2023: 1729–1736. doi: 10.5194/isprs-archives-XLVIII-1-W2-2023-1729-2023. [113] GUO Haonan, SU Xin, WU Chen, et al. Remote sensing ChatGPT: Solving remote sensing tasks with ChatGPT and visual models[C]. IGARSS 2024-2024 IEEE International Geoscience and Remote Sensing Symposium, Athens, Greece, 2024: 11474–11478. doi: 10.1109/IGARSS53475.2024.10640736. [114] LIU Chenyang, CHEN Keyan, ZHANG Haotian, et al. Change-Agent: Towards interactive comprehensive remote sensing change interpretation and analysis[J]. IEEE Transactions on Geoscience and Remote Sensing, 2024, 62: 5635616. doi: 10.1109/TGRS.2024.3425815. [115] XU Wenjia, YU Zijian, MU Boyang, et al. RS-Agent: Automating remote sensing tasks through intelligent agents[J]. arXiv preprint arXiv: 2406.07089, 2024. [116] SINGH S, FORE M, and STAMOULIS D. GeoLLM-Engine: A realistic environment for building geospatial copilots[C]. The IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, USA, 2024: 585–594. doi: 10.1109/CVPRW63382.2024.00063. [117] SINGH S, FORE M, and STAMOULIS D. Evaluating tool-augmented agents in remote sensing platforms[J]. arXiv preprint arXiv: 2405.00709, 2024. -