遥感多模态大语言模型:架构、关键技术和未来展望

许文嘉 于睿卿 薛铭浩 张源奔 魏智威 张柘 彭木根

许文嘉, 于睿卿, 薛铭浩, 等. 遥感多模态大语言模型:架构、关键技术和未来展望[J]. 雷达学报(中英文), 待出版. doi: 10.12000/JR25088
引用本文: 许文嘉, 于睿卿, 薛铭浩, 等. 遥感多模态大语言模型:架构、关键技术和未来展望[J]. 雷达学报(中英文), 待出版. doi: 10.12000/JR25088
XU Wenjia, YU Ruiqing, XUE Minghao, et al. A survey on remote sensing multimodal large language models: Framework, core technologies, and future perspectives[J]. Journal of Radars, in press. doi: 10.12000/JR25088
Citation: XU Wenjia, YU Ruiqing, XUE Minghao, et al. A survey on remote sensing multimodal large language models: Framework, core technologies, and future perspectives[J]. Journal of Radars, in press. doi: 10.12000/JR25088

遥感多模态大语言模型:架构、关键技术和未来展望

DOI: 10.12000/JR25088 CSTR: 32380.14.JR25088
基金项目: 国家自然科学基金(62301063),目标认知与应用技术重点实验室开放基金(2023-CXPT-LC-005),微波成像技术国家重点实验室开放基金(70323006)
详细信息
    作者简介:

    许文嘉,博士,副教授,主要研究方向为通信遥感一体化与遥感智能解译

    于睿卿,博士,主要研究方向为遥感多模态大语言模型

    薛铭浩,硕士,主要研究方向为遥感多模态大语言模型

    张源奔,博士,副研究员,主要研究方向为时空数据决策智能、数字地球与数字孪生

    魏智威,博士,讲师,主要研究方向为地理可视化、LLM4Urban、时空图谱、三维重建等

    张 柘,博士,研究员,主要研究方向为新体制SAR成像与信号处理技术

    彭木根,博士,教授,主要研究方向为无线移动通信和低轨信息通信网络

    通讯作者:

    许文嘉 xuwenjia@bupt.edu.cn

  • 责任主编:张帆 Corresponding Editor: ZHANG Fan
  • 中图分类号: TN957.51; TP753

A Survey on Remote Sensing Multimodal Large Language Models: Framework, Core Technologies, and Future Perspectives

Funds: The National Natural Science Fundation of China (62301063), The Key Laboratory of Target Cognition and Application Technology (2023-CXPT-LC-005), The National Key Laboratory of Microwave Imaging Technology (70323006)
More Information
  • 摘要: 近年来,人工智能技术和遥感领域的结合已成为领域发展的前沿热点,多模态大语言模型(MLLM)的快速发展为遥感智能解译带来新的机遇和挑战。遥感多模态大语言模型通过构建大语言模型与视觉模型之间的桥接机制并采用联合训练方式,深度融合遥感领域的视觉特征与语义信息,有效推动遥感智能解译由浅层语义匹配向高层的世界知识理解跃迁。该文系统性回顾了多模态大语言模型在遥感领域的相关研究成果,以期为新的研究方向提供依据。具体而言,该文首先明确了遥感多模态大语言模型(RS-MLLM)的概念定义,并梳理了遥感多模态大语言模型的发展脉络。随后,详细阐述了遥感多模态大语言模型的模型架构、训练方法、适用任务及其对应的基准数据集,并介绍了遥感智能体。最后,探讨了遥感多模态大语言模型的研究现状和未来发展方向。

     

  • 图  1  典型的多模态大语言模型架构

    Figure  1.  The typical MLLM architecture

    图  2  代表性遥感多模态大语言模型和遥感智能体的发展时间线

    Figure  2.  Timeline of representative RS-MLLMs and remote sensing agents

    图  3  RS-Agent架构

    Figure  3.  RS-Agent architecture

    表  1  遥感多模态大语言模型具体结构

    Table  1.   The specific structure of RS-MLLM

    RS-MLLM 多模态编码器 多模态特征投影器 预训练大语言模型 训练硬件
    AeroLite[57] CLIP ViT-L/14 A two-layer MLP LLaMA3.2-3B 4090
    Aquila[55] Aquila-CLIP
    ConvNext (A-CCN)
    SFI (MDA)-LLM
    (基于LLaMA3)
    4$* $A800
    Aquila-plus[56] CLIP ConvNeXt-L Mask spatial feature extractor Vicuna -
    CDChat[49] CLIP ViT-L/14 A two-layer MLP (GELU) Vicuna-v1.5-7B 3$* $A100
    ChangeChat[48] CLIP ViT/14 A two-layer MLP Vicuna-v1.5 L20 (48 GB)
    EagleVision[58] Baseline detector Attribute disentangle InternLM2.5-7B-Chat等 8$* $A100
    EarthDial[59] Adaptive high resolution + Data fusion +
    InternViT-300M
    A simple MLP Phi-3-mini 8$* $A100 (80 GB)
    EarthGPT[12] DINOv2 ViT-L/14 + CLIP ConvNeXt-L A linear layer LLaMA2-13B 16$* $A100
    EarthGPT-X[60] DINOv2 ViT-L/14 + CLIP ConvNeXt-L +
    Hybrid signals mutual understanding
    Vision-to-language
    Modality-align projection
    LLaMA2-13B 8$* $A100 (80 GB)
    EarthMarker[61] DINOv2 ViT-L/14 + CLIP ConvNeXt-L A linear layer LLaMA2-13B 8$* $A100 (80 GB)
    GeoChat[42] CLIP ViT-L/14 A two-layer MLP (GELU) Vicuna-v1.5-7B -
    GeoGround[29] CLIP ViT A two-layer MLP Vicuna-v1.5 8$* $V100 (32 GB)
    GeoLLaVA-8K[62] CLIP ViT-L/14 +
    A two-step tokens compression module
    A linear layer Vicuna-v1.5-7B -
    IFShip[46] CLIP ViT-L/14 A four-layer MLP (GELU) Vicuna-13B -
    LHRS-Bot[11] CLIP ViT-L/14 Vision perceiver LLaMA2-7B 8$* $V100 (32 GB)
    LHRS-Bot-Nova[63] SigLIP-L/14 Vision perceiver LLaMA3-8B 8$* $H100
    Popeye[47] DINOv2 ViT-L/14 + CLIP ConvNeXt-L Alignment projection LLaMA-7B -
    RingMoGPT[30] EVA-CLIP ViT-g/14 A Q-Former + A linear layer Vicuna-13B 8$* $A100 (80 GB)
    RS-CapRet[45] CLIP ViT-L/14 Three linear layers LLaMA2-7B -
    RSGPT[10] EVA-G A Q-Former + A linear layer Vicuna-7B· Vicuna-13B 8$* $A100
    RS-LLaVA[64] CLIP ViT-L A two-layer MLP (GELU) Vicuna-v1.5-7B
    Vicuna-v1.5-13B
    2$* $A6000 (48 GB)
    RSUniVLM[65] SigLIP-400M A two-layer MLP QWen2-0.5B 4$* $A40 (40 GB)
    SkyEyeGPT[31] EVA-CLIP A linear layer LLaMA2 4$* $3090
    SkySenseGPT[44] CLIP ViT-L/14 A two-layer MLP Vicuna-v1.5 4$* $A100 (40 GB)
    Spectral-LLaVA[66] SpectralGPT[34] (encoder only) A linear layer LLaMA3 -
    TEOChat[50] CLIP ViT-L/14 A two-layer MLP LLaMA2 A4000 (16 GB)
    UniRS[13] SigLIP + A change extraction module A downsampling module +
    A MLP
    Sheared-LLaMA (3B) 4$* $4090 (24 GB)
    VHM[43] CLIP ViT-L/14 A two-layer MLP Vicuna-v1.5-7B 16$* $A100 (80 GB)
    注:斜体表示该模型在论文中并未给出正式名称或缩写。
    下载: 导出CSV

    表  2  遥感多模态大语言模型训练时使用的数据集

    Table  2.   Datasets used for train in RS-MLLMs

    RS-MLLM 指令调优 其他训练
    AeroLite[57] - RSSCN7, DLRSD, iSAID, LoveDA, WHU, UCM-Captions, Sydney-Captions
    Aquila[55] FIT-RS CapERA, UCM-Captions, Sydney-Captions, NWPU-Captions, RSICD, RSITMD, RSVQA-HR, RSVQA-LR, WHU_RS19
    Aquila-plus[56] Aquila-plus-100K -
    CDChat[49] LEVIR-CD, SYSU-CD LEVIR-CD, SYSU-CD
    ChangeChat[48] ChangeChat-87k -
    EagleVision[58] EVAttrs-95K EVAttrs-95K
    EarthDial[59] EarthDial-Instruct EarthDial-Instruct
    EarthGPT[12] MMRS-1M LAION-400M, COCO Caption
    EarthGPT-X[60] M-RSVP -
    EarthMarker[61] RSVP-3M COCO Caption, RSVP-3M, RefCOCO, RefCOCO+
    GeoChat[42] RS multimodal instruction following dataset -
    GeoGround[29] refGeo FAIR1M, DIOR, DOTA
    GeoLLaVA-8K[62] SuperRS-VQA, HighRS-VQA -
    IFShip[46] TITANIC-FGS -
    LHRS-Bot[11] LLaVA complex reasoning dataset, NWPU,
    RSITMD, LHRS-Instruct
    LHRS-Align
    LHRS-Bot-Nova[63] Multi-task instruction dataset LHRS-Align-Recap, LHRS-Instruct, LHRS-Instruct-Plus, LRV-Instruct
    Popeye[47] MMShip COCO Caption
    RingMoGPT[30] Instruction-tuning dataset Image-text pre-training dataset
    RS-CapRet[45] - RSCID, UCM-Captions, Sydney-Captions, NWPU-Captions
    RSGPT[10] - RSICap
    RS-LLaVA[64] RS-Instructions -
    RSUniVLM[65] RSUniVLM-Instruct-1.2M RSUniVLM-Resampled
    SkyEyeGPT[31] SkyEye-968k SkyEye-968k
    SkySenseGPT[44] FIT-RS, NWPU-Captions, UCM-Captions, RSITMD, EarthVQA, Floodnet-VQA, RSVQA-LR, DOTA, DIOR, FAIR1M -
    Spectral-LLaVA[66] BigEarthNet-v2 fMoW, BigEarthNet-v1
    TEOChat[50] TEOChatlas -
    UniRS[13] GeoChat-Instruct, LEVIR-CC, EAR -
    VHM[43] VersaD-Instruct, VariousRS-Instruct, HnstD VersaD
    注:由于主要关注指令调优阶段的数据集,因此将其他训练阶段的数据集合并至第3列。斜体表示该数据集在论文中并未给出正式名称或缩写。
    下载: 导出CSV

    表  3  遥感多模态大语言模型的适用任务及其对应的基准数据集

    Table  3.   The applicable tasks of RS-MLLMs and their corresponding benchmark datasets

    RS-MLLM RSIC RSVQA RSVG RSSC
    AeroLite[57] Sydney-Captions[74],
    UCM-Captions[74]
    Aquila[55] RSICD[75], Sydney-Captions[74], UCM-Captions[74], FIT-RS[44] RSVQA-LR[76], RSVQA-HR[76],
    FIT-RS[44]
    EarthDial[59] NWPU-Captions[77], RSICD[75], RSITMD-Captions[78], Sydney-Captions[74], UCM-Captions[74] RSVQA-LR[76], RSVQA-HR[76] AID[79], UCMerced[80],WHU-RS19[81], BigEarthNet[82],
    xBD Set 1[83], fMoW[84]
    EarthGPT[12] NWPU-Captions[77] CRSVQA[85], RSVQA-HR[76] DIOR-RSVG[86] NWPU-RESISC45[87], CLRS[88],
    NaSC-TG2[89]
    GeoChat[42] RSVQA-LR[76], RSVQA-HR[76] GeoChat*[42] AID[79], UCMerced[80]
    GeoGround[29] DIOR-RSVG[86], RSVG[90], GeoChat*[42], VRSBench*[91], AVVG[29]
    LHRS-Bot[11] RSVQA-LR[76], RSVQA-HR[76] DIOR-RSVG[86], RSVG[90] AID[79], WHU-RS19[81], NWPU-RESISC45[87], SIRI-WHU[92], EuroSAT[93],
    METER-ML[94], fMoW[84]
    LHRS-Bot-Nova[63] RSVQA-LR[76], RSVQA-HR[76] DIOR-RSVG[86], RSVG[90] AID[79], WHU-RS19[81], NWPU-RESISC45[87], SIRI-WHU[92], EuroSAT[93], METER-ML[94], fMoW[84]
    RingMoGPT[30] DOTA-Cap[30], DIOR-Cap[30], NWPU-Captions[77], RSICD[75], Sydney-Captions[74],
    UCM-Captions[74]
    HRVQA[95] AID[79], NWPU-RESISC45[87], UCMerced[80], WHU-RS19[81]
    RS-CapRet[45] NWPU-Captions[77], RSICD[75], Sydney-Captions[74],
    UCM-Captions[74]
    RSGPT[10] RSIEval[10], UCM-Captions[74], Sydney-Captions[74], RSICD[75] RSIEval[10], RSVQA-LR[76], RSVQA-HR[76]
    RS-LLaVA[64] UCM-Captions[74], UAV[96] RSVQA-LR[76], RSIVQA-DOTA[97]
    RSUniVLM[65] RSVQA-LR[76], RSVQA-HR[76] DIOR-RSVG[86], VRSBench[91] AID[79], WHU-RS19[81], NWPU-RESISC45[87], SIRI-WHU[92]
    SkyEyeGPT[31] UCM-Captions[74], CapERA[98] RSVQA-LR[76], RSVQA-HR[76] DIOR-RSVG[86], RSVG[90]
    TEOChat[50] RSVQA-LR[76], RSVQA-HR[76] AID[79], UCMerced[80]
    UniRS[13] RSVQA-LR[76], RSVQA-HR[76], CRSVQA[85]
    VHM[43] RSVQA-LR[76], RSVQA-HR[76] DIOR-RSVG[86] AID[79], WHU-RS19[81], NWPU-RESISC45[87], SIRI-WHU[92],
    METER-ML[94]
    注:*表示测试集已修改。
    下载: 导出CSV

    表  4  遥感多模态大语言模型在不同的遥感图像描述数据集上的性能表现

    Table  4.   Performance of RS-MLLMs on various remote sensing image captioning datasets

    数据集 模型 BLEU-1↑ BLEU-2↑ BLEU-3↑ BLEU-4↑ METEOR↑ ROUGE-L↑ CIDEr↑ SPICE↑
    NWPU-
    Captions[77]
    EarthDial[59] 0.806 0.400
    EarthGPT[12] 0.871 0.787 0.716 0.655 0.445 0.782 1.926 0.322
    EarthMarker[61] 0.844 0.731 0.629 0.543 0.375 0.700 1.629 0.268
    RS-CapRet[45] 0.871 0.787 0.717 0.656 0.436 0.776 1.929 0.311
    RSICD[75] Aquila[55] 0.746
    EarthDial[59] 0.562 0.276
    RS-CapRet[45] 0.741 0.622 0.529 0.455 0.376 0.649 2.605 0.484
    RSGPT[10] 0.703 0.542 0.440 0.368 0.301 0.533 1.029
    SkyEyeGPT[31] 0.867 0.767 0.673 0.600 0.354 0.626 0.837
    RingMoGPT[30] 0.343 0.616 2.758
    UCM-
    Captions[74]
    AeroLite[57] 0.934 0.796 0.498 0.880
    Aquila[55] 0.883
    EarthDial[59] 0.514 0.342
    RS-CapRet[45] 0.843 0.779 0.722 0.670 0.472 0.817 3.548 0.525
    RSGPT[10] 0.861 0.791 0.723 0.657 0.422 0.783 3.332
    SkyEyeGPT[31] 0.907 0.857 0.816 0.784 0.462 0.795 2.368
    RS-LLaVA[64] 0.900 0.849 0.803 0.760 0.492 0.858 3.556
    RingMoGPT[30] 0.499 0.833 3.593
    Sydney-
    Captions[74]
    AeroLite[57] 0.919 0.759 0.475 0.837
    Aquila[55] 0.834
    EarthDial[59] 0.573 0.410
    RS-CapRet[45] 0.787 0.700 0.628 0.564 0.388 0.707 2.392 0.434
    RSGPT[10] 0.823 0.753 0.686 0.622 0.414 0.748 2.731
    SkyEyeGPT[31] 0.919 0.856 0.809 0.774 0.466 0.777 1.811
    RingMoGPT[30] 0.421 0.734 2.888
    FIT-RS[44] Aquila[55] 0.351
    GeoChat[42] 0.088
    SkySenseGPT[44] 0.273
    注:所有结果均引自对应论文原文,加粗表示最佳结果。
    下载: 导出CSV

    表  5  遥感多模态大语言模型在不同的遥感视觉问答数据集上的性能表现(%)

    Table  5.   Performance of RS-MLLMs on various remote sensing visual question answering datasets (%)

    模型 RSVQA-LR[76]数据集 RSVQA-HR[76]数据集 CRSVQA[85]数据集 FIT-RS[44]数据集
    Presence Compare Rural/Urban Avg Presence Compare Avg Avg Avg
    Aquila[55] 92.72 92.64 83.87
    EarthDial[59] 92.58 92.75 94.00 92.70 58.89 83.11 72.45
    EarthGPT[12] 62.77 79.53 72.06 82.00
    GeoChat[42] 91.09 90.33 94.00 90.70 59.02 83.16 53.47
    LHRS-Bot[11] 89.07 88.51 90.00 89.19 92.57 92.53 92.55
    LHRS-Bot-Nova[63] 89.00 90.71 89.11 89.61 91.68 92.44 92.06
    RSGPT[10] 91.17 91.70 94.00 92.29 90.92 90.02 90.47
    RS-LLaVA[64] 92.27 91.37 95.00 88.10
    RSUniVLM[65] 92.00 91.51 92.65 92.05 90.81 90.88 90.85
    SkyEyeGPT[31] 88.93 88.63 75.00 84.19 80.00 80.13 82.56
    SkySenseGPT[44] 95.00 91.07 92.00 92.69 69.14 84.14 76.64 79.76
    TEOChat[50] 91.70 92.70 94.00 67.50 81.10
    UniRS[13] 91.81 93.23 93.00 92.63 59.29 84.05 73.15 86.67
    VHM[43] 91.17 89.89 88.00 89.33 64.00 83.50 73.75
    注:所有结果均引自对应论文原文,加粗表示最佳结果。
    下载: 导出CSV

    表  6  遥感多模态大语言模型在不同的遥感视觉定位数据集上的性能表现(%)

    Table  6.   Performance of RS-MLLMs on various remote sensing visual grounding datasets (%)

    模型 DIOR-RSVG[86] RSVG[90]
    Pr@0.5 Pr@0.5
    EarthGPT[12] 76.65
    GeoGround[29] 77.73 26.65
    LHRS-Bot[11] 88.10 73.45
    LHRS-Bot-Nova[63] 92.87 81.85
    RSUniVLM[65] 72.47
    SkyEyeGPT[31] 88.59 70.50
    VHM[43] 56.17
    注:所有结果均引自对应论文原文,加粗表示最佳结果。
    下载: 导出CSV

    表  7  遥感多模态大语言模型在不同的遥感场景分类数据集上的性能表现(%)

    Table  7.   Performance of RS-MLLMs on various remote sensing scene classification datasets (%)

    模型 UCMerced[80] AID[79] NWPU-
    RESISC45[87]
    CLRS[88] NaSC-
    TG2[89]
    WHU-
    RS19[81]
    SIRI-
    WHU[92]
    EuroSAT[93] METER-
    ML[94]
    fMoW[84]
    EarthDial[59] 92.42 88.76 96.21 70.03
    EarthGPT[12] 93.84 77.37 74.72
    EarthMaker[61] 86.52 77.97
    GeoChat[42] 84.43 72.03
    LHRS-Bot[11] 91.26 83.94 93.17 62.66 51.40 69.81 56.56
    LHRS-Bot-Nova[63] 88.32 86.80 95.63 74.75 63.54 70.05 57.11
    RingMoGPT[30] 86.48 97.94 96.47 97.71
    RSUniVLM[65] 81.18 86.86 84.91 68.13
    SkyEyeGPT[31] 60.95 26.30
    TEOChat[50] 86.30 80.90
    VHM[43] 91.70 94.54 95.80 70.88 72.74
    注:所有结果均引自对应论文原文,加粗表示最佳结果。
    下载: 导出CSV
  • [1] 王桥, 刘思含. 国家环境遥感监测体系研究与实现[J]. 遥感学报, 2016, 20(5): 1161–1169. doi: 10.11834/jrs.20166201.

    WANG Qiao and LIU Sihan. Research and implementation of national environmental remote sensing monitoring system[J]. Journal of Remote Sensing, 2016, 20(5): 1161–1169. doi: 10.11834/jrs.20166201.
    [2] 安立强, 张景发, MONTEIRO R, 等. 地震灾害损失评估与遥感技术现状和展望[J]. 遥感学报, 2024, 28(4): 860–884. doi: 10.11834/jrs.20232093.

    AN Liqiang, ZHANG Jingfa, MONTEIRO R, et al. A review and prospective research of earthquake damage assessment and remote sensing[J]. National Remote Sensing Bulletin, 2024, 28(4): 860–884. doi: 10.11834/jrs.20232093.
    [3] 张王菲, 陈尔学, 李增元, 等. 雷达遥感农业应用综述[J]. 雷达学报, 2020, 9(3): 444–461. doi: 10.12000/JR20051.

    ZHANG Wangfei, CHEN Erxue, LI Zengyuan, et al. Review of applications of radar remote sensing in agriculture[J]. Journal of Radars, 2020, 9(3): 444–461. doi: 10.12000/JR20051.
    [4] LI Yansheng, DANG Bo, ZHANG Yongjun, et al. Water body classification from high-resolution optical remote sensing imagery: Achievements and perspectives[J]. ISPRS Journal of Photogrammetry and Remote Sensing, 2022, 187: 306–327. doi: 10.1016/j.isprsjprs.2022.03.013.
    [5] LI Yansheng, WEI Fanyi, ZHANG Yongjun, et al. HS2P: Hierarchical spectral and structure-preserving fusion network for multimodal remote sensing image cloud and shadow removal[J]. Information Fusion, 2023, 94: 215–228. doi: 10.1016/j.inffus.2023.02.002.
    [6] CHEN Yongqi, FENG Shou, ZHAO Chunhui, et al. High-resolution remote sensing image change detection based on Fourier feature interaction and multiscale perception[J]. IEEE Transactions on Geoscience and Remote Sensing, 2024, 62: 5539115. doi: 10.1109/TGRS.2024.3500073.
    [7] 杨桄, 刘湘南. 遥感影像解译的研究现状和发展趋势[J]. 国土资源遥感, 2004(2): 7–10, 15. doi: 10.3969/j.issn.1001-070X.2004.02.002.

    YANG Guang and LIU Xiangnan. The present research condition and development trend of remotely sensed imagery interpretation[J]. Remote Sensing for Land & Resources, 2004(2): 7–10, 15. doi: 10.3969/j.issn.1001-070X.2004.02.002.
    [8] ZHAO W X, ZHOU Kun, LI Junyi, et al. A survey of large language models[J]. arXiv preprint arXiv: 2303.18223, 2023.
    [9] YIN Shukang, FU Chaoyou, ZHAO Sirui, et al. A survey on multimodal large language models[J]. National Science Review, 2024, 11(12): nwae403. doi: 10.1093/nsr/nwae403.
    [10] HU Yuan, YUAN Jianlong, WEN Congcong, et al. RSGPT: A remote sensing vision language model and benchmark[J]. ISPRS Journal of Photogrammetry and Remote Sensing, 2025, 224: 272–286. doi: 10.1016/j.isprsjprs.2025.03.028.
    [11] MUHTAR D, LI Zhenshi, GU Feng, et al. LHRS-Bot: Empowering remote sensing with VGI-enhanced large multimodal language model[C]. The 18th European Conference on Computer Vision, Milan, Italy, 2024: 440–457. doi: 10.1007/978-3-031-72904-1_26.
    [12] ZHANG Wei, CAI Miaoxin, ZHANG Tong, et al. EarthGPT: A universal multimodal large language model for multisensor image comprehension in remote sensing domain[J]. IEEE Transactions on Geoscience and Remote Sensing, 2024, 62: 5917820. doi: 10.1109/TGRS.2024.3409624.
    [13] LI Yujie, XU Wenjia, LI Guangzuo, et al. UniRS: Unifying multi-temporal remote sensing tasks through vision language models[J]. arXiv preprint arXiv: 2412.20742, 2024.
    [14] VOUTILAINEN A. A syntax-based part-of-speech analyser[C]. The 7th Conference of the European Chapter of the Association for Computational Linguistics, Dublin, Ireland, 1995.
    [15] BRILL E and RESNIK P. A rule-based approach to prepositional phrase attachment disambiguation[C]. The 15th International Conference on Computational Linguistics, Kyoto, Japan, 1994.
    [16] HINTON G E, OSINDERO S, and TEH Y W. A fast learning algorithm for deep belief nets[J]. Neural Computation, 2006, 18(7): 1527–1554. doi: 10.1162/neco.2006.18.7.1527.
    [17] DEVLIN J, CHANG M W, LEE K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding[C]. 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, 2019: 4171–4186. doi: 10.18653/v1/N19-1423.
    [18] RADFORD A, NARASIMHAN K, SALIMANS T, et al. Improving language understanding by generative pre-training[EB/OL]. https://openai.com/iadex/language-unsupervised, 2018.
    [19] RADFORD A, WU J, CHILD R, et al. Language models are unsupervised multitask learners[EB/OL]. https://openai.com/index/better-langaage-models/, 2019.
    [20] BROWN T, MANN B, RYDER N, et al. Language models are few-shot learners[C]. The 34th International Conference on Neural Information Processing Systems, Vancouver, Canada, 2020: 159.
    [21] OpenAI. GPT-4 technical report[J]. arXiv preprint arXiv: 2303.08774, 2023.
    [22] TOUVRON H, LAVRIL T, IZACARD G, et al. LLaMA: Open and efficient foundation language models[J]. arXiv preprint arXiv: 2302.13971, 2023.
    [23] BAI Jinze, BAI Shuai, CHU Yunfei, et al. Qwen technical report[J]. arXiv preprint arXiv: 2309.16609, 2023.
    [24] DeepSeek-AI. DeepSeek-V3 technical report[J]. arXiv preprint arXiv: 2412.19437, 2024.
    [25] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]. The 31st International Conference on Neural Information Processing Systems, Long Beach, USA, 2017: 6000–6010.
    [26] LIU Haotian, LI Chunyuan, WU Qingyang, et al. Visual instruction tuning[C]. The 37th International Conference on Neural Information Processing Systems, New Orleans, USA, 2023: 1516.
    [27] LIN Bin, YE Yang, ZHU Bin, et al. Video-LLaVA: Learning united visual representation by alignment before projection[C]. 2024 Conference on Empirical Methods in Natural Language Processing, Miami, USA, 2024: 5971–5984.
    [28] KOH J Y, FRIED D, and SALAKHUTDINOV R R. Generating images with multimodal language models[C]. The 37th International Conference on Neural Information Processing Systems, New Orleans, USA, 2023: 939.
    [29] ZHOU Yue, LAN Mengcheng, LI Xiang, et al. GeoGround: A unified large vision-language model for remote sensing visual grounding[J]. arXiv preprint arXiv: 2411.11904, 2024.
    [30] WANG Peijin, HU Huiyang, TONG Boyuan, et al. RingMoGPT: A unified remote sensing foundation model for vision, language, and grounded tasks[J]. IEEE Transactions on Geoscience and Remote Sensing, 2025, 63: 5611320. doi: 10.1109/TGRS.2024.3510833.
    [31] ZHAN Yang, XIONG Zhitong, and YUAN Yuan. SkyEyeGPT: Unifying remote sensing vision-language tasks via instruction tuning with large language model[J]. ISPRS Journal of Photogrammetry and Remote Sensing, 2025, 221: 64–77. doi: 10.1016/j.isprsjprs.2025.01.020.
    [32] 张永军, 李彦胜, 党博, 等. 多模态遥感基础大模型: 研究现状与未来展望[J]. 测绘学报, 2024, 53(10): 1942–1954. doi: 10.11947/j.AGCS.2024.20240019.

    ZHANG Yongjun, LI Yansheng, DANG Bo, et al. Multi-modal remote sensing large foundation models: Current research status and future prospect[J]. Acta Geodaetica et Cartographica Sinica, 2024, 53(10): 1942–1954. doi: 10.11947/j.AGCS.2024.20240019.
    [33] HONG Danfeng, HAN Zhu, YAO Jing, et al. SpectralFormer: Rethinking hyperspectral image classification with transformers[J]. IEEE Transactions on Geoscience and Remote Sensing, 2022, 60: 5518615. doi: 10.1109/TGRS.2021.3130716.
    [34] HONG Danfeng, ZHANG Bing, LI Xuyang, et al. SpectralGPT: Spectral remote sensing foundation model[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024, 46(8): 5227–5244. doi: 10.1109/TPAMI.2024.3362475.
    [35] FULLER A, MILLARD K, and GREEN J R. CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders[C]. The 37th International Conference on Neural Information Processing Systems, New Orleans, USA, 2023: 241.
    [36] WANG Yi, ALBRECHT C M, BRAHAM N A A, et al. Decoupling common and unique representations for multimodal self-supervised learning[C]. The 18th European Conference on Computer Vision, Milan, Italy, 2024: 286–303. doi: 10.1007/978-3-031-73397-0_17.
    [37] 张良培, 张乐飞, 袁强强. 遥感大模型: 进展与前瞻[J]. 武汉大学学报(信息科学版), 2023, 48(10): 1574–1581. doi: 10.13203/j.whugis20230341.

    ZHANG Liangpei, ZHANG Lefei, and YUAN Qiangqiang. Large remote sensing model: Progress and prospects[J]. Geomatics and Information Science of Wuhan University, 2023, 48(10): 1574–1581. doi: 10.13203/j.whugis20230341.
    [38] CHIANG W L, LI Zhuohan, LIN Zi, et al. Vicuna: An open-source Chatbot impressing GPT-4 with 90%* ChatGPT quality[EB/OL]. https://lmsys.org/blog/2023-03-30-vicuna/, 2023.
    [39] DeepSeek-AI. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning[J]. arXiv preprint arXiv: 2501.12948, 2025.
    [40] ALAYRAC J B, DONAHUE J, LUC P, et al. Flamingo: A visual language model for few-shot learning[C]. The 36th International Conference on Neural Information Processing Systems, New Orleans, USA, 2022: 1723.
    [41] LI Junnan, LI Dongxu, SAVARESE S, et al. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models[C]. The 40th International Conference on Machine Learning, Honolulu, USA, 2023: 814.
    [42] KUCKREJA K, DANISH M S, NASEER M, et al. GeoChat: Grounded large vision-language model for remote sensing[C]. The IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2024: 27831–27840. doi: 10.1109/CVPR52733.2024.02629.
    [43] PANG Chao, WENG Xingxing, WU Jiang, et al. VHM: Versatile and honest vision language model for remote sensing image analysis[C]. The 39th AAAI Conference on Artificial Intelligence, Philadelphia, USA, 2025: 6381–6388. doi: 10.1609/aaai.v39i6.32683.
    [44] LUO Junwei, PANG Zhen, ZHANG Yongjun, et al. SkySenseGPT: A fine-grained instruction tuning dataset and model for remote sensing vision-language understanding[J]. arXiv preprint arXiv: 2406.10100, 2024.
    [45] SILVA J D, MAGALHÃES J, TUIA D, et al. Large language models for captioning and retrieving remote sensing images[J]. arXiv preprint arXiv: 2402.06475, 2024.
    [46] GUO Mingning, WU Mengwei, SHEN Yuxiang, et al. IFShip: Interpretable fine-grained ship classification with domain knowledge-enhanced vision-language models[J]. Pattern Recognition, 2025, 166: 111672. doi: 10.1016/j.patcog.2025.111672.
    [47] ZHANG Wei, CAI Miaoxin, ZHANG Tong, et al. Popeye: A unified visual-language model for multisource ship detection from remote sensing imagery[J]. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2024, 17: 20050–20063. doi: 10.1109/JSTARS.2024.3488034.
    [48] DENG Pei, ZHOU Wenqian, and WU Hanlin. ChangeChat: An interactive model for remote sensing change analysis via multimodal instruction tuning[C]. ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, 2025: 1–5. doi: 10.1109/ICASSP49660.2025.10890620.
    [49] NOMAN M, AHSAN N, NASEER M, et al. CDChat: A large multimodal model for remote sensing change description[J]. arXiv preprint arXiv: 2409.16261, 2024.
    [50] IRVIN J A, LIU E R, CHEN J C, et al. TEOChat: A large vision-language assistant for temporal earth observation data[C]. The 13th International Conference on Learning Representations, Singapore, Singapore, 2025.
    [51] RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[C]. The 38th International Conference on Machine Learning, 2021: 8748–8763.
    [52] LIU Zhuang, MAO Hanzi, WU Chaoyuan, et al. A ConvNet for the 2020s[C]. The IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, 2022: 11966–11976. doi: 10.1109/CVPR52688.2022.01167.
    [53] SUN Quan, FANG Yuxin, WU L, et al. EVA-CLIP: Improved training techniques for CLIP at scale[J]. arXiv preprint arXiv: 2303.15389, 2023.
    [54] DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16x16 words: Transformers for image recognition at scale[C]. The 9th International Conference on Learning Representations, 2021.
    [55] LU Kaixuan, ZHANG Ruiqian, HUANG Xiao, et al. Aquila: A hierarchically aligned visual-language model for enhanced remote sensing image comprehension[J]. arXiv preprint arXiv: 2411.06074, 2024.
    [56] LU Kaixuan. Aquila-plus: Prompt-driven visual-language models for pixel-level remote sensing image understanding[J]. arXiv preprint arXiv: 2411.06142, 2024.
    [57] ZI Xing, NI Tengjun, FAN Xianjing, et al. AeroLite: Tag-guided lightweight generation of aerial image captions[J]. arXiv preprint arXiv: 2504.09528, 2025.
    [58] JIANG Hongxiang, YIN Jihao, WANG Qixiong, et al. EagleVision: Object-level attribute multimodal LLM for remote sensing[J]. arXiv preprint arXiv: 2503.23330, 2025.
    [59] SONI S, DUDHANE A, DEBARY H, et al. EarthDial: Turning multi-sensory earth observations to interactive dialogues[C]. The Computer Vision and Pattern Recognition Conference, Nashville, USA, 2025: 14303–14313.
    [60] ZHANG Wei, CAI Miaoxin, NING Yaqian, et al. EarthGPT-X: Enabling MLLMs to flexibly and comprehensively understand multi-source remote sensing imagery[J]. arXiv preprint arXiv: 2504.12795, 2025.
    [61] ZHANG Wei, CAI Miaoxin, ZHANG Tong, et al. EarthMarker: A visual prompting multimodal large language model for remote sensing[J]. IEEE Transactions on Geoscience and Remote Sensing, 2025, 63: 5604219. doi: 10.1109/TGRS.2024.3523505.
    [62] WANG Fengxiang, CHEN Mingshuo, LI Yueying, et al. GeoLLaVA-8K: Scaling remote-sensing multimodal large language models to 8K resolution[J]. arXiv preprint arXiv: 2505.21375, 2025.
    [63] LI Zhenshi, MUHTAR D, GU Feng, et al. LHRS-Bot-Nova: Improved multimodal large language model for remote sensing vision-language interpretation[J]. ISPRS Journal of Photogrammetry and Remote Sensing, 2025, 227: 539–550. doi: 10.1016/j.isprsjprs.2025.06.003.
    [64] BAZI Y, BASHMAL L, AL RAHHAL M M, et al. RS-LLaVA: A large vision-language model for joint captioning and question answering in remote sensing imagery[J]. Remote Sensing, 2024, 16(9): 1477. doi: 10.3390/rs16091477.
    [65] LIU Xu and LIAN Zhouhui. RSUniVLM: A unified vision language model for remote sensing via granularity-oriented mixture of experts[J]. arXiv preprint arXiv: 2412.05679, 2024.
    [66] KARANFIL E, IMAMOGLU N, ERDEM E, et al. A vision-language framework for multispectral scene representation using language-grounded features[J]. arXiv preprint arXiv: 2501.10144, 2025.
    [67] ZHANG Hao, LI Feng, LIU Shilong, et al. DINO: DETR with improved denoising anchor boxes for end-to-end object detection[C]. The 11th International Conference on Learning Representations, Kigali, Rwanda, 2023.
    [68] ZHAI Xiaohua, MUSTAFA B, KOLESNIKOV A, et al. Sigmoid loss for language image pre-training[C]. The IEEE/CVF International Conference on Computer Vision, Paris, France, 2023: 11941–11952. doi: 10.1109/ICCV51070.2023.01100.
    [69] ZHANG Jihai, QU Xiaoye, ZHU Tong, et al. CLIP-MoE: Towards building mixture of experts for CLIP with diversified multiplet upcycling[J]. arXiv preprint arXiv: 2409.19291, 2024.
    [70] WANG Weihan, LV Qingsong, YU Wenmeng, et al. CogVLM: Visual expert for pretrained language models[C]. The 38th International Conference on Neural Information Processing Systems, Vancouver, Canada, 2024: 3860.
    [71] LIU Haotian, LI Chunyuan, LI Yuheng, et al. Improved baselines with visual instruction tuning[C]. The IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2024: 26286–26296. doi: 10.1109/CVPR52733.2024.02484.
    [72] KAUFMANN T, WENG P, BENGS V, et al. A survey of reinforcement learning from human feedback[J]. Transactions on Machine Learning Research, in press, 2025.
    [73] HU E J, SHEN Yelong, WALLIS P, et al. LoRA: Low-rank adaptation of large language models[C]. The 10th International Conference on Learning Representations, 2022.
    [74] QU Bo, LI Xuelong, TAO Dacheng, et al. Deep semantic understanding of high resolution remote sensing image[C]. 2016 International Conference on Computer, Information and Telecommunication Systems (CITS), Kunming, China, 2016: 1–5. doi: 10.1109/CITS.2016.7546397.
    [75] LU Xiaoqiang, WANG Binqiang, ZHENG Xiangtao, et al. Exploring models and data for remote sensing image caption generation[J]. IEEE Transactions on Geoscience and Remote Sensing, 2018, 56(4): 2183–2195. doi: 10.1109/TGRS.2017.2776321.
    [76] LOBRY S, MARCOS D, MURRAY J, et al. RSVQA: Visual question answering for remote sensing data[J]. IEEE Transactions on Geoscience and Remote Sensing, 2020, 58(12): 8555–8566. doi: 10.1109/TGRS.2020.2988782.
    [77] CHENG Qimin, HUANG Haiyan, XU Yuan, et al. NWPU-captions dataset and MLCA-Net for remote sensing image captioning[J]. IEEE Transactions on Geoscience and Remote Sensing, 2022, 60: 5629419. doi: 10.1109/TGRS.2022.3201474.
    [78] YUAN Zhiqiang, ZHANG Wenkai, FU Kun, et al. Exploring a fine-grained multiscale method for cross-modal remote sensing image retrieval[J]. IEEE Transactions on Geoscience and Remote Sensing, 2022, 60: 4404119. doi: 10.1109/TGRS.2021.3078451.
    [79] XIA Guisong, HU Jingwen, HU Fan, et al. AID: A benchmark data set for performance evaluation of aerial scene classification[J]. IEEE Transactions on Geoscience and Remote Sensing, 2017, 55(7): 3965–3981. doi: 10.1109/TGRS.2017.2685945.
    [80] YANG Yi and NEWSAM S. Bag-of-visual-words and spatial extensions for land-use classification[C]. The 18th SIGSPATIAL International Conference on Advances in Geographic Information Systems, San Jose, USA, 2010: 270–279. doi: 10.1145/1869790.1869829.
    [81] DAI Dengxin and YANG Wen. Satellite image classification via two-layer sparse coding with biased image representation[J]. IEEE Geoscience and Remote Sensing Letters, 2011, 8(1): 173–176. doi: 10.1109/LGRS.2010.2055033.
    [82] SUMBUL G, CHARFUELAN M, DEMIR B, et al. BigEarthNet: A large-scale benchmark archive for remote sensing image understanding[C]. IGARSS 2019-2019 IEEE international geoscience and remote sensing symposium, Yokohama, Japan, 2019: 5901–5904. doi: 10.1109/IGARSS.2019.8900532.
    [83] GUPTA R, GOODMAN B, PATEL N, et al. Creating xBD: A dataset for assessing building damage from satellite imagery[C]. The IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Long Beach, USA, 2019: 10–17.
    [84] CHRISTIE G, FENDLEY N, WILSON J, et al. Functional map of the world[C]. The IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018: 6172–6180. doi: 10.1109/CVPR.2018.00646.
    [85] ZHANG Meimei, CHEN Fang, and LI Bin. Multistep question-driven visual question answering for remote sensing[J]. IEEE Transactions on Geoscience and Remote Sensing, 2023, 61: 4704912. doi: 10.1109/TGRS.2023.3312479.
    [86] ZHAN Yang, XIONG Zhitong, and YUAN Yuan. RSVG: Exploring data and models for visual grounding on remote sensing data[J]. IEEE Transactions on Geoscience and Remote Sensing, 2023, 61: 5604513. doi: 10.1109/TGRS.2023.3250471.
    [87] CHENG Gong, HAN Junwei, and LU Xiaoqiang. Remote sensing image scene classification: Benchmark and state of the art[J]. Proceedings of the IEEE, 2017, 105(10): 1865–1883. doi: 10.1109/JPROC.2017.2675998.
    [88] LI Haifeng, JIANG Hao, GU Xin, et al. CLRS: Continual learning benchmark for remote sensing image scene classification[J]. Sensors, 2020, 20(4): 1226. doi: 10.3390/s20041226.
    [89] ZHOU Zhuang, LI Shengyang, WU Wei, et al. NaSC-TG2: Natural scene classification with Tiangong-2 remotely sensed imagery[J]. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2021, 14: 3228–3242. doi: 10.1109/JSTARS.2021.3063096.
    [90] SUN Yuxi, FENG Shanshan, LI Xutao, et al. Visual grounding in remote sensing images[C]. The 30th ACM International Conference on Multimedia, Lisboa, Portugal, 2022: 404–412. doi: 10.1145/3503161.3548316.
    [91] LI Xiang, DING Jian, and ELHOSEINY M. VRSBench: A versatile vision-language benchmark dataset for remote sensing image understanding[C]. The 38th International Conference on Neural Information Processing Systems, Vancouver, Canada, 2024: 106.
    [92] ZHU Qiqi, ZHONG Yanfei, ZHAO Bei, et al. Bag-of-visual-words scene classifier with local and global features for high spatial resolution remote sensing imagery[J]. IEEE Geoscience and Remote Sensing Letters, 2016, 13(6): 747–751. doi: 10.1109/LGRS.2015.2513443.
    [93] HELBER P, BISCHKE B, DENGEL A, et al. EuroSAT: A novel dataset and deep learning benchmark for land use and land cover classification[J]. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2019, 12(7): 2217–2226. doi: 10.1109/JSTARS.2019.2918242.
    [94] ZHU B, LUI N, IRVIN J, et al. METER-ML: A multi-sensor earth observation benchmark for automated methane source mapping[C]. The 2nd Workshop on Complex Data Challenges in Earth Observation (CDCEO 2022) Co-Located with 31st International Joint Conference on Artificial Intelligence and the 25th European Conference on Artificial Intelligence (IJCAI-ECAI 2022), Vienna, Austria, 2022: 33–43.
    [95] LI Kun, VOSSELMAN G, and YANG M Y. HRVQA: A visual question answering benchmark for high-resolution aerial images[J]. ISPRS Journal of Photogrammetry and Remote Sensing, 2024, 214: 65–81. doi: 10.1016/j.isprsjprs.2024.06.002.
    [96] HOXHA G and MELGANI F. A novel SVM-based decoder for remote sensing image captioning[J]. IEEE Transactions on Geoscience and Remote Sensing, 2022, 60: 5404514. doi: 10.1109/TGRS.2021.3105004.
    [97] ZHENG Xiangtao, WANG Binqiang, DU Xingqian, et al. Mutual attention inception network for remote sensing visual question answering[J]. IEEE Transactions on Geoscience and Remote Sensing, 2022, 60: 5606514. doi: 10.1109/TGRS.2021.3079918.
    [98] BASHMAL L, BAZI Y, AL RAHHAL M M, et al. CapERA: Captioning events in aerial videos[J]. Remote Sensing, 2023, 15(8): 2139. doi: 10.3390/rs15082139.
    [99] SUN Xian, WANG Peijin, LU Wanxuan, et al. RingMo: A remote sensing foundation model with masked image modeling[J]. IEEE Transactions on Geoscience and Remote Sensing, 2023, 61: 5612822. doi: 10.1109/TGRS.2022.3194732.
    [100] BI Hanbo, FENG Yingchao, TONG Boyuan, et al. RingMoE: Mixture-of-modality-experts multi-modal foundation models for universal remote sensing image interpretation[J]. arXiv preprint arXiv: 2504.03166, 2025.
    [101] CHEN Hao and SHI Zhenwei. A spatial-temporal attention-based method and a new dataset for remote sensing image change detection[J]. Remote Sensing, 2020, 12(10): 1662. doi: 10.3390/rs12101662.
    [102] SHI Qian, LIU Mengxi, LI Shengchen, et al. A deeply supervised attention metric-based network and an open aerial image dataset for remote sensing change detection[J]. IEEE Transactions on Geoscience and Remote Sensing, 2022, 60: 5604816. doi: 10.1109/TGRS.2021.3085870.
    [103] LIU Chenyang, ZHAO Rui, CHEN Hao, et al. Remote sensing image change captioning with dual-branch transformers: A new method and a large scale dataset[J]. IEEE Transactions on Geoscience and Remote Sensing, 2022, 60: 5633520. doi: 10.1109/TGRS.2022.3218921.
    [104] XIA Guisong, BAI Xiang, DING Jian, et al. DOTA: A large-scale dataset for object detection in aerial images[C]. The IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018: 3974–3983. doi: 10.1109/CVPR.2018.00418.
    [105] LI Ke, WAN Gang, CHENG Gong, et al. Object detection in optical remote sensing images: A survey and a new benchmark[J]. ISPRS Journal of Photogrammetry and Remote Sensing, 2020, 159: 296–307. doi: 10.1016/j.isprsjprs.2019.11.023.
    [106] ZHAO Yuanxin, ZHANG Mi, YANG Bingnan, et al. LuoJiaHOG: A hierarchy oriented geo-aware image caption dataset for remote sensing image-text retrieval[J]. ISPRS Journal of Photogrammetry and Remote Sensing, 2025, 222: 130–151. doi: 10.1016/j.isprsjprs.2025.02.009.
    [107] YUAN Zhenghang, XIONG Zhitong, MOU Lichao, et al. ChatEarthNet: A global-scale image-text dataset empowering vision-language geo-foundation models[J]. Earth System Science Data, 2025, 17(3): 1245–1263. doi: 10.5194/essd-17-1245-2025.
    [108] LI Haodong, ZHANG Xiaofeng, and QU Haicheng. DDFAV: Remote sensing large vision language models dataset and evaluation benchmark[J]. Remote Sensing, 2025, 17(4): 719. doi: 10.3390/rs17040719.
    [109] AN Xiao, SUN Jiaxing, GUI Zihan, et al. COREval: A comprehensive and objective benchmark for evaluating the remote sensing capabilities of large vision-language models[J]. arXiv preprint arXiv: 2411.18145, 2024.
    [110] ZHOU Yue, FENG Litong, LAN Mengcheng, et al. GeoMath: A benchmark for multimodal mathematical reasoning in remote sensing[C]. The 13th International Conference on Representation Learning, Singapore, Singapore, 2025.
    [111] GE Junyao, ZHANG Xu, ZHENG Yang, et al. RSTeller: Scaling up visual language modeling in remote sensing with rich linguistic semantics from openly available data and large language models[J]. ISPRS Journal of Photogrammetry and Remote Sensing, 2025, 226: 146–163. doi: 10.1016/j.isprsjprs.2025.05.002.
    [112] DU Siqi, TANG Shengjun, WANG Weixi, et al. Tree-GPT: Modular large language model expert system for forest remote sensing image understanding and interactive analysis[J]. The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, 2023, XLVIII-1-W2-2023: 1729–1736. doi: 10.5194/isprs-archives-XLVIII-1-W2-2023-1729-2023.
    [113] GUO Haonan, SU Xin, WU Chen, et al. Remote sensing ChatGPT: Solving remote sensing tasks with ChatGPT and visual models[C]. IGARSS 2024-2024 IEEE International Geoscience and Remote Sensing Symposium, Athens, Greece, 2024: 11474–11478. doi: 10.1109/IGARSS53475.2024.10640736.
    [114] LIU Chenyang, CHEN Keyan, ZHANG Haotian, et al. Change-Agent: Towards interactive comprehensive remote sensing change interpretation and analysis[J]. IEEE Transactions on Geoscience and Remote Sensing, 2024, 62: 5635616. doi: 10.1109/TGRS.2024.3425815.
    [115] XU Wenjia, YU Zijian, MU Boyang, et al. RS-Agent: Automating remote sensing tasks through intelligent agents[J]. arXiv preprint arXiv: 2406.07089, 2024.
    [116] SINGH S, FORE M, and STAMOULIS D. GeoLLM-Engine: A realistic environment for building geospatial copilots[C]. The IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, USA, 2024: 585–594. doi: 10.1109/CVPRW63382.2024.00063.
    [117] SINGH S, FORE M, and STAMOULIS D. Evaluating tool-augmented agents in remote sensing platforms[J]. arXiv preprint arXiv: 2405.00709, 2024.
  • 加载中
图(3) / 表(7)
计量
  • 文章访问数: 
  • HTML全文浏览量: 
  • PDF下载量: 
  • 被引次数: 0
出版历程
  • 收稿日期:  2025-05-12
  • 修回日期:  2025-07-22
  • 网络出版日期:  2025-09-01

目录

    /

    返回文章
    返回