Graph-Temporal Deep Learning for Urban Image Modeling From Multimodal Discourse Interaction: A Case Study of Hefei

Abstract

This paper presents a computational framework for urban image modeling driven by multimodal discourse interaction, integrating cross-modal representation learning, graph-based semantic modeling, and temporal sequence prediction. Images and texts are embedded into a unified semantic space using CLIP and BLIP-2 models, enabling high-fidelity multimodal representation. A knowledge graph of urban discourse is constructed and modeled with a Graph Attention Network (GAT) to capture semantic relationships among multiple agents, while a Temporal Fusion Transformer (TFT) is employed to learn both long-term dependencies and local feature dynamics. To enhance interpretability, a variable selection network identifies the dominant multimodal features shaping urban image evolution. Experimental results based on Hefei’s urban discourse demonstrate high semantic alignment between text–image pairs (e.g., 0.87 for “Hefei Metro Expansion” and “Metro Station,” 0.89 for “USTC Research Breakthrough” and “USTC Campus”), strong knowledge graph relations (e.g., 0.82 for USTC–High-tech Zone linkage), and accurate temporal forecasting with RMSE reduced to 0.061. The dataset contains 18,426 text entries and 9,307 paired images, and the evaluation adopts a fixed 7:2:1 split with CLIP and BLIP-2 as embedding baselines. Comparative tests against LSTM and GRU yield RMSE values of 0.083 and 0.079 respectively. The findings confirm that the proposed graph-temporal multimodal framework provides an interpretable, data-driven methodology for quantifying and analyzing urban image formation.

References

Yuan J, Zhang L, Kim C S. Multimodal interaction of MU plant landscape design in marine urban based on computer vision technology[J]. Plants, 2023, 12(7): 1431.

https://doi.org/10.3390/plants12071431

Han Y. The Study on the Multimodal Discourse Construction and Communication of Zhengzhou's National Central City Image[J]. Journal of Business and Management Studies, 2025, 7(1): 222-226.

https://doi.org/10.32996/jbms.2025.7.1.17

Wang Y, Feng D. History, modernity, and city branding in China: a multimodal critical discourse analysis of Xi'an's promotional videos on social media[J]. Social Semiotics, 2023, 33(2): 402-425.

https://doi.org/10.1080/10350330.2020.1870405

Hosseini A, Barekat B. A multimodal critical discourse analysis of city as text: investigation of meaning metafunctions of Rasht's Imam Khomeini Street[J]. Visual Communication, 2025, 24(1): 90-113.

https://doi.org/10.1177/14703572221128886

Huang J, Xiao W, Wang Y. Use of metadiscourse for identity construction in tourist city publicity: A comparative study of Chinese and Australian social media discourse[J]. Heliyon, 2023, 9(12).

https://doi.org/10.1016/j.heliyon.2023.e23122

Chen X. Representing cityscape through texts and images: Translations of multimodal public notices in Macao[J]. Asia Pacific Translation and Intercultural Studies, 2023, 10(1): 53-70.

https://doi.org/10.1080/23306343.2023.2165004

Tivyaeva I V. Memorial plaques in multimodal urban discourse: A visual narrative reflecting Moscow's glorious past[J]. Visual Anthropology, 2023, 36(1): 38-53.

https://doi.org/10.1080/08949468.2023.2168960

Sukma B P. Constructing and promoting national identity through tourism: A multimodal discourse analysis of Indonesian official tourism website[J]. Linguistik Indonesia, 2021, 39(1): 63-77.

https://doi.org/10.26499/li.v39i1.197

Qi W, Sorokina N. Constructing online tourist destination images: a visual discourse analysis of the official Beijing Tourism website[J]. Chinese Semiotic Studies, 2021, 17(3): 421-448.

https://doi.org/10.1515/css-2021-2006

Yu Z, Xiao Z, Liu X. A data-driven perspective for sensing urban functional images: Place-based evidence in Hong Kong[J]. Habitat International, 2022, 130: 102707.

https://doi.org/10.1016/j.habitatint.2022.102707

Su L, Chen W, Zhou Y, et al. Exploring city image perception in social media big data through deep learning: A case study of Zhongshan City[J]. Sustainability, 2023, 15(4): 3311.

https://doi.org/10.3390/su15043311

Xie L, Feng X, Zhang C, et al. Identification of urban functional areas based on the multimodal deep learning fusion of high-resolution remote sensing images and Social Perception Data[J]. Buildings, 2022, 12(5): 556.

https://doi.org/10.3390/buildings12050556

Fan R, Li F, Han W, et al. Fine-scale urban informal settlements mapping by fusing remote sensing images and building data via a transformer-based multimodal fusion network[J]. IEEE Transactions on Geoscience and Remote Sensing, 2022, 60: 1-16.

https://doi.org/10.1109/tgrs.2022.3204345

Zhang F, Salazar-Miranda A, Duarte F, et al. Urban visual intelligence: Studying cities with artificial intelligence and street-level imagery[J]. Annals of the American Association of Geographers, 2024, 114(5): 876-897.

https://doi.org/10.1080/24694452.2024.2313515

Huang J, Obracht-Prondzynska H, Kamrowska-Zaluska D, et al. The image of the City on social media: A comparative study using “Big Data” and “Small Data” methods in the Tri-City Region in Poland[J]. Landscape and Urban Planning, 2021, 206: 103977.

https://doi.org/10.1016/j.landurbplan.2020.103977

Huang Y, Zheng B. Social media users' visual and emotional preferences of internet-famous sites in urban riverfront public spaces: a case study in Changsha, China[J]. Land, 2024, 13(7): 930.

https://doi.org/10.3390/land13070930

Reyes MF, Xie Y, Yuan X, et al. A 2D/3D multimodal data simulation approach with applications on urban semantic segmentation, building extraction and change detection[J]. ISPRS Journal of Photogrammetry and Remote Sensing, 2023, 205: 74-97.

https://doi.org/10.1016/j.isprsjprs.2023.09.013

Piadyk Y, Rulff J, Brewer E, et al. Streetaware: A high-resolution synchronized multimodal urban scene dataset[J]. Sensors, 2023, 23(7): 3710.

https://doi.org/10.3390/s23073710

Chen S, Zang Y, Yang P. City images in transnational travel vlogs from a multimodal perspective: an investigation of 20 port cities worldwide[J]. Online Media and Global Communication, 2024, 3(1): 82-107.

https://doi.org/10.1515/omgc-2023-0034

Chen X, Yu J, Zhu Y, et al. Short video-driven deep perception for city imagery[J]. Environment and Planning B: Urban Analytics and City Science, 2024, 51(3): 689-704.

https://doi.org/10.1177/23998083231193236

Kang Y, Cho N, Yoon J, et al. Transfer learning of a deep learning model for exploring tourists' urban image using geotagged photos[J]. ISPRS International Journal of Geo-Information, 2021, 10(3): 137.

https://doi.org/10.3390/ijgi10030137

Xue J, Jiang N, Liang S, et al. Quantifying the spatial homogeneity of urban road networks via graph neural networks[J]. Nature Machine Intelligence, 2022, 4(3): 246-257.

https://doi.org/10.1038/s42256-022-00462-y

Liu C, Wang Y, Li W, et al. An urban built environment analysis approach for street view images based on graph convolutional neural networks[J]. Applied Sciences, 2024, 14(5): 2108.

https://doi.org/10.3390/app14052108

Rashidan H, Musliman IA, Sani MJ, et al. Semantic labeling of 3D buildings by using graph neural network (GNN)[J]. The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, 2024, 48: 307-311.

https://doi.org/10.5194/isprs-archives-xlviii-4-w9-2024-307-2024

Ma X, Zeng T, Zhang M, et al. Street microclimate prediction based on Transformer model and street view image in high-density urban areas[J]. Building and Environment, 2025, 269: 112490.

https://doi.org/10.1016/j.buildenv.2024.112490

Wang W, Teng Y, Yan L, et al. Image Experience Prediction for Historic Districts Using a CNN-Transformer Fusion Model[J]. Image Analysis and Stereology, 2025, 44(1): 11-23.

https://doi.org/10.5566/ias.3361

Cho M, Kim S, Choi D, et al. Enhanced BLIP-2 Optimization Using LoRA for Generating Dashcam Captions[J]. Applied Sciences, 2025, 15(7): 3712.

https://doi.org/10.3390/app15073712

Lee C, Jang J, Lee J. Personalizing text-to-image generation with visual prompts using BLIP-2[J]. 2023.

Huang T, Wang Z, Sheng H, et al. Learning neighborhood representation from multi-modal multi-graph: Image, text, mobility graph and beyond[J]. arXiv preprint, 2021 , 21 ( 05 ): 02489.

Zhang J, Chen R, Li S, et al. MGKGR: Multimodal Semantic Fusion for Geographic Knowledge Graph Representation[J]. Algorithms, 2024, 17(12): 593.

https://doi.org/10.3390/a17120593

Feng Z, Zeng Z, Guo C, et al. Temporal multimodal graph transformer with global-local alignment for video-text retrieval[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2022, 33(3): 1438-1453.

https://doi.org/10.1109/tcsvt.2022.3207910

Hu B, Zhu L, Dong Q, et al. Physiological electrosignal asynchronous acquisition technology: Insight and perspectives[J]. IEEE Transactions on Computational Social Systems, 2024, 11(1): 5-24.

https://doi.org/10.1109/tcss.2024.3350958

Meng W. A Tactical Behaviour Recognition Framework Based on Causal Multimodal Reasoning: A Study on Covert Audio-Video Analysis Combining GAN Structure Enhancement and Phonetic Accent Modelling[J]. arXiv preprint arXiv:2507.21100, 2025.

https://doi.org/10.2139/ssrn.5352173

Guan S, Cheng X, Bai L, et al. What is event knowledge graph: A survey[J]. IEEE Transactions on Knowledge and Data Engineering, 2022, 35(7): 7569-7589. [35]Joshi R. Introduction to graph neural network: A systematic review of trends, methods, and applications[J]. Applied Graph Data Science, 2025: 1-16.

https://doi.org/10.1016/b978-0-443-29654-3.00017-x

Su Y, Dong Z, Li S. Distributed Sharing and Personalized Recommendation System of College Preschool Education Resources Under the Intelligent Education Cloud Platform Environment[J]. International Journal of High Speed Electronics and Systems, 2025: 2540430.

https://doi.org/10.1142/s0129156425404309

Yang Z, Zhang J, Li Z. Multi-scale time series prediction model based on deep learning and its application[J]. PLoS One, 2025, 20(7): e0325474.

https://doi.org/10.1371/journal.pone.0325474

Lisena P, Meroño-Peñuela A, Troncy R. MIDI2vec: Learning MIDI embeddings for reliable prediction of symbolic music metadata[J]. Semantic Web, 2022, 13(3): 357-377.

https://doi.org/10.3233/sw-210446

Biometric Recognition: 18th Chinese Conference, CCBR 2024, Nanjing, China, November 22–24, 2024, Proceedings, Part II[M]. Springer Nature, 2025.

https://doi.org/10.1007/978-981-96-1071-6_9

Gu X, Liu C, Wang S. Biometric Recognition[J]. Lecture Notes in Computer Science, 2013, 8232: 34-42.

Reily B J. Representation Learning for Human-Robot Teaming with Multi-Robot Systems[D]. Colorado School of Mines, 2021.

Authors

  • Lixia Xu Anhui International Studies University, Hefei 231200, China

DOI:

https://doi.org/10.31449/inf.v50i8.11578

Downloads

Published

03/23/2026

How to Cite

Xu, L. (2026). Graph-Temporal Deep Learning for Urban Image Modeling From Multimodal Discourse Interaction: A Case Study of Hefei. Informatica, 50(8). https://doi.org/10.31449/inf.v50i8.11578