Graph-Temporal Deep Learning for Urban Image Modeling From Multimodal Discourse Interaction: A Case Study of Hefei
Abstract
This paper presents a computational framework for urban image modeling driven by multimodal discourse interaction, integrating cross-modal representation learning, graph-based semantic modeling, and temporal sequence prediction. Images and texts are embedded into a unified semantic space using CLIP and BLIP-2 models, enabling high-fidelity multimodal representation. A knowledge graph of urban discourse is constructed and modeled with a Graph Attention Network (GAT) to capture semantic relationships among multiple agents, while a Temporal Fusion Transformer (TFT) is employed to learn both long-term dependencies and local feature dynamics. To enhance interpretability, a variable selection network identifies the dominant multimodal features shaping urban image evolution. Experimental results based on Hefei’s urban discourse demonstrate high semantic alignment between text–image pairs (e.g., 0.87 for “Hefei Metro Expansion” and “Metro Station,” 0.89 for “USTC Research Breakthrough” and “USTC Campus”), strong knowledge graph relations (e.g., 0.82 for USTC–High-tech Zone linkage), and accurate temporal forecasting with RMSE reduced to 0.061. The dataset contains 18,426 text entries and 9,307 paired images, and the evaluation adopts a fixed 7:2:1 split with CLIP and BLIP-2 as embedding baselines. Comparative tests against LSTM and GRU yield RMSE values of 0.083 and 0.079 respectively. The findings confirm that the proposed graph-temporal multimodal framework provides an interpretable, data-driven methodology for quantifying and analyzing urban image formation.References
Yuan J, Zhang L, Kim C S. Multimodal interaction of MU plant landscape design in marine urban based on computer vision technology[J]. Plants, 2023, 12(7): 1431.
https://doi.org/10.3390/plants12071431
Han Y. The Study on the Multimodal Discourse Construction and Communication of Zhengzhou's National Central City Image[J]. Journal of Business and Management Studies, 2025, 7(1): 222-226.
https://doi.org/10.32996/jbms.2025.7.1.17
Wang Y, Feng D. History, modernity, and city branding in China: a multimodal critical discourse analysis of Xi'an's promotional videos on social media[J]. Social Semiotics, 2023, 33(2): 402-425.
https://doi.org/10.1080/10350330.2020.1870405
Hosseini A, Barekat B. A multimodal critical discourse analysis of city as text: investigation of meaning metafunctions of Rasht's Imam Khomeini Street[J]. Visual Communication, 2025, 24(1): 90-113.
https://doi.org/10.1177/14703572221128886
Huang J, Xiao W, Wang Y. Use of metadiscourse for identity construction in tourist city publicity: A comparative study of Chinese and Australian social media discourse[J]. Heliyon, 2023, 9(12).
https://doi.org/10.1016/j.heliyon.2023.e23122
Chen X. Representing cityscape through texts and images: Translations of multimodal public notices in Macao[J]. Asia Pacific Translation and Intercultural Studies, 2023, 10(1): 53-70.
https://doi.org/10.1080/23306343.2023.2165004
Tivyaeva I V. Memorial plaques in multimodal urban discourse: A visual narrative reflecting Moscow's glorious past[J]. Visual Anthropology, 2023, 36(1): 38-53.
https://doi.org/10.1080/08949468.2023.2168960
Sukma B P. Constructing and promoting national identity through tourism: A multimodal discourse analysis of Indonesian official tourism website[J]. Linguistik Indonesia, 2021, 39(1): 63-77.
https://doi.org/10.26499/li.v39i1.197
Qi W, Sorokina N. Constructing online tourist destination images: a visual discourse analysis of the official Beijing Tourism website[J]. Chinese Semiotic Studies, 2021, 17(3): 421-448.
https://doi.org/10.1515/css-2021-2006
Yu Z, Xiao Z, Liu X. A data-driven perspective for sensing urban functional images: Place-based evidence in Hong Kong[J]. Habitat International, 2022, 130: 102707.
https://doi.org/10.1016/j.habitatint.2022.102707
Su L, Chen W, Zhou Y, et al. Exploring city image perception in social media big data through deep learning: A case study of Zhongshan City[J]. Sustainability, 2023, 15(4): 3311.
https://doi.org/10.3390/su15043311
Xie L, Feng X, Zhang C, et al. Identification of urban functional areas based on the multimodal deep learning fusion of high-resolution remote sensing images and Social Perception Data[J]. Buildings, 2022, 12(5): 556.
https://doi.org/10.3390/buildings12050556
Fan R, Li F, Han W, et al. Fine-scale urban informal settlements mapping by fusing remote sensing images and building data via a transformer-based multimodal fusion network[J]. IEEE Transactions on Geoscience and Remote Sensing, 2022, 60: 1-16.
https://doi.org/10.1109/tgrs.2022.3204345
Zhang F, Salazar-Miranda A, Duarte F, et al. Urban visual intelligence: Studying cities with artificial intelligence and street-level imagery[J]. Annals of the American Association of Geographers, 2024, 114(5): 876-897.
https://doi.org/10.1080/24694452.2024.2313515
Huang J, Obracht-Prondzynska H, Kamrowska-Zaluska D, et al. The image of the City on social media: A comparative study using “Big Data” and “Small Data” methods in the Tri-City Region in Poland[J]. Landscape and Urban Planning, 2021, 206: 103977.
https://doi.org/10.1016/j.landurbplan.2020.103977
Huang Y, Zheng B. Social media users' visual and emotional preferences of internet-famous sites in urban riverfront public spaces: a case study in Changsha, China[J]. Land, 2024, 13(7): 930.
https://doi.org/10.3390/land13070930
Reyes MF, Xie Y, Yuan X, et al. A 2D/3D multimodal data simulation approach with applications on urban semantic segmentation, building extraction and change detection[J]. ISPRS Journal of Photogrammetry and Remote Sensing, 2023, 205: 74-97.
https://doi.org/10.1016/j.isprsjprs.2023.09.013
Piadyk Y, Rulff J, Brewer E, et al. Streetaware: A high-resolution synchronized multimodal urban scene dataset[J]. Sensors, 2023, 23(7): 3710.
https://doi.org/10.3390/s23073710
Chen S, Zang Y, Yang P. City images in transnational travel vlogs from a multimodal perspective: an investigation of 20 port cities worldwide[J]. Online Media and Global Communication, 2024, 3(1): 82-107.
https://doi.org/10.1515/omgc-2023-0034
Chen X, Yu J, Zhu Y, et al. Short video-driven deep perception for city imagery[J]. Environment and Planning B: Urban Analytics and City Science, 2024, 51(3): 689-704.
https://doi.org/10.1177/23998083231193236
Kang Y, Cho N, Yoon J, et al. Transfer learning of a deep learning model for exploring tourists' urban image using geotagged photos[J]. ISPRS International Journal of Geo-Information, 2021, 10(3): 137.
https://doi.org/10.3390/ijgi10030137
Xue J, Jiang N, Liang S, et al. Quantifying the spatial homogeneity of urban road networks via graph neural networks[J]. Nature Machine Intelligence, 2022, 4(3): 246-257.
https://doi.org/10.1038/s42256-022-00462-y
Liu C, Wang Y, Li W, et al. An urban built environment analysis approach for street view images based on graph convolutional neural networks[J]. Applied Sciences, 2024, 14(5): 2108.
https://doi.org/10.3390/app14052108
Rashidan H, Musliman IA, Sani MJ, et al. Semantic labeling of 3D buildings by using graph neural network (GNN)[J]. The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, 2024, 48: 307-311.
https://doi.org/10.5194/isprs-archives-xlviii-4-w9-2024-307-2024
Ma X, Zeng T, Zhang M, et al. Street microclimate prediction based on Transformer model and street view image in high-density urban areas[J]. Building and Environment, 2025, 269: 112490.
https://doi.org/10.1016/j.buildenv.2024.112490
Wang W, Teng Y, Yan L, et al. Image Experience Prediction for Historic Districts Using a CNN-Transformer Fusion Model[J]. Image Analysis and Stereology, 2025, 44(1): 11-23.
https://doi.org/10.5566/ias.3361
Cho M, Kim S, Choi D, et al. Enhanced BLIP-2 Optimization Using LoRA for Generating Dashcam Captions[J]. Applied Sciences, 2025, 15(7): 3712.
https://doi.org/10.3390/app15073712
Lee C, Jang J, Lee J. Personalizing text-to-image generation with visual prompts using BLIP-2[J]. 2023.
Huang T, Wang Z, Sheng H, et al. Learning neighborhood representation from multi-modal multi-graph: Image, text, mobility graph and beyond[J]. arXiv preprint, 2021 , 21 ( 05 ): 02489.
Zhang J, Chen R, Li S, et al. MGKGR: Multimodal Semantic Fusion for Geographic Knowledge Graph Representation[J]. Algorithms, 2024, 17(12): 593.
https://doi.org/10.3390/a17120593
Feng Z, Zeng Z, Guo C, et al. Temporal multimodal graph transformer with global-local alignment for video-text retrieval[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2022, 33(3): 1438-1453.
https://doi.org/10.1109/tcsvt.2022.3207910
Hu B, Zhu L, Dong Q, et al. Physiological electrosignal asynchronous acquisition technology: Insight and perspectives[J]. IEEE Transactions on Computational Social Systems, 2024, 11(1): 5-24.
https://doi.org/10.1109/tcss.2024.3350958
Meng W. A Tactical Behaviour Recognition Framework Based on Causal Multimodal Reasoning: A Study on Covert Audio-Video Analysis Combining GAN Structure Enhancement and Phonetic Accent Modelling[J]. arXiv preprint arXiv:2507.21100, 2025.
https://doi.org/10.2139/ssrn.5352173
Guan S, Cheng X, Bai L, et al. What is event knowledge graph: A survey[J]. IEEE Transactions on Knowledge and Data Engineering, 2022, 35(7): 7569-7589. [35]Joshi R. Introduction to graph neural network: A systematic review of trends, methods, and applications[J]. Applied Graph Data Science, 2025: 1-16.
https://doi.org/10.1016/b978-0-443-29654-3.00017-x
Su Y, Dong Z, Li S. Distributed Sharing and Personalized Recommendation System of College Preschool Education Resources Under the Intelligent Education Cloud Platform Environment[J]. International Journal of High Speed Electronics and Systems, 2025: 2540430.
https://doi.org/10.1142/s0129156425404309
Yang Z, Zhang J, Li Z. Multi-scale time series prediction model based on deep learning and its application[J]. PLoS One, 2025, 20(7): e0325474.
https://doi.org/10.1371/journal.pone.0325474
Lisena P, Meroño-Peñuela A, Troncy R. MIDI2vec: Learning MIDI embeddings for reliable prediction of symbolic music metadata[J]. Semantic Web, 2022, 13(3): 357-377.
https://doi.org/10.3233/sw-210446
Biometric Recognition: 18th Chinese Conference, CCBR 2024, Nanjing, China, November 22–24, 2024, Proceedings, Part II[M]. Springer Nature, 2025.
https://doi.org/10.1007/978-981-96-1071-6_9
Gu X, Liu C, Wang S. Biometric Recognition[J]. Lecture Notes in Computer Science, 2013, 8232: 34-42.
Reily B J. Representation Learning for Human-Robot Teaming with Multi-Robot Systems[D]. Colorado School of Mines, 2021.
DOI:
https://doi.org/10.31449/inf.v50i8.11578Downloads
Published
How to Cite
Issue
Section
License
Authors retain copyright in their work. By submitting to and publishing with Informatica, authors grant the publisher (Slovene Society Informatika) the non-exclusive right to publish, reproduce, and distribute the article and to identify itself as the original publisher.
All articles are published under the Creative Commons Attribution license CC BY 3.0. Under this license, others may share and adapt the work for any purpose, provided appropriate credit is given and changes (if any) are indicated.
Authors may deposit and share the submitted version, accepted manuscript, and published version, provided the original publication in Informatica is properly cited.







