MGC-SIFT: A Multimodal Graph-Based Color SIFT Descriptor for Content-Based Image Retrieval
Abstract
Content-Based Image Retrieval (CBIR) systems critically depend on discriminative yet efficient feature representations to retrieve relevant images from large-scale databases. However, many existing handcrafted and graph-based methods face limitations in scalability and in jointly modeling multimodal information such as color, texture, and spatial relationships. To address these challenges, this paper proposes a novel feature extraction framework termed Multi-modal Graph Color SIFT (MGC-SIFT). In the proposed approach, color-augmented SIFT descriptors extracted in the YCbCr color space are organized as a graph of local keypoints, over which Graph Neural Networks (GNNs) are applied to model inter-keypoint spatial relationships. An attention mechanism is incorporated to emphasize discriminative keypoint regions, while proxy-based learning is employed to improve representation compactness and retrieval efficiency.The effectiveness of MGC-SIFT is evaluated on four benchmark datasets—Corel-1K, COIL-20, Oxford-102 Flowers, and UC-Merced Land Use—covering natural scenes, controlled object images, fine-grained categories, and aerial imagery. Experimental evaluation using standard CBIR metrics, including mean Average Precision (mAP), Precision@k, Recall@k, F1-score@k, and Accuracy@k, demonstrates that the proposed method achieves consistent and competitive retrieval performance across heterogeneous datasets, including robustness under image degradation conditions. Ablation studies further confirm the complementary contributions of color augmentation, graph-based modeling, attention mechanisms, and proxy-based learning. In addition, runtime and memory analysis indicate that proxy-based learning significantly reduces retrieval latency, supporting scalable image retrieval.Overall, the proposed MGC-SIFT framework provides a robust and interpretable multimodal representation for CBIR by explicitly modeling joint color–spatial dependencies at the local keypoint level, offering a practical solution for scalable image retrieval in real-world applications.References
[1] S. Sikandar, A. Alsalman, and R. Mahum, “A Novel Hybrid Approach for a Content-Based Image Retrieval Using Feature Fusion,” Applied Sciences, vol. 13, no. 7, p. 4581, Apr. 2023, doi: 10.3390/app13074581.
[2] J. Kim and B. C. Ko, “Scene Graph and Natural Language-Based Semantic Image Retrieval Using Vision Sensor Data,” Sensors, vol. 25, no. 11, p. 3252, May 2025, doi: 10.3390/s25113252.
[3] A. W. M. Smeulders, S. Santini, M. Worring, R. Jain, and A. Gupta, “Content-Based Image Retrieval at the End of the Early Years,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, no. 12, pp. 1349–1380, Jan. 2000, doi: 10.1109/34.895972.
[4] A. Humeau-Heurtier, “Texture Feature Extraction Methods: A Survey,” IEEE Access, vol. 7, pp. 8975–9000, 2019, doi: 10.1109/ACCESS.2018.2890743.
[5] N. Alpaslan and K. Hanbay, “Multi-Scale Shape Index-Based Local Binary Patterns for Texture Classification,” IEEE Signal Processing Letters, vol. 27, pp. 660–664, 2020, doi: 10.1109/LSP.2020.2987474.
[6] F. Mirzapour and H. Ghassemian, “Improving Hyperspectral Image Classification by Combining Spectral, Texture, and Shape Features,” International Journal of Remote Sensing, vol. 36, no. 4, pp. 1070–1096, 2015, doi: 10.1080/01431161.2015.1007251.
[7] D. G. Lowe, “Distinctive Image Features from Scale-Invariant Keypoints,” International Journal of Computer Vision, vol. 60, no. 2, pp. 91–110, 2004, doi: 10.1023/B: VISI.0000029664.99615.94.
[8] J. R. R. van de Sande, T. Gevers, and C. G. M. Snoek, “Evaluating Color Descriptors for Object and Scene Recognition, ”IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), vol. 32, no. 9, pp. 1582–1596, 2010. DOI: 10.1109/TPAMI.2009.154
[9] X. Zhang, M. Jiang, Z. Zheng, X. Tan, E. Ding, and Y. Yang, “Understanding Image Retrieval Re-Ranking: A Graph Neural Network Perspective,” arXiv:2012.07620, 2020, doi: 10.48550/arXiv.2012.07620.
[10] H. Lacheheb and S. Aouat, “SIMIR: New Mean SIFT Color Multi-Clustering Image Retrieval,” Multimedia Tools and Applications, vol. 76, no. 5, pp. 6333–6354, 2016, doi: 10.1007/s11042-015-3167-3.
[11] D. Kobak and P. Berens, “The Art of Using t-SNE for Single-Cell Transcriptomics,” Nature Communications, vol. 10, p. 5416, 2019, doi: 10.1038/s41467-019-13056-x.
[12] X. Jia, A. Kale, V. Kumar, Z. Lin, and H. Zhao, “Personalized Image Retrieval with Sparse Graph Representation Learning,” ACM Transactions on Multimedia Computing, Communications, and Applications, vol. 19, no. 4, pp. 2735–2743, 2020, doi: 10.1145/3394486.3403324.
[13] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classification with Deep Convolutional Neural Networks,” Communications of the ACM, vol. 60, no. 6, pp. 84–90, 2017, doi: 10.1145/3065386.
[14] K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recognition,” arXiv:1409.1556, 2014, doi: 10.48550/arXiv.1409.1556.
[15] X. Li, S. Wei, M. Ge, J. Wang, and Y. Du, “Adaptive Multi-Proxy for Remote Sensing Image Retrieval,” Remote Sensing, vol. 14, no. 21, p. 5615, 2022, doi: 10.3390/rs14215615.
[16] A. Hermans, L. Beyer, and B. Leibe, “In Defense of the Triplet Loss for Person Re-Identification,” arXiv:1703.07737, 2017, doi: 10.48550/arXiv.1703.07737.
[17] S. Kim, M. Cho, S. Kwak, and D. Kim, “Proxy Anchor Loss for Deep Metric Learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, doi: 10.1109/CVPR42600.2020.00330.
[18] Y. Movshovitz-Attias, S. Singh, A. Toshev, T. K. Leung, and S. Ioffe, “No Fuss Distance Metric Learning Using Proxies,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017, pp. 360–368, doi: 10.1109/ICCV.2017.47.
[19] M. M. Adnan et al., “Image Annotation with YCbCr Color Features Based on Multiple Deep CNN-GLP,” IEEE Access, vol. 12, pp. 11340–11353, 2024, doi: 10.1109/ACCESS.2023.3330765.
[20] H. Yu et al., “Text–Image Matching for Cross-Modal Remote Sensing Image Retrieval via Graph Neural Network,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 16, pp. 812–824, 2023, doi: 10.1109/JSTARS.2022.3231851.
[21] Y. Zhang, X. Zheng, and X. Lu, “Remote Sensing Image Retrieval by Deep Attention Hashing with Distance-Adaptive Ranking,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 16, pp. 4301–4311, 2023, doi: 10.1109/JSTARS.2023.3271303.
[22] D. Zhao, S. Xiong, and Y. Chen, “Multiscale Context Deep Hashing for Remote Sensing Image Retrieval,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 16, pp. 7163–7172, 2023, doi: 10.1109/JSTARS.2023.3298990.
[23] Z. Cai, Y. Pan, and W. Jin, “Proxy-Based Rotation Invariant Deep Metric Learning for Remote Sensing Image Retrieval,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, pp. 7759–7772, 2024, doi: 10.1109/JSTARS.2024.3382845.
DOI:
https://doi.org/10.31449/inf.v50i1.10558Downloads
Published
How to Cite
Issue
Section
License
Authors retain copyright in their work. By submitting to and publishing with Informatica, authors grant the publisher (Slovene Society Informatika) the non-exclusive right to publish, reproduce, and distribute the article and to identify itself as the original publisher.
All articles are published under the Creative Commons Attribution license CC BY 3.0. Under this license, others may share and adapt the work for any purpose, provided appropriate credit is given and changes (if any) are indicated.
Authors may deposit and share the submitted version, accepted manuscript, and published version, provided the original publication in Informatica is properly cited.







