Integration of Multiscale Fusion and Cross-scale Attention Refinement for Enhanced Target Detection Using MSFNet

Abstract

Object recognition across varying scales remains a persistent challenge in computer vision, especially in scenes with occlusion, low contrast, and diverse spatial resolutions. Conventional convolutional neural networks with fixed receptive fields often fail to capture both fine-grained details and high-level contextual cues. This study focuses on developing a scale-adaptive detection framework to overcome these limitations. The proposed MSFNet (Multiscale Fusion Network) employs a Dual-Stream Convolutional Backbone to extract low-level and high-level features in parallel. A Scale-Adaptive Feature Fusion Module (SAFFM) integrates multiscale representations through dynamic, scale-aware weighting. A Cross-Scale Attention Refinement (CSAR) module enhances discriminative features and suppresses irrelevant or redundant information. The architecture operates in an end-to-end fashion and is optimized for detection accuracy and real-time inference speed. Experimental evaluation on MS COCO 2017 and PASCAL VOC 2012 reports 47.3% AP and 81.5% mAP, respectively. Performance exceeds Faster R-CNN, YOLOv5, and RetinaNet by +3.8%, +4.5%, and +3.2% AP on the COCO benchmark. MSFNet provides a scalable, accurate, and computationally efficient approach for multiscale object recognition, enabling deployment in real-time applications such as autonomous driving, intelligent surveillance, and remote sensing.

Author Biography

Xiaofang Liao, South China Business College Guangdong University of Foreign Studies

Intelligent Information Research Institute

References

Liu, S., Qi, L., Qin, H., Shi, J., & Jia, J. (2021). Path Aggregation Network for Instance Segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(1), 147–162. https://doi.org/10.1109/TPAMI.2019.2917184

Yang, J., Li, C., Zhang, Z., & Wang, L. (2022). Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions. Computer Vision and Image Understanding, 224, 103535. https://doi.org/10.1016/j.cviu.2022.103535

Zubair, A., Al Rashed, F. (2024). Deep learning algorithms for multimodal interaction using speech and motion data in virtual reality systems. PatternIQ Mining, 1(4), 52–64. https://doi.org/10.70023/sahd/241105

Nair, S., Kumar, A. (2024). Zero-shot learning algorithms for object recognition in medical and navigation applications. PatternIQ Mining, 1(4), 24–37. https://doi.org/10.70023/sahd/241103

Chen, H., Sun, J., & Wang, X. (2023). Adaptive Feature Aggregation for Multiscale Object Detection. IEEE Transactions on Multimedia, 25, 422–434. https://doi.org/10.1109/TMM.2022.3140191

Zhao, R., Li, S., & Liu, Y. (2021). Deep Multiscale Contextual Learning for Semantic Segmentation in Urban Scenes. Pattern Recognition Letters, 145, 76–83. https://doi.org/10.1016/j.patrec.2021.02.014

Liu, M., Ma, J., Zheng, Q., Liu, Y., & Shi, G. (2022). 3D object detection based on attention and multi-scale feature fusion. Sensors, 22(10), 3935.

Xu, B., Gao, B., Li, Y., & Chen, L. (2024). An improved YOLOv8-based lightweight attention mechanism for cross-scale feature fusion. Sensors, 24(4), 1238

Ding, J., Lin, G., & Lu, J. (2022). Hierarchical Feature Fusion with Deformable Convolutions for Object Detection in Aerial Images. IEEE Transactions on Geoscience and Remote Sensing, 60, 1–13. https://doi.org/10.1109/TGRS.2022.3164917

Guo, C., Fan, B., Zhang, Q., & Tai, Y. (2023). Multiscale Deformable Convolutional Network for Fine-Grained Image Classification. Neural Networks, 162, 118–128. https://doi.org/10.1016/j.neunet.2023.03.005

He, Y., Zhang, H., & Yu, L. (2021). Global Context Aware Feature Aggregation for Scale-Invariant Object Detection. Knowledge-Based Systems, 229, 107374. https://doi.org/10.1016/j.knosys.2021.107374

Xie, X., Wang, C., & Zhang, Y. (2024). Multiscale Cross-Modal Feature Fusion for Object Detection in Autonomous Vehicles. Information Fusion, 98, 102210. https://doi.org/10.1016/j.inffus.2023.102210

Tan, M., Pang, R., & Le, Q. V. (2021). EfficientDet: Scalable and Efficient Object Detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(11), 4014–4026. https://doi.org/10.1109/TPAMI.2020.2979456

Chen, Y., Zhao, X., & Jia, K. (2022). Selective Feature Fusion for Object Detection. IEEE Transactions on Image Processing, 31, 2889–2901. https://doi.org/10.1109/TIP.2022.3154976

Gao, J., Lin, Z., & Liu, J. (2023). Cross-Scale Attention for High-Resolution Object Detection in Remote Sensing Images. ISPRS Journal of Photogrammetry and Remote Sensing, 195, 345–359. https://doi.org/10.1016/j.isprsjprs.2023.01.009

Zhang, T., Li, H., & Xu, M. (2022). ScaleEqualNet: Scale-Equalizing Pyramid Convolutional Network for Object Detection. Neurocomputing, 513, 293–304. https://doi.org/10.1016/j.neucom.2022.09.014

Jiang, Y., Chen, D., & Li, S. (2023). Transformer-based Multiscale Feature Aggregation for Object Detection. Pattern Recognition, 139, 109404. https://doi.org/10.1016/j.patcog.2023.109404

Wang, R., Yang, X., & Lu, Z. (2023). Attention-Driven Multi-Resolution Feature Fusion for Aerial Object Detection. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 16, 14456–14468. https://doi.org/10.1109/JSTARS.2023.3288003

https://cocodataset.org/#format-data

https://www.kaggle.com/datasets/sovitrath/pascal-voc-07-12

Authors

  • Xiaofang Liao South China Business College Guangdong University of Foreign Studies
  • Xinnan Liu Guangdong Technology College

DOI:

https://doi.org/10.31449/inf.v49i37.9896

Downloads

Published

12/24/2025

How to Cite

Liao, X., & Liu, X. (2025). Integration of Multiscale Fusion and Cross-scale Attention Refinement for Enhanced Target Detection Using MSFNet. Informatica, 49(37). https://doi.org/10.31449/inf.v49i37.9896