Integration of Multiscale Fusion and Cross-scale Attention Refinement for Enhanced Target Detection Using MSFNet

Abstract

Object recognition across varying scales remains a persistent challenge in computer vision, especially in scenes with occlusion, low contrast, and diverse spatial resolutions. Conventional convolutional neural networks with fixed receptive fields often fail to capture both fine-grained details and high-level contextual cues. This study focuses on developing a scale-adaptive detection framework to overcome these limitations. The proposed MSFNet (Multiscale Fusion Network) employs a Dual-Stream Convolutional Backbone to extract low-level and high-level features in parallel. A Scale-Adaptive Feature Fusion Module (SAFFM) integrates multiscale representations through dynamic, scale-aware weighting. A Cross-Scale Attention Refinement (CSAR) module enhances discriminative features and suppresses irrelevant or redundant information. The architecture operates in an end-to-end fashion and is optimized for detection accuracy and real-time inference speed. Experimental evaluation on MS COCO 2017 and PASCAL VOC 2012 reports 47.3% AP and 81.5% mAP, respectively. Performance exceeds Faster R-CNN, YOLOv5, and RetinaNet by +3.8%, +4.5%, and +3.2% AP on the COCO benchmark. MSFNet provides a scalable, accurate, and computationally efficient approach for multiscale object recognition, enabling deployment in real-time applications such as autonomous driving, intelligent surveillance, and remote sensing.

Author Biography

  • Xiaofang Liao, South China Business College Guangdong University of Foreign Studies
    Intelligent Information Research Institute

References

Liu, S., Qi, L., Qin, H., Shi, J., & Jia, J. (2021). Path Aggregation Network for Instance Segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(1), 147–162. https://doi.org/10.1109/TPAMI.2019.2917184

Yang, J., Li, C., Zhang, Z., & Wang, L. (2022). Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions. Computer Vision and Image Understanding, 224, 103535. https://doi.org/10.1016/j.cviu.2022.103535

Zubair, A., Al Rashed, F. (2024). Deep learning algorithms for multimodal interaction using speech and motion data in virtual reality systems. PatternIQ Mining, 1(4), 52–64. https://doi.org/10.70023/sahd/241105

Nair, S., Kumar, A. (2024). Zero-shot learning algorithms for object recognition in medical and navigation applications. PatternIQ Mining, 1(4), 24–37. https://doi.org/10.70023/sahd/241103

Chen, H., Sun, J., & Wang, X. (2023). Adaptive Feature Aggregation for Multiscale Object Detection. IEEE Transactions on Multimedia, 25, 422–434. https://doi.org/10.1109/TMM.2022.3140191

Zhao, R., Li, S., & Liu, Y. (2021). Deep Multiscale Contextual Learning for Semantic Segmentation in Urban Scenes. Pattern Recognition Letters, 145, 76–83. https://doi.org/10.1016/j.patrec.2021.02.014

Liu, M., Ma, J., Zheng, Q., Liu, Y., & Shi, G. (2022). 3D object detection based on attention and multi-scale feature fusion. Sensors, 22(10), 3935.

Xu, B., Gao, B., Li, Y., & Chen, L. (2024). An improved YOLOv8-based lightweight attention mechanism for cross-scale feature fusion. Sensors, 24(4), 1238

Ding, J., Lin, G., & Lu, J. (2022). Hierarchical Feature Fusion with Deformable Convolutions for Object Detection in Aerial Images. IEEE Transactions on Geoscience and Remote Sensing, 60, 1–13. https://doi.org/10.1109/TGRS.2022.3164917

Guo, C., Fan, B., Zhang, Q., & Tai, Y. (2023). Multiscale Deformable Convolutional Network for Fine-Grained Image Classification. Neural Networks, 162, 118–128. https://doi.org/10.1016/j.neunet.2023.03.005

He, Y., Zhang, H., & Yu, L. (2021). Global Context Aware Feature Aggregation for Scale-Invariant Object Detection. Knowledge-Based Systems, 229, 107374. https://doi.org/10.1016/j.knosys.2021.107374

Xie, X., Wang, C., & Zhang, Y. (2024). Multiscale Cross-Modal Feature Fusion for Object Detection in Autonomous Vehicles. Information Fusion, 98, 102210. https://doi.org/10.1016/j.inffus.2023.102210

Tan, M., Pang, R., & Le, Q. V. (2021). EfficientDet: Scalable and Efficient Object Detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(11), 4014–4026. https://doi.org/10.1109/TPAMI.2020.2979456

Chen, Y., Zhao, X., & Jia, K. (2022). Selective Feature Fusion for Object Detection. IEEE Transactions on Image Processing, 31, 2889–2901. https://doi.org/10.1109/TIP.2022.3154976

Gao, J., Lin, Z., & Liu, J. (2023). Cross-Scale Attention for High-Resolution Object Detection in Remote Sensing Images. ISPRS Journal of Photogrammetry and Remote Sensing, 195, 345–359. https://doi.org/10.1016/j.isprsjprs.2023.01.009

Zhang, T., Li, H., & Xu, M. (2022). ScaleEqualNet: Scale-Equalizing Pyramid Convolutional Network for Object Detection. Neurocomputing, 513, 293–304. https://doi.org/10.1016/j.neucom.2022.09.014

Jiang, Y., Chen, D., & Li, S. (2023). Transformer-based Multiscale Feature Aggregation for Object Detection. Pattern Recognition, 139, 109404. https://doi.org/10.1016/j.patcog.2023.109404

Wang, R., Yang, X., & Lu, Z. (2023). Attention-Driven Multi-Resolution Feature Fusion for Aerial Object Detection. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 16, 14456–14468. https://doi.org/10.1109/JSTARS.2023.3288003

https://cocodataset.org/#format-data

https://www.kaggle.com/datasets/sovitrath/pascal-voc-07-12

Authors

  • Xiaofang Liao South China Business College Guangdong University of Foreign Studies image/svg+xml
  • Xinnan Liu Guangdong Technology College image/svg+xml

DOI:

https://doi.org/10.31449/inf.v49i37.9896

Downloads

Published

12/24/2025

How to Cite

Integration of Multiscale Fusion and Cross-scale Attention Refinement for Enhanced Target Detection Using MSFNet. (2025). Informatica, 49(37). https://doi.org/10.31449/inf.v49i37.9896