PVT-EfficientNet Dual Encoder with Mamba Head for Efficient UnderwaterImage Classification
Abstract
Identification and classification of marine organisms remain challenging due to pose variation, partial occlusions,and underwater imaging conditions. Existing Convolutional Neural Network (CNN) and Transformermodels often struggle to obtain long-range contextual understanding while maintaining computationalefficiency. This research proposes a new method named PVT Fusion Mamba architecture, whichintegrates PVT-v2-B2 and EfficientNet-B0 in a dual-encoder backbone, followed by a hierarchical fusionneck and a Mamba-based classification head. This architecture enables effective multi-scale feature integrationand efficient long-range dependency modeling with linear complexity, while dynamically emphasizingorganism-relevant features and suppressing background noise. We conducted experiments usingthe ROUD Dataset across ten marine organism classes. An extensive ablation study confirmed the synergisticeffect of the dual-encoder fusion, demonstrating that the combined PVT Fusion Mamba architecturesignificantly outperforms its single-encoder counterparts (EfficientNet-B0 and PVT-v2-B2) in terms of convergencespeed and final accuracy. Furthermore, in comparative studies against models like ResNet50 andYOLOv8, our proposed architecture achieved superior performance. The PVT Fusion Mamba architectureattained state-of-the-art accuracy of 98.6% with an optimized validation loss of 0.062 (at configurationC1 = 96,C2 = 288,C3 = 192). Analysis of the confusion matrix reveals excellent classification performance,with most errors occurring only between morphologically similar species. The results demonstratethat PVT Fusion Mamba successfully overcomes the limitations of previous methods, achieving superioraccuracy and robustness with reduced computational cost compared to established deep learning models.References
H. Li et al., “A Survey of Underwater Object Detection and Classification Using Deep Learning,” Journal of Ocean Engineering and Science, vol. 8, no. 2, pp. 154-169, 2023.
A. Z. K. Frisky, A. Harjoko, L. Awaludin, S. Zambanini, and R. Sablatnig, “Investigation of single image depth prediction under different lighting conditions: a case study of ancient Roman coins,” J. Comput. Cult. Herit., vol. 14, no. 4, pp. 1–17, Dec. 2021, doi: 10.1145/3465742.
A. Z. K. Frisky et al., “Registered Relief Depth (RRD) Borobudur dataset for single-frame depth prediction on one-side artifacts,” Data Brief, vol. 35, p. 106853, Feb. 2021, doi: 10.1016/j.dib.2021.106853.
M. Yang et al., “An improved CNN model for underwater acoustic image classification,” Ocean Engineering, vol. 261, p. 112194, Oct. 2022, doi: 10.1016/j.oceaneng.2022.112194.
B. Liu et al., “Underwater object classification based on multi-scale feature fusion and convolutional neural networks,” Journal of Marine Science and Engineering, vol. 9, no. 12, p. 1332, Dec. 2021, doi: 10.3390/jmse9121332.
Z. Gao, Y. Shi, and S. Li, “Self-attention and long-range relationship capture network for underwater object detection,” J. King Saud Univ. - Comput. Inf. Sci., vol. 36, no. 1, p. 101971, Mar. 2024, doi: 10.1016/j.jksuci.2024.101971.
K. Liu, L. Peng, and S. Tang, “Underwater Object Detection Using TC-YOLO with Attention Mechanisms,” Sensors, vol. 23, no. 5, p. 2567, Mar. 2023, doi: 10.3390/s23052567.
M. Cao et al., “CNN–Transformer Hybrid Architecture for Underwater Sonar Image Segmentation (FLSSNet),” Remote Sens., vol. 17, no. 4, p. 707, Feb. 2024, doi: 10.3390/rs17040707.
A. H. A. R. Ali, M. Z. M. J. Abdullah, and S. I. Othman, “A Comparative Study of Different CNN Models and Transfer Learning Effect for Underwater Object Classification in Side-Scan Sonar Images,” Remote Sens., vol. 15, no. 3, p. 593, Jan. 2023, doi: 10.3390/rs15030593.
Z. Sun, F. Luo, D. Zhang, and X. Su, “Convolutional Neural Network With Second-Order Pooling for Underwater Target Classification,” IEEE Sens. J., vol. 19, no. 3, pp. 1123-1133, Feb. 2019, doi: 10.1109/JSEN.2018.2886368.
H. Tang, T. Zhou, B. Feng, Y. Tang, and B. Wang, “Underwater Image Processing and Object Detection Based on Deep CNN Method,” J. Sensors, vol. 2020, p. 6707328, Oct. 2020, doi: 10.1155/2020/6707328.
Z. Sun, F. Luo, D. Zhang, and X. Su, “Convolutional Neural Network With Second-Order Pooling for Underwater Target Classification,” IEEE Sens. J., vol. 19, no. 3, pp. 1123-1133, Feb. 2019, doi: 10.1109/JSEN.2018.2886368.
A. Dosovitskiy et al., “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale,” arXiv:2010.11929 [cs.CV], Oct. 2020. doi: 10.48550/arXiv.2010.11929.
J. Han, D. Han, Y. Gao, J. Zhao, and Y. Huang, “IA-Net: An Inception–Attention-Module-Based Network for Classifying Underwater Images From Others,” IEEE J. Ocean. Eng., vol. 47, no. 2, pp. 320-333, Apr. 2022, doi: 10.1109/JOE.2021.3126090.
B. V. Kulkarni and B. K. B. P, “An integration of ensemble deep learning with hybrid optimization approaches for effective underwater object detection and classification model,” Sci Rep, vol. 15, p. 95596, May 2025, doi: 10.1038/s41598-025-95596-5.
Y. Zhang, Y. Wang, and Y. Lin, “Research on underwater object detection based on frequency domain attention mechanism,” in Proc. SPIE 13133, Seventh Target Recognition and Classification Technology Conference, Aug. 2024, doi: 10.1117/12.3055714.
K. Xu, H. Li, B. Luo, B. Feng, and J. Wu, “Underwater object detection algorithm based on YOLOv8 and Swin Transformer,” in Proc. SPIE 13133, Seventh Target Recognition and Classification Technology Conference, Aug. 2024, doi: 10.1117/12.3061264.
D. P. M. de Oliveira, E. J. P. da S. Júnior, and M. L. F. da Silva, “On-board classification of underwater images using hybrid classical-quantum CNN based method,” arXiv:2404.13130 [cs.CV], Apr. 2024.
S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia, “Feature Pyramid Networks
for Object Detection,” arXiv:1612.03144 [cs.CV], Dec. 2016. doi: 10.48550/arXiv.1612.03144.
A. Z. K. Frisky, A. Putranto, S. Zambanini, and R. Sablatnig, “MCCNet: Multi-Color Cascade Network with Weight Transfer for Single Image Depth Prediction on Outdoor Relief Images,” in Pattern Recognition. ICPR International Workshops and Challenges, LNCS vol. 12667, pp. 263–278, 2021, doi: 10.1007/978-3-030-68787-8_19.
M. Cao et al., “CNN–Transformer Hybrid Architecture for Underwater Sonar Image Segmentation (FLSSNet),” Remote Sens., vol. 17, no. 4, p. 707, Feb. 2024, doi: 10.3390/rs17040707.
M. Tan and Q. V. Le, “EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks,” arXiv:1905.11946 [cs.CV], May 2019.
C. Wang et al., “PVT v2: Improved Baselines with Pyramid Vision Transformer,” arXiv:2106.13797 [cs.CV], Jun. 2021.
P. Wang, J. Ma, S. Chen, G. Wang, and F. Li, “Rethinking general underwater object detection: Datasets, challenges, and solutions,” Neurocomputing, vol. 518, pp. 417-432, 2023, doi: 10.1016/j.neucom.2022.10.039.
DOI:
https://doi.org/10.31449/inf.v50i13.12034Downloads
Published
Issue
Section
License
Authors retain copyright in their work. By submitting to and publishing with Informatica, authors grant the publisher (Slovene Society Informatika) the non-exclusive right to publish, reproduce, and distribute the article and to identify itself as the original publisher.
All articles are published under the Creative Commons Attribution license CC BY 3.0. Under this license, others may share and adapt the work for any purpose, provided appropriate credit is given and changes (if any) are indicated.
Authors may deposit and share the submitted version, accepted manuscript, and published version, provided the original publication in Informatica is properly cited.







