PVT-EfficientNet Dual Encoder with Mamba Head for Efficient UnderwaterImage Classification

Aufaclav Zatu Kusuma Frisky; Ari Dwi Hartanto; Hanum Khairana Fatmah; Fadillah Siva; Waffiq Maaroja; Novelio Putra Indarto; Mohammad Akbar Ghifari Tuasikal; Jia Ching Wang; Fadillah Siva

doi:10.31449/inf.v50i13.12034

Abstract

Identification and classification of marine organisms remain challenging due to pose variation, partial occlusions,and underwater imaging conditions. Existing Convolutional Neural Network (CNN) and Transformermodels often struggle to obtain long-range contextual understanding while maintaining computationalefficiency. This research proposes a new method named PVT Fusion Mamba architecture, whichintegrates PVT-v2-B2 and EfficientNet-B0 in a dual-encoder backbone, followed by a hierarchical fusionneck and a Mamba-based classification head. This architecture enables effective multi-scale feature integrationand efficient long-range dependency modeling with linear complexity, while dynamically emphasizingorganism-relevant features and suppressing background noise. We conducted experiments usingthe ROUD Dataset across ten marine organism classes. An extensive ablation study confirmed the synergisticeffect of the dual-encoder fusion, demonstrating that the combined PVT Fusion Mamba architecturesignificantly outperforms its single-encoder counterparts (EfficientNet-B0 and PVT-v2-B2) in terms of convergencespeed and final accuracy. Furthermore, in comparative studies against models like ResNet50 andYOLOv8, our proposed architecture achieved superior performance. The PVT Fusion Mamba architectureattained state-of-the-art accuracy of 98.6% with an optimized validation loss of 0.062 (at configurationC1 = 96,C2 = 288,C3 = 192). Analysis of the confusion matrix reveals excellent classification performance,with most errors occurring only between morphologically similar species. The results demonstratethat PVT Fusion Mamba successfully overcomes the limitations of previous methods, achieving superioraccuracy and robustness with reduced computational cost compared to established deep learning models.

Author Biographies

Aufaclav Zatu Kusuma Frisky, Department of Computer Science and Electronics, Universitas Gadjah Mada

Assistant Professor, Department of Computer Science and Electronics, Universitas Gadjah Mada
Ari Dwi Hartanto, Department of Mathematics, Universitas Gadjah Mada

Assistant Professor, Department of Mathematics, Universitas Gadjah Mada
Hanum Khairana Fatmah, Department of Computer Science and Electronics, Universitas Gadjah Mada

Master’s student in Computer Science, Department of Computer Science and Electronics, Universitas Gadjah Mada
Fadillah Siva, Department of Computer Science and Electronics, Universitas Gadjah Mada

Master’s student in Computer Science, Department of Computer Science and Electronics, Universitas Gadjah Mada
Waffiq Maaroja, Department of Computer Science and Electronics, Universitas Gadjah Mada

Master’s student in Computer Science, Department of Computer Science and Electronics, Universitas Gadjah Mada
Novelio Putra Indarto, Department of Computer Science and Electronics, Universitas Gadjah Mada

Master’s student in Electronics and Instrumentation, Department of Computer Science and Electronics, Universitas Gadjah Mada
Mohammad Akbar Ghifari Tuasikal, Department of Computer Science and Electronics, Universitas Gadjah Mada

Master’s student in Computer Science, Department of Computer Science and Electronics, Universitas Gadjah Mada
Jia Ching Wang, Department of Computer Science and Information Engineering, National Central University

Professor, Department of Computer Science and Information Engineering, National Central University

References

H. Li et al., “A Survey of Underwater Object Detection and Classification Using Deep Learning,” Journal of Ocean Engineering and Science, vol. 8, no. 2, pp. 154-169, 2023.

A. Z. K. Frisky, A. Harjoko, L. Awaludin, S. Zambanini, and R. Sablatnig, “Investigation of single image depth prediction under different lighting conditions: a case study of ancient Roman coins,” J. Comput. Cult. Herit., vol. 14, no. 4, pp. 1–17, Dec. 2021, doi: 10.1145/3465742.

A. Z. K. Frisky et al., “Registered Relief Depth (RRD) Borobudur dataset for single-frame depth prediction on one-side artifacts,” Data Brief, vol. 35, p. 106853, Feb. 2021, doi: 10.1016/j.dib.2021.106853.

M. Yang et al., “An improved CNN model for underwater acoustic image classification,” Ocean Engineering, vol. 261, p. 112194, Oct. 2022, doi: 10.1016/j.oceaneng.2022.112194.

B. Liu et al., “Underwater object classification based on multi-scale feature fusion and convolutional neural networks,” Journal of Marine Science and Engineering, vol. 9, no. 12, p. 1332, Dec. 2021, doi: 10.3390/jmse9121332.

Z. Gao, Y. Shi, and S. Li, “Self-attention and long-range relationship capture network for underwater object detection,” J. King Saud Univ. - Comput. Inf. Sci., vol. 36, no. 1, p. 101971, Mar. 2024, doi: 10.1016/j.jksuci.2024.101971.

K. Liu, L. Peng, and S. Tang, “Underwater Object Detection Using TC-YOLO with Attention Mechanisms,” Sensors, vol. 23, no. 5, p. 2567, Mar. 2023, doi: 10.3390/s23052567.

M. Cao et al., “CNN–Transformer Hybrid Architecture for Underwater Sonar Image Segmentation (FLSSNet),” Remote Sens., vol. 17, no. 4, p. 707, Feb. 2024, doi: 10.3390/rs17040707.

A. H. A. R. Ali, M. Z. M. J. Abdullah, and S. I. Othman, “A Comparative Study of Different CNN Models and Transfer Learning Effect for Underwater Object Classification in Side-Scan Sonar Images,” Remote Sens., vol. 15, no. 3, p. 593, Jan. 2023, doi: 10.3390/rs15030593.

Z. Sun, F. Luo, D. Zhang, and X. Su, “Convolutional Neural Network With Second-Order Pooling for Underwater Target Classification,” IEEE Sens. J., vol. 19, no. 3, pp. 1123-1133, Feb. 2019, doi: 10.1109/JSEN.2018.2886368.

H. Tang, T. Zhou, B. Feng, Y. Tang, and B. Wang, “Underwater Image Processing and Object Detection Based on Deep CNN Method,” J. Sensors, vol. 2020, p. 6707328, Oct. 2020, doi: 10.1155/2020/6707328.

Z. Sun, F. Luo, D. Zhang, and X. Su, “Convolutional Neural Network With Second-Order Pooling for Underwater Target Classification,” IEEE Sens. J., vol. 19, no. 3, pp. 1123-1133, Feb. 2019, doi: 10.1109/JSEN.2018.2886368.

A. Dosovitskiy et al., “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale,” arXiv:2010.11929 [cs.CV], Oct. 2020. doi: 10.48550/arXiv.2010.11929.

J. Han, D. Han, Y. Gao, J. Zhao, and Y. Huang, “IA-Net: An Inception–Attention-Module-Based Network for Classifying Underwater Images From Others,” IEEE J. Ocean. Eng., vol. 47, no. 2, pp. 320-333, Apr. 2022, doi: 10.1109/JOE.2021.3126090.

B. V. Kulkarni and B. K. B. P, “An integration of ensemble deep learning with hybrid optimization approaches for effective underwater object detection and classification model,” Sci Rep, vol. 15, p. 95596, May 2025, doi: 10.1038/s41598-025-95596-5.

Y. Zhang, Y. Wang, and Y. Lin, “Research on underwater object detection based on frequency domain attention mechanism,” in Proc. SPIE 13133, Seventh Target Recognition and Classification Technology Conference, Aug. 2024, doi: 10.1117/12.3055714.

K. Xu, H. Li, B. Luo, B. Feng, and J. Wu, “Underwater object detection algorithm based on YOLOv8 and Swin Transformer,” in Proc. SPIE 13133, Seventh Target Recognition and Classification Technology Conference, Aug. 2024, doi: 10.1117/12.3061264.

D. P. M. de Oliveira, E. J. P. da S. Júnior, and M. L. F. da Silva, “On-board classification of underwater images using hybrid classical-quantum CNN based method,” arXiv:2404.13130 [cs.CV], Apr. 2024.

S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia, “Feature Pyramid Networks

for Object Detection,” arXiv:1612.03144 [cs.CV], Dec. 2016. doi: 10.48550/arXiv.1612.03144.

A. Z. K. Frisky, A. Putranto, S. Zambanini, and R. Sablatnig, “MCCNet: Multi-Color Cascade Network with Weight Transfer for Single Image Depth Prediction on Outdoor Relief Images,” in Pattern Recognition. ICPR International Workshops and Challenges, LNCS vol. 12667, pp. 263–278, 2021, doi: 10.1007/978-3-030-68787-8_19.

M. Cao et al., “CNN–Transformer Hybrid Architecture for Underwater Sonar Image Segmentation (FLSSNet),” Remote Sens., vol. 17, no. 4, p. 707, Feb. 2024, doi: 10.3390/rs17040707.

M. Tan and Q. V. Le, “EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks,” arXiv:1905.11946 [cs.CV], May 2019.

C. Wang et al., “PVT v2: Improved Baselines with Pyramid Vision Transformer,” arXiv:2106.13797 [cs.CV], Jun. 2021.

P. Wang, J. Ma, S. Chen, G. Wang, and F. Li, “Rethinking general underwater object detection: Datasets, challenges, and solutions,” Neurocomputing, vol. 518, pp. 417-432, 2023, doi: 10.1016/j.neucom.2022.10.039.

PVT-EfficientNet Dual Encoder with Mamba Head for Efficient UnderwaterImage Classification

Abstract

Author Biographies

References

Authors

DOI:

Downloads

Published

Issue

Section

License

How to Cite

Developed By

Information