Efficient Transformer Architectures for Diabetic Retinopathy Classification from Fundus Images: DR-MobileViT, DR-EfficientFormer, and DR-SwinTiny
Abstract
Diabetic retinopathy (DR) is a prevalent cause of vision loss, necessitating efficient diagnostic tools, particularly in resource-limited settings. This study presents three lightweight transformer-based models— DR-MobileViT, DR-EfficientFormer, and DR-SwinTiny—for automated DR classification from fundus images (APTOS 2019: 3,662 images; Messidor-2: 1,748 images). After preprocessing including resizing to 224×224 pixels and CLAHE enhancement, these models, leveraging compact architectures (1.8–3.5M parameters), are trained using an AdamW optimizer with data augmentation. DR-MobileViT integrates convolutional and transformer layers, DR-EfficientFormer employs a dimension-consistent design, and DRSwinTiny utilizes shifted window attention. All models were initialized with ImageNet pretrained weights. Evaluated on the APTOS 2019 and Messidor-2 datasets, they achieve quadratic weighted kappa (QWK) scores up to 0.89 and areas under the ROC curve (AUC) up to 0.95. These models approach the performance of top-performing CNN ensembles from the APTOS 2019 challenge (which exceed 40M parameters) while reducing inference times to 10–15 ms/image (NVIDIA P100 GPU) and computational overhead by over 90%. These results indicate their potential for scalable, point-of-care DR screening, offering a viable solution for early detection in underserved regions.
Full Text:
PDFReferences
Cheung, N., Mitchell, P., and Wong, T. Y. (2010) Diabetic retinopathy, Lancet, 376(9735), pp. 124-136.
Ting, D. S. W., Pasquale, L. R., Peng, L., Campbell, J. P., Lee, A. Y., Raman, R., Tan, G. S. W., Schmetterer, L., Keane, P. A., and Wong, T. Y. (2019) Artificial intelligence and deep learning in ophthalmology, British Journal of Ophthalmology, BMJ Publishing Group Ltd, 103(2), pp. 167-175.
Gulshan, V., Peng, L., Coram, M., and others (2016) Development and validation of a deep learning algorithm for detection of diabetic retinopathy, JAMA, 316(22), pp. 2402-2410.
LeCun, Y., Bengio, Y., and Hinton, G. (2015) Deep learning, Nature, 521(7553), pp. 436-444.
Kaggle (2019) APTOS 2019 Blindness Detection Challenge, Available at: https://kaggle.com/c/aptos2019-blindness-detection.
Howard, A. G., Zhu, M., Chen, B., and others (2017) MobileNets: Efficient convolutional neural networks for mobile vision applications, arXiv:1704.04861.
Vaswani, A., Shazeer, N., Parmar, N., and others (2017) Attention is all you need, Advances in Neural Information Processing Systems (NeurIPS).
Chen, J., Lu, Y., Yu, Q., and others (2021) TransUNet: Transformers make strong encoders for medical image segmentation, arXiv:2102.04306.
Han, K., Wang, Y., Chen, H., and others (2022) A survey on vision transformer, IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(1), pp. 87-110.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., and others (2020) An image is worth 16x16 words: Transformers for image recognition, International Conference on Learning Representations (ICLR).
Raghu, M., Unterthiner, T., Kornblith, S., and others (2021) Do vision transformers see like convolutional neural networks?, Advances in Neural Information Processing Systems (NeurIPS).
Mehta, S. and Rastegari, M. (2021) MobileViT: Light-weight, general-purpose, and mobile-friendly vision transformer, arXiv:2110.02178.
Li, Y., Yuan, G., Wen, Y., and others (2022) EfficientFormer: Vision transformers at mobile speed, arXiv:2206.01191.
Liu, Z., Lin, Y., Cao, Y., and others (2021) Swin Transformer: Hierarchical vision transformer using shifted windows, IEEE/CVF International Conference on Computer Vision (ICCV).
He, K., Chen, X., Xie, S., and others (2022) Transformers in medical imaging: A survey, Medical Image Analysis, 81, pp. 102567.
Decencière, E., Zhang, X., Cazuguel, G., and others (2014) Feedback on a publicly distributed image database: The Messidor database, Image Analysis & Stereology, 33(3), pp. 231-234.
APTOS (2019) Asia Pacific Tele-Ophthalmology Society dataset.
Pratt, H., Coenen, F., Broadbent, D. M., and others (2019) Convolutional neural networks for diabetic retinopathy detection, Medical Image Analysis, 55, pp. 101-110.
Zuiderveld, K. (1994) Contrast limited adaptive histogram equalization, Graphics Gems IV, Academic Press.
APTOS 2019 Rank 1 Solution (2019) APTOS 2019 Rank 1 Solution, Available at: https://kaggle.com.
APTOS 2019 Rank 2 Solution (2019) APTOS 2019 Rank 2 Solution, Available at: https://kaggle.com.
Tan, M. and Le, Q. (2019) EfficientNet: Rethinking model scaling for convolutional neural networks, International Conference on Machine Learning (ICML).
Shorten, C. and Khoshgoftaar, T. M. (2019) A survey on image data augmentation for deep learning, Journal of Big Data, 6(1), pp. 60.
Loshchilov, I. and Hutter, F. (2017) SGDR: Stochastic gradient descent with warm restarts, International Conference on Learning Representations (ICLR).
Szegedy, C., Vanhoucke, V., Ioffe, S., and others (2016) Rethinking the inception architecture for computer vision, IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Wilcoxon, F. (1945) Individual comparisons by ranking methods, Biometrics Bulletin, 1(6), pp. 80-83.
Carion, N., Massa, F., Synnaeve, G., and others (2020) End-to-end object detection with transformers, European Conference on Computer Vision (ECCV).
Wang, X., Girshick, R., Gupta, A., and others (2021) Pyramid vision transformer: A versatile backbone for dense prediction, IEEE/CVF International Conference on Computer Vision (ICCV).
Rajpurkar, P., Chen, E., Banerjee, O., and others (2022) AI in healthcare: The future of diagnostics, Nature Medicine, 28(1), pp. 15-18.
Abràmoff, M. D., Lou, Y., Erginay, A., and others (2018) Improved automated detection of diabetic retinopathy, Ophthalmology, 125(12), pp. 1904-1912.
Jacob, B., Kligys, S., Chen, B., and others (2018) Quantization and training of neural networks for efficient integer-arithmetic-only inference, IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Panwar, N., Huang, P., Lee, J., and others (2020) Fundus photography in the 21st century—A review of portable devices, Eye, 34(5), pp. 849-856.
McMahan, H. B., Moore, E., Ramage, D., and others (2017) Communication-efficient learning of deep networks from decentralized data, Artificial Intelligence and Statistics (AISTATS).
Li, T., Sahu, A. K., Talwalkar, A., and others (2020) Robustness of deep learning models in real-world medical imaging, IEEE Transactions on Biomedical Engineering, 67(5), pp. 1432-1441.
DOI: https://doi.org/10.31449/inf.v49i29.8695

This work is licensed under a Creative Commons Attribution 3.0 License.