Ensemble Feature Fusion of VGG16, ResNet50, and Vision Transformer for Pneumonia Detection in Chest X-ray Images
Abstract
This study proposes a novel heterogeneous ensemble deep learning architecture for pneumonia classifica- tion from chest X-ray images by integrating pretrained convolutional neural networks(CNN), VGG16 and ResNet50 with a fine-tuned vision transformer (ViT). The model employs a feature-level fusion strategy that concatenates deep local spatial features extracted by the CNN backbones and feeds them into the ViT to capture global contextual relationships via self-attention. This design effectively addresses the limitations of standalone CNN and ViT models by synergistically combining their complementary strengths. Extensive ablation studies and experimental evaluations demonstrate that the ensemble model significantly outper- forms individual CNN and ViT baseline models, achieving an accuracy of 98.5%, precision of 98.7%, recall of 98.3%, F1-score of 98.5%, and an area under the receiver operating characteristic (AUC-ROC) curve of 0.99 on the pneumonia X-ray dataset. The architecture balances detailed local feature extraction and holistic global context modelling, offering a robust and efficient solution for medical image classification.DOI:
https://doi.org/10.31449/inf.v50i12.9647Downloads
Published
How to Cite
Issue
Section
License
Authors retain copyright in their work. By submitting to and publishing with Informatica, authors grant the publisher (Slovene Society Informatika) the non-exclusive right to publish, reproduce, and distribute the article and to identify itself as the original publisher.
All articles are published under the Creative Commons Attribution license CC BY 3.0. Under this license, others may share and adapt the work for any purpose, provided appropriate credit is given and changes (if any) are indicated.
Authors may deposit and share the submitted version, accepted manuscript, and published version, provided the original publication in Informatica is properly cited.







