A Hybrid OCR-XGBoost-Transformer Pipeline for Resume Parsing with Spatial-Semantic Integration
Abstract
This study addresses the automation of resume information extraction using a hybrid Artificial Intelligence (AI) framework that integrates Optical Character Recognition (OCR), Machine Learning, and Deep Learning techniques. The system operates in three stages: text extraction using PaddleOCR, resume section classification via XGBoost, and semantic entity recognition using a Transformers-based Named Entity Recognition (NER) model. The dataset consists of 200 French resumes collected in PDF format and annotated for ten resume section classes and multiple named entities. Evaluation was conducted using standard multi-class classification metrics including accuracy, precision, recall, and F1-score. Experimental results show that XGBoost achieved 96.5% accuracy in section classification, while the Transformers model attained 82% accuracy in semantic entity extraction. This dual-stage pipeline captures both spatial and semantic structures of resumes, offering improved accuracy and adaptability over traditional parsing approaches.
Full Text:
PDFReferences
Kessler, R., Torres-Moreno, J. M., & El-Bèze, M. 2010. E-Gen: automatic processing of human resources information. Document numérique, 13(3), 95–119.
Baudoin, E., Déroulède, B., Diné, S., Dubouloz, M.-A., & Peretti, J.-M. 2019. Digital recruitment. In Digital transformation of the HR function (pp. 49–101). Paris: Dunod.
Khan, N., Khan, K., Naveed, S., Nabi, N., Qureshi, M., & Naveed, N. 2023. Resume Parser and Summarizer. International Journal of Advanced Research in Science, Communication and Technology, 3(1), 35–42.
Olorunshola, O. E., Ampitan, I. O., Adamu-Fika, F., & Ademuwagun, A. K. (2025). An Enhanced K-NN Algorithm Leveraging BERT Techniques for Resume Parsing System. Asian Journal of Research in Computer Science, 18(7), 49-59.
Aakankshu, R., Kariya, J., Khant, D., Khandare, S., & Barve, P. 2020. A Systematic Literature Review (SLR) on the beginning of resume parsing in HR Recruitment Process & SMART advancements in chronological order. Research Square. https://assets.researchsquare.com/files/rs-570370/v1/9da1a6e1-437f-4f6d-a021-743ea3ee268e.pdf
Gomathy, C. K. 2022. OPTICAL CHARACTER RECOGNITION. ResearchGate. https://www.researchgate.net/publication/360620085_OPTICAL_CHARACTER_RECOGNITION
Sarhan, A. M., Ali, H. A., Wagdi, M., Ali, B., Adel, A., & Osama, R. (2024). CV Content Recognition and Organization Framework based onYOLOv8 and Tesseract-OCR Deep Learning Models.
Pokharel, P. 2022. Resume parser using NLP. ResearchGate. https://www.researchgate.net/publication/361772014_RESUME_PARSER
Wosiak, A. 2021. Automated extraction of information from Polish resume documents in the IT recruitment process. Procedia Computer Science, 192, 2432–2439. https://doi.org/10.1016/j.procs.2021.09.012
Malik, S., et al. 2020. XGBoost: A Deep Dive into Boosting. ResearchGate. https://www.researchgate.net/publication/339499154_XGBoost_A_Deep_Dive_into_Boosting_Introduction_Documentation
Gao, S., Kotevska, O., Sorokine, A., & Christian, J. B. (2021). A pre-training and self-training approach for biomedical named entity recognition. PloS one, 16(2), e0246310.
Kumar, M., Chaturvedi, K. K., Sharma, A., Arora, A., Farooqi, M. S., Lal, S. B., ... & Ranjan, R. (2023). An algorithm for automatic text annotation for named entity recognition using Spacy framework. ICAR, Delhi, India, Tech. Rep.
Chen, T., et al. 2015. XGBoost: extreme gradient boosting. R package version 0.4-2, 1(4), 1–4.
Lee, J. Y., Dernoncourt, F., & Szolovits, P. 2017. Transfer learning for named-entity recognition with neural networks. arXiv preprint, arXiv:1705.06273, 1–5.
Panja, S., Chatterjee, A., & Yasmin, G. 2018. Kernel Functions of SVM: A Comparison and Optimal Solution. In Advanced Informatics for Computing Research (pp. 88–97). Singapore: Springer. https://doi.org/10.1007/978-981-13-3140-4_9
Ghaith, S. 2024. The triple attention transformer: advancing contextual coherence in transformer models. Evolutionary Intelligence, 17(5), 3723–3744.
Riyanto, S., Imas, S. S., Djatna, T., & Atikah, T. D. 2023. Comparative analysis using various performance metrics in imbalanced data for multi-class text classification. International Journal of Advanced Computer Science and Applications, 14(6).
Grandini, M., Bagli, E., & Visani, G. 2020. Metrics for multi-class classification: an overview. arXiv preprint, arXiv:2008.05756.
DOI: https://doi.org/10.31449/inf.v49i12.9588
This work is licensed under a Creative Commons Attribution 3.0 License.








