MORSE-QE: A Morphology-Aware, Embedding-Driven Framework with Root Extraction for Arabic Dialectal Query Expansion
Abstract
Information retrieval (IR) in diglossic and morphologically complex Arabic includes major difficulties since dialectal searches usually do not retrieve documents written in Modern Standard Arabic (MSA). The following paper introduces MORSE-QE (Morphology‑aware, Optimized, Root‑driven Semantic Expansion for Query Enhancement), a four‑stage approach that combines rule‑based morph-based processing and embedding-based expansion in parallel structures. The process includes: (i) dialect‑to‑MSA normalization using curated lexicons, (ii) root extraction via AlKhalil Morpho Sys 2, (iii) semantic expansion with AraVec embeddings, and (iv) root‑driven filtering to reduce morphological noise. Experiments were conducted on the QADI dataset (2,000 dialectal queries) and the TREC 2001 Arabic Corpus (383,872 MSA documents), using Mean Average Precision (MAP), Precision@10, and Root Recall as evaluation metrics. MORSE‑QE achieves MAP gains of 15–18% over neural baselines (DANs, DMNs) on QADI and 15% over RM3 on TREC 2001, with a root recall improvement from 65% (DMNs) to 88%. Ablation studies show that dialect normalization and root‑based filtering contribute 19.5% and 28.9% relative MAP improvements, respectively. These results demonstrate that MORSE‑QE provides a scalable and interpretable solution for bridging dialectal and morphological gaps in Arabic IR.References
________________________________________
N. Y. Habash, Introduction to Arabic natural language processing. Morgan & Claypool Publishers, 2010.
Y. H. Farhan, M. Mohd, S. A. M. J. J. o. I. S. T. Noah, and Practice, "Survey of automatic query expansion for Arabic text retrieval," vol. 8, no. 4, pp. 67-86, 2020.
Y. H. Farhan, S. A. Mohd Noah, M. Mohd, and J. Atwan, "Word-embedding-based query expansion: Incorporating Deep Averaging Networks in Arabic document retrieval," Journal of Information Science, vol. 49, no. 5, pp. 1168-1186, 2021.
Y. H. Farhan, M. Shakir, M. Abd Tareq, B. J. J. o. I. S. T. Shannaq, and Practice, "Incorporating Deep Median Networks for Arabic Document Retrieval Using Word Embeddings-Based Query Expansion," vol. 12, no. 3, pp. 36-48, 2024.
M. Danilevsky, K. Qian, R. Aharonov, Y. Katsis, B. Kawas, and P. J. a. p. a. Sen, "A survey of the state of explainable AI for natural language processing," 2020.
Y. H. FARHAN et al., "UTILIZING WORD EMBEDDING’S FOR AUTOMATED QUERY EXPANSION IN ARABIC INFORMATION RETRIEVAL: A BLENDED METHODOLOGY," vol. 103, no. 2, 2025.
A. Abdelali, H. Mubarak, Y. Samih, S. Hassan, and K. Darwish, "QADI: Arabic dialect identification in the wild," in Proceedings of the sixth Arabic natural language processing workshop, 2021, pp. 1-10.
D. W. Oard and F. C. Gey, "The TREC 2002 Arabic/English CLIR Track," in TREC, 2002.
N. Abdul-Jaleel et al., "UMass at TREC 2004: Novelty and HARD," in Proceedings of TREC-13, 2004, pp. 715-725.
A. Abdelali, H. Mubarak, Y. Samih, S. Hassan, and K. J. a. p. a. Darwish, "Arabic dialect identification in the wild," 2020.
M. Iyyer, V. Manjunatha, J. Boyd-Graber, and H. Daumé III, "Deep unordered composition rivals syntactic methods for text classification," in Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing (volume 1: Long papers), 2015, pp. 1681-1691.
W. Antoun, F. Baly, and H. J. a. p. a. Hajj, "Arabert: Transformer-based model for arabic language understanding," 2020.
T. Hammouda, M. Jarrar, and M. J. P. C. S. Khalilia, "SinaTools: Open Source Toolkit for Arabic Natural Language Processing," vol. 244, pp. 388-396, 2024.
A. Abdelali, K. Darwish, N. Durrani, and H. Mubarak, "Farasa: A fast and furious segmenter for arabic," in Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: Demonstrations, 2016, pp. 11-16.
H. Bouamor, S. Hassan, and N. Habash, "The MADAR shared task on Arabic fine-grained dialect identification," in Proceedings of the Fourth Arabic Natural Language Processing Workshop, 2019, pp. 199-207.
M. Abdul-Mageed, A. Elmadany, C. Zhang, E. M. B. Nagoudi, H. Bouamor, and N. J. a. p. a. Habash, "NADI 2023: The Fourth Nuanced Arabic Dialect Identification Shared Task," 2023.
W. Antoun, F. Baly, and H. J. A. P. A. Hajj, "Arabert: A Transformer-based Model for Arabic Language Understanding," 2020.
H. Almaqtari, F. Zeng, and A. J. A. Mohammed, "Enhancing Arabic Sentiment Analysis of Consumer Reviews: Machine Learning and Deep Learning Methods Based on NLP," vol. 17, no. 11, p. 495, 2024.
A. J. A. P. A. Wadhawan, "Arabert and Farasa Segmentation-Based Approach for Sarcasm and Sentiment Detection in Arabic Tweets," 2021.
X. V. Lin et al., "Few-shot learning with multilingual generative language models," in Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022, pp. 9019-9052.
Y. H. Farhan, M. Mohd, and S. A. M. Noah, "Survey of Automatic Query Expansion for Arabic Text Retrieval," JISTaP, vol. 8, no. 4, pp. 67-86, 2020.
Y. H. Farhan, S. A. M. Noah, M. Mohd, and J. Atwan, "Word Embeddings-Based Pseudo Relevance Feedback Using Deep Averaging Networks for Arabic Document Retrieval," JISTaP, vol. 9, no. 2, pp. 1-17, 2021.
Y. H. Farhan, M. Shakir, M. Abd Tareq, B. J. J. o. I. S. T. Shannaq, and Practice, "Incorporating Deep Median Networks for Arabic Document Retrieval Using Word Embeddings-Based Query Expansion," vol. 12, no. 3, pp. 36-48, 2024.
Y. H. Farhan, S. A. Mohd Noah, M. Mohd, and J. Atwan, "Word-embedding-based query expansion: Incorporating Deep Averaging Networks in Arabic document retrieval," Journal of Information Science, vol. 49, no. 5, pp. 1168-1186, 2021.
Y. H. FARHAN et al., "UTILIZING WORD EMBEDDING’S FOR AUTOMATED QUERY EXPANSION IN ARABIC INFORMATION RETRIEVAL: A BLENDED METHODOLOGY," vol. 103, no. 2, 2025.
O. Obeid et al., "CAMeL tools: An open-source python toolkit for Arabic natural language processing," in Proceedings of the Twelfth Language Resources and Evaluation Conference, 2020, pp. 7022-7032.
F. J. P. J. o. B. Daneshfar and A. Sciences, "Enhancing Low-Resource Sentiment Analysis: A Transfer Learning Approach," vol. 6, no. 2, pp. 265-274, 2024.
N. Ahmed, R. Amin, H. Aldabbas, M. Saeed, M. Bilal, and H. Song, "A Novel Approach for Sentiment Analysis of a Low Resource Language Using Deep Learning Models," 2024.
O. E. Ariss and L. M. Alnemer, "Morphology-Based Arabic Sentiment Analysis of Book Reviews," in International Conference on Computational Linguistics and Intelligent Text Processing, 2017, pp. 115-128, Springer.
C. Sabty, M. Islam, and S. Abdennadher, "Contextual Embeddings for Arabic-English Code-Switched Data," in Proceedings of the Fifth Arabic Natural Language Processing Workshop, 2020, pp. 215-225.
I. Covert, S. M. Lundberg, and S.-I. J. A. i. N. I. P. S. Lee, "Understanding global feature contributions with additive importance measures," vol. 33, pp. 17212-17223, 2020.
L. Nemes, A. J. J. o. I. Kiss, and Telecommunication, "Social Media Sentiment Analysis Based on COVID-19," vol. 5, no. 1, pp. 1-15, 2021.
F. Sudirjo, K. Diantoro, J. A. Al-Gasawneh, H. K. Azzaakiyyah, and A. M. A. J. J. T. D. S. I. B. Ausat, "Application of ChatGPT in Improving Customer Sentiment Analysis for Businesses," vol. 5, no. 3, pp. 283-288, 2023.
D. Marlina, K. Tri Basuki, Z. Mohd Zaki, and A. J. J. o. D. S. Siti Farahnasihah, "Sentiment Analysis on Natural Skincare Products," vol. 2022, no. 12, pp. 1-17, 2022.
G. Fatouros, J. Soldatos, K. Kouroumali, G. Makridis, and D. J. M. L. w. A. Kyriazis, "Transforming sentiment analysis in the financial domain with ChatGPT," vol. 14, p. 100508, 2023.
J. Hartmann, M. Heitmann, C. Siebert, and C. J. I. J. o. R. i. M. Schamp, "More than a feeling: Accuracy and application of sentiment analysis," vol. 40, no. 1, pp. 75-87, 2023.
A. Abdelali, K. Darwish, N. Durrani, and H. Mubarak, "Farasa: A Fast and Furious Segmenter for Arabic," in Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations, 2016, pp. 11-16.
Y. Ganin et al., "Domain-adversarial training of neural networks," vol. 17, no. 59, pp. 1-35, 2016.
W. Antoun, F. Baly, and H. J. a. p. a. Hajj, "AraGPT2: Pre-trained transformer for Arabic language generation," 2020.
L. Arras, F. Horn, G. Montavon, K.-R. Müller, and W. J. P. o. Samek, "" What is relevant in a text document?": An interpretable machine learning approach," vol. 12, no. 8, p. e0181142, 2017.
M. Nabil, M. Aly, and A. Atiya, "Astd: Arabic sentiment tweets dataset," in Proceedings of the 2015 conference on empirical methods in natural language processing, 2015, pp. 2515-2519.
K. Taghva, R. Elkhoury, and J. Coombs, "Arabic stemming without a root dictionary," in International Conference on Information Technology: Coding and Computing (ITCC'05)-Volume II, 2005, vol. 1, pp. 152-157: IEEE.
DOI:
https://doi.org/10.31449/inf.v49i20.9628Downloads
Published
How to Cite
Issue
Section
License
Authors retain copyright in their work. By submitting to and publishing with Informatica, authors grant the publisher (Slovene Society Informatika) the non-exclusive right to publish, reproduce, and distribute the article and to identify itself as the original publisher.
All articles are published under the Creative Commons Attribution license CC BY 3.0. Under this license, others may share and adapt the work for any purpose, provided appropriate credit is given and changes (if any) are indicated.
Authors may deposit and share the submitted version, accepted manuscript, and published version, provided the original publication in Informatica is properly cited.







