Semi-Automatic Construction and Benchmarking of a Word-Segmented Corpus for Lao Using LLMs and Transformer Models

Ha Tien Nguyen, Thongphan Palongve, Cuong Quy Nguyen, Kien Trung Le

Abstract


Word segmentation is a fundamental task in Natural Language Processing (NLP), particularly for continuous-script languages such as Lao and Thai, where the absence of spaces between words makes boundary detection highly challenging. In this paper, we present the first publicly released Lao word-segmentation corpus together with a semi-automatic construction pipeline that combines automatic pre-annotation and human verification. Specifically, pre-annotation was generated primarily with
GPT-4o (May 2024 snapshot), while a smaller subset was produced using GPT-5o, and all outputs were subsequently corrected through systematic verification by trained native annotators. A senior Lao linguist served as adjudicator to resolve difficult cases and enforce consistency. The final resource comprises 10,000 training sentences and a 1,000-sentence gold-standard test set. To demonstrate
utility, we benchmarked Lao word segmentation with an XLM-RoBERTa transformer model, achieving a boundary-level F1 score of 0.75, which surpasses the widely used LaoNLP baseline (0.71). This corpus fills a critical resource gap for Lao and provides a reproducible foundation for future research on word segmentation, POS tagging, machine translation, and other downstream tasks in low-resource language
processing.


Full Text:

PDF

References


Anand Kumar M., “NITK-IT_NLP@NSURL2019: Transfer Learning based POS Tagger for Under Resourced Bhojpuri and Magahi Language,” in: Proceedings of the 1st International Workshop on NLP Solutions for Under Resourced Languages (NSURL 2019) co-located with ICNLSP 2019 – Short Papers, Association for Computational Linguistics, Trento, Italy, 2019, pp. 68–72. Available at: https://aclanthology.org/2019.nsurl-1.10/

A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer, V. Stoyanov, Unsupervised Cross-lingual Representation Learning at Scale, in: Proc. ACL, 2020, pp. 8440–8451. https://doi.org/10.18653/v1/2020.acl-main.747

J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, in: Proc. NAACL-HLT 2019, Minneapolis, Minnesota, June 2019, pp. 4171–4186. https://aclanthology.org/N19-1423/

R. Eskander, C. Lowry, S. Khandagale, J. Klavans, M. Polinsky, S. Muresan, Unsupervised Stem-based Cross-lingual Part-of-Speech Tagging for Morphologically Rich Low-Resource Languages, in: Proc. NAACL-HLT 2022, pp. 4061–4072. https://doi.org/10.18653/v1/2022.naacl-main.298

R. Eskander, S. Muresan, M. Collins, Unsupervised Cross-Lingual Part-of-Speech Tagging for Truly Low-Resource Scenarios, in: Proc. EMNLP 2020, pp. 4820–4831. https://doi.org/10.18653/v1/2020.emnlp-main.391

A. Imani, S. Severini, M. J. Sabet, F. Yvon, H. Schütze, Graph-Based Multilingual Label Propagation for Low-Resource Part-of-Speech Tagging, in: Proc. EMNLP 2022, pp. 1577–1589. https://doi.org/10.18653/v1/2022.emnlp-main.102

T. Kudo, CRF++: Yet Another CRF Toolkit, Technical Report, 2005. http://crfpp.sourceforge.net/

S. Kumar, P. Jyothi, P. Bhattacharyya, Part-of-speech Tagging for Extremely Low-resource Indian Languages, in: Findings of ACL 2024, Bangkok, Thailand, 2024, pp. 14422–14431. https://aclanthology.org/2024.findings-acl.857/

C. D. Manning, H. Schütze, Foundations of Statistical Natural Language Processing, MIT Press, Cambridge, MA, May 1999. https://icog-labs.com/wp-content/uploads/2014/07/Christopher_D._Manning_Hinrich_Sch%C3%BCtze_Foundations_Of_Statistical_Natural_Language_Processing.pdf

S. Moeller, L. Liu, M. Hulden, To POS Tag or Not to POS Tag: The Impact of POS Tags on Morphological Learning in Low-Resource Settings, in: Proc. ACL-IJCNLP 2021 (Vol. 1: Long Papers), Association for Computational Linguistics, Online, 2021, pp. 966–978. https://doi.org/10.18653/v1/2021.acl-long.78

T. Pires, E. Schlinger, D. Garrette, How Multilingual is Multilingual BERT?, in: Proc. ACL 2019, pp. 4996–5001. https://doi.org/10.18653/v1/P19-1493

Y. Shen, J. Li, S. Huang, Y. Zhou, X. Xie, Q. Zhao, Data Augmentation for Low-resource Word Segmentation and POS Tagging of Ancient Chinese Texts, in: Proc. 2nd Workshop on Language Technologies for Historical and Ancient Languages, European Language Resources Association, Marseille, France, 2022, pp. 169–173. https://aclanthology.org/2022.lt4hala-1.26

B. Plank, A. Søgaard, Y. Goldberg, Multilingual Part-of-Speech Tagging with Bidirectional Long Short-Term Memory Models and Auxiliary Loss, in: Proc. ACL 2016 (Short Papers), pp. 412–418. https://doi.org/10.18653/v1/P16-2067

K. Phatthiyaphaibun, T. Phon-Amnuaisuk, LaoNLP: A Toolkit for Lao Natural Language Processing, Zenodo (2022). https://doi.org/10.5281/zenodo.6833407

N. Goyal, J. Gao, V. Chaudhary, P. Chen, G. Wenzek, V. Ju, S. Krishnan, M. Ranzato, F. Guzmán, A. Fan, The FLORES-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation, Trans. Assoc. Comput. Linguist. 10 (2022) 522–538. https://doi.org/10.1162/tacl_a_00474

K. Khankasikam, N. Muansuwqan, Thai word segmentation: a lexical semantic approach, in: Proc. Machine Translation Summit X: Posters, Phuket, Thailand (2005) 331–338. https://aclanthology.org/2005.mtsummit-posters.2/

N. Ljubešić, T. Erjavec, Corpus vs. Lexicon Supervision in Morphosyntactic Tagging: The Case of Slovene, Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), Portorož, Slovenia (2016), 1527–1531. https://aclanthology.org/L16-1242/

T. Ruokolainen, O. Kohonen, K. Sirts, S.-A. Grönroos, M. Kurimo, S. Virpioja, A Comparative Study of Minimally Supervised Morphological Segmentation, Computational Linguistics, Vol. 42, No. 1, March 2016, pp. 91–120. https://aclanthology.org/J16-1003.pdf

H. P. Lê, T. M. Huyền, A. Roussanaly, T. V. Hồ, A Hybrid Approach to Word Segmentation of Vietnamese Texts, in: Proc. LATA 2008, Tarragona, Spain, pp. 240–249. https://hal.inria.fr/inria-00334761

X. Pan, B. Zhang, Y. Wang, J. Chen, Y. Shen, Cross-lingual Transfer for Low-resource Asian Languages: Challenges and Opportunities, in: Proc. COLING 2020, pp. 2706–2717. https://aclanthology.org/2020.coling-main.243

C. Haruechaiyasak, C. Kongthon, A. Sangkeettrakarn, A. Palingoon, A Comparative Study on Thai Word Segmentation Approaches, in: Proc. IJCNLP, 2008, pp. 282–287. https://www.cs.ait.ac.th/~mdailey/papers/Choochart-Wordseg.pdf

N. Chanta, V. Theeramunkong, Thai Word Segmentation based on Global and Local Unsupervised Learning, Informatica 30(4) (2006), 403–414.

Y. Zhou, Y. Zhang, P. Li, Optimal word segmentation for neural machine translation into Dravidian languages, in: Proc. 8th Workshop on Asian Translation (WAT 2021), 205–214 (2021). https://aclanthology.org/2021.wat-1.21/

P. Joshi, S. Santy, A. Budhiraja, K. Bali, M. Choudhury, The State and Fate of Linguistic Diversity and Inclusion in the NLP World, in: Proc. ACL 2020, pp. 6282–6293. https://doi.org/10.18653/v1/2020.acl-main.560

H. Kaing, C. Ding, M. Utiyama, E. Sumita, S. Sam, S. Seng, K. Sudoh, S. Nakamura, Towards Tokenization and Part-of-Speech Tagging for Khmer: Data and Discussion, ACM Trans. Asian Low-Resour. Lang. Inf. Process. 20 (6) (2021) Article 104. https://doi.org/10.1145/3464378

C. Ding, Y. K. Thu, Word Segmentation for Burmese (Myanmar), ACM Trans. Asian Low-Resour. Lang. Inf. Process. (2016). https://doi.org/10.1145/2846095

W. P. Pa, H. A. Oo, T. San, Word Boundary Identification for Myanmar Text Using Conditional Random Fields, in: Proc. Int. Conf. Asian Lang. Process. (IALP), IEEE, 2015. https://doi.org/10.1007/978-3-319-23207-2_46

C. Mao, Z. Man, Z. Yu, H. Wang, A Neural Joint Model with BERT for Burmese Syllable Segmentation, Word Segmentation, and POS Tagging, ACM Trans. Asian Low-Resour. Lang. Inf. Process. (2021). https://doi.org/10.1145/3436818

Y. Li, X. Li, Y. Wang, H. Lv, F. Li, L. Duo, Character-based Joint Word Segmentation and Part-of-Speech Tagging for Tibetan Based on Deep Learning, ACM Trans. Asian Low-Resour. Lang. Inf. Process. 21 (5) (2022) Article 95. https://doi.org/10.1145/3511600

Z. Wang, Y. Lyu, J. Zhu, Tibetan Word Segmentation Based on Word-Position Tagging, in: Proc. Int. Conf. Asian Lang. Process. (IALP), IEEE, 2013. https://doi.org/10.1109/IALP.2013.74

R. Buoy, N. Taing, S. Kor, Joint Khmer Word Segmentation and Part-of-Speech Tagging Using Deep Learning, arXiv preprint (2021). https://arxiv.org/abs/2103.16801

J. Hu, J. Fu, W. Zhao, P. Lou, M. Feng, H. Ren, A. Fang, Characterizing pituitary adenomas in clinical notes: Corpus construction and its application in LLMs, Health Informatics Journal 30(4) (2024), 14604582241291442. https://doi.org/10.1177/14604582241291442

A. Ahmad, M. Azzeh, E. Alnagi, Q. Abu Al-Haija, D. Halabi, A. Aref, Y. AbuHour, Hate speech detection in the Arabic language: corpus design, construction, and evaluation, Frontiers in Artificial Intelligence 7 (2024), 1345445. https://doi.org/10.3389/frai.2024.1345445

R. Shao, P. Lin, Z. Xu, Integrated natural language processing method for text mining and visualization of underground engineering text reports, Automation in Construction 166 (2024), 105636. https://doi.org/10.1016/j.autcon.2024.105636




DOI: https://doi.org/10.31449/inf.v49i27.11195

Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.