Enhanced Hate Speech Detection in Indonesian-English Code-Mixed Texts Using XLM-RoBERTa

Farrel Dinarta; Arya Wicaksana

doi:10.31449/inf.v49i21.7713

Enhanced Hate Speech Detection in Indonesian-English Code-Mixed Texts Using XLM-RoBERTa

Farrel Dinarta, Arya Wicaksana

Abstract

The prevalence of hate speech on digital platforms presents significant challenges, particularly in multilingual communities where code-mixing complicates detection. This study explores the use of XLMRoBERTa, a transformer-based model with robust multilingual capabilities, to detect hate speech within code-mixed texts, focusing on Indonesian-English code-mixing. Traditional hate speech detection models rely on single-language datasets, limiting their effectiveness in such environments. We employ a dataset consisting of Indonesian, English, and code-mixed Indonesian-English texts to evaluate XLMRoBERTa's performance. The dataset comprises 24,844 training samples, 2,760 test samples, and an additional 100 supplementary samples. Key hyperparameters included a batch size of 16 and 32, with a learning rate ranging from 1e-5 to 5e-5. The model achieved near-perfect accuracy (99.6%) on the primary test set and demonstrated strong generalization across realistic supplementary data, achieving an F1-score of 90.94%. These findings underscore the model's potential for application in complex linguistic contexts, contributing to the development of effective code-mixed hate speech detection.

Full Text:

PDF

References

J. B. Walther, “Social media and online hate,” Curr. Opin. Psychol., vol. 45, no. January, 2022, doi: 10.1016/j.copsyc.2021.12.010.

K. Sreelakshmi, B. Premjith, and K. P. Soman, “Detection of Hate Speech Text in Hindi-English Code-mixed Data,” Procedia Comput. Sci., vol. 171, no. 2019, pp. 737–744, 2020, doi: 10.1016/j.procs.2020.04.080.

E. W. Pamungkas, A. Fatmawati, Y. S. Nugroho, D. Gunawan, and E. Sudarmilah, “Hate Speech Detection in Code-Mixed Indonesian Social Media: Exploiting Multilingual Languages Resources,” in 2022 Seventh International Conference on Informatics and Computing (ICIC), IEEE, Dec. 2022, pp. 1–5. doi: 10.1109/ICIC56845.2022.10006940.

M. S. Jahan and M. Oussalah, “A systematic review of hate speech automatic detection using natural language processing,” Neurocomputing, vol. 546, p. 126232, Aug. 2023, doi: 10.1016/j.neucom.2023.126232.

L. Xu, J. Zeng, and S. Chen, “yasuo at HASOC2020: Fine-tune XML-RoBERTa for Hate Speech Identification,” 2020.

X. Ou and H. Li, “YNU@Dravidian-CodeMix-FIRE2020: XLM-RoBERTa for Multi-language Sentiment Analysis,” 2020.

T. Tita, Q. Mary, and A. Zubiaga, “Cross-lingual Hate Speech Detection using Transformer Models”, doi: 10.48550/arXiv.2111.00981.

T. Leburu-Dingalo, K. Johannes Ntwaagae, N. Peace Motlogelwa, E. Thuma, and M. Mudongo, “Application of XLM-RoBERTa for Multi-Class Classification of Conversational Hate Speech,” 2022.

S. Wang, J. Liu, X. Ouyang, and Y. Sun, “Galileo at SemEval-2020 Task 12: Multi-lingual Learning for Offensive Language Identification using Pre-trained Language Models,” 14th Int. Work. Semant. Eval. SemEval 2020 - co-located 28th Int. Conf. Comput. Linguist. COLING 2020, Proc., pp. 1448–1455, 2020, doi: 10.18653/v1/2020.semeval-1.189.

A. Conneau et al., “Unsupervised Cross-lingual Representation Learning at Scale,” Nov. 2019.

Y. Liu et al., “RoBERTa: A robustly optimized BERT pretraining approach,” arXiv, no. 1, 2019.

J. Devlin, M. W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in NAACL HLT 2019 - 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - Proceedings of the Conference, 2019.

Anonymous, “XLM-RoBERTa GitHub,” https://github.com/huggingface/transformers/blob/main/docs/source/en/model_doc/xlm-roberta.md.

Y. Meng et al., “Representation Deficiency in Masked Language Modeling,” Feb. 2023.

Anonymous, “XLM-RoBERTa HuggingFace,” https://huggingface.co/docs/transformers/v4.46.0/en/model_doc/xlm-roberta#transformers.TFXLMRobertaForSequenceClassification.

X. Amatriain, A. Sankar, J. Bing, P. K. Bodigutla, T. J. Hazen, and M. Kazi, “Transformer models: an introduction and catalog,” Feb. 2023.

G. Lample and A. Conneau, “Cross-lingual Language Model Pretraining,” Jan. 2019.

DOI: https://doi.org/10.31449/inf.v49i21.7713

This work is licensed under a Creative Commons Attribution 3.0 License.

Username
Password
Remember me