Gender Classification on Twitter Based on Feeds and User Descriptions Using Xlnet-Fasttext
Abstract
Gender falsification in social media content is an increasingly troubling challenge, with users often choosing to hide their true gender identity or pretend to be members of a different gender. This can lead to negative consequences, including the spread of disinformation, discrimination and online security risks. To overcome this problem, this research proposes a text classification-based solution to identify gender fakes in social media texts. This method involves extracting linguistic features from texts, such as word usage, sentence structure, and language patterns that can provide clues to the author's gender. Therefore, this research aims to introduce a new transformers-based approach that uses XLNet and is also modified with additional Fasttext embedding. Modifications were made to the embedding section which can increase XLNet's understanding of text context in carrying out text classification. The results of this research are that baseline XLNet gets a fairly good performance score in gender classification based on Twitter feeds, namely with accuracy, precision, recall and f1-score of 0.704, 0.770, 0.598, 0.674 respectively, while XLNet-FastText gets the respective scores. -respectively 0.714, 0.770, 0.609, 0.680. And for gender classification based on user account descriptions, baseline XLNet gets scores of accuracy, precision, recall, f1-score of 0.705, 0.771, 0.598, 0.674 respectively while XLNet-FastText gets scores of 0.724, 0.751, 0.6324, 0.686 respectively.References
Delić, D. (2022). Are women at more risk of online scams, the latest 2024 statistics. Retrieved from https://proprivacy.com/blog/women-and-online-scams-latest-statistics-2022
Susandra, A. (2022). Erayani Pelaku Penipuan nikah Sesama Jenis dilaporkan ke Polresta Jambi : Okezone Video. Retrieved from https://video.okezone.com/play/2022/06/30/1/149948/erayani-pelaku-penipuan-nikah-sesama-jenis-dilaporkan-ke-polresta-jambi
Yang, L., Li, Y., Wang, J., & Sherratt, R. S. (2020). Sentiment analysis for e-commerce product reviews in Chinese based on sentiment lexicon and deep learning. IEEE Access, 8, 23522–23530. https://doi.org/10.1109/access.2020.2969854
Bazzaz Abkenar, S., Haghi Kashani, M., Akbari, M., & Mahdipour, E. (2023). Learning textual features for Twitter Spam Detection: A systematic literature review. Expert Systems with Applications, 228, 120366. https://doi.org/10.1016/j.eswa.2023.120366
Adhikari, A., Ram, A., Tang, R., Hamilton, W. L., & Lin, J. (2020). Exploring the limits of simple learners in knowledge distillation for document classification with DocBERT. Proceedings of the 5th Workshop on Representation Learning for NLP. https://doi.org/10.18653/v1/2020.repl4nlp-1.10
Joshi, S., & Abdelfattah, E. (2021). Multi-class text classification using machine learning models for online drug reviews. 2021 IEEE World AI IoT Congress (AIIoT). https://doi.org/10.1109/aiiot52608.2021.9454250
Suleymanov, U., Kiani Kalejahi, B., Amrahov, E., & Badirkhanli, R. (2020). Text classification for azerbaijani language using machine learning. Computer Systems Science and Engineering, 35(6), 467–475. https://doi.org/10.32604/csse.2020.35.467
Garcia-Mendez, S., Fernandez-Gavilanes, M., Juncal-Martinez, J., Gonzalez-Castano, F. J., & Seara, O. B. (2020). Identifying banking transaction descriptions via support vector machine short-text classification based on a specialized labelled corpus. IEEE Access, 8, 61642–61655. https://doi.org/10.1109/access.2020.2983584
Zhong, B., Xing, X., Love, P., Wang, X., & Luo, H. (2019). Convolutional Neural Network: Deep learning-based classification of building quality problems. Advanced Engineering Informatics, 40, 46–57. https://doi.org/10.1016/j.aei.2019.02.009
Wani, A., Joshi, I., Khandve, S., Wagh, V., & Joshi, R. (2021). Evaluating deep learning approaches for covid19 fake news detection. Combating Online Hostile Posts in Regional Languages during Emergency Situation, 153–163. https://doi.org/10.1007/978-3-030-73696-5_15
Gupta, A., Chugh, D., Anjum, & Katarya, R. (2022). Automated News summarization using Transformers. Retrieved from https://link.springer.com/chapter/10.1007/978-981-16-9012-9_21
Anwar, M. T., Permana, A. K., Ambarwati, L., & Agustin, D. (2021). Analyzing public opinion based on emotion labeling using Transformers. 2021 2nd International Conference on Innovative and Creative Information Technology (ICITech). https://doi.org/10.1109/icitech50181.2021.9590110
Anwar, M. T., Permana, A. K., Ambarwati, L., & Agustin, D. (2021). Analyzing public opinion based on emotion labeling using Transformers. 2021 2nd International Conference on Innovative and Creative Information Technology (ICITech). https://doi.org/10.1109/icitech50181.2021.9590110
Kumar, D., Kumar, N., & Mishra, S. (2021). NLP@NISER: Classification of covid19 tweets containing symptoms. Proceedings of the Sixth Social Media Mining for Health (#SMM4H) Workshop and Shared Task. https://doi.org/10.18653/v1/2021.smm4h-1.19
Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., &; Soricut, R. (2020). Albert: A lite bert for self-supervised learning of language representations. arXiv.org. https://doi.org/10.48550/arXiv.1909.11942
Yao, T., Zhai, Z., & Gao, B. (2020). Text classification model based on fasttext: IEEE Conference Publication: IEEE Xplore. Retrieved from https://doi.org/10.1109/ICAIIS49377.2020.9194939
Nia, Z. M., Ahmadi, A., Mellado, B., Wu, J., Orbinski, J., Agary, A., & Kong, J. D. (2022). Twitter-based gender recognition using Transformers. Retrieved from https://arxiv.org/abs/2205.06801
Vashisth, P., &; Meehan, K. (2020). Gender classification using Twitter Text Data. 2020 31st Irish Signals and Systems Conference (ISSC). https://doi.org/10.1109/issc49989.2020.9180161
Puertas, E., Ureña-López, L. A., Pomares-Quimbaya, A., Alvarado-Valencia, J. A., Plaza-del-Arco, F. M., & Moreno-Sandoval, L. G. (2019). Bots and gender profiling on Twitter using sociolinguistic features ... Bots and gender profiling on Twitter using sociolinguistic features. https://www.researchgate.net/publication/335611800_Bots_and_Gender_Profiling_on_Twitter_using_Sociolinguistic_Features_Notebook_for_PAN_at_CLEF_2019
Staykovski, T. (2019). Stacked bots and gender prediction from Twitter feeds - CEUR-WS.org. Stacked Bots and Gender Prediction from Twitter Feeds. https://ceur-ws.org/Vol-2380/paper_197.pdf
Alroobaea, R., Aldahass, A., Alhomidi, S., Alafif, S., Hamed, R., Mulla, R., &; Alotaibi, B. (2020). A decision support system for detecting age and gender from Twitter feeds based on a comparative experiments. International Journal of Advanced Computer Science and Applications, 11(12). https://doi.org/10.14569/ijacsa.2020.0111245
Saeed, U., &; Shirazi, F. (2019). Bots and gender classification on Twitter - Webis. Notebook for PAN at CLEF 2019. https://pan.webis.de/downloads/publications/papers/saeed_2019.pdf
Ouni, S., Fkih, F., &; Omri, M. N. (2022). Bots and gender detection on twitter using stylistic features. Advances in Computational Collective Intelligence, 650–660. https://doi.org/10.1007/978-3-031-16210-7_53
Soldevilla, I., &; Flores, N. (2021). Natural language processing through Bert for identifying gender-based violence messages on social media. 2021 IEEE International Conference on Information Communication and Software Engineering (ICICSE). https://doi.org/10.1109/icicse52190.2021.9404127
Hashempour, R., Amorim, R., Villavicencio, A., & Plank, B. (2019). A deep learning approach to language-independent gender prediction on Twitter. ACL Anthology. https://aclanthology.org/W19-3630/
Eight, F. (2016). Twitter User Gender Classification. Retrieved from https://www.kaggle.com/datasets/crowdflower/twitter-user-gender-classification
DOI:
https://doi.org/10.31449/inf.v48i20.5761Downloads
Published
How to Cite
Issue
Section
License
Authors retain copyright in their work. By submitting to and publishing with Informatica, authors grant the publisher (Slovene Society Informatika) the non-exclusive right to publish, reproduce, and distribute the article and to identify itself as the original publisher.
All articles are published under the Creative Commons Attribution license CC BY 3.0. Under this license, others may share and adapt the work for any purpose, provided appropriate credit is given and changes (if any) are indicated.
Authors may deposit and share the submitted version, accepted manuscript, and published version, provided the original publication in Informatica is properly cited.







