A novel term weighting scheme for imbalanced text classification
Abstract
High dimensional feature is the main problem of text domain. If imbalance class is also found in the context, the classifier’s performance is worsen. Moreover, solving imbalance problem by oversampling method in this circumstance is very difficult to get performance improvement. In this paper, a new term weighting scheme is proposed by combining Term frequency with an average of inverse document frequency factor. We denoted our scheme by TFmeanIDF. Our proposed method has high potential for imbalance text domain with high dimension. No feature selection or oversampling method is required. Extensive comparison results on 7 datasets validate the advantages of TFmeanIDF in terms of F1 score obtained from widely used base classifier such as logistic regression and Support Vector Machines. We found that F1 score of minority class is higher than that of baseline term weighting schemes. Using TFmeanIDF as a term weighting shows promising result of logistic regression and Support Vector Machines.References
Tang, Z., Li, W., Li, Y. (2020) An improved term weighing scheme for text classification, Concurrency Computat Pract Exper. 32 (9) https://doi.org/10.1002/cpe.5604
Zhong Tang, Wenqiang Li, Yan Li, Wu Zhao, Song Li, (2020b) Several alternative term weighting methods for text representation and classification, Knowledge-Based Systems, Volume 207, 106399,
https://doi.org/10.1016/j.knosys.2020.106399.
Long Chen, Liangxiao Jiang, Chaoqun Li, 2021, Using modified term frequency to improve term weighting for text classification, Engineering Applications of Artificial Intelligence, 101, 104215,
https://doi.org/10.1016/j.engappai.2021.104215.
Salton, G., Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. Inf. Process. Manage. 24 (5), 513–523.
Lan, M., Tan, C. L., Su, J., & Lu, Y. (2009). Supervised and traditional term weighting methods for automatic text categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(4), 721-735. doi:10.1109/TPAMI.2008.110
Ng, H.T., Goh, W.B. and Low, K.L. (1997). Feature Selection, Perceptron Learning, and a Usability Case Study for Text Categorization, Proc. SIGIR ’97, pp. 67-73.
López, V., Fernández, A., Moreno-Torres, J.G., Herrera, F. (2012). Analysis of preprocessing vs. cost-sensitive learning for imbalanced classification. Open problems on intrinsic data characteristics, Expert Syst. Appl. 39. pp. 6585–6608.
DOI:
https://doi.org/10.31449/inf.v46i2.3523Downloads
Published
How to Cite
Issue
Section
License
Authors retain copyright in their work. By submitting to and publishing with Informatica, authors grant the publisher (Slovene Society Informatika) the non-exclusive right to publish, reproduce, and distribute the article and to identify itself as the original publisher.
All articles are published under the Creative Commons Attribution license CC BY 3.0. Under this license, others may share and adapt the work for any purpose, provided appropriate credit is given and changes (if any) are indicated.
Authors may deposit and share the submitted version, accepted manuscript, and published version, provided the original publication in Informatica is properly cited.







