Categorization of Event Clusters from Twitter Using Term Weighting Schemes
Abstract
A real-world event is commonly represented on Twitter as a collection of repetitive and noisy text messages posted by different users. Term weighting is a popular pre-processing step for text classification, especially when the size of the dataset is limited. In this paper, we propose a new term weighting scheme and a modification to an existing one and compare them with many state-of-the-art methods using three popular classifiers. We create a labelled Twitter dataset of events for exhaustive cross-validation experiments and use another Twitter event dataset for cross-corpus tests. The proposed schemes are among the best performers in many experiments, with the proposed modification significantly improving the performance of the original scheme. We create two majority voting based classifiers that further enhance the F1-scores of the best individual schemes.References
[Alsaedi et al., 2016] Alsaedi, N., Burnap, P., and
Rana, O. F. (2016). Automatic summarization of
real world events using twitter. In Proceedings of
the Tenth International Conference on Web and So-
cial Media, Cologne, Germany, May 17-20, 2016.,
pages 511–514.
[Cardoso-Cachopo, 2007] Cardoso-Cachopo, A.
(2007). Improving methods for single-label text
categorization. PhD Thesis, Instituto Superior
Tecnico, Universidade Tecnica de Lisboa.
[Debole and Sebastiani, 2003] Debole, F. and Sebas-
tiani, F. (2003). Supervised term weighting for automated text categorization. In Proceedings of
the 2003 ACM Symposium on Applied Computing,
SAC ’03, pages 784–788, New York, NY, USA.
ACM.
[Escalante et al., 2015] Escalante, H. J., Garc´ ıa-
Limón, M. A., Morales-Reyes, A., Graff, M.,
Montes-y Gómez, M., Morales, E. F., and
Mart´ ınez-Carranza, J. (2015). Term-weighting
learning via genetic programming for text classi-
fication. Know.-Based Syst., 83(C):176–189.
[Joachims, 1998] Joachims, T. (1998). Text catego-
rization with support vector machines: Learning
with many relevant features. In Proceedings of
the 10th European Conference on Machine Learn-
ing, ECML’98, pages137–142, Berlin, Heidelberg.
Springer-Verlag.
[Kalyanam et al., 2016] Kalyanam, J., Quezada, M.,
Poblete, B., and Lanckriet, G. (2016). Prediction
and characterization of high-activity events in so-
cial media triggered by real-world news. PLOS
ONE, 11(12):1–13.
[Lan et al., 2006] Lan, M., Tan, C. L., and Low,
H. (2006). Proposing a new term weighting
scheme for text categorization. In Proceedings,
The Twenty-First National Conference on Artificial
Intelligence and the Eighteenth Innovative Appli-
cations of Artificial Intelligence Conference, July
-20, 2006, Boston, Massachusetts, USA, pages
–768.
[Malliaros and Skianis, 2015] Malliaros, F. D. and
Skianis, K. (2015). Graph-based term weight-
ing for text categorization. In 2015 IEEE/ACM
International Conference on Advances in Social
Networks Analysis and Mining (ASONAM), pages
–1479.
[McMinn et al., 2013] McMinn, A.J., Moshfeghi, Y.,
and Jose, J. M. (2013). Building a large-scale cor-
pus for evaluating event detection on twitter.
[Ng et al., 1997] Ng, H. T., Goh, W. B., and Low,
K. L. (1997). Feature selection, perceptron learn-
ing, and a usability case study for text categoriza-
tion. In Proceedings of the 20th annual interna-
tional ACM SIGIR conference on Research and
development in information retrieval - SIGIR ’97,
pages 67–73.
[Quan et al., 2011] Quan, X., Wenyin, L., andQiu, B.
(2011). Term weighting schemes for question cate-
gorization. IEEE Transactions on Pattern Analysis
and Machine Intelligence, 33(5):1009–1021.
[Radev et al., 2004] Radev, D. R., Jing, H., Sty´ s, M.,
and Tam, D. (2004). Centroid-based summariza-
tion of multiple documents. Inf. Process. Manage.,
(6):919–938.
[Reed et al., 2006] Reed, J. W., Jiao, Y., Potok, T. E.,
Klump, B. A., Elmore, M. T., and Hurson, A. R.
(2006). Tf-icf: A new term weighting scheme for
clustering dynamic data streams. In 2006 5th In-
ternational Conference on Machine Learning and
Applications (ICMLA’06), pages 258–263.
[Wang et al., 2015] Wang, T., Cai, Y., Leung, H.,
Cai, Z., and Min, H. (2015). Entropy-based term
weighting schemes for text categorization in vsm.
In 2015 IEEE 27th International Conference on
Tools with Artificial Intelligence (ICTAI), pages
–332.
[Wu et al., 2017] Wu, H., Gu, X., and Gu, Y. (2017).
Balancing between over-weighting and under-
weighting in supervised term weighting. Inf. Pro-
cess. Manage., 53(2):547–557.
[Yang and Pedersen, 1997] Yang, Y. and Pedersen,
J. O. (1997). A comparative study on feature se-
lection in text categorization. In Proceedings of
the Fourteenth International Conference on Ma-
chine Learning, ICML ’97, pages 412–420, San
Francisco, CA, USA. Morgan Kaufmann Publish-
ers Inc.
DOI:
https://doi.org/10.31449/inf.v45i3.3063Downloads
Published
How to Cite
Issue
Section
License
Authors retain copyright in their work. By submitting to and publishing with Informatica, authors grant the publisher (Slovene Society Informatika) the non-exclusive right to publish, reproduce, and distribute the article and to identify itself as the original publisher.
All articles are published under the Creative Commons Attribution license CC BY 3.0. Under this license, others may share and adapt the work for any purpose, provided appropriate credit is given and changes (if any) are indicated.
Authors may deposit and share the submitted version, accepted manuscript, and published version, provided the original publication in Informatica is properly cited.







