Ensemble-Based Text Classification for Spam Detection

This research proposes an ensemble-based approach for spam detection in digital communication, addressing the escalating challenge posed by unsolicited messages, commonly known as spam. The exponential growth of online platforms has necessitated the development of effective information filtering systems to maintain security and efficiency. The proposed approach involves three main components: feature extraction, classifier selection, and decision fusion. The feature extraction techniques are word embedding, are explored to represent text messages effectively. Multiple classifiers, including RNN including LSTM and GRU are evaluated to identify the best performers for spam detection. By employing the ensemble model combines the strengths of individual classifiers to achieve higher accuracy, precision, and recall. The evaluation of the proposed approach utilizes widely accepted metrics on benchmark datasets, ensuring its generalizability and robustness. The experimental results demonstrate that the ensemble-based approach outperforms individual classifiers, offering an efficient solution for combatting spam messages. Integration of this approach into existing spam filtering systems can contribute to improved online communication, user experience, and enhanced cybersecurity, effectively mitigating the impact of spam in the digital landscape.


Introduction
The pervasive expansion of digital communication platforms has revolutionized global connectivity, enabling seamless information exchange and unprecedented interactivity [1].However, this unprecedented growth has also ushered in a persistent and escalating challenge: the proliferation of unsolicited and often malicious messages, commonly referred to as spam.These intrusive messages not only disrupt efficient communication but also pose substantial risks to the security and integrity of online interactions [2].Consequently, the development of effective spam detection mechanisms has become imperative to sustain the safety, efficiency, and user experience of digital communication channels.In response to the mounting threat of spam, this research introduces an innovative and comprehensive ensemblebased approach to spam detection.This approach addresses the intricate dynamics of spam identification by leveraging the collective power of diverse classifiers within a unified framework [3].In recognition of the exponential growth of online platforms, our research delves into the design and implementation of this ensemble-based approach, which encapsulates three fundamental components: feature extraction, classifier selection, and decision fusion.At the heart of our approach lies the adoption of advanced feature extraction techniques, specifically focusing on word embeddings [4].These techniques harness the semantic nuances of language to transform text messages into dense vector representations, enabling more effective spam detection [5].Concurrently, a spectrum of classifiers is meticulously evaluated, including state-of-the-art Recurrent Neural Networks (RNNs) encompassing Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) architectures.This assessment seeks to identify the optimal combination of classifiers capable of discerning spam messages with unparalleled accuracy.A central tenet of our research revolves around the strategic amalgamation of individual classifier outputs through an ensemble model.This collaborative approach capitalizes on the inherent strengths of diverse classifiers, resulting in heightened accuracy, precision, and recall in spam detection [6].To gauge the efficacy of our proposed ensemble-based method, extensive experimentation is conducted using established metrics and benchmark datasets.The meticulous evaluation process ensures the generalizability and robustness of our approach across various contexts and data distributions.The culmination of our research showcases compelling evidence that the ensemble-based approach significantly surpasses the performance of individual classifiers in combating spam messages.By seamlessly integrating our approach into existing spam filtering systems, the digital landscape stands to benefit from improved communication, enhanced user experiences, and fortified cyber security.This research, spanning two comprehensive pages, embodies a significant stride towards mitigating the pervasive impact of spam in the contemporary digital realm.The contribution of the work is   1.The synthesis of recent literature reinforces the interdisciplinary nature of the proposed ensemble-based approach, harnessing the power of deep learning, ensemble methods, and contextawareness to mitigate the menace of spam in digital communication.

System model
The proposed approach holds significant potential for real-world applications, particularly in the domain of spam detection.In practical scenarios, the impact of this approach lies in its ability to enhance the accuracy and reliability of spam detection systems.By integrating diverse deep learning architectures, including AlexNet, VGG-16, ResNet-50, and an ensemble of Recurrent Neural Networks (Ens_RNN), the model gains the capability to capture both intricate visual features and temporal dependencies within the data.This combination addresses the multifaceted nature of spam, which often manifests in various forms, including image-based spam and evolving text patterns.One key improvement over existing spam detection system is the inherent flexibility of the ensemble approach.The combination of different neural network architectures allows for a more holistic understanding of the diverse characteristics of spam content.This flexibility is particularly beneficial in adapting to new and emerging spam patterns, ensuring the system remains robust against evolving spam techniques.The use of recurrent neural networks also contributes to improved detection accuracy in scenarios where sequential patterns or temporal dependencies play a crucial role, such as in the identification of phishing attempts or evolving spam campaigns.The novelty of our research lies in the thoughtful integration of both convolutional and recurrent neural network architectures within an ensemble framework.While ensemble methods themselves are not novel, the innovation in our approach lies in the effective combination of diverse models, each specialized in capturing specific aspects of spam content.This comprehensive approach enhances the overall performance of the system, demonstrating a nuanced understanding of the intricacies associated with spam detection.Furthermore, the explicit consideration of temporal dependencies through the use of an ensemble of recurrent neural networks represents a novel contribution, as it addresses a critical aspect often overlooked in traditional spam detection systems  The proposed ensemble-based spam detection approach follows a straightforward and systematic workflow to effectively identify and block spam messages in digital communication.This approach involves several key stages: First, a diverse dataset containing both spam and legitimate messages is collected and cleaned.Irrelevant characters are removed, and messages are transformed into a format that computers can understand.This prepares the data for analysis.Next, different intelligent algorithms, referred to as "detectives," are selected and trained.These detectives learn from the dataset to recognize patterns that distinguish spam from legitimate messages.The detectives' decisions are then combined through a group decision-making process, similar to teamwork.If most detectives agree that a message is spam, the system is likely to classify it as such.Context and emotional cues are also considered by analyzing the situation, sender, and emotional tone of messages using sentiment analysis.This enhances the system's ability to differentiate between different types of messages.To ensure the system's effectiveness, regular testing and evaluation are performed to see how well the detectives and the group decision are performing.This helps identify areas of improvement and fine-tuning.Once the system proves effective, it can be integrated into email or messaging platforms.Continuous monitoring ensures that it remains up-to-date and adaptive to changing spam patterns.Feedback from users plays a vital role in refining the system.Mistakes made by the system, such as labelling a legitimate message as spam, are learned from and used to make the system smarter over time.The system's impact is assessed by measuring the number of spam messages detected and evaluating its overall accuracy.Findings are documented to share insights and contribute to the improvement of email and messaging systems.In essence, the ensemble-based spam detection approach combines data processing, intelligent analysis, teamwork among algorithms, context understanding, user feedback, and continuous improvement to create a robust and reliable defence against spam messages in digital communication.

A. Preprocessing
The initial phase of the project involves the collection and preparation of data, a critical step to ensure the effectiveness of the proposed ensemble-based spam detection approach.A diverse dataset encompassing both spam and legitimate text messages is carefully curated.These messages are manually labelled as either "spam" or "legitimate" to establish a reliable ground truth for model training and evaluation.The collected dataset undergoes a meticulous cleaning process, where noise, special characters, and irrelevant details are meticulously removed.To ensure consistent analysis, all text is converted to lowercase, and common words devoid of substantial meaning (stopwords) are excluded.Tokenization dissects the text into meaningful units, which can be words or even smaller subword components.A significant transformation occurs through word embeddings is Word2Vec, which convert words into numerical vectors that encapsulate their semantic essence.Finally, the dataset is split into distinct subsets: the training set serves as the educational foundation for the model, the validation set assists in parameter tuning, and the test set provides a final assessment of the model's capabilities.This comprehensive data collection and preprocessing phase lays a robust groundwork for subsequent stages, contributing to the overall accuracy and efficiency of the ensemble-based spam detection approach.

B. Tsallis entropy-based segmentation
Tsallis Entropy-based segmentation for text classification is a novel way to improve accuracy and resilience.A core notion for text data segmentation is Tsallis Entropy, an expanded version of entropy.This method uses the text's information dynamics and inconsistencies to better grasp its patterns.It divides text into meaningful parts that may represent distinct categories or themes.This methodological fusion may enhance text categorization by addressing the complexity and diversity of textual information.The combination of Tsallis Entropy-based segmentation with text categorization requires multiple phases.To maintain consistency, text data is preprocessed using tokenization, stopword removal, and stemming [12].It is then calculated for each section to show text linguistic characteristics.In text categorization, Tsallis Entropy helps identify linguistic patterns linked with various classes.Higher Tsallis Entropy values in some portions may suggest complexity or divergence, indicating unique content.This information helps classification algorithms choose a text segment category or label.
It may improve sentiment analysis, topic modelling, and content categorization accuracy and interpretability.The fundamental properties of Tsallis Entropy complement standard text categorization, enabling more nuanced and effective textual data processing.However, Shannon changed the definition of entropy to assess uncertainty based on the system's data content.Furthermore, it is ensured that the additive quality of the Shannon entropy as calculated by Using a general entropy construction and the numerous fractal notions, the Tsallis entropy is expanded to nonextensive module: where  indicates the degree of non-extensiveness of the Tsallis variable, or entropic index, technique, and  defines the quantity of likelihood of occurrence of the scheme.An entropic pseudo-additive rule converts the entropic scheme into an independent and identically distributed module: The Tsallis entropy may be carefully considered while determining the ideal threshold for a picture.Consider a grayscale picture with L levels in the range of a probability distribution.So, it is possible to achieve the Tsallis multilevel thresholding by The appropriate threshold for a picture might be selected by carefully taking into account the Tsallis entropy.
Consider that the likelihood distribution for a picture with L grey levels in the interval of {0, 1, . . ., L − 1} values with p i = p 0 , p 1 , … p L−1 .so, it is possible to achieve the Tsallis multilevel thresholding by

C. Non-linear data augmentation
Non-linear data augmentation is a sophisticated technique applied to enhance the performance and generalization ability of text categorization models.It involves creating new instances of text data by applying various non-linear transformations that preserve the inherent semantics and meaning of the original text [13].This approach aims to diversify the training data, making the model more robust and capable of handling variations in language usage and expression.

D. Ensemble feature extraction
Ensemble feature extraction utilizing Word2Vec embeds a sophisticated approach that amalgamates the strengths of ensemble methodologies with the semantic comprehension offered by Word2Vec's word embeddings.This amalgamation is designed to elevate the representation of textual data across a spectrum of natural language processing endeavors.The foundation of this process lies in Word2Vec's adeptness at transmuting words into dense, contextually informed vectors that encapsulate semantic relationships.The process unfolds as follows: Initially, the Word2Vec embeddings are derived through a pre-trained model, furnishing each word within the textual corpus with a high-dimensional vector reflective of its semantic essence.The innovation comes to fruition through an ensemble of diverse feature extraction methodologies applied to these embeddings.This ensemble encapsulates an array of extraction methods, encompassing techniques like averaging, weighted averaging, and stacking, among others.The outcome of this ensemble process is a tapestry of feature representations for each text fragment, each facet gleaned through a distinct extraction mechanism.During the classifier training phase, these manifold features serve as input.The classifiers are primed to address a spectrum of natural language processing objectives, be it sentiment analysis, text classification, or even named entity recognition.In the realm of prediction, the outputs of these classifiers conjoin through ensemble methodologies, materializing as either majority voting, weighted voting, or stacking.This aggregate decision-making draws upon the comprehensive viewpoints captured by the ensemble feature extraction process.The potency of ensemble feature extraction via Word2Vec burgeons from its ability to synergize the intricate semantic subtleties encapsulated by Word2Vec embeddings with the manifold vantage points fostered by ensemble strategies.This not only augments representation but also fortifies resilience, potentially culminating in heightened model performance and broader applicability.As with any advanced approach, considerations encompass computational demands and the imperative of meticulous hyperparameter calibration to unlock the full potential of this innovative amalgamation.The selection of classifiers and feature extraction techniques in this study was guided by a thoughtful consideration of their efficacy in addressing the complexities of the medical imaging datasets under investigation.AlexNet, VGG-16, and ResNet-50, renowned for their success in image classification tasks, were chosen for their ability to capture intricate features in medical images.Their deep and hierarchical architectures allow for the automatic extraction of relevant features without the need for manual engineering.Additionally, an ensemble of Recurrent Neural Networks (Ens_RNN) was introduced to capture temporal dependencies within the data, an essential consideration in medical time series.The ensemble approach was deemed appropriate to enhance model robustness, leveraging the diversity of the individual models.Regarding ensemble methods, a straightforward averaging approach was chosen for its simplicity and effectiveness in maintaining model diversity.While alternative strategies such as bagging and boosting were considered, the diverse nature of the chosen base models rendered more complex ensemble methods unnecessary.The decision-making process was guided by a desire for a transparent and interpretable methodology.To assess the performance of the models, a comprehensive set of metrics, including accuracy, precision, recall, specificity, false positive rate (FPR), and false negative rate (FNR), was employed.This choice was motivated by the nuanced nature of medical data, where different types of classification errors can have varying consequences.By articulating these methodological choices, this paper aim to provide clarity and transparency in our approach, facilitating a deeper understanding and reproducibility of the results.

E. Classification using ensemble RNN:
We suggest an ensemble approach that combines the LSTM, Bi-LSTM, and GRU deep learning architectures.LSTM-GRU classifier: This network solves the vanishing gradient issue by adding a second processor, known as a cell, that can judge whether the data is useful or not.Three gates-the input gate f t , the forgetting gate f t , and the output gate o t -are arranged in a cell.The cell functionality are defined as follows:

Performance analyses
In the context of ensemble-based text classification for spam detection is compared with SVM [14], RF [15], NB [16] with several performance metrics can be utilized to evaluate the effectiveness of the approach.These metrics provide insights into the model's accuracy, precision, recall, and its ability to handle different aspects of the classification task.
• Accuracy: The proportion of correctly classified messages out of the total messages in the dataset.It provides an overall measure of the model's correctness.
• Precision: The proportion of true positive predictions (correctly identified spam) out of all positive predictions (both true positives and false positives).Precision is particularly relevant when the cost of false positives is high.• Recall (Sensitivity): The proportion of true positive predictions out of all actual positive instances.
Recall is valuable when the cost of false negatives (missed spam) is a concern.• Specificity: The harmonic mean of precision and recall, providing a balanced measure of a model's performance.

A. Dataset description
The SpamDetectionDataset was collected from various online platforms, including social media, emails, and online forums.The dataset was curated to include a diverse range of text messages, encompassing both legitimate content and unsolicited messages commonly known as "spam."The dataset was compiled for the purpose of developing and evaluating an ensemble-based text classification approach for spam detection.

Conclusions
This research has introduced and demonstrated the efficacy of an ensemble-based approach for tackling the persistent and escalating challenge of spam detection in digital communication.As the online landscape continues to expand, the need for effective information filtering systems to safeguard security and optimize efficiency becomes increasingly critical.By focusing on three key components -feature extraction, classifier selection, and decision fusion -this approach has showcased a comprehensive and innovative strategy.Leveraging word embedding techniques, text messages are adeptly represented, forming the foundation for subsequent analysis.The meticulous evaluation of multiple classifiers, including advanced RNN models like LSTM and GRU, has enabled the identification of optimal performers.The culmination of these classifiers into an ensemble model capitalizes on their strengths, resulting in elevated accuracy, precision, and recall for spam detection.Through extensive experimentation and benchmarking on widely accepted datasets, the approach's robustness and applicability have been established.The ensemble-based technique consistently outperforms individual classifiers, offering a pragmatic solution to the challenge of spam messages.By seamlessly integrating this approach into existing spam filtering systems, a ripple effect of positive outcomes is anticipated.Enhanced online communication quality, improved user experiences, and heightened cyber security are all foreseeable benefits.As a collective result, the digital landscape stands to be significantly fortified against the intrusive and disruptive impact of spam.In a world where digital communication is central, the demonstrated effectiveness of this ensemble-based approach signifies a promising step towards safer, more efficient, and user-centric online interactions.Future work in this domain may further refine and extend the approach, continuing to bolster the fight against the everevolving threat of spam.
. The work flow of the classification of text classification is shown in Fig 1.

Figure 1 :
Figure 1: work flow text classification

Table 1 :
Literature contributions to spam detection and classification (Devlin et al., 2019) the field of text classification, there have been several related works that focus on improving accuracy and performance.Some notable studies include: The literature survey encapsulates the burgeoning advancements in spam detection, text classification, and ensemble methods, spanning the last five years.Recent research has illuminated the potential of deep learning models, ensemble techniques, and innovative feature extraction methods, shaping the groundwork for the proposed ensemble-based approach for spam detection.The transformative impact of deep learning in text classification is evident through breakthrough models like BERT(Devlin et al., 2019)and the diverse architectures explored by Chen et al. (2020).These studies accentuate the significance of contextual understanding and feature extraction, pivotal for the success of our ensemble approach.Focusing on email spammers, this study introduces graph embedding for detection, aligning with the proposed approach's decision fusion and context-awareness (L.Shi et al., 2021).This paper demonstrates a deep learning approach for detecting spam on Twitter, offering insights into social media-specific spam characteristics.The exploration of diverse platforms enriches the proposed approach's scope (F.M. Couto et al., 2019).While focused on cyberbullying, this study highlights sentiment analysis's role in detection, correlating with the ensemble-based decision fusion strategy's sentimentbased analysis (M.M. Zulfikar et al., 2020).The detection of malicious URLs (Gupta & Soni, 2020) aligns conceptually with spam detection, reinforcing the importance of algorithm selection and evaluation.Additionally, Maatuk and Abbass (2020) highlight the contextual nuances of spam detection in online social networks, mirroring the decision fusion component's emphasis on context-aware analysis.These related works contribute to the advancement of text classification by exploring various deep learning architectures, transfer learning, ensemble techniques, and other machine learning algorithms.They provide valuable insights and benchmark results, inspiring further research in this critical domain.

)
Here,  is sigmoid non-linear function,  is the tangent non-linear function.  ,   ,   ,   and   ,     .  , are learnable weights.⊙ refers element-wise multiplication.  and  −1 denotes the cell state at  and  -1, ht and ℎ −1 denotes the hidden-state at time    -1, and  means the th time step.N.

Table 4 :
Comparison of precision

Table 9 :
Overall comparative analysis Across varying dataset sizes, Ens_RNN consistently outperforms its counterparts, achieving remarkable accuracy, precision, recall, specificity, and maintaining low false positive and false negative rates.The accuracy comparison (Table 3, Fig 2) reveals Ens_RNN's exceptional performance, starting with a high accuracy of 97% for 2000 samples and steadily improving to an impressive 98.6% for 10,000 samples.Precision values (Table 4, Fig 3) showcase Ens_RNN's dominance, reaching an extraordinary 99.7% for 10,000 samples, while SVM, RF, and NB maintain relatively stable precision levels.Ens_RNN's recall rates (Table 5, Fig 4) consistently outshine other methods, emphasizing its strong ability to identify and classify spam messages effectively.Specificity values (Table 6, Fig 5) further highlight Ens_RNN's reliability in accurately classifying legitimate messages, starting with an impressive 98.8% for 2000 samples and maintaining this elevated performance.The comparison of false positive rates (FPR) (Table 7, Fig 6) underscores Ens_RNN's capability to reduce false positives, contrasting with an increasing trend in FPR for other methods.Additionally, the analysis of false negative rates (FNR) (Table 8, Fig 7) accentuates Ens_RNN's consistency in minimizing misclassifications of actual spam messages.In summary, Ens_RNN emerges as a robust and effective solution for spam detection, consistently outperforming traditional methods across multiple performance metrics, thereby affirming its potential in enhancing the reliability and efficiency of spam detection in diverse digital communication channels.