Feature Extraction and Classification of Text Data by Combining Two-Stage Feature Selection Algorithm and Improved Machine Learning Algorithm

,


Introduction
As information technology develop, especially in many fields such as medicine, finance, and journalism, the Internet has generated massive amounts of text data.These text data contain a wealth of information and knowledge, significant for improving business decision-making, market analysis, disease diagnosis, etc.However, due to the large and complex volume of these data, it has become a challenging problem to effectively extract useful information from them and perform accurate text classification [1][2][3].The core of text classification lies in how to accurately and efficiently identify and classify a large number of unlabeled text data, which directly affects the quality and application effect of information extraction.Firstly, there is a large amount of redundant information in the text data, which is not only irrelevant to the classification task, but will interfere with the judgment of the classifier and reduce the accuracy of classification.Secondly, the feature distribution of text data is often uneven, which makes it difficult for traditional classification algorithms to maintain stable and efficient performance in the face of different types of datasets [4][5].To this end, a Two-stage Feature Selection (TFS) Algorithm that fuses Information Gain (IG) and improved Minimum Redundancy Maximum Relevance (MRMR) is proposed, and a Fourier hybrid kernel function is introduced to enhance the Support Vector Machine (SVM) effect in text classification.Through these technological innovations, the research aims to process large-scale text data more efficiently and improve the accuracy and efficiency of classification.This has important practical value for information processing and decision support in the fields of medical diagnosis, news topic analysis, and market trend forecasting.The overall structure of the study consists of four parts.The first part summarizes the relevant research results and shortcomings of feature extraction at home and abroad.The second part proposes the fusion of TFS and improved machine learning algorithms.The third part analyzes the experimental results through the proposed algorithm and includes a discussion section related to the current research.The fourth part summarizes the experimental results, points out the shortcomings of the research, and proposes future research directions.
In the field of machine learning, SVM has become one of the core technologies of text classification due to its excellent classification performance.The performance of SVM depends largely on the quality of feature selection and extraction.Feature selection is important when dealing with large-scale text data, and effective H. Huang feature extraction is essential to improve classification accuracy and efficiency [6].Here are some of the relevant studies by scientists and scholars.Ahmed Y A et al. proposed a weighted MRMR algorithm for better estimating the feature significance of data captured by cyberattacks.This technique combined enhanced weighted MRMR with frequency inverse document frequency and further accommodates an improved approach to entropy.It was used to evaluate the weights of the features generated by the algorithm.Results showed a good performance of proposed algorithm [7].Jiménez-Cordero et al. proposed an MRMR-based embedded feature selection method for the trade-off between complexity and classification accuracy.The algorithm used duality theory to reformulate the min-max problem and solved it using off-the-shelf nonlinear optimization software.Compared with public datasets, the proposed method proved its effectiveness and practicability [8].Wang et al. proposed a SVM kernel function selection mechanism.First, the types of kernel function best suited for the given data were chosen.Then, these types were classified as SVMs.The results showed that the mechanism superiority was verified [9].Sun et al. proposed a feature selection algorithm for multi-label data with missing labels.Firstly, a multi-label uncertainty measure based on fuzzy neighborhood entropy was proposed, and the MRMR algorithm was improved to evaluate the candidate features.Results showed that this algorithm selected important features with better classification performance [10].
Jia et al. proposed an improved barnacle pairing optimizer combined with an SVM algorithm.The Gaussian mutation and logic model were used to improve the performance of the improved algorithm from different perspectives, and results showed a better performance than other comparison methods.In addition, the model showed significant superiority over other classifiers [11].Yin et al. proposed an SVM algorithm based on simulated annealing algorithm for the identification of different motion patterns.Firstly, the simulated annealing algorithm obtained the SVM optimal parameters.Then, the MRMR algorithm was used for feature extraction, and the five-layer cross-validation trained the classifier.
Results showed that the accuracy of the algorithm was 98% [12].Bansal et al. proposed a hybrid MRMR feature selection technique using a multi-objective method for automatic sign language recognition.Firstly, the MRMR algorithm was used as a preprocessor to remove redundant and irrelevant features.A multi-class SVM was used as a classifier.The results showed that a more accurate classification was achieved with a decrease in the size of the feature vector [13].Zhou et al. proposed a feature selection method based on Mutual Information (MI) and correlation coefficients.In this method, the correlation coefficient was first introduced, and then combined with MI to measure features' relationship.To effectively select low redundancy features, minimization was also used in the evaluation criteria.Results showed that the proposed method had good feature classification ability [14].

Selecting ransomware attack features through weighted MRMR algorithm
There is no involvement in the field of text classification and a lack of further research. [8]

Jiménez-Cordero et al
Select features from the dataset using an embedded feature selection method based on MRMR.
There is no involvement in the field of text classification and a lack of further research.Using a single kernel function may not match the data distribution.
[10] Sun et al A fuzzy neighborhood entropy based MRMR algorithm was proposed for feature selection.
There is no involvement in the field of text classification and a lack of further research.
[11] Jia et al Using SVM algorithm based on improved rattan pot mating optimizer for high-dimensional data testing.
Lack of consideration for data redundancy issues. [13]

Bansal et al
Sign language feature selection is performed using a hybrid MRMR feature selection technique, and classification is performed using multi class SVM.
Using a single kernel function may not match the data distribution.
[14] Zhou et al Using feature selection method based on MI and correlation coefficient for feature selection.
There is still room for optimization in handling redundant feature problems.
In Table 1, recent research findings and shortcomings are presented.In summary, although many scholars have conducted research on SVM and feature selection in machine learning and applied them to many fields, the existing methods still face the problems of high redundancy, data sparsity and insufficient classification accuracy in processing large-scale text data.To solve the redundancy problem in feature selection, a TFS using the fusion of IG and improved maximum correlation and minimum redundancy is proposed.To further improve the text classification, an improved SVM algorithm based on Fourier hybrid kernel function is proposed.This study has a significant positive effect on improving the accuracy and processing efficiency of text classification [15][16][17][18][19].
Previous studies have addressed the issue of feature redundancy, but there is still room for optimization and improvement.Some studies have focused on feature redundancy but neglected the optimization of classification algorithms.Others have used a single kernel function in classification algorithms, which may result in a mismatch of data distribution.It is important to consider both feature redundancy and algorithm optimization to achieve accurate classification results.Compared to previous studies, this research considers not only the issue of high data redundancy but also the correlation between features and the semantic relationship of the context.This approach is beneficial for improving the accuracy of text feature selection through the TFS algorithm.The classification algorithm employs a hybrid kernel function based on the Fourier kernel function, which overcomes the limitations of a single kernel function.This study is better adapted than previous studies to facilitate classification.

Text data feature extraction and classification by integrating two-stage feature selection and machine learning algorithms
In order to improve the text classification and redundancy, a fusion TFS and an improved machine learning algorithm are proposed.Firstly, a TFS based on IG and MRMR algorithms is proposed.On this basis, an improved SVM algorithm is further proposed.

Text data feature extraction and classification based on two-stage feature selection algorithm
In text classification tasks, it is crucial to select the right features.This process mainly involves removing secondary words and retaining keywords with strong expressiveness to reduce the feature space complexity of text data and avoid the high complexity of dimensions affecting classification performance.In this study, a TFS for IG-MRMR is used to fuse IG and MRMR.Through the IG-MRMR algorithm, the selected feature words are vectorized by text and used by SVM for text classification processing, as shown in Figure 1.
In equation ( 1 C under the condition that there are no feature words.In the process of feature screening, the IG algorithm focuses too much on the number of documents and ignores the importance of word frequency, which leads to the decline of the ability of selected features in prediction and representation.In addition, IG not only considers the existence of feature words, but also pays attention to their absence, mainly focusing on the role of features in classification, ignoring the distribution of features between and within categories.Therefore, the feature set selected by IG needs to be further optimized.The MRMR algorithm is a filtering method using spatial search, which calculates the relevance and redundancy of features through MI. Figure 2   In TFS, the research is based on the preliminary feature word set 1 T screened by the IG algorithm, which contains n features.After performing IG filtering, there is still redundancy among the feature words in the subset.Therefore, it is necessary to perform secondary feature extraction on the selected subset.The task at this stage is to apply the MRMR criterion to n feature words and select a more optimized feature subset S from 1 T .This process is based on maximum correlation D and minimum redundancy R , as calculated in equation (2).
In equation ( 2), max D and max R represent the maximum relevance and minimum redundancy, || S represents the amount of selected feature words, ( ; In equation ( 3), D is correlation and R is redundancy.When processing text data, due to the large number of feature words, it is often time-consuming to calculate the MI between them.The MRMR strategy takes a step-by-step iterative approach to identify the ideal combination of features S .If it has already selected 1 k − features to form a subset To optimize the selection of feature subsets, an improved MRMR TFS is further proposed, which mainly increases the weight of the relationship between features and categories.By introducing the class difference degree a , the improved algorithm can more accurately evaluate the distribution and influence of features in different categories.It combines inter-class dispersion AC and coupling degree DC to measure the distribution of feature words in different categories of documents and the uniformity within the same category of documents, respectively.The representation of features can be enhanced to increase their prominence in a particular category and ensure even distribution across documents within a class.equation (5) shows the relevant calculations.
In equation ( 5 In equation ( 6),  is a constant, a represents the difference in the degree of difference of the class, and this difference is logarithmic.This calculation method is applied to the MRMR algorithm, as shown in equation (7).
A significant difference indicates that the feature words are primarily present in one category, making them highly identifiable to that category.Conversely, a small difference suggests that the feature words are common across multiple categories and are not enough to distinguish between categories with certainty.Logarithmic processing helps maintain data characteristics and the relationship between features and categories, while reducing data size and ensuring stability.In summary, the MRMR algorithm steps are shown in Figure 3.

Application of fourier mixed kernel function in SVM text classification algorithm
To enhance SVM's performance in text classification, SVM text data classification algorithm with Fourier hybrid kernel function is further introduced.In the text classification task, features are usually feature words or n-grams, forming a large number of text vectors.
The SVM algorithm maps the input vectors to a higher-dimensional space, identifies a hyperplane that separates the data, and maximizes the margin between the hyperplane and the data points to enhance the classification accuracy.Linear SVMs includes linearly separable and indivisible, linear separable means that the data can be directly sliced by the hyperplane.Binary classification data on a 2D plane, if a line can divide the two classes, the line is a hyperplane.To Equation ( 8) is a convex quadratic programming problem with constraints.
Considering their characteristics, in order to simplify the calculation, Lagrangian multiplier is applied to transform it into a dual problem.By setting the L partial derivative relative to w and b to zero, the calculation process can be transformed to obtain the expression of w and b .Substituting these into ( , , ) L w b a , equation ( 9) can be obtained to construct a classification model.
In reality, most data is non-linear and cannot be directly classified by linear methods.SVM solves it by mapping data to a high-dimensional space.The kernel function is used for inner product operations, which avoids complication and dimensional disaster.The kernel function must meet the Mercer condition.SVMs with kernel functions can also be solved using the Lagrangian multiplier method, as shown in equation ( 10).
In equation ( 10), ( , ) Next, the Fourier kernel function is proposed.In practical use, in addition to the universal Gaussian kernel and polynomial kernel, this function performs well in specific fields and has a high learning effect.There are two main forms of manifestation, and the one-dimensional Fourier kernel function corresponding to the two types is detailed in equation ( 12).

( , ) ([ ] ,[ ] )
As above, the corresponding one-dimensional and n dimensional Fourier kernel functions are shown in Figure 6.As a local kernel, the Fourier kernel function is characterized by adjusting its amplitude only by parameter q , which provides an effective learning mechanism for text classification.The Fourier nucleus provides buffer attenuation near the test point, which improves the sparse distribution in high-dimensional spaces.However, the right q value selection is critical, as inappropriate q value can lead to too rapid attenuation near the test point.In order to optimize the performance, the principle of linear weighted combination of kernel functions is adopted.This method combines the different kernel functions and aims to improve the accuracy and efficiency of text classification.The specific combination and parameter adjustment are shown in equation (14).
12 (1 ) ,0 In equation ( 14), mix K represents the hybrid kernel function, which combines the respective characteristics of the two single-kernels 1 K and 2 K that satisfy the Mercer condition, and a denotes the influence of these two single-kernels.In order to construct a hybrid kernel with better performance, it is proposed to combine the polynomial kernel (as the global kernel) and the Fourier kernel (as the local kernel) to integrate the advantages of the two.At the same time, the combination of polynomial kernels and widely used Gaussian kernels is also considered to compare the classification effects of the two hybrid kernels, as shown in equation ( 15).
In equation ( 15), (0 aa  is the weight coefficient, which balances the combined effect of the two kernel functions.Fourier nuclei are prioritized for their easy parameter adjustment q and buffer attenuation away from the test point.Based on the principle of combinatorial kernels, the proposed Fourier hybrid kernel H. Huang function combines the linear weighting of the Fourier kernel and the polynomial kernel, which conforms to Mercer's theorem and is suitable for the kernel function of SVMs.Overall, the process of improving the SVM algorithm is shown in Figure 7.

Linear weighting of Fourier kernel function and polynomial kernel function
Building a Text Classifier Based on SVM

Model training Evaluation
Using grid search method to optimize some parameters End Figure 7: Improve the process of SVM algorithm Figure 7 shows the preprocessed data being input into the SVM algorithm, followed by the selection of the kernel function.The selected Fourier and polynomial kernel functions are linearly weighted to construct a text classification model.The model is trained using a partitioned training dataset and parameter selection is done using the grid search method.Finally, the model is evaluated using the test set.

Text classification results analysis based on two-stage feature selection and improved machine learning
In this study, three datasets and their parameter configurations are first identified.Subsequently, feature selection and classification results are analyzed for these different datasets.Finally, a variety of kernel functions are analyzed in depth, and the proposed algorithm evaluates SVM classification performance of these kernel functions in detail.

Results analysis of IG-MRMR two-stage feature selection algorithm under different datasets
Experiments are conducted using the LING-SPAM, IMDB, and Cornell datasets.The text data is pre-processed by filtering out noisy feature items, reducing feature dimensions, alleviating classifier burden, and improving text classification accuracy through the removal of stop words, punctuation, and special characters.70% of data are the training set and 30% the test set, the classifier is an SVM model using Gaussian kernels, and the experimental environment is Python.To evaluate the effect of IG-MRMR algorithm in extracting feature subsets, the accuracy and F1 value are used as evaluation indexes.The algorithm's performance improves as the accuracy of its feature selection increases.A higher F1 value indicates better accuracy and recall, resulting in a more effective feature selection.The IMDB dataset is applied to the Chi-Square (CHI), MI and TFS of IG, IG-MRMR and IG-MRMR, and the feature subsets from 10 to 100 dimensions are selected, respectively.The dimension interval for each feature subset is 10.After selecting the first 20-dimensional feature subset of each method, the number of extracted words ranges from 15 to 14, 16, 16, and 18, and the priority order of each feature subset is also not the same.The accuracy results of the algorithm are presented in Fig. 8.The number of feature subsets required to achieve an accuracy of 0.82 for each algorithm is 60, 63, 59, 46, and 40, respectively.This shows that the IG-MRMR two-stage feature algorithm has the best prediction effect while using fewer feature words, and has the highest classification accuracy with the same feature subsets.To evaluate the influence of feature dimension improvement on different feature selection algorithms, the experimental set of feature subset dimension range is increased from 100 to 1000, with each 100 as an interval.A comparison of the five methods is shown in Figure 9.The F1 values of all algorithms begin to decrease when the number of features exceeds 390, indicating that the key features have been extracted and the additional features have reduced the classification effect.In Figure 9(b), the IG-MRMR TFS algorithm shows an advantage, with an average F1 value of about 1% to 2% higher than that of other algorithms, which means that more text can be correctly classified, about 18 more articles, showing its efficient and accurate feature selection ability.Five different algorithms are applied to the Cornell dataset for experiments, the same as the IMDB dataset, with feature dimensions set between 10 and 100.The analysis focuses on the first 20-dimensional feature subsets extracted by each algorithm.It is found that the number of extracted evaluation words ranges from 15 to 17, as shown in Figure 10(a).To further explore the effect of feature dimension increase on the classification effect, the experimental range is extended to 100 to 1000 dimensions, with 100 intervals, as shown in Figure 10(b).Figure 10(a) shows that at an accuracy of 0.76, the number of feature subsets required for the five algorithms is about 57, 60, 59, 55, and 40, respectively.The IG-MRMR TFS requires the least number of feature subsets, and its accuracy is higher than that of the same number of feature subsets.Figure 10(b) shows that the classification effect is best when the number of features is close to 285.As the number of features increased, the classification effectiveness of all algorithms gradually decreased.This suggests that the additional features contain more words with weak representation abilities.IG-MRMR TFS algorithm only shows a significant decrease after the feature exceeded 700, and its F1 value is 2% higher than that of other methods on average, and the number of correctly classified texts are increased by about 18. Next, experiments of five algorithms are carried out on the LING-SPAM dataset, and the feature words of this dataset mainly focuses on advertising-related words.In this study, 10-dimensional to 100-dimensional feature words are selected for comparison of classification effects, and the detailed results are shown in Figure 11(a).In order to have a more comprehensive understanding of the classification performance of feature subsets, the feature dimension is further extended to 100 to 1000, and the classification results of five feature selection algorithms are compared in Figure 11  The IG-MRMR TFS algorithm requires significantly less feature subsets than other methods while maintaining accuracy.At the same time, in the same number of feature subsets, the accuracy of IG-MRMR TFS is generally higher than that of other feature selection algorithms.Figure 11(b) shows that most of the algorithms have reached 0.96 for 100-dimensional features, which means that the words with strong representational ability in the dataset are mainly concentrated in the first 100 dimensions.The F1 value of IG-MRMR TFS peaks when the feature dimension is about 680, and its average accuracy is 1% higher than that of IG-MRMR, and the classification of about 6 articles is correctly increased, which is 2% higher than that of the IG and CHI single-stage algorithms, and about 14 articles are correctly added, showing its accurate feature selection advantage.

Text classification results analysis based on two-stage feature selection algorithm
To ensure data standardization, a preprocessing is performed to remove stop words, punctuation, and special characters, and the processed corpus words are vectorized using the term frequency-inverse document frequency method.And the weight of the words in the text is calculated and normalized.60% of dataset is training set and 40% is test set.Parameter selection includes the use of a grid search method to determine the penalty parameters C in the SVM (ranging from 1 to 100, adjusted every 10) and the exponent d of the polynomial kernel, set to 3. The kernel weight range a of the hybrid kernel function is set to 0.1 and the step size is 0.1.The experimental platform uses Python 3.6.A 5-fold cross-validation is adopted, and F1 is the evaluation index.IMDB dataset is selected to compare the performance of the proposed algorithm with other kernel functions.The dataset comprises 2000 reviews of films and television programs, with an equal number of positive and negative reviews.The document frequency algorithm is used as the feature selection algorithm to process the dataset.Considering the excellent performance of the Fourier a kernel function, the weight coefficient in the hybrid kernel function is set to 0.25, and the results are shown in Figure 12.As can be observed in Figure 13, the classification effect first increases and then decreases.When the feature dimension is about 400, the IG-MRMR two-stage algorithm shows excellent classification performance.As features increase, the effect of TFS decreases significantly, which shows that the increase of weaker feature words in the selected feature subset interferes with the classification effect.Figure 13(a) and 13(b) show that SVM using IG-MRMR TFS using any kernel function is generally better than IG method in terms of F1 value compared to the IG method, confirming the effectiveness of IG-MRMR.The combination of Fourier hybrid kernel function and IG-MRMR two-stage algorithm is 1~3% higher than other combinations on average in F1 value, and the number of correctly classified texts increases by 20 to 45.The experimental corpus selected for analysis is the Cornell Film and Television Review.The comparative method chosen to analyze its classification effect is the SVM algorithm, as shown in Table 2.  2 shows that the research method has higher accuracy and larger F1 values (P<0.05)compared to the benchmark method.Specifically, the accuracy of the research method is 96.57%, which is 23.11% higher than the SVM algorithm.These results demonstrate the high performance of the research method, which is further improved through optimization.

Discussion
In text data feature classification, achieving higher accuracy in text feature selection involves considering data redundancy, correlation between features, and semantic relationships in context.Fan Y et al. conducted research on relevant selection algorithms based on label correlation and feature redundancy to improve the effectiveness of text feature selection.The results indicated that the proposed method has a relatively high selection accuracy [20].The literature acknowledges the issue of data redundancy and correlation, but there are still shortcomings, such as a lack of research on contextual semantic relationships.This necessitates further optimization of feature selection.However, this study can address these gaps.During the feature selection process, Zhou H et al. analyzed the weight of MI redundancy terms through correlation coefficients and selected the principle of minimization.The proposed method was found to have good feature classification performance in experiments [21].This reference is comparable with the proposed method.However, there has been no research conducted on the contextual semantic relationships involved in the feature selection process.This study explores this aspect, resulting in a more effective feature selection process.The accuracy and F1 value of feature selection are both high.

Conclusion
To enhance text classification redundancy and SVM performance, a TFS algorithm based on IG and improved MRMR is proposed.Additionally, to further improve the effect of SVM in text classification, an SVM text classification algorithm based on Fourier mixed kernel function is introduced.The study found that the IG-MRMR TFS algorithm had the best prediction accuracy with fewer feature words used on the LING-SPAT, IMDB, and Cornell datasets.The algorithm achieved the highest classification accuracy with the same feature subsets.On the IMDB dataset, the algorithm required only 40 feature subsets to achieve an accuracy of 0.82, which was fewer than other algorithms.On the LING-SPAM dataset, the single-stage algorithms IG and CHI were outperformed by 2%.The addition of about 14 articles was correctly classified.Furthermore, when the number of features exceeded 390, the F1 value of all algorithms began to decrease, indicating that the key features had been extracted and additional features were reducing the classification effect.In this case, the IG-MRMR algorithm maintained its advantage, with an average F1 value 1% to 2% higher than other algorithms, and correctly classified 18 more texts.In comparison to benchmark methods, research methods exhibit higher accuracy rates.Specifically, the research method boasts an accuracy rate of 96.57%, which is 23.11% higher than that of the SVM algorithm.However, the study has some shortcomings.The second-stage feature selection of the current IG algorithm may need improvement, and the IG algorithm can be further optimized in the future.Additionally, while the Fourier kernel function shows superiority, future studies can consider more efficient local kernel functions to enhance classification performance.In addition, when dealing with complex real-world problems, such as uneven data distribution, research methods may have limited generalization ability and certain shortcomings.Future work can focus on optimizing the algorithm through feature learning and multi-level feature learning to improve its performance.

Fundings
The research is supported by: Graduate Education Reform Project of Henan Province, Achievements of the Henan Province Higher Education Teaching Reform Research and Practice Project (Graduate Education), (No. 2023SJGLX365Y).
function selection mechanism was proposed for bearing fault diagnosis.

Figure 1 :
Figure 1: Text classification process based on two-stage feature selection

Figure 2 :
Figure 2: Block diagram of feature selection algorithm i

1 kS
− , the next task is to extract the next feature from the pool of features 11 {} k TS − − that have not yet been selected.The rules followed in the selection process are described in equation (4).

Figure 8 :
Figure 8: 10 to 100 dimensional results for different algorithms on IMDB datasets

Figure 9 :
Figure 9: Comparison of F1 values of different algorithms on IMDB datasets

Figure 10 :
Figure 10: Different algorithms in Cornell data set (b).

Figure 11 :
Figure 11: Different algorithms in LING-SPAM data set

Figure 12 :Figure 13 :
Figure 12: Comparison of multiple kernel functions

Table 1 :
Research status and shortcomings of related works ), m is the number of different categories in text data, illustrates this feature selection process.
(6)n and m denote the total number, () ft are the number of documents and the average number of documents for the feature words.If the dispersion AC value is higher, the feature words .A lower value for intra-class coupling DC indicates that it is more efficient on behalf of the class C .Next, the MRMR algorithm considers the MI of feature words in all categories, fine-tunes the weight of the MI by introducing the class difference degree  , and selects the two largest class difference degree values for processing, as detailed in equation(6).
ki ft and () i k C

Table 2 :
Comparison of classification effects