Fuzzy Clustering and Kernel PCA-Based High-Dimensional Imbalanced Data Integration with Octree Encoding

Qin Wang

Abstract


Due to the high-dimensional and unbalanced characteristics of national economic accounting data, there is a large amount of redundant information in the data, which will lead to problems such as boundary shift and integration overfitting shift when integrating the data, and will increase the difficulty of subsequent data integration. For this reason, a fuzzy clustering-based method for integrating high-dimensional unbalanced data of national accounts is proposed. Using the kernel principal component analysis method to reduce the dimensionality of high-dimensional imbalanced national economic accounting data, in order to reduce the complexity and sparsity of the data while preserving the main information of the original data as much as possible. Use fuzzy clustering algorithm for data clustering. Fuzzy clustering allows data points to belong to multiple clusters simultaneously, with each cluster having a membership measure that represents the strength of the relationship between data points and each cluster. Introducing deviation maximization for optimizing fuzzy clustering methods to ensure that the distance between each data point and its cluster center is as large as possible, while ensuring that the distance between data points within the same cluster is as small as possible. Based on text free grammar rules and conversion functions, convert national economic accounting data into hesitant fuzzy language data and obtain the optimal data attribute weight vector. Calculate the distance between different categories and the minimum distance, and determine the repulsion phenomenon between unknown and known classes through the objective function. Using Lagrange multipliers to solve the objective function and obtain the optimal clustering center. According to the optimal clustering center, complete the clustering of national economic accounting data and obtain different categories of national economic accounting data. According to the experimental results, the data integration imbalance of the proposed method ranges from 1.68% to 32.85%, and the total number of samples fluctuates between 139 and 5136. The three indicators of the integrated data are all greater than 0.88. Through actual coding cases, the coding ability of our method for highdimensional imbalanced data in national economic accounting has been verified.


Full Text:

PDF


DOI: https://doi.org/10.31449/inf.v49i2.8267

Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.