Parallelization of K-Means and Spatial Join Algorithms on Heterogeneous Platforms Using Apache Spark and GPU Integration for Enhanced AI Information Management
Abstract
In the era of artificial intelligence informatization, real-time mining of massive spatial data has become the core bottleneck of intelligent decision-making, and existing methods have problems of poor computational performance and high complexity. Therefore, this study proposes a novel solution based on heterogeneous computing platforms. The approach employs Apache Spark to design a hybrid system integrating Central Processing Units (CPUs) and Graphics Processing Units (GPUs). It achieves parallelization of the K-means algorithm through Spark's elastic distributed datasets and broadcast variables, optimizing both the initial cluster centre selection and new centre determination steps. Concurrently, upper and lower bound constraints alongside group filtering techniques are introduced to reduce computational complexity. For spatial join algorithms, the study achieves efficient spatial data mining and dynamic load balancing through spatial index partitioning and the Compute Unified Device Architecture (CUDA) dynamic parallelization strategy. Experiments have shown that the parallelized K-means algorithm exhibited significantly improved acceleration on different data dimensions. Especially with an acceleration ratio of 45.32 times on 90-dimensional data, the execution efficiency was 0.31 times higher than Spark MLlib. The parallel spatial join algorithm achieved optimal performance with 1,500 partitions, completing computations in just 37.5 seconds while maintaining a data mining accuracy of 94.2%, surpassing traditional algorithms. Its maximum data mining accuracy reached 94.2%, exceeding DBSCAN and GeoSpark by 3.7% and 4.4%, respectively. The research method effectively solves the real-time problem of spatial data mining in artificial intelligence information management, providing scalable technical support for scenarios such as smart cities and autonomous driving.DOI:
https://doi.org/10.31449/inf.v49i30.10302Downloads
Published
How to Cite
Issue
Section
License
Authors retain copyright in their work. By submitting to and publishing with Informatica, authors grant the publisher (Slovene Society Informatika) the non-exclusive right to publish, reproduce, and distribute the article and to identify itself as the original publisher.
All articles are published under the Creative Commons Attribution license CC BY 3.0. Under this license, others may share and adapt the work for any purpose, provided appropriate credit is given and changes (if any) are indicated.
Authors may deposit and share the submitted version, accepted manuscript, and published version, provided the original publication in Informatica is properly cited.







