A Hybrid MPI/OpenMP Parallelization of K-Means Algorithms Accelerated Using the Triangle Inequality
Wojciech Kwedlo , Pawel J. Czochanski
AbstractThe standard formulation of the K -means clustering (Lloyd's method) performs many unnecessary distance calculations. In this paper, we focus on four approaches that use the triangle inequality to avoid unnecessary distance calculations. These approaches are Drake's, Elkan's, Annulus, and Yinyang algorithms. We propose a hybrid MPI/OpenMP parallelization of these algorithms in which the dataset and the corresponding data structures storing bounds on distances are evenly divided among MPI processes. Then, in the assignment step of a K -means iteration, each MPI process computes the assignment of its portion of data using OpenMP threads. In the update step of the iteration, the cluster centroids are computed using a hierarchical all-reduce operation. In the computational experiments, we compared the strong scalability of these four algorithms with the scalability of Lloyd's algorithm, parallelized using the same approach. The results indicate that all four algorithms maintain an advantage in computing time over Lloyd's algorithm. A comparison with two software packages, whose sources are publicly available, in the same computing environment, shows that our implementations are more efficient.
|Journal series||IEEE Access, ISSN 2169-3536, (N/A 100 pkt)|
|Publication size in sheets||0.85|
|Keywords in English||Clustering , K-means , triangle inequality , MPI , OpenMP , hybrid parallelization|
|ASJC Classification||; ;|
|Internal identifier||ROC 19-20|
|Score||= 100.0, 24-03-2020, ArticleFromJournal|
|Publication indicators||: 2018 = 1.718; : 2018 = 4.098 (2) - 2018=4.54 (5)|
* presented citation count is obtained through Internet information analysis and it is close to the number calculated by the Publish or Perish system.