K-means: critical analysis on the techniques used to determine the optimal value of k in high-dimensional datasets

Gikera, Rufus Kinyua

K-means: critical analysis on the techniques used to determine the optimal value of k in high-dimensional datasets

dc.contributor.author	Gikera, Rufus Kinyua
dc.date.accessioned	2025-02-07T06:46:30Z
dc.date.available	2025-02-07T06:46:30Z
dc.date.issued	2024-12
dc.description	Thesis submitted in partial fulfillment of the requirements for the award of the degree of Doctor of Philosophy (Computer Science) in the School of Pure and Applied Sciences of Kenyatta University, December 2024 Supervisors: Prof. Elizaphan Maina Prof. Jonathan Mwaura Eng. Dr. Shadrack Mambo
dc.description.abstract	Clustering is one of the main goals of exploratory data analysis. It has an extensive and wealthy history in a variety of fields. The methods used to perform clustering have been evolving over time. Among these methods, k-means is still the most popular clustering algorithm because of its ability to adapt to new examples and to scale up to large datasets. It is also easy to understand and implement and is computationally faster and more efficient compared to other algorithms. However, with k-means, selecting the correct k-hyperparameter, i.e. the number of clusters in a dataset, has a long standing challenge and has a significant effect on the clustering results. Although a number of k-hyperparameter tuning techniques in high-dimensional space clustering have been proposed, to help in the selection of the correct k-value, these techniques still face performance limitations in a variety of high dimensional datasets and dimensionality reduction methods. This makes the k-hyperparameter tuning problem intractable and an open research challenge. In light of this, this research firstly aims at investigating the existing k-hyperparameter tuning techniques in high dimensional space clustering through the literature review analysis. Secondly, an investigation on the dimensionality reduction methods used with the high dimensional spaces is also done via the same process. The results of the first two steps provide key findings and a conceptual framework that acts as the road map and the foundation for the subsequent empirical investigations in the third step. These investigations are guided by a comprehensive methodology based on mixed research methods for validation triangulation. Experiments are conducted on techniques that demonstrate methodological rigour and novelty, in a variety of datasets and dimensionality reduction methods. Empirical research design guides the process of conducting these experiments. The invaluable insights based on the results’ analysis of the experimental data, evinces the significance of the feature extraction process as a critical leverage point in the effective k-hyperparameter tuning process in high dimensions. This guides the implementation of a novel generalizable technique, through a multi-methodological system development methodology. This technique is then validated against the existing ones, using similar metrics, in order to evaluate its effectiveness. Statistical significance tests, using the ANOVA and the Kruskal-Wallis H statistic, demonstrate that the new technique is more superior. This is also evinced by the improved internal index scores, cluster visualizations as well as the presence of shorter whiskers and higher median (Q2) values in the whisker-box plots, in a variety of datasets. The new technique handles a variety of datasets, using an improved self-adapting autoencoder based on an unsupervised transfer learning strategy and a thoughtful configuration of both the architectural and training-related hyperparameter settings. This makes it effective in handling data sparsity and curse of dimensionality limitations inherent in high dimensional spaces. Future research aims at evaluating its efficacy in wider application domains, including a further comparative analysis of hybrid sets of best performing dimensionality reduction methods
dc.description.sponsorship	Kenyatta University
dc.identifier.uri	https://ir-library.ku.ac.ke/handle/123456789/29533
dc.language.iso	en
dc.publisher	Kenyatta University
dc.title	K-means: critical analysis on the techniques used to determine the optimal value of k in high-dimensional datasets
dc.type	Thesis

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Fulltext thesis.pdf
Size:: 7.59 MB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 2.66 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

PHD-Department of Computing & Information Technology