High Dimensional Clustering
COSA (Clustering Objects on Subsets of Attributes)
Datasets used in cluster analysis have become larger and more complex employing more features (i.e. high dimensionality). Many of these features may be irrelevant for identifying latent clusters Researchers sometimes believe irrelevant features in a dataset will simply be ignored by the analysis however for most clustering algorithms this is not the case. In fact, irrelevant features can mask existing clusters in noisy data.
Clustering algorithms therefore need to adapt to provide high quality cluster solutions (i.e., objects having high similarity within, and low similarity between clusters).
A variety of approaches have been used to address clustering high dimensional data:
- Principal Component Analysis (PCA): PCA creates linear combinations of similar features prior to cluster analysis in order reduce dimensionality and feature overlap. Some issues with this approach include:
- PCA does not however actually remove any of the original attributes from consideration, i.e., information from irrelevant dimensions is preserved.
- Interpretation of the clusters can be difficult leading to poor or limited insight.
- Linear combinations / weighted averaging, reduces variability in the resultant features making it more difficult to distinguish between clusters.
- PCA is best suited to datasets where most of the dimensions are relevant to the clustering task, but many are highly correlated or redundant.
Another approach to dealing with high dimensional data in cluster analysis is known as “Feature Selection”.
- Feature Selection: Feature selection involves finding a subset of dimensions on which to perform clustering by removing irrelevant and redundant dimensions.
- Benefit: interpretation is enhanced when looking at subset of features since there are fewer variables.
- Drawback: Feature selection seeks groups of objects which all cluster on the same subset of attributes. Consider the following example: A customer satisfaction study contains respondents in which 2 groups cluster on “likely to recommend” and “purchase volume”. Another two groups are similar with respect to “overall satisfaction” and ” environmental concern”. If all clusters are constrained to group by the same features, the identification of two of these groups will be problematic, i.e. simple elimination of any of the four variables may aid in identifying some clusters and complicate identification of others. Also note that as data sets include more and more features, the chances of subspace clustering becomes greater.
- COSA: A final approach which addresses the drawback in feature selection is called COSA (Clustering Objects on Subsets of Attributes). COSA allows for clusters not only to be created on subsets of attributes but, in addition, clusters may in fact be based on different attributes! The COSA approach therefore provides a flexible and effective way to perform cluster analysis on high dimensional data.