Feature selection for unsupervised problems: the case of clustering
Last Updated on January 14, 2023 by Editorial Team
Author(s): Kevin Berlemont, PhD
Originally published on Towards AI.
Feature Selection for Unsupervised Problems: The Case of Clustering
With the massive growth of data over the last decade, selecting the right feature is becoming a major challenge. A well-known technique in data processing consists of dimensionality reduction. This process tries to remove redundant and irrelevant features that would degrade the performance. These methods can be categorized between feature extraction/construction and feature selection. In the case of feature extraction, the dimensionality of the data is reduced by deriving new features based on the original ones. Examples of this process are Principal Component Analysis  and Singular Value Decomposition . On the other hand, feature selection tries to select a subset, ideally small, of relevant features. This approach is needed when there is a large number of features in the dataset and the goal is to reduce the computational complexity and obtain generalizable models.
Feature selection approaches usually require class labels to determine whether a feature is relevant or not. However, when the class labels are unknown, such as for clustering, how can a feature be classified as relevant? Feature selection can be categorized in four categories:
- Filters methods try to select an optimal feature subset according to the general characteristics of the data but not from a learning algorithm. In general, filters compute the score of a subset of features using specific evaluation criteria.
- Wrappers methods need a learner to evaluate the goodness of the feature subsets. Thus, they are computationally more expensive but will increase the performance of a specific learning algorithm.
- Hybrid methods try to get the advantages of both methods above by incorporating them in a two-stage process.
- Embedded methods embed features directly into the learning algorithm. However, they often don’t reach better performances than wrappers.
Next, I will describe specific feature selection methods for all of these different categories, highlighting when and how to use them.
Filters select features in the data according to the characteristics of features. They directly evaluate the statistical performance of features in the data. A proposed filter approach  is to measure the dependencies between features based on a variance-based metric (maximal information compression index, MICI). This approach divides features into clusters in a similar way to the k-nearest neighbor algorithm. In each iteration, the k nearest features are found for each feature based on the MICI. Afterward, the feature which builds the most compact subset is selected, and the procedure is repeated until all features are selected or discarded.
Another filtering method consists in selecting features using the Pearson correlation coefficient. First, all possible pairwise correlations between features and data are computed. Then, it removes the feature with the highest average dependency on other features. Afterward, the process is repeated until the number of features wanted is reached.
As shown with these two examples, filter methods are usually general as they don’t rely on a specific learning algorithm. However, their clustering performances are usually lower than the ones of wrapper methods, which will be the focus of the next section.
Wrappers Approach for K-means
In this section, I will focus on the K-means algorithm for clustering, as the wrapper approaches are specific to the algorithm picked. For more details about other models, such as evolutionary algorithms, I recommend the following paper .
K-means is one of the most popular clustering algorithms in Data Science, but one of its main deficiencies is that it evaluates all the features with equal importance. Thus, in the case of a significant number of irrelevant features, the quality of the clustering process will decrease. In this context, it is useful to give certain features more importance by weighting them.
The convex K-means algorithm  improves the standard K-means algorithm by integrating an adaptive weighting scheme in K-means. It attempts to iteratively determine the optimal weights of a feature set by minimizing the average within-cluster distance. One caveat to this approach is that the minima search can be stuck in a local optimum due to gradient descent search.
Another well-known feature weighting approach for K-means consists in attribute weighting clustering. Each feature can have different weights at different clusters. The goal is then to minimize the sum of the weighted distances within the clusters. This method and variants have been really successful at clustering, but they are highly dependent on the hyperparameter keeping the weights at a reasonable level.
For embedded approaches, the feature selection process is performed as a part of the learning process. Due to their performances and interpretability, embedded approaches usually make use of a sparse learning algorithm. First, it finds the cluster labels using a clustering algorithm, and it then transforms the unsupervised feature selection into a supervised context.
One of the earliest sparse learning feature selection methods is multi-cluster feature selection. In the first step, the intrinsic structure of the data is explored using spectral analysis in order to measure the correlation between features. In the second step, the importance of the features is quantified using an L1-regularized regression model. The last step consists in selecting the specified number of features with the highest coefficients from the previous stage. This approach has been proven efficient at feature selection for clustering but is computationally expensive.
The previous method consists in the conventional sparse learning feature selection approach that requires the cluster labels to be generated by a clustering algorithm before transforming the problem into a supervised feature selection problem. However, this approach has the tendency to cause non-optimal feature subsets. To address this, embedded unsupervised feature selection directly embeds feature selection into a clustering algorithm without the transformation. It applies K-means by minimizing the reconstruction error to obtain the cluster labels and select the features. However, it is necessary to be careful about the heterogeneity between clusters with this approach as it has the tendency to select non-discriminative features otherwise.
In recent years, hybrid approaches for feature selections have become very popular. One example of a two-way feature selection process tries to eliminate redundant features by using an entropy-based measure and the fuzzy evaluation index . Afterward, it tries to select an optimal feature subset using the scatter trace criterion.
Another popular hybrid feature selection method combines the Bayesian network and K-means, the BFK algorithm. It first carries out the wrapper step by applying K-means with a range of clusters. Then, the cluster with the highest value of the Silhouette index is picked. In the filter stage, the subset of features is selected using a Bayesian network, which considers each cluster and feature as a class and a node. One caveat is that if the cluster structure is not well determined using the Silhouette index, then the second stage of the method will be affected.
Feature selection is an important technique in data processing that helps reduce the complexity of data and improve the performance of learning algorithms. It can be categorized into four main approaches: filters, wrappers, hybrids, and embedded. Filters select features according to the characteristics of the data, while wrappers use a learning algorithm to evaluate the goodness of the feature subsets. Hybrid methods combine filters and wrappers, while embedded methods embed feature selection directly into the learning algorithm.
Clustering as an unsupervised problem is harder than classification as different evaluation measures will show different level of goodness for the same set of clusters. It is thus difficult to develop comprehensive evaluation measures but that could lead to the development of an efficient search algorithm for feature selection.
-  https://towardsdatascience.com/a-one-stop-shop-for-principal-component-analysis-5582fb7e0a9c
-  https://towardsdatascience.com/understanding-singular-value-decomposition-and-its-application-in-data-science-388a54be95d
-  Mitra, Pabitra, C. A. Murthy, and Sankar K. Pal. “Unsupervised feature selection using feature similarity.” IEEE transactions on pattern analysis and machine intelligence 24.3 (2002): 301–312.
-  Fop, Michael, Thomas Brendan Murphy, and Luca Scrucca. “Model-based clustering with sparse covariance matrices.” Statistics and Computing 29.4 (2019): 791–819.
-  Modha, Dharmendra S., and W. Scott Spangler. “Feature weighting in k-means clustering.” Machine learning 52.3 (2003): 217–237.
-  Pal, Sankar K., Rajat K. De, and Jayanta Basak. “Unsupervised feature evaluation: A neuro-fuzzy approach.” IEEE Transactions on neural networks 11.2 (2000): 366–376.
Feature selection for unsupervised problems: the case of clustering was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI