Soft Clustering for HDBSCAN\* ============================= Soft clustering is a new (and still somewhat experimental) feature of the hdbscan library. It takes advantage of the fact that the condensed tree is a kind of smoothed density function over data points, and the notion of exemplars for clusters. If you want to better understand how soft clustering works please refer to :any:`soft_clustering_explanation`. Let's consider the digits dataset from sklearn. We can project the data into two dimensions to visualize it via t-SNE. .. code:: python from sklearn import datasets from sklearn.manifold import TSNE import matplotlib.pyplot as plt import seaborn as sns import numpy as np .. code:: python digits = datasets.load_digits() data = digits.data projection = TSNE().fit_transform(data) plt.scatter(*projection.T, **plot_kwds) .. image:: images/soft_clustering_3_1.png Now we import hdbscan and then cluster in the full 64 dimensional space. It is important to note that, if we wish to use the soft clustering we should use the ``prediction_data=True`` option for HDBSCAN. This will ensure we generate the extra data required that will allow soft clustering to work. .. code:: python import hdbscan .. code:: python clusterer = hdbscan.HDBSCAN(min_cluster_size=10, prediction_data=True).fit(data) color_palette = sns.color_palette('Paired', 12) cluster_colors = [color_palette[x] if x >= 0 else (0.5, 0.5, 0.5) for x in clusterer.labels_] cluster_member_colors = [sns.desaturate(x, p) for x, p in zip(cluster_colors, clusterer.probabilities_)] plt.scatter(*projection.T, s=50, linewidth=0, c=cluster_member_colors, alpha=0.25) .. image:: images/soft_clustering_6_1.png Certainly a number of clusters were found, but the data is fairly noisy in 64 dimensions, so there are a number of points that have been classified as noise. We can generate a soft clustering to get more information about some of these noise points. To generate a soft clustering for all the points in the original dataset we use the :py:func:`~hdbscan.prediction.all_points_membership_vectors` function which takes a clusterer object. If we wanted to get soft cluster membership values for a set of new unseen points we could use :py:func:`~hdbscan.prediction.membership_vector` instead. The return value is a two-dimensional numpy array. Each point of the input data is assigned a vector of probabilities of being in a cluster. For a first pass we can visualize the data looking at what the *most likely* cluster was, by coloring according to the ``argmax`` of the probability vector (i.e. the cluster for which a given point has the highest probability of being in). .. code:: python soft_clusters = hdbscan.all_points_membership_vectors(clusterer) color_palette = sns.color_palette('Paired', 12) cluster_colors = [color_palette[np.argmax(x)] for x in soft_clusters] plt.scatter(*projection.T, s=50, linewidth=0, c=cluster_colors, alpha=0.25) .. image:: images/soft_clustering_8_1.png This fills out the clusters nicely -- we see that there were many noise points that are most likely to belong to the clusters we would expect; we can also see where things have gotten confused in the middle, and there is a mix of cluster assignments. We are still only using part of the information however; we can desaturate according to the actual probability value for the most likely cluster. .. code:: python color_palette = sns.color_palette('Paired', 12) cluster_colors = [sns.desaturate(color_palette[np.argmax(x)], np.max(x)) for x in soft_clusters] plt.scatter(*projection.T, s=50, linewidth=0, c=cluster_colors, alpha=0.25) .. image:: images/soft_clustering_10_1.png We see that many points actually have a low probability of being in the cluster -- indeed the soft clustering applies *within* a cluster, so only the very cores of each cluster have high probabilities. In practice desaturating is a fairly string treatment; visually a lot will look gray. We could apply a function and put a lower limit on the desaturation that meets better with human visual perception, but that is left as an exercise for the reader. Instead we'll explore what else we can learn about the data from these cluster membership probabilities. An interesting question is which points have high likelihoods for *two* clusters (and low likelihoods for the other clusters). .. code:: python def top_two_probs_diff(probs): sorted_probs = np.sort(probs) return sorted_probs[-1] - sorted_probs[-2] # Compute the differences between the top two probabilities diffs = np.array([top_two_probs_diff(x) for x in soft_clusters]) # Select out the indices that have a small difference, and a larger total probability mixed_points = np.where((diffs < 0.001) & (np.sum(soft_clusters, axis=1) > 0.5))[0] .. code:: python colors = [(0.75, 0.1, 0.1) if x in mixed_points else (0.5, 0.5, 0.5) for x in range(data.shape[0])] plt.scatter(*projection.T, s=50, linewidth=0, c=colors, alpha=0.5) .. image:: images/soft_clustering_13_1.png We can look at a few of these and see that many are, indeed, hard to classify (even for humans). It also seems that 8 was not assigned a cluster and is seen as a mixture of other clusters. .. code:: python fig = plt.figure() for i, image in enumerate(digits.images[mixed_points][:16]): ax = fig.add_subplot(4,4,i+1) ax.imshow(image) plt.tight_layout() .. image:: images/soft_clustering_15_0.png There is, of course, a lot more analysis that can be done from here, but hopefully this provides sufficient introduction to what can be achieved with soft clustering.