Basic Usage of HDBSCAN* for Clustering¶
We have some data, and we want to cluster it. How exactly do we do that,
and what do the results look like? If you are very familiar with sklearn
and it’s API, particularly for clustering, then you can probably skip
this tutorial – hdbscan
implements exactly this API, so you can use
it just as you would any other sklearn clustering algorithm. If, on the
other hand, you aren’t that familiar with sklearn, fear not, and read
on. Let’s start with the simplest case first – we have data in a nice
tidy dataframe format.
The Simple Case¶
Let’s generate some data with, say 2000 samples, and 10 features. We can put it in a dataframe for a nice clean table view of it.
blobs, labels = make_blobs(n_samples=2000, n_features=10)
pd.DataFrame(blobs).head()
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | |
---|---|---|---|---|---|---|---|---|---|---|
0 | -3.370804 | 8.487688 | 4.631243 | -10.181475 | 9.146487 | -8.070935 | -1.612017 | -2.418106 | -8.975390 | -1.769952 |
1 | -4.092931 | 8.409841 | 3.362516 | -9.748945 | 9.556615 | -9.240307 | -2.038291 | -3.129068 | -7.109673 | -0.993827 |
2 | -4.604753 | 9.616391 | 4.631508 | -11.166361 | 10.888212 | -8.427564 | -3.929517 | -4.563951 | -8.886373 | -1.995063 |
3 | -6.889866 | -7.801482 | -6.974958 | -8.570025 | 5.438101 | -5.097457 | -4.941206 | -5.926394 | -10.145152 | 0.219269 |
4 | 5.339728 | 2.791309 | 0.611464 | -2.929875 | -7.694973 | 7.776050 | -1.218101 | 0.408141 | -4.563975 | -1.309128 |
So now we need to import the hdbscan library.
import hdbscan
Now, to cluster we need to generate a clustering object.
clusterer = hdbscan.HDBSCAN()
We can then use this clustering object and fit it to the data we have. This will return the clusterer object back to you – just in case you want do some method chaining.
clusterer.fit(blobs)
HDBSCAN(algorithm='best', alpha=1.0, approx_min_span_tree=True,
gen_min_span_tree=False, leaf_size=40, memory=Memory(cachedir=None),
metric='euclidean', min_cluster_size=5, min_samples=None, p=None)
At this point we are actually done! We’ve done the clustering! But where
are the results? How do I get the clusters? The clusterer object knows,
and stores the result in an attribute labels_
.
clusterer.labels_
array([2, 2, 2, ..., 2, 2, 0])
So it is an array of integers. What are we to make of that? It is an array with an integer for each data sample. Samples that are in the same cluster get assigned the same number. The cluster labels are 0 up numbers. We can thus determine the number of clusters found by checking what the largest cluster label is.
clusterer.labels_.max()
2
So we have a total of three clusters, with labels 0, 1, and 2.
Importantly HDBSCAN is noise aware – it has a notion of data samples
that are not assigned to any cluster. This is handled by assigning these
samples the label -1. But wait, there’s more. The hdbscan
library
implements soft clustering, where wach data point is assigned a cluster
membership score ranging from 0.0 to 1.0. A score of 0.0 represents a
sample that is not in the cluster at all (all noise points will get this
score) while a score of 1.0 represents a sample that is at the heart of
the cluster (note that this is not the spatial centroid notion of core).
You can access these scores via the probabilities_
attribute.
clusterer.probabilities_
array([ 0.83890858, 1. , 0.72629904, ..., 0.79456452,
0.65311137, 0.76382928])
What about different metrics?¶
That is all well and good, but even data that is embedded in a vector
space may not want to consider distances between data points to be pure
Euclidean distance. What can we do in that case? We are still in good
shape, since hdbscan
supports a wide variety of metrics, which you
can set when creating the clusterer object. For example we can do the
following:
clusterer = hdbscan.HDBSCAN(metric='manhattan')
clusterer.fit(blobs)
clusterer.labels_
array([1, 1, 1, ..., 1, 1, 0])
What metrics are supported? Because we simply steal metric computations from sklearn we get a large number of metrics readily available.
hdbscan.dist_metrics.METRIC_MAPPING
{'braycurtis': hdbscan.dist_metrics.BrayCurtisDistance,
'canberra': hdbscan.dist_metrics.CanberraDistance,
'chebyshev': hdbscan.dist_metrics.ChebyshevDistance,
'cityblock': hdbscan.dist_metrics.ManhattanDistance,
'dice': hdbscan.dist_metrics.DiceDistance,
'euclidean': hdbscan.dist_metrics.EuclideanDistance,
'hamming': hdbscan.dist_metrics.HammingDistance,
'haversine': hdbscan.dist_metrics.HaversineDistance,
'infinity': hdbscan.dist_metrics.ChebyshevDistance,
'jaccard': hdbscan.dist_metrics.JaccardDistance,
'kulsinski': hdbscan.dist_metrics.KulsinskiDistance,
'l1': hdbscan.dist_metrics.ManhattanDistance,
'l2': hdbscan.dist_metrics.EuclideanDistance,
'mahalanobis': hdbscan.dist_metrics.MahalanobisDistance,
'manhattan': hdbscan.dist_metrics.ManhattanDistance,
'matching': hdbscan.dist_metrics.MatchingDistance,
'minkowski': hdbscan.dist_metrics.MinkowskiDistance,
'p': hdbscan.dist_metrics.MinkowskiDistance,
'pyfunc': hdbscan.dist_metrics.PyFuncDistance,
'rogerstanimoto': hdbscan.dist_metrics.RogersTanimotoDistance,
'russellrao': hdbscan.dist_metrics.RussellRaoDistance,
'seuclidean': hdbscan.dist_metrics.SEuclideanDistance,
'sokalmichener': hdbscan.dist_metrics.SokalMichenerDistance,
'sokalsneath': hdbscan.dist_metrics.SokalSneathDistance,
'wminkowski': hdbscan.dist_metrics.WMinkowskiDistance}
Distance matrices¶
What if you don’t have a nice set of points in a vector space, but only
have a pairwise distance matrix providing the distance between each pair
of points? This is a common situation. Perhaps you have a complex custom
distance measure; perhaps you have strings and are using Levenstein
distance, etc. Again, this is all fine as hdbscan
supports a special
metric called precomputed
. If you create the clusterer with the
metric set to precomputed
then the clusterer will assume that,
rather than being handed a vector of points in a vector space, it is
recieving an all pairs distance matrix.
distance_matrix = pairwise_distances(blobs)
clusterer = hdbscan.HDBSCAN(metric='precomputed')
clusterer.fit(distance_matrix)
clusterer.labels_
array([1, 1, 1, ..., 1, 1, 2])
Note that this result only appears different due to a different labelling order for the clusters.