# Parameter Selection for HDBSCAN*¶

While the HDBSCAN class has a large number of parameters that can be set on initialization, in practice there are a very small number of parameters that have significant practical effect on clustering. We will first consider those major parameters, and consider how one may go about choosing them effectively. With that out of the way we’ll look at the remaining parameters and see what their effects are – many just effect performance for various different use cases.

## Selecting `min_cluster_size`

¶

The primary parameter to effect the resulting clustering is
`min_cluster_size`

. Ideally this is a relatively intuitive parameter
to select – set it to the smallest size grouping that you sih to
consider a cluster. It can have slightly non-obvious effects however.
Let’s consider the digits dataset from sklearn. We can project the data
into two dimensions to visualize it via t-SNE.

If we cluster this data in the full 64 dimensional space with hdbscan we
can see some effects from varying the `min_cluster_size`

.

We start with a `min_cluster_size`

of 15.

```
clusterer = hdbscan.HDBSCAN(min_cluster_size=15).fit(data)
color_palette = sns.color_palette('Paired', 12)
cluster_colors = [color_palette[x] if x >= 0
else (0.5, 0.5, 0.5)
for x in clusterer.labels_]
cluster_member_colors = [sns.desaturate(x, p) for x, p in
zip(cluster_colors, clusterer.probabilities_)]
plt.scatter(*projection.T, s=50, linewidth=0, c=cluster_member_colors, alpha=0.25)
```

Increasing the `min_cluster_size`

to 30 reduces the number of
clusters, merging some together. This is a result of HDBSCAN*
reoptimizing which flat clustering provides greater stability under a
slightly different notion of what constitutes cluster.

```
clusterer = hdbscan.HDBSCAN(min_cluster_size=30).fit(data)
color_palette = sns.color_palette('Paired', 12)
cluster_colors = [color_palette[x] if x >= 0
else (0.5, 0.5, 0.5)
for x in clusterer.labels_]
cluster_member_colors = [sns.desaturate(x, p) for x, p in
zip(cluster_colors, clusterer.probabilities_)]
plt.scatter(*projection.T, s=50, linewidth=0, c=cluster_member_colors, alpha=0.25)
```

Doubling the `min_cluster_size`

again to 60 gives us just two clusters
– the really core clusters. This is somewhat as expected, but surely
some of the other clusters that we had previously had more than 60
members? Why are they being considered noise? The answer is that
HDBSCAN* has a second parameter `min_samples`

. The implementation
defaults this value (if it is unspecified) to whatever
`min_cluster_size`

is set to. We can recover some of our original
clusters by explicitly providing `min_samples`

at the original value
of 15.

```
clusterer = hdbscan.HDBSCAN(min_cluster_size=60).fit(data)
color_palette = sns.color_palette('Paired', 12)
cluster_colors = [color_palette[x] if x >= 0
else (0.5, 0.5, 0.5)
for x in clusterer.labels_]
cluster_member_colors = [sns.desaturate(x, p) for x, p in
zip(cluster_colors, clusterer.probabilities_)]
plt.scatter(*projection.T, s=50, linewidth=0, c=cluster_member_colors, alpha=0.25)
```

```
clusterer = hdbscan.HDBSCAN(min_cluster_size=60, min_samples=15).fit(data)
color_palette = sns.color_palette('Paired', 12)
cluster_colors = [color_palette[x] if x >= 0
else (0.5, 0.5, 0.5)
for x in clusterer.labels_]
cluster_member_colors = [sns.desaturate(x, p) for x, p in
zip(cluster_colors, clusterer.probabilities_)]
plt.scatter(*projection.T, s=50, linewidth=0, c=cluster_member_colors, alpha=0.25)
```

As you can see this results in us recovering something much closer to
our original clustering, only now with some of the smaller clusters
pruned out. Thus `min_cluster_size`

does behave more closely to our
intuitions, but only if we fix `min_samples`

. If you wish to explore
different `min_cluster_size`

settings with a fixed `min_samples`

value, especially for larger dataset sizes, you can cache the hard
computation, and recompute onlythe relatively cheap flat cluster
extraction using the `memory`

parameter, which makes use of `joblib`

[link].

## Selecting `min_samples`

¶

Since we have seen that `min_samples`

clearly has a dramatic effect on
clustering, the question becomes: how do we select this parameter? The
simplest intuition for what `min_samples`

does is provide a measure of
how conservative you want you clustering to be. The larger the value of
`min_samples`

you provide, the more conservative the clustering –
more points will be declared as noise, and clusters will be restricted
to progressively more dense areas. We can see this in practice by
leaving the `min_cluster_size`

at 60, but reducing `min_samples`

to
1.

```
clusterer = hdbscan.HDBSCAN(min_cluster_size=60, min_samples=1).fit(data)
color_palette = sns.color_palette('Paired', 12)
cluster_colors = [color_palette[x] if x >= 0
else (0.5, 0.5, 0.5)
for x in clusterer.labels_]
cluster_member_colors = [sns.desaturate(x, p) for x, p in
zip(cluster_colors, clusterer.probabilities_)]
plt.scatter(*projection.T, s=50, linewidth=0, c=cluster_member_colors, alpha=0.25)
```

```
<matplotlib.collections.PathCollection at 0x120978438>
```

Now most points are clustered, and there are much fewer noise points.
Steadily increasing `min_samples`

will, as we saw in the examples
above, make the clustering progressivly more conservative, culiminating
in the example above where `min_samples`

was set to 60 and we had only
two clusters with most points declared as noise.

## Selecting `alpha`

¶

A further parameter that effects the resulting clustering is `alpha`

.
In practice it is best not to mess with this paramter – ultimately it
is part of the `RobustSingleLinkage`

code, but flows naturally into
HDBSCAN*. If, for some reason, `min_samples`

is not providing you
what you need, stop, rethink things, and try again with `min_samples`

.
If you still need to play with another parameter (and you shouldn’t),
then you can try setting `alpha`

. The `alpha`

parameter provides a
slightly different approach to determining how conservative the
clustering is. By default `alpha`

is set to 1.0. Increasing `alpha`

will make the clustering more conservative, but on a much tighter scale,
as we can see by setting `alpha`

to 1.3.

```
clusterer = hdbscan.HDBSCAN(min_cluster_size=60, min_samples=15, alpha=1.3).fit(data)
color_palette = sns.color_palette('Paired', 12)
cluster_colors = [color_palette[x] if x >= 0
else (0.5, 0.5, 0.5)
for x in clusterer.labels_]
cluster_member_colors = [sns.desaturate(x, p) for x, p in
zip(cluster_colors, clusterer.probabilities_)]
plt.scatter(*projection.T, s=50, linewidth=0, c=cluster_member_colors, alpha=0.25)
```