Perform HDBSCAN clustering from vector array or distance matrix.
HDBSCAN - Hierarchical Density-Based Spatial Clustering of Applications
with Noise. Performs DBSCAN over varying epsilon values and integrates
the result to find a clustering that gives the best stability over epsilon.
This allows HDBSCAN to find clusters of varying densities (unlike DBSCAN),
and be more robust to parameter selection.
The minimum size of clusters; single linkage splits that contain
fewer points than this will be considered points “falling out” of a
cluster rather than a cluster splitting into two new clusters.
min_samplesint, optional (default=None)
The number of samples in a neighbourhood for a point to be
considered a core point.
metricstring, or callable, optional (default=’euclidean’)
The metric to use when calculating distance between instances in a
feature array. If metric is a string or callable, it must be one of
the options allowed by metrics.pairwise.pairwise_distances for its
metric parameter.
If metric is “precomputed”, X is assumed to be a distance matrix and
must be square.
pint, optional (default=None)
p value to use if using the minkowski metric.
alphafloat, optional (default=1.0)
A distance scaling parameter as used in robust single linkage.
See [3] for more information.
Exactly which algorithm to use; hdbscan has variants specialised
for different characteristics of the data. By default this is set
to best which chooses the “best” algorithm given the nature of
the data. You can force other options if you believe you know
better. Options are:
best
generic
prims_kdtree
prims_balltree
boruvka_kdtree
boruvka_balltree
leaf_size: int, optional (default=40)
If using a space tree algorithm (kdtree, or balltree) the number
of points ina leaf node of the tree. This does not alter the
resulting clustering, but may have an effect on the runtime
of the algorithm.
memoryInstance of joblib.Memory or string (optional)
Used to cache the output of the computation of the tree.
By default, no caching is done. If a string is given, it is the
path to the caching directory.
approx_min_span_treebool, optional (default=True)
Whether to accept an only approximate minimum spanning tree.
For some algorithms this can provide a significant speedup, but
the resulting clustering may be of marginally lower quality.
If you are willing to sacrifice speed for correctness you may want
to explore this; in general this should be left at the default True.
gen_min_span_tree: bool, optional (default=False)
Whether to generate the minimum spanning tree with regard
to mutual reachability distance for later analysis.
core_dist_n_jobsint, optional (default=4)
Number of parallel jobs to run in core distance computations (if
supported by the specific algorithm). For core_dist_n_jobs
below -1, (n_cpus + 1 + core_dist_n_jobs) are used.
The method used to select clusters from the condensed tree. The
standard approach for HDBSCAN* is to use an Excess of Mass algorithm
to find the most persistent clusters. Alternatively you can instead
select the clusters at the leaves of the tree – this provides the
most fine grained and homogeneous clusters. Options are:
By default HDBSCAN* will not produce a single cluster, setting this
to True will override this and allow single cluster results in
the case that you feel this is a valid result for your dataset.
prediction_databoolean, optional
Whether to generate extra cached data for predicting labels or
membership vectors for new unseen points later. If you wish to
persist the clustering object for later re-use you probably want
to set this to True.
(default False)
There exist some interpretational differences between this
HDBSCAN* implementation and the original authors reference
implementation in Java. This can result in very minor differences
in clustering results. Setting this flag to True will, at a some
performance cost, ensure that the clustering results match the
reference implementation.
The strength with which each sample is a member of its assigned
cluster. Noise points have probability zero; points in clusters
have values assigned proportional to the degree that they
persist as part of the cluster.
A score of how persistent each cluster is. A score of 1.0 represents
a perfectly stable cluster that persists over all distance scales,
while a score of 0.0 represents a perfectly ephemeral cluster. These
scores can be guage the relative coherence of the clusters output
by the algorithm.
The minimum spanning tree of the mutual reachability graph generated
by HDBSCAN. Note that this is not generated by default and will only
be available if gen_min_span_tree was set to True on object creation.
Even then in some optimized cases a tre may not be generated.
Outlier scores for clustered points; the larger the score the more
outlier-like the point. Useful as an outlier detection technique.
Based on the GLOSH algorithm by Campello, Moulavi, Zimek and Sander.
A list of exemplar points for clusters. Since HDBSCAN supports
arbitrary shapes for clusters we cannot provide a single cluster
exemplar per cluster. Instead a list is returned with each element
of the list being a numpy array of exemplar points for a cluster –
these points are the “most representative” points of the cluster.
A fast approximation of the Density Based Cluster Validity (DBCV)
score [4]. The only differece, and the speed, comes from the fact
that this relative_validity_ is computed using the mutual-
reachability minimum spanning tree, i.e. minimum_spanning_tree_,
instead of the all-points minimum spanning tree used in the
reference. This score might not be an objective measure of the
goodness of clusterering. It may only be used to compare results
across different choices of hyper-parameters, therefore is only a
relative score.
Return clustering that would be equivalent to running DBSCAN* for a particular cut_distance (or epsilon)
DBSCAN* can be thought of as DBSCAN without the border points. As such these results may differ slightly
from sklearns implementation of dbscan in the non-core points.
This can also be thought of as a flat clustering derived from constant height cut through the single
linkage tree.
This represents the result of selecting a cut value for robust single linkage
clustering. The min_cluster_size allows the flat clustering to declare noise
points (and cluster smaller than min_cluster_size).
Create data that caches intermediate results used for predicting
the label of new/unseen points. This data is only useful if
you are intending to use functions from hdbscan.prediction.
Provide an approximate representative point for a given cluster.
Note that this technique assumes a euclidean metric for speed of
computation. For more general metrics use the weighted_cluster_medoid
method which is slower, but can work with the metric the model trained
with.
Provide an approximate representative point for a given cluster.
Note that this technique can be very slow and memory intensive for
large clusters. For faster results use the weighted_cluster_centroid
method which is faster, but assumes a euclidean metric.
Perform robust single linkage clustering from a vector array
or distance matrix.
Robust single linkage is a modified version of single linkage that
attempts to be more robust to noise. Specifically the goal is to
more accurately approximate the level set tree of the unknown
probability density function from which the sample data has
been drawn.
Xarray or sparse (CSR) matrix of shape (n_samples, n_features), or
array of shape (n_samples, n_samples)
A feature array, or array of distances between samples if
metric='precomputed'.
cutfloat
The reachability distance value to cut the cluster heirarchy at
to derive a flat cluster labelling.
kint, optional (default=5)
Reachability distances will be computed with regard to the k
nearest neighbors.
alphafloat, optional (default=np.sqrt(2))
Distance scaling for reachability distance computation. Reachability
distance is computed as
$max { core_k(a), core_k(b), 1/alpha d(a,b) }$.
gammaint, optional (default=5)
Ignore any clusters in the flat clustering with size less than gamma,
and declare points in such clusters as noise points.
metricstring, or callable, optional (default=’euclidean’)
The metric to use when calculating distance between instances in a
feature array. If metric is a string or callable, it must be one of
the options allowed by metrics.pairwise.pairwise_distances for its
metric parameter.
If metric is “precomputed”, X is assumed to be a distance matrix and
must be square.
metric_paramsdict, option (default={})
Keyword parameter arguments for calling the metric (for example
the p values if using the minkowski metric).
algorithmstring, optional (default=’best’)
Exactly which algorithm to use; hdbscan has variants specialised
for different characteristics of the data. By default this is set
to best which chooses the “best” algorithm given the nature of
the data. You can force other options if you believe you know
better. Options are:
small
small_kdtree
large_kdtree
large_kdtree_fastcluster
core_dist_n_jobsint, optional
Number of parallel jobs to run in core distance computations (if
supported by the specific algorithm). For core_dist_n_jobs
below -1, (n_cpus + 1 + core_dist_n_jobs) are used.
(default 4)
How far apart to space the final leaves of the
dendrogram. (default 1)
log_sizeboolean, optional
Use log scale for the ‘size’ of clusters (i.e. number of
points in the cluster at a given lambda value).
(default False)
max_rectangles_per_icicleint, optional
To simplify the plot this method will only emit
max_rectangles_per_icicle bars per branch of the dendrogram.
This ensures that we don’t suffer from massive overplotting in
cases with a lot of data points.
bar_centers x coordinate centers for bars
bar_tops heights of bars in lambda scale
bar_bottoms y coordinate of bottoms of bars
bar_widths widths of the bars (in x coord scale)
bar_bounds a 4-tuple of [left, right, bottom, top]
giving the bounds on a full set of
cluster bars
Data associates with cluster splits:
line_xs x coordinates for horizontal dendrogram lines
line_ys y coordinates for horizontal dendrogram lines
Use matplotlib to plot an ‘icicle plot’ dendrogram of the condensed tree.
Effectively this is a dendrogram where the width of each cluster bar is
equal to the number of points (or log of the number of points) in the cluster
at the given lambda value. Thus bars narrow as points progressively drop
out of clusters. The make the effect more apparent the bars are also colored
according the the number of points (or log of the number of points).
How far apart to space the final leaves of the
dendrogram.
cmapstring or matplotlib colormap, optional (default viridis)
The matplotlib colormap to use to color the cluster bars.
select_clustersboolean, optional (default False)
Whether to draw ovals highlighting which cluster
bar represent the clusters that were selected by
HDBSCAN as the final clusters.
label_clustersboolean, optional (default False)
If select_clusters is True then this determines
whether to draw text labels on the clusters.
selection_palettelist of colors, optional (default None)
If not None, and at least as long as
the number of clusters, draw ovals
in colors iterating through this palette.
This can aid in cluster identification
when plotting.
axismatplotlib axis or None, optional (default None)
The matplotlib axis to render to. If None then a new axis
will be generated. The rendered axis will be returned.
colorbarboolean, optional (default True)
Whether to draw a matplotlib colorbar displaying the range
of cluster sizes as per the colormap.
log_sizeboolean, optional (default False)
Use log scale for the ‘size’ of clusters (i.e. number of
points in the cluster at a given lambda value).
To simplify the plot this method will only emit
max_rectangles_per_icicle bars per branch of the dendrogram.
This ensures that we don’t suffer from massive overplotting in
cases with a lot of data points.
Return a NetworkX DiGraph object representing the condensed tree.
Edge weights in the graph are the lamba values at which child nodes
‘leave’ the parent cluster.
Nodes have a size attribute attached giving the number of points
that are in the cluster (or 1 if it is a singleton point) at the
point of cluster creation (fewer points may be in the cluster at
larger lambda values).
Return a pandas dataframe representation of the condensed tree.
Each row of the dataframe corresponds to an edge in the tree.
The columns of the dataframe are parent, child, lambda_val
and child_size.
The parent and child are the ids of the
parent and child nodes in the tree. Node ids less than the number
of points in the original dataset represent individual points, while
ids greater than the number of points are clusters.
The lambda_val value is the value (1/distance) at which the child
node leaves the cluster.
The child_size is the number of points in the child node.
Return a flat clustering from the single linkage hierarchy.
This represents the result of selecting a cut value for robust single linkage
clustering. The min_cluster_size allows the flat clustering to declare noise
points (and cluster smaller than min_cluster_size).
The dendrogram can be hard to read when the original
observation matrix from which the linkage is derived
is large. Truncation is used to condense the dendrogram.
There are several modes:
None/'none'
No truncation is performed (Default).
'lastp'
The last p non-singleton formed in the linkage are the only
non-leaf nodes in the linkage; they correspond to rows
Z[n-p-2:end] in Z. All other non-singleton clusters are
contracted into leaf nodes.
'level'/'mtica'
No more than p levels of the dendrogram tree are displayed.
This corresponds to Mathematica(TM) behavior.
pint, optional
The p parameter for truncate_mode.
vary_line_widthboolean, optional
Draw downward branches of the dendrogram with line thickness that
varies depending on the size of the cluster.
cmapstring or matplotlib colormap, optional
The matplotlib colormap to use to color the cluster bars.
A value of ‘none’ will result in black bars.
(default ‘viridis’)
colorbarboolean, optional
Whether to draw a matplotlib colorbar displaying the range
of cluster sizes as per the colormap. (default True)
Return a numpy array representation of the single linkage tree.
This representation conforms to the scipy.cluster.hierarchy notion
of a single linkage tree, and can be used with all the associated
scipy tools. Please see the scipy documentation for more details
on the format.
Return a pandas dataframe representation of the single linkage tree.
Each row of the dataframe corresponds to an edge in the tree.
The columns of the dataframe are parent, left_child,
right_child, distance and size.
The parent, left_child and right_child are the ids of the
parent and child nodes in the tree. Node ids less than the number
of points in the original dataset represent individual points, while
ids greater than the number of points are clusters.
The distance value is the at which the child nodes merge to form
the parent node.
The size is the number of points in the parent node.
Return a Pandas dataframe of the minimum spanning tree.
Each row is an edge in the tree; the columns are from,
to, and distance giving the two vertices of the edge
which are indices into the dataset, and the distance
between those datapoints.
Compute the density separation between two clusters. This is the minimum
distance between pairs of points, one from internal nodes of MSTs of each cluster.
Xarray (n_samples, n_features) or (n_samples, n_samples)
The input data of the clustering. This can be the data, or, if
metric is set to precomputed the pairwise distance matrix used
for the clustering.
labelsarray (n_samples)
The label array output by the clustering, providing an integral
cluster label to each data point, with -1 for noise points.
cluster_id1integer
The first cluster label to compute separation between.
cluster_id2integer
The second cluster label to compute separation between.
internal_nodes1array
The vertices of the MST for cluster_id1 that were internal vertices.
internal_nodes2array
The vertices of the MST for cluster_id2 that were internal vertices.
core_distances1array (size of cluster_id1,)
The all-points-core_distances of all points in the cluster
specified by cluster_id1.
core_distances2array (size of cluster_id2,)
The all-points-core_distances of all points in the cluster
specified by cluster_id2.
metricstring
The metric used to compute distances for the clustering (and
to be re-used in computing distances for mr distance). If
set to precomputed then X is assumed to be the precomputed
distance matrix between samples.
Compute pairwise distances for all the points of a cluster.
If metric is ‘precomputed’ then assume X is a distance matrix for the full
dataset. Note that in this case you must pass in ‘d’ the dimension of the
dataset.
Xarray (n_samples, n_features) or (n_samples, n_samples)
The input data of the clustering. This can be the data, or, if
metric is set to precomputed the pairwise distance matrix used
for the clustering.
labelsarray (n_samples)
The label array output by the clustering, providing an integral
cluster label to each data point, with -1 for noise points.
cluster_idinteger
The cluster label for which to compute the distances
metricstring
The metric used to compute distances for the clustering (and
to be re-used in computing distances for mr distance). If
set to precomputed then X is assumed to be the precomputed
distance matrix between samples.
dinteger (or None)
The number of features (dimension) of the dataset. This need only
be set in the case of metric being set to precomputed, where
the ambient dimension of the data is unknown to the function.
Compute the ‘internal’ minimum spanning tree given a matrix of mutual
reachability distances. Given a minimum spanning tree the ‘internal’
graph is the subgraph induced by vertices of degree greater than one.
The pairwise mutual reachability distances, inferred to be the edge
weights of a complete graph. Since MSTs are computed per cluster
this is the all-points-mutual-reacability for points within a single
cluster.
An array listing the indices of the internal nodes of the MST
internal_edgesarray (?, 3)
An array of internal edges in weighted edge list format; that is
an edge is an array of length three listing the two vertices
forming the edge and weight of the edge.
Xarray (n_samples, n_features) or (n_samples, n_samples)
The input data of the clustering. This can be the data, or, if
metric is set to precomputed the pairwise distance matrix used
for the clustering.
labelsarray (n_samples)
The label array output by the clustering, providing an integral
cluster label to each data point, with -1 for noise points.
metricoptional, string (default ‘euclidean’)
The metric used to compute distances for the clustering (and
to be re-used in computing distances for mr distance). If
set to precomputed then X is assumed to be the precomputed
distance matrix between samples.
doptional, integer (or None) (default None)
The number of features (dimension) of the dataset. This need only
be set in the case of metric being set to precomputed, where
the ambient dimension of the data is unknown to the function.
Whether to return the validity index for individual clusters.
Defaults to False with the function returning a single float
value for the whole clustering.
mst_raw_distoptional, boolean (default False)
If True, the MST’s are constructed solely via ‘raw’ distances (depending on the given metric, e.g. euclidean distances)
instead of using mutual reachability distances. Thus setting this parameter to True avoids using ‘all-points-core-distances’ at all.
This is advantageous specifically in the case of elongated clusters that lie in close proximity to each other <citation needed>.
The density based cluster validity index for the clustering. This
is a numeric value between -1 and 1, with higher values indicating
a ‘better’ clustering.
per_cluster_validity_indexarray (n_clusters,)
The cluster validity index of each individual cluster as an array.
The overall validity index is the weighted average of these values.
Only returned if per_cluster_scores is set to True.
Predict soft cluster membership vectors for all points in the
original dataset the clusterer was trained on. This function is more
efficient by making use of the fact that all points are already in the
condensed tree, and processing in bulk.
A clustering object that has been fit to the data and
either had prediction_data=True set, or called the
generate_prediction_data method after the fact.
This method does not work if the clusterer was trained
with metric='precomputed'.
Predict the cluster label of new points. The returned labels
will be those of the original clustering found by clusterer,
and therefore are not (necessarily) the cluster labels that would
be found by clustering the original data combined with
points_to_predict, hence the ‘approximate’ label.
If you simply wish to assign new points to an existing clustering
in the ‘best’ way possible, this is the function to use. If you
want to predict how points_to_predict would cluster with
the original data under HDBSCAN the most efficient existing approach
is to simply recluster with the new point(s) added to the original dataset.
A clustering object that has been fit to the data and
either had prediction_data=True set, or called the
generate_prediction_data method after the fact.
points_to_predictarray, or array-like (n_samples, n_features)
The new data points to predict cluster labels for. They should
have the same dimensionality as the original dataset over which
clusterer was fit.
Predict the outlier score of new points. The returned scores
will be based on the original clustering found by clusterer,
and therefore are not (necessarily) the outlier scores that would
be found by clustering the original data combined with
points_to_predict, hence the ‘approximate’ label.
If you simply wish to calculate the outlier scores for new points
in the ‘best’ way possible, this is the function to use. If you
want to predict the outlier score of points_to_predict with
the original data under HDBSCAN the most efficient existing approach
is to simply recluster with the new point(s) added to the original dataset.
A clustering object that has been fit to the data and
either had prediction_data=True set, or called the
generate_prediction_data method after the fact.
points_to_predictarray, or array-like (n_samples, n_features)
The new data points to predict cluster labels for. They should
have the same dimensionality as the original dataset over which
clusterer was fit.
Predict soft cluster membership. The result produces a vector
for each point in points_to_predict that gives a probability that
the given point is a member of a cluster for each of the selected clusters
of the clusterer.
A clustering object that has been fit to the data and
either had prediction_data=True set, or called the
generate_prediction_data method after the fact.
points_to_predictarray, or array-like (n_samples, n_features)
The new data points to predict cluster labels for. They should
have the same dimensionality as the original dataset over which
clusterer was fit.