API Reference¶

Major classes are HDBSCAN and RobustSingleLinkage.

HDBSCAN¶

class hdbscan.hdbscan_.HDBSCAN(min_cluster_size=5, min_samples=None, metric='euclidean', alpha=1.0, p=None, algorithm='best', leaf_size=40, memory=Memory(cachedir=None), approx_min_span_tree=True, gen_min_span_tree=False, core_dist_n_jobs=4, allow_single_cluster=False, match_reference_implementation=False, **kwargs)¶

Perform HDBSCAN clustering from vector array or distance matrix.

HDBSCAN - Hierarchical Density-Based Spatial Clustering of Applications with Noise. Performs DBSCAN over varying epsilon values and integrates the result to find a clustering that gives the best stability over epsilon. This allows HDBSCAN to find clusters of varying densities (unlike DBSCAN), and be more robust to parameter selection.

min_cluster_size : int, optional (default=5)

The minimum size of clusters; single linkage splits that contain fewer points than this will be considered points “falling out” of a cluster rather than a cluster splitting into two new clusters.

min_samples : int, optional (default=None)

The number of samples in a neighbourhood for a point to be considered a core point.

metric : string, or callable, optional (default=’euclidean’)

The metric to use when calculating distance between instances in a feature array. If metric is a string or callable, it must be one of the options allowed by metrics.pairwise.pairwise_distances for its metric parameter. If metric is “precomputed”, X is assumed to be a distance matrix and must be square.

p : int, optional (default=None)

p value to use if using the minkowski metric.

alpha : float, optional (default=1.0)

A distance scaling parameter as used in robust single linkage. See [3] for more information.

algorithm : string, optional (default=’best’)

Exactly which algorithm to use; hdbscan has variants specialised for different characteristics of the data. By default this is set to best which chooses the “best” algorithm given the nature of the data. You can force other options if you believe you know better. Options are:

best

generic

prims_kdtree

prims_balltree

boruvka_kdtree

boruvka_balltree

leaf_size: int, optional (default=40)

If using a space tree algorithm (kdtree, or balltree) the number of points ina leaf node of the tree. This does not alter the resulting clustering, but may have an effect on the runtime of the algorithm.

memory : Instance of joblib.Memory or string (optional)

Used to cache the output of the computation of the tree. By default, no caching is done. If a string is given, it is the path to the caching directory.

approx_min_span_tree : bool, optional (default=True)

Whether to accept an only approximate minimum spanning tree. For some algorithms this can provide a significant speedup, but the resulting clustering may be of marginally lower quality. If you are willing to sacrifice speed for correctness you may want to explore this; in general this should be left at the default True.

gen_min_span_tree: bool, optional (default=False)

Whether to generate the minimum spanning tree with regard to mutual reachability distance for later analysis.

core_dist_n_jobs : int, optional (default=4)

Number of parallel jobs to run in core distance computations (if supported by the specific algorithm). For core_dist_n_jobs below -1, (n_cpus + 1 + core_dist_n_jobs) are used.

allow_single_cluster : bool, optional (default=False)

By default HDBSCAN* will not produce a single cluster, setting this to True will override this and allow single cluster results in the case that you feel this is a valid result for your dataset.

match_reference_implementation : bool, optional (default=False)

There exist some interpretational differences between this HDBSCAN* implementation and the original authors reference implementation in Java. This can result in very minor differences in clustering results. Setting this flag to True will, at a some performance cost, ensure that the clustering results match the reference implementation.

**kwargs : optional

Arguments passed to the distance metric

labels_ : ndarray, shape (n_samples, ): Cluster labels for each point in the dataset given to fit(). Noisy samples are given the label -1.
probabilities_ : ndarray, shape (n_samples, ): The strength with which each sample is a member of its assigned cluster. Noise points have probability zero; points in clusters have values assigned proportional to the degree that they persist as part of the cluster.
cluster_persistence_ : ndarray, shape (n_clusters, ): A score of how persistent each cluster is. A score of 1.0 represents a perfectly stable cluster that persists over all distance scales, while a score of 0.0 represents a perfectly ephemeral cluster. These scores can be guage the relative coherence of the clusters output by the algorithm.
condensed_tree_ : CondensedTree object: The condensed tree produced by HDBSCAN. The object has methods for converting to pandas, networkx, and plotting.
single_linkage_tree_ : SingleLinkageTree object: The single linkage tree produced by HDBSCAN. The object has methods for converting to pandas, networkx, and plotting.
minimum_spanning_tree_ : MinimumSpanningTree object: The minimum spanning tree of the mutual reachability graph generated by HDBSCAN. Note that this is not generated by default and will only be available if gen_min_span_tree was set to True on object creation. Even then in some optimized cases a tre may not be generated.
outlier_scores_ : ndarray, shape (n_samples, ): Outlier scores for clustered points; the larger the score the more outlier-like the point. Useful as an outlier detection technique. Based on the GLOSH algorithm by Campello, Moulavi, Zimek and Sander.

[1]	Campello, R. J., Moulavi, D., & Sander, J. (2013, April). Density-based clustering based on hierarchical density estimates. In Pacific-Asia Conference on Knowledge Discovery and Data Mining (pp. 160-172). Springer Berlin Heidelberg.

[2]	Campello, R. J., Moulavi, D., Zimek, A., & Sander, J. (2015). Hierarchical density estimates for data clustering, visualization, and outlier detection. ACM Transactions on Knowledge Discovery from Data (TKDD), 10(1), 5.

[3]	Chaudhuri, K., & Dasgupta, S. (2010). Rates of convergence for the cluster tree. In Advances in Neural Information Processing Systems (pp. 343-351).

fit(X, y=None)¶

Perform HDBSCAN clustering from features or distance matrix.

X : array or sparse (CSR) matrix of shape (n_samples, n_features), or array of shape (n_samples, n_samples): A feature array, or array of distances between samples if metric='precomputed'.

self : object: Returns self

fit_predict(X, y=None)¶

Performs clustering on X and returns cluster labels.

X : array or sparse (CSR) matrix of shape (n_samples, n_features), or array of shape (n_samples, n_samples): A feature array, or array of distances between samples if metric='precomputed'.

y : ndarray, shape (n_samples, ): cluster labels

RobustSingleLinkage¶

class hdbscan.robust_single_linkage_.RobustSingleLinkage(cut=0.4, k=5, alpha=1.4142135623730951, gamma=5, metric='euclidean', algorithm='best', core_dist_n_jobs=4, **kwargs)¶

Perform robust single linkage clustering from a vector array or distance matrix.

Roust single linkage is a modified version of single linkage that attempts to be more robust to noise. Specifically the goal is to more accurately approximate the level set tree of the unknown probability density function from which the sample data has been drawn.

X : array or sparse (CSR) matrix of shape (n_samples, n_features), or \

array of shape (n_samples, n_samples)

A feature array, or array of distances between samples if metric='precomputed'.

cut : float

The reachability distance value to cut the cluster heirarchy at to derive a flat cluster labelling.

k : int, optional (default=5)

Reachability distances will be computed with regard to the k nearest neighbors.

alpha : float, optional (default=np.sqrt(2))

Distance scaling for reachability distance computation. Reachability distance is computed as $max { core_k(a), core_k(b), 1/alpha d(a,b) }$.

gamma : int, optional (default=5)

Ignore any clusters in the flat clustering with size less than gamma, and declare points in such clusters as noise points.

metric : string, or callable, optional (default=’euclidean’)

The metric to use when calculating distance between instances in a feature array. If metric is a string or callable, it must be one of the options allowed by metrics.pairwise.pairwise_distances for its metric parameter. If metric is “precomputed”, X is assumed to be a distance matrix and must be square.

algorithm : string, optional (default=’best’)

Exactly which algorithm to use; hdbscan has variants specialised for different characteristics of the data. By default this is set to best which chooses the “best” algorithm given the nature of the data. You can force other options if you believe you know better. Options are:

small

small_kdtree

large_kdtree

large_kdtree_fastcluster

core_dist_n_jobs : int, optional

Number of parallel jobs to run in core distance computations (if supported by the specific algorithm). For core_dist_n_jobs below -1, (n_cpus + 1 + core_dist_n_jobs) are used. (default 4)

labels_ : ndarray, shape (n_samples, )

Cluster labels for each point. Noisy samples are given the label -1.

cluster_hierarchy_ : SingleLinkageTree object

The single linkage tree produced during clustering. This object provides several methods for:

Plotting

Generating a flat clustering

Exporting to NetworkX

Exporting to Pandas

[1]	Chaudhuri, K., & Dasgupta, S. (2010). Rates of convergence for the cluster tree. In Advances in Neural Information Processing Systems (pp. 343-351).

fit(X, y=None)¶

Perform robust single linkage clustering from features or distance matrix.

X : array or sparse (CSR) matrix of shape (n_samples, n_features), or array of shape (n_samples, n_samples): A feature array, or array of distances between samples if metric='precomputed'.

self : object: Returns self

fit_predict(X, y=None)¶

Performs clustering on X and returns cluster labels.

X : array or sparse (CSR) matrix of shape (n_samples, n_features), or array of shape (n_samples, n_samples): A feature array, or array of distances between samples if metric='precomputed'.

y : ndarray, shape (n_samples, ): cluster labels