Python Package lumbermark Reference

lumbermark.Lumbermark([n_clusters, ...])

Lumbermark [1] is a fast and robust divisive clustering algorithm which identifies a specified number of clusters.

lumbermark Python Package

class lumbermark.Lumbermark(n_clusters=2, *, min_cluster_size=10, min_cluster_factor=0.25, skip_leaves='auto', M=5, metric='l2', quitefastmst_params={}, verbose=False)

Lumbermark [1] is a fast and robust divisive clustering algorithm which identifies a specified number of clusters.

It iteratively chops off sizeable limbs that are joined by protruding segments of a dataset’s mutual reachability minimum spanning tree.

The use of a mutual reachability distance [2] pulls peripheral points farther away from each other. When combined with the deadwood package, it can act as an outlier detector.

Once the spanning tree is determined (\(\Omega(n \log n)\)\(O(n^2)\)), the Lumbermark algorithm runs in \(O(kn)\) time, where \(k\) is the number of clusters sought. Memory use is \(O(n)\).

As with all distance-based methods (this includes k-means and DBSCAN as well), applying data preprocessing and feature engineering techniques (e.g., feature scaling, feature selection, dimensionality reduction) might lead to more meaningful results.

Parameters:
n_clustersint

The number of clusters to detect.

min_cluster_sizeint, default=10

Minimal cluster size.

min_cluster_factorfloat in [0,1], default=0.25

Output cluster sizes will not be smaller than min_cluster_factor*n_points/n_clusters.

skip_leavesbool, default=’auto’

Whether the MST leaves should be omitted from cluster size counting; "auto" selects True if M > 0.

Mint, default=5

Smoothing factor for the mutual reachability distance [2]. M = 0 and M = 1 select the original distance as given by the metric parameter; see deadwood.MSTBase.

metricstr, default=’l2’

The metric used to compute the linkage; see deadwood.MSTBase for more details. Defaults to the Euclidean distance.

quitefastmst_paramsdict

Additional parameters to be passed to quitefastmst.mst_euclid if metric is "l2".

verbosebool, default=False

Whether to print diagnostic messages and progress information onto stderr.

Attributes:
labels_ndarray

Detected cluster labels.

It is an integer vector such that labels_[i] gives the cluster ID (between 0 and n_clusters_ - 1) of the i-th object.

n_clusters_int

The actual number of clusters detected by the algorithm.

n_samples_int

The number of points in the dataset.

n_features_int or None

The number of features in the dataset.

References

[1] (1,2)

M. Gagolewski, Lumbermark, in preparation, 2026, TODO

[2] (1,2)

R.J.G.B. Campello, D. Moulavi, J. Sander, Density-based clustering based on hierarchical density estimates, Lecture Notes in Computer Science 7819, 2013, 160-172, https://doi.org/10.1007/978-3-642-37456-2_14

[3]

M. Gagolewski, A. Cena, M. Bartoszuk, Ł. Brzozowski, Clustering with minimum spanning trees: How good can it be?, Journal of Classification 42, 2025, 90-112, https://doi.org/10.1007/s00357-024-09483-1

fit(X, y=None)

Performs a cluster analysis of a dataset.

Parameters:
Xobject

Typically a matrix or a data frame with n_samples rows and n_features columns; see deadwood.MSTBase.fit_predict for more details.

yNone

Ignored.

Returns:
selflumbermark.Lumbermark

The object that the method was called on.

Notes

Refer to the labels_ and n_clusters_ attributes for the result.