Python Package lumbermark Reference¶

lumbermark.Lumbermark([n_clusters, ...])

Lumbermark [1] is a fast and robust divisive clustering algorithm which identifies a specified number of clusters.

lumbermark Python Package

class lumbermark.Lumbermark(n_clusters=2, *, min_cluster_size=10, min_cluster_factor=0.25, skip_leaves='auto', M=5, metric='l2', quitefastmst_params={}, verbose=False)¶

Lumbermark [1] is a fast and robust divisive clustering algorithm which identifies a specified number of clusters.

It iteratively chops off sizeable limbs that are joined by protruding segments of a dataset’s mutual reachability minimum spanning tree.

The use of a mutual reachability distance [2] pulls peripheral points farther away from each other. When combined with the deadwood package, it can act as an outlier detector.

Once the spanning tree is determined (\(\Omega(n \log n)\) – \(O(n^2)\)), the Lumbermark algorithm runs in \(O(kn)\) time, where \(k\) is the number of clusters sought. Memory use is \(O(n)\).

As with all distance-based methods (this includes k-means and DBSCAN as well), applying data preprocessing and feature engineering techniques (e.g., feature scaling, feature selection, dimensionality reduction) might lead to more meaningful results.

Parameters:

n_clustersint: The number of clusters to detect.
min_cluster_sizeint, default=10: Minimal cluster size.
min_cluster_factorfloat in [0,1], default=0.25: Output cluster sizes will not be smaller than min_cluster_factor*n_points/n_clusters.
skip_leavesbool, default=’auto’: Whether the MST leaves should be omitted from cluster size counting; "auto" selects True if M > 0.
Mint, default=5: Smoothing factor for the mutual reachability distance [2]. M = 0 and M = 1 select the original distance as given by the metric parameter; see deadwood.MSTBase.
metricstr, default=’l2’: The metric used to compute the linkage; see deadwood.MSTBase for more details. Defaults to the Euclidean distance.
quitefastmst_paramsdict: Additional parameters to be passed to quitefastmst.mst_euclid if metric is "l2".
verbosebool, default=False: Whether to print diagnostic messages and progress information onto stderr.

Attributes:

labels_ndarray

Detected cluster labels.

It is an integer vector such that labels_[i] gives the cluster ID (between 0 and n_clusters_ - 1) of the i-th object.

n_clusters_int

The actual number of clusters detected by the algorithm.

n_samples_int

The number of points in the dataset.

n_features_int or None

The number of features in the dataset.

References

[1] (1,2)

M. Gagolewski, Lumbermark, in preparation, 2026, TODO

[2] (1,2)

R.J.G.B. Campello, D. Moulavi, J. Sander, Density-based clustering based on hierarchical density estimates, Lecture Notes in Computer Science 7819, 2013, 160-172, https://doi.org/10.1007/978-3-642-37456-2_14

[3]

M. Gagolewski, A. Cena, M. Bartoszuk, Ł. Brzozowski, Clustering with minimum spanning trees: How good can it be?, Journal of Classification 42, 2025, 90-112, https://doi.org/10.1007/s00357-024-09483-1

fit(X, y=None)¶

Performs a cluster analysis of a dataset.

Parameters:

Xobject: Typically a matrix or a data frame with n_samples rows and n_features columns; see deadwood.MSTBase.fit_predict for more details.
yNone: Ignored.

Returns:

selflumbermark.Lumbermark: The object that the method was called on.

Notes

Refer to the labels_ and n_clusters_ attributes for the result.