Python Package lumbermark Reference¶
|
Lumbermark [1] is a fast and robust divisive clustering algorithm which identifies a specified number of clusters. |
lumbermark Python Package
- class lumbermark.Lumbermark(n_clusters=2, *, min_cluster_size=10, min_cluster_factor=0.25, skip_leaves='auto', M=5, metric='l2', quitefastmst_params={}, verbose=False)¶
Lumbermark [1] is a fast and robust divisive clustering algorithm which identifies a specified number of clusters.
It iteratively chops off sizeable limbs that are joined by protruding segments of a dataset’s mutual reachability minimum spanning tree.
The use of a mutual reachability distance [2] pulls peripheral points farther away from each other. When combined with the
deadwoodpackage, it can act as an outlier detector.Once the spanning tree is determined (\(\Omega(n \log n)\) – \(O(n^2)\)), the Lumbermark algorithm runs in \(O(kn)\) time, where \(k\) is the number of clusters sought. Memory use is \(O(n)\).
As with all distance-based methods (this includes k-means and DBSCAN as well), applying data preprocessing and feature engineering techniques (e.g., feature scaling, feature selection, dimensionality reduction) might lead to more meaningful results.
- Parameters:
- n_clustersint
The number of clusters to detect.
- min_cluster_sizeint, default=10
Minimal cluster size.
- min_cluster_factorfloat in [0,1], default=0.25
Output cluster sizes will not be smaller than
min_cluster_factor*n_points/n_clusters.- skip_leavesbool, default=’auto’
Whether the MST leaves should be omitted from cluster size counting;
"auto"selects True if M > 0.- Mint, default=5
Smoothing factor for the mutual reachability distance [2]. M = 0 and M = 1 select the original distance as given by the metric parameter; see
deadwood.MSTBase.- metricstr, default=’l2’
The metric used to compute the linkage; see
deadwood.MSTBasefor more details. Defaults to the Euclidean distance.- quitefastmst_paramsdict
Additional parameters to be passed to
quitefastmst.mst_euclidifmetricis"l2".- verbosebool, default=False
Whether to print diagnostic messages and progress information onto
stderr.
- Attributes:
- labels_ndarray
Detected cluster labels.
It is an integer vector such that
labels_[i]gives the cluster ID (between 0 and n_clusters_ - 1) of the i-th object.- n_clusters_int
The actual number of clusters detected by the algorithm.
- n_samples_int
The number of points in the dataset.
- n_features_int or None
The number of features in the dataset.
References
[2] (1,2)R.J.G.B. Campello, D. Moulavi, J. Sander, Density-based clustering based on hierarchical density estimates, Lecture Notes in Computer Science 7819, 2013, 160-172, https://doi.org/10.1007/978-3-642-37456-2_14
[3]M. Gagolewski, A. Cena, M. Bartoszuk, Ł. Brzozowski, Clustering with minimum spanning trees: How good can it be?, Journal of Classification 42, 2025, 90-112, https://doi.org/10.1007/s00357-024-09483-1
- fit(X, y=None)¶
Performs a cluster analysis of a dataset.
- Parameters:
- Xobject
Typically a matrix or a data frame with
n_samplesrows andn_featurescolumns; seedeadwood.MSTBase.fit_predictfor more details.- yNone
Ignored.
- Returns:
- selflumbermark.Lumbermark
The object that the method was called on.
Notes
Refer to the labels_ and n_clusters_ attributes for the result.