dtaidistance.clustering.kmeans

(requires version 2.2.0 or higher)

Time series clustering using k-means and Barycenter averages.

author:Wannes Meert
copyright:Copyright 2020-2022 KU Leuven, DTAI Research Group.
license:Apache License, Version 2.0, see LICENSE for details.
class dtaidistance.clustering.kmeans.KMeans(k, max_it=10, max_dba_it=10, thr=0.0001, drop_stddev=None, nb_prob_samples=None, dists_options=None, show_progress=True, initialize_with_kmedoids=False, initialize_with_kmeanspp=True, initialize_sample_size=None)

K-means clustering algorithm for time series using Dynamic Barycenter Averaging.

Parameters:
  • k – Number of components
  • max_it – Maximal interations for K-means
  • max_dba_it – Maximal iterations for the Dynamic Barycenter Averaging.
  • thr – Convergence is achieved if the averaging iterations differ less then this threshold
  • drop_stddev – When computing the average series per cluster, ignore the instances that are further away than stddev*drop_stddev from the prototype (this is a gradual effect, the algorithm starts with drop_stddev is 3). This is related to robust k-means approaches that use trimming functions.
  • nb_prob_samples – Probabilistically sample best path this number of times.
  • dists_options
  • show_progress
  • initialize_with_kmedoids – Cluster a sample of the dataset first using K-medoids.
  • initialize_with_kmeanspp – Use k-means++
  • initialize_sample_size – How many samples to use for initialization with K-medoids or K-means++. Defaults are k*20 for K-medoid and 2+log(k) for k-means++.
fit(series, use_c=False, use_parallel=True, monitor_distances=None)

Perform K-means clustering.

Parameters:
  • series – Container with series
  • use_c – Use the C-library (only available if package is compiled)
  • use_parallel – Use multipool for parallelization
  • monitor_distances – This function is called with two arguments: (1) a list of (cluster, distance) for each instance; (2) a boolean indicating whether the clustering has been stopped or not. From this one can compute inertia or other metrics to monitor the clustering. If the boolean argument is true, this is the final assignment. If this function returns True, the clustering continues, if False is returned the clustering is stopped.
Returns:

cluster indices, number of iterations If the number of iterations is equal to max_it, the clustering did not converge.

fit_fast(series, monitor_distances=None)
kmeansplusplus_centers(series, use_c=False)

Better initialization for K-Means.

Arthur, D., and S. Vassilvitskii. “k-means++: the, advantages of careful seeding. In, SODA’07: Proceedings of the.” eighteenth annual ACM-SIAM symposium on Discrete, algorithms.

Procedure (in paper):

  • 1a. Choose an initial center c_1 uniformly at random from X.
  • 1b. Choose the next center c_i , selecting c_i = x′∈X with probability D(x’)^2/sum(D(x)^2, x∈X).
  • 1c. Repeat Step 1b until we have chosen a total of k centers.
  • (2-4. Proceed as with the standard k-means algorithm.)

Extension (in conclusion):

  • Also, experiments showed that k-means++ generally performed better if it selected several new centers during each iteration, and then greedily chose the one that decreased φ as much as possible.

Detail (in original code):

  • numLocalTries==2+log(k)
Parameters:
  • series
  • use_c
Returns:

kmedoids_centers(series, use_c=False)
plot(filename=None, axes=None, ts_height=10, bottom_margin=2, top_margin=2, ts_left_margin=0, ts_sample_length=1, tr_label_margin=3, tr_left_margin=2, ts_label_margin=0, show_ts_label=None, show_tr_label=None, cmap='viridis_r', ts_color=None)