(requires version 2.2.0 or higher)

Time series clustering using k-means and Barycenter averages.

author:Wannes Meert
copyright:Copyright 2020-2022 KU Leuven, DTAI Research Group.
license:Apache License, Version 2.0, see LICENSE for details.
class dtaidistance.clustering.kmeans.KMeans(k, max_it=10, max_dba_it=10, thr=0.0001, drop_stddev=None, nb_prob_samples=None, dists_options=None, show_progress=True, initialize_with_kmedoids=False, initialize_with_kmeanspp=True, initialize_sample_size=None)

K-means clustering algorithm for time series using Dynamic Barycenter Averaging.

  • k – Number of components
  • max_it – Maximal interations for K-means
  • max_dba_it – Maximal iterations for the Dynamic Barycenter Averaging.
  • thr – Convergence is achieved if the averaging iterations differ less then this threshold
  • drop_stddev – When computing the average series per cluster, ignore the instances that are further away than stddev*drop_stddev from the prototype (this is a gradual effect, the algorithm starts with drop_stddev is 3). This is related to robust k-means approaches that use trimming functions.
  • nb_prob_samples – Probabilistically sample best path this number of times.
  • dists_options
  • show_progress
  • initialize_with_kmedoids – Cluster a sample of the dataset first using K-medoids.
  • initialize_with_kmeanspp – Use k-means++
  • initialize_sample_size – How many samples to use for initialization with K-medoids or K-means++. Defaults are k*20 for K-medoid and 2+log(k) for k-means++.
fit(series, use_c=False, use_parallel=True)

Perform K-means clustering.

  • series – Container with series
  • use_c – Use the C-library (only available if package is compiled)
  • use_parallel – Use multipool for parallelization

cluster indices, number of iterations If the number of iterations is equal to max_it, the clustering did not converge.

kmeansplusplus_centers(series, use_c=False)

Better initialization for K-Means.

Arthur, D., and S. Vassilvitskii. “k-means++: the, advantages of careful seeding. In, SODA’07: Proceedings of the.” eighteenth annual ACM-SIAM symposium on Discrete, algorithms.

Procedure (in paper):

  • 1a. Choose an initial center c_1 uniformly at random from X.
  • 1b. Choose the next center c_i , selecting c_i = x′∈X with probability D(x’)^2/sum(D(x)^2, x∈X).
  • 1c. Repeat Step 1b until we have chosen a total of k centers.
  • (2-4. Proceed as with the standard k-means algorithm.)

Extension (in conclusion):

  • Also, experiments showed that k-means++ generally performed better if it selected several new centers during each iteration, and then greedily chose the one that decreased φ as much as possible.

Detail (in original code):

  • numLocalTries==2+log(k)
  • series
  • use_c

kmedoids_centers(series, use_c=False)
plot(filename=None, axes=None, ts_height=10, bottom_margin=2, top_margin=2, ts_left_margin=0, ts_sample_length=1, tr_label_margin=3, tr_left_margin=2, ts_label_margin=0, show_ts_label=None, show_tr_label=None, cmap='viridis_r', ts_color=None)