Welcome to DTAIDistance’s documentation!¶
Library for time series distances (e.g. Dynamic Time Warping) used in the DTAI Research Group. The library offers a pure Python implementation and a faster implementation in C. The C implementation has only Cython as a dependency. It is compatible with Numpy and Pandas and implemented to avoid unnecessary data copy operations.
Source available on https://github.com/wannesm/dtaidistance.
Installation¶
From PyPI¶
This packages is available on PyPI (requires Python 3):
$ pip install dtaidistance
This requires OpenMP to be available on your system. If this is not the case, use:
$ pip install --global-option=--noopenmp dtaidistance
A version compiled without OpenMP (OMP) might raise an exception when parallelization is required. To avoid this exception, you can force the method to use Python’s multiprocessing library for parallelization by providing the –use_mp=True option.
Depending on your system, this might not install the C version. To guarantee installation of the C extensions (which enable much faster DTW alignment), follow the instructions in the “From Source” section below.
Troubleshooting:
If the C-library is not available after compilation you can try the following steps to identify the problem:
- Call the
dtw.try_import_c(verbose=True)
function that will print the status of the package. - Reinstall with
pip install -v --upgrade --force-reinstall --no-build-isolation --no-binary dtaidistance dtaidistance
and calldtw.try_import_c(verbose=True)
again. The--no-build-isolation
is present to use your already installed versions of Cython and Numpy instead of downloading recent versions in an isolation build environment (PEP 517). When you are using an older version of Numpy, the pre-compiled package might trigger binary incompatibility errors.
Troubleshootimg (OMP):
If the OMP library is not detected during compilation, parallel execution in c is not available. If OMP is installed but not found, there is probably an issue with the options given to the compiler. A few variations are available to try alternative options:
# To include -lgomp (when using GOMP instead of OMP)
$ pip install --global-option=--forcegnugcc dtaidistance
# To include -lomp:
$ pip install --global-option=--forcellvm dtaidistance
# To remove the -Xpreprocessor option (can be combined with the above):
$ pip install --global-option=--noxpreprocessor dtaidistance
If problems persist, consider using the Anaconda.org Python environment (see next section) for which precompiled versions are available.
From Conda / Anaconda¶
This package is available on anaconda.org (incuding precompiled binary versions for Linux, Macos, and Windows):
$ conda install -c conda-forge dtaidistance
From Github¶
If you want to install the latest, unreleased version using pip:
$ pip install git+https://github.com/wannesm/dtaidistance.git#egg=dtaidistance
This requires OpenMP to be available on your system. If this is not the case, use:
$ pip install --global-option=--noopenmp git+https://github.com/wannesm/dtaidistance.git#egg=dtaidistance
From source¶
The library can also be compiled and/or installed directly from source.
- Download the source from https://github.com/wannesm/dtaidistance
- Compile the C extensions:
python3 setup.py build_ext --inplace
- Install into your site-package directory:
python3 setup.py install
This requires OpenMP to be available on your system. If this is not the case, use:
$ python3 setup.py --noopenmp build_ext --inplace
In case OpenMP is available but the compiler is unable to detect the library, a few options are available to change the compiler arguments:
--forcegnugcc
: Include the -lgomp argument--forcellvm
: Include the -lomp argument--noxpreprocessor
: Remove the -Xpreprocessor argumentpython3 setup.py -h
: To see al options
Without Numpy
Most of the dtaidistance package works just fine without Numpy. It is required at installation because most deployments require Numpy support (to feed Numpy arrays as input) and therefore the package needs to be compiled with Numpy support.
If you want to remove the Numpy dependency, remove it from pyproject.toml
file.
From C¶
A number of algorithms (DTW, Barycenter averaging) are implemented in C.
They can be called directly from C source code as they do not rely on
Python. All files can be found in dtaidistance/lib/DTAIDistanceC/DTAIDistanceC/
.
An example Makefile and XCode project are available. Example usage can be seen
in the dd_benchmark.c
, dd_tests_dtw.c
, and dd_tests_matrix.c
files.
For example:
$ gcc -c -o DTAIDistanceC/dd_benchmark.o DTAIDistanceC/dd_benchmark.c -Wall -g -Xpreprocessor -fopenmp
$ gcc -c -o DTAIDistanceC/dd_dtw_openmp.o DTAIDistanceC/dd_dtw_openmp.c -Wall -g -Xpreprocessor -fopenmp
$ gcc -c -o DTAIDistanceC/dd_ed.o DTAIDistanceC/dd_ed.c -Wall -g -Xpreprocessor -fopenmp
$ gcc -o dd_benchmark DTAIDistanceC/dd_benchmark.o DTAIDistanceC/dd_dtw.o DTAIDistanceC/dd_dtw_openmp.o DTAIDistanceC/dd_ed.o -Wall -g -Xpreprocessor -fopenmp -lomp
$ ./dd_benchmark
Benchmarking ...
OpenMP is supported
Creating result array of size 17997000
Execution time = 7.000000
Dynamic Time Warping (DTW)¶
from dtaidistance import dtw
from dtaidistance import dtw_visualisation as dtwvis
import numpy as np
s1 = np.array([0., 0, 1, 2, 1, 0, 1, 0, 0, 2, 1, 0, 0])
s2 = np.array([0., 1, 2, 3, 1, 0, 0, 0, 2, 1, 0, 0, 0])
path = dtw.warping_path(s1, s2)
dtwvis.plot_warping(s1, s2, path, filename="warp.png")

DTW Distance Measure Between Two Time Series¶
Only the distance measure based on two sequences of numbers:
from dtaidistance import dtw
s1 = [0, 0, 1, 2, 1, 0, 1, 0, 0]
s2 = [0, 1, 2, 0, 0, 0, 0, 0, 0]
distance = dtw.distance(s1, s2)
print(distance)
The fastest version (30-300 times) uses c directly but requires an array
as input (with the double type), and (optionally) also prunes computations
by setting max_dist
to the Euclidean upper bound:
from dtaidistance import dtw
import array
s1 = array.array('d',[0, 0, 1, 2, 1, 0, 1, 0, 0])
s2 = array.array('d',[0, 1, 2, 0, 0, 0, 0, 0, 0])
d = dtw.distance_fast(s1, s2, use_pruning=True)
Or you can use a numpy array (with dtype double or float):
from dtaidistance import dtw
import numpy as np
s1 = np.array([0, 0, 1, 2, 1, 0, 1, 0, 0], dtype=np.double)
s2 = np.array([0.0, 1, 2, 0, 0, 0, 0, 0, 0], dtype=np.double)
d = dtw.distance_fast(s1, s2, use_pruning=True)
Check the __doc__
for information about the available arguments:
print(dtw.distance.__doc__)
DTW Complexity and Early-Stopping¶
The distance
function has linear space complexity but quadratic
time complexity. To reduce the time complexity, a number of options
are available. The most used approach across DTW implementations is
to use a window that indicates the maximal shift that is allowed (also
known as a Sakoe-Chiba band).
This reduces the complexity to the product of window size and
largest sequence length:
window
: Only allow for shifts up to this amount away from the two diagonals.
A number of other options are foreseen to early stop some or all paths the dynamic programming algorithm is exploring:
max_dist
: Avoid computing partial paths that will be larger than this value. If no solution is found that is smaller or equal to this value, then return infinity.use_pruning
: A good way of pruning partial paths is to setmax_dist
to the Euclidean upper bound. If this option is set to true, this is done automatically.max_step
: Do not allow steps larger than this value, replace them with infinity.max_length_diff
: Return infinity if difference in length of two sequences is larger than this value.
DTW Tuning¶
A number of options are foreseen to tune how the cost is computed:
penalty
: Penalty to add if compression or expansion is applied (on top of the distance).psi
: Up topsi
number of start and end points of a sequence can be ignored if this would lead to a lower distance. This is also called psi-relaxation (for cyclical sequences) [2].
DTW and keep all warping paths¶
If, next to the distance, you also want the full matrix to see all possible warping paths (also called the accumulated cost matrix):
from dtaidistance import dtw
s1 = [0, 0, 1, 2, 1, 0, 1, 0, 0]
s2 = [0, 1, 2, 0, 0, 0, 0, 0, 0]
distance, paths = dtw.warping_paths(s1, s2)
print(distance)
print(paths)
The matrix with all warping paths (or accumulated cost matrix) can be visualised as follows:
from dtaidistance import dtw
from dtaidistance import dtw_visualisation as dtwvis
import random
import numpy as np
x = np.arange(0, 20, .5)
s1 = np.sin(x)
s2 = np.sin(x - 1)
random.seed(1)
for idx in range(len(s2)):
if random.random() < 0.05:
s2[idx] += (random.random() - 0.5) / 2
d, paths = dtw.warping_paths(s1, s2, window=25, psi=2)
best_path = dtw.best_path(paths)
dtwvis.plot_warpingpaths(s1, s2, paths, best_path)

Notice the psi
parameter that relaxes the matching at the beginning
and end. In this example this results in a perfect match even though the
sine waves are slightly shifted.
DTW between multiple Time series¶
To compute the DTW distance measures between all sequences in a list of
sequences, use the method dtw.distance_matrix
. You can speed up the
computation by using the dtw.distance_matrix_fast
method that tries
to run all algorithms in C. Also parallelization can be activated using
the parallel
argument.
The distance_matrix
and distance_matrix_fast
methods expect a
list of lists/arrays:
from dtaidistance import dtw
import numpy as np
timeseries = [
np.array([0, 0, 1, 2, 1, 0, 1, 0, 0], dtype=np.double),
np.array([0.0, 1, 2, 0, 0, 0, 0, 0, 0, 0, 0]),
np.array([0.0, 0, 1, 2, 1, 0, 0, 0])]
ds = dtw.distance_matrix_fast(timeseries)
or a matrix (in case all time series have the same length):
from dtaidistance import dtw
import numpy as np
timeseries = np.array([
[0.0, 0, 1, 2, 1, 0, 1, 0, 0],
[0.0, 1, 2, 0, 0, 0, 0, 0, 0],
[0.0, 0, 1, 2, 1, 0, 0, 0, 0]])
ds = dtw.distance_matrix_fast(timeseries)
The result is stored in a matrix representation. Since only the upper
triangular matrix is required, this representation uses more memory then necessary.
This behaviour can be deactivated by setting the argument compact
to
true. The method will then return a 1-dimensional array with all results.
This array represents the concatenation of all upper triangular rows.
DTW between multiple time series, limited to block¶
You can instruct the computation to only fill part of the distance measures matrix. For example to distribute the computations over multiple computing nodes, or to only compare source time series to target time series.
from dtaidistance import dtw
import numpy as np
timeseries = np.array([
[0., 0, 1, 2, 1, 0, 1, 0, 0],
[0., 1, 2, 0, 0, 0, 0, 0, 0],
[1., 2, 0, 0, 0, 0, 0, 1, 1],
[0., 0, 1, 2, 1, 0, 1, 0, 0],
[0., 1, 2, 0, 0, 0, 0, 0, 0],
[1., 2, 0, 0, 0, 0, 0, 1, 1]])
ds = dtw.distance_matrix_fast(timeseries, block=((1, 4), (3, 5)))
The output in this case will be:
# 0 1 2 3 4 5
[[ inf inf inf inf inf inf] # 0
[ inf inf inf 1.4142 0.0000 inf] # 1
[ inf inf inf 2.2360 1.7320 inf] # 2
[ inf inf inf inf 1.4142 inf] # 3
[ inf inf inf inf inf inf] # 4
[ inf inf inf inf inf inf]] # 5
Especially for blocks the matrix representation uses a lot of unnecesary
memory. This can be avoided by setting the compact
argument to true:
from dtaidistance import dtw
import numpy as np
timeseries = np.array([
[0., 0, 1, 2, 1, 0, 1, 0, 0],
[0., 1, 2, 0, 0, 0, 0, 0, 0],
[1., 2, 0, 0, 0, 0, 0, 1, 1],
[0., 0, 1, 2, 1, 0, 1, 0, 0],
[0., 1, 2, 0, 0, 0, 0, 0, 0],
[1., 2, 0, 0, 0, 0, 0, 1, 1]])
ds = dtw.distance_matrix_fast(timeseries, block=((1, 4), (3, 5)), compact=True)
The result will now be:
[1.4142 0.0000 2.2360 1.7320 1.4142]
DTW based on shape¶
If you are interested in comparing only the shape, and not the absolute differences and offset, you need to transform the data first.
z-normalization
Z-normalize is the most popular transformation. This can be achieved
using the SciPy zscore
function:
import numpy as np
a = np.array([0.1, 0.3, 0.2, 0.1])
from scipy import stats
az = stats.zscore(a)
# az = array([-0.90453403, 1.50755672, 0.30151134, -0.90453403])
Differencing
Z-normalization has the disadvantage that constant baselines are not necessarily at the same level. The causes a small error but it accumulates over a long distance. To avoid this, use differencing (see the clustering K-means documentation for a visual example).
series = dtaidistance.preprocessing.differencing(series, smooth=0.1)
Multi-dimensionsal DTW¶
Compare two multi-dimensional sequences.
Assumes the first dimension of the data structure to be the sequence item index (or time series index).
For example, two 2-dimensional series with five timesteps:
from dtaidistance import dtw_ndim
series1 = np.array([[0, 0], # first 2-dim point at t=0
[0, 1], # second 2-dim point at t=1
[2, 1],
[0, 1],
[0, 0]], dtype=np.double)
series2 = np.array([[0, 0],
[2, 1],
[0, 1],
[0, .5],
[0, 0]], dtype=np.double)
d = dtw_ndim.distance(series1, series2)
This method returns the dependent DTW (DTW_D) distance between two n-dimensional sequences. If you want to compute the independent DTW (DTW_I) distance, use the 1-dimensional version:
dtw_i = 0
for dim in range(ndim):
dtw_i += dtw.distance(s1[:,dim], s2[:,dim])
Euclidean Distance (ED)¶
from dtaidistance import ed
s1 = np.array([0., 0, 1, 2, 1, 0, 1, 0, 0, 2, 1, 0, 0])
s2 = np.array([0., 1, 2, 3, 1, 0, 0, 0, 2, 1, 0, 0, 0])
distance = ed.distance(s1, s2)
print(distance)
Different lenghts¶
The Euclidean distance also handles sequences of different lengths by comparing the last element of the shortest series to the remaining elements in the longer series. This is compatible with Euclidean distance being used as an upper bound for DTW.
Clustering¶
Clustering is used to find groups of similar instances (e.g. time series, sequences). Such a clustering can be used to:
- Identify typical regimes or modes of the source being monitored (see for example the cobras package).
- Identify anomalies, outliers or abnormal behaviour (see for example the anomatools package).

Two possible strategies for time series clustering are:
Agglomerative clustering¶
A distance matrix can be used for time series clustering. You can use
existing methods such as scipy.cluster.hierarchy.linkage
or one of
two included clustering methods (the latter is a wrapper for the SciPy
linkage method).
# Custom Hierarchical clustering
model1 = clustering.Hierarchical(dtw.distance_matrix_fast, {})
cluster_idx = model1.fit(timeseries)
# Keep track of full tree by using the HierarchicalTree wrapper class
model2 = clustering.HierarchicalTree(model1)
cluster_idx = model2.fit(timeseries)
# You can also pass keyword arguments identical to instantiate a Hierarchical object
model2 = clustering.HierarchicalTree(dists_fun=dtw.distance_matrix_fast, dists_options={})
cluster_idx = model2.fit(timeseries)
# SciPy linkage clustering
model3 = clustering.LinkageTree(dtw.distance_matrix_fast, {})
cluster_idx = model3.fit(timeseries)
For models that keep track of the full clustering tree
(HierarchicalTree
or LinkageTree
), the tree is available in model.linkage
and
can be visualised (see figure at top of this page):
model2.plot("hierarchy.png")
A number of options are also available to tune the layout of the figure. You can also pass your
own set of axes. The only assumption is that the tree is printed to ax[0]
and the
time series to ax[1]
.
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(10, 10))
show_ts_label = lambda idx: "ts-" + str(idx)
model.plot("hierarchy.png", axes=ax, show_ts_label=show_ts_label,
show_tr_label=True, ts_label_margin=-10,
ts_left_margin=10, ts_sample_length=1)
K-Means DBA clustering¶
K-means clustering for time series requires an averaging strategy for time series. One possibility is DTW Barycenter Averaging (DBA).
Example:
For example, to cluster the Trace dataset by Davide Roverso.
model = KMeans(k=4, max_it=10, max_dba_it=10, dists_options={"window": 40})
cluster_idx, performed_it = model.fit(series, use_c=True, use_parallel=False)

DTW Barycenter Averaging:
If you only want to run DTW Barycenter Averaging once or multiple times:
new_center = dtw_barycenter.dba(series, center, use_c=True)
new_center = dtw_barycenter.dba_loop(series, center, max_it=10, thr=0.0001, use_c=True)
Example with differencing:
For the Trace example above, the clustering is not perfect because the different series have slightly different baselines that cannot be corrected with normalization. This causes an accumulated error that is larger than the subtle sine wave in one of the types of series. A possible solution is to apply differencing on the signals to focus on the changes in the series. Additionally, we also apply a low-pass filter the avoid accumulation of noise.
series = dtaidistance.preprocessing.differencing(series, smooth=0.1)
model = KMeans(k=4, max_it=10, max_dba_it=10, dists_options={"window": 40})
cluster_idx, performed_it = model.fit(series, use_c=True, use_parallel=False)

K-Medoids clustering¶
The distance matrix can also be used for k-medoid time series clustering.
The kmedoids
class from the pyclustering package supports
a distance matrix as input. It is wrapped in the dtaidistance.clustering.medoids.KMedoids
class.
from dtaidistance import dtw, clustering
s = np.array([
[0., 0, 1, 2, 1, 0, 1, 0, 0],
[0., 1, 2, 0, 0, 0, 0, 0, 0],
[1., 2, 0, 0, 0, 0, 0, 1, 1],
[0., 0, 1, 2, 1, 0, 1, 0, 0],
[0., 1, 2, 0, 0, 0, 0, 0, 0],
[1., 2, 0, 0, 0, 0, 0, 1, 1],
[1., 2, 0, 0, 0, 0, 0, 1, 1]])
model = clustering.KMedoids(dtw.distance_matrix_fast, {}, k=3)
cluster_idx = model.fit(s)
model.plot("kmedoids.png")

Active semi-supervised clustering¶
The recommended method for perform active semi-supervised clustering using DTAIDistance is to use the COBRAS for time series clustering: https://github.com/ML-KULeuven/cobras. COBRAS is a library for semi-supervised time series clustering using pairwise constraints, which natively supports both dtaidistance.dtw and kshape.
Subsequence¶
Subsequence search is to match the best occurance of a short time serise in a longer series.
DTW subsequence alignment¶
Given a series:

And a query:

We can find the best occurence(s) as follows:
from dtaidistance.subsequence.dtw import subsequence_alignment
from dtaidistance import dtw_visualisation as dtwvis
sa = subsequence_alignment(query, series)
match = sa.best_match()
startidx, endidx = match.segment
dtwvis.plot_warpingpaths(query, series, sa.warping_paths(), match.path, figure=fig)
The resultig match is

If we compare the best match with the query we see they are similar. The best match is only a little bit compressed.

If you want to find all matches (or the k best):
fig, ax = dtwvis.plot_warpingpaths(query, series, sa.warping_paths(), path=-1)
for kmatch in sa.kbest_matches(9):
dtwvis.plot_warpingpaths_addpath(ax, kmatch.path)

DTW subsequence search¶
Similar to using alignment, we can also iterate over a sequence of series or windows to search for the best match:
from dtaidistance.subsequence.dtw import subsequence_search
k = 3
s = []
w = 22
ws = int(np.floor(w/2))
wn = int(np.floor((len(series) - (w - ws)) / ws))
si, ei = 0, w
for i in range(wn):
s.append(series[si:ei])
si += ws
ei += ws
sa = subsequence_search(query, s)
best = sa.kbest_matches(k=k)
When setting k, the search is pruned to early abandon comparisons that will not improve on the top k best matches.
In the result one can observe that the choice of windows has an impact on where the best matches are found. Whereas the previous alignment method does not require a window size or a shift, here matches are limited to the windows that are given. The advantage of this method is that it can be used also if the windows are not from one continuous series (e.g. periods with missing data, multiple sources).
The best three windows are visualized below. The gray vertical lines indicate the windows, the red verical lines the three best windows.

DTW Local Concurrences¶
In some case we are not interested in searching for a query but to find any or all subsequences that are similar between two series. This is used for example to identify that parts of two series are similar but not necessarily the entire series. Or when comparing a series to itself it produces subsequences (of arbitrary length) that frequenty reappear in the series.
For example below, we can see that one heartbeat in ECG is a common pattern. Sometimes a sequence a few heartbeats is similar to another sequence of heartbeats.
lc = local_concurrences(series, None, estimate_settings=0.7) # second is None to compare to self
# The parameters tau, delta, delta_factor are estimated based on series
paths = []
for match in lc.kbest_matches(k=100, minlen=20, buffer=10):
paths.append(match.path)
fig, ax = dtwvis.plot_warpingpaths(series, series, lc.wp, path=-1)
for path in paths:
dtwvis.plot_warpingpaths_addpath(ax, path)

Sequences¶
For time series, it is assumed that it is a sequence of numerical values. If this is not the case, the same basic algorithm, dynamic programming, can still be used to find the globally optimal sequence alignment. The only difference is that it requires a custom cost function. In this toolbox the Needleman-Wunsch algorithm is available that works on sequences in general.
Needleman-Wunsch sequence alignment¶
s1 = "GATTACA"
s2 = "GCATGCU"
from dtaidistance import alignment
value, scores, paths = alignment.needleman_wunsch(s1, s2)
algn, s1a, s2a = alignment.best_alignment(matrix, s1, s2, gap='-')
This will result in the following alignment:
s1a = 'G-ATTACA'
s2a = 'GCA-TGCU'
The matrix representing all possible optimal alignments (paths
) and their
cost (scores
) is
— | G | A | T | T | A | C | A | |
---|---|---|---|---|---|---|---|---|
— | 0 | -1 | -2 | -3 | -4 | -5 | -6 | -7 |
G | -1 | ↖ 1 | ← 0 | ← -1 | ← -2 | ←↖ -3 | ← -4 | ← -5 |
C | -2 | ↑ 0 | ↖ 0 | ↖ 1 | ← 0 | ← -1 | ← -2 | ← -3 |
A | -3 | ↑ -1 | ↑↖ -1 | ↑ 0 | ↖ 2 | ← 1 | ← 0 | ← -1 |
T | -4 | ↑ -2 | ↑↖ -2 | ↑ -1 | ↑↖ 1 | ↖ 1 | ←↖ 0 | ←↖ -1 |
G | -5 | ↑ -3 | ↑↖ -3 | ↖ -1 | ↑ 0 | ↑↖ 0 | ↖ 0 | ↖ -1 |
C | -6 | ↑ -4 | ↖ -2 | ↑ -2 | ↑ -1 | ↑↖ -1 | ↖ 1 | ← 0 |
U | -7 | ↑ -5 | ↑ -3 | ↖ -1 | ←↑ -2 | ↑↖ -2 | ↑ 0 | ↖ 0 |
If you want to use a custom distance between (some) symbols, you can provide a custom function
using the substitution
argument to needleman_wunsch
. A wrapper is available to translate
a dictionary to a function with:
substitution_cost = {('A','G'): 2, ('G', 'A'): 3}
substitution = alignment.make_substitution_fn(substitution_cost)
value, scores, paths = alignment.needleman_wunsch(s1, s2, substitution=substitution)
Changelist¶
Version 2.3¶
- Subsequence search and local concurrences
- Parallellization improvements in C-code for >8 threads (thanks to Erlend Kvinge Jørgensen)
Version 2.2¶
- DTW Barycenter Averaging
- K-means DBA clustering
Version 2.1¶
- Various improvements in the C code
- K-medoids clustering
Version 2.0¶
- Numpy is now an optional dependency, also to compile the C library (only Cython is required).
- Small optimizations throughout the C code to improve speed.
- The consistent use of ssize_t instead of int allows for larger data structures on 64 bit machines and be more compatible with Numpy.
- The parallelization is now implemented directly in C (included if OpenMP is installed).
- The max_dist argument turned out to be similar to Silva and Batista’s work on PrunedDTW [7]. The toolbox now implements a version that is equal to PrunedDTW since it prunes more partial distances. Additionally, a use_pruning argument is added to automatically set max_dist to the Euclidean distance, as suggested by Silva and Batista, to speed up the computation.
- Support in the C library for multi-dimensional sequences in the dtaidistance.dtw_ndim package.
dtaidistance.dtw¶
Dynamic Time Warping (DTW)
author: | Wannes Meert |
---|---|
copyright: | Copyright 2017-2022 KU Leuven, DTAI Research Group. |
license: | Apache License, Version 2.0, see LICENSE for details. |
-
dtaidistance.dtw.
best_path
(paths, row=None, col=None, use_max=False)¶ Compute the optimal path from the nxm warping paths matrix.
Parameters: - row – If given, start from this row (instead of lower-right corner)
- col – If given, start from this column (instead of lower-right corner)
Returns: Array of (row, col) representing the best path
-
dtaidistance.dtw.
best_path2
(paths)¶ Compute the optimal path from the nxm warping paths matrix.
-
dtaidistance.dtw.
distance
(s1, s2, window=None, max_dist=None, max_step=None, max_length_diff=None, penalty=None, psi=None, use_c=False, use_pruning=False, only_ub=False)¶ Dynamic Time Warping.
This function keeps a compact matrix, not the full warping paths matrix.
Uses dynamic programming to compute:
- wps[i, j] = (s1[i]-s2[j])**2 + min(
- wps[i-1, j ] + penalty, // vertical / insertion / expansion wps[i , j-1] + penalty, // horizontal / deletion / compression wps[i-1, j-1]) // diagonal / match
dtw = sqrt(wps[-1, -1])
Parameters: - s1 – First sequence
- s2 – Second sequence
- window – Only allow for maximal shifts from the two diagonals smaller than this number. It includes the diagonal, meaning that an Euclidean distance is obtained by setting window=1.
- max_dist – Stop if the returned values will be larger than this value
- max_step – Do not allow steps larger than this value
- max_length_diff – Return infinity if length of two series is larger
- penalty – Penalty to add if compression or expansion is applied
- psi –
Psi relaxation parameter (ignore start and end of matching). If psi is a single integer, it is used for both start and end relaxations of both series. If psi is a 4-tuple, it is used as the psi-relaxation for
(begin series1, end series1, begin series2, end series2)Useful for cyclical series.
- use_c – Use fast pure c compiled functions
- use_pruning – Prune values based on Euclidean distance. This is the same as passing ub_euclidean() to max_dist
- only_ub – Only compute the upper bound (Euclidean).
Returns: DTW distance
-
dtaidistance.dtw.
distance_fast
(s1, s2, window=None, max_dist=None, max_step=None, max_length_diff=None, penalty=None, psi=None, use_pruning=False, only_ub=False)¶ Same as
distance()
but with different defaults to chose the fast C-based version of the implementation (use_c = True).Note: the series are expected to be arrays of the type
double
. Thusnumpy.array([1,2,3], dtype=numpy.double)
orarray.array('d', [1,2,3])
-
dtaidistance.dtw.
distance_matrix
(s, max_dist=None, use_pruning=False, max_length_diff=None, window=None, max_step=None, penalty=None, psi=None, block=None, compact=False, parallel=False, use_c=False, use_mp=False, show_progress=False, only_triu=False)¶ Distance matrix for all sequences in s.
Parameters: - s – Iterable of series
- max_dist – see
distance()
- use_pruning – Prune values based on Euclidean distance. This is the same as passing ub_euclidean() to max_dist
- max_length_diff – see
distance()
- window – see
distance()
- max_step – see
distance()
- penalty – see
distance()
- psi – see
distance()
- block – Only compute block in matrix. Expects tuple with begin and end, e.g. ((0,10),(20,25)) will only compare rows 0:10 with rows 20:25.
- compact – Return the distance matrix as an array representing the upper triangular matrix.
- parallel – Use parallel operations
- use_c – Use c compiled Python functions
- use_mp – Force use Multiprocessing for parallel operations (not OpenMP)
- show_progress – Show progress using the tqdm library. This is only supported for the pure Python version (thus not the C-based implementations).
- only_triu – Only compute upper traingular matrix of warping paths. This is useful if s1 and s2 are the same series and the matrix would be mirrored around the diagonal.
Returns: The distance matrix or the condensed distance matrix if the compact argument is true
-
dtaidistance.dtw.
distance_matrix_fast
(s, max_dist=None, use_pruning=False, max_length_diff=None, window=None, max_step=None, penalty=None, psi=None, block=None, compact=False, parallel=True, use_mp=False, only_triu=False)¶ Same as
distance_matrix()
but with different defaults to choose the fast parallized C version (use_c = True and parallel = True).This method uses the C-compiled version of the DTW algorithm and uses parallelization. By default this is the OMP C parallelization. If the OMP functionality is not available the parallelization is changed to use Python’s multiprocessing library.
-
dtaidistance.dtw.
distances_array_to_matrix
(dists, nb_series, block=None, only_triu=False)¶ Transform a condensed distances array to a full matrix representation.
The upper triangular matrix will contain all the distances.
-
dtaidistance.dtw.
lb_keogh
(s1, s2, window=None, max_dist=None, max_step=None, max_length_diff=None)¶ Lowerbound LB_KEOGH
-
dtaidistance.dtw.
ub_euclidean
(s1, s2)¶ See ed.euclidean_distance
-
dtaidistance.dtw.
warp
(from_s, to_s, path=None, **kwargs)¶ Warp a function to optimally match a second function.
Parameters: - from_s – First sequence
- to_s – Second sequence
- path – (Optional) Path to use wrap the ‘from_s’ sequence to the ‘to_s’ sequence If provided, this function will use it. If not provided, this function will calculate it using the warping_path function
- kwargs – Same options as
warping_paths()
.
-
dtaidistance.dtw.
warping_amount
(path)¶ Returns the number of compressions and expansions performed to obtain the best path. Can be used as a metric for the amount of warping.
Parameters: path – path to be tested :returns number of compressions or expansions
-
dtaidistance.dtw.
warping_path
(from_s, to_s, **kwargs)¶ Compute warping path between two sequences.
-
dtaidistance.dtw.
warping_path_fast
(from_s, to_s, **kwargs)¶ Compute warping path between two sequences.
-
dtaidistance.dtw.
warping_path_penalty
(s1, s2, penalty_post=0, **kwargs)¶ Dynamic Time Warping with an alternative penalty.
This function supports two different penalties. The traditional DTW penalty penalty is used in the matrix during calculation of the warping path (see
distance()
).The second penalty penalty_post measures the amount of warping. This penalty doesn’t affect the warping path and is added to the DTW distance after the warping for every compression or expansion.
Same options as
warping_paths()
Parameters: - s1 – First sequence
- s2 – Second sequence
- penalty_post – Penalty to be added after path calculation, for compression/extension
:returns [DTW distance, best path, DTW distance between 2 path elements, DTW matrix]
-
dtaidistance.dtw.
warping_path_prob
(from_s, to_s, avg, use_c=True, **kwargs)¶ Compute warping path between two sequences.
-
dtaidistance.dtw.
warping_paths
(s1, s2, window=None, max_dist=None, use_pruning=False, max_step=None, max_length_diff=None, penalty=None, psi=None, psi_neg=True, use_c=False, use_ndim=False)¶ Dynamic Time Warping.
The full matrix of all warping paths (or accumulated cost matrix) is built.
Parameters: - s1 – First sequence
- s2 – Second sequence
- window – see
distance()
- max_dist – see
distance()
- use_pruning – Prune values based on Euclidean distance. This is the same as passing ub_euclidean() to max_dist
- max_step – see
distance()
- max_length_diff – see
distance()
- penalty – see
distance()
- psi – see
distance()
- psi_neg – Replace values that should be skipped because of psi-relaxation with -1.
- use_c – Use the C implementation instead of Python
- use_ndim – The input series is >1 dimensions. Use cost = EuclideanDistance(s1[i], s2[j])
Returns: (DTW distance, DTW matrix)
-
dtaidistance.dtw.
warping_paths_affinity
(s1, s2, window=None, only_triu=False, penalty=None, psi=None, psi_neg=True, gamma=1, tau=0, delta=0, delta_factor=1, exp_avg=None, use_c=False)¶ Dynamic Time Warping warping paths using an affinity/similarity matrix instead of a distance matrix.
The full matrix of all warping paths (or accumulated cost matrix) is built.
Parameters: - s1 – First sequence
- s2 – Second sequence
- window – see
distance()
- only_triu – Only compute upper traingular matrix of warping paths. This is useful if s1 and s2 are the same series and the matrix would be mirrored around the diagonal.
- penalty – see
distance()
- psi – see
distance()
- psi_neg – Replace values that should be skipped because of psi-relaxation with -1.
Returns: (DTW distance, DTW matrix)
-
dtaidistance.dtw.
warping_paths_affinity_fast
(s1, s2, window=None, only_triu=False, penalty=None, psi=None, psi_neg=True, gamma=1, tau=0, delta=0, delta_factor=1, exp_avg=None, compact=False, use_ndim=False)¶ Fast C version of
warping_paths()
.- Additional parameters:
param compact: Return a compact warping paths matrix. Size is ((l1 + 1), min(l2 + 1, abs(l1 - l2) + 2*window + 1)). This option is meant for internal use. For more details, see the C code.
-
dtaidistance.dtw.
warping_paths_fast
(s1, s2, window=None, max_dist=None, use_pruning=False, max_step=None, max_length_diff=None, penalty=None, psi=None, psi_neg=True, compact=False, use_ndim=False)¶ Fast C version of
warping_paths()
.- Additional parameters:
param compact: Return a compact warping paths matrix. Size is ((l1 + 1), min(l2 + 1, abs(l1 - l2) + 2*window + 1)). This option is meant for internal use. For more details, see the C code.
dtaidistance.dtw_visualisation¶
Dynamic Time Warping (DTW) visualisations.
author: | Wannes Meert |
---|---|
copyright: | Copyright 2017-2022 KU Leuven, DTAI Research Group. |
license: | Apache License, Version 2.0, see LICENSE for details. |
-
dtaidistance.dtw_visualisation.
plot_average
(s1, s2, avg, path1, path2, filename=None, fig=None, ax=None)¶ Plot how s1 and s2 relate to the avg.
Parameters: - s1 – Seq 1.
- s2 – Seq 2.
- path – Average sequence.
- filename – Filename path (optional).
- fig – Matplotlib Figure object
- ax – Matplotlib axes.Axes object
Returns: Figure, axes.Axes
-
dtaidistance.dtw_visualisation.
plot_warp
(from_s, to_s, new_s, path, filename=None, fig=None, axs=None)¶ Plot the warped sequence and its relation to the original sequence and the target sequence.
Parameters: - from_s – From sequence.
- to_s – To sequence.
- new_s – Warped version of from sequence.
- path – Optimal warping path.
- filename – Filename path (optional).
- fig – Matplotlib Figure object
- axs – Array of Matplotlib axes.Axes objects (length == 3)
Returns: Figure, list[Axes]
-
dtaidistance.dtw_visualisation.
plot_warping
(s1, s2, path, filename=None, fig=None, axs=None, series_line_options=None, warping_line_options=None)¶ Plot the optimal warping between to sequences.
Parameters: - s1 – From sequence.
- s2 – To sequence.
- path – Optimal warping path.
- filename – Filename path (optional).
- fig – Matplotlib Figure object
- axs – Array of Matplotlib axes.Axes objects (length == 2)
- series_line_options – Dictionary of options to pass to matplotlib plot None will not pass any options
- warping_line_options – Dictionary of options to pass to matplotlib ConnectionPatch None will use {‘linewidth’: 0.5, ‘color’: ‘orange’, ‘alpha’: 0.8}
Returns: Figure, list[Axes]
-
dtaidistance.dtw_visualisation.
plot_warping_single_ax
(s1, s2, path, filename=None, fig=None, ax=None)¶ Plot the optimal warping between to sequences.
Parameters: - s1 – From sequence.
- s2 – To sequence.
- path – Optimal warping path.
- filename – Filename path (optional).
- fig – Matplotlib Figure object
- ax – Matplotlib axes.Axes object
Returns: Figure, Axes
-
dtaidistance.dtw_visualisation.
plot_warpingpaths
(s1, s2, paths, path=None, filename=None, shownumbers=False, showlegend=False, figure=None, matshow_kwargs=None)¶ Plot the warping paths matrix.
Parameters: - s1 – Series 1
- s2 – Series 2
- paths – Warping paths matrix
- path – Path to draw (typically this is the best path)
- filename – Filename for the image (optional)
- shownumbers – Show distances also as numbers
- showlegend – Show colormap legend
- figure – Matplotlib Figure object
Returns: Figure, Axes
dtaidistance.dtw_ndim¶
Dynamic Time Warping (DTW) for N-dimensional series.
author: | Wannes Meert |
---|---|
copyright: | Copyright 2017-2022 KU Leuven, DTAI Research Group. |
license: | Apache License, Version 2.0, see LICENSE for details. |
-
dtaidistance.dtw_ndim.
distance
(s1, s2, window=None, max_dist=None, max_step=None, max_length_diff=None, penalty=None, psi=None, use_c=False, use_pruning=False, only_ub=False)¶ (Dependent) Dynamic Time Warping using multidimensional sequences.
Assumes the first dimension to be the sequence item index, and the second dimension to be the series index (thus timestep).
Example:
s1 = np.array([[0, 0], [0, 1], [2, 1], [0, 1], [0, 0]], dtype=np.double) s2 = np.array([[0, 0], [2, 1], [0, 1], [0, .5], [0, 0]], dtype=np.double) d = distance(s1, s2)
See
dtaidistance.dtw.distance()
for parameters.This method returns the dependent DTW (DTW_D) [1] distance between two n-dimensional sequences. If you want to compute the independent DTW (DTW_I) distance, use the 1-dimensional version:
dtw_i = 0 for dim in range(ndim): dtw_i += dtw.distance(s1[:,dim], dtw.distance(s2[:,dim])
Note: If you are using the C-optimized code, the above snippet will trigger a copy operation to guarantee the arrays to be C-ordered and will thus create time and memory overhead. This can be avoided by storing the dimensions as separate arrays or by flipping the array dimensions and use dtw.distance(s1[dim,:], dtw.distance(s2[dim,:]).
[1] M. Shokoohi-Yekta, B. Hu, H. Jin, J. Wang, and E. Keogh. Generalizing dtw to the multi-dimensional case requires an adaptive approach. Data Mining and Knowledge Discovery, 31:1–31, 2016.
-
dtaidistance.dtw_ndim.
distance_fast
(s1, s2, window=None, max_dist=None, max_step=None, max_length_diff=None, penalty=None, psi=None, use_pruning=False, only_ub=False)¶ Fast C version of
distance()
.Note: the series are expected to be arrays of the type
double
. Thusnumpy.array([[1,1],[2,2],[3,3]], dtype=numpy.double)
-
dtaidistance.dtw_ndim.
distance_matrix
(s, ndim=None, max_dist=None, use_pruning=False, max_length_diff=None, window=None, max_step=None, penalty=None, psi=None, block=None, compact=False, parallel=False, use_c=False, use_mp=False, show_progress=False, only_triu=False)¶ Distance matrix for all n-dimensional sequences in s.
This method returns the dependent DTW (DTW_D) [1] distance between two n-dimensional sequences. If you want to compute the independent DTW (DTW_I) distance, use the 1-dimensional version and sum the distance matrices:
dtw_i = dtw.distance_matrix(series_sep_dim[0]) for dim in range(1, ndim): dtw_i += dtw.distance_matrix(series_sep_dim(dim)
Where series_sep_dim is a datastructure that returns a list of the sequences that represents the i-th dimension of each sequence in s.
Parameters: - s – Iterable of series
- window – see
distance()
- max_dist – see
distance()
- max_step – see
distance()
- max_length_diff – see
distance()
- penalty – see
distance()
- psi – see
distance()
- block – Only compute block in matrix. Expects tuple with begin and end, e.g. ((0,10),(20,25)) will only compare rows 0:10 with rows 20:25.
- compact – Return the distance matrix as an array representing the upper triangular matrix.
- parallel – Use parallel operations
- use_c – Use c compiled Python functions
- use_mp – Use Multiprocessing for parallel operations (not OpenMP)
- show_progress – Show progress using the tqdm library. This is only supported for the pure Python version (thus not the C-based implementations).
- only_triu – Only fill the upper triangle
Returns: The distance matrix or the condensed distance matrix if the compact argument is true
[1] M. Shokoohi-Yekta, B. Hu, H. Jin, J. Wang, and E. Keogh. Generalizing dtw to the multi-dimensional case requires an adaptive approach. Data Mining and Knowledge Discovery, 31:1–31, 2016.
-
dtaidistance.dtw_ndim.
distance_matrix_fast
(s, ndim=None, max_dist=None, max_length_diff=None, window=None, max_step=None, penalty=None, psi=None, block=None, compact=False, parallel=True, only_triu=False)¶ Fast C version of
distance_matrix()
.
-
dtaidistance.dtw_ndim.
ub_euclidean
(s1, s2)¶ Euclidean (dependent) distance between two n-dimensional sequences. Supports different lengths.
If the two series differ in length, compare the last element of the shortest series to the remaining elements in the longer series.
Parameters: - s1 – Sequence of numbers, 1st dimension is sequence, 2nd dimension is n-dimensional value vector.
- s2 – Sequence of numbers, 1st dimension is sequence, 2nd dimension is n-dimensional value vector.
Returns: Euclidean distance
-
dtaidistance.dtw_ndim.
warping_path
(from_s, to_s, **kwargs)¶ Compute warping path between two sequences.
-
dtaidistance.dtw_ndim.
warping_paths
(*args, **kwargs)¶ Dynamic Time Warping (keep full matrix) using multidimensional sequences.
See
dtaidistance.dtw.warping_paths()
for parameters.
-
dtaidistance.dtw_ndim.
warping_paths_fast
(*args, **kwargs)¶ Dynamic Time Warping (keep full matrix) using multidimensional sequences.
See
dtaidistance.dtw.warping_paths()
for parameters.
dtaidistance.dtw_barycenter¶
Dynamic Time Warping (DTW) Barycenter
author: | Wannes Meert |
---|---|
copyright: | Copyright 2020-2022 KU Leuven, DTAI Research Group. |
license: | Apache License, Version 2.0, see LICENSE for details. |
-
dtaidistance.dtw_barycenter.
dba
(s, c, mask=None, samples=None, use_c=False, nb_initial_samples=None, **kwargs)¶ DTW Barycenter Averaging.
F. Petitjean, A. Ketterlin, and P. Gan ̧carski. A global averaging method for dynamic time warping, with applications to clustering. Pattern Recognition, 44(3):678–693, 2011.
Parameters: - s – Container of sequences
- c – Initial averaging sequence. If none is given, the first one is used (unless if nb_initial_samples is set). Better performance can be achieved by starting from an informed starting point (Petitjean et al. 2011).
- mask – Boolean array with the series in s to use. If None, use all.
- nb_initial_samples – If c is None, and this argument is not None, select nb_initial_samples samples and select the series closest to all other samples as c.
- use_c – Use a fast C implementation instead of a Python version.
- kwargs – Arguments for dtw.distance
Returns: Bary-center of length len(c).
-
dtaidistance.dtw_barycenter.
dba_loop
(s, c=None, max_it=10, thr=0.001, mask=None, keep_averages=False, use_c=False, nb_initial_samples=None, nb_prob_samples=None, **kwargs)¶ Loop around the DTW Barycenter Averaging (DBA) method until convergence.
Parameters: - s – Container of sequences
- c – Initial averaging sequence. If none is given, the first one is used (unless if nb_initial_samples is set). Better performance can be achieved by starting from an informed starting point (Petitjean et al. 2011).
- max_it – Maximal number of calls to DBA.
- thr – Convergence if the DBA is changing less than this value.
- mask – Boolean array with the series in s to use. If None, use all.
- keep_averages – Keep all DBA values (for visualisation or debugging).
- nb_initial_samples – If c is None, and this argument is not None, select nb_initial_samples samples and select the series closest to all other samples as c.
- nb_prob_samples – Probabilistically sample the best path instead of the deterministic version.
- use_c – Use a fast C implementation instead of a Python version.
- kwargs – Arguments for dtw.distance
dtaidistance.ed¶
Euclidean Distance (ED)
author: | Wannes Meert |
---|---|
copyright: | Copyright 2020 KU Leuven, DTAI Research Group. |
license: | Apache License, Version 2.0, see LICENSE for details. |
-
dtaidistance.ed.
distance
(s1, s2)¶ Euclidean distance between two sequences. Supports different lengths.
If the two series differ in length, compare the last element of the shortest series to the remaining elements in the longer series. This is compatible with Euclidean distance being used as an upper bound for DTW.
Parameters: - s1 – Sequence of numbers
- s2 – Sequence of numbers
Returns: Euclidean distance
clustering¶
dtaidistance.clustering.hierarchical¶
Time series clustering using hierarchical clustering.
author: | Wannes Meert |
---|---|
copyright: | Copyright 2017-2022 KU Leuven, DTAI Research Group. |
license: | Apache License, Version 2.0, see LICENSE for details. |
-
class
dtaidistance.clustering.hierarchical.
BaseTree
(**kwargs)¶ Base Tree abstract class.
Returns a datastructure compatible with the Scipy clustering methods:
https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html
A (n-1) by 4 matrix Z is returned. At the i-th iteration, clusters with indices Z[i, 0] and Z[i, 1] are combined to form cluster n + i. A cluster with an index less than n corresponds to one of the original observations. The distance between clusters Z[i, 0] and Z[i, 1] is given by Z[i, 2]. The fourth value Z[i, 3] represents the number of original observations in the newly formed cluster.
-
get_linkage
(node)¶
-
maxnode
¶
-
plot
(filename=None, axes=None, ts_height=10, bottom_margin=2, top_margin=2, ts_left_margin=0, ts_sample_length=1, tr_label_margin=3, tr_left_margin=2, ts_label_margin=0, show_ts_label=None, show_tr_label=None, cmap='viridis_r', ts_color=None)¶ Plot the hierarchy and time series.
Parameters: - filename – If a filename is passed, the image is written to this file.
- axes – If a axes array is passed the image is added to this figure. Expects axes[0] and axes[1] to be present.
- ts_height – Height of a time series
- bottom_margin – Margin on bottom
- top_margin – Margin on top
- ts_left_margin – Margin on left of time series image
- ts_sample_length – Space between two points in the time series
- tr_label_margin – Margin between tree split and label
- tr_left_margin – Left margin for tree
- ts_label_margin – Margin between start of series and label
- show_ts_label – Show label indices. Boolean, callable or subscriptable object. If it is a callable object, the index of the time series will be given and the return string will be printed.
- show_tr_label – Show tree distances. Boolean, callable or subscriptable object. If it is a callable object, the index of the time series will be given and the return string will be printed.
- cmap – Matplotlib colormap name
- ts_color – function that takes the index and returns a color (compatible with the matplotlib.color color argument)
-
to_dot
()¶
-
-
class
dtaidistance.clustering.hierarchical.
Hierarchical
(dists_fun, dists_options, max_dist=inf, merge_hook=None, order_hook=None, show_progress=True)¶ Hierarchical clustering.
Note: This method first computes the entire distance matrix. This is not ideal for extremely large data sets.
Parameters: - dists_fun – Function to compute pairwise distance matrix between set of series.
- dists_options – Arguments to pass to dists_fun.
- max_dist – Do not merge or cluster series that are further apart than this.
- merge_hook – Function that is called when two series are clustered. The function definition is def merge_hook(from_idx, to_idx, distance), where idx is the index of the series.
- order_hook – Function that is called to decide on the next idx out of all shortest distances
- show_progress – Use a tqdm progress bar
Returns: Cluster indices
-
fit
(series)¶ Merge sequences.
Parameters: series – Sequence over series. Returns: Dictionary with as keys the prototype indicices and as values all the indicides of the series in that cluster.
-
plot
(*args, **kwargs)¶
-
class
dtaidistance.clustering.hierarchical.
HierarchicalTree
(model=None, **kwargs)¶ Wrapper to keep track of the full tree that represents the hierarchical clustering.
The linkage tree is available in self.linkage.
Parameters: model – Clustering object. For example of class Hierarchical
. If no model is given, the arguments are identical to those of classHierarchical
.-
fit
(series, *args, **kwargs)¶ Fit a hierarchical clustering tree.
All arguments are passed when calling the model past to __init__. The linkage tree is also available in self.linkage.
Parameters: series – Sequence over time series Returns: Linkage datastructure
-
get_linkage
(node)¶
-
maxnode
¶
-
plot
(filename=None, axes=None, ts_height=10, bottom_margin=2, top_margin=2, ts_left_margin=0, ts_sample_length=1, tr_label_margin=3, tr_left_margin=2, ts_label_margin=0, show_ts_label=None, show_tr_label=None, cmap='viridis_r', ts_color=None)¶ Plot the hierarchy and time series.
Parameters: - filename – If a filename is passed, the image is written to this file.
- axes – If a axes array is passed the image is added to this figure. Expects axes[0] and axes[1] to be present.
- ts_height – Height of a time series
- bottom_margin – Margin on bottom
- top_margin – Margin on top
- ts_left_margin – Margin on left of time series image
- ts_sample_length – Space between two points in the time series
- tr_label_margin – Margin between tree split and label
- tr_left_margin – Left margin for tree
- ts_label_margin – Margin between start of series and label
- show_ts_label – Show label indices. Boolean, callable or subscriptable object. If it is a callable object, the index of the time series will be given and the return string will be printed.
- show_tr_label – Show tree distances. Boolean, callable or subscriptable object. If it is a callable object, the index of the time series will be given and the return string will be printed.
- cmap – Matplotlib colormap name
- ts_color – function that takes the index and returns a color (compatible with the matplotlib.color color argument)
-
to_dot
()¶
-
-
class
dtaidistance.clustering.hierarchical.
Hooks
¶ -
static
create_orderhook
(weights)¶
-
static
create_weighthook
(weights, series)¶
-
static
-
class
dtaidistance.clustering.hierarchical.
LinkageTree
(dists_fun, dists_options=None, method='complete')¶ Hierarchical clustering using the Scipy linkage function.
The linkage tree is available in self.linkage.
This is the same but faster algorithm as available in Hierarchical (~10 times faster). But with less options to steer the clustering (e.g. no possibility to give weights). It still computes the entire distance matrix first and is thus not ideal for extremely large data sets.
Parameters: - dists_fun – Distance funcion, e.g. dtw.distance
- dists_options – Options passed to dists_fun
- method – Linkage method (see scipy.cluster.hierarchy.linkage)
-
fit
(series)¶ Fit a hierarchical clustering tree.
The linkage tree is also available in self.linkage.
Parameters: series – Sequence over time series Returns: Linkage datastructure
-
get_linkage
(node)¶
-
maxnode
¶
-
plot
(filename=None, axes=None, ts_height=10, bottom_margin=2, top_margin=2, ts_left_margin=0, ts_sample_length=1, tr_label_margin=3, tr_left_margin=2, ts_label_margin=0, show_ts_label=None, show_tr_label=None, cmap='viridis_r', ts_color=None)¶ Plot the hierarchy and time series.
Parameters: - filename – If a filename is passed, the image is written to this file.
- axes – If a axes array is passed the image is added to this figure. Expects axes[0] and axes[1] to be present.
- ts_height – Height of a time series
- bottom_margin – Margin on bottom
- top_margin – Margin on top
- ts_left_margin – Margin on left of time series image
- ts_sample_length – Space between two points in the time series
- tr_label_margin – Margin between tree split and label
- tr_left_margin – Left margin for tree
- ts_label_margin – Margin between start of series and label
- show_ts_label – Show label indices. Boolean, callable or subscriptable object. If it is a callable object, the index of the time series will be given and the return string will be printed.
- show_tr_label – Show tree distances. Boolean, callable or subscriptable object. If it is a callable object, the index of the time series will be given and the return string will be printed.
- cmap – Matplotlib colormap name
- ts_color – function that takes the index and returns a color (compatible with the matplotlib.color color argument)
-
to_dot
()¶
dtaidistance.clustering.kmeans¶
(requires version 2.2.0 or higher)
Time series clustering using k-means and Barycenter averages.
author: | Wannes Meert |
---|---|
copyright: | Copyright 2020-2022 KU Leuven, DTAI Research Group. |
license: | Apache License, Version 2.0, see LICENSE for details. |
-
class
dtaidistance.clustering.kmeans.
KMeans
(k, max_it=10, max_dba_it=10, thr=0.0001, drop_stddev=None, nb_prob_samples=None, dists_options=None, show_progress=True, initialize_with_kmedoids=False, initialize_with_kmeanspp=True, initialize_sample_size=None)¶ K-means clustering algorithm for time series using Dynamic Barycenter Averaging.
Parameters: - k – Number of components
- max_it – Maximal interations for K-means
- max_dba_it – Maximal iterations for the Dynamic Barycenter Averaging.
- thr – Convergence is achieved if the averaging iterations differ less then this threshold
- drop_stddev – When computing the average series per cluster, ignore the instances that are further away than stddev*drop_stddev from the prototype (this is a gradual effect, the algorithm starts with drop_stddev is 3). This is related to robust k-means approaches that use trimming functions.
- nb_prob_samples – Probabilistically sample best path this number of times.
- dists_options –
- show_progress –
- initialize_with_kmedoids – Cluster a sample of the dataset first using K-medoids.
- initialize_with_kmeanspp – Use k-means++
- initialize_sample_size – How many samples to use for initialization with K-medoids or K-means++. Defaults are k*20 for K-medoid and 2+log(k) for k-means++.
-
fit
(series, use_c=False, use_parallel=True)¶ Perform K-means clustering.
Parameters: - series – Container with series
- use_c – Use the C-library (only available if package is compiled)
- use_parallel – Use multipool for parallelization
Returns: cluster indices, number of iterations If the number of iterations is equal to max_it, the clustering did not converge.
-
fit_fast
(series)¶
-
kmeansplusplus_centers
(series, use_c=False)¶ Better initialization for K-Means.
Arthur, D., and S. Vassilvitskii. “k-means++: the, advantages of careful seeding. In, SODA’07: Proceedings of the.” eighteenth annual ACM-SIAM symposium on Discrete, algorithms.Procedure (in paper):
- 1a. Choose an initial center c_1 uniformly at random from X.
- 1b. Choose the next center c_i , selecting c_i = x′∈X with probability D(x’)^2/sum(D(x)^2, x∈X).
- 1c. Repeat Step 1b until we have chosen a total of k centers.
- (2-4. Proceed as with the standard k-means algorithm.)
Extension (in conclusion):
- Also, experiments showed that k-means++ generally performed better if it selected several new centers during each iteration, and then greedily chose the one that decreased φ as much as possible.
Detail (in original code):
- numLocalTries==2+log(k)
Parameters: - series –
- use_c –
Returns:
-
kmedoids_centers
(series, use_c=False)¶
-
plot
(filename=None, axes=None, ts_height=10, bottom_margin=2, top_margin=2, ts_left_margin=0, ts_sample_length=1, tr_label_margin=3, tr_left_margin=2, ts_label_margin=0, show_ts_label=None, show_tr_label=None, cmap='viridis_r', ts_color=None)¶
dtaidistance.clustering.medoids¶
Time series clustering using medoid-based methods.
author: | Wannes Meert |
---|---|
copyright: | Copyright 2020 KU Leuven, DTAI Research Group. |
license: | Apache License, Version 2.0, see LICENSE for details. |
-
class
dtaidistance.clustering.medoids.
KMedoids
(dists_fun, dists_options, k=None, initial_medoids=None, show_progress=True)¶ KMedoids using the PyClustering package.
Novikov, A., 2019. PyClustering: Data Mining Library. Journal of Open Source Software, 4(36), p.1230. Available at: http://dx.doi.org/10.21105/joss.01230.
-
fit
(series)¶
-
plot
(filename=None, axes=None, ts_height=10, bottom_margin=2, top_margin=2, ts_left_margin=0, ts_sample_length=1, tr_label_margin=3, tr_left_margin=2, ts_label_margin=0, show_ts_label=None, show_tr_label=None, cmap='viridis_r', ts_color=None)¶
-
-
class
dtaidistance.clustering.medoids.
Medoids
(dists_fun, dists_options, k, show_progress=True)¶ Parameters: - dists_fun –
- dists_options –
- show_progress –
-
plot
(filename=None, axes=None, ts_height=10, bottom_margin=2, top_margin=2, ts_left_margin=0, ts_sample_length=1, tr_label_margin=3, tr_left_margin=2, ts_label_margin=0, show_ts_label=None, show_tr_label=None, cmap='viridis_r', ts_color=None)¶
subsequence¶
dtaidistance.subsequence.dtw¶
(requires version 2.3.0 or higher)
DTW-based subsequence matching.
author: | Wannes Meert |
---|---|
copyright: | Copyright 2021-2022 KU Leuven, DTAI Research Group. |
license: | Apache License, Version 2.0, see LICENSE for details. |
-
class
dtaidistance.subsequence.dtw.
LocalConcurrences
(series1, series2=None, gamma=1, tau=0, delta=0, delta_factor=1, only_triu=False, penalty=None, window=None)¶ Version identification based on local concurrences.
Find recurring patterns across two time series. Used to identify whether one time series is a version of another. If the two time series are the same one, it can be used to find typical or frequent patterns in a time series.
Based on 7.3.2 Identification Procedure in Fundamentals of Music Processing, Meinard Müller, Springer, 2015.
Different from the original formulation, D_tau is introduced based on the given delta factor. This makes the penalty less sensitive to the cumulative effect of the paths in the self-similarity matrix S:
- S_tau(n,m) = S(n,m) if S(n,m) >= tau (with tau >= 0)
- delta if S(n,m) < tau (with delta <= 0)
And for the accumulated score matrix D:
- D_tau(n,m) = max(0,
- df * D_tau(n−1,m−1) + S_tau(n,m), df * D_tau(n−1,m) + S_tau(n,m), df * D_tau(n,m−1) + S_tau(n,m))
where df = 1 if S(n,m) >= tau and df=delta_factor (<=1) otherwise,
Parameters: - series1 – First time series.
- series2 – Second time series. If empty, series1 is used and compared with itself.
- gamma – Affinity transformation exp(-gamma*(s1[i] - s2[j])**2), should be >0
- tau – threshold parameter, should be >= 0
- delta – penalty parameter, should be <= 0
- delta_factor – penalty factor parameter, should be <= 1
- only_triu – Only consider upper triangular matrix in warping paths.
-
align
()¶ Returns:
-
best_match
()¶
-
best_path
(row, col)¶
-
estimate_settings_from_std
(series, tau_std=0.33)¶ Parameters: - series –
- tau_std – Set tau to differences larger than tau_std time standard deviation of the given series (default is 0.33, or reject differences that are larger than the deviation wrt to the mean of 75% of the values in the series, assuming a normal distribution).
Returns:
-
kbest_matches
(k=1, minlen=2, buffer=0)¶ Yields the next best match. Stops at k matches (use None for all matches).
Parameters: - k – Number of matches to yield. None is all matches.
- minlen – Consider only matches of length longer than minlen
- buffer – Matches cannot be closer than buffer to each other.
Returns: Yield an LCMatch object
-
reset
()¶
-
wp
¶
-
class
dtaidistance.subsequence.dtw.
SAMatch
(idx, alignment)¶ SubsequenceAlignment match
-
distance
¶ DTW distance of match.
This value is dependent on the length of the query. Use the value property when comparing queries of different lengths.
-
path
¶ Matched path in series
-
segment
¶ Matched segment in series.
-
value
¶ Normalized DTW distance of match.
Normalization is the DTW distance divided by the query length.
-
-
class
dtaidistance.subsequence.dtw.
SSMatch
(idx, ss)¶ Found match by SubsequenceSearch.
The match is identified by the idx property, which is the index of the matched series in the original list of series. The distance property returns the DTW distance between the query and the series at index idx.
-
distance
¶ DTW distance.
-
value
¶ Normalized DTW distance.
-
-
class
dtaidistance.subsequence.dtw.
SubsequenceAlignment
(query, series, penalty=0.1, use_c=False)¶ Subsequence alignment using DTW. Find where the query occurs in the series.
Based on Fundamentals of Music Processing, Meinard Müller, Springer, 2015.
Example: query = np.array([1., 2, 0]) series = np.array([1., 0, 1, 2, 1, 0, 2, 0, 3, 0, 0]) sa = subsequence_search(query, series) mf = sa.matching_function() sa.kbest_matches(k=2)
Parameters: - query – Subsequence to search for
- series – Long sequence in which to search
- penalty – Penalty for non-diagonal matching
- use_c – Use the C-based DTW function if available
-
align
()¶
-
align_fast
()¶
-
best_match
()¶
-
best_match_fast
()¶
-
get_match
(idx)¶
-
kbest_matches
(k=1, overlap=0)¶ Yields the next best match. Stops at k matches (use None for all matches).
Parameters: - k – Number of matches to yield. None is all matches.
- overlap – Matches cannot overlap unless overlap > 0.
Returns: Yield an SAMatch object
-
kbest_matches_fast
(k=1, overlap=0)¶
-
matching_function
()¶ The matching score for each end-point of a possible match.
-
matching_function_bestpath
(idx)¶ Indices in series for best path for match in matching function at idx.
Parameters: idx – Index in matching function Returns: List of (row, col)
-
matching_function_endpoint
(idx)¶ Index in series for end of match in matching function at idx.
Parameters: idx – Index in matching function Returns: Index in series
-
matching_function_segment
(idx)¶ Matched segment in series.
-
matching_function_startpoint
(idx)¶ Index in series for start of match in matching function at idx.
Parameters: idx – Index in matching function Returns: Index in series
-
reset
()¶
-
warping_paths
()¶ Get matrix with all warping paths.
If the aligmnent was computed using a compact, the paths are first copied into a full warping paths matrix.
Returns: Numpy matrix of size (len(query)+1) * (len(series)+1)
-
class
dtaidistance.subsequence.dtw.
SubsequenceSearch
(query, s, dists_options=None)¶ Search the best matching (subsequence) time series compared to a given time series.
Parameters: - query – Time series to search for
- s – Iterator over time series to perform search on. This can be for example windows over a long time series.
- dists_options – Options passed on to dtw.distance
-
align
(k=None)¶
-
align_fast
(k=None)¶
-
best_match
()¶
-
best_match_fast
()¶
-
kbest_matches
(k=1)¶
-
kbest_matches_fast
(k=1)¶
-
reset
()¶
-
dtaidistance.subsequence.dtw.
local_concurrences
(series1, series2=None, gamma=1, tau=0, delta=0, delta_factor=1, estimate_settings=None, only_triu=False, penalty=None, window=None)¶ Local concurrences, see LocalConcurrences.
Parameters: - series1 –
- series2 –
- gamma – Affinity transformation exp(-gamma*(s1[i] - s2[j])**2)
- tau – threshold parameter
- delta – penalty parameter Should be negative. Added instead of the affinity score (if score below tau threshold parameter).
- delta_factor – multiply cumulative score (e.g. by 0.5). This is useful to have the same impact at different locations in the warping paths matrix, which is cumulative (and thus typically large in one corner and small in the opposite corner).
- estimate_settings – Estimate tau, delta, delta_factor from given series. Will be passed as tau_std to estimate_settings_from_std.
- only_triu – Only compute the upper traingle matrix values. Useful to avoid redundant computations when series1 is equal to series2 (or equivalently if series2 is None).
- penalty – Penalty that is added when dynamic programming is using moving vertically or horizontally through the matrix instead of diagonally. Used to prefer diagonal paths.
Returns:
-
dtaidistance.subsequence.dtw.
subsequence_alignment
(query, series, use_c=False)¶ See SubsequenceAligment.
Parameters: - query –
- series –
Returns:
-
dtaidistance.subsequence.dtw.
subsequence_search
(query, series, dists_options=None)¶ See SubsequenceSearch.
Parameters: - query – Time series to search for
- series – Iterator over time series to perform search on. This can be for example windows over a long time series.
- dists_options – Options passed on to dtw.distance
Returns: SubsequenceSearch object