Sequences

For time series, it is assumed that it is a sequence of numerical values. If this is not the case, the same basic algorithm, dynamic programming, can still be used to find the globally optimal sequence alignment. The only difference is that it requires a custom cost function. In this toolbox the Needleman-Wunsch algorithm is available that works on sequences in general.

Needleman-Wunsch sequence alignment

s1 = "GATTACA"
s2 = "GCATGCU"
from dtaidistance import alignment
value, scores, paths = alignment.needleman_wunsch(s1, s2)
algn, s1a, s2a = alignment.best_alignment(matrix, s1, s2, gap='-')

This will result in the following alignment:

s1a = 'G-ATTACA'
s2a = 'GCA-TGCU'

The matrix representing all possible optimal alignments (paths) and their cost (scores) is

  G A T T A C A
0 -1 -2 -3 -4 -5 -6 -7
G -1 ↖ 1 ← 0 ← -1 ← -2 ←↖ -3 ← -4 ← -5
C -2 ↑ 0 ↖ 0 ↖ 1 ← 0 ← -1 ← -2 ← -3
A -3 ↑ -1 ↑↖ -1 ↑ 0 ↖ 2 ← 1 ← 0 ← -1
T -4 ↑ -2 ↑↖ -2 ↑ -1 ↑↖ 1 ↖ 1 ←↖ 0 ←↖ -1
G -5 ↑ -3 ↑↖ -3 ↖ -1 ↑ 0 ↑↖ 0 ↖ 0 ↖ -1
C -6 ↑ -4 ↖ -2 ↑ -2 ↑ -1 ↑↖ -1 ↖ 1 ← 0
U -7 ↑ -5 ↑ -3 ↖ -1 ←↑ -2 ↑↖ -2 ↑ 0 ↖ 0

If you want to use a custom distance between (some) symbols, you can provide a custom function using the substitution argument to needleman_wunsch. A wrapper is available to translate a dictionary to a function with:

substitution_cost = {('A','G'): 2, ('G', 'A'): 3}
substitution = alignment.make_substitution_fn(substitution_cost)
value, scores, paths = alignment.needleman_wunsch(s1, s2, substitution=substitution)