Motivation

Inexact matching

Algorithms

**Recovering structural similarity**Some diseases are caused by lack of parts of the protein sequences in an organism. In this case, we would like to produce sequences with the same functionality, then insert them into the organism, to compensate its lack. Finding the sequences with the same functionality is, finally, finding similar sequences. This technique is used in Drug Design.

**Comparison of a new sequence vs. a database**When biologists discover a new protein sequence, they would like to compare it to a database of already discovered sequences and find those sequences which are

*similar*(does not have to be exactly identical) to the newly discovered one. Sequences in database are known, their features are defined and described. Then, if the new sequence is similar to some from database, it means that they have similar structure; therefore, they have similar functionality and similar features.**Detection of pattern**Sometimes, biologists need to distribute sequences into classes or families. If some class is specified by one or more sub-sequences (patterns), then to define whether a sequence belongs to the class is just to find a sequence segment that is similar to the pattern.

**Evolution**Given two sequences it is possible to define evolution process, to find segments, which are predisposed to some kind of mutations, or to find segments, which preserve their structures. This might allow to predict the future gene mutation and to suppose the possible source of disease.

**Gene mutations**The concept of

*similarity*between biological sequences is different from mathematical approach. From a biological point of view, two amino acids are*similar*if they have same structure (size and shape). They don't have to be identical. This phenomena is called*substitution*and caused by gene mutation. There are more two kind of mutation:*deletion*and*insertion*. Deletion is disappearance of one or more amino acids from protein sequence. On the contrary, insertion is presence of one or more new amino acids in protein sequence.**Global alignment**Given two sequences define what is the similarity between them.

**Motivation for protein sequence alignment
**Sequence comparison is widespread in Biology and serves different
purposes. This usage is based on the fact that the functionality of protein
is defined by its structure, and the protein structure is defined by the
sequence of amino acids.

Back to index

**Inexact matching
**Since the databases are large and each sequence is long, biologists
need an efficient way to compare pairs of sequences. There are three models
for inexact matching:

*example:*

AASRPRSGVPAQSDSDPCQNLAATPIPSRPPSSQSCQKCRADARQGRWGP |:| ||:|:| : |: |::| | ||:|:: : : :|: || AGSPGSSGAPGQRGEPGPQGHAGAPGPPGPPGSDG-----SPGGKGEMGP [-------------------alignment--------------------]

*example:*

AASRPRSGVPAQSDSDPCQNLAATPIPSRPPSSQSCQKCRADARQGRWGP ||:|:| : |: |::| | ||:|:: : : :| SGAPGQRGEPGPQGHAGAPGPPGPPGSDG-----SPGGKG [-------------alignment----------------]

**Local alignment**Given two sequences define most similar substrings of the two.

*example:*

AASRPRSGVPAQSDSDPCQNLAATPIPSRPPSSQSCQKCRSDAR ||:|:| : |: |::| | ||:|:: | PGSSGAPGQRGEPGPQGHAGAPGPPGPPGSDG-----SPGGKGEMGP [-------------alignment-----------]

To be able compare pairs of sequences and to decide which of
comparison better, we need to suggest an absolute way to calculate the
similarity between protein sequences. Given two sequences, *Edit distance*
will return the cost of transformation from one sequence into second.

*S*is the alphabet.*S**is the set of all finite sequences of symbols from*S*.*a,b,c*are variables denoting individual symbols.*A,B,C*are variables denoting sequences from*S**.*|A|*denotes the length of the sequence*A*.*A(i,j)*denotes a subsequence from the*i-th*symbol to the*j-th*symbol in the sequence*A*(including bounds).*a[i]*denotes a*i-th*symbol in the sequence*A*.

**Definition:**Substituting*a[i]*(in*A*) by*b[j]*(in*B*). Also*(a,a)*denote a match.

`Sub(a[i],b[j]) or (a[i],b[j]) `

**Definition:**Inserting the character*b[j]*in sequence*A*in place`i.`

I(i,b[j]) or (i,b[j])

**Definition:**-Deleting the symbol*a[i]*in sequence*A*.

D(a[i],-) or (a[i],-)

**
**Notice, that the problem is symmetric: a deletion in A can be seen
as an insertion in B, and vice versa.

- Each editing operation will be assigned a weight. These weights represent
difference between two characters or penalties for insertions/deletions.

**Definition:**- a transformation of sequence__T(A,B)__*A*to sequence*B*will be the set of editing operations applied to one of the sequences iteratively, which transform*A*into*B*.**Definition:**- the weight of transformation will be defined as the total sum of the weights of the editing operations in__W(T(A,B))__`T(A,B).`**Definition:**- editing distance:__D(A,B)__

** D(A,B) = min _{T(A,B)}{W(T(A,B))}**

Editing distance D(A,B) is a metric. It has the following characteristics:

- D(A,B) = 0 => A = B

- D(A,B) = D(B,A)

- for each C, D(A,B) <= D(A,C) + D(C,B)

**Example of Editing Distance**The sequences are

*A*= "industry" ( |A| = 8, A(2,5) = ndus, a[7] = r) and*B*= "interest" ( |B| = 8, B(3,8) = terest, b[2] = n). The editing operations for*A->B*are:

D(a[3],-) = inustry D(a[3],-) = instry Sub(a[6],b[7]) = instrs D(a[3],-) = intrs I(4,b[4]) = inters I(6,b[6]) = interes I(8,b[8]) = interest

The alignment of this two sequence is:

INDUST-R-Y IN--TEREST

Back to index

**Dynamic Programming**

Dynamic programming is the method for solving different sophisticated
programming problems. Dynamic programming solves problems by combining
solutions to subproblems. It is similar to calculating a recursive
formula
where each step of the calculation is based on the results of previous
steps.
So, we need to keep only the last results and not the whole result list.

Using dynamic programming, the problem of finding edit distance given two
sequences of length *n* and *m* can be solved with time complexity
*O(n*m)* and space complexity *O(n+m)*.

Let's consider the problem as a matrix (edit distance matrix), where sequence
*A* is located on *X* axis and sequence *B* is located on
*Y* axis. The matrix dimensions are *(n+1)x(m+1)*. We add one
row and one column for initial conditions.

Then, the optimal edit distance will be presented as a path into the matrix,
where each cell *(i+1,j+1)* contains the value of edit distance between
subsequences *A(1,j)* and *B(1,j)*. Obviously, the edit distance
between sequences A and B is located in the cell *(n+1,m+1)*.

The recursive formula of edit distance is:

D(i-1,j) + W(a[i],-) D(i,j) = min D(i,j-1) + W(i,b[j]) D((i-1,j-1) + W(a[i],b[j])

Initial conditions:

D(0,0) = 0 j D(0,j) = Sum W(-,b[k]) k=1 j D(j,0) = Sum W(a[k],-) k=1

The calculation is performed by filling the matrix according to the
recursive definition. Each cell in the matrix stands for *D(i,j)*
and is the function of the close cells. The three-way choice in the minimization
formula for *D(i,j)* leads to the following pattern of dependencies
between matrix elements:

D(i-1,j-1) Sub |
D(i-1,j) Ins |

D(i,j-1) Del |
D(i,j) |

** Example of Edit Distance Matrix:**Lets consider two sequences

W(Sub(x,x)) = 0 W(Sub(x,y)) = 1 (x != y) W(Ins) = 1 W(Del) = 1 ---------------------- --------------- | "A" A G C A C A C A | | "A" AGCACACA | ---------------------- --------------- |"B" 0 1 2 3 4 5 6 7 8 | |"B" | |A 1 0 1 2 3 4 5 6 7 | |A \_ | |C 2 1 1 1 2 3 4 5 6 | |C \ | |A 3 2 2 2 1 2 3 4 5 | |A \ | |C 4 3 3 2 2 1 2 3 4 | |C \ | |A 5 4 4 3 2 2 1 2 3 | |A \ | |C 6 5 5 4 3 2 2 1 2 | |C \ | |T 7 6 6 5 4 3 3 2 2 | |T | | |A 8 7 7 6 5 4 3 3 2 | |A \ | ---------------------- ---------------

We can trace our way up to find the optional alignment.

Notice that a diagonal line means substitution or match, a vertical line
means deletion, a horizontal line means insertion. Thus this part
indicates
the edit operation protocol of the optimal alignment D(A,B)=2. Note that
in some cases, the minimal choice is not unique, and different paths could
have been drawn, which indicate an alternative optimal alignment.

The alignment in this case is:

A: AGCACAC-A B: A-CACACTA

Complexity analisys of dynamic programming algorithm:

-__Time Complexity__*O(n*m)*the evaluation of each cell requires a constant numbers of operations, hence constructing*D(A,B)*request*n*m*O(1)*computations.-__Space Complexity__*O(n+m)*since it's enough to store the last row in memory (not included alignment tracing).

To be more specific, given two sequences

**Definition**will be defined as the maximal similarity between`E(i,j)`*A(1,i)*and*B(1,j)*which doesn't necessarily use all literal in the given prefixes.**Recursive definition**

E(i-1,j) + W(a[i],-) E(i,j) = max E(i,j-1) + W(i,b[j]) E((i-1,j-1) + W(a[i],b[j]) 0

**Initial conditions:**

E(0,0) = 0 E(0,j) = 0 for each j E(j,0) = 0 for each j

*E(n,m)* will be the local similarity between the two sequences,
denoted by *E(A,B)*. We must make sure the weight will be the way
that certain segments will fall meaning:

W(-,a) <= 0 W(a,-) <= 0 Exista,bfromSsuch that W(a,b) <= 0

Back to index