Sequence alignment

Protein sequence comparison

Biological Sequence Comparison

The lecture introduces the problem of comparing biological sequences as strings of letters allowing insertions, deletions and substitutions between them. We will consider the reason of such assumptions, called gene mutation. Also we will consider several advanced algorithms for comparing biological sequences.

Motivation for protein sequence alignment
Sequence comparison is widespread in Biology and serves different purposes. This usage is based on the fact that the functionality of protein is defined by its structure, and the protein structure is defined by the sequence of amino acids.

Recovering structural similarity
Some diseases are caused by lack of parts of the protein sequences in an organism. In this case, we would like to produce sequences with the same functionality, then insert them into the organism, to compensate its lack. Finding the sequences with the same functionality is, finally, finding similar sequences. This technique is used in Drug Design.
Comparison of a new sequence vs. a database
When biologists discover a new protein sequence, they would like to compare it to a database of already discovered sequences and find those sequences which are similar (does not have to be exactly identical) to the newly discovered one. Sequences in database are known, their features are defined and described. Then, if the new sequence is similar to some from database, it means that they have similar structure; therefore, they have similar functionality and similar features.
Detection of pattern
Sometimes, biologists need to distribute sequences into classes or families. If some class is specified by one or more sub-sequences (patterns), then to define whether a sequence belongs to the class is just to find a sequence segment that is similar to the pattern.
Evolution
Given two sequences it is possible to define evolution process, to find segments, which are predisposed to some kind of mutations, or to find segments, which preserve their structures. This might allow to predict the future gene mutation and to suppose the possible source of disease.
Gene mutations
The concept of similarity between biological sequences is different from mathematical approach. From a biological point of view, two amino acids are similar if they have same structure (size and shape). They don't have to be identical. This phenomena is called substitution and caused by gene mutation. There are more two kind of mutation: deletion and insertion. Deletion is disappearance of one or more amino acids from protein sequence. On the contrary, insertion is presence of one or more new amino acids in protein sequence.

Back to index

Inexact matching
Since the databases are large and each sequence is long, biologists need an efficient way to compare pairs of sequences. There are three models for inexact matching:

Global alignment
Given two sequences define what is the similarity between them.

example:

 AASRPRSGVPAQSDSDPCQNLAATPIPSRPPSSQSCQKCRADARQGRWGP
 |:|   ||:|:|  :   |: |::| |  ||:|::     : : :|: ||          
 AGSPGSSGAPGQRGEPGPQGHAGAPGPPGPPGSDG-----SPGGKGEMGP

 [-------------------alignment--------------------]

Ends free
Given two sequences define what is the similarity between them, when the maximal similarity could result by shifting one of the strings, and comparing the two until the end of the shorter string.

example:

 AASRPRSGVPAQSDSDPCQNLAATPIPSRPPSSQSCQKCRADARQGRWGP
       ||:|:|  :   |: |::| |  ||:|::     : : :|
       SGAPGQRGEPGPQGHAGAPGPPGPPGSDG-----SPGGKG

       [-------------alignment----------------]

Local alignment
Given two sequences define most similar substrings of the two.

example:

 AASRPRSGVPAQSDSDPCQNLAATPIPSRPPSSQSCQKCRSDAR
       ||:|:|  :   |: |::| |  ||:|::     |
    PGSSGAPGQRGEPGPQGHAGAPGPPGPPGSDG-----SPGGKGEMGP

       [-------------alignment-----------]

Definition of Edit distance
To be able compare pairs of sequences and to decide which of comparison better, we need to suggest an absolute way to calculate the similarity between protein sequences. Given two sequences, Edit distance will return the cost of transformation from one sequence into second.

Definition of notation

S is the alphabet.
S* is the set of all finite sequences of symbols from S.
a,b,c are variables denoting individual symbols.
A,B,C are variables denoting sequences from S*.
|A| denotes the length of the sequence A.
A(i,j) denotes a subsequence from the i-th symbol to the j-th symbol in the sequence A (including bounds).
a[i] denotes a i-th symbol in the sequence A.

Definition of editing operation

Definition: Substituting a[i] (in A) by b[j] (in B). Also (a,a) denote a match.

Sub(a[i],b[j]) or (a[i],b[j])

Definition: Inserting the character b[j] in sequence A in place i.

I(i,b[j]) or (i,b[j])

Definition: -Deleting the symbol a[i] in sequence A.

D(a[i],-) or (a[i],-)

Notice, that the problem is symmetric: a deletion in A can be seen as an insertion in B, and vice versa.

Each editing operation will be assigned a weight. These weights represent difference between two characters or penalties for insertions/deletions.

Definition: T(A,B) - a transformation of sequence A to sequence B will be the set of editing operations applied to one of the sequences iteratively, which transform A into B.

Definition: W(T(A,B)) - the weight of transformation will be defined as the total sum of the weights of the editing operations in T(A,B).

Definition: D(A,B) - editing distance:

D(A,B) = min_T(A,B){W(T(A,B))}

Editing distance D(A,B) is a metric. It has the following characteristics:

D(A,B) = 0 => A = B
D(A,B) = D(B,A)
for each C, D(A,B) <= D(A,C) + D(C,B)

Example of Editing Distance
The sequences are A = "industry" ( |A| = 8, A(2,5) = ndus, a[7] = r) and B = "interest" ( |B| = 8, B(3,8) = terest, b[2] = n). The editing operations for A->B are:

   D(a[3],-) = inustry
   D(a[3],-) = instry
   Sub(a[6],b[7]) = instrs
   D(a[3],-) = intrs
   I(4,b[4]) = inters
   I(6,b[6]) = interes
   I(8,b[8]) = interest

The alignment of this two sequence is:

   INDUST-R-Y
   IN--TEREST

Back to index

Algorithms

O(n)

Dynamic Programming
Dynamic programming is the method for solving different sophisticated programming problems. Dynamic programming solves problems by combining solutions to subproblems. It is similar to calculating a recursive formula where each step of the calculation is based on the results of previous steps. So, we need to keep only the last results and not the whole result list.
Using dynamic programming, the problem of finding edit distance given two sequences of length n and m can be solved with time complexity O(n*m) and space complexity O(n+m).
Let's consider the problem as a matrix (edit distance matrix), where sequence A is located on X axis and sequence B is located on Y axis. The matrix dimensions are (n+1)x(m+1). We add one row and one column for initial conditions.
Then, the optimal edit distance will be presented as a path into the matrix, where each cell (i+1,j+1) contains the value of edit distance between subsequences A(1,j) and B(1,j). Obviously, the edit distance between sequences A and B is located in the cell (n+1,m+1).
The recursive formula of edit distance is:

                     D(i-1,j) + W(a[i],-)
        D(i,j) = min D(i,j-1) + W(i,b[j])
                     D((i-1,j-1) + W(a[i],b[j])

Initial conditions:

        D(0,0) = 0
                  j
        D(0,j) = Sum W(-,b[k])
                 k=1
                  j
        D(j,0) = Sum W(a[k],-)
                 k=1

The calculation is performed by filling the matrix according to the recursive definition. Each cell in the matrix stands for D(i,j) and is the function of the close cells. The three-way choice in the minimization formula for D(i,j) leads to the following pattern of dependencies between matrix elements:

D(i-1,j-1) Sub	D(i-1,j) Ins
D(i,j-1) Del	D(i,j)

Example of Edit Distance Matrix:
Lets consider two sequences A = "AGCACACA" and B = "ACACACTA". Define the simple cost function:

    W(Sub(x,x)) = 0
    W(Sub(x,y)) = 1 (x != y)
    W(Ins) = 1
    W(Del) = 1

 ----------------------   ---------------
|  "A" A G C A C A C A | |  "A" AGCACACA |
 ----------------------   ---------------
|"B" 0 1 2 3 4 5 6 7 8 | |"B"            |
|A   1 0 1 2 3 4 5 6 7 | |A     \_       |
|C   2 1 1 1 2 3 4 5 6 | |C       \      |
|A   3 2 2 2 1 2 3 4 5 | |A        \     |
|C   4 3 3 2 2 1 2 3 4 | |C         \    |
|A   5 4 4 3 2 2 1 2 3 | |A          \   |
|C   6 5 5 4 3 2 2 1 2 | |C           \  |
|T   7 6 6 5 4 3 3 2 2 | |T           |  |
|A   8 7 7 6 5 4 3 3 2 | |A            \ |
 ----------------------   ---------------

We can trace our way up to find the optional alignment.
Notice that a diagonal line means substitution or match, a vertical line means deletion, a horizontal line means insertion. Thus this part indicates the edit operation protocol of the optimal alignment D(A,B)=2. Note that in some cases, the minimal choice is not unique, and different paths could have been drawn, which indicate an alternative optimal alignment.
The alignment in this case is:

  A: AGCACAC-A
  B: A-CACACTA

Complexity analisys of dynamic programming algorithm:

Time Complexity - O(n*m) the evaluation of each cell requires a constant numbers of operations, hence constructing D(A,B) request n*m*O(1) computations.
Space Complexity - O(n+m) since it's enough to store the last row in memory (not included alignment tracing).

Local similarity
The alignments so far tried to find the difference or the similarity between two sequences. In this section we will present an algorithm that, given two sequences, attempts to find the most similar sub-sequence.
To be more specific, given two sequences A and B, the algorithm finds four indices i,j,k and l so that the similarity between subsequence A(i,j) and B(k,l) is maximal over all substrings. This is called Local Similarity between two sequences.

Definition E(i,j) will be defined as the maximal similarity between A(1,i) and B(1,j) which doesn't necessarily use all literal in the given prefixes.
Recursive definition

                     E(i-1,j) + W(a[i],-)
        E(i,j) = max E(i,j-1) + W(i,b[j])
                     E((i-1,j-1) + W(a[i],b[j])
                     0

Initial conditions:

        E(0,0) = 0
        E(0,j) = 0 for each j
        E(j,0) = 0 for each j

E(n,m) will be the local similarity between the two sequences, denoted by E(A,B). We must make sure the weight will be the way that certain segments will fall meaning:

    W(-,a) <= 0
    W(a,-) <= 0
    Exist a,b from S such that W(a,b) <= 0

Gap (insertion/deletion) penalties.
Opening Gap Penalty Opening gap is a penalty for the initiation of the gap in sequence or in structure. To make the match more significant you can try to make the gap penalty larger. It will decrease the number of gaps.
Extension Gap Penalty This penalty is applied for encreasing already existing by one residue. If you don't like long gaps, just increase the extension gap penalty. As well as in the opening gap penalty case, increasing an extension gap penalty may increase the significance of the match.
Comparison (substitution/match) table.
Though sequences comparison is a mathematical task, its application in molecular biology should take into account the biological aspect of the problem. That is, weights of substitution, deletion and insertion should reflect actual similarities in proteins. Therefore significant research efforts have been done in order to construct comparison tables to answer that requirement. A series of such tables, PAM (Percent Accepted Mutation), have been constructed by Dayhoff and Dayhoff. These amino acid weight matrices are used to help detect distant similarity between proteins. These give weights to the different possible pairs of aligned residues. The weights can be positive or negative. You are allowed to choose the particular PAM matrix by specifying a value between 1 and 500. Roughly, low PAM values (e.g. 40 or so) will be best suited for finding short regions of very strong similarity while high values (e.g. 250 or more) will be better suited for finding longer, weaker matches.

Back to index