Protein sequence comparison
Motivation
Inexact matching
Algorithms
Biological Sequence Comparison
The lecture introduces the problem of comparing biological sequences
as strings of letters allowing insertions, deletions and substitutions
between them. We will consider the reason of such assumptions, called gene
mutation. Also we will consider several advanced algorithms for comparing
biological sequences.
Motivation for protein sequence alignment
Sequence comparison is widespread in Biology and serves different
purposes. This usage is based on the fact that the functionality of protein
is defined by its structure, and the protein structure is defined by the
sequence of amino acids.
- Recovering structural similarity
Some diseases are caused by lack of parts of the protein sequences
in an organism. In this case, we would like to produce sequences with the
same functionality, then insert them into the organism, to compensate its
lack. Finding the sequences with the same functionality is, finally, finding
similar sequences. This technique is used in Drug Design.
- Comparison of a new sequence vs. a database
When biologists discover a new protein sequence, they would
like to compare it to a database of already discovered sequences and find
those sequences which are similar (does not have to be exactly identical)
to the newly discovered one. Sequences in database are known, their features
are defined and described. Then, if the new sequence is similar to some
from database, it means that they have similar structure; therefore, they
have similar functionality and similar features.
- Detection of pattern
Sometimes, biologists need to distribute sequences into classes
or families. If some class is specified by one or more sub-sequences (patterns),
then to define whether a sequence belongs to the class is just to find
a sequence segment that is similar to the pattern.
- Evolution
Given two sequences it is possible to define evolution process,
to find segments, which are predisposed to some kind of mutations, or to
find segments, which preserve their structures. This might allow to predict
the future gene mutation and to suppose the possible source of disease.
- Gene mutations
The concept of similarity between biological
sequences is different from mathematical approach. From a biological point
of view, two amino acids are similar if they have same structure
(size and shape). They don't have to be identical. This phenomena is called
substitution and caused by gene mutation. There are more two kind
of mutation: deletion and insertion. Deletion is disappearance
of one or more amino acids from protein sequence. On the contrary, insertion
is presence of one or more new amino acids in protein sequence.
Back to index
Inexact matching
Since the databases are large and each sequence is long, biologists
need an efficient way to compare pairs of sequences. There are three models
for inexact matching:
- Global alignment
Given two sequences define what is the similarity between them.
example:
AASRPRSGVPAQSDSDPCQNLAATPIPSRPPSSQSCQKCRADARQGRWGP
|:| ||:|:| : |: |::| | ||:|:: : : :|: ||
AGSPGSSGAPGQRGEPGPQGHAGAPGPPGPPGSDG-----SPGGKGEMGP
[-------------------alignment--------------------]
Ends free
Given two sequences define what is the similarity between them, when
the maximal similarity could result by shifting one of the strings, and
comparing the two until the end of the shorter string.
example:
AASRPRSGVPAQSDSDPCQNLAATPIPSRPPSSQSCQKCRADARQGRWGP
||:|:| : |: |::| | ||:|:: : : :|
SGAPGQRGEPGPQGHAGAPGPPGPPGSDG-----SPGGKG
[-------------alignment----------------]
- Local alignment
Given two sequences define most similar substrings of the two.
example:
AASRPRSGVPAQSDSDPCQNLAATPIPSRPPSSQSCQKCRSDAR
||:|:| : |: |::| | ||:|:: |
PGSSGAPGQRGEPGPQGHAGAPGPPGPPGSDG-----SPGGKGEMGP
[-------------alignment-----------]
Definition of Edit distance
To be able compare pairs of sequences and to decide which of
comparison better, we need to suggest an absolute way to calculate the
similarity between protein sequences. Given two sequences, Edit distance
will return the cost of transformation from one sequence into second.
Definition of notation
- S is the alphabet.
- S* is the set of all finite sequences of symbols from S.
- a,b,c are variables denoting individual symbols.
- A,B,C are variables denoting sequences from S*.
- |A| denotes the length of the sequence A.
- A(i,j) denotes a subsequence from the i-th symbol to
the j-th symbol in the sequence A (including bounds).
- a[i] denotes a i-th symbol in the sequence A.
Definition of editing operation
- Definition: Substituting a[i] (in
A) by b[j] (in B). Also (a,a) denote a match.
Sub(a[i],b[j]) or (a[i],b[j])
- Definition: Inserting the character b[j] in sequence
A in place i.
I(i,b[j]) or (i,b[j])
- Definition: -Deleting the symbol a[i] in sequence A.
D(a[i],-) or (a[i],-)
Notice, that the problem is symmetric: a deletion in A can be seen
as an insertion in B, and vice versa.
- Each editing operation will be assigned a weight. These weights represent
difference between two characters or penalties for insertions/deletions.
- Definition: T(A,B) -
a transformation of sequence A to sequence B will be the
set of editing operations applied to one of the sequences iteratively,
which transform A into B.
- Definition: W(T(A,B)) -
the weight of transformation will be defined as the total sum of the weights
of the editing operations in T(A,B).
- Definition: D(A,B) -
editing distance:
D(A,B) = minT(A,B){W(T(A,B))}
Editing distance D(A,B) is a metric. It has the following characteristics:
- D(A,B) = 0 => A = B
- D(A,B) = D(B,A)
- for each C, D(A,B) <= D(A,C) + D(C,B)
- Example of Editing Distance
The sequences are A = "industry" ( |A| = 8, A(2,5)
= ndus, a[7] = r) and B = "interest" ( |B| = 8, B(3,8)
= terest, b[2] = n). The editing operations for A->B are:
D(a[3],-) = inustry
D(a[3],-) = instry
Sub(a[6],b[7]) = instrs
D(a[3],-) = intrs
I(4,b[4]) = inters
I(6,b[6]) = interes
I(8,b[8]) = interest
The alignment of this two sequence is:
INDUST-R-Y
IN--TEREST
Back to index
Algorithms
There are numbers of algorithms using for string comparing and string
matching. It is possible to show, that given the string with length n
and pattern with length m, we can find the pattern in the string
in time O(n). But for biological sequences these algorithms are not
useful
because of allowing insertions and deletions. This section introduces several
algorithms for biological sequence comparison.
Dynamic Programming
Dynamic programming is the method for solving different sophisticated
programming problems. Dynamic programming solves problems by combining
solutions to subproblems. It is similar to calculating a recursive
formula
where each step of the calculation is based on the results of previous
steps.
So, we need to keep only the last results and not the whole result list.
Using dynamic programming, the problem of finding edit distance given two
sequences of length n and m can be solved with time complexity
O(n*m) and space complexity O(n+m).
Let's consider the problem as a matrix (edit distance matrix), where sequence
A is located on X axis and sequence B is located on
Y axis. The matrix dimensions are (n+1)x(m+1). We add one
row and one column for initial conditions.
Then, the optimal edit distance will be presented as a path into the matrix,
where each cell (i+1,j+1) contains the value of edit distance between
subsequences A(1,j) and B(1,j). Obviously, the edit distance
between sequences A and B is located in the cell (n+1,m+1).
The recursive formula of edit distance is:
D(i-1,j) + W(a[i],-)
D(i,j) = min D(i,j-1) + W(i,b[j])
D((i-1,j-1) + W(a[i],b[j])
Initial conditions:
D(0,0) = 0
j
D(0,j) = Sum W(-,b[k])
k=1
j
D(j,0) = Sum W(a[k],-)
k=1
The calculation is performed by filling the matrix according to the
recursive definition. Each cell in the matrix stands for D(i,j)
and is the function of the close cells. The three-way choice in the minimization
formula for D(i,j) leads to the following pattern of dependencies
between matrix elements:
D(i-1,j-1) Sub |
D(i-1,j) Ins |
D(i,j-1) Del |
D(i,j) |
Example of Edit Distance Matrix:
Lets consider two sequences A = "AGCACACA"
and B = "ACACACTA". Define the simple cost function:
W(Sub(x,x)) = 0
W(Sub(x,y)) = 1 (x != y)
W(Ins) = 1
W(Del) = 1
---------------------- ---------------
| "A" A G C A C A C A | | "A" AGCACACA |
---------------------- ---------------
|"B" 0 1 2 3 4 5 6 7 8 | |"B" |
|A 1 0 1 2 3 4 5 6 7 | |A \_ |
|C 2 1 1 1 2 3 4 5 6 | |C \ |
|A 3 2 2 2 1 2 3 4 5 | |A \ |
|C 4 3 3 2 2 1 2 3 4 | |C \ |
|A 5 4 4 3 2 2 1 2 3 | |A \ |
|C 6 5 5 4 3 2 2 1 2 | |C \ |
|T 7 6 6 5 4 3 3 2 2 | |T | |
|A 8 7 7 6 5 4 3 3 2 | |A \ |
---------------------- ---------------
We can trace our way up to find the optional alignment.
Notice that a diagonal line means substitution or match, a vertical line
means deletion, a horizontal line means insertion. Thus this part
indicates
the edit operation protocol of the optimal alignment D(A,B)=2. Note that
in some cases, the minimal choice is not unique, and different paths could
have been drawn, which indicate an alternative optimal alignment.
The alignment in this case is:
A: AGCACAC-A
B: A-CACACTA
Complexity analisys of dynamic programming algorithm:
- Time Complexity - O(n*m) the evaluation of each
cell requires a constant numbers of operations, hence constructing D(A,B)
request n*m*O(1) computations.
- Space Complexity - O(n+m) since it's enough to
store the last row in memory (not included alignment tracing).
Local similarity
The alignments so far tried to find the difference or the similarity
between two sequences. In this section we will present an algorithm
that, given two sequences, attempts to find the most similar
sub-sequence.
To be more specific, given two sequences A and B, the algorithm
finds four indices i,j,k and l so that the similarity
between
subsequence A(i,j) and B(k,l) is maximal over all substrings.
This is called Local Similarity between two sequences.
- Definition E(i,j) will be defined as the maximal similarity
between A(1,i) and B(1,j) which doesn't necessarily use all
literal in the given prefixes.
- Recursive definition
E(i-1,j) + W(a[i],-)
E(i,j) = max E(i,j-1) + W(i,b[j])
E((i-1,j-1) + W(a[i],b[j])
0
E(0,0) = 0
E(0,j) = 0 for each j
E(j,0) = 0 for each j
E(n,m) will be the local similarity between the two sequences,
denoted by E(A,B). We must make sure the weight will be the way
that certain segments will fall meaning:
W(-,a) <= 0
W(a,-) <= 0
Exist a,b from S such that W(a,b) <= 0
Gap (insertion/deletion) penalties.
Opening Gap Penalty
Opening gap is a penalty for the initiation of the gap in sequence or in
structure. To make the match more significant you can try to make the gap
penalty larger. It will decrease the number of gaps.
Extension Gap Penalty
This penalty is applied for encreasing already existing by one residue. If you
don't like long gaps, just increase the extension gap penalty. As well as in
the opening gap penalty case, increasing an extension gap penalty may increase
the significance of the match.
Comparison (substitution/match) table.
Though sequences comparison is a mathematical task, its application
in molecular biology should take into account the biological aspect of
the problem. That is, weights of substitution, deletion and insertion should
reflect actual similarities in proteins. Therefore significant research
efforts have been done in order to construct comparison tables to answer
that requirement. A series of such tables, PAM (Percent Accepted Mutation),
have been constructed by Dayhoff and Dayhoff. These amino acid weight matrices
are used to help detect distant similarity between proteins. These give
weights to the different possible pairs of aligned residues. The weights
can be positive or negative. You are allowed to choose the particular PAM
matrix by specifying a value between 1 and 500. Roughly, low PAM values
(e.g. 40 or so) will be best suited for finding short regions of very strong
similarity while high values (e.g. 250 or more) will be better suited for
finding longer, weaker matches.
Back to index