Introduction to Protein Comparison

3-D Protein Matching

Problem definition
Geometric Hashing
Flexible Matching

Detection of a-priori unknown substructure which are almost isometric modulo the 3-D rigid motion group.< >

Problem definition:
Given A known Database of molecules, each as a set ( linearly ordered, or not) of  points representing Atoms , and given a new Molecule M: Find all the molecules Mi in the DB, and  rigid Transformations Ti : R3 ->  R3  so that a matching from ( a part of ) the Atoms of M to those of Mi will give an RMSD (root mean square distance) bellow a given limit L.


Back to Index.




Back to Index.

Geometric Hashing:

the 2D case:
For each ordered pair of points (a,b) in the plane there exists an unique Coordinate system (Frame of Reference) such that its center is on the first and its x axis is the vector from b to a. If we take two such points in a model, and calculate the coordinates of all the other points in this reference frame, then , these coordinates will be invariant under any translation, rotation ,similarity, or a combination of these three kinds of transformations. More over, these coordinates will only dependent on choosing the right base pair ,i.e. if we show only some of the points of the model (partial overlapping) then, given the right base pair, we will still identify the remaining points. It seems logical, then, to store these frames of reference, and coordinates within them, in some Fast Retrieval Data Structure, so that given a new scene, they will be used immediately, so that we will concentrate only on transformations likely to occur.
This gives us the GH Algorithm for the 2D case:
we will divide the work into two main stages:

The Application of Geometric Hashing to molecular biology.

The method stated above can be applied in the 3D case. This is conceptually no problem, the only difference is that for determining a frame of reference uniquely in R3 we need basis triplets, i.e. 3 nonlinear points (one for the origin, another will determine the unit X axis, and the third will span a plane ( with orientation) which will give the Y and Z axes.)
So we may take the above scheme, and replace all the basis pairs to basis triplets.
The difference will, of course, result in higher complexity: The Preprocessing stage will now have an O(N4 *M) time complexity, While the Recognition stage will take O(S4) time.

To reduce the complexity , we can :

  1. take only triplets that are not 'too small' ( for numeric reasons) or 'too big' ( these have less chance of appearing in an a molecule with a partial match to the input).
  2. In the case of linear polymers (which is the dominant case in biological macro-molecules) we can use the linearity , and look only at triplets which are in the right order in the sequence, or even give some heuristic ( such as Ci , Ci+2 , Ci+4  , for example, or like taking the Ca , C , and N atoms from each amino acid.) that will reduce the complexity to linear ( at the risk of losing some potential solutions).
In the case of Proteins, which is our main concern in this course, we can significantly reduce complexity of the GH algorithm by using secondary structure features, such a Alpha Helices or Beta strands. These can be approximated with straight lines, and 2 non parallel straight lines in R3 form a basis.

Since these features usually have at least several dozens of Amino Acids, using them , plus all the "unrelated" Ca atoms can reduce the input size in 1-2 orders of magnitude, thus greatly cutting both the worst case and the mean time complexity.
Back to Index.

The Problem of Non rigid bodies, specifically rotary joints:

Geometric Hashing with Hinges - the solution: Back to Index.