Rigid and flexible matching

Introduction to Protein Comparison

3-D Protein Matching

Problem
Problem definition
Applications
Examples
Geometric Hashing
Flexible Matching

Problem:
Detection of a-priori unknown substructure which are almost isometric modulo the 3-D rigid motion group.< >

Problem definition:
Given A known Database of molecules, each as a set ( linearly ordered, or not) of points representing Atoms , and given a new Molecule M: Find all the molecules Mi in the DB, and rigid Transformations Ti : R3 -> R3 so that a matching from ( a part of ) the Atoms of M to those of Mi will give an RMSD (root mean square distance) bellow a given limit L.

Applications:

structural comparison of proteins
receptor ligand docking

Back to Index.

Examples

1dgaC - 1svy :

1tt1 - 1tf1:

Back to Index.

Geometric Hashing:

the 2D case:

For each ordered pair of points (a,b) in the plane there exists an unique Coordinate system (Frame of Reference) such that its center is on the first and its x axis is the vector from b to a. If we take two such points in a model, and calculate the coordinates of all the other points in this reference frame, then , these coordinates will be invariant under any translation, rotation ,similarity, or a combination of these three kinds of transformations. More over, these coordinates will only dependent on choosing the right base pair ,i.e. if we show only some of the points of the model (partial overlapping) then, given the right base pair, we will still identify the remaining points. It seems logical, then, to store these frames of reference, and coordinates within them, in some Fast Retrieval Data Structure, so that given a new scene, they will be used immediately, so that we will concentrate only on transformations likely to occur.
This gives us the GH Algorithm for the 2D case:
we will divide the work into two main stages:

Preprocessing stage:

Recognition stage:

Extend the match list (if you took only a part of the scene for the preceding stages) ,i.e. take the model, map it point by point to the scene under the calculated transformation, add all consistent matches to the match list.

(model, Transformation [,match list])

The Application of Geometric Hashing to molecular biology.

The method stated above can be applied in the 3D case. This is conceptually no problem, the only difference is that for determining a frame of reference uniquely in R3 we need basis triplets, i.e. 3 nonlinear points (one for the origin, another will determine the unit X axis, and the third will span a plane ( with orientation) which will give the Y and Z axes.)
So we may take the above scheme, and replace all the basis pairs to basis triplets.
The difference will, of course, result in higher complexity: The Preprocessing stage will now have an O(N⁴ *M) time complexity, While the Recognition stage will take O(S⁴) time.

To reduce the complexity , we can :

take only triplets that are not 'too small' ( for numeric reasons) or 'too big' ( these have less chance of appearing in an a molecule with a partial match to the input).
In the case of linear polymers (which is the dominant case in biological macro-molecules) we can use the linearity , and look only at triplets which are in the right order in the sequence, or even give some heuristic ( such as Ci , Ci+2 , Ci+4 , for example, or like taking the C_a , C , and N atoms from each amino acid.) that will reduce the complexity to linear ( at the risk of losing some potential solutions).

In the case of Proteins, which is our main concern in this course, we can significantly reduce complexity of the GH algorithm by using secondary structure features, such a Alpha Helices or Beta strands. These can be approximated with straight lines, and 2 non parallel straight lines in R3 form a basis.

Since these features usually have at least several dozens of Amino Acids, using them , plus all the "unrelated" C_a atoms can reduce the input size in 1-2 orders of magnitude, thus greatly cutting both the worst case and the mean time complexity.
Back to Index.

The Problem of Non rigid bodies, specifically rotary joints:

Statement of Problem:
Given a database of known molecules, each represented as a (possibly linearly ordered) set of points in R³and possible rotary joints , and an input molecule ( "scene") find the best (set of) choice of:
Molecule in the DB , Rigid transformation and a set of rotations ( on the chosen molecule's joints) ,such that the chosen molecule, under the transformation and the bending of joints, will match the input molecule most closely.

We are interested in a Scheme that will "vote" for consistent parts of a molecule together, not as different pieces, since this would cut down Post processing, and improve the quality of the solution (we can lower the bound, which means more possible answers.)

Geometric Hashing with Hinges - the solution:

Again , we have 2 stages to the algorithm:

Preprocessing stage:

⁴

Recognition stage:

check if the basis triplet and the Axle in it are consistent with the scene: if so , "cast a vote" (raise a counter) for the quartet (model, part, axle, f.o.r.) and match the point in the scene to the point in the model.

RMSD

give (model, part, Transformation [match list]) as output.

As before, Extending the match list simply means this:
For each part of the recognized model:

⁴

This is done for a single model, but can just as easily be done for M models in a DB, where the complexity of both the preprocessing and the recognition stages (assuming a good hash table) will be linear in M.
Also ,the above method will work in the special case of no bending of joints, so that "limited over jointing" will not stop from identifying the wanted model.The last remark is cautious, since fixing a joint in a molecule means that a lot of the ("non local") basis triplets will not be entered to the hash table and therefore will not influence the identification, so that "greatly over jointing", will cause the algorithm to function poorly.

Back to Index.