### Introduction to Protein Comparison #### 3-D Protein Matching

Problem
Problem definition
Applications
Examples
Geometric Hashing
Flexible Matching

Problem:
Detection of a-priori unknown substructure which are almost isometric modulo the 3-D rigid motion group.< >

Problem definition:
Given A known Database of molecules, each as a set ( linearly ordered, or not) of  points representing Atoms , and given a new Molecule M: Find all the molecules Mi in the DB, and  rigid Transformations Ti : R3 ->  R3  so that a matching from ( a part of ) the Atoms of M to those of Mi will give an RMSD (root mean square distance) bellow a given limit L.

Applications:

•    structural comparison of proteins
•   receptor ligand docking
Back to Index.

Examples

• 1dgaC - 1svy : • 1tt1 - 1tf1: Geometric Hashing:

the 2D case:
For each ordered pair of points (a,b) in the plane there exists an unique Coordinate system (Frame of Reference) such that its center is on the first and its x axis is the vector from b to a. If we take two such points in a model, and calculate the coordinates of all the other points in this reference frame, then , these coordinates will be invariant under any translation, rotation ,similarity, or a combination of these three kinds of transformations. More over, these coordinates will only dependent on choosing the right base pair ,i.e. if we show only some of the points of the model (partial overlapping) then, given the right base pair, we will still identify the remaining points. It seems logical, then, to store these frames of reference, and coordinates within them, in some Fast Retrieval Data Structure, so that given a new scene, they will be used immediately, so that we will concentrate only on transformations likely to occur.
This gives us the GH Algorithm for the 2D case:
we will divide the work into two main stages:
• Preprocessing stage:

• For each model (2D "molecule") we iterate on all pairs of its points ("Atoms") .
For each basis pair we:
Compute the associated frame of reference.
For each point in each model we:
Compute its coordinates in the above frame of reference Use the coordinates as a key for inserting an Entry to a Hash table.
The Entry will include:
{
Model name.
Frame of Reference ( the basis pair, not the f.o.r. itself)
Identity of point in model
}
The complexity of this stage is O( N3 * M) where M is the number of models and N the number of points in each model.
This stage requires no knowledge of the "Scene" (the input "molecule") and can be done off line.

• Recognition stage:

• Given a scene:
For each ordered pair of points:
Compute the unique frame of reference.
For each point in scene:
compute its coordinates in the above frame of reference, and look up any entries in the hash table under these coordinates.
These entries represent points that match the above scene point under the given transformation. For each entry found:
check if the basis pair in it is consistent:
if so, "cast a vote" (raise a counter) for the pair (model,f.o.r.) and match the point on the scene to the point in the model.
After finishing going through the basis pairs, cluster the highest - ranking matches by transformation.
For each cluster:
calculate the transformation (by a least RMSD criterion) given from all matched basis pairs .
For each model:
Extend the match list (if you took only a part of the scene for the preceding stages) ,i.e. take the model, map it point by point to the scene under the calculated transformation, add all consistent matches to the match list.
give (model, Transformation [,match list]) as output.

#### The Application of Geometric Hashing to molecular biology.

The method stated above can be applied in the 3D case. This is conceptually no problem, the only difference is that for determining a frame of reference uniquely in R3 we need basis triplets, i.e. 3 nonlinear points (one for the origin, another will determine the unit X axis, and the third will span a plane ( with orientation) which will give the Y and Z axes.)
So we may take the above scheme, and replace all the basis pairs to basis triplets.
The difference will, of course, result in higher complexity: The Preprocessing stage will now have an O(N4 *M) time complexity, While the Recognition stage will take O(S4) time.

To reduce the complexity , we can :

1. take only triplets that are not 'too small' ( for numeric reasons) or 'too big' ( these have less chance of appearing in an a molecule with a partial match to the input).
2. In the case of linear polymers (which is the dominant case in biological macro-molecules) we can use the linearity , and look only at triplets which are in the right order in the sequence, or even give some heuristic ( such as Ci , Ci+2 , Ci+4  , for example, or like taking the Ca , C , and N atoms from each amino acid.) that will reduce the complexity to linear ( at the risk of losing some potential solutions).
In the case of Proteins, which is our main concern in this course, we can significantly reduce complexity of the GH algorithm by using secondary structure features, such a Alpha Helices or Beta strands. These can be approximated with straight lines, and 2 non parallel straight lines in R3 form a basis.

Since these features usually have at least several dozens of Amino Acids, using them , plus all the "unrelated" Ca atoms can reduce the input size in 1-2 orders of magnitude, thus greatly cutting both the worst case and the mean time complexity.
Back to Index.

The Problem of Non rigid bodies, specifically rotary joints:

Molecules , when they bond with other molecules, tend to change their shape.
In some cases, these changes can be approximated with rotational movements around fixed hinges.
This gives us ( C.S. people) the motivation to try and solve the above problem(s) with an option for movements of parts of the molecule around fixed hinges.
So, if we have some prior information we can try and give an answer for the molecule matching problem that will have better correspondence  to the biological (wanted) solution.

Statement of Problem:
Given a database of known molecules, each represented as a (possibly linearly ordered) set of points in R3 and possible rotary joints , and an input molecule ( "scene") find the best (set of) choice of:
Molecule in the DB , Rigid transformation and a set of rotations ( on the chosen molecule's joints) ,such that the chosen molecule, under the transformation and the bending of joints, will match the input molecule most closely.

We are interested in a Scheme that will "vote" for consistent parts of a molecule together, not as different pieces, since this would cut down Post processing, and improve the quality of the solution (we can lower the bound, which means more possible answers.)

Geometric Hashing with Hinges - the solution:
Let us look in the simple case of one hinge first: in this case, we have 2 moving parts.
For each (un-occluded) part, there is a rigid Transformation from the model to the scene, that gives us the part's (invariant) frame of reference.
Also, in this local frame of reference, the translation from the origin to the axle is invariant.
This gives us the motivation to use local f.o.r.s (basis triplets) and vote for quartets {model, part, axle, f.o.r.}.

Again , we have 2 stages to the algorithm:

• Preprocessing stage:

• for each part we iterate on all basis triplets of its points.
For each basis triplet we:
Compute the associated frame of reference.
Compute the translation to the Axle (No need for rotation...)
For each point in part we:
Compute its coordinates in the above frame of reference
Use the coordinates as a key for inserting an Entry to a Hash table.
The Entry will include: {
Model name. (if more then one Model)
Part Name.
Frame of Reference
Translation (of f.o.r) to Axle.
Identity of point in model
}
The complexity of this stage is O( N4) where N is the number of points in each part.
Again, this stage requires no knowledge of the "Scene" and is done off line.
• Recognition stage:

• Given a scene:
For each ordered triplet of points (in a "small enough" region):
Compute the unique frame of reference.
For each point in scene:
Compute its coordinates in the above frame of reference, and look up any entries in the hash table under these coordinates.
For each entry found:
check if the basis triplet and the Axle in it are consistent with the scene: if so , "cast a vote" (raise a counter)  for the quartet  (model, part, axle, f.o.r.) and match the point in the scene to the point in the model.
After finishing going through the triplets, cluster the highest - ranking matches by transformation and Axle.
For each cluster ,calculate the transformation (by a least RMSD criterion) given from all matched triplets.
For a specific part, Extend the match list (if you took only a part of the scene for the preceding stages).

give (model, part, Transformation [match list]) as output.

As before, Extending the match list simply means this:
For each part of the recognized  model:

take each point (not already matched) and transform its coordinates, then check in the input for a point with "close enough" coordinates: when (an unmatched) one is found, match it to the one in the model.
The complexity of this stage is O(S4) where S is the number of points in the scene.
The generalization to the case of a linear chain of parts is obvious, but the generalization to the case where we have several hinges on the same part or several parts on the same Axle (some general "graph") give rise to questions of weighting, i.e. the same triplet should vote to several hinges, should its vote be the same as a triplet voting for only 2 Axles? for 1 ?

This is done for a single model, but can just as easily be done for M models in a DB, where the complexity of both the preprocessing and the recognition stages (assuming a good hash table)  will be linear in M.
Also ,the above method will work in the special case of no bending of joints, so that "limited over jointing" will not stop from identifying the wanted model.The last remark is cautious, since fixing a joint in a molecule means that a lot of the ("non local") basis triplets will not be entered to the hash table and therefore will not influence the identification, so that "greatly over jointing", will cause the algorithm to function poorly.

Back to Index.