FilterKit Documentation

written by Ram Nathaniel <ramn@math.tau.ac.il>

Table of contents:

What is FilterKit

About this document

The Filter Architecture

Using the filters in a script

File Format

Classes

Skeleton

What is FilterKit

FilterKit is a software architecture combined with a set of utilities and classes built over the GAMB++ library, which provides a framework for geometric algorithms in molecular biology. The main idea behind the design of the architecture is to seperate the algorithm into many phases, each phase resides in a different process, and the processes are not order dependent. Motivation for this approach is the need for different filters and the fact we can never be sure what should the order of these filters be.
Advantages of working with the FilterKit:

Flexibility of rearanging the algorithm

Debugging - a filter can be debugged separatly without the need to start the algorithm over every time. This enables debugging with the real data and not a small set, which makes the debugging a lot more reliable.

Tracable transformation: Since a transformation which has survived the last filter is logged in all the output files of all the filters one can get the score of each transformation after each step simply by using 'grap -n [transformation id] output files' and looking at the score in each step. This allows better understanding of the algorithm and therefore rapid improvement.

Disadvantages of working with the FilterKit:

A slight overhead is kept for every transformation along the way, in order to enable tracing backward.

All transformations are written and read from a file each phase. This hurts the efficiency of the program slightly.

The Filters are built so that one can use a filter as a deamon which lies in the background and gives out services to the main program communicating by FIFOs (named pipelines). This allow the algorithm to run a few things in parallel (See Using the filters in a script section for more details).

About this document

This document was ment to help the composer of a new application in the GAMBA group in blending in with the filter architecture. It is also intended for the smart user of such applications who wants to take advantage of the flexibility of the architecture without recompiling the modules (See Using the filters in a script section).

The Filter Architecture

The Filter Architecture is a software architecture designed to enhance flexibility of the developnent process of new applications in the GAMBA group. This architecture is mainly in the form of a giant pipeline in which the data travels through the stages of the geometric hashing algorithm, and then goes through a number of filters designed to get rid of some of the false positives the algorithm introduced.
The filters can also be used as semi-demon programs which stays alive until the main program sends a TERNIMATE signal through stdin. For more details see the Using the filters in a script section.
In the skeleton section one can find a basic empty filter which demonstates the usage of the architecture.

Using the filters in a script

The main advantage of the filter architecture is the ability of using the filters in a shell script. In such cases the user can call the different filters in any order he or she may please. All filters can be introduced at any point in the algorithm since the file format of the input is the same as the file format of the output. For example if a user has 4 different filters : A, B, C, D a script which uses them might be:

Matching_Program | A | B | C | D > output_file
or:
Matching_Program | B | A | A | C | D | B > output_file
A sophisticated user might want to use one of the fileters as a parralel task which does part of the work. For example if we want to preform clustering several times during our algorithm we can invoke the clustering filter to stay in the background and wait, so that every time we want to use it we can send a collection of transformations and get the clustered set.
The shell script would look something like :
mknod instream
mknod outstream
Clustering_fileter < instream > outstream
Main_Program < outstream > instream

File Format

The files which are in use through the pipeline are made of the following format:
Notations : here all fields marked with $ are mandatory and all other fields are optional.
The bold typing marks the keywords that appear in the beginning of each line indicating the type of the line.

The first part is the header part which will be copied to the output file as is - this part may contain all information necessary in order to make it easier on the user to understand the context in which this file was created. (i.e. date, name, molecules ...)
HEADER (free text)
The params part specifies the structure of the TRANS record: in here will appear all fields which are in use and are not mandatory, such as MolName PartNumber etc. and whether there are DETAIL records or not.
PARAMS (indication which optional fields will be used)
A basic record containing information about the transformation.
TRANS ($ Result Number) ($ Result ID) ($ score1) (MoleculeName) (PartNumber) ($ Rotx, roty, rotz, transx, transy, transz)
($ score1) (score2) (size)
Optional detail records containing info about the actual Matching list for this transformation. The first field - Result Number indicates the relevant transformation
DETAIL ($ Result Number) ($ model index) ($ scene index) (dist)
End this set of inputs - not necessary end the program after finishing to process this input. Another session may follow
END
End all. The program may terminate if it processed all before this.
TERMINATE

Skeleton

The skeleton filter is an empty filter which provides a good start for writing a new filter. In this file you can see a typical usage of the FileReader and MyTrans classes. This filter can be used in a pipeline or as a deamon (see the Using the fileters in a script section)

Compiled Filters