KabatMan - Query language

Data in the database is accessed using a SQL-like query language. The main deviation between the language used here and SQL is that clauses within the WHERE statement are combined in reverse polish notation. This is described in detail below. Also, since there is only one table in the database, there is no FROM statement.

Some simple example queries are shown below.

The query language uses three statements: SET, SELECT and WHERE. The SELECT and WHERE commands take you into a mode where clauses are specified with no introductionary SELECT or WHERE keyword. Conversely, the SET command needs to be specified on each line where variables are to be set and must be given before any SELECT or WHERE statement.

The QUIT or EXIT command is used to leave the program.


The SET Statement

This allows variables to be set. These are used to control the operation of the program. The SET keyword must be given at the start of each line containing variables to be set and must be given before the SELECT and WHERE statements. More than one variable/value pair may be specified on the line. The syntax is thus:

      SET variable value [variable value ...]
    

(Square brackets indicate optional items; you don't type them!)

The following variables may be set:

   Variable     Values               Notes
   --------     ------               -----
   LOOPS        KABAT, ABM, CHOTHIA  Sets the definitions to use for 
                                     the CDRs
   INFO         integer              Sets the information level for 
                                     messages
   INSERTS      ON, OFF              Specifies whether CDRs should be 
                                     displayed with or without alignment
                                     dashes
   VARIABILITY  real (0...100)       Specifies a sequence identity
                                     threshold (see below).
   HTML         ON, OFF              Adds some HTML codes to the
                                     output to make the results
                                     look nicer over the web
                                     (Defaults to ON)
   URL          string               Specify the URL
   CANONICAL    string               Specify the canonical class
                                     definition file (DEFAULT,
                                     AUTO or STRICT)

The VARIABILITY variable allows one to specify that only sequences with less than a specified sequence homology be displayed. After all WHERE sub-clauses have been executed, KabatMan will do a pairwise comparison of all the remaining antibodies and reject one from each pair which has a percentage sequence identity greater than this value. For each sequence mismatch, a penalty of 100/MeanLength % is scored; missing chains do not score a penalty; deletions in one sequence with respect to the other are given a double penalty (i.e. 200/MeanLength). By default, VARIABILITY is set to 0.0 which is used to indicate `off' (this has the same effect as setting it to 100, but the program will run faster since the variability code will not be called).

N.B. The variability option is SLOW if there are more than a couple of hundred hits to be processed. For example, with 1124 hits (using the query SELECT light WHERE complete = t), takes 4 minutes on a 100MHz R4000 SG Indy.

The option to specify the URL allows you to specfiy a different URL string to be used when selecting the URLL or URLH fields. You may also give the command SET URL DEFAULT to restore the default URL. If you use this command, you must specify the format in the correct fashion. It must start with href= and end with the HTML GET method query. The actual accession code and anchor tags will be appended to the string.

The default string is:
href="http://immuno.bme.nwu.edu/scripts/noninter.tcl?qt="
If this doesn't mean anything to you, then DO NOT use this option!

The SET CANONICAL command allows the canonical definitions (key residue) to be specified. By default, a modified version of the definitions used by AbM are used. Other options are SET CANONICAL AUTO (for my automatically generated definitions) and SET CANONICAL STRICT (for strict definitions from the papers by Chothia et al.). See the Web page for using the Chothia assignment software for further details.


The SELECT Statement

Specifies what data should be extracted from the database. Each field may be separated by a space and/or a comma. Parameters are placed in parentheses.

The select statement may extend over more than one line.

The following fields may be displayed:

   Property        Notes
   --------        -------------------------
   NAME            Name of antibody
   ANTIGEN         Antigen (or blank string)
   L1              Sequence of CDR L1
   L2              Sequence of CDR L2
   L3              Sequence of CDR L3
   H1              Sequence of CDR H1
   H2              Sequence of CDR H2
   H3              Sequence of CDR H3
   CLASS           Light chain class (derived from file or filename)
   SOURCE          Animal source (derived from filename)
   REFerence       Publication reference
   LENgth(loop)    Length of specified CDR
   RESidue(pos)    Amino acid at specified residue
   PIRfile         Create a PIR style output                     
   LIGHT           The light chain sequence
   HEAVY           The heavy chain sequence
   CANonical(loop) The Chothia canonical class for a loop
   IDLight         The accession code for the light chain
   IDHeavy         The accession code for the heavy chain
   URLLight        An HTML hypertext link to the light chain raw data
   URLHeavy        An HTML hypertext link to the heavy chain raw data
   SUBGroup(chain) The subgroup of a human light or heavy chain
   DATe            The year of the earliest reference
   LFR1            Sequence of light framework 1
   LFR2            Sequence of light framework 2
   LFR3            Sequence of light framework 3
   LFR4            Sequence of light framework 4
   HFR1            Sequence of heavy framework 1
   HFR2            Sequence of heavy framework 2
   HFR3            Sequence of heavy framework 3
   HFR4            Sequence of heavy framework 4

(The required parts of field names are in capital letters)


The WHERE Statement

Specifies filters to be applied to the data in order to select a subset which is to be displayed.

Each clause is specified in conventional order notation using the form:

        property  test  value

The following properties are supported:

   Property        Type     Notes
   --------        ----     -----
   NAME            string   The antibody name
   ANTIGEN         string   The antigen (if specified)
   L1              string   Sequence of CDR L1
   L2              string   Sequence of CDR L2
   L3              string   Sequence of CDR L3
   H1              string   Sequence of CDR H1
   H2              string   Sequence of CDR H2
   H3              string   Sequence of CDR H3
   CLASS           string   Light chain class (derived from file or name)
   SOURCE          string   Animal source (derived from filename)
   REFerence       string   Publication reference
   LENgth(loop)    integer  Length of specified CDR
   RESidue(pos)    char     Amino acid at specified position
   COMPLETE        bool     True of both chains are present
   VARiability     integer  Not implemented
   LIGHT           string   The light chain sequence
   HEAVY           string   The heavy chain sequence
   CANonical(loop) string   Chothia class name or ? if unmatched
   IDLight         nnnnnn   The accession code for the light chain
   IDHeavy         nnnnnn   The accession code for the heavy chain
   SUBGroup(chain) string   The subgroup of a human light or heavy chain
   DATe            The year of the earliest reference
   LFR1            Sequence of light framework 1
   LFR2            Sequence of light framework 2
   LFR3            Sequence of light framework 3
   LFR4            Sequence of light framework 4
   HFR1            Sequence of heavy framework 1
   HFR2            Sequence of heavy framework 2
   HFR3            Sequence of heavy framework 3
   HFR4            Sequence of heavy framework 4

(The required parts of field names are in capital letters) nnnnnn represents a 6-digit accession code.

The following tests may be made:

   Test                   Property type  Notes
   ----                   -------------  -----
   =, ==, eq              all            Exact match (NOTE 1)
   !=, <>, ne             all            Mismatch
   <, lt                  integer        Less than
   <=, le                 integer        Less than or equal
   >, gt                  integer        Greater than
   >=, ge                 integer        Greater than or equal
   cont,like,sim,inc,sub  string         Sub-string match (NOTES 2,3)

NOTE 1: The exact match is not recommended for use with strings; it is much safer to use the substring match.
NOTE 2: These test words are all equivalent and are the minimum string required. Thus `contains' may be used in place of `cont'; `includes' in place of `inc', etc.
NOTE 3: Sub-string matches on the CDRs, LIGHT and HEAVY properties ignore dashes in the sequence allowing one to search for portions of sequence while ignoring gaps left for insertions.

String values do not need to be placed in inverted commas unless they include white space or inverted commas. Thus one can use the string "HYHEL-5'CL" to include a single inverted comma or 'a "name"' to include double inverted commas. Blank strings may be specified using '' or "".

Boolean values are entered as `true' or `false' (only the first letter, t or f, is actually considered).

All comparisons ignore case.


Reverse Polish Notation

WHERE clauses are combined using Reverse Polish Notation (RPN). Each clause generates a set which matches the specified test and places the set on a stack. The logical operators AND, OR and NOT cause operations to be performed on this stack of sets.

   Stack        Operator        Stack    Notes
   -----        --------        -----    -----
   1 2          AND             1        Performs a logical AND
   1 2          OR              1        Performs a logical OR
   1            NOT             1        Performs a logical NOT

The stack has a maximum depth of 10 items. This should be more than enough for all likely queries providing that the logical operators are not all placed at the end of the statement. The maximum stack depth is set at 10. In the unlikely event that you should exceed this, it is almost always possible to re-phrase your query to reduce stack usage. If you really need more stack depth than this, please EMail me.

If the concept of stacks and RPN is still unclear to you, please read the RPN Introduction.


Examples

In the examples below, keywords are given in upper case, but this is only done for clarity. The database is completely case-insensitive.

1. Find all complete antibodies where the antigen is known with loop lengths:

SELECT name,antigen,length(l1),length(l2),length(l3),
       length(h1),length(h2),length(h3)
WHERE  antigen ne '' complete eq true AND

2. Get the sequences of all complete mouse antibodies which bind to lysozyme, display the results in PIR format:

SELECT pir
WHERE  source   includes mouse
       antigen  includes lysozyme   AND
       complete eq       true       AND

3. Find all antibodies with 11 residue CDR-L1s and a proline at the sixth position:

SELECT name, l1
WHERE  len(l1) eq 11 res(L29) eq P AND

4. Find all complete antibodies with the sequence Ser-Ala-Ser-Ser-Ser in the light chain:

Note that there must be no spaces in the sequence

SELECT name, light
WHERE  complete = t
       light includes SASSS AND