Data in the database is accessed using a SQL-like query language. The main deviation between the language used here and SQL is that clauses within the WHERE statement are combined in reverse polish notation. This is described in detail below. Also, since there is only one table in the database, there is no FROM statement.
Some simple example queries are shown below.
The query language uses three statements: SET, SELECT and WHERE. The SELECT and WHERE commands take you into a mode where clauses are specified with no introductionary SELECT or WHERE keyword. Conversely, the SET command needs to be specified on each line where variables are to be set and must be given before any SELECT or WHERE statement.
The QUIT or EXIT command is used to leave the program.
This allows variables to be set. These are used to control the operation of the program. The SET keyword must be given at the start of each line containing variables to be set and must be given before the SELECT and WHERE statements. More than one variable/value pair may be specified on the line. The syntax is thus:
SET variable value [variable value ...]
(Square brackets indicate optional items; you don't type them!)
The following variables may be set:
Variable Values Notes -------- ------ ----- LOOPS KABAT, ABM, CHOTHIA Sets the definitions to use for the CDRs INFO integer Sets the information level for messages INSERTS ON, OFF Specifies whether CDRs should be displayed with or without alignment dashes VARIABILITY real (0...100) Specifies a sequence identity threshold (see below). HTML ON, OFF Adds some HTML codes to the output to make the results look nicer over the web (Defaults to ON) URL string Specify the URL CANONICAL string Specify the canonical class definition file (DEFAULT, AUTO or STRICT)
The VARIABILITY variable allows one to specify that only sequences with less than a specified sequence homology be displayed. After all WHERE sub-clauses have been executed, KabatMan will do a pairwise comparison of all the remaining antibodies and reject one from each pair which has a percentage sequence identity greater than this value. For each sequence mismatch, a penalty of 100/MeanLength % is scored; missing chains do not score a penalty; deletions in one sequence with respect to the other are given a double penalty (i.e. 200/MeanLength). By default, VARIABILITY is set to 0.0 which is used to indicate `off' (this has the same effect as setting it to 100, but the program will run faster since the variability code will not be called).
N.B. The variability option is SLOW if there are more than a couple of hundred hits to be processed. For example, with 1124 hits (using the query SELECT light WHERE complete = t), takes 4 minutes on a 100MHz R4000 SG Indy.
The option to specify the URL allows you to specfiy a different URL string to be used when selecting the URLL or URLH fields. You may also give the command SET URL DEFAULT to restore the default URL. If you use this command, you must specify the format in the correct fashion. It must start with href= and end with the HTML GET method query. The actual accession code and anchor tags will be appended to the string.
The default string is:
If this doesn't mean anything to you, then DO NOT use this option!
The SET CANONICAL command allows the canonical definitions (key residue) to be specified. By default, a modified version of the definitions used by AbM are used. Other options are SET CANONICAL AUTO (for my automatically generated definitions) and SET CANONICAL STRICT (for strict definitions from the papers by Chothia et al.). See the Web page for using the Chothia assignment software for further details.
Specifies what data should be extracted from the database. Each field may be separated by a space and/or a comma. Parameters are placed in parentheses.
The select statement may extend over more than one line.
The following fields may be displayed:
Property Notes -------- ------------------------- NAME Name of antibody ANTIGEN Antigen (or blank string) L1 Sequence of CDR L1 L2 Sequence of CDR L2 L3 Sequence of CDR L3 H1 Sequence of CDR H1 H2 Sequence of CDR H2 H3 Sequence of CDR H3 CLASS Light chain class (derived from file or filename) SOURCE Animal source (derived from filename) REFerence Publication reference LENgth(loop) Length of specified CDR RESidue(pos) Amino acid at specified residue PIRfile Create a PIR style output LIGHT The light chain sequence HEAVY The heavy chain sequence CANonical(loop) The Chothia canonical class for a loop IDLight The accession code for the light chain IDHeavy The accession code for the heavy chain URLLight An HTML hypertext link to the light chain raw data URLHeavy An HTML hypertext link to the heavy chain raw data SUBGroup(chain) The subgroup of a human light or heavy chain DATe The year of the earliest reference LFR1 Sequence of light framework 1 LFR2 Sequence of light framework 2 LFR3 Sequence of light framework 3 LFR4 Sequence of light framework 4 HFR1 Sequence of heavy framework 1 HFR2 Sequence of heavy framework 2 HFR3 Sequence of heavy framework 3 HFR4 Sequence of heavy framework 4
(The required parts of field names are in capital letters)
Specifies filters to be applied to the data in order to select a subset which is to be displayed.
Each clause is specified in conventional order notation using the form:
property test value
The following properties are supported:
Property Type Notes -------- ---- ----- NAME string The antibody name ANTIGEN string The antigen (if specified) L1 string Sequence of CDR L1 L2 string Sequence of CDR L2 L3 string Sequence of CDR L3 H1 string Sequence of CDR H1 H2 string Sequence of CDR H2 H3 string Sequence of CDR H3 CLASS string Light chain class (derived from file or name) SOURCE string Animal source (derived from filename) REFerence string Publication reference LENgth(loop) integer Length of specified CDR RESidue(pos) char Amino acid at specified position COMPLETE bool True of both chains are present VARiability integer Not implemented LIGHT string The light chain sequence HEAVY string The heavy chain sequence CANonical(loop) string Chothia class name or ? if unmatched IDLight nnnnnn The accession code for the light chain IDHeavy nnnnnn The accession code for the heavy chain SUBGroup(chain) string The subgroup of a human light or heavy chain DATe The year of the earliest reference LFR1 Sequence of light framework 1 LFR2 Sequence of light framework 2 LFR3 Sequence of light framework 3 LFR4 Sequence of light framework 4 HFR1 Sequence of heavy framework 1 HFR2 Sequence of heavy framework 2 HFR3 Sequence of heavy framework 3 HFR4 Sequence of heavy framework 4
(The required parts of field names are in capital letters) nnnnnn represents a 6-digit accession code.
The following tests may be made:
Test Property type Notes ---- ------------- ----- =, ==, eq all Exact match (NOTE 1) !=, <>, ne all Mismatch <, lt integer Less than <=, le integer Less than or equal >, gt integer Greater than >=, ge integer Greater than or equal cont,like,sim,inc,sub string Sub-string match (NOTES 2,3)
NOTE 1: The exact match is not recommended for use with strings; it is
much safer to use the substring match.
NOTE 2: These test words are all equivalent and are the minimum string required. Thus `contains' may be used in place of `cont'; `includes' in place of `inc', etc.
NOTE 3: Sub-string matches on the CDRs, LIGHT and HEAVY properties ignore dashes in the sequence allowing one to search for portions of sequence while ignoring gaps left for insertions.
String values do not need to be placed in inverted commas unless they include white space or inverted commas. Thus one can use the string "HYHEL-5'CL" to include a single inverted comma or 'a "name"' to include double inverted commas. Blank strings may be specified using '' or "".
Boolean values are entered as `true' or `false' (only the first letter, t or f, is actually considered).
All comparisons ignore case.
WHERE clauses are combined using Reverse Polish Notation (RPN). Each clause generates a set which matches the specified test and places the set on a stack. The logical operators AND, OR and NOT cause operations to be performed on this stack of sets.
Stack Operator Stack Notes ----- -------- ----- ----- 1 2 AND 1 Performs a logical AND 1 2 OR 1 Performs a logical OR 1 NOT 1 Performs a logical NOT
The stack has a maximum depth of 10 items. This should be more than enough for all likely queries providing that the logical operators are not all placed at the end of the statement. The maximum stack depth is set at 10. In the unlikely event that you should exceed this, it is almost always possible to re-phrase your query to reduce stack usage. If you really need more stack depth than this, please EMail me.
If the concept of stacks and RPN is still unclear to you, please read the RPN Introduction.
In the examples below, keywords are given in upper case, but this is only done for clarity. The database is completely case-insensitive.
1. Find all complete antibodies where the antigen is known with loop lengths:
SELECT name,antigen,length(l1),length(l2),length(l3), length(h1),length(h2),length(h3) WHERE antigen ne '' complete eq true AND
2. Get the sequences of all complete mouse antibodies which bind to lysozyme, display the results in PIR format:
SELECT pir WHERE source includes mouse antigen includes lysozyme AND complete eq true AND
3. Find all antibodies with 11 residue CDR-L1s and a proline at the sixth position:
SELECT name, l1 WHERE len(l1) eq 11 res(L29) eq P AND
4. Find all complete antibodies with the sequence Ser-Ala-Ser-Ser-Ser in the light chain:
Note that there must be no spaces in the sequence
SELECT name, light WHERE complete = t light includes SASSS AND