Apologies - the RAID controller on our main server has failed. This affects our database and other services - most of these WILL NOT WORK! We are working to restore this as soon as possible, but are suffering ongoing problems.

abYsis which allows most of our antibody analysis to be performed is still available at www.abysis.org

How to use CATH lists

CATH is a hierarchical classification of protein domains. It is accessible over the web at http://www.biochem.ucl.ac.uk/bsm/cath/

However, it is probably most useful when accessed as raw data files which allow you to gather information on domains in an automated fashion. The raw data files are available from ftp://ftp.biochem.ucl.ac.uk/pub/cathdata/

Unfortunately the file formats are not documented, so here is some information about the CATH file formats.

CATH lists

The CATH lists themselves (e.g. caths.list.v2.3) are pretty straightforward. They simply contain a domain identifier (domid) followed by the C,A,T,H,S,N and I numbers, the number of amino acids in the domain and the resolution (999.0 for NMR structures and 1000.0 for PDB files which have been obsoleted). Look at the CATH web pages for an explanation of what C,A,T,H,S,N and I mean.

The domid is the 4-character PDB code followed by the chain name and the domain number. If there is no chain name then 0 is used, if there is only one domain in the chain then a 0 is used for the domain number.

The following lines of Perl may be useful to you:

$ndom = 0;
while(<>)
{
    ($d, $rest) = split;
    ($junk, $c{$d}, $a{$d}, $t{$d}, $h{$d}, $s{$d}, $n{$d}, $i{$d}, $nres{$d}, $resol{$d}) = split;
    $domid[$ndom++] = $d;
}

You now have an array @domid containing the list of domain identifiers and hashes indexed by domain identifier with the CATH data.

Reps lists

The reps lists give you a single example of a protein in each family at the respective classification level.

The Reps lists are essentially the same format as the main CATH list, but have an extra column after the domain ID which just has a - in it. Not sure of the function of that!

The following lines of Perl may be useful:

$ndom = 0;
while(<>)
{
    ($d, $rest) = split;
    ($junk, $junk, $c{$d}, $a{$d}, $t{$d}, $h{$d}, $s{$d}, $n{$d}, $i{$d}, $nres{$d}, $resol{$d}) = split;
    $domid[$ndom++] = $d;
}

You now have an array @domid containing the list of domain identifiers and hashes indexed by domain identifier with the CATH data.

The domain file

The nightmare file is the domain file!! (e.g. domall.v2.3). This describes the domain boundaries for each protein so it tells you what a Domain ID in the other file actually means in terms of the PDB file.

If a chain has only one domain, then it doesn't appear in the domain file.

The following describes the format of a line:

Column 1 contains the domid, but with the domain number always set to zero: all domains for that chain will be described on the one line. The next column looks something like 'D02' and describes the number of domains (in this case 2); the next column (e.g. 'F01') is the number of small protein fragments not associated with a domain.

We then get the description of the first domain. A domain may be made up of non-contiguous segmnents. The next column tells you the number of segments in the first domain. The following 3 columns give you the start residue of the first segment (chain label, residue number, insert code or - for none), followed by the end of the segment in the next 3 columns. These 6 columns then repeat for any additional segments segments. The number of segments and the segment boundaries then repeat for any additional domains.

Finally we get the non-domain-associated fragments described of column 3 was anything other than 'F00'. This description consists of 6 columns describing the first and last residue of a fragment followed by the number of residues in the fragment in parentheses. This 7-column section repeats for each fragment.

Here is a simple example with 2 1-segment domains and no fragments:

15c8H0 D02 F00  1  H    1 - H  113 -  1  H  114 - H  226 - 
|      |   |    |  |        |         |  |        |
|      |   |    |  |        |         |  |        Chain/resnum/insert-code of domain end
|      |   |    |  |        |         |  Chain/resnum/insert-code of domain start
|      |   |    |  |        |         Number of segments in second domain (1)
|      |   |    |  |        Chain/resnum/insert-code of domain end
|      |   |    |  Chain/resnum/insert-code of domain start
|      |   |    Number of segments in first domain (1)
|      |   Number of fragments (0)
|      Number of domains (2)
Domain ID

Here is a more complex example with a 2-segment domain, a 1-segment domain and a fragment

10gsA0 D02 F01  2  A    2 - A   78 -  A  187 - A  208 -  1  A   79 - A 186 -  A  209 - A  209 - (1) 
|      |   |    |  |        |         |        |         |  |        |        |        |        |
|      |   |    |  |        |         |        |         |  |        |        |        |        Number of residues in fragment
|      |   |    |  |        |         |        |         |  |        |        |        End of fragment
|      |   |    |  |        |         |        |         |  |        |        Start of fragment
|      |   |    |  |        |         |        |         |  |        End of domain 2
|      |   |    |  |        |         |        |         |  Start of domain 2
|      |   |    |  |        |         |        |         Number of segments in domain 2
|      |   |    |  |        |         |        End of domain 1 segment 2
|      |   |    |  |        |         Start of domain 1 segment 2
|      |   |    |  |        End of domain 1 segment 1
|      |   |    |  Start of domain 1 segment 1
|      |   |    Number of segments in somain 1
|      |   Number of fragments
|      Number of domains
Domain identifier

Utilities

Here are useful utilities to use with the CATH data: