How to use CATH Lists

CATH is a hierarchical classification of protein domains accessible at cathdb.info

However, it is probably most useful when accessed as raw data files which allow you to gather information on domains in an automated fashion. The raw data files are available from ftp://ftp.biochem.ucl.ac.uk/pub/cathdata/

Unfortunately the file formats are not documented, so here is some information about the CATH file formats.

CATH Lists

The CATH lists themselves (e.g. caths.list.v2.3) are pretty straightforward. They simply contain a domain identifier (domid) followed by the C,A,T,H,S,N and I numbers, the number of amino acids in the domain and the resolution (999.0 for NMR structures and 1000.0 for PDB files which have been obsoleted). Look at the CATH web pages for an explanation of what C,A,T,H,S,N and I mean.

The domid is the 4-character PDB code followed by the chain name and the domain number. If there is no chain name then 0 is used, if there is only one domain in the chain then a 0 is used for the domain number.

The following lines of Perl may be useful to you:

$ndom = 0;
    ($d, $rest) = split;
    ($junk, $c{$d}, $a{$d}, $t{$d}, $h{$d}, $s{$d}, $n{$d}, $i{$d}, $nres{$d}, $resol{$d}) = split;
    $domid[$ndom++] = $d;

Reps Lists

The reps lists give you a single example of a protein in each family at the respective classification level.

The Reps lists are essentially the same format as the main CATH list, but have an extra column after the domain ID which just has a - in it. Not sure of the function of that!

The following lines of Perl may be useful:

$ndom = 0;
    ($d, $rest) = split;
    ($junk, $junk, $c{$d}, $a{$d}, $t{$d}, $h{$d}, $s{$d}, $n{$d}, $i{$d}, $nres{$d}, $resol{$d}) = split;
    $domid[$ndom++] = $d;

You now have an array @domid containing the list of domain identifiers and hashes indexed by domain identifier with the CATH data.

The domain file

The nightmare file is the domain file!! (e.g. domall.v2.3). This describes the domain boundaries for each protein so it tells you what a Domain ID in the other file actually means in terms of the PDB file.

If a chain has only one domain, then it doesn't appear in the domain file.

The following describes the format of a line:

Column 1 contains the domid, but with the domain number always set to zero: all domains for that chain will be described on the one line. The next column looks something like 'D02' and describes the number of domains (in this case 2); the next column (e.g. 'F01') is the number of small protein fragments not associated with a domain.

We then get the description of the first domain. A domain may be made up of non-contiguous segmnents. The next column tells you the number of segments in the first domain. The following 3 columns give you the start residue of the first segment (chain label, residue number, insert code or - for none), followed by the end of the segment in the next 3 columns. These 6 columns then repeat for any additional segments segments. The number of segments and the segment boundaries then repeat for any additional domains.

Finally we get the non-domain-associated fragments described of column 3 was anything other than 'F00'. This description consists of 6 columns describing the first and last residue of a fragment followed by the number of residues in the fragment in parentheses. This 7-column section repeats for each fragment.

Here is a simple example with 2 1-segment domains and no fragments:

15c8H0 D02 F00  1  H    1 - H  113 -  1  H  114 - H  226 - 
|      |   |    |  |        |         |  |        |
|      |   |    |  |        |         |  |        Chain/resnum/insert-code of domain end
|      |   |    |  |        |         |  Chain/resnum/insert-code of domain start
|      |   |    |  |        |         Number of segments in second domain (1)
|      |   |    |  |        Chain/resnum/insert-code of domain end
|      |   |    |  Chain/resnum/insert-code of domain start
|      |   |    Number of segments in first domain (1)
|      |   Number of fragments (0)
|      Number of domains (2)
Domain ID

Here is a more complex example with a 2-segment domain, a 1-segment domain and a fragment:

10gsA0 D02 F01  2  A    2 - A   78 -  A  187 - A  208 -  1  A   79 - A 186 -  A  209 - A  209 - (1) 
|      |   |    |  |        |         |        |         |  |        |        |        |        |
|      |   |    |  |        |         |        |         |  |        |        |        |        Number of residues in fragment
|      |   |    |  |        |         |        |         |  |        |        |        End of fragment
|      |   |    |  |        |         |        |         |  |        |        Start of fragment
|      |   |    |  |        |         |        |         |  |        End of domain 2
|      |   |    |  |        |         |        |         |  Start of domain 2
|      |   |    |  |        |         |        |         Number of segments in domain 2
|      |   |    |  |        |         |        End of domain 1 segment 2
|      |   |    |  |        |         Start of domain 1 segment 2
|      |   |    |  |        End of domain 1 segment 1
|      |   |    |  Start of domain 1 segment 1
|      |   |    Number of segments in somain 1
|      |   Number of fragments
|      Number of domains
Domain identifier


Here are useful utilities to use with the CATH data:

  • A script to read a CATH domain file and write out a simple-to-read version. Information about fragments not associated with domains is ignored.
  • A script based on the above, to generate split domain files from PDB files and the domain list. To use this script, you will also need to download my getpdb program. This is a C program and comes as a gzipped tar file. [Installation instructions]

Preparing Proteins for Crystallography

How much protein do I need?
10mg is sufficient for initial crystallization trials. Restricted trials are also possible with 2-5mg. Ideally you want as much as you can so that protein used for crystallization trials is also available for the actual crystals. (A different prep. might need slightly different conditions.)
What concentration should the protein solution be?
10-15mg/ml is typical, but crystals have been grown using 5-70mg/ml. Absolute minimum is 1mg/ml. If you know that your protein aggregates at 10mg/ml, then there is no point in producing a solution of this strength!
How pure must the preparation be?
Quite simply as pure as you can get it! Ideally >98% and at least >90%. Purities as low as 80% have been known to produce crystals, but your chances are much lower. Make use of CD spectra, dynamic light scattering (DLS), gel filtration and EM techniques to check for aggregation, denaturation, etc.
What tags and trimmings should I use?
There is no simple answer. Use what works, try everything!
What buffer should I use?
Ideally water so as not to interfere with conditions used for crystallization trials. However, most proteins have poor solubility in pure water. Avoid high salt concentrations - around 20mM is typically OK (certainly less than 100mM), but this depends on the salt type. TRIS is the best choice. Avoid phosphate and ammonium sulphate if possible as these can cause problems.
What pH should I use?
pH4-9 is typical, but start at the pI of the protein. This is one of the parameters that will be varied during crystallization trials.
Where can I get more information?
The Hampton Research web site has lots of useful information on crystallization and preparing samples for crystallography.

Multi-dimensional Arrays in C

This page tries to explain a common misconseption in using multi-dimensional arrays in C.

What you CAN do with C

C allows you to create a 2D array simply by using the square bracket nomenclature. For example:

int array[MAXI][MAXJ];

It then allows you to use a 'slice' of that array as if it were a 1D array. For example:

for(i=0; i<MAXI; i++)
   printline(array[i], MAXJ);

where the function printline() takes a pointer:

void printline(int *line, int len)

Here is a complete sample program which demonstrates this.

What you CAN'T do with C

So, a logical extension of this is that if you wanted to provide a routine which printed the whole array rather than a line at a time, you should be able to do:

int array[MAXI][MAXJ];
printarray(array, MAXI, MAXJ);

with the printarray() function taking a pointer-to-pointer:

void printarray(int **array, int maxx, int maxy)

However, this does not work! The compiler should complain that array is of the wrong type. You could get around this by casting it to int **, but if you do that the program core dumps. Here's the complete source so you can try it for yourself. (Define CAST if you want to try it with the cast.)

Fix the array sizes

The reason this is the case is that, despite what you might think (and what many C books imply), the first dimension of the 2D array is not an array of pointers. Instead the whole 2D array exists as a single block of data and the compiler simply calculates offsets to access a particular cell. When you pass the 2D array like this, the function you call doesn't know what the offsets are.

The answer, therefore, is that instead of passing the 2D array with a pointer-to-pointer, you have to pass it as an array with dimensions specified in the function:

printarray(array, MAXI, MAXJ);

with the function defined as:

void printarray(int array[MAXI][MAXJ], int maxx, int maxy)

Here is some sample code that works properly.

Use malloc

This is all very well and is absolutely fine if you are doing this in a standalone piece of C source code. But what happens if you wish to place the printarray() function in a library so it can be used by many programs which need to pass it arrays of different sizes? It won't work!

The answer is to use malloc() to allocate your 2D arrays so that the first dimension genuinely is an array of pointers:

if((array = (int **)malloc(MAXI * sizeof(int *)))==NULL)
for(i=0; i<MAXI; i++)
   if((array[i] = (int *)malloc(MAXJ * sizeof(int)))==NULL)

then you can call the routine with:

printarray(array, MAXI, MAXJ);

and define it as:

void printarray(int **array, int maxx, int maxy)

Here is a sample program which demonstrates this.

My bioplib libraries provide routines Array2D() and Array3D() which do all the allocation for you.

Another example

Further proof of the way that C 'tricks' you into thinking that arrays without the square brackets are the same as pointers, comes from another example that uses a 1-D array.

One often reads that if you have an array which has been defined as:

int array1D[MAXJ];

then, if you use just array1D, you have a pointer to the array. This is not strictly the case!

Suppose we have the printarray() routine used above which prints out the contents of a 2D array, but we want to use it to print a 1D array. If the name of the 1D array is indeed a pointer, then all we should need to do is get a reference to this pointer (i.e. a pointer-to-pointer) using & and pass that. i.e.:

int array1d[MAXJ];
printarray(&array1d, 1, MAXJ);

Once again, however, this doesn't work - the compiler will complain about incompatible pointer types, if you cast &array1d to int **, then run the program, it core dumps. You can try the code.

As before, the answer is to create the 1D array with malloc():

int *array1d;
if((array1d = (int *)malloc(MAXJ * sizeof(int)))==NULL)
printarray(&array1d, 1, MAXJ);

You can download the complete example code and try it yourself.