www.bioinf.org.uk : Dr. Andrew C.R. Martin's Group

Apologies - the RAID controller on our main server has failed. This affects our database and other services - most of these WILL NOT WORK! We are working to restore this as soon as possible.

Cluster is a program for performing cluster analysis on an arbitrary set of vectors. Dynamic memory allocation is used throughout so the data is limited only by the memory (or swap space) of you computer. Seven cluster analysis algorithms are implemented: Ward's minimum variance method, Single linkage, Complete linkage, Group average, McQuitty's method, Gower's median method, Centroid method.

It is heavily based on a FORTRAN program by F. Murtagh, ESA/ESO/STECF, Garching, which is available in STATLIB. See ftp://unix.hensa.ac.uk/mirrors/statlib/multi/hcl. This original FORTRAN code was re-implemented in C such that it is only necessary to provide a file of vectors for clustering.

A simple input file for cluster is:

-8.172 -9.271 22.292 -7.711 -9.179 19.162 -6.312 -8.998 19.678 -7.708 -9.701 17.728 -7.139 -11.205 21.771 -6.503 -11.361 23.084 -5.104 -11.990 22.966 -4.928 -13.035 22.336 -7.404 -12.188 24.061

You do not need to specify the number of rows (vectors) or the number of columns (dimensions of each vector).

You supply a set of vectors as input, these are then clustered by one of the algorithms the program implements. The output is in 2 forms, a cluster table and a dendogram. The table shows the "SEQ NOS" i.e. the sequence of vectors you supplied (1 refers to the first vector, 2 to the second, etc. The "2CL" column then shows you a clustering into 2 clusters, "3CL" into three, etc.

The dendogram shows you the same clustering graphically. Note that the branch lengths are not proportional to the scale on the left hand side which shows you the distances between the clusters as they are merged (or the loss of information content if using the default Ward's method. Note also that only one merge occurs per step so the value on the left may be occur more than once when two merges should have occurred at exactly the same time.

e.g. If you try the dataset:

1 5 8 2 3 8 1 0 5 3 6 9 1 9 9 6 3 8 2 9 7

clusters 1 & 7 and 2 & 6 are both clustered with a value of 2.5 - really the dendogram should show them merging at the same time.

Note that I said "clusters" in the previous line, not "vectors" This is potentially the most confusing thing about the output. The numbers that appear along the bottom of the dendogram are the cluster numbers when N vectors are placed in N clusters. So in the example dataset I just gave you with 7 vectors, when there are 7 clusters, the cluster number isn't necessarilly the same as the vector number. Here is the cluster table for that example (Ward's method):

SEQ NOS 2CL 3CL 4CL 5CL 6CL 7CL 8CL 9CL ------- --- --- --- --- --- --- --- --- ---- 1 1 1 1 1 1 1 2 1 1 1 1 1 7 3 1 3 3 3 3 3 4 1 1 1 5 5 5 5 2 2 2 2 2 2 6 1 1 4 4 4 4 7 2 2 2 2 6 6

So for 7 clusters, Cluster 1 is indeed vector 1, cluster 7 is vector 2, cluster 2 is vector 5 and cluster 6 is vector 7.

A couple of other useful sites relating to clustering are:

Cluster is freely available for use by not-for-profit organisations. Commerical use is also permitted providing the author is informed. It is supplied as a gzipped tar file of C source file.