bioinf.org.uk - Prof. Andrew C.R. Martin's group at UCL

Description

Cluster is a program for performing cluster analysis on an arbitrary set of vectors. Dynamic memory allocation is used throughout so the data is limited only by the memory (or swap space) of you computer. Seven cluster analysis algorithms are implemented: Ward's minimum variance method, Single linkage, Complete linkage, Group average, McQuitty's method, Gower's median method, Centroid method.

It is heavily based on a FORTRAN program by F. Murtagh, ESA/ESO/STECF, Garching, which is available in STATLIB. See http://ftp.uni-bayreuth.de/math/statlib/multi/hcl. This original FORTRAN code was re-implemented in C such that it is only necessary to provide a file of vectors for clustering.

A simple input file for cluster is:

 -8.172  -9.271  22.292
 -7.711  -9.179  19.162
 -6.312  -8.998  19.678
 -7.708  -9.701  17.728
 -7.139 -11.205  21.771
 -6.503 -11.361  23.084
 -5.104 -11.990  22.966
 -4.928 -13.035  22.336
 -7.404 -12.188  24.061

You do not need to specify the number of rows (vectors) or the number of columns (dimensions of each vector).

A brief explanation of the output...

You supply a set of vectors as input, these are then clustered by one of the algorithms the program implements. The output is in 2 forms, a cluster table and a dendogram. The table shows the "SEQ NOS" i.e. the sequence of vectors you supplied (1 refers to the first vector, 2 to the second, etc. The "2CL" column then shows you a clustering into 2 clusters, "3CL" into three, etc.

The dendogram shows you the same clustering graphically. Note that the branch lengths are not proportional to the scale on the left hand side which shows you the distances between the clusters as they are merged (or the loss of information content if using the default Ward's method. Note also that only one merge occurs per step so the value on the left may be occur more than once when two merges should have occurred at exactly the same time.

e.g. If you try the dataset:

clusters 1 & 7 and 2 & 6 are both clustered with a value of 2.5 - really the dendogram should show them merging at the same time.

Note that I said "clusters" in the previous line, not "vectors" This is potentially the most confusing thing about the output. The numbers that appear along the bottom of the dendogram are the cluster numbers when N vectors are placed in N clusters. So in the example dataset I just gave you with 7 vectors, when there are 7 clusters, the cluster number isn't necessarilly the same as the vector number. Here is the cluster table for that example (Ward's method):

     SEQ NOS 2CL 3CL 4CL 5CL 6CL 7CL 8CL 9CL
     ------- --- --- --- --- --- --- --- --- ----
          1   1   1   1   1   1   1
          2   1   1   1   1   1   7
          3   1   3   3   3   3   3
          4   1   1   1   5   5   5
          5   2   2   2   2   2   2
          6   1   1   4   4   4   4
          7   2   2   2   2   6   6

So for 7 clusters, Cluster 1 is indeed vector 1, cluster 7 is vector 2, cluster 2 is vector 5 and cluster 6 is vector 7.

Other sites

A couple of other useful sites relating to clustering are:

Licence

Cluster is freely available for use by all.

Download

Cluster is supplied as a gzipped tar file of C source file.

Download