Home      Labs      Publications      People      Tools   


Cluster-Buster Trainer

  Cluster-Trainer Introduction
    Cluster-Trainer is an auxiliary program for use with Cluster-Buster. It estimates optimal 'motif weights' to use with Cluster-Buster. These weights can also be interpreted as abundances of the motifs (occurences per kb). Given a set of DNA sequences and a set of motif definitions, Cluster-Trainer will estimate how abundant each motif is in the sequences, and the average distance between neighboring motifs.  
  1. Download Cluster-Trainer by clicking on one of these links and saving the file on your computer:
    Cluster-Trainer executable for Linux (Redhat 7.2/7.3)
    Cluster-Trainer executable for Sun (Solaris 8)
    Cluster-Trainer executable for SGI/IRIX
  2. Set execute permission for the file by typing chmod +x ctrain-linux (or whatever you saved it as).
  3. Cluster-Trainer is now ready to run.
  1. Download the Cluster-Trainer source code.
  2. Uncompress: gunzip ctrain-src.tar.gz
  3. Un-archive: tar -xvf ctrain-src.tar
  4. Change directory: cd ctrain-src
  5. Compile (cross your fingers): make
Unfortunately the source code does not compile successfully on all systems. Any suggestions for making the code more portable would be greatly appreciated!

Cluster-Trainer requires two inputs: a file of DNA sequences in the standard FASTA format (here is an example), and a file of motifs. Any non-alphabetic characters in the sequences are ignored, and any alphabetic characters except A, C, G, T (uppercase or lowercase) are converted to 'n' and forbidden from matching motifs.

The motif file should contain matrices in the following format:

0  4 2 14
12 0 0 8
8  0 1 11
20 0 0 0
13 1 1 5
The rows of each matrix correspond to successive positions of the motif, from 5' to 3', and the columns indicate the frequencies of A, C, G, and T, respectively, in each position. These frequencies are usually obtained from alignments of protein-binding sites.

Cluster-Trainer attempts to find the motif weights that cause a 'score' to achieve a maximum value. (This score is a log likelihood ratio, for the positive hypothesis that the sequences contain the motifs at the given abundances versus the null hypothesis of random, independent nucleotides.) The program uses a technique called 'Expectation-Maximization' (E-M), which starts with a random set of weights and iteratively changes them so as to improve the score. Since this technique only guarantees finding local maxima, the program performs several trials from different starting points. If most trials give roughly the same answer, and those that don't give lower-scoring answers, then we have found if not the true global optimum then at least a reproducibly good set of weights.

Cluster-Trainer prints the best set of weights that it finds (motif abundances per kb), the corresponding score, and the corresponding average distance between neighboring motifs. These parameters can be fed directly to Cluster-Buster.

The program's behavior can be modified with the following options. The defaults are designed to give sensible results in most cases.

Help: print documentation.
Filename for writing the motif matrices together with their weights, in a format usable as input to Cluster-Buster.
Number of trials to perform.
Specify a fixed value for the average distance between neighboring motifs. This forces the sum of the motif abundances to remain fixed.
Range in bp for counting local nucleotide abundances. Abundances of A, C, G, and T vary significantly along natural DNA sequences. The program estimates the local nucleotide abundances at each position in the sequence by counting them up to this distance in both directions. These abundances form the null hypothesis for the score calculation. If this parameter is made too low, there is a danger that the null hypothesis will match the sequences too well and all the motifs will be assigned very low weights.
Stop each trial when the score improves by less than this amount.
Mask lowercase letters in the sequences (i.e. forbid motifs from matching them). Lowercase letters are often used to indicate repetitive regions.
Force all sequences to be considered. By default, at each E-M iteration, any sequence which would contribute negatively to the score is ignored. The default behavior allows for experimental error or biological heterogeneity in the collection of sequences.
Assume motif clusters go all the way to the ends of sequences. By default, the program considers that the motifs may only be present in some sub-region of each sequence. The default behavior is reasonable when the exact boundaries of the functional region/motif cluster are not known, and a larger region of sequence was taken in the hope of including the entire regulatory region, perhaps flanked by non-regulatory DNA.
Initial guess for the average distance between neighboring motifs. This provides a rough guide for the choice of random starting weights for each trial.
Pseudocount. This value gets added to all entries in the motif matrices. Pseudocounts are a standard way of estimating underlying frequencies from a limited number of observations. If your matrices contain probabilities rather than counts, you should probably set this parameter to zero.
Verbose: print information for each iteration of E-M within each trial. By default a single dot is printed at the start of each iteration.
Example usage: ctrain-linux -r10000 -f35 mymotifs myseqs.fa

  Problems & Fixes
    Cluster-Trainer may assign excessively high weights to motifs that resemble 'low-complexity sequence' (e.g. GC-rich or AT-rich motifs). If this problem occurs, you could try masking low complexity regions in the sequences using programs such as nseg or dust, before applying Cluster-trainer. Alternatively, omit the problematic motifs from training. If all the motifs receive excessively low weights, you could try increasing the -r parameter, or fixing the average distance between neighboring motifs to a plausible value such as 35 with the -f option.  
Comments and questions to Martin Frith

Protein Engineering