Home      Labs      Publications      People      Tools   


Cluster-Buster Overview

    Cluster-Buster is our third generation program for finding clusters of pre-specified motifs in nucleotide sequences. The main application is detection of sequences that regulate gene transcription, such as enhancers and silencers, but other types of biological regulation may be mediated by motif clusters too. Cluster-Buster may be used via our web server, or downloaded for use on your local computer. We also provide a downloadable program Cluster-Trainer for estimating optimal motif weights and gap parameters for Cluster-Buster.  
    Martin C Frith, Michael C Li, and Zhiping Weng (2003). Nucleic Acids Research, 31(13):3666-8. (Abstract)
Here is some supporting data for our publication, which demonstrates that Cluster-Buster overcomes a fundamental problem of hidden Markov model algorithms.
  Sample Output
    This example shows output of Cluster-Buster applied to GenBank sequence AY007685, which contains the human TERT gene encoding the catalytic subunit of telomerase. The first diagram shows an overview of motif cluster locations in the sequence, along with protein-coding regions (CDS) annotated in the GenBank record:

Homo sapiens telomerase catalytic subunit (TERT) and sodium channel-like protein genes, complete cds.

KEY: motif cluster protein-coding

Next, detailed information for each cluster is printed. Here are the details for the second strongest cluster, corresponding to the second tallest green bar in the overview:

Cluster: 21602 to 22627 Score: 25

Motif Position Strand Score Sequence
ERE 21687 to 21700 - 12.7 agatcagcctgacc
V$LYF1_01 21736 to 21744 - 8.55 tttgggagg
V$PITX2_Q2 21748 to 21758 - 10.4 tgtaatcccag
V$E12_Q6 21770 to 21780 - 7.49 gccaggtgcag
V$GNCF_01 21839 to 21856 + 8.72 atggagttcaatttcccc
V$E12_Q6 21946 to 21956 - 7.91 aacaggtggtc
V$E12_Q6 22018 to 22028 - 8.38 ggcagatggca
V$PITX2_Q2 22399 to 22409 - 10.5 tgtaatcccag
V$LYF1_01 22534 to 22542 - 8.16 tttaggagg
V$PITX2_Q2 22546 to 22556 - 10.2 tgtaatcccag
    Q: What do the scores mean?
A: The scores are log likelihood ratios. The cluster score is log [ prob(cluster sequence given that it's a cluster of real sites) / prob(cluster sequence given that it's random DNA) ]. The motif score is log [ prob(motif sequence given that it's a real site) / prob(motif sequence given that it's random DNA) ]. The higher the better.
Q: How high is high enough?
A: Unfortunately there's no easy answer. You could try running Cluster-Buster on some control sequences (matched for GC content, etc.) and seeing what scores you get.
    Programming and documentation: Martin C Frith
Website authoring and design: Michael C Li
Comments and questions to Martin Frith

Protein Engineering