From CAGT
Clover: Cis-eLement OVERrepresentation
Introduction
Clover is a program for identifying functional sites in DNA sequences. If you give it a set of DNA sequences that share a common function, it will compare them to a library of sequence motifs (e.g. transcription factor binding patterns), and identify which if any of the motifs are statistically overrepresented in the sequence set.
Publication
Martin C Frith, Yutao Fu, Liqun Yu, Jiang-Fan Chen, Ulla Hansen, Zhiping Weng
(2004). Detection of functional DNA motifs via statistical over-representation.
Nucleic
Acids Research 32(4):1372-81.
Here are the data sets studied in the paper.
Changes
2-21-2005: Made the source code slightly more portable.
2-16-2005: Fixed
the source code for fussy compilers.
3-3-2004: Fixed crashes on some kinds of
bad input.
10-26-2003: Added minor error checking. SGI version available.
Get Clover
- Download Clover by clicking on one of these links, and saving the file on
your computer:
Clover executable for Linux (RedHat 7.2/7.3)
Clover executable for Sun (Solaris 8)
Clover executable for SGI/IRIX
Clover executable for Mac OS X (Oct 26 2003 version)
- Set execute permission for the file by typing
chmod +x clover-linux (or whatever you saved it as). - Clover is now ready to run.
- Download the Clover source code.
- Uncompress:
gunzip clover-src.tar.gz - Un-archive:
tar -xvf clover-src.tar - Change directory:
cd clover-src - Compile (cross your fingers): make
Unfortunately the source code doesn't compile successfully on all systems. We'd love to hear your suggestions for making it more portable.
Get a Motif Library
Another source of motifs is TRANSFAC: the commercial nature of this database prevents us from providing it directly.
Get Background Sequences
- Human chromosome 20 (44.1% C+G) - finished sequence
- Sequences 2000 bp upstream of human genes (49.8% C+G) - from UCSC 08-Jul-2003
- Human CpG islands (68.8% C+G, median length = 557 bp) - from UCSC 14-Apr-2003
- Mouse chromosome 19 (42.8% C+G) - NCBI Build 30
- Sequences 2000 bp upstream of mouse genes (47.8% C+G) - from UCSC 25-Apr-2003
- Drosophila chromosome 2 arm R (43.5% C+G) - from BDGP Release 3
These files need to be uncompressed (using gunzip) before using them with Clover.
Required Input
>TATA 0 0 0 10 10 0 0 0 0 0 0 10 10 0 0 0 >E-box 1 20 1 1 (etc)
Each motif begins with a title line containing the character '>' followed by the motif's name. Subsequent lines represent successive positions of the motif, from 5' to 3', and the columns contain counts of A, C, G, and T, respectively, observed at each position. These numbers typically come from an alignment of several binding sites for a transcription factor.
Raw Scores and P-values
Clover will compare each motif in turn to the sequence set, and calculate a
"raw score" indicating how strongly the motif is present in the sequence set.
Raw scores by themselves are hard to interpret, so Clover provides options
(which we recommend you use) to determine the statistical significance of the
raw scores. Four ways of determining statistical significance are available. The
first involves providing Clover with one or more files of background DNA
sequences. Each background file should contain sequences in FASTA format, with
total length much greater than the target sequence set. For each background set,
Clover will repeatedly extract random fragments matched by length to the target
sequences, and calculate raw scores for these fragments. The proportion of times
that the raw score of a fragment set exceeds or equals the raw score of the
target set, e.g. 0.02, is called a
The second way of determining statistical significance is to repeatedly
shuffle the letters within each target sequence, and use these shuffled sequence
sets as controls.
Advice
In our experience to date, the use of background sequence sets works best. However, it is necessary to choose the background sets carefully: they should ideally come from the same taxonomic group as the target sequences, and have similar repetitive element and GC content. We like to cover our bases by using multiple background sets, e.g. for human target sequences, we might use a human chromosome, a set of human CpG islands, and a set of human gene upstream regions as backgrounds. The methods that randomize nucleotides and dinucleotides suffer from predicting motifs that lie in Alus and other common repetitive elements to be significant. You should avoid including orthologous sequences from closely related species, e.g. human and mouse, as that will artefactually boost the significance of motifs in these sequences.
Output
score = log[ prob(sequence|motif) / prob(sequence|random) ]
Details are printed for motif instances with score >= some threshold, by default 6.
Options
| Help: print documentation. | |
| Number of randomized/control raw scores to calculate for comparison with each target raw score. | |
| Score threshold for printing locations of significant motifs. This
parameter doesn't affect raw score and |
|
| Perform sequence (nucleotide) shuffles. | |
| Perform dinucleotide randomizations. | |
| Perform motif shuffles. | |
| Mask (convert to 'n') any lowercase letters in the target sequences (and background sequences, if any). Lowercase letters are often used to indicate repetitive elements. | |
| Verbose: print per-sequence scores for significant motifs. When
calculating a motif's raw score, preliminary scores are first obtained for
the motif compared to each sequence, and these are then combined to form
the overall raw score. The | |
| Pseudocount to add to each entry of the motif matrices. Pseudocounts are a widely used technique, with a theoretical underpinning in Bayesian statistics, for estimating underlying frequencies from a limited number of counts. If your matrices contain probabilities rather than counts, you should probably set the pseudocount to zero. | |
| Seed for the random number generator (default = 1). |
clover -t 0.05 mymotifs myseqs.fa background1.fa background2.fa
Good luck finding those motifs!
Return to Zlab Gene Regulation Hub
Suggestions to: Martin Frith
Last
modified: Sunday, 20-Feb-2005 21:58:05 EST

