Home      Labs      Publications      People      Tools   

From CAGT

Cister Instructions

Cister predicts regulatory regions in DNA sequences by searching for clusters of cis-elements. You can just give her a DNA sequence, select which types of cis-elements you want to search for, and go!

These instructions are for using Cister on the Web. There is also a downloadable version that can be run on the command line. Download page

Return to Cister input form

Explanation of Output

The results are displayed as a plot like this.

The colored lines indicate probabilities that regulatory factors bind to cis-elements at these positions. The black curve indicates the overall probability of being within a cluster of cis-elements bound by their factors. Each color corresponds to a different kind of binding site, as described in the key. Lines in the upper half of the plot indicate cis-elements on the direct strand, and lines in the lower half refer to the complementary strand. In this example, which is the output for the whole genome of the SV40 virus, Cister correctly identifies the promoter at the start of the genome, and makes no false positive predictions.

The program also produces a table of high-scoring cis-elements, like this:

Possible functional cis-elements
type position strand sequence probability
Sp1 93 + taggggcgggact 0.85
Sp1 72 + taggggcgggatg 0.65
LSF 272 + gctggttctttccgc 0.56
Sp1 50 + aatgggcggaact 0.36
Ets 305 - aagttcctctt 0.28
Sp1 356 - cgccaggcctccg 0.22
Ets 53 + gggcggaactg 0.22
Sp1 39 + atggggcggagaa 0.14
Sp1 60 + actgggcggagtt 0.13
Ets 143 - ttctgcctgct 0.11


Sequence Input Format

Cister understands GenBank format, and will display annotated protein coding regions (CDS) in the output. Alternatively, fasta format (with comment lines starting with ">") or raw sequence can be entered - digits, spaces and newlines are ignored. Maximum sequence length: 100 kb (download Cister if you want to analyze longer sequences).

GenBank Identifiers

For example a GenBank accession number (e.g. NC_001669), an 'accession.version' number (e.g. NC_001669.1), or a GI number (e.g. 9628421). Please note: you may want to check that your identifier refers to a promoter sequence. For example, GenBank accessions from Affymetrix chips may refer to mRNA sequences, which don't include the promoter region.

Set Subsequence

You may limit the search to a subsequence by entering its start and end coordinates. (The first nucleotide in the sequence has coordinate 1.) The default values of the From and To fields are the start and end of the sequence, respectively.

Format for User-defined Cis-elements

Cis-elements can be entered as TRANSFAC-style matrices, which look like this:

NA   AML-1a
XX
DE   runt-factor AML-1
XX
BF   T02256; AML1a; Species: human, Homo sapiens.
XX
P0      A      C      G      T
01      5      1      2     49      T
02      2      2     52      1      G
03      4     14      1     38      T
04      0      0     57      0      G
05      1      0     55      1      G
06      1      4      0     52      T

You can cut-and-paste these directly from the TRANSFAC website. All lines except the name line (beginning with 'NA') and the position-specific nucleotide frequency lines (beginning with digits) are ignored. The name line is required, and should be above the base frequency lines.

Alternatively, each cis-element can have a title line starting with ">" and then the name of the element, followed by 4 numbers per line describing nucleotide frequencies at each position in the cis-element. For example:

>element1
14 4 2 0
0 0 12 8
8 8 1 3
20 0 0 0
3 3 13 1
10 0 10 0
3 3 6 8
>element2
13 1 1 5
...

These numbers might come from a multiple alignment of experimentally determined cis-elements. The first column indicates the number of adenines observed in each position, the second column the number of cytosines, the third column the number of guanines, and the fourth column the number of thymines. Gaps in the cis-element may be indicated by entering a single "b" on a line. Cister will use background nucleotide frequencies at these positions. This option allows users to specify multipartite cis-elements. In addition, if a transcription factor is known to occlude several bases adjacent to its sequence-specific binding site from binding other factors, this steric hindrance can be modelled by specifying a number of "b" (blank / background) positions above and below the sequence-specific portion of the cis-element definition.

The 2 formats can be mixed. Optionally, there may be an extra line following the name line (in either format) specifying weights for the cis-element on each strand (see the download page for more details).

Parameters

Cister detects cis-element clusters by using a statistical model (a hidden Markov model) of what it expects these clusters to look like. Basically, the more closely this model matches real clusters, the better Cister will do. The parameters allow the user to vary some aspects of the model, and it is quite possible that different model parameters are suitable for different types of motif cluster.

a
The distance between neighboring cis-elements within a cluster is assumed to be geometrically distributed with mean a.
b
The number of cis-elements in a cluster is assumed to be geometrically distributed with mean b.
g
The distance between regulatory cis-element clusters is assumed to be geometrically distributed with mean g.

These 3 parameters should be chosen to resemble what you expect to find in a real functional cis-element cluster. Since the distributions are all geometric, the median is about 70% of the mean.

The background states are programmed to represent the local abundances of the 4 bases in the query sequence. Examining local abundances accounts for the biological reality of heterogeneous base composition, and prevents, for example, many spurious GC-rich motifs being detected in a part of the sequence that happens to be generally GC-rich.

w
The base abundances are counted in a window of length 2W+1 around each point in the query sequence.
Motif probability threshold
Cis-element predictions are displayed only if their posterior probability is above this threshold.
Pseudocount
This value will be added to all the counts in the cis-element matrices.

Algorithm

Cister uses the technique of posterior decoding, with this hidden Markov model:

Citation:

Frith, M. C., Hansen U. and Weng, Z.
Detection of cis-element clusters in higher eukaryotic DNA
Bioinformatics 2001 Oct;17(10):878-889.

If you use cis-element matrices from TRANSFAC, please cite:
Wingender, E., Chen, X., Hehl, R., Karas, H., Liebich, I., Matys, V., Meinhardt, T., Pruss, M., Reuter, I. and Schacherer, F.
TRANSFAC: an integrated system for gene expression regulation
Nucleic Acids Res. 28, 316-319 (2000).

Protein Engineering