Home      Labs      Publications      People      Tools   

From CAGT

REPFIND Help

Contents


Changes

3-9-2004: Fixed bug in downloadable REPFIND, where the -b option didn't work.

Output

REPFIND produces a graphical display of the repeats that it finds, followed by a textual summary of each repeat cluster. An example output graphic, for the 3'UTR of the Xpat gene of the frog Xenopus laevis, is shown below. The bars show each position of the repeated word within the sequence; the colored bars indicate those repeats that form the strongest cluster. The different colors serve to distinguish repeats that are close together and have no meaning beyond that. On many computers you can save the image by right clicking on it and selecting the appropriate menu option. On other computers there should be other ways to do this.

example graphical output of REPFIND

Interpretation of P-values

A P-value of 1e-05 means that such a concentration of that particular repeated word would be expected to occur by chance only one time in 10^5 clusters of the individual repeat being examined. Since REPFIND tests all possible words, we would like to know the frequency of any word achieving this P-value, that would be expected by chance in a random sequence. The following graph shows how often repeat clusters with particular P-values occur, per kb, in pseudorandom sequence. For example, a P-value of 1e-5 occurs on average once in 10,000 bp by chance.

graph of REPFIND P-values

Input

This section describes some of the fields found on the REPFIND Input Form.

Sequence Input Format

Digits, spaces, newlines, and fasta-style comments beginning with ">" will be ignored.

GenBank Identifiers

For example a GenBank accession number (e.g. X72340), an 'accession.version' number (e.g. X72340.1), or a GI number (e.g. 312302).

Set Subsequence

You may limit the search to a subsequence by entering its start and end coordinates. (The first nucleotide in the sequence has coordinate 1.) The default values of the From and To fields are the start and end of the sequence, respectively.

P-value cutoff

Only repeat clusters with P-values lower than this cutoff will be displayed.

Low Complexity Filter

Real nucleotide sequences often contain so-called low complexity sequence, meaning tracts of predominantly one nucleotide, dinucleotide tandem repeats, and the like. Since they probably do not correspond to the type of signal that you are looking for with REPFIND, they may be masked out with the program dust, which is widely used with other sequence analysis tools, such as BLAST, for similar reasons.

Statistical background

To calculate how unlikely a repeat is, REPFIND needs to know how abundant the nucleotides within the repeated word are. By default these abundances are obtained from the input sequence. However, they may also be obtained from databases of Xenopus, human, and S. cerevisiae 3' UTRs that we have compiled. By default, the abundances of dinucleotides are used, thus accounting for, e.g., reduced abundance of CpG relative to C and G. Alternatively, you may select the use of mononucleotides up to hexanucleotides, by selecting a Markov model of order zero to five. Since a 5th order Markov model requires frequencies of 4^6 = 4096 hexanucleotides, the dataset used for counting them should contain many more than this number of basepairs to get meaningful results.

Protein Engineering