From CAGT
REPFIND Help
Contents |
Changes
3-9-2004: Fixed bug in downloadable REPFIND, where the -b option didn't work.
Output
REPFIND produces a graphical display of the repeats that it finds, followed by a textual summary of each repeat cluster. An example output graphic, for the 3'UTR of the Xpat gene of the frog Xenopus laevis, is shown below. The bars show each position of the repeated word within the sequence; the colored bars indicate those repeats that form the strongest cluster. The different colors serve to distinguish repeats that are close together and have no meaning beyond that. On many computers you can save the image by right clicking on it and selecting the appropriate menu option. On other computers there should be other ways to do this.

Interpretation of P-values
A P-value of 1e-05 means that such a concentration of that particular repeated word would be expected to occur by chance only one time in 10^5 clusters of the individual repeat being examined. Since REPFIND tests all possible words, we would like to know the frequency of any word achieving this P-value, that would be expected by chance in a random sequence. The following graph shows how often repeat clusters with particular P-values occur, per kb, in pseudorandom sequence. For example, a P-value of 1e-5 occurs on average once in 10,000 bp by chance.

Input
This section describes some of the fields found on the REPFIND Input Form.
Sequence Input Format
Digits, spaces, newlines, and fasta-style comments beginning with ">" will be ignored.
GenBank Identifiers
For example a GenBank accession number (e.g. X72340), an 'accession.version' number (e.g. X72340.1), or a GI number (e.g. 312302).
Set Subsequence
You may limit the search to a subsequence by entering its start and end coordinates. (The first nucleotide in the sequence has coordinate 1.) The default values of the From and To fields are the start and end of the sequence, respectively.
P-value cutoff
Only repeat clusters with P-values lower than this cutoff will be displayed.
Low Complexity Filter
Real nucleotide sequences often contain so-called low complexity sequence, meaning tracts of predominantly one nucleotide, dinucleotide tandem repeats, and the like. Since they probably do not correspond to the type of signal that you are looking for with REPFIND, they may be masked out with the program dust, which is widely used with other sequence analysis tools, such as BLAST, for similar reasons.
Statistical background
To calculate how unlikely a repeat is, REPFIND needs to know how abundant the nucleotides within the repeated word are. By default these abundances are obtained from the input sequence. However, they may also be obtained from databases of Xenopus, human, and S. cerevisiae 3' UTRs that we have compiled. By default, the abundances of dinucleotides are used, thus accounting for, e.g., reduced abundance of CpG relative to C and G. Alternatively, you may select the use of mononucleotides up to hexanucleotides, by selecting a Markov model of order zero to five. Since a 5th order Markov model requires frequencies of 4^6 = 4096 hexanucleotides, the dataset used for counting them should contain many more than this number of basepairs to get meaningful results.

