Home      Labs      Publications      People      Tools   

From CAGT

Using PromoSer

Contents

Output

PromoSer reports the number of accessions for which a promoter could be identified. This may not always be the same as the number of accessions requested. Details of why certain promoters could not be returned are given in the output report and include: a threshold too high excluding lower quality predictions, reference to an accession number that PromoSer did not analyze (probably because it is newer than the current release or because it is for an unsupported organism), or reference to one that PromoSer analyzed and excluded (e.g., synthetic probes) or could not assign confidently to any cluster.

PromoSer's output consists of a summary report displayed in the browser, and a FASTA file that is downloadable by clicking the displayed result ID (that ID can be used to retrieve the results again later from the main screen, within 24 hours). The summary report contains the software version, the genomic releases used to construct the promoters, the options chosen and a table with the following fields:

  • Accession: The accession number supplied or the sequence identifier used in the input screen. This is hyperlinked to the cluster-viewer to show the context and structure of the cluster to which this entry has been assigned. If the "exclude entries that map to the same cluster" option were chosen, the ids used will be for the clusters instead of being for any specific entry. The id may be flagged by a * and/or a ! to indicate the following: If an mRNA sequence cannot be aligned to the genome in a reliable manner PromoSer will try to assign the smallest cluster (in terms of genomic extent) that completely encloses the aligned portions of the sequence. Those heuristically assigned entries will be marked by a *. If an mRNA maps to more than one genomic locus with scores too close to favor one over the other, all those alignments are kept and treated separately (in terms of clustering and TSS prediction) but the entries will be flagged with a ! to indicate that other sources of evidence may be needed to distinguish this ambiguity.
  • Seq.: A sequential id used to distinguish multiple promoters for the same accession number, starting from 0. Note that most options under "alternative promoters" select one specific promoter of all the predicted alternatives.
  • Orgn.: The organism to which the entry belongs.
  • Chrom.: The chromosome to which the entry maps.
  • Strand: The strand to which the entry maps.
  • TSS: The genomic position of the transcription start site on the + strand of the chromosome to which the entry maps
  • Effective Length: The length of the extracted sequence. This may be less than the total requested length for genes on the edges of chromosomes or if any "stop" option is selected.
  • Quality: The quality score for the reported TSS, as described in the TSS Quality and Support section.
  • Supporting Sequences: The number of sequences that support the reported TSS prediction.
  • Distance form TSS to 5' end: The genomic distance in bases between the reported TSS and the mapped 5'-end of the entry. In some cases, the reported extension distance will be negative. This indicates that the given entry belongs to a cluster of sequences with multiple identified transcription start sites, some of which are actually downstream of the 5' end of the mRNA specified by this accession number.
  • Upstream overlap: If the promoter sequence runs into the locus of a cluster upstream of the current entry and no "stop" options are selected, then this column indicates the amount (in bases) of this overlap and the strand on which the most overlap is found. An overlap with a cluster on the same strand indicates that the reported number of bases from the start of the promoter sequence in the FASTA file may belong to the 3'UTR (and upstream) of another gene. An overlap with an opposite strand cluster indicates the sequence may be for the 5'UTR (and downstream) of another gene.

The FASTA file contains the actual promoters. Each promoter is annotated with a line that indicates the accession number, the sequential id, the coordinates of the sequence relative to the TSS (the TSS is at position +1 and there is no position 0), the coordinate of the TSS on the chromosome (always referring to the + strand), the organism, chromosome, start and end coordinates of the extracted promoter (the coordinates are always in terms of the + strand and start from 0) and the strand on which the gene lies. A * and/or ! flag may follow if applicable (as explained above).

>NM_000399|0|Promoter: -1000 to 50|TSS: 63920729|Region: Human chr10:63920680-63921730 (-)|

List of GenBank accession IDs

Enter a list of the accession IDs for which you would like to look for promoters. Separate the IDs by commas or enter each in a separate line. Please do not include the version number in the accession (i.e., use NM_091834 and not NM_091834.2). We insist on not including the version number with the accession ID to emphasize that PromoSer attempts to localize transcriptional units rather than specific sequences. If you however do use a version number it will just be ignored and the latest version of the sequence will always be used. Please limit your query to 2000 accessions per request.

Note: PromoSer currently supports the following organisms only: Homo sapiens (human), Mus musculus (mouse) and Rattus norvegicus (rat). mRNA sequences from other organisms are not considered by PromoSer (this is particularly relevant for Rattus Rattus sequences). Accession IDs for genomic gene records are recognized by PromoSer and can be searched for. Such records sometimes contain several genes and may end up being associated with several clusters, some of which might not be the gene given in the record's annotation.

Tip: If you are using for example the NCBI Entrez service to select a certain set of sequences you may copy the NCBI summary listing, e.g.

1: NM_025221 Links

Homo sapiens potassium channel-interacting protein 4 (KCNIP4), transcript variant 1, mRNA
gi|28373063|ref|NM_025221.4|[28373063]


2: NM_147183 Links

Homo sapiens potassium channel-interacting protein 4 (KCNIP4), transcript variant 4, mRNA
gi|28373062|ref|NM_147183.2|[28373062]


3: NM_147182 Links

Homo sapiens potassium channel-interacting protein 4 (KCNIP4), transcript variant 3, mRNA
gi|28373061|ref|NM_147182.2|[28373061]


4: NM_147181 Links

Homo sapiens potassium channel-interacting protein 4 (KCNIP4), transcript variant 2, mRNA
gi|28373060|ref|NM_147181.2|[28373060]
and paste it directly into PromoSer. As long as the accessions are clearly distinct and have no version, they will be identified and retrieved.

List of FASTA records

Note: Using sequence searches to locate promoters is not the preferred way to use PromoSer. If you use this feature please be patient as sequence searching can be slow (sometimes very slow).

PromoSer can accept a multi-FASTA format list of sequences as an input. Those sequences will be mapped onto the genome and matched to clusters that overlap their alignments by 95% of their length or more. This is a best effort guess and not as rigorous as the clustering process used by PromoSer for the fully processed accessions. In particular, if a sequence is much longer than any of the homologous sequences already processed by PromoSer then it will either be associated with the wrong cluster or no cluster at all.

The header line of FASTA sequences must be formatted in a special way:

  • The organism to which the sequence belongs must be the first part of the annotation. Common names (Human, Mouse, Rat), Latin names and their abbreviations hs - Homo sapiens, mm - Mus musculus, rn - Rattus norvegicus are all recognized as valid. If the organism for a sequence cannot be determined, or if the header is entirely missing the sequence will be assumed human.
  • The identifier must consist only of alphanumeric characters and the underscore. All other characters will be stripped off.
  • Only the first 20 characters of an identifier are considered (not counting the organism's name)
  • If multiple sequences are given, their identifiers must be distinct.

    For example, all of the following header lines are valid:

    >Homo sapiens early growth res 2
    acacac...
    >mm seq1
    gatata....
    >rat
    ctaaag....
    

    There is a limit of 25 sequences per request and 50,000 bases of total sequence. An individual sequence is limited to be 25,000 bases long.

    List of genomic loci

    As a high throughput service, PromoSer can be used to extract user specified regions directly from the genome. In this case, no promoter related processing is performed and PromoSer simply acts as a sequence server. There is a limit of 2000 requests and 2Mb of total sequence. The requests must be formatted in a special way. For example:

    hs chr5 4000 5000 - My sequence 1
    mm chrY 200 4000 +
    

    The list contains one request per line. Each request has 5 required columns and a sixth optional one, separated by spaces or tabs. The columns are: organism abbreviation (same as used in FASTA requests), chromosome (as chrXX, e.g. chr1, chrY, chr20 ...), start position for extraction (first position is 0), end position for extraction (end itself is not included), strand (for - strands the coordinates are in terms of the + strand but the results are reverse complemented) and an optional annotation column (only alphanumeric characters and the underscore are allowed). You can check the sizes of the chromosomes used in the current builds in this page.

    Repeat elements and tandem repeats can be masked in the returned sequence by choosing the appropriate option on the main page.

    Upstream region

    Enter the number of bases required upstream down to but not including the TSS. Allowed range is 1..10000 bases.

    Downstream region

    Enter the number of bases required starting from the TSS and downstream. Allowed range is 0..1000 bases.

    TSS quality and support

    A TSS is identified when the 5' end of a number of sequences map to the same genomic position (within a window of 20bp) with a high quality alignment score. The number of those sequences represents the support for the TSS. If the cluster does not contain any high scoring alignments, the 5' most position of the entire cluster is named a TSS and assigned quality 0 and support 0. The TSS quality score indicates the composition of the sequences that support this TSS, as follows:

      0: No high scoring alignments support this TSS.
      1: All supporting sequences are ESTs only.
      2: The TSS is supported by one or more mRNAs.
      3: The TSS is supported by one or more RefSeqs.
      4: The TSS is supported by experimental evidence from Eukaryotic Promoter Database (EPD).

    Note that the support score indicates the number of EST sequences for quality level 1 and indicates the number of non EST supporting sequences for higher levels.

    Repeat elements

    Repeats in the retrieved sequence identified by both RepeatMasker and Tandem Repeat Finder can be masked. You can choose to mask them with an N, mark their sequence using the lower case alphabet or ignore them completely and show all sequence using capital letters.

    Alternative promoters

    Predicted TSSs from non-EST sequences that are further than 20 bases apart are considered cases of alternative promoters. You can choose to return all of these promoters, the best supported, the nearest upstream to an accession, return only the 5' most or the 3' most or ignore all extension information. The promoter of the 5' most TSS is the furthest possible extension of the accession given. The promoter of the 3' most TSS is the shortest necessary extension of the given accession. These can be considered as the most aggressive and the most conservative extensions, respectively. Each TSS is identified from a group of sequences that share approximately the 5' most genomic position. We call the number of these sequences the support for the TSS. If you wish to ignore all extension info, PromoSer will consider the aligned position of the 5' end as the only TSS.
    The best supported TSS is the one with the highest support score. If several TSSs have the same highest score, only the 5'-most of them is returned. The nearest upstream TSS is the one that maps nearest but upstream to wherever the identified 5'-end of an accession aligns. This is a useful option if the promoter associated with a specific variant of some transcript is required.

    Excluding some results

    In some cases, a given sequence will map to more than one genomic locus with nearly equal high scores. PromoSer keeps all alignments within 1% of the score of the best alignment as possible alternatives. In both the summary table and the FASTA file, those alignments are flagged with an exclamation mark (!). If needed, all such results can be excluded from the output.

    In some other cases, a sequence can only be partially aligned to the genome, or not align well enough to be considered reliable. PromoSer will not use such alignments in either clustering or TSS prediction. It will however try to guess the correct cluster this alignment may belong to. Those guessed alignments are flagged with a * in both the summary table and the FASTA file. These can also be excluded from the results if desired.

    Note: Gene records that are genomic (given as a DNA sequence rather than an mRNA sequence) will always be included in the results, regardless of the above exclude settings. This is becuase genomic records associate with any cluster they overlap with on either strand of the chromosome.

    Finally, it may be the case that among the many accession IDs given as input, some map to the same cluster. Those will have identical promoters returned for each of them. If unique promoters are desired the option to exclude duplicate promoters should be selected. In this case, promoters returned will not be associated with any specific accession, but rather with the whole cluster.

    Overlap

    If the range specified for the upstream region happens to overlap with the 3' end of a gene upstream of the gene of interest then you may wish not to retrieve the sequence of the upstream gene as part of the promoter. Choose whether to ignore this case or to consider genes upstream only on the same strand or on both strands. (If the gene were on the opposite strand, the overlap would be with its 5' region.)

    Assembly gaps

    Becuase of the draft nature of the genome assemblies currently in use, certain spans of chromosomes have not yet been sequenced and are noted in the assembly by a run of N's who's length depends on the reason the gap exists. If such gaps are found in the range specified, you may wish to choose to stop the extracted sequence at the nearest upstream gap's boundary. This is potentially useful if you are specifically interested in the spatial distribution of signals in the sequence and the uncertainty in the length of the gap affects the analysis. Stopping at a gap's boundary takes precedence over upstream overlapping clusters.

    Retrieval of previous results

    If PromoSer successfully processes your request, it gives an identifier for the results. You can use this identifier to retrieve your results again within one day without rerunning the query. Results will not be retained indefinitely and will be removed periodically.

  • Views
    Protein Engineering