Home      Labs      Publications      People      Tools   

From CAGT

TRDB Help

[Annotations] [Browser] [Clustering] [Data Download] [Distributions] [FASTA] [Flanking Sequences] [GAP] [GFF] [Guest Account] [History] [MREPS] [My Account] [Partitions] [Predicting Polymorphism] [Predicting Transcription Binding Sites] [Primers] [Projects] [Repeats] [Reports] [Sequences] [sets] [sub-pattern strength] [Tools] [TRF]

Annotations (top)

  1. annotation explanation Annotations are an additional information associated with a sequence (like gene or exon information).

    Annotations can either be viewed through the browser (accessible from the same page as are distributions) or from the "filter repeats" page where one can run queries on them.

  2. annotate a sequence

    Interface:
    You can annotate the sequence by selection "annotate" on the sequences menu (to upload a new file) or by going to the modify menu (to view and maintain already uploaded ones).

    If you click on the "modify" link, you are presented (along with sequence description info) a list of annotation files (in GFF format) which is originally empty. You can then proceed to upload GFF files with different features one at a time (note, it is possible to have multiple files with same features, ex: if the files are too large and are broken into pieces.) After uploading is done, you can click on the "generate indexes" button. This completes the annotation process. Note, if one of the GFF files is later deleted or a new one is updated, indexes will need to be regenerated.

    Structure:
    There are 2 tables in the database: GFFfiles, GFFFeatures.

    Gfffiles tables contains names and information about gff files uploaded. Note that one of the field, sequence ID is a uniqued identifier of the sequence annotated.

    GFFfiles
    3	Id		int		4	0
    0	Name		varchar		300	1
    0	SequenceId	int		4	0
    0	UserID		int		4	0
    0	Description	varchar		500	1
    0	UploadDate	datetime	8	0
    

    GFFFeatures table contains actual data from the GFF files uploaded. Each feature is linked to the descriptor in the GFFFIles on GFFFeatures.gffid= GFFfiles.id. The data parser is written in C and supports GFF standard Version 2.

    GFFFeatures
    1	GFFfile		int		4	0
    0	SeqId		int		4	0
    0	SeqName		varchar		300	1
    0	Source		varchar		200	1
    0	Feature		varchar		200	1
    0	FirstIndex 	int		4	1
    0	LastIndex 	int		4	1
    0	Score		real		4	1
    0	Strand		char		1	1
    0	Frame		int		4	1
    0	Attribute		varchar		8000	1
    0	Comment		varchar		8000	1
    0	id		int		4	0
    

  3. downloading annotations You can download annotations uploaded by yourself from the >sequences>annotate menu.

  4. viewing annotations You can view repeat annotations from the same page as any other field (">sets>view repeats".) In order to make them viewable, you need to go to "Change Columns" and select annotations of interest (ex: Gene, Intron, etc.) There are a number of predefined annotation fields. If some sequence has annotations associated with it which are not defined in these field, you can select the "Other Feature" annotation field and it will be placed there together with any other "undefined" fields.

    Once you refresh the page with annotations selected, you will see the following for each annotation (YES - means the feature overlaps the repeat, NO - means the feature doesn't overlap, but << and >> arrows give you links to upstream and downstream information.) If you see "--", that means the sequence from which the repeat originates was never annotated.

    Annotations, just as any other fields can be filtered. Select the name of the feature from the --field-- filter box, and select one of the following options:

    overlaps - only fetches repeats that overlap that feature

    doesn't overlap - only fetches repeats that do not overlap that feature

    upstream, within# - only fetches repeats that are upstream (on the left) of the repeat within [value] bases.

    downstream, within# - only fetches repeats that are downstream (on the right) of the repeat within [value] bases.

    nearby, within# - only fetches repeats that overlap, upstream or downstream of the repeat within [value] bases.

  5. origin of data Annotations for sets in the Public Database project downloaded from following locations:

    NCBI
    ensembl.org
    wormbase.org
    genome.ucsc.edu
    yeastgenome.org

Browser (top)

  1. browser explanation You can get to the browser page by selecting a set via the SETS menu and clicking the "BROWSER" button. On the top of the page you are presented with the description of the set. Below the description, there is a box labeled "Browser Options". It contains various options for controlling broser layout and which data the browser displays.

    First, select the zoom level. This is how much data the browser will try to display starting from the selected range point. Note; the amount of data displayed might be smaller if the highest index in the set is not high enough to accomodate it.

    Second, click on the range browser to select the starting range. Note; your range selection might be moved backward in order to fit the zoom level you selected. Decrease the zoom level to be able to get closer to the end of the set.

    Some additional controls are:

    1. description - check this to force the browser to display a description stamp in the upper left corner.
    2. annotations - check this to force the browser to display annotations associated with the sequence if some are available.
    3. image width - specifies the width of the image (500 - 50,000).
    4. move - these buttons move the data range sideways. The smallest one moves it 20%, middle one 50% and the largest one 100%.


    Don't forget that you can recenter the image by clicking the base positions of the image. If you change the zoom before doing it, it will also zoom in around that point.

Clustering (top)

  1. compare flanking sequences When clustering tandem repeats TRDB makes use of similarity in the repeating pattern. Once a cluster of repeats is obtained it is possible to execute a flanking sequence comparison to determine if a given set of repeats exhibits significant similarities in the sequences surrounding the repeats. This information can reinforce the notion of likeness between a specific pair of repeats.

    The output of the flanking sequence comparison is presented in two parts: A tabular representation showing all the repeats in the cluster where a flanking sequence similarity with another repeat was detected, and a graphical form showing the details about the flanking sequence similarity for a specific pair of repeats in the cluster.

    The data presented in the tabular form is intended for quick browsing and shows the left and right flank alignment scores. If the repeats are similar on opposite strands the program will display the comparison as RC, or reverse complement. The scores may be artificially high if the flanking sequences have regions of overlap, so the overlap, if any, is also shown. From the tabular form one can access the graphical view for each repeat pair by clicking on the corresponding table cell.

    The graphical view of each repeat pair allows close inspection of the alignment of the repeat flanks. The flanking sequences being compared are shown in red. The matching parts of the flank are shown in blue. Some useful numbers are shown on the image, like the length of the matching segment and its distance from the edge of the repeat. The indices where the matching sequences start and end are also shown. Any fragments of the repeat that may be present in the flanking sequence of the same repeat are shown in yellow to help identify cases where the similarity is caused by a repeat fragment. If the similarity occurs on opposite strands the flanks are reverse-complemented on the second repeat and "RC" is drawn on the left side of the image. Also included on the detail page is the actual alignment of the flanks in graphical form.

  2. creating a cluster mannually You don't need to run a clustering algorithm to create a cluster. Simply save a set "as cluster" when making a copy of it and place it into any of the available partitions. You might need that because some tools can only be run on clusters (flanking comparisons, label copies).

  3. label copies This tool labels each copy of a repeat with a unique letter. You can download the label explanation file by clicking the button on the bottom of the page.

  4. clustering explanation

    Clustering can be performed by running the clustering tool against a set. In running this tool you create a partition of the set. Partitions can be accessed from the main menu at any time. In the clustering tool page you provide the following:

    • A name for the new partition.
    • A project to which the partition will be associated.
    • A distance table to use for aligning repeats in the cluster.
    • A distance table for scoring the alignments.
    • A percent cutoff representing how close to a perfect score between two repeats is required for them to be considered neighbors by the clustering algorithm.
    • An algorithm to use. CCA(single linkage) is the default.

    We use repeat profiles to account for variability in the repeat pattern from copy to copy. Clustering is based on distances among repeat profiles.

Data Download (top)

  1. download explanation Data download is accessible from the tool menu. It allows you to download the contents of a set in different formats.

    First select the columns you wish to download by selecting them in the box on the left and moving them to the right by clicking the ">>" arrow. You can select multiple columns by holding down the SHIFT key.

    Then select the ordering of the set in the "Order By" selection box. Note that you cannot sort on some fields like (pattern, profile, sequence, etc.)

    Second, there is drop down box for the operating system that you use. Select the appropriate one to avoid funny end of line characters.

    At last, choose the data format. If you choose ASCII, make sure you choose the column delimiter that best suits you. Clicking the "make first row the column heading" option, will insert the list of columns as the first row of data.

    XML data format will enclose data elements in their respective column identifiers. Appropriate DTD document will be posted for download later.

    GFF format is another availible option. It uses preset fields, so your field selections do not matter. GFF has become increasing popular recently as it is very simple and general. We use GFF format for annotations. You can annotate one set with a GFF file from another set and easily compare the results in the browser.

    FASTA is another preset format in which the repeats are compacted as a multiple sequences FASTA file. Many tools and programs around the world use FASTA format.

    If you think there are other useful formats that we should implement, make sure to let us know.

    Note: If a character field is of 0 length, it will just not be displayed. If, by some reason, the resource is no longer present in the database, a "--" marker will be displayed. You can think of this as a NULL value indicator.

Distributions (top)

  1. distributions explanation You can get to the distributions page by selecting a set via the SETS menu and clicking the "VIEW DISTRIBUTIONS" button. On the top of the page you are presented with the description of the set. Below the description, there is a box labeled "Distribution Options". It contains various options for controlling distributions layout.

    "Distribution On" drop down box allows you to choose the column of interest. The option below (graph or table) allows you to choose between a graphical view and a simple frequency table (the default.) "Bucketsize" determines the size of a single bucket on which frequency is to be calculated on. You must uncheck the "auto" checkbox if you want to set it manualy. You can also manually change the range of data if you want to view a smaller slice of your data (you must uncheck the "full" checkbox to do so.) Because image is always fixed to a certain width, when a lot of buckets are to be shown, the image is practically unreadable because the lines are so close to each other. To fix this problem, choose "dynamic" on the "image width" option. This will allow the image to grow. The largest image width is 100,000 pixels (it's in .png format which compresses very well.) If your selections result in a larger width, an error message will be displayed. You should then enlarge the bucket size or decrease your range.

FASTA (top)

  1. FASTA The FASTA format looks something like this:


    >myseq
    AGTCGTCGCTAGCTAGCTAGCATCGAGTCTTTTCGATCGAG
    CTAGCTAGCTAGCATGTCGCTCGAGCATGTCGCTCATGAGA
    TTTAGCTAGCTAGCATAGCATACGAGCATATCGGTGTCGCT


    The first line starts with a greater than sign ">" and contains a name or other identifier for the sequence. The remaining lines contain the sequence data. The sequence can be in upper or lower case letters. Anything other than letters (numbers for example) is ignored. The sequence size is unlimited.

Flanking Sequences (top)

  1. flanking sequences explanation Flanking sequences can be downloaded fby using the "Flanking Sequence and Primers Extraction" tool. You are first asked to select a set. After doing so, you will have to pick necessary flanking sequence options. Select the size of the flanking sequences first (50,100,200,350 or 500.) Then, choose the ouput format (ASCII or XML.) Don't forget to choose the line termination that your system understands (newline for unix, carriage return for mac, or both for windows.) Click on the "get flanking sequences" button. You will be transported to the next page with a link to your results. Your results might not be availible right away, so just wait and keep checking on them with the link. The time it takes really depends on the size of your set.

GAP (top)

  1. GAP detection Basically, all this tool does is scan through the sequence and create an annotation entry each time a run of Ns is found of certain minimum length. Annotation is stored in the "OTHER FEATURE" track. This makes it possible to find repeats that have big gaps overlapping or adjacent to them. NOTE: just as uploading a GFF file, you will need to run the index regeneration from the "sequence->modify" menu in order to be able to search through results.

GFF (top)

  1. GFF The GFF(General Feature Format) format looks something like this:

    SEQ1	EMBL	atg	103	105	.	+	0
    SEQ1	EMBL	exon	103	172	.	+	0
    SEQ1	EMBL	splice5	172	173	.	+	.
    SEQ1	netgene	splice5	172	173	0.94	+	.
    SEQ1	genie	sp5-20	163	182	2.3	+	.
    SEQ1	genie	sp5-10	168	177	2.1	+	.
    SEQ2	grail	ATG	17	19	2.1	-	0
    
    Fields are tab delimited. The file size is unlimited. For the latest GFF file specificatation consult GFF Specifications Document at Sanger.

    When exporting data we use the set name as the id, set sensor name to the name of the sensor that found the repeats (usually "TRF" for tandem repeats and "IRF" to inverted repeats), set the name of the feature to "repeat" for tandem repeats or "stem_loop" for inverted repeats, set the start and end indices to appropriate repeat bounds, set the score to the score field, leave the strand and the frame fields empty(set to '.') and have a note in the attribute field with some extra info (number of copies and pattern size for tandem repeats, for example).

Guest Account (top)

  1. guest account Guest account permits browsing of our online database but does not allow storage or copying of any data in the database. You must register to get a private workspace inside the database.

History (top)

  1. set history explanation Sets are often created, filtered and merged. After doing it for a while, you may not remember the history of the set. That's what the Set History page allows you to view. You can get to the HISTORY page by selecting a set via the SETS menu and clicking the "VIEW HISTORY" button. On the top of the page you are presented with the description of the set. On the bottom of the page there is an image with a tree like structure. It shows you the history of the set. Blue boxes indicate sequences, green indicate filters and blue/red circles indicate merges. If you click on one of the tree items, item information will be extracted into the text area above.

MREPS (top)

  1. MREPS mreps is a software for finding all contiguous repetitions (periodicities) in a DNA sequence. It is developed at LORIA, France.

    Note: some changes were done to the alignment routine in TRDB version 2.02 to correctly align repeats found by mreps:

    If the repeat came from mreps, the alignment is being performed without insertions/deletions (a very large indel penalty is imposed.) Also, the master pattern is formed by taking the majority letter in each column. Note that because alignment is not performed, the score is not available, so it is just set to 1. The clustering algorithm is unchanged.

My Account (top)

  1. modify user options
    1) Repeats Per Page - indicates how many repeats are displayed on one page while viewing a set (or cluster) of repeats.

    2) Show Repeats In New/Same window indicates whether a new browser window is created for each repeat while viewing it. It may be useful to use multiple windows if you want to compare two repeats next to each other.

  2. view explanation View menu allows you to change the columns that are displayed when you are viewing repeats and on which you can filter and order them. First select the columns you wish to display by selecting them in the box on the left ("Availible Extra COlumns") and moving them to the right by clicking the ">>" arrow to the "Selected Extra Columns". You can select multiple columns by holding down the SHIFT key. Some columns (like indices, pattern size and copy number) are there by default and cannot be removed. To save your selection, click the "SAVE COLUMNS" button. To default to the original settings, press the "DEFAULT" button. Now when you view the repeats, you will be able to view/filter/order on all the columns you just selected.

  3. changing personal information This option allows you to modify your contact information. Your name is used to refer to you in the database. You can either type your real name or some optional handle. Your address will only be used if we ever have to contact you by email. Make sure your contact email is correct, because it is also your login name (make sure you don't get locked out. )

  4. changing password This option allows you to change your password. Type in the old password and the new password twice. As a reminder, if you forget your password, you can always go to the "password reminder" page. You can get to it from the login page.

  5. changing styles This option allows you to change the way your pages look. It has a set of predefined style sheets that you can choose to render your pages with. Later on we will add more options here for changing background color or image and maybe a feature to allow the user to upload their custom style sheets.

  6. modify alignment options This option allows you to modify your default alignment parameters as well as the length of the flank used by the "compare flanks" tool (see clustering for more information).

Partitions (top)

  1. partition explanation Partitions are produced by running a clustering algorithm against a set of repeats. Clustering is availible through the TOOLS menu. After the algorithm is run, partitions contain clusters of similar repeats. Following information is displayed: completion status, cutoff value with which the algorithm was run, number of clusters produced, name of the clustering algorithm and the creation date. The following actions are availible for partitions: delete. Clicking on the partition name will go to the page that displays clusters. If the link is inactive and status is set to "Pending", partition is still being processed. Please come back to that page later.

Predicting Polymorphism (top)

  1. predicting polymorphism Polymorphism prediction is based on the method described in the paper F. Denoeud, G. Vergnaud and G. Benson, Predicting human minisatellite polymorphism Genome Research, in press. Prediction values are based on two factors, %GC in the repeat sequence and HistoryR, a value derived from the history reconstruction algorithm described in the paper G. Benson and L. Dong. Reconstructing the Duplication History of a Tandem Repeat. Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology (ISMB-99), 44-53, 1999.

    Essentially, the HistoryR value (a real number between 0 and 1) measures the levels of redundant mutations in the repeat (mutations that appear in the same position in several copies of the repeat), and redundant mutation motifs (the same or similar sets of mutations that appear in the several copies of the repeat). A larger number means more redundancy. Repeats are predicted to be polymorphic if HistoryR >= 0.54 and %GC >= 0.48.

    These criteria were tested in repeats from human chromosomes 21 and 22. Within the set of repeats that were selected by the criteria, 59% had a heterozygosity value >= 0.5 and only 6% were monomorphic, whereas the background rate was 43% with heterozygosity >= 0.5 and 25% monomorphic.

    Polymorphism can be predicted using the %GC and HistoryR calculation by running the "Polymorphism Prediction" tool. Only the set owner can run this tool, as it modifies some fields on the source repeats. Once complete, the results will be stored in the "HistoryR" and "PredictedPolymorphism" columns of the set. You will need to add them to your view if you want to filter on them. Note, currently a repeat is not considered for PP consideration in TRDB if PATSIZE is greater than 300 and COPYNUM is greater than 100. If this is unsuitable for you, please either use the standalone "history" program available on our server or email us so we can address your need.

Predicting Transcription Binding Sites (top)

  1. predicting transcription binding sites The general idea of predicting binding sites in a given DNA sequence is to compare it to a known or agreed upon binding sequence, hence the idea of a binding site matrix. The Matrix contains an agreed upon sequence with weights for every nucleotide, allowing some variation and allowing us to calculate a percent similarity of potential binding sites. What we have done is written a program that takes weight matrices and a DNA sequence as input and calculates locations of potential binding sites as a result of comparing the sequence to the matrix. We use the Aho-Corasick to searches through the sequence for the sites that match the matrix(es).

    The reason for having this tool in the database is because binding sites often occur in a tight cluster producing a structure similar to a tandem repeat. If a repeat does turn out to be a cluster of these sites, one might assume that they code for the biginning of the transcription of a gene that is located nearby.

    Start by selecting a set and choosing the "transcription factor binding sites prediction" tool. Note: make sure you are the owner of all repeats inside the set, or you will not be allowed to run the tool.

    First, select the matrix you want to use. Right now we have a collection of matrices for human, rat and mouse species. We got the matrices from http://www.at.embnet.org/vbc where they are freely available for download. Currently we do not let people upload their own matrixes, but if you have some you would like to add, send them to us and we will load them into our database.

    Second, select the cutoff score. We currently set the lowest limit to 90. If you find that this score is too conservative, email us.

    Once ran, the results will be stored as annotations inside the database. You can filter the set based on TFBS field as a regular annotation field (see annotations for more info). In addition, the number of sites overlapping a repeat is stored in the "TFBS Count" field.

    REFERENCES 1. K. Quandt, K. Frech, H. Karas, E. Wingender and T. Werner, MatInd and MatInspector: new fast and versatile tools for detection of consensus matches in nucleotide sequence data, Nucleic Acids Research, 1995, Vol. 23, No. 23 pp. 4878-4884.

Primers (top)

  1. primers extraction This product includes software developed by the Whitehead Institute for Biomedical Research.

    This tools outputs all repeat sequences (surrounded by 300 characters of flanking sequences) and prepares a primer3 input file. It then runs primer3 software on the input file and lets the user download the output file. Primer3 input options are described below:

    PRIMER_EXPLAIN_FLAG (boolean, default 0)
    If this flag is non-0, produce PRIMER_LEFT_EXPLAIN, PRIMER_RIGHT_EXPLAIN, and PRIMER_INTERNAL_OLIGO_EXPLAIN output tags, which are intended to provide information on the number of oligos and primer pairs that Primer3 examined, and statistics on the number discarded for various reasons. If -format_output is set similar information is produced in the user-oriented output.

    PRIMER_PICK_INTERNAL_OLIGO (boolean, default 0)
    If the associated value is non-0, then Primer3 will attempt to pick an internal oligo (hybridization probe to detect the PCR product). This tag is maintained for backward compatibility. Use PRIMER_TASK.

    PRIMER_GC_CLAMP (int, default 0)
    Require the specified number of consecutive Gs and Cs at the 3' end of both the left and right primer. (This parameter has no effect on the internal oligo if one is requested.)

    Primer Size

    • Optimum (default 20)
      Optimum length (in bases) of a primer oligo. Primer3 will attempt to pick primers close to this length.

    • PRIMER_MIN_SIZE (default 18)
      Minimum acceptable length of a primer. Must be greater than 0 and less than or equal to PRIMER_MAX_SIZE.

    • PRIMER_MAX_SIZE (default 27)
      Maximum acceptable length (in bases) of a primer. Currently this parameter cannot be larger than 35. This limit is governed by maximum oligo size for which Primer3's melting-temperature is valid.


    Primer melt Temp

    • Optimum (default 60.0C)
      Optimum melting temperature(Celsius) for a primer oligo. Primer3 will try to pick primers with melting temperatures are close to this temperature. The oligo melting temperature formula in Primer3 is that given in Rychlik, Spencer and Rhoads, Nucleic Acids Research, vol 18, num 12, pp 6409-6412 and Breslauer, Frank, Bloeker and Marky, Proc. Natl. Acad. Sci. USA, vol 83, pp 3746-3750. Please refer to the former paper for background discussion.

    • Minimum (default 57.0C)
      Minimum acceptable melting temperature(Celsius) for a primer oligo.

    • Maximum (default 63.0C)
      Maximum acceptable melting temperature(Celsius) for a primer oligo.


    Primer GC % content

    • Minimum (default 20.0%)
      Minimum allowable percentage of Gs and Cs in any primer.

    • Optimum (default 50.0%)
      Optimum GC percent. This parameter influences primer selection only if PRIMER_WT_GC_PERCENT_GT or PRIMER_WT_GC_PERCENT_LT are non-0.

    • Maximum (default 80.0%)
      Maximum allowable percentage of Gs and Cs in any primer generated by Primer.


    PRIMER_SELF_ANY (decimal,9999.99, default 8.00)
    The maximum allowable local alignment score when testing a single primer for (local) self-complementarity and the maximum allowable local alignment score when testing for complementarity between left and right primers. Local self-complementarity is taken to predict the tendency of primers to anneal to each other without necessarily causing self-priming in the PCR. The scoring system gives 1.00 for complementary bases, -0.25 for a match of any base (or N) with an N, -1.00 for a mismatch, and -2.00 for a gap. Only single-base-pair gaps are allowed. For example, the alignment

    5' ATCGNA 3'
       || ||
    3' TA-CGT 5'
    
    
    is allowed (and yields a score of 1.75), but the alignment
    5' ATCCGNA 3'
       ||  | |
    3' TA--CGT 5'
    
    is not considered. Scores are non-negative, and a score of 0.00 indicates that there is no reasonable local alignment between two oligos.

    PRIMER_SELF_END (decimal 9999.99, default 3.00)
    The maximum allowable 3'-anchored global alignment score when testing a single primer for self-complementarity, and the maximum allowable 3'-anchored global alignment score when testing for complementarity between left and right primers. The 3'-anchored global alignment score is taken to predict the likelihood of PCR-priming primer-dimers, for example

    5' ATGCCCTAGCTTCCGGATG 3'
                 ||| |||||
              3' AAGTCCTACATTTAGCCTAGT 5'
    
    or
    5` AGGCTATGGGCCTCGCGA 3'
                   ||||||
                3' AGCGCTCCGGGTATCGGA 5'
    

    The scoring system is as for the Maximum Complementarity argument. In the examples above the scores are 7.00 and 6.00 respectively. Scores are non-negative, and a score of 0.00 indicates that there is no reasonable 3'-anchored global alignment between two oligos. In order to estimate 3'-anchored global alignments for candidate primers and primer pairs, Primer assumes that the sequence from which to choose primers is presented 5'->3'. It is nonsensical to provide a larger value for this parameter than for the Maximum (local) Complementarity parameter because the score of a local alignment will always be at least as great as the score of a global alignment.

    Max #N's (int, default 0)
    Maximum number of unknown bases (N) allowable in any primer.

Projects (top)

  1. create a project Go to the "create project" page by clicking the "create a project" button on the projects page. Once there, write the name and the description (optional) and click on the "CREATE" button.

  2. modify a project Modify a project page is accessible by clicking the modify action in front of the project. This option allows you to do two things. First, you can modify the project's name and description. Second, you can add collaborators to the project. Just type the user's email into the textbox on the button and press add. If the user is in our database, (s)he will be added to your project. You can remove the user at any time by clicking the [remove] link in front of the user's name.

  3. projects explanation Projects are holders for sets of repeats and the results of analysis. If you want to generate a new set, you must have an active project you can add it to. You cannot add sets to public projects. Projects page contains a list of projects you created/joined, allows you to perform various actions on them and lets you create new ones. Following information is displayed: project name, project owner, number of users in the project, number of sets in the project. The following actions are availible for projects: delete, modify, info. The first two are only availible if you are the owner of the project. To create a new project, press the "CREATE A NEW PROJECT" button on the buttom. You can add collaborators to the project by selecting the "modify" action in front of the project that you want to add people to.

  4. other options Delete option allows you to delete a project. Modify option allows you to modify project name, description and other fields. Clicking on info will pop up a window with more detailed information about a project. Note: you are not allowed to delete a project if there are still sets in it. Delete the sets first, and then delete the project.

Repeats (top)

  1. repeat explanation Repeats are tandem repeats found in DNA sequences by the TRF algorithm. In order to view them, you must select a set from the SETS menu and click on the "VIEW REPEATS" button. This will transport you to "sets > view repeats" pages. This page has two main sections: filter section, and viewing section.

  2. filter explanation Filter allows you to filter out repeats you do not want. Select a condition from the FIELD textbox, a comparison operator from the OP textbox and type the desired value into the VALUE textbox. For example, selecting PatternSize > 20 and pressing APPLY will filter out the repeats that are less than or equal to 20. Once you have entered a filter, it will be displayed above the selection line. If you want to get rid of the filter, uncheck the checkbox in front of the filter you want to get rid off and press APPLY.

    Note that some filters are unary. If you select the checkboxes filter, you do not need to provide the value. Simply select "checked" or "unchecked" from the OP. Once you select the needed Checkboxes filter, you can remove repeats and sequences by simply unchecking boxes in front of them. You need to press APPLY to apply the changes. The rectangular box with a border on top contains the set information. Pay attention to the number of repeats in the "original set" and always compare it to the number of repeats "after filter" as displayed below the filter. That indicates how many repeats are left after the filters you applied. On the right bottom side of the filter selection there is an ordering criteria selector. You can select which order you want the repeats to be displayed. The "Group by Sequences" check box indicates whether repeats should be grouped by the sequence they came from, or be all grouped together.

  3. edit comment Comments are short (up to 300 chars) annotations entered by the user.

    Users can modify comments of repeats of projects of which they are members.

    Users cannot modify comments of repeats which are parts of public projects (only administrators can do that).

    If a set is copied from a public project into a private project, comments can be modified.

  4. view repeats explanation Each row of the repeats table represents a single repeat. The first column indices are the indices where the repeat is located inside the original sequence. If you click on the indices link, a window will pop up with alignment explanation and a visual representation of the repeat occurrences. "Pattern size" is the length of the consensus pattern. "Copy Number" is the number of copies detected. Other columns are optional and can be added and removed via the "VIEW" menu. Once you add one of these extra columns, you can filter and order by it. Right now we have a number of preselected columns that are available to you to choose from, other column may be added in the future. Right above the table there is a "pages" link(s). This tells you how many pages of repeats are there and lets you browse the pages by clicking appropriate page numbers. Right above the repeats there is (usually) a gray box with information on the sequence where the repeats (below it) came from. If you unselect the "group by sequences" checkbox, this box will not appear.

  5. alignment explanation This page displays information about one repeat and its alignment. Information about the source sequence, annotations and other repeat characteristics are also displayed. Flanking sequences can be displayed optionally. When flanking sequences are selected, database tries to find fractions of the pattern in them. It will display at most 10 fractions (in yellow) that have a score over 14.

Reports (top)

  1. reports explanation Reports are documents inside the LBI database that you can create to keep track of your work and/or share it with other people. They are designed so it is easy to pull any kind of information of the database and store it for further viewing. Your report can have an unlimited amount of text, as well as any number of pictures or records inserted in it (note: records and pictures are stored statically, therefore they will not change inside the report if your data is updated). Create a new report by pressing the "CREATE NEW REPORT" button. Once you have created your report, you can view it by clicking on the link which is also its name. To change name and abstract of a report or to share it with other people, use the "modify" action. To actually add text and data to it, click on the "edit" action.

  2. create a report This page lets you create a new report by providing its name and abstract(optional). Press CREATE when you are finished filling in required information and a new empty report will be created.

  3. modify a report This page lets you change the report name and abstract. In addition, you can share this report with other people by typing in their emails.

  4. edit a report When you click on the edit button of a new report you are presented with a single text area. You can start typing your text there. If you need to insert a resource, first save your work, then simply navigate to your resource inside the database and press the "SAVE RESOURCE" button. You will be prompted to select the report to add it to. Once you select the target report, press the "SAVE RESOURCE" button again. Your resource will be added to the end of the selected report. If you need your resource to be in a different place (like higher up in the report), simply move it around with "UP" or "DOWN" buttons. You can delete a resource from a report by pressing the "DELETE RESOURCE" button.

Sequences (top)

  1. sequence explanation Sequences are DNA sequences in FASTA format. The Sequences page contains a list of sequences you uploaded, allows you to perform various actions on them and lets you upload new ones. The following information is displayed: sequence name, number of sequences inside the FASTA file, sequence length (if there is more than one subsequence in the file, this number is the sum of all subsequences), upload date. The following actions are availible for sequences: delete, modify, info, process, download, annotate. To upload a new sequence press the "UPLOAD A NEW SEQUENCE" button on the bottom.

  2. upload a new sequence Upload a new sequence form expects a DNA sequence in FASTA format (note: if you upload a file, make sure it is saved as a regular text file and not something else, like word or rich text format, etc. ) You can also just copy and paste the sequence text into the "Cut and Paste" section. You must also provide the name of the sequence and the name of the organism. Genbank number and description are optional. After you fill in all the necessary information, press "submit sequence" button. Please wait while the sequence is being uploaded. Aftewards, you should be transported back to the page that displays all your sequences. Your newly uploaded sequence should be on top of the list.

  3. process sequence In order to run the the search algorithm on a sequence you need to process it. You can get to the "process sequence" form by clicking the process link in front of the sequence you wish to process. Once there, make sure you enter the name of the future set and which project to add it to. At this point, you must have already created/joined a project. Note, you cannot add anything to public projects. Once you press process, a new set is created and you are automatically transferred to the SETS page. If the set is not yet processed, it will have the status set to PENDING. You cannot do anything with the set until it is finished. You can keep on browsing the site while your set is being processed or simply keep refreshing the page.

  4. other sequence options Delete option allows you to delete a sequence. Modify option allows you to modify sequence name, description and other fields. Clicking on info will pop up a window with more detailed information about the sequence. Note: although deleting a sequence is allowable, some options (like set history) will not display correct information if this is done. You can also maintain annotation files from the "modify" menu (see "annotations" for more info).

  5. download sequence "Download sequences" tool provides the following options:

    First, a range can be selected, to download a part of the sequence. By default, the range is set to the whole length of the sequences. The first input box is the starting index (note we use a one based coordinate system). The second is the length. You cannot enter range exceeding the length of the sequence. (Note: if your file is a multiple sequences FASTA file, you will not be presented with the range selection.)

    Second, there is drop down box for the operating system that you use. Select the appropriate one to avoid funny end of line characters.

    Third, there is a checkbox that lets you mask your sequence with an already processed set (if one exists for this sequence(s)). (Note: you will be warned if the set does not contain any repeats from your sequence(s).) You can either mask the repeats with an "NA" character or use casing (make sure you use upper case masking if your sequence is in lowercase and vise versa).

Sets (top)

  1. sets explanation Sets are produced by running the TRF algorithm on a DNA sequence. They are a collection of tandem repeats. Sets can be created, merged, copied, deleted and modified. Set is the main unit of the TRDB. Various tools like clustering, data download, etc... are run on sets.

    "View Repeats" button sends you to the page that lets you view repeats in a set. "View History" allows you to view the set history. "View Distributions" lets you view the way repeats are distributed in a set by various attributes. "Run Tools" is equivalent to going to the TOOLS menu and selecting a set (again, set is the basic unit for most of the tools.)

    If you are the set owner, you can delete and modify sets (modifying changes the basic info like name and description.) All users can copy and merge sets (the result will have to be resaved as a different set.) Note: you need to have created/joined an active project in order to create/copy/merge sets into it.

  2. save or copy a set There are three reasons you might use this options: First, if you just filtered a set, you might want to save your results. Second, you might want to copy a set out of the public database into one of your projects. Finally, you might want to save a set as a cluster (to make some other tools available, like "compare flanks" or "label copies").

    First, select the project you want to add your new set to. Second, enter the name of your new set (a suggested name might already be provided there). Optionally, you might provide an up to 500-character description. Check the box above the set name to save the set as a cluster.

  3. merge sets Merging a set is the process of combining two sets of repeats into another one that shares some of the units from both of them (note: the new set may be a join, an intersect, or somethig else).

    First, select the project to add your new set to. Second, select the type of the merge ("A or B" is the default one and is usually the most commonly used one, which is a join). Third, pick the name for your new set (at least 4 characters). Optionally, you may provide a description of the new set (this may be very usefull later as you are trying to figure where that wierd set came from?)

    There is one more option you should be aware off and that is the "Positional Merge". This option (unchecked by default) is usefull when you want the repeats to be compared positionally, rather than referentially. That means two repeats will be considered "same" if they came from the same sequence and their indexes are the same (or share a certain positional percentage.) If the box is unchecked, repeats are considered the "same" only if they have the same database repeat id. Exercise caution with this option, as similar or redundant repeats may throw it off. This option is generally usefull if you processed two sets with different parameters and want to figure out the difference. Another option is to annotate the source sequence with a GFF file made from the first set and then view the second set in the browser with annotations on.

sets (top)

  1. Set Import SET IMPORT lets you import a set into the database using a native .DAT file (TRF/IRF output.) Please note that in the TRDB version of this tool, pattern is recomputed to produce a better pattern along with caclulating a profile. In IRDB, profile calculation is skipped.

Sets (top)

  1. set comparison This tool compares two sets. Comparison is done only on repeats that came from the same sequence. Tool creates a report, which needs to be refreshed to see the final results.

    Segments are areas covered by repeats (for example two nearby repeats that overlap form one segment).

    Difference sets are also calculated and are stored separately.

sub-pattern strength (top)

  1. sub-pattern strength explanation Sub-pattern strength is a measure that can be used to identify repeats with smaller repeating units than the reported pattern size. For example TRF may report a repeat with pattern size 25 as having a pattern size of 100 because it scores better at that size. In this case the sub-pattern strength can be used to identify the existence of 4 repeating units inside the repeat's pattern. The computation is based on a Discrete Fourier Transform (DFT) of the scores obtained by cyclically aligning a pattern against itself. The strength is the ratio of the highest harmonic (after the first) in relation to the first harmonic. Higher values indicate a stronger signal for the sub-pattern. The harmonic is also given, and indicates how many repeating units are identified at the given strength. A pattern size of 36 with a strong 3rd harmonic indicates a possible pattern of size 12.

Tools (top)

  1. tools explanation TOOLS menu gives you access to a number of tools you can run. TOOL is defined as following: "an action that affects some part of the database and generates some data which is either stored for later processing or immediately downloaded." Some database objects have ACTIONS associated with them (ex: you can select view or process actions for any sequence while viewing them.) ACTIONS are also tools, but they are so interconnected with their objects as essential to them, so it was decided to put them directly on the object pages.

TRF (top)

  1. TRF Tandem Repeats Finder is a program to locate and display tandem repeats in DNA sequences. It was developed by Dr. Gary Benson.



top

Protein Engineering