Home      Labs      Publications      People      Tools   

From CAGT

CARRIE Help

Computational Ascertainment of Regulatory Relationships Inferred from Expression

Contents

Background

The CARRIE Web Service can automatically infer a hypothetical transcriptional regulatory network from microarray and promoter sequence data. Given microarray data that show the changes in expression between two conditions, say before and after stimulating some cellular receptor, CARRIE can determine which genes showed changes in gene expression in response to that stimulus, which transcription factors are the most likely to have regulated that expression change, and specifically which transcription factor or factors were most likely to regulate each gene. The resulting network is displayed in an interactive format that includes additional sources of data that can assist in forming hypotheses about the network.

Instructions

Using the CARRIE Web Service is a simple two-step process. For the first step you simply upload your microarray data and choose some significance cutoffs for gene expression changes and transcription factor binding sites. The second step gives you the chance to view the analysis of the promoters selected from your microarray study and decide which transcription factors are significantly likely to regulate the genes in your network. After completing step two you will be presented with an interactive map of the inferred transcriptional regulatory network inferred from your microarray data.

If you want to know more about something on the page, try clicking on it. Most form or graph elements will raise popup messages about themselves if they are clicked.

This site is best viewed in a modern browser. Version 6.0 or higher of Netscape or Internet Explorer are appropriate. Mozilla, or one of its cousins would be best. Mozilla is a free, modern browser that is available for most operating systems. Mozilla offers more features than Internet explorer and does a better job of meeting the international standards that make advances in web technology possible.

Microarray Data

It is important to make sure that your microarray data has the right file format. Improper formats will cause problems. We request that you format your data in one of two ways:

Pre-Processed Data

If you would like to use your own microarray data analysis, then use the following file format. Each line should contain five items with tabs in between. (Exporting your data from Excel as plain text is fine.) The five items should be as follows:

  1. Gene Accession. This is the unique identifier for the gene. If you would like to use our promoters then you should use the ORF name (such as YAL005C) for yeast genes and the Genbank Accession (such as V00756 or AA039124) for other organisms.
  2. Median Abundance. This is the median expression level for all measurements of this gene. The scale of the numbers is not important.
  3. P-value. This is the probability that the observed change in mRNA abundance from condition one to condition two could have occurred by chance. Smaller values indicate more significant expression changes.
  4. Fold Change. This is the change in mRNA abundance from condition one to condition two. Again the scale of the numbers is not important. Negative numbers are allowed. You will be given an opportunity to specify a fold change cutoff for significant changes. If you do not wish to use fold change as a criterion for significance simply specify a cutoff of 1 (or 0 for logged values).
  5. Gene Name. This will be the name used to identify each gene in the final output. You may specify any name here as long as it does not contain tab characters.

Here is a short example of the Pre-Processed Array Data File Format:
#Acc	Abund	Probability	F-C	Name
YAL008W 40 8.84E-01 1 FUN14
YAL009W 3.5 8.92E-01 1.2 SPO7
YAL010C 7.2 8.52E-01 2 MDM10
YAL011W 5 8.38E-01 4.2 SWC1
YAL012W 20 5.21E-01 1.8 CYS3
YAL013W 8 8.58E-01 3.5 DEP1

Un-Processed Data

If you would like to use our microarray data analysis tool, then use this "Un-Processed Data Format". This format is more simple than the "Pre-Processed Array Data Format". The first column specifies the Gene Accession and the last column specifies the Gene Name as before. The columns between the first and last specify the abundance of mRNA for that gene in multiple conditions. If you are using Affymetrix style data this would be the "Average Difference" value or the like. If you are using two-color microarray chips, then each value would be a background-subtracted "Red" or "Green" intensity.

The first two rows are also important in this file format. The first row should specify the experimental condition (1 for condition 1 and 2 for condition 2) represented by each column. The second row should give a short name for the condition in each column (e.g. Chip1 or Control3).

Here is an example of this file format. The first column values #Groups and #Acc are required. We limit the number of data columns to 20 or fewer in order to prevent unreasonable loads on our server.
#Groups     1           1           1           2           2           2
#Acc WT1 WT2 WT3 Ex1 Ex2 Ex3 Gene_Info
M86671 6.99509 6.86046 7.05552 7.06867 6.86269 6.76425 My Favorite Gene
J04423 7.1 6.82682 6.83204 6.86906 6.7119 6.86151 actin, beta, cytoplasmic
J04423 6.80759 6.71316 6.8 8.18634 8.32921 9.55781 A kinase anchor protein
M60469 8.18282 8.19393 8.22964 8.11692 8.2 7.27472 inhibin beta-C

Our microarray data analysis uses a "permutation test" to judge the significance of the mRNA abundance changes observed between the two conditions for each gene. Currently we do not adjust each probability to account for the "multiple testing problem". We suggest that you use at least three replicates for each condition in order to give the test acceptable power.

Promoters

You can use our promoters if you are working with yeast data or the Affymetrix HGU95 set, the HGU133 set (A,B,and P), or mu11ksubB chips (more info). If you would like to specify your own promoters make sure that they are in FASTA format and upload them in the supplied box. Please make sure that your FASTA format looks like this:

>M86671  My favorite gene
ACTGCTACACTACACAC...
>M60412 My other favorite gene
CTAGCTAGACTACTACT....

The first line for each promoter should contain the accession and gene name. The accession and gene name can be separated by tab or by a pipe character "|", but not by spaces. The sequences should be in upper case letters. Lower case letters and 'N's are taken to be filtered out regions that should be ignored. Your sequences can be split up over multiple lines.

If you would like to use your own promoters we suggest using Promoser for mouse, rat, or human promoters and the Cold Spring Harbor resource for yeast promoters, as we did.

Output

The CARRIE Web Service site has three pages. The first page allows you to upload your data. The second and third page are the results.

Page Two: Analysis of Promoters

After the microarray data is analyzed to select gene that show significant changes in gene expression, CARRIE analyzes the promoters of these genes for significant binding sites for each of the transcription factors in the matrix group you selected on Page One. The table on this page displays, for each transcription factor, the probability that the observed number of binding sites would occur by chance in the selected promoters.

Page two provides three features to help you choose transcription factors. The TRANSFAC collection of matrices contains, in some cases, many matrices for the same transcription factor. TRANSFAC matrices are annotated as belonging to a particular 'Factor'. Factors may be assigned overlapping groups of matrices. We have assigned all PSSMs belonging to a given factor and PSSMs belonging to factors with overlapping sets of matrices a single unique group identifier. Choosing "on" from next to TRANSFAC redundancy reduction and clicking the "Apply" button limits the ROVER results to the highest scoring PSSM for each group. You may also determine the similarity of any selected matrices by clicking on the "Cluster Now" button. This method determines the similarity of the selected matrices using the "malign" matrix alignment tool by Martin Frith. A hierarchical clustering diagram is then produced using the R function "hclust". This clustering can reveal matrix list redundancy and potential false positive transcription factors that may not be biologically relevant, but do have binding sites similar to other relevant transcription factors. The final feature that aids in the selection of transcription factors from the ROVER result is the option to apply a multiple testing correction to the ROVER P-values. Choosing "on" by "Rover P-value Multiple Testing Correction:" applies Benjamini and Yekutieli (2001) stepped correction for strong control of the False Discovery Rate as described in: Dudoit, S., Shaffer, J.P., and Boldrick, J.C. (2002) Multiple Testing Corrections in Microarray Experiments, U.C. Berkeley Division of Biostatistics Working Paper Series, Paper 110.

At this point you should select the transcription factors with significant probabilities to include in your network. Transcription factors that correspond to genes in your microarray data that showed significant expression changes are pre-selected for you and colored in red. This is done because it is likely that if a stimulus changes the expression of a transcription factor gene, then that transcription factor plays some role in the response to that stimulus. This association of transcription factors and genes from the microarray data is done using NCBI locuslink data which may not be available for every gene.

Step two of page two gives you an opportunity to specify a gene that you know to be affected in your microarray study. For example, you may have stimulated or repressed a particular receptor protein. If you choose to specify a gene here, it will be included in the final output. Choose "Positive" or "Negative" for receptors that were stimulated or repressed, respectively.

Step three of page two offers the opportunity to include other sources of network data in the final network map. The user may choose from a selected set of KEGG regulatory pathways with gene accessions appropriate from specific species or upload data from KEGG or other sources formatted in the KEGG XML format KGML. KEGG pathways will not always use gene accessions that correspond to those in your microarray data. BE SURE TO INSPECT KEGG'S ACCESSIONS before uploading a data set. Users uploading their own data should be sure to use the entry, component, relation, and subtype tags and their name or entry1/entry2 attributes. Be sure to use organism:accession pairs for entry names as KEGG does (name="sce:YBR083W").

Page Three: Your Inferred Network

The third page displays a map of the transcriptional regulatory network inferred from your microarray and promoter data. The transcription factors you selected on page two are shown in red. If you specified a stimulated or repressed gene on page two, then that gene will be shown in green. The other genes selected from your microarray data will be shown in white (if CARRIE detected significant binding sites in their promoters for any of the transcription factors you selected). Solid black arrows represent direct regulation of a gene by a transcription factor. Black dashed lines represent the inferred indirect regulation of the transcription factors by the receptor (if specified). Blue arrows and labels are used for data included from KEGG or other KGML formatted data on page 2.

The network map is interactive. Passing the mouse over any of the genes will bring up a description, the microarray and transcription factor binding site data, and the Gene Ontology data for that gene (if available). Clicking on a gene will open a new window showing the NCBI LocusLink information for that gene (if it is available).

Downloadable Data Files

The final results are also available in a downloadable zip file. The files have the following names and contents:

  • *xls: Your processed microarray data
  • *.acc.xls: The genes that met your significance cutoffs in the microarray analysis
  • *.rover.vs.all.xls: A short summary of the significance of transcription factor binding sites for each transcription factor in Excel format.
  • *.html: The web page showing the map of your inferred network.
  • *.png: The picture of your inferred network
  • *.prom.fa: The promoters of the genes with significant gene expression changes.
  • *.bg.prom.fa: The promoters used as examples without significant binding sites.
The raw text file with the .xls endings can be opened in Microsoft Excel or any text editor.

Details

Citing CARRIE


The design and testing of CARRIE is detailed in this article:
Haverty, P.M., Hansen, U., Weng, Z.
Computational Inference of Transcriptional Regulatory Networks from Expression Profiling and Transcription Factor Binding Site Identification
2004, Nucleic Acids Research, Vol. 32, 179-188.
This publication details the CARRIE Web Service specifically:
Haverty, P.M., Frith, M.C., Weng Z.
CARRIE Web Service: Automated Transcriptional Regulatory Network Inference and Interactive Analysis
2004, Nucleic Acids Research, In Press.

Data Used by CARRIE

Promoters

Yeast: Promoters for 6221 yeast ORFs were downloaded from Cold Spring Harbor on 6/16/2003. The promoters consists of the 1000 bases upstream of the transcription start site. To use these promoters, your array data should identify genes by Systematic Name (e.g. YFL056W).

Affymetrix mu11ksubB: Promoters for 2017 of the 6186 genbank accessions on this chip were downloaded from Promoser on 7/3/2003. This chip uses a surprisingly large number of EST sequences that have been removed from Genbank or are too short to be placed on the genome accurately. The promoters stretch from 1000 bases upstream to 100 bases downstream of the transcription start site. To use these promoters, your array data should identify genes by Genbank Accession (e.g. M87653).

Affymetrix HG-U95: Promoters for 41076 of the 88522 unique accessions on the A-E chips were downloaded from Promoser on 10/31/2005. The promoters stretch from 1000 bases upstream to 100 bases downstream of the transcription start site. To use these promoters, your array data should identify genes by Genbank Accession (e.g. M99714). These accessions can be obtained from the Affymetrix annotation files.

Affymetrix HG-U133: Promoters for 29912 of the 51344 unique accessions on the A, B, and P 2.0 chips were downloaded from Promoser on 10/31/2005. The promoters stretch from 1000 bases upstream to 100 bases downstream of the transcription start site. To use these promoters, your array data should identify genes by Genbank Accession (e.g. M99714). These accessions can be obtained from the Affymetrix annotation files.

Transcription Factor Binding Data

Position-Specific Scoring Matrices (PSSMs) giving the frequencies of A,C,G, and T in known DNA binding sites were obtained from JASPAR and TRANSFAC. The JASPAR set contain 86 matrices representing binding sites for transcription factors from a variety of multicellular eukaryotes. The TRANSFAC set consists of 636 families of transcription factors were obtained from version 7.2 of TRANSFAC Professional. The 623 families of transcription factors in this resource contains 341 human examples, 297 mouse examples, 175 rat examples, 48 fruit fly examples, and 43 yeast examples. (Note: One family may contain examples of transcription factors from multiple species). A set of 466 matrices representing vertebrate transcription factors is also provided.

JASPAR Reference

Albin Sandelin, Wynand Alkema, Pär Engström, Wyeth Wasserman and Boris Lenhard
JASPAR: an open access database for eukaryotic transcription factor binding profiles Nucleic Acids Res. 2004 Jan; 32(1) Database Issue

TRANSFAC Reference

Matys V, Fricke E, Geffers R, Gossling E, Haubrock M, Hehl R, Hornischer K, Karas D, Kel AE, Kel-Margoulis OV, Kloos DU, Land S, Lewicki-Potapov B, Michael H, Munch R, Reuter I, Rotert S, Saxel H, Scheer M, Thiele S, Wingender E.
TRANSFAC: transcriptional regulation, from patterns to profiles. Nucleic Acids Res. 2003 Jan 1;31(1):374-8.

Contact Info

If you have any questions, problems, or suggestions for improvements feel free to contact the CARRIE administrator.



Instructions Last Updated:Friday, 04-Nov-2005 23:53:27 EST
Protein Engineering