Home      Labs      Publications      People      Tools   

From CAGT

Documentation for Predictome


Predictome is a database of putative links between proteins. Using the sequence data of the available genomes of 71 microorganisms.

Summary of Computational Prediction Methods

Proteins A () and B () are termed "close" if the genes that code them are within 300 base pairs along the chromosome and are transcribed in the same direction.3 Two "close" proteins are chromo-linked if their respective orthologs are also "close" in at least one other genome of a different phylogenetic lineage (as annotated in the COGs database).
A () and B () are fusion linked if they occur as a single fusion gene in another organism. Domain fusions represented in some genomes by single domain fusion components are split so that each fusion component can be assigned to a different COG. When distinct domains of a fusion protein are in different COGs, those COGs are said to be fusion-linked COGs.
The distribution of a given protein across different phyla can be represented as a profile.2 When protein A () is always present/absent in an organism where protein B () is present/absent, i.e. the two share identical phylogenetic profiles, they are phylo-linked. If a phylogenetic pattern is found to be too common (more than 6 COGs), it is ignored.

 

In addition, published experimental data of protein-protein interactions from yeast two-hybrid, co-immunoprecipitation and other methods are included in the database.

 

Functional Summaries of Linked Proteins

Using the model that genes and proteins linked by the methods above can share various levels of functional association, Predictome provides a summary of the linked proteins for each protein in the database. There are two tools which provide this:

Gene Ontology (GO) Summary

The Gene Ontology project provides a systematic framework for describing the functions of genes and proteins in cells. The structured vocabulary of GO allows each functional assignment for a gene or protein to be related to others, regardless of whether this relation is only general or very specific. Each protein has one or more assignments of a defined functional term in GO, depending on what is known about it. Each term exists in a hierarchy of terms (e.g. protein -> enzyme -> reductase) -- therefore, each term represents a set of paths to the root of the hierarchy. For any protein in the database, the paths for a set of interactors in the link network can be computed. This set of paths can then be examined for overlap. The overlap between these paths provides a quick view of the functional relatedness of the set of proteins being viewed.

Annotation Phrase-Building Summary

While the GO Summary is an efficient summary tool for a large set of linked proteins, it is dependant on the annotation of the genes and proteins in the set. In many cases, the annotation for a given protein is only putative, or very general. For sets of proteins, there maybe a mixture of well-annotated and pooly-annotated. In these cases, it makes more sense to compare the text of the original annotations themselves, using information in the various public sequence and annotation databases (InterPro, Genbank, GeneQuiz, etc). The phrase-building tool takes a result set of proteins from the Predictome database, compliles their annotation and searches for any common text motifs. Text which is uninformative is screened out in this process. The result is a list of shared "text elements", each of which is some shared functional phrase for the set.

 

References for methods and data in the database:

1. Eisenberg, D., Marcotte, E.M., Xenarios, I. & Yeates, T.O. Protein function in the post-genomic era. Nature 405, 823-6. (2000).
2. Pellegrini, M., Marcotte, E.M., Thompson, M.J., Eisenberg, D. & Yeates, T.O. Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proc Natl Acad Sci USA 96, 4285-8. (1999).
3. Overbeek, R., Fonstein, M., D'Souza, M., Pusch, G.D. & Maltsev, N. The use of gene clusters to infer functional coupling. Proc Natl Acad Sci USA 96, 2896-901. (1999).
4. Dandekar, T., Snel, B., Huynen, M. & Bork, P. Conservation of gene order: a fingerprint of proteins that physically interact. Trends Biochem Sci 23, 324-8. (1998).
5. Marcotte, E.M. et al. Detecting protein function and protein-protein interactions from genome sequences. Science 285, 751-3. (1999).
6. Yanai, I., Derti, A. & DeLisi, C. Genes linked by fusion events are generally of the same functional category: a systematic analysis of 30 microbial genomes. Proc Natl Acad Sci USA 98, 7940-5. (2001).
7. Ito, T. et al. A comprehensive two-hybrid analysis to explore the yeast protein interactome. Proc Natl Acad Sci USA 98, 4569-74. (2001).
8. Rain, J.C. et al. The protein-protein interaction map of Helicobacter pylori. Nature 409, 211-5. (2001).
9. Marcotte, E.M., Pellegrini, M., Thompson, M.J., Yeates, T.O. & Eisenberg, D. A combined algorithm for genome-wide prediction of protein function. Nature 402, 83-6. (1999).
10. Huynen, M., Snel, B., Lathe, W., 3rd & Bork, P. Predicting protein function by genomic context: quantitative evaluation and qualitative inferences. Genome Res 10, 1204-10. (2000).
11. Tatusov, R.L. et al. The COG database: new developments in phylogenetic classification of proteins from complete genomes. Nucleic Acids Res 29, 22-8. (2001).
12. Kanehisa, M. & Goto, S. KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res 28, 27-30. (2000).
13. Andrade, M.A. et al. Automated genome sequence analysis and annotation. Bioinformatics 15, 391-412. (1999).
14. The Gene Ontology Consortium. Creating the gene ontology resource: design and implementation. Genome Res 11, 1425-33. (2001).

Protein Engineering