Home      Labs      Publications      People      Tools   


Geneva Help


What is this website for?

GENEVA is a supplementary website for Zheng, Y. et al. published in PLoS Biology. It catalogues SVGs (segmentally variable genes) in several complete microbial genomes. Please read the paper for in depth discussion.

What is SVG?

SVG stands for segmentally variable gene. They are genes that have one or more sizable variable regions interspersed among well-conserved regions by sequence comparison analysis. While most current sequence analyses focus on regions widely conserved across diverse lineages, variable regions are sometimes ignored during the analysis.

Why study SVGs?

It has been known that the variable domains inside some gene families confer them multiple specificities,e.g., C-5 DNA methyltransferases, etc. Through bioinformatic approach we find hundreds of genes that have similar variability structures along sequences. Immediate following questions can be: What kind of genes tend to have such variable regions? What could be the function of those variable regions? Is there any commonality among these functions?

The juxtaposition of well-conserved regions and the variable regions suggests their function may be related if any. In cases where the conserved portion has already been assigned a biochemical function, initial guesses toward the function of the variable region can always be formulated: either binding to a different target molecule, or having a different sequence specificity. These hypotheses can later be tested in hands of biochemists. Although answers to each case may differ, we have suggested that commonality in function of the variable regions is that they may mediate interaction with other molecules.

How can I make use of this website?

In this site, you will find many examples where such variable regions exist in a number of completely sequenced microbial genomes. Do some clicks and see how SVGs prevail in the sequence universe. You may find genes that have the function of your interest. Have the variable region been chracterized in that gene family? If yes, let us know (zhengyu@bu.edu). If no and you have some wild guesses, also let us know. We would be very happy to hear comments from you.

Explanation of the result

One needs to be extremely careful to interpret the result when looking at the variable region inside a gene family. Other factors that can contribute to the regional variability include:

  • Inaccurate gene prediction result, especially in the N-terminus of the prokaryotic genes. Extra care needs to be taken when the variable region is in the N-terminus.
  • The current method relies on the Taxonomy database at NCBI. Occassionally when a sequence has not been assigned a taxonomy id or the database has not been updated, sequences from within a single species are included into the comparison. This increases the bias in the dataset and may represent false positives (for an example).
  • It is possible that only the query protein is different from all the other similar genes in a specific region. In those cases, the observed regional variability does not represent real diversity in this region. This can be caused by various reasons: (1) that regin in the query protein is not functional and is accumulating random mutations; (2) most of the hits are from phylogenetically similar species, e.g. E.coli related species. (3) there are not enough similar sequences in the database to make the comparison.

To relieve this concern, we also provide results of hierachical clustering on the variable regions. Intuitively, when there is only a few number of clusters present (e.g.2), it may be the artifact from the above reasons (see example).

New methodologies incorporating these considerations are in development.

PFAM result

Sequences are searched against PFAM release 7.6 (2002). Since PFAM is evolving over time, the user is suggested to do a quick PFAM search if a PFAM domain listed here is not found.

Query-anchored BLAST diagram

In this diagram, three colored blocks are seen: blue - nongapped HSPs between the query protein sequence and the hit protein sequence; yellow and grey - unaligned regions, however, when two unaligned regions are of similar length and anchored by same sets of conserved HSPs, they are colored yellow and they are the candidate regions to be considered as "variable"; when two unaligned regions are of significantly different length (e.g., gap content larger than 0.3 as defined in the paper), they are likely due to segment insertion or deletion in protein sequences, they are colored grey and they are not considered as candidate variable regions.

Hierachical clustering on the variable region(not published)

Hierachical clustering is performed on the collection of corresponding variable regions (regions that are bounded by a same set of well-conserved regions) among a family of similar genes. Briefly, at each clustering step, two clusters with the smallest distance are joined. Distance between two variable regions is their percent identiy calculated from ClustalW report. Cluster-cluster distance is defined as an average of all pairwise distances between the elements in each cluster. The clustering procedure stops when no distance between any two clusters is beyond 30%. Usually, variable regions from phylogenetically close species will be grouped together. (The result is only shown in several genomes.)

Motif finding in variable regions(not published)

Clusters are reported after hierachical clustering procedure on the variable regions. Usually phylogenetically close species are grouped into a same cluster. As a result the bias in the data is reduced. MEME is then used to look for conserved short motifs that may exist among these distinct clusters. One sequence is chosen from each cluster as a representative and then this dataset is fed into MEME. Reported short motifs may be suggestive to the common function of the variable regions if any. The user should be cautious when seeing short motifs at the end that is close to the conserved regions, because they may be just extensions of the conserved regions due to imprecise detection of the boundaries of the variable regions.(The result is only shown in several genomes.)

Protein Engineering