Home      Labs      Publications      People      Tools   

From CAGT

GEMS - Help

Contents

Update

GEMS version 1.5 is now available. The new version improves the masking function, and be able to generate two vector files for the selection of samples and genes individually. A web-based server for GEMS is available at http://genomics10.bu.edu/terrence/gems/

The NAR paper describing GEMS server is available HERE

If you use GEMS in your research please acknowledge the following paper:

Wu CJ, Kasif S.
GEMS: a web server for biclustering analysis of expression data.
Nucleic Acids Res. 2005 Jul 1;33.

Additional information

This project has been presented in the Fourth Annual International Workshop on Bioinformatics and Systems Biology in Kyoto Japan June 3, 2004. Details of the algorithm can be found in Genome Informatics 15(1): 239-248, 2004

DOWNLOAD PDF DOCUMENT

Introduction

Recent advances in high throughput profiling of gene expression have catalyzed an explosive growth in functional genomics aimed at the elucidation of genes that are differentially expressed in different tissue or cell types across a range of experimental conditions. Traditional clustering methods such as hierarchical clustering, or principal component analysis are difficult to deploy effectively for several of these tasks since genes rarely exhibit similar expression pattern across a wide range of conditions. Biclustering (also referred to as co-clustering, two way clustering, projective clustering, block clustering) of gene expression data is a promising methodology for identification of gene groups that show a coherent expression profile across a subset of conditions. While biclustering was introduced in statistics in 1974 few robust and efficient solutions exist. Here we propose a simple but promising new approach for biclustering based on a Gibbs sampling paradigm. Our algorithm is implemented in the program GEMS (Gene Expression Module Sampler). GEMS had been tested on published leukemia data sets, as well as on synthetic data generated to evaluate the effect of noise on the performance of the algorithm. In our preliminary studies we showed that GEMS is a reliable, flexible and computationally efficient approach for biclustering gene expression data. These biclusters are potential targets for genes that are functionally related or co-regulated by common transcription factors. The samples produced by the algorithm can potentially suggest sub-classes of diseases and can serve as a diagnostic tool.

Gems_fig1c.jpg

Workflow

Gems_workflow.jpg

Dowload and install

Click the link to download the source code: gems15.cpp

After downloading, run the following commands:

1. Linux/Unix System: GNU GCC complier version 3 and above preferred.

   $ g++ gems15.cpp -o gems15

2. Windows System: Borland C Complier 5.5 prefered

  > bcc32 gems15.cpp -o gems15

Now the GEMS program should be ready to run.

Usage

Input

Three parameters are required: name of file containing the expression data, size constraint alpha, and width constraint w.

  1. Expression data (required): The format is similar to the usual format used in gene expression data sets. All lines are tab-delimited. The first row of the file specifies the names of the first two columns followed by all the sample names. Every other line corresponds to a gene. The line contains the gene id, the gene name, and all the expression values for that gene. Missing values can be indicated by "NA". GEMS will skip the missing values when mining the optimal bicluster. A sample array file can be downloaded here. Current version of server accepts an array file with up to 50,000 genes and 512 samples. 

    Below is a simple example of the format.

    ID Comment ALL01 ALL02 ALL03 AML01 AML02 AML03
    200001_at gb:X00101 1.1 1.2 1.0 NA 0.3 2.2
    200002_at gb:X00102 20.4 20.4 20.5 3.4 3.6 3.5
    200003_at gb:X00103 -1.3 0 -2 0.6 0.5 0.6

      

  2. Gene weighting scores (optional): Every gene can be assigned a weighting score. The score of a expression bicluster is the sum of the weighting scores of the genes included in it. GEMS will try to search biclusters with the highest scores. If no weighting file is uploaded, the weighting of every gene will be 1. The first line of file is a header. Every other line corresponds to a gene. The weighting scores should be a non-negative real number. A sample weighting file can be downloaded here.
  3. Samples class labels (optional): .In a semi-supervised analysis, the labels of part of the conditions are known, and others are unknown. Associating unknown samples to known ones can help to make classification. The current version of GEMS allows users to specify a set of samples in the array file to be seeds. These seed samples will be kept in the bicluster along the Gibbs Sampling Iteration. The first line of file is a header. Every other line corresponds to a gene. The class is 1 if the sample is a seed (forced to stay in biclusters), and the class is 0 if the corresponding sample is a candidate. Any sample with a class label other than 0 or 1 will be ignored during bicluster mining. A sample weighting file can be downloaded here.

  

Output

Working report will be displayed on STDOUT. For every bicluster extracted, users can choose to generate three files.

  1. Bicluster expression matrix: The expression values of the selected genes in the selected samples. Its format is same as the original expression array data file. 
  2. Sample selection vector: A tab-delimited file indicating whether each sample is selected in bicluster or not. The first line is header, every other line corresponds a sample containing the name and a label. Sample is labeled 1 if it is selected, and labeled 0 if not selected. 
  3. Gene selection vector: A tab-delimited file indicating whether each gene is selected in bicluster or not. The first line is header, every other line corresponds a gene containing the name and a label. Gene is labeled 1 if it is selected, and labeled 0 if not selected.

Command Line Options

GEMS accepts the following command-line options:

$ gems15 arrayfile.name -a=???  -w=???  [optional parameters]

Options Type Description Default
-g=? positive float Lower limit for sum of gene weighting scores in a bicluster. default=1
-c=? positive integer Number of biclusters wanted.  default=1
-k=? 1, 2, 3, or 4 The method used to mask earlier extracted biclusters.
 1: no masking 
 2. selected genes in selected samples. 
 3. selected samples 
 4. selected genes.
1. No masking
-u=? filename File containing gene weighting scores.  No weighting, (every gene=1)
-f=? filename File containing sample class labels.  No labels, (every sample can be candidate)
-o   generate bicluster expression files. No output
-q    generate sample and gene vector files. No output
-v    More display (verbose) Not
-r    Allow redundant biclusters to be repeatedly reported. Only unique biclusters reported.
-p=? 1 to 9 Running speed of the program. 9 is fastest and 1 is slowest 5
-e=? positive integer Seed number for random number generator 10000
-t=? positive integer Permutation test. Given value is the number of tests. No permutation.
-m=? float Values indicating missing data. All non-numeric values.
-h    Help message   


Examples

  • $ gems15 samplearray.txt -a=0.25 -w=0.5

    In the array specified by samplearray.txt, find one bicluster containing at least 25% of samples with the expression ranges of genes in the selected samples <= 0.5. 

  • $ gems15 samplearray.txt -a=0.25 -w=0.5 -g=10 -c=3

    Similar with previous example. There should be at least 10 genes in extracted biclusters. At most three biclusters will be extracted. Every extracted biclusters is unique. ( No any two biclusters extracted are the same.)

  • $ gems15 samplearray.txt -a=0.25 -w=0.5 -g=10 -c=3 -r

    Similar with previous example. There should be at least 10 genes in extracted biclusters. Three biclusters will be extracted. Biclusters can be the same as previously reported ones.

  • $ gems15 samplearray.txt -a=0.25 -w=0.5 -g=10 -c=3 -k=2

    Similar with the second example. There should be at least 10 genes in extracted biclusters. At most three biclusters will be extracted. Any samples in extracted biclusters will be masked so that further biclusters won't overlap with previous ones.

  • $ gems15 samplearray.txt -a=0.25 -w=0.5 -g=10 -c=3 -o

    Similar with the second example. The expression profile of the three extracted biclusters will be in the files:  samplearray.txt.001, samplearray.txt.002, and samplearray.txt.003

  • $ gems15 samplearray.txt -a=0.25 -w=0.5 -q

    Similar with the first example. The bicluster will be represented by two vector file. samplearray.txt.001.samples and samplearray.txt.001.genes.

  • $ gems15 samplearray.txt -a=0.25 -w=0.5 -u=sampleweight.txt -f=samplelabel.txt -g=8.8

    Similar with the first example. The gene weighting scores in sampleweight.txt and the sample labels in samplelabel.txt will be used. The  sum of weighting scores of genes in any reported bicluster will be greater than or equal to 8.8.

  • $ gems15 samplearray.txt -a=0.25 -w=0.5 -t=100 -o

    Permutation test 100 iterations. In each iteration, the values on each row will be completely shuffled. Then GEMS try to find the largest bicluster in the permuted matrix. The number of genes in the 100 biclusters from permuted arrays will be recorded and output a file: samplearray.txt.permutation.

  • $ gems15 samplearray.txt -a=0.25 -w=0.5 -m=999 -m=90 -m=-100

    Similar with the first example. If the any cell in the array matrix has the values of 999, 90 or 100, it will be regarded as missing data.

License

GEMS is open source software; you can redistribute it and/or modify it under the terms of  the GNU General Public License as published by the Free Software Foundation;  either version 2 of the License, or any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY;  without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

Acknowledgements

This work is supported in part by NSF grants DBI-0239435 and ITR-048715 and NHGRI grant #1R33HG002850-01A1.

Contact

If you have any comments or questions, please contact Chang-Jiun Wu

Protein Engineering