From CAGT
GEMS - Help
Contents |
Update
GEMS version 1.5 is now available. The new version improves the masking
function, and be able to generate two vector files for the selection of samples
and genes individually. A web-based server for GEMS is available at http://genomics10.bu.edu/terrence/gems/
The NAR paper describing GEMS server is available HERE
If you use GEMS in your research please acknowledge the following paper:
Wu CJ, Kasif S.
GEMS: a web server for biclustering analysis of expression data.
Nucleic Acids Res. 2005 Jul 1;33.
Additional information
This project has been presented in the Fourth
Annual International Workshop on Bioinformatics and Systems Biology in Kyoto
Japan June 3, 2004. Details of the algorithm can be found in Genome
Informatics 15(1): 239-248, 2004.
DOWNLOAD PDF DOCUMENT
Introduction
Recent advances in high throughput profiling of gene expression have catalyzed an explosive growth in functional genomics aimed at the elucidation of genes that are differentially expressed in different tissue or cell types across a range of experimental conditions. Traditional clustering methods such as hierarchical clustering, or principal component analysis are difficult to deploy effectively for several of these tasks since genes rarely exhibit similar expression pattern across a wide range of conditions. Biclustering (also referred to as co-clustering, two way clustering, projective clustering, block clustering) of gene expression data is a promising methodology for identification of gene groups that show a coherent expression profile across a subset of conditions. While biclustering was introduced in statistics in 1974 few robust and efficient solutions exist. Here we propose a simple but promising new approach for biclustering based on a Gibbs sampling paradigm. Our algorithm is implemented in the program GEMS (Gene Expression Module Sampler). GEMS had been tested on published leukemia data sets, as well as on synthetic data generated to evaluate the effect of noise on the performance of the algorithm. In our preliminary studies we showed that GEMS is a reliable, flexible and computationally efficient approach for biclustering gene expression data. These biclusters are potential targets for genes that are functionally related or co-regulated by common transcription factors. The samples produced by the algorithm can potentially suggest sub-classes of diseases and can serve as a diagnostic tool.
Workflow
Dowload and install
Click the link to download the source code: gems15.cpp
After downloading, run the following commands:
1. Linux/Unix System: GNU GCC complier version 3 and above preferred.
$ g++ gems15.cpp -o gems15
2. Windows System: Borland C Complier 5.5 prefered
> bcc32 gems15.cpp -o gems15
Now the GEMS program should be ready to run.
Usage
Input
Three parameters are required: name of file containing the expression data, size constraint alpha, and width constraint w.
- Expression data (required): The format
is similar to the usual format used in gene expression data sets. All lines
are tab-delimited. The first row of the file specifies the names of the first
two columns followed by all the sample names. Every other line corresponds to
a gene. The line contains the gene id, the gene name, and all the expression
values for that gene. Missing values can be indicated by "NA". GEMS
will skip the missing values when mining the optimal bicluster. A sample array file can be downloaded here.
Current version of server accepts an array file with up to 50,000 genes and 512
samples.
Below is a simple example of the format.
ID Comment ALL01 ALL02 ALL03 AML01 AML02 AML03 200001_at gb:X00101 1.1 1.2 1.0 NA 0.3 2.2 200002_at gb:X00102 20.4 20.4 20.5 3.4 3.6 3.5 200003_at gb:X00103 -1.3 0 -2 0.6 0.5 0.6 - Gene weighting scores (optional): Every gene can be assigned a weighting score. The score of a
expression bicluster is the sum of the weighting scores of the genes included
in it. GEMS will try to search biclusters with the highest scores. If no
weighting file is uploaded, the weighting of every gene will be 1. The
first line of file is a header. Every other line corresponds to a gene. The
weighting scores should be a non-negative real number. A sample weighting file can be downloaded
here.
- Samples class labels (optional): .In a semi-supervised analysis, the labels of part of the conditions are known, and others are unknown. Associating unknown samples to known ones can help to make classification. The current version of GEMS allows users to specify a set of samples in the array file to be seeds. These seed samples will be kept in the bicluster along the Gibbs Sampling Iteration. The first line of file is a header. Every other line corresponds to a gene. The class is 1 if the sample is a seed (forced to stay in biclusters), and the class is 0 if the corresponding sample is a candidate. Any sample with a class label other than 0 or 1 will be ignored during bicluster mining. A sample weighting file can be downloaded here.
Output
Working report will be displayed on STDOUT. For every bicluster extracted, users can choose to generate three files.
- Bicluster expression matrix: The expression
values of the selected genes in the selected samples. Its format is same as
the original expression array data file.
- Sample selection vector: A tab-delimited
file indicating whether each sample is selected in bicluster or not. The first
line is header, every other line corresponds a sample containing the name and
a label. Sample is labeled 1 if it is selected, and labeled 0 if not
selected.
- Gene selection vector: A tab-delimited file indicating whether each gene is selected in bicluster or not. The first line is header, every other line corresponds a gene containing the name and a label. Gene is labeled 1 if it is selected, and labeled 0 if not selected.
Command Line Options
GEMS accepts the following command-line options:
$ gems15 arrayfile.name -a=??? -w=??? [optional parameters]
| Options | Type | Description | Default |
| -g=? | positive float | Lower limit for sum of gene weighting scores in a bicluster. | default=1 |
| -c=? | positive integer | Number of biclusters wanted. | default=1 |
| -k=? | 1, 2, 3, or 4 | The method used to mask earlier extracted
biclusters. 1: no masking 2. selected genes in selected samples. 3. selected samples 4. selected genes. |
1. No masking |
| -u=? | filename | File containing gene weighting scores. | No weighting, (every gene=1) |
| -f=? | filename | File containing sample class labels. | No labels, (every sample can be candidate) |
| -o | generate bicluster expression files. | No output | |
| -q | generate sample and gene vector files. | No output | |
| -v | More display (verbose) | Not | |
| -r | Allow redundant biclusters to be repeatedly reported. | Only unique biclusters reported. | |
| -p=? | 1 to 9 | Running speed of the program. 9 is fastest and 1 is slowest | 5 |
| -e=? | positive integer | Seed number for random number generator | 10000 |
| -t=? | positive integer | Permutation test. Given value is the number of tests. | No permutation. |
| -m=? | float | Values indicating missing data. | All non-numeric values. |
| -h | Help message |
Examples
- $ gems15 samplearray.txt -a=0.25 -w=0.5
In the array specified by samplearray.txt, find one bicluster containing at least 25% of samples with the expression ranges of genes in the selected samples <= 0.5.
- $ gems15 samplearray.txt -a=0.25 -w=0.5 -g=10
-c=3
Similar with previous example. There should be at least 10 genes in extracted biclusters. At most three biclusters will be extracted. Every extracted biclusters is unique. ( No any two biclusters extracted are the same.)
- $ gems15 samplearray.txt -a=0.25 -w=0.5 -g=10
-c=3 -r
Similar with previous example. There should be at least 10 genes in extracted biclusters. Three biclusters will be extracted. Biclusters can be the same as previously reported ones.
- $ gems15 samplearray.txt -a=0.25 -w=0.5 -g=10
-c=3 -k=2
Similar with the second example. There should be at least 10 genes in extracted biclusters. At most three biclusters will be extracted. Any samples in extracted biclusters will be masked so that further biclusters won't overlap with previous ones.
- $ gems15 samplearray.txt -a=0.25 -w=0.5 -g=10
-c=3 -o
Similar with the second example. The expression profile of the three extracted biclusters will be in the files: samplearray.txt.001, samplearray.txt.002, and samplearray.txt.003
- $ gems15 samplearray.txt -a=0.25 -w=0.5 -q
Similar with the first example. The bicluster will be represented by two vector file. samplearray.txt.001.samples and samplearray.txt.001.genes.
- $ gems15 samplearray.txt -a=0.25 -w=0.5 -u=sampleweight.txt
-f=samplelabel.txt -g=8.8
Similar with the first example. The gene weighting scores in sampleweight.txt and the sample labels in samplelabel.txt will be used. The sum of weighting scores of genes in any reported bicluster will be greater than or equal to 8.8.
- $ gems15 samplearray.txt -a=0.25 -w=0.5
-t=100 -o
Permutation test 100 iterations. In each iteration, the values on each row will be completely shuffled. Then GEMS try to find the largest bicluster in the permuted matrix. The number of genes in the 100 biclusters from permuted arrays will be recorded and output a file: samplearray.txt.permutation.
- $ gems15 samplearray.txt -a=0.25 -w=0.5
-m=999 -m=90 -m=-100
Similar with the first example. If the any cell in the array matrix has the values of 999, 90 or 100, it will be regarded as missing data.
License
GEMS is open source software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
Acknowledgements
This work is supported in part by NSF grants DBI-0239435 and ITR-048715 and NHGRI grant #1R33HG002850-01A1.
Contact
If you have any comments or questions, please contact Chang-Jiun Wu

