From CAGT
Rankgene: a program to rank genes from expression data
This page is a mirror. Original document is located here.
Contents |
UPDATE
Rankgene version 1.1 is now available. The new version accepts expression data sets in two formats: a standard Affymetrix-like gene expression data format and the original OC1 format. Check the README file in the package for details.
Introduction
Rankgene is a program for analyzing gene expression data, feature selection and ranking genes based on the predictive power of each gene to classify samples into functional or disease categories. A paper describing Rankgene will be published in Bionformatics. One useful feature of this program is that the user can select eight different ranking criteria. Rankgene uses the six measures of predictability adopted from the popular OC1 decision tree software developed at Johns Hopkins university. In addition we provide the traditional t-test and a novel efficient implementation of one-dimensional support vector machine (SVM) as two new options. The Rankgene program can be used as a feature selection program to select or rank genes based on their relative predictive power in classification of gene expression data. The input to Rankgene is a gene expression data consisting of a set of samples, the expression levels of all the genes across these samples, and the class label for each sample. A typical example would be gene expression measurements from normal or cancerous tissues. For each gene, Rankgene analyzes the expression values of each gene and ranks them based on their capability to distinguish between the classes.
Download and Install
Click the link to download: rankgene-1.1.tar.gz
After downloading,run the following commands:
$ gunzip rankgene-1.1.tar.gz $ tar -xvf rankgene-1.1.tar $ cd rankgene-1.1 $
make
Operating System | Compiler |
---|---|
Linux (Redhat 7.2/7.3) | gcc 2.96 |
Linux (Redhat 8.0) | gcc 3.2 |
Usage
Input
Rankgene accepts input files containing gene expression data in two formats:
- Standard: This format is similar to the usual format used in gene
expression data sets. All lines are tab-delimited. The first row of
the file specifies the names of the first two columns followed by all
the sample names. Every other line corresponds to a gene. The line
contains the gene id, the gene name, and all the expression values for
that gene. You can indicate missing values either by
"
? " or "NA". RankGene replaces missing values for a gene with the average value of the other expression values for that gene.
In this format, a separate file specifies the class labels. Each line of this file contains a sample name and a class name, separated by a tab. RankGene will ignore a sample whose class name is "NA."
The files in the data directory in the RankGene package are examples of this format.
Note: We suggest that you use a format where the first column in each line contains the gene id (gene accession number) and the second column contains the gene name. RankGene recognises some standard names for the gene name column (Gene Description and Name) and the gene id column (Gene Accession Number, GID, and Image Id). If you use different names for these columns, please let us know.
Note: RankGene does not accept data sets that contain information apart from the gene expression values. For instance, some Affymetrix data sets contain CALL values or p-values. Please remove these columns from the data set before invoking RankGene.
- OC1: Each line corresponds to a sample or experimental condition. The line contains the expression values of all the genes for that sample followed by the class label of that sample. The elements of the line are separated by commas. RankGene expects an expression value to be a floating point number. Unlike the previous format, the class label must be an integer.
Output
RankGene outputs a file containing the genes most predictive of the sample classes based on the measure specified on the command line. The number of genes is specified on the command line. The first few lines of the output file contain some general information about RankGene, the data contained in the input files, and the measure used. Each succeeding line lists the index of a gene in the data file (indices start at 1), the name of the gene, the id of the gene, and the value of the measure for that gene. The genes are sorted in increasing order of the value of the measure.
Command Line Options
RankGene accepts the following command-line options:
- -i
: name of the file containing the input data. - -c
: name of the file containing the class labels - -o
: name of the file to output the results to. - -n
: number of genes to list in the output file. The default value is 100. - -m
: the measure to use to rank genes. This number must range between 1 and 8. The default value is 1. The correspondence between this option and the measures is: - Information gain.
- Twoing rule.
- Gini index.
- Sum minority.
- Max minority.
- Sum of variances.
- t-statistic.
- One dimensional SVM.
- -w
: the weight parameter for the 1D SVM. This parameter can be any positive number. Its default value is 1. This parameter is used only when the -m option is 8. - -R: specifies that the input file is in OC1 format. The class file is ignored if this option is set.
-
For example, to list the best 500 genes using the t-test and the input
files in the data directory:
./rankgene -m 7 -n 100 -o data/gene.list -i data/all-aml.txt -c data/all-aml-class.txt
- To do so for the one-dimensional SVM measure with a weight value of
10:
./rankgene -m 8 -w 10 -n 500 -o gene.list -i data/all-aml.txt -c data/all-aml-class.txt
Note:
- For the t-statistic, RankGene ranks genes according to the decreasing order of the statistics. For each gene, it prints out two values: the reciprocal of the t-statistic and the t-statistic itself.
- For the one-dimensional SVM, RankGene also prints out the expression values corresponding to the two support vectors and their corresponding classes. Please refer to README file in the same directory for more information on usage.
License
Rankgene is available without a fee for all non-profit and academic institutions. Commercial users are required to obtain a license for Rankgene. The license is required for any use of Rankgene in a profit-making enterprise, and it gives the company all rights to any discoveries or inventions made with Rankgene. There is a modest fee for this license. Please contact Prof. Simon Kasif for details. Rankgene is an evolving development software, thus we would appreciate comments from users.
Contact
If you have any questions, please contact Simon Kasif.
Measures of Predictability
ankgene supports eight measures for quantifying a gene's ability to distinguish between classes. In the formulae below, k is the total number of classes; n is the total number of expression values; nl (resp., nr) is the number of values in the left (resp., right) partition; li (resp., lr) is the number of values that belong to class i in the left (resp., right) partition; and c is the class of the ith sample.
- Information gain
- Twoing rule
- Sum minority
- Max Minority
- Gini index
- Sum of variances
- t-statistic: Rankgene sorts the genes in decreasing order of the absolute value of the t-statistic for each gene.
- One dimensional support vector machine (SVM): We train an SVM on each gene's expression values. The gene's measure is the function optimised by the SVM training algorithm. Standard SVM training algorithms run in O(n3) time, where n is the number of training samples. We have developed and implemented an algorithm for training one-dimensional SVMs with linear kernels that runs in O(n log n) time. You can read about the details of this algorithm in this paper.
Each one of the first 6 measures attempts to capture the best possible reduction in uncertainty (analogous to increase in predictabilty) that we can obtain by dividing the full range of expression of a given gene into two intervals (up-regulated, down-regulated).
E.g., Sum-minority is a simple rule where for a given threshold
we test the error obtained by predicting all samples below the
threshold to be in class one (e.g, normal) and above the threshold to be
in class two (e.g, cancer). The sum-minority rule counts the minority
class samples below and above the treshold as errors.
For information gain we use reduction in class entropy resulting from
partitioning the samples in two ranges (below/above
a single threshold) as a measure of predictability.
Comparisons of the measures
We have implemented some techniques for comparing and contrasting the lists of predictive genes computed by each measure. We provide links to the comparison results for some publicly available data sets.