Home      Labs      Publications      People      Tools   

From CAGT

Rankgene: a program to rank genes from expression data

This page is a mirror. Original document is located here.

Contents


UPDATE

Rankgene version 1.1 is now available. The new version accepts expression data sets in two formats: a standard Affymetrix-like gene expression data format and the original OC1 format. Check the README file in the package for details.

Introduction

Rankgene is a program for analyzing gene expression data, feature selection and ranking genes based on the predictive power of each gene to classify samples into functional or disease categories. A paper describing Rankgene will be published in Bionformatics. One useful feature of this program is that the user can select eight different ranking criteria. Rankgene uses the six measures of predictability adopted from the popular OC1 decision tree software developed at Johns Hopkins university. In addition we provide the traditional t-test and a novel efficient implementation of one-dimensional support vector machine (SVM) as two new options. The Rankgene program can be used as a feature selection program to select or rank genes based on their relative predictive power in classification of gene expression data. The input to Rankgene is a gene expression data consisting of a set of samples, the expression levels of all the genes across these samples, and the class label for each sample. A typical example would be gene expression measurements from normal or cancerous tissues. For each gene, Rankgene analyzes the expression values of each gene and ranks them based on their capability to distinguish between the classes.

Download and Install

Click the link to download: rankgene-1.1.tar.gz

After downloading,run the following commands:

   $ gunzip rankgene-1.1.tar.gz

   $ tar -xvf rankgene-1.1.tar

   $ cd rankgene-1.1

   $ make

Now the rankgene program should be ready to run.

Supported operating systems and compilers
Operating SystemCompiler
Linux (Redhat 7.2/7.3) gcc 2.96
Linux (Redhat 8.0) gcc 3.2
Note that the program should compile well on most linux/unix systems, please let us know if you have problems or suggestions when compiling the program.

Usage

Input

Rankgene accepts input files containing gene expression data in two formats:

  1. Standard: This format is similar to the usual format used in gene expression data sets. All lines are tab-delimited. The first row of the file specifies the names of the first two columns followed by all the sample names. Every other line corresponds to a gene. The line contains the gene id, the gene name, and all the expression values for that gene. You can indicate missing values either by "?" or "NA". RankGene replaces missing values for a gene with the average value of the other expression values for that gene.
    In this format, a separate file specifies the class labels. Each line of this file contains a sample name and a class name, separated by a tab. RankGene will ignore a sample whose class name is "NA."
    The files in the data directory in the RankGene package are examples of this format.
    Note: We suggest that you use a format where the first column in each line contains the gene id (gene accession number) and the second column contains the gene name. RankGene recognises some standard names for the gene name column (Gene Description and Name) and the gene id column (Gene Accession Number, GID, and Image Id). If you use different names for these columns, please let us know.
    Note: RankGene does not accept data sets that contain information apart from the gene expression values. For instance, some Affymetrix data sets contain CALL values or p-values. Please remove these columns from the data set before invoking RankGene.
  2. OC1: Each line corresponds to a sample or experimental condition. The line contains the expression values of all the genes for that sample followed by the class label of that sample. The elements of the line are separated by commas. RankGene expects an expression value to be a floating point number. Unlike the previous format, the class label must be an integer.

Output

RankGene outputs a file containing the genes most predictive of the sample classes based on the measure specified on the command line. The number of genes is specified on the command line. The first few lines of the output file contain some general information about RankGene, the data contained in the input files, and the measure used. Each succeeding line lists the index of a gene in the data file (indices start at 1), the name of the gene, the id of the gene, and the value of the measure for that gene. The genes are sorted in increasing order of the value of the measure.

Command Line Options

RankGene accepts the following command-line options:

  • -i : name of the file containing the input data.
  • -c : name of the file containing the class labels
  • -o : name of the file to output the results to.
  • -n : number of genes to list in the output file. The default value is 100.
  • -m : the measure to use to rank genes. This number must range between 1 and 8. The default value is 1. The correspondence between this option and the measures is:
    1. Information gain.
    2. Twoing rule.
    3. Gini index.
    4. Sum minority.
    5. Max minority.
    6. Sum of variances.
    7. t-statistic.
    8. One dimensional SVM.
  • -w : the weight parameter for the 1D SVM. This parameter can be any positive number. Its default value is 1. This parameter is used only when the -m option is 8.
  • -R: specifies that the input file is in OC1 format. The class file is ignored if this option is set.
  1. For example, to list the best 500 genes using the t-test and the input files in the data directory:
    ./rankgene -m 7 -n 100 -o data/gene.list -i data/all-aml.txt -c data/all-aml-class.txt
  2. To do so for the one-dimensional SVM measure with a weight value of 10:
    ./rankgene -m 8 -w 10 -n 500 -o gene.list -i data/all-aml.txt -c data/all-aml-class.txt

Note:
  1. For the t-statistic, RankGene ranks genes according to the decreasing order of the statistics. For each gene, it prints out two values: the reciprocal of the t-statistic and the t-statistic itself. 
  2. For the one-dimensional SVM, RankGene also prints out the expression values corresponding to the two support vectors and their corresponding classes. Please refer to README file in the same directory for more information on usage.

License

Rankgene is available without a fee for all non-profit and academic institutions. Commercial users are required to obtain a license for Rankgene. The license is required for any use of Rankgene in a profit-making enterprise, and it gives the company all rights to any discoveries or inventions made with Rankgene. There is a modest fee for this license. Please contact Prof. Simon Kasif for details. Rankgene is an evolving development software, thus we would appreciate comments from users.

Contact

If you have any questions, please contact Simon Kasif.

Measures of Predictability

ankgene supports eight measures for quantifying a gene's ability to distinguish between classes. In the formulae below, k is the total number of classes; n is the total number of expression values; nl (resp., nr) is the number of values in the left (resp., right) partition; li (resp., lr) is the number of values that belong to class i in the left (resp., right) partition; and c is the class of the ith sample.

  1. Information gain Formula for information gain
  2. Twoing rule Formula for twoing rule
  3. Sum minority Formula for sum minority
  4. Max Minority Formula for max minority
  5. Gini index Formula for gini index
  6. Sum of variances Formula for sum of variances
  7. t-statistic: Rankgene sorts the genes in decreasing order of the absolute value of the t-statistic for each gene.
  8. One dimensional support vector machine (SVM): We train an SVM on each gene's expression values. The gene's measure is the function optimised by the SVM training algorithm. Standard SVM training algorithms run in O(n3) time, where n is the number of training samples. We have developed and implemented an algorithm for training one-dimensional SVMs with linear kernels that runs in O(n log n) time. You can read about the details of this algorithm in this paper.

Each one of the first 6 measures attempts to capture the best possible reduction in uncertainty (analogous to increase in predictabilty) that we can obtain by dividing the full range of expression of a given gene into two intervals (up-regulated, down-regulated).

E.g., Sum-minority is a simple rule where for a given threshold we test the error obtained by predicting all samples below the threshold to be in class one (e.g, normal) and above the threshold to be in class two (e.g, cancer). The sum-minority rule counts the minority class samples below and above the treshold as errors.
For information gain we use reduction in class entropy resulting from partitioning the samples in two ranges (below/above a single threshold) as a measure of predictability.

Comparisons of the measures

We have implemented some techniques for comparing and contrasting the lists of predictive genes computed by each measure. We provide links to the comparison results for some publicly available data sets.

Views
Protein Engineering