Home      Labs      Publications      People      Tools   

From CAGT

Composition Alignment Source Code Download

This code is provided to illustrate an implementation of composition alignment. The code was tested under Microsoft Windows. It is written in ANSI C and could be compiled on any platform with an appropriate compiler with none or some minor modifications. Click here to access the code.

Files are as follows:

cau.c - this is the file you should build to create a console executable.
align.h - various data types and helper routines for regular alignment.
comp.alignment.h - various data types and helper routines for composition alignment.
complex.match.function.h - contains alignment routines for simple matching functions (1-3).
simple.match.function.h - contains alignment routines for complex matching functions (4-6).
kbest.local.composition.alignment.h - this file contains alternate versions of local alignments, that
allow for finding next best alignments, by blocking previous alignment paths from going through. Not
used in this version of the console utility.
in.fa - sample input file (contains random sequences)
run.bat - an example of program invocation

Program usage:

Assuming your executable file is called cau.exe, please use the following syntax:

cau.exe File Match Mismatch Delta Limit isDiNuc MatchFunc IsLocal IsPaired [-m (minscore)] [-r (scoreratio)]

Where: (all weights, penalties, and scores are positive)
File = multiple sequences input file
Match = matching weight
Mismatch = mismatching penalty
Delta = indel penalty
Limit = how many characters can be scrambled at a time
isDiNuc = 1 for nucleartide, 2 for dinucleartide
MatchFunc = function number between 1 and 6
IsLocal = 0 for Global, 1 for Local, 2 (or anything else) for PatternGlobalTextLocal
IsPaired = 0 to align everything against everything, 1 (or anything else) to align pairs from the input file,
-m (minscore) = use this switch to provide a minimum compositional score to report an alignment
-r (scoreratio) = use this switch to indicate a minimum composition/basic alignment score ratio to report an alignment

Note the sequence file should be in FASTA format:

>Name of sequence1
   aggaaacctg ccatggcctc ctggtgagct gtcctcatcc actgctcgct gcctctccag
   atactctgac ccatggatcc cctgggtgca gccaagccac aatggccatg gcgccgctgt
   actcccaccc gccccaccct cctgatcctg ctatggacat ggcctttcca catccctgtg...
>Name of sequence2
   aggaaacctg ccatggcctc ctggtgagct gtcctcatcc actgctcgct gcctctccag
   atactctgac ccatggatcc cctgggtgca gccaagccac aatggccatg gcgccgctgt
   actcccaccc gccccaccct cctgatcctg ctatggacat ggcctttcca catccctgtg...

Note for PatternGlobalTextLocal alignment, pattern is the second sequence:

Note match functions are as follows:

  Function 1:
     The simplest function, a constant times the length of the match.
  Function 2:
     Square root of length of extended match times a constant.
  Function 3:
     Log base 2 of length+1 of extended match times a constant.
  Function 4:
     Relative entropy of substring composition with respect to background
     composition times length of extended match times a constant.
  Function 5:
     Relative entropy of substring composition with respect to background
     composition times length of extended match times a constant. For
     length =1, normal match value prevents two identical sequences
     composed of only one letter from scoring zero if the background is
     the same letter. 
  Function 6:
     Shannon-Jensen entropy of substring composition versus the
     background composition times length of extended match times a
     constant.

Protein Engineering