Local Alignment of Multiple sequences
GLAM is a program for discovering functional motifs shared by a set of nucleotide sequences. Examples of functional motifs include transcription factor binding sites, mRNA splicing control elements, signals for mRNA 3'-cleavage and polyadenylation, and anything else you can dream of. GLAM attempts to find these motifs by obtaining the best possible gapless, multiple alignment of segments of the sequences. The 'best' alignment is the one that maximizes the value of a certain formula. At most one segment from each sequence is included in the alignment, and some sequences may be excluded if doing so would improve alignment quality. Currently we do not offer a web server, because GLAM is too compute intensive.
Martin C Frith, Ulla Hansen, John L Spouge, Zhiping Weng (2004). Finding
functional sequence elements by multiple local alignment. Nucleic Acids
Here are the data sets studied in the paper.
- Fixed the source code for fussy compilers. (Thanks: Dan Haft)
- Fixed bug that caused occasional crashes. (Thanks: Szymon Kielbasa)
- Fixed crashes caused by temperature underflow. (Thanks: Yutao Fu)
- Added options -g and -k to find suboptimal alignments.
- Print stars under conserved columns.
- Added option -d to control frequency of width-adjusting moves.
- Print strand information.
- Fixed bug with -l option (lowercase masking).
- Added options -l to filter lowercase letters and -e to turn off the E-value calculation.
- Changed the option to use the modified Lam schedule from -l to -m.
- Changed the default value of -n from 20000 to 10000, so the program appears to run twice as fast.
- Changed the default temperature from 1 to 0.9, since that works better in a variety of tests.
- Switched from Spouge's to Altschul & Gish's edge correction formula, which makes a minor difference to the E-value calculation. Renamed mu_star to H for consistency with our publication.
- Minor changes in the output format, so your parser will be broken.
- Fixed printing of flanking sequences on reverse strands. (Thanks: Kavitha Venkatesan)
- Download GLAM by clicking one of these links, and saving the file on your
GLAM executable for Linux (Red Hat 7.2/7.3)
GLAM executable for Sun (Solaris 8)
GLAM executable for SGI/IRIX
- Set execute permission for the file by typing 'chmod +x glam-linux' (or whatever you saved it as).
- GLAM is now ready to run.
- Download the GLAM source code.
tar -xvf glam-src.tar
- Change directory:
- Compile (cross your fingers): make
Unfortunately the source code doesn't compile successfully on all systems. We'd love to hear your suggestions for making it more portable.
GLAM: Gapless Local Alignment of Multiple sequences Compiled on Dec 11 2003 Run 1... 10345 iterations Run 2... 15190 iterations Run 3... 24749 iterations Run 4... 20294 iterations Run 5... 20488 iterations Run 6... 16930 iterations Run 7... 21583 iterations Run 8... 23260 iterations Run 9... 13733 iterations Run 10... 14805 iterations Calculating score distribution... Calculating random walk parameters... Best alignment found: Score: 44.7738 bits Width: 32 Sequences: 5 Runs: 6 E-value: 2.29 FirstSeq 164 GGACTAAGTTACTTAAACTGTTCAGGAGATAC 195 + (8.36) 2ndSeq 244 GGGCATGGTGACCTTTCGCACTCTGGGCATGC 275 + (9.91) 3rdSeq 244 GGTCAAGGTCACCGACAGCAGTAAGGGCTGAC 275 + (11.6) 4thSeq 244 GGGCAAAGTGACTGGACATAGGAGTGGGACAC 275 + (10.4) LastSeq 92 GGGCAAAGCAACATAGCGGGGTAGGGTCCTCC 61 - (7.88) ** ****** ** * * * ** ** * * Other alignments: Score: 43.4623 bits Width: 38 Sequences: 5 Runs: 1 E-value: 5.68 Score: 43.2357 bits Width: 19 Sequences: 5 Runs: 3 E-value: 6.65 glam myseqs.fa 5 sequences in file Residue abundances: a=0.248002 c=0.251998 g=0.251998 t=0.248002 Pseudocounts: a=0.372002 c=0.377998 g=0.377998 t=0.372002 Max possible alignment width: 500 K: 0.178944 H: 1.25899
GLAM works by starting from a completely random alignment of the sequences, and making small refinements to it over many iterations, in an attempt to find the best possible alignment. Since this procedure does not guarantee finding the optimum alignment, GLAM repeats it 10 times from different starting points (10 runs). The idea is that if several of the runs converge to the same best alignment, we have increased confidence that it is indeed the optimum alignment. In this example 6 out of 10 runs gave the same best alignment.
The score is GLAM's measure of how strong/well-conserved/striking the alignment is: the higher the better. The E-value indicates how often we would expect an alignment of this score or greater to exist among unrelated sequences just by chance. We hope to find E-values lower than 1, but in this example it is only 2.29, so the alignment does not appear to be statistically significant. The stars indicate conserved columns that contribute positively to the score. The numbers in brackets are the marginal scores (in bits) of each segment in the alignment: i.e. the score gained by including this segment in the alignment rather than excluding it. We might feel more confident that segments with higher marginal scores are true motif instances. The marginal scores won't in general sum to the total alignment score.
The program glam_logo.pl draws a sequence logo representation of the best alignment in a GLAM output file (in encapsulated PostScript format). After making the program executable (by typing chmod +x glam_logo.pl), run it with a command like this: glam_logo.pl glam_out_file mypic.eps
- RepeatMasker - AFA Smit & P Green, unpublished.
- nseg - JC Wootton & S Federhen, Methods Enzymol. 1996;266:554-71.
- dust - R Tatusov & D Lipman, unpublished.
These programs have parameters that vary the stringency of masking. It will be necessary to experiment with these parameters to get a balance between masking repetitive elements adequately but not masking too many potential motifs.
|Help: print documentation.|
|This important parameter controls the tradeoff between speed and accuracy. If you don't play around with any other parameters, play around with this one. Each alignment run will continue until n (default = 10000) iterations have passed without improving on the best alignment found so far. We like to set n sufficiently high that at least 3 out of 10 runs converge to the same alignment. Low values of n are adequate when the problem size is small, i.e. when the sequences are short and more importantly there are few of them, but high values of n are needed for large problems. In addition, smaller values of n are sufficient when there is a strong alignment to be found, but larger values are necessary when there isn't, e.g. for finding the optimal alignment of random sequences. You'll have to choose n on a case-by-case basis, but to give some examples we have used n=1000 to align 5 x 500bp sequences, and n=20000 to align 20 x 1000bp sequences. For larger problems it may be impossible to converge reproducibly to the same exact alignment in reasonable time, but in these cases you can check that similar motifs are reproducibly obtained.|
|The number of alignment runs (default = 10).|
|(The digit, not the letter): just examine the direct strand (default = both strands).|
|Minimum alignment width (default = 1).|
|Maximum alignment width (default = 10000).|
|Require every sequence to participate in the alignment.|
|Supply a previous GLAM output file, and exclude the best alignment found previously from being recovered again. The previous GLAM output will be appended to the current output: if a file with multiple such outputs is supplied with -g, all best alignments found previously will be excluded.|
|Prevent all residues participating in previous alignments from participating in this one. The default behavior is that any pair of residues aligned previously may not be aligned this time.|
|(The letter, not the digit): exclude lowercase letters from being aligned. Lowercase letters are often used to indicate repetitive sequence.|
|Verbose: if multiple runs return more than one alignment, print all alignments in full.|
|Print this number of flanking residues, in lowercase, either side of the alignment (default = 0).|
|Pretend that the background residue abundances equal 1/4, instead of estimating them from the input sequences. This option might be a good idea for aligning very short sequences that are mostly covered by the motif.|
|Frequency of width-adjusting versus sequence-adjusting moves (default = 1). When the number of sequences is large compared to the sequence length, GLAM has difficulty widening the alignment, and may return excessively narrow alignments. Increasing the frequency of width-adjusting moves compensates for this problem to some extent. In theory this problem can always be overcome by making the -n parameter sufficiently large.|
|Initial temperature (default = 0.9).|
|Cooling factor: multiply temperature by this amount each iteration (default = 1, i.e. constant temperature).|
|Use the "modified Lam schedule" instead of the default geometric schedule. This is a strategy where the algorithm aims to achieve a target "accept rate", i.e. rate of altering the alignment versus leaving it unchanged per iteration. In the early phase of the algorithm, the target accept rate decays geometrically from 100% to 44%. In the middle phase it remains constant at 44%. In the final phase it decays geometrically to 0%. Whenever the actual accept rate is higher than the target the temperature is multiplied by c, when it is lower than the target the temperature is divided by c. If -m is selected, the -n option indicates the total number of iterations per run, not the number of iterations without improvement.|
|Print energy (negative alignment score in nats), accept rate, temperature, width, and number of sequences in the alignment after each iteration.|
|Pseudocount weight (default = 1.5).|
|Uniform pseudocounts: set each pseudocount equal to p/4. Default = p * (background residue abundances).|
|Seed for the random number generator (default = 1).|
|Turn off the E-value calculation. If it's too slow for you.|
glam-linux -n5000 -l -v -f5 myseqs.fa
- The E-values become increasingly conservative as the number of sequences increases.
- If the number of sequences is many-fold larger than the sequence length, GLAM has difficulty widening the alignment.
- The E-value calculation aborts when given more than about 730 input sequences.
The Zlab Gene Regulation Hub lists many other motif discovery programs. We believe GLAM possesses a unique combination of advantages: 1) Automatic determination of the alignment width. 2) Calculation of the statistical significance of alignments. 3) Ability to find suboptimal alignments. 4) Robustness and rigor: by default GLAM carries out many alignment runs from different starting points. 5) Flexibility: you can search 1 or 2 DNA strands, place limits on the alignment width, and vary details of the refinement scheme.
Return to Zlab Gene Regulation Hub
Suggestions to: Martin Frith
Last modified: Tuesday, 15-Feb-2005 19:35:31 EST