Why Use GeneDoc??? Read on.

Initial Alignment of Sequences

The Uses of Multiple Sequence Alignments

Multiple sequence alignments are a powerful tool for investigating the relationship between structure and function in biomolecules. Such alignments represent the evolutionary history of the group of sequences (Nicholas and Graves, 1983; Zukerkandl and Pauling, 1965). This evolutionary history is a record of successful mutagenesis experiments carried out by nature on a family of macromolecules. The multiple sequence alignment demonstrates the extent to which specific residues may be changed without destroying the essential structure and function of the macromolecule. At the same time they can identify which residues must be changed to create a new and different function within a similar structural framework (Nicholas et al., 1987). As such, a multiple sequence alignment is the logical starting point for laboratory mutagenesis experiments used to study the relationship between structure and function in families of macromolecules (Eckstein et al., 1996). GeneDoc�s analysis functions are designed to help users derive these kinds of information from multiple sequence alignments and apply it to experimental situations like those described below.

Multiple sequence alignments have been the essential representation of the data in investigations that successfully identified the sequence residues necessary for the correct functioning of different families of transfer RNAs (Nicholas and McClain, 1987; McClain and Nicholas, 1987) and for determining the correct secondary structure of many families of structural RNAs (Gutell et al., 1994). Such alignments also provide a means of testing hypotheses about gene duplication events and the origins of regulatory genes (Nicholas et al., 1995). Multiple sequence alignments also form the basis for databases of motifs, patterns that are diagnostic for membership in particular families of biomolecules (Bairoch et al., 1995) or for identifying the sequence features defining the sites for posttranslational modifications (Rosenquist and Nicholas, 1993; Nicholas and Rosenquist, 1997) as well as predicting the outcome of RNA selection experiments (Schmidt et al., 1996). The carefully constructed alignment of all of the aldehyde dehydrogenases sequences known at the time (Hempel et al., 1993) is still being used to guide site directed mutagenesis studies (Sheikh et al., 1997), as a guide to examining important features in the recently determined crystal structure of the enzyme (Liu et al. 1997) and modeling paralogous members of the same superfamily (Hempel et al., 1997).

Programs for Multiple Sequence Alignments

Thus multiple sequence alignments are a valuable source of information that can be brought to bear in many ways while investigating the properties, characteristics, and functions of macromolecules. However, multiple sequence alignments created with today�s programs suffer from both theoretical and practical shortcomings (Nicholas et al., 1995b; McClure et al., 1994). The ability to easily create multiple sequence alignments that more accurately reflect the evolution of the sequences in the alignment should make multiple sequence alignments an even more useful and powerful guide for experimental studies. GeneDoc�s alignment editing and analysis features are designed to help users improve their alignment and overcome the current limitations in multiple sequence alignment programs. Other GeneDoc analysis features are intended to help you extract information from the alignment to use in planning and intrepreting experimental work. There are currently a variety of approaches being used to create multiple sequence alignments. Each has both theoretical and practical shortcomings. Currently used programs fall into three categories.

1. Multi-dimensional dynamic programming methods, which are able to guarantee that a solution is optimal within the volume of solution space examined. However, memory and computational time requirements for this method scale exponentially, as the product of the lengths of the sequences. This limits the lengths and number of sequences that can be considered, thus limiting a rigorous multiple sequence alignment to relatively small problems.

2. In contrast, heuristic methods, the progressive pairwise programs, require less memory and computational time than the dynamic programming methods. Heuristic methods cannot guarantee that the alignment obtained is optimal in any rigorous sense because they are sensitive to local extrema which effects even alignments among closely related sequences. The multiple sequence alignments produced by these programs are also dependent upon the order in which the sequences are aligned.

3. A new program, SAGA (Sequence Alignment by Genetic Algorithm � Notredame and Higgens, 1996) is proving useful as a means of refining alignments produced by the progressive pairwise programs. The resulting alignments generally match, as they were intended to, MSA alignments. This sets an upper bound to the improvements.

The best current program for creating multiple sequence alignments is the MSA program (Lipman et al., 1989). MSA is superior to the more widely used progressive pairwise methods (Feng and Doolittle, 1987) found in programs such as PileUp (Genetics Computer Group) and CLUSTAL W (Thompson et al., 1994) because MSA rigorously optimizes a well defined alignment criteria. MSA minimizes the weighted sum-of-pairs alignment scores with quasi-natural gap costs (Lipman et al., 1989, Altschul, 1989). Although MSA is generally superior to other programs available today it still has shortcomings.

MSA is severely limited in the number of sequences that it can consider, because it requires memory and processor time proportional to the product of the lengths of the sequences being aligned. On a Cray C-90 where 256 MWords of memory were available as many as 20 phospohlipases A2 of about 125 amino acids have been aligned. For proteins in the 350 to 450 amino acid length range (aspartyl proteases and lipid binding protein superfamily members) the limit was in the range of 6 to 8 sequences. To make alignment of more than three or four short sequences possible and to achieve alignments of the size described above MSA combines information from pairwise sequence alignments and a progressive pairwise multiple sequence alignment to restrict the volume of alignment space explored in such a way that the alignment is no longer guaranteed to be the global (absolute) minimum sum-of-pairs cost alignment for the set of sequences. In some cases no alignment can be found within the restricted volume of the alignment space.

Another limitation of MSA and of the less capable progressive pairwise programs is that they are forced to use biologically unrealistic alignment criteria because computing biologically realistic alignment criteria is too time consuming for the large number of evaluations required during the alignment computation (Lipman et al., 1989, Altschul, 1989).

The problems of current interest can require the alignment of hundreds of sequences simultaneously, making the use of unrestricted multidimensional dynamic programming methods impossible. The MSA program (Lipman, et al., 1989) utilizes the Carrillo-Lipman strategy (Carrillo and Lipman 1988), a multidimensional dynamic programming scheme, to search for an optimal solution within a K-dimensional subset of the alignment space. The particular subset of solution space that MSA searches is determined using an heuristic pairwise progressive algorithm and individually computed alignments of each pair of sequences.

Heuristic progressive pairwise procedures have been developed to create multiple sequence alignments of large numbers of sequences. These are implemented in the widely used Clustal series of programs and the Genetics Computing Group�s PileUp program. The heuristic procedures perform adequately on sets of sequences that are minimally diverged from each other but are virtually sure to produce misleading alignments with more diverged sets of sequences. The main limitations in the heuristic procedures are described below.

The multiple sequence alignments are far from robust and the alignment is dependent on the order in which the sequences are entered into the alignment process. The addition of even one new sequence into the alignment process at an early stage can result in a quite different alignment than would result from aligning the new sequence with an already completed alignment of the other sequences.

· No explicit score is being optimized over the entire alignment. Rather, the alignment is generated in a cumulative fashion by first aligning the most similar sequences, and then aligning each subsequent sequence relative to the current cumulative alignment. Even though the score at each step is optimized, that process can be interpreted as the assumption that new data does not improve our understanding of the relationships among the sequences already aligned, since the new data is not allowed to modify the previously generated incomplete alignment.

· The alignment process is not guided by biologically reasonable alignment criteria. Because of this, phylogenetic analyses bases on these alignments virtually always find phylogenetic trees with a topology identical to the order of joining used in creating the alignment.

· The alignments are strongly problem dependent making it difficult to develop guidelines about how much reliance can be placed in the multiple sequence alignment.

The SAGA (Notredame and Higgins, 1996) program computes sum-of-pairs alignments using a maximization method, the genetic algorithm (Holland, 1975), developed by computer scientists. The basic strategy is to generate arbitrary alignments and to evaluate them by some well defined criteria. The initial alignments can be generated by a variety of methods. In SAGA the initial alignment is generally imported from the Clustal W program. The initial alignment is then modified by introducing random changes. These modified alignments reevaluated and further operations are carried out on a selected subset of the alignments. This includes replacing regions from one alignment by the comparable region from another alignment and then introducing more random changes.

These changes generate a large sample of the possible alignments. Eventually this sampling process finds good alignments and refine them. This leads to high quality alignments, as defined by the criteria used to evaluate the sampled alignments. None the less, there are disadvantages to the SAGA method.

· One result of the complicated customized mutation operators used in the SAGA genetic algorithm is that it is very computationally intensive compared to a simple genetic algorithm or the heuristic pairwise methods. As a program running on a work station, only alignment refinements are practical.

· So far, only sum-of-pairs scoring is available as an alignment criterion.

· SAGA uses a very biased sampling routine to select the alignments to be generated and evaluated. This sampling is dependent on a user input phylogenetic tree. Alignments that are not a good fit to the user specified tree are not likely to be generated and evaluated. Thus the best alignment may be missed.

Editing Sequence Alignments

GeneDoc provides a combination of alignment editing and alignment analysis capabilities intended to help users refine their alignments. These refinements should be directed toward the goal creating an alignment that accurately reflects the evolutionary history of the sequences being aligned. The editing can be directed toward aligning specific sequence residues to reflect structural or biochemical information that could not be incorporated in the initial alignment procedure. Changes can also be made to compensate for the limitations of multiple sequence alignment procedures discussed above. A very powerful guide to alignment editing is to determine whether or not the changes have improved the score of the alignment. The scores are an important guide because they are based on the accumulated knowledge of evolutionary processes incorporated in the empirical log-odds scoring matrices available in GeneDoc (Dayhoff et al., 1978; Henikoff and Henikoff, 1992; States et al., 1991, Altschul, 1991). GeneDoc offers two different ways to compute a score for your alignment or any section of it. These allow you evaluate the changes you make in an alignment using GeneDoc's alignment editing capabilities.

The first is sum-of-pairs scoring of the alignment. Sum-of-pairs scoring is much faster than weighted parsimony scoring. Sum-of-pairs scoring involves scoring all of the alignments between the independent pairs of sequences and adding these scores together to yield the total alignment score. While sum-of-pairs scoring is less than ideal, it results in alignments that are closer to those produced by superposition of three dimensional structures than do alignments produced by the heuristic methods. Sum-of-pairs scoring may be particularly appropriate in some protein engineering context or in the core regions of globular proteins where few if any alignment gaps are required.

Sum-of-pairs scoring is also useful in correcting some of the deficiencies of alignments produced by the progressive pairwise programs. These deficiencies are caused by the failure of the progressive pairwise methods to optimize a well defined alignment criterion and are generally referred to as local minima or local extrema. The most common deficiency, or local minima, is to fail to align a residue that is in fact identical in all of the sequences in the alignment. It is important to use objective criteria for changing an alignment. Ideally, there will be biochemical or structural data that indicates that a sequence residue is conserved during evolution and that the alignment should therefore have a corresponding column of identical residues. If such evidence is not available it is desirable that the changes in the alignment should improve the score for all of the effected alignment positions.

While the most common kind of local minima is failure to align a residue that is in fact identical in all of the sequences in the alignment, it is not the only kind. The user should keep in mind that spurious columns of identical residues can also result from local minima. It is difficult to identify these in the absence of external information about the biochemistry or structure of the molecules whose sequences are being aligned. Once again, comparing the scores of alternative alignments can provide useful guidance. One useful strategy to examine your alignment for artifacts of this kind is to perform the alignment several times while varying the gap penalties, the scoring matrices, the order in which the sequences are aligned, and even the program used to create the alignment. Most often a number of features in the alignment will recur in all of the alignments. These robust features can serve as anchor points in the absence of contradictory structural or biochemical information. The less robust a feature is the more carefully it should be evaluated before it is accepted as a part of the final alignment.

It is worth noting that there are strategies for refining progressive pairwise alignments that eventually converge to the sum-of-pairs alignment (Gotoh, 1996; Hirosawa et al., 1995). These strategies involve removing groups of sequences from the completed alignment and then realigning them to the remaining alignment. A simple form of this strategy was initially proposed by Barton and Sternberg (1987). The current implementation of the Clustal series, Clustal X, has rudimentary capabilities for alignment refinement.

The second is weighted parsimony scoring, an alignment criteria that has long been considered desirable but has been avoided because of its computational requirements (Sankoff and Cedegren, 1983). Weighted parsimony will result in an alignment that is most congruent with a user specified phylogenetic tree relating the sequences. The tree can be derived from whatever source the user prefers such as morphological data or another set of sequences. Weighted parsimony alignments should alleviate some of the theoretical deficiencies of sum-of-pairs scoring. One obvious artifact and deficiency in alignments produced by rigorous sum-of-pairs scoring is forcing the first residue before a gap and the first residue after the gap to coincide, that is to be in the same column of the alignment. This is more or less reasonable if the phylogeny of the sequences indicates that only one insertion or deletion event is necessary to explain the presence of a gap in the sequences where it is observed in the alignment. However, it seems very unlikely when the identical gap is present in rodent sequences and in bacterial sequences while the gap is completely missing from primate sequences and from invertebrate sequences. It is unlikely because these gaps most likely require at least two insertion and deletion events in the evolutionary history of the sequences. Unfortunately this is a common feature of sum-of-pairs alignments. Sometimes this problem is obvious and can be alleviated either by visual inspection and manual adjustment or by recomputing the alignment with a lower penalty for gaps. Manual adjustment carries the risk that the subjective criteria on which it is based are, in fact, incorrect in this instance. Lowering the gap penalty may create an even more inaccurate alignment in other parts of the sequences. Thus using a more sophisticated, evolution based alignment criteria, such as weighted parsimony, is a much better solution. These considerations apply equally to nucleic acid or protein alignments.

Weighted parsimony also alleviates some cases of the extreme prejudice that sum-of-pairs scoring has against including very dissimilar amino acids in the same position of an alignment. For example Aspartic acid and Tryptophan are among the very most dissimilar amino acids according to the Dayhoff PAM 250 similarity matrix (Dayhoff et al., 1978) or the Blosum series of matrices (Henikoff and Henikoff, 1992). Attempts to put them into the same column of an alignment are heavily penalized in sum-of-pairs scoring. However, if the phylogeny of the sequences indicate that the Aspartic acid containing sequences are connected to the Tryptophan containing sequences through nodes that could plausibly have been Histidine and Tyrosine the mutations are unexceptionable since all are observed at least as often as we would expect from random mutations. Weighted parsimony scoring would take into account favorable mutational pathways in the phylogenetic tree relating the sequences while still penalizing dissimilar amino acids not related by a plausible mutational pathway. Such prejudices can lead to the alignment of sequence positions which are not in fact homologous. These misalignments can propagate into adjacent parts of the overall alignment. These problems cannot be aleviated by alterations in the scores and are virtually impossible to detect and correct by visual inspection. The only practical solution is improved alignment criteria. Similar benefits accrue to weighted parsimony scoring for nucleic acid sequences by allowing different weights to be assigned to transitions and transversions.

References Altschul, S.F. 1991. Amino acid substitution matrices from an information theoretic perspective. J. Mol. Biol., vol. 219, pp. 555 - 565.

Altschul, S.F. 1989. Gap Costs for Multiple Sequence Alignment. J. Theor. Biol., vol. 138, pp. 297 - 309. Bairoch, A., Bucher, P. and Hofmann K., 1995. The PROSITE database, its status in 1995. Nuc. Acids Res. vol. 24, pp. 189 - 196.

Barton, G.J. and Sternberg, M.J. 1987. A strategy for the rapid multiple alignment of protein sequences. Confidence levels from tertiary structure comparisons. J. Mol. Biol., vol. 198, pp. 327 - 337.

Carrillo, H., and Lipman, D. 1988, SIAM J. Appl. Math., vol. 48, pp. 1073.

Dayhoff, M.O., Schwartz, R.M., Orcutt, B.C. 1978. A model of evolutionary change in proteins. In "Atlas of Protein Sequence and Structure" vol. 5(3) M.O. Dayhoff (ed.), National Biomedical Research Foundation, Washington. pp. 345 - 352.

Eckstein, J.W., Beer-Romero, P. and Berdo, I. 1996. Identification of an essential acid residue in Cdc25 protein phosphatase and a general three-dimensional model for a core region in protein phosphatases. Prot. Sci. vol. 5, pp. 5 - 12.

Feng, D.F. and Doolittle, R.F. 1987. Progressive sequence alignment as a prerequisite to correct phylogenetic trees. J. Mol. Evol. vol. 25, pp. 351 - 360.

Gotoh, O. 1996. Significant improvement in accuracy of multiple protein sequence alignments by iterative refinement as assessed by reference to structural alignments. J. Mol. Biol., vol. 264, pp. 823 - 838.

Gutell, R.R., Larsen, N. and Woese, C.R. 1994. Lessons from an Evolving rRNA: 16S and 23S rRNA Structures from a Comparative Perspective. Microbiological Review, vol. 58, pp. 10 - 26.

Hempel, J., Nicholas, H., and Lindahl, R. 1993. Aldehyde dehydrogenases: Widespread structural and functional diversity within a shared framework. Protein Science, vol. 2, pp. 1890 - 1900.

Hempel, J., Perozich, J., Rose, J., Liu, Z.J., Lindahl, R., Wang, B.C. 1997. Aldehyde Dehydrogenase: Catalytic Mechanism and Structural Model for the Entire Family. FEBS Letters (submitted).

Henikoff S. and Henikoff, J.G. 1992. Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. USA. vol. 89, pp. 10915 - 10919.

Hirosawa, M., Totoki, Y, Hoshida, M., and Ishikawa, M. 1995. Compreshensive study on iterative algorithms of multiple sequence alignment. Comput. Appl. Biosci., vol. 11, pp. 13 - 18.

Holland, J.H. (ed.), Adaptation in Natural and Artificial Systems, University of Michigan Press, 1975.

Lipman, D.J., Altschul, S.F., and Kececioglu, J.D. 1989. A tool for multiple sequence alignment, Proc. Natl. Acad. Sci. USA., vol. 86, pp. 4412 - 4415.

Liu, Z.J., Hempel, J., Sun, J. 1997. The First Structure of an Aldehyde Dehydrogenase Reveals Novel Interactions between NAD and the Rossmann Fold. Nature Struct. Biol., vol. 4, pp. 317 - 326.

McClain, W.H. and Nicholas, H.B. Jr. 1987. Discrimination between transfer RNA molecules. J. Mol. Biol., vol. 194, pp 635 - 642.

McClure, M.A., Vasi, T.K., and Fitch, W.M. 1994. Comparative analysis of multiple protein-sequence alignment methods. Mol. Biol. Evol. vol. 11, pp. 571 - 592.

Nicholas, H.B. Jr., and Graves, S.B. 1983. Clustering of transfer RNA by cell type and amino acid specificity. J. Mol. Biol., vol. 171, pp. 111 - 118.

Nicholas, H.B. Jr. and McClain, W.H. 1987. An algorithm for discriminating transfer RNA sequences. Computer Applications in the Biosciences, vol. 3, pp. 177 - 181.

Nicholas, H.B. Jr., Chen, Y-M., and McClain, W.H. 1987. Comparisons of transfer RNA sequences. Computer Applications in the Biosciences, vol. 3, p. 53.

Nicholas, H.B. Jr., Persson, B., Jornvall, H. and Hempel, J. 1995. Ethanol utilization regulatory protein: Profile alignments give no evidence of origin through aldehyde and alcohol dehydrogenase gene fusion. Prot. Sci. vol. 4, pp. 2621 - 2624.

Nicholas, H.B. Jr., Ropelewski, A.J., Deerfield, D.W. II., and Behrmann, J.G. 1995b. A Metric Measure for Comparing Sequence Alignments. Proceedings of the 10th International Conference on Methods in Protein Structure, Eds. M.Z. Atassi and E. Appella, Plenum Press, New York pp. 515 - 525.

Nicholas, H.B. Jr. and Rosenquist, G.L. 1997. A position specific scoring matrix for identifying sites of protein tyrosine sulfation. Manuscript in preparation.

Notredame, C. and Higgins, D.G. 1996. SAGA: sequence alignment by genetic algorithm. Nucleic Acids Res ., vol. 24, pp. 1515 - 1524.

Rosenquist, G.L. and Nicholas, H.B. Jr. 1993. Analysis of Sequence Requirements for Protein Tyrosine Sulfation. Protein Science, vol. 2, pp.215 - 222.

Sankoff, D. and Cedegren, R.J. 1983. Simultaneous comparisons of three or more sequences related by a tree. In "Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison." D. Sankoff and J.B. Kruskal (eds.) pp. 253 - 263. Addison-Wesley, Reading, MA

Schmidt, F. J., Cho, B., and Nicholas, H.B., Jr. 1996. RNA Libraries and RNA Recognition, Ann. NY Acad Sci. vol. 782, pp. 526 - 533.

Sheikh, S., Ni, L., Weiner, H. 1997. Mutation of the conserved amino acids of mitochondria aldehyde dehydorgenase. Role of the conserved residues in the mechanism of reaction. Adv. Exp. Med. Biol., vol. 414, pp. 195 - 200.

States, D.J., Gish, W., and Altschul, S.F. 1991. Improved sensitivity of nucleic acid database searches using application-specific scoring matrices. Methods, vol. 3, pp. 66 - 70.

Thompson, J.D., Higgins, D.G. and Gibson, T.J. 1994 CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice. Nucleic Acids Research, vol. 22, pp. 4673 - 4680.

Zuckerkandl, E. and Pauling, L. 1965 Molecules as documents of evolutionary history. J. Theor. Biol. vol. 8, pp. 357 - 366.