GeneDoc: Analysis and Visualization of Genetic Variation

Karl B. Nicholas1, Hugh B. Nicholas Jr. 2, and David W. Deerfield II.2

1 Bank of America, 315 Montgomery Street; San Francisco, CA, 94127.

2 Pittsburgh Supercomputing Center; 4400 Fifth Avenue; Pittsburgh, PA, 15213.


Download the Rich Text Format version of this paper for printing. Note the figure is in black and white.

Download GeneDoc at the GeneDoc Home Page.


GeneDoc provides tools for visualizing, editing, and analyzing multiple sequence alignments of protein and nucleic acid sequences. GeneDoc embeds these tools in an explicitly evolutionary context. This context is most directly expressed as the ability to divide the sequences into groups that reflect the division of superfamilies of genes (and proteins) into distinct families. GeneDoc can analyze and visualize these groups either separately or together. Groups can also be contrasted . GeneDoc�s analysis capabilities include statistical tools that allow users to evaluate explicit biological or evolutionary hypotheses expressed in terms of specific groupings of sequences (Nicholas and Graves, 1983; Nicholas and McClain, 1995). The visualization tools are strongly integrated with the analysis tools and present the analysis results in a form that is easily comprehend and to use in presentations. GeneDoc provides an evolutionary context for alignment editing by evaluating changes to the alignment in terms of explicit evolutionary models. GeneDoc�s analysis functions help users discover which sequence residues are important in the structural and functional roles carried out by biological macromolecules.

Editing Tools

GeneDoc�s alignment editing features help overcome the current limitations in multiple sequence alignment programs (Nicholas et al., 1995; McClure et al., 1994). Editing can incorporate structural or biochemical information about which residues should be aligned. . GeneDoc's alignment scores are based on the accumulated knowledge of evolutionary processes incorporated in the empirical log-odds scoring matrices. GeneDoc provides such matrices for both protein and nucleic acid sequences (Dayhoff et al., 1978; Henikoff and Henikoff, 1992; States et al., 1991, Altschul, 1991). Scores are an objective measure of whether or not specific changes are justified for a given degree of divergence. GeneDoc offers two different ways to compute a score for any section of your alignment. The first is sum-of-pairs scoring which involves scoring all of the alignments between the independent pairs of sequences and adding these scores together to yield the total alignment score. While sum-of-pairs scoring is less than ideal, it results in alignments that are closer to those produced by superposition of three dimensional structures than do alignments produced by the heuristic methods. The second is weighted parsimony scoring, an alignment criterion that is more biologically desirable but imposes higher computational requirements (Sankoff and Cedegren, 1983). Weighted parsimony will result in an alignment that is most congruent with a user specified phylogenetic tree relating the sequences. Phylogenetic trees for use with weighted parsimony scoring can be imported in either Phylip or Nexus style tree files, or can be built with the graphical tree building interface in GeneDoc. The tree can also be edited in this interface. GeneDoc has two editing modes that are kept separate from each other to prevent unintended changes in the separate aspects of the alignment. The first mode is alignment editing mode. Characters in one sequence are moved relative to characters in the other sequences in this mode. The overall lengths of the sequences may be changed by either adding or removing gap characters. Gap characters may be added or removed in three ways: in the sequence currently marked by the cursor; to all of the sequences except the one marked by the cursor; or to all of the sequences. "Grab and drag" arrangement allows sequence residues to be moved without necessarily changing the number of gap characters in the sequence. The second editing mode is residue editing mode in which the sequence residues may be changed from one value to another. This includes changing one sequence character to another and changing gap characters into sequence characters or vice versa. However, no operation that would change the sum of the sequence characters and gap characters is allowed in this mode.


GeneDoc�s visualization capabilities are built around two residue display modes and six shading modes. The two residue display modes are to display all residues and to display only those residues that differ from the master sequence. The master sequence is either the consensus sequence for the alignment or for a group within the alignment or the first sequence within the alignment or a group within the alignment. These two residue display modes can be combined with any of the six shading modes. Three of the shading modes are actually visual displays of widely used analyses of multiple sequence alignments. Conservation mode produces a display that highlights alignment columns that show from 1 to 4 user defined levels of conservation. Quantify mode highlights the 1, 2, or 3 most frequent residues found in each column of the alignment, which focuses attention on the sequence positions that have evolved with a similar pattern of differentiation even though the actual residues at the position may differ. In both conservation and quantify mode the user sets the colors used for the highlighting and determines whether or not to treat conservative substitutions as if they are identical (e.g., I, L, V, M). Physiochemical properties mode analyzes each alignment position in terms of the hierarchical set of amino acid properties similar to those proposed by Dickerson and Geis (1969) and each position is shaded to identify the most exclusive set to which all of the amino acids at that position can be assigned. The other three shading modes also highlight alignment position according to an analysis. However the analysis is either largely (property shading mode) or entirely (structure and manual shading modes) under the control of the user. The property shading mode allows the user to divide the possible sequence residues into an arbitrary number of sets each assigned its own coloring scheme. The colors can then be applied to those columns where the property identified with the set is conserved or they can be applied to every residue in the alignment. The structure shading mode allows users to define an arbitrary number of states that the sequence residues may inhabit and assign colors to each state. Users can import information about protein secondary structure or RNA folding and color specific residues in a particular sequence, a group of sequences, or the entire alignment according to that structural information. GeneDoc has provisions for importing state information from the Protein Structure database (PSdb) (Deerfield and Geigel, 1996), DSSP (Kabsch and Sander, 1983), both are derived from Brookhaven PDB files. State information may also be imported from many of the structure prediction programs on the EMBL server, or as user defined values of from the reformatted version of the 3D_ALI database (Pascarella and Argos, 1992) available on the GeneDoc web site. User defined values require a file that assigns the residues of a specific sequence to states defined in a file of user created state definitions. The residues in the specific sequence will be highlighted in the corresponding color. This shading may be extended to the other sequences in the alignment or only to those in the same group as the original sequence. It is possible to shade every sequence in the alignment individually in this manner. Manual shading allows the user to assign specific colors to individual residues with point and click ease.


Many of GeneDoc's analyses are Kolmogorov-Smirnov (K-S) analyses of pairs of cumulative distribution functions (Sokal and Rohlf, 1995). K-S analyses provide a rigorous assessment whether two distributions are different. The difference can be either in the location or shape of the distributions. Thus, K-S tests are more broadly based than more common tests like Student's T test or the F test. The K-S tests use distributions of alignment scores or comparisons of sequences in terms of the percentage of identities between a pair of aligned sequences. Probably the most useful test is the analysis of whether the scores for pairs of sequence within the same group are smaller than the scores for pairs of sequences that are in different groups. A positive result for this test indicates that the grouping categories are systematically reflected in the sequences (Nicholas and Graves, 1983; Nicholas and McClain, 1995). There are two types of contrast analysis that contrast the sequences within one group with those in the other groups on a position by position basis. The PCR contrast highlights sites that meet two criteria. First is that a single residue is completely conserved within the group. Second, this conserved residue does not appear, at that position, in any sequence outside of the group in which it is conserved. The group contrast analysis is less restrictive within the group than is the PCR contrast analysis. In the group contrast analysis all of the sequence residues at a site are required to have a positive similarity score with each other. Residues outside of the group must have a negative similarity score with every residue from within the group.


GeneDoc GCG�s msf file format as its primary file type using the header region to store information about residue display and shading modes along with large amounts of user configuration choices. In addition to the msf files, sequences may be read from or written to Clustal W aln files, Pearson FASTA files, and PIR formatted files. Aligned sequences can also be written to Phylip interleaved files. Graphic results can be sent to the printer or to a Postscript file by using an appropriate printer driver. Highlighted results can also be exported in Windows Enhanced Meta Files or in Macintosh style PICT files.


GeneDoc is a full featured multiple sequence alignment visualization, editing, and analysis tool. It has an easy-to-use point and click user interface with extensive keyboard mapping for advanced users. In addition to the features described above there are many more features and additional details in the extensive context sensitive help files that comes with the program. Figure 1 shows an alignment of 13 phospholipases A2. The shading for each sequence indicates the secondary structure state of the residue as derived from the three dimensional coordinates taken Brookhaven PDB file that is used to label the sequence. The secondary structure states were computed using the four state PSdb model (Deerfield and Geigel, 1996). The alignment and PSdb files used to create the figure are available on the GeneDoc web site. GeneDoc version 2.1 runs on any IBM compatible personal computer under Windows 31, Windows 95 or Windows NT. It can be obtained at no cost over the World Wide Web. Thanks to Russell Malmberg a version that runs on DEC Alpha workstations under Windows NT is available at: GeneDoc has benefited from the comments, suggestions, and error reports from a number of early users.