To make robust phylogenetic inferences, data from several unlinked sources are often required. Commonly, researchers evaluate DNA (or amino acid) sequences from a number of different regions and/or anatomical, morphological, developmental, biochemical, or behavioral characteristics. To conduct phylogenetic analyses of aligned sequences, a researcher must concatenate sequence files (typically in FASTA format) into a single matrix file formatted specifically for the analysis package used. Binary characters, representing inferred insertion/ deletion (indel) events, are often appended to the matrix along with non-molecular data.
Indel events are usually incorporated in phylogenetic matrices using the “simple indel coding” algorithm (Simmons and Ochoterena, 2000). The algorithm creates a character for each unique combination of 5′ and 3′ indel termini in an alignment (5′ termini must be preceded by a nucleotide/amino acid sequence and 3′ termini must be followed by a nucleotide/amino acid sequence). For each character, each sequence is assigned a state based on what is contained between the indel termini: (0) nucleotide/amino acid sequence and/or an indel with termini that do not extend up to or beyond both the 5′ and 3′ indel termini; (1) an indel with the exact same combination of termini; (— [inapplicable]) an indel that extends up to or beyond both the 5′ and 3′ indel termini; or (? [missing]) the sequence begins after the 5′ indel terminus or ends before the 3′ indel terminus. Several software implementations are available: gapcode (part of NEXUS Class Library; Lewis, 2003), GapCoder (Young and Healy, 2003; no longer publicly distributed), 2xread (Little, 2005), and SeqState (Müller, 2006). Although useful, these implementations cannot simultaneously code indel events and concatenate data sets nor can they process sequences along with non-molecular data sets in a straightforward manner. Therefore, we created a program that can code indels, concatenate DNA and amino acid sequences, incorporate non-molecular data, and produce output files compatible with the most widely used analysis programs.
METHODS AND RESULTS
2matrix is an open source Perl (5.10) script that concatenates and translates phylogenetic data sets into a variety of useful file formats. It can be executed on any operating system that has a Perl interpreter (e.g., Linux, Mac OS X, and Windows). Perl interpreters are installed, by default, on Linux and Mac OS X. Users of Windows must install a Perl distribution—available Perl distributions and installation instructions can be found at the Perl Programming Language web site ( http://www.perl.org/get.html). Once installed, Perl can be accessed by the user via a terminal window.
2matrix accepts DNA and amino acid sequence alignments in FASTA format and non-molecular data in xread or comma-separated value (csv) formats. FASTA is the most widely used format for sequence alignments and is output by most alignment programs. Non-molecular data are often compiled using specialized software (e.g., WinClada, Mesquite) that can export xread files or, in some cases, spreadsheet programs that can export csv files. The csv files accepted by 2matrix must be consistently organized: the first row contains character names; the second row describes character state additivity; the first column contains taxon names; the remaining cells contain the scores of a single character for a given taxon (polymorphic entries can be accommodated). Sample files and detailed information on file formats is provided with the program distribution (Fig. 1).
By default, 2matrix implements the “simple indel coding” algorithm (Simmons and Ochoterena, 2000) to create binary characters that describe indel size and distribution throughout each sequence alignment. Optionally, users can prevent indel coding (“-d”), but still concatenate and/or reformat matrices. Nucleotide and amino acid positions in xread and NEXUS output files can, optionally, be named (“-s”) with a stem phrase (one per partition). This facilitates post-analysis data interpretation—particularly if indels have been coded. All 2matrix command-line options are summarized in Table 1.
The output of 2matrix is compatible with popular phylogenetic programs: Garli (NEXUS sensu Garli; Zwickl, 2006), RAxML (extended PHYLIP; Stamatakis, 2006), TNT (xread; Goloboff et al., 2008), and MrBayes (NEXUS sensu MrBayes; Ronquist et al., 2012). RAxML and Garli require additional configuration files to read partitioned data sets—2matrix outputs these files using default settings. Users should tailor these configuration files to suit their data and analytic needs. NEXUS files formatted specifically for MrBayes and Garli are output by 2matrix when the NEXUS option is selected (“-o n”). The NEXUS file format (Maddison et al., 1997) is not fully or consistently implemented in most programs that use it. As a result, a NEXUS file that can be read correctly by all programs cannot be created. 2matrix outputs NEXUS files compatible with MrBayes and Garli—due to their current popularity. Unfortunately, this comes at the cost of compatibility with other NEXUS-utilizing programs. With slight manual modification, the MrBayes and Garli NEXUS files can be made compatible with sundry NEXUS-utilizing programs.
The 2matrix distribution includes morphological data (csv and xread format) and sequences for three molecular markers (FASTA files) reconstructed from an analysis of basal angiosperms (Doyle and Endress, 2000). To recreate the combined matrix in TNT format, the user should issue the following command from within a terminal window (assuming that all the files are in the user's current directory; users of Windows should omit the “./” proceeding the command):
To output the same matrix in NEXUS format, the user should replace “-o x” with “-o n” in the command. If the user wishes to add coded indels to the matrix using the simple indel coding” algorithm, the “-d” option should be omitted (indels were not coded in the original analysis).
Table 1.
Command-line options available in 2matrix.
In addition to the instructions included in the 2matrix distribution's README file ( https://github.com/nrsalinas/2matrix/blob/master/README), a complete description of all available options can be viewed by invoking 2matrix without any of the required options (“./2matrix.pl” on Linux and Mac OS X, “matrix.pl” on Windows; Table 1).
CONCLUSIONS
2matrix is hosted on GitHub ( https://github.com/nrsalinas/2matrix) and available for free download ( https://github.com/nrsalinas/2matrix/archive/master.zip; this is a direct link to a download of the complete 2matrix distribution) under the General Public License (GPL). It is capable of coding indel events, concatenating sequences, incorporating non-molecular data into matrices, and producing output formatted specifically for the popular analytic programs Garli, MrBayes, RAxML, and TNT. In addition, 2matrix can be used within shell scripts and analysis pipelines. No matter how one chooses to use 2matrix, it is vastly more efficient that manually coding indels and/or concatenating matrices.