Translator Disclaimer
7 January 2014 2matrix: A Utility for Indel Coding and Phylogenetic Matrix Concatenation
Author Affiliations +

To make robust phylogenetic inferences, data from several unlinked sources are often required. Commonly, researchers evaluate DNA (or amino acid) sequences from a number of different regions and/or anatomical, morphological, developmental, biochemical, or behavioral characteristics. To conduct phylogenetic analyses of aligned sequences, a researcher must concatenate sequence files (typically in FASTA format) into a single matrix file formatted specifically for the analysis package used. Binary characters, representing inferred insertion/ deletion (indel) events, are often appended to the matrix along with non-molecular data.

Indel events are usually incorporated in phylogenetic matrices using the “simple indel coding” algorithm (Simmons and Ochoterena, 2000). The algorithm creates a character for each unique combination of 5′ and 3′ indel termini in an alignment (5′ termini must be preceded by a nucleotide/amino acid sequence and 3′ termini must be followed by a nucleotide/amino acid sequence). For each character, each sequence is assigned a state based on what is contained between the indel termini: (0) nucleotide/amino acid sequence and/or an indel with termini that do not extend up to or beyond both the 5′ and 3′ indel termini; (1) an indel with the exact same combination of termini; (— [inapplicable]) an indel that extends up to or beyond both the 5′ and 3′ indel termini; or (? [missing]) the sequence begins after the 5′ indel terminus or ends before the 3′ indel terminus. Several software implementations are available: gapcode (part of NEXUS Class Library; Lewis, 2003), GapCoder (Young and Healy, 2003; no longer publicly distributed), 2xread (Little, 2005), and SeqState (Müller, 2006). Although useful, these implementations cannot simultaneously code indel events and concatenate data sets nor can they process sequences along with non-molecular data sets in a straightforward manner. Therefore, we created a program that can code indels, concatenate DNA and amino acid sequences, incorporate non-molecular data, and produce output files compatible with the most widely used analysis programs.

METHODS AND RESULTS

2matrix is an open source Perl (5.10) script that concatenates and translates phylogenetic data sets into a variety of useful file formats. It can be executed on any operating system that has a Perl interpreter (e.g., Linux, Mac OS X, and Windows). Perl interpreters are installed, by default, on Linux and Mac OS X. Users of Windows must install a Perl distribution—available Perl distributions and installation instructions can be found at the Perl Programming Language web site ( http://www.perl.org/get.html). Once installed, Perl can be accessed by the user via a terminal window.

2matrix accepts DNA and amino acid sequence alignments in FASTA format and non-molecular data in xread or comma-separated value (csv) formats. FASTA is the most widely used format for sequence alignments and is output by most alignment programs. Non-molecular data are often compiled using specialized software (e.g., WinClada, Mesquite) that can export xread files or, in some cases, spreadsheet programs that can export csv files. The csv files accepted by 2matrix must be consistently organized: the first row contains character names; the second row describes character state additivity; the first column contains taxon names; the remaining cells contain the scores of a single character for a given taxon (polymorphic entries can be accommodated). Sample files and detailed information on file formats is provided with the program distribution (Fig. 1).

By default, 2matrix implements the “simple indel coding” algorithm (Simmons and Ochoterena, 2000) to create binary characters that describe indel size and distribution throughout each sequence alignment. Optionally, users can prevent indel coding (“-d”), but still concatenate and/or reformat matrices. Nucleotide and amino acid positions in xread and NEXUS output files can, optionally, be named (“-s”) with a stem phrase (one per partition). This facilitates post-analysis data interpretation—particularly if indels have been coded. All 2matrix command-line options are summarized in Table 1.

Fig. 1.

Example data matrix in csv format. The first column contains taxon names. The remaining columns are used for individual characters. The first row contains character names, the second row indicates additivity (the order of the additive states must be given; non-additive/unordered characters must be indicated), and remaining rows contain taxon scores. Polymorphic scores are separated by spaces. Missing data are indicated by question marks or dashes.

f01_01.jpg

The output of 2matrix is compatible with popular phylogenetic programs: Garli (NEXUS sensu Garli; Zwickl, 2006), RAxML (extended PHYLIP; Stamatakis, 2006), TNT (xread; Goloboff et al., 2008), and MrBayes (NEXUS sensu MrBayes; Ronquist et al., 2012). RAxML and Garli require additional configuration files to read partitioned data sets—2matrix outputs these files using default settings. Users should tailor these configuration files to suit their data and analytic needs. NEXUS files formatted specifically for MrBayes and Garli are output by 2matrix when the NEXUS option is selected (“-o n”). The NEXUS file format (Maddison et al., 1997) is not fully or consistently implemented in most programs that use it. As a result, a NEXUS file that can be read correctly by all programs cannot be created. 2matrix outputs NEXUS files compatible with MrBayes and Garli—due to their current popularity. Unfortunately, this comes at the cost of compatibility with other NEXUS-utilizing programs. With slight manual modification, the MrBayes and Garli NEXUS files can be made compatible with sundry NEXUS-utilizing programs.

The 2matrix distribution includes morphological data (csv and xread format) and sequences for three molecular markers (FASTA files) reconstructed from an analysis of basal angiosperms (Doyle and Endress, 2000). To recreate the combined matrix in TNT format, the user should issue the following command from within a terminal window (assuming that all the files are in the user's current directory; users of Windows should omit the “./” proceeding the command):

e01_01.gif

To output the same matrix in NEXUS format, the user should replace “-o x” with “-o n” in the command. If the user wishes to add coded indels to the matrix using the simple indel coding” algorithm, the “-d” option should be omitted (indels were not coded in the original analysis).

Table 1.

Command-line options available in 2matrix.

t01_01.gif

In addition to the instructions included in the 2matrix distribution's README file ( https://github.com/nrsalinas/2matrix/blob/master/README), a complete description of all available options can be viewed by invoking 2matrix without any of the required options (“./2matrix.pl” on Linux and Mac OS X, “matrix.pl” on Windows; Table 1).

CONCLUSIONS

2matrix is hosted on GitHub ( https://github.com/nrsalinas/2matrix) and available for free download ( https://github.com/nrsalinas/2matrix/archive/master.zip; this is a direct link to a download of the complete 2matrix distribution) under the General Public License (GPL). It is capable of coding indel events, concatenating sequences, incorporating non-molecular data into matrices, and producing output formatted specifically for the popular analytic programs Garli, MrBayes, RAxML, and TNT. In addition, 2matrix can be used within shell scripts and analysis pipelines. No matter how one chooses to use 2matrix, it is vastly more efficient that manually coding indels and/or concatenating matrices.

LITERATURE CITED

1.

J. A. Doyle , and P. K. Endress . 2000. Morphological phylogenetic analysis of basal angiosperms: Comparison and combination with molecular data. International Journal of Plant Sciences 161: S121–S153. Google Scholar

2.

P. A. Goloboff , J. S. Farris , and K. C. Nixon . 2008. TNT, a free program for phylogenetic analysis. Cladistics 24: 774–786. Google Scholar

3.

P. O. Lewis 2003. NCL: A C++ class library for interpreting data files in NEXUS format. Bioinformatics (Oxford, England) 19: 2330–2331. Google Scholar

4.

D. P. Little 2005. 2xread: A simple indel coding tool. Available at:  http://www.nybg.org/files/scientists/2xread.html [accessed 16 December 2013]. Google Scholar

5.

D. R. Maddison , D. L. Swofford , and W. P. Maddison . 1997. NEXUS: An extensible file format for systematic information. Systematic Biology 46: 590–621. Google Scholar

6.

K. Müller 2006. Incorporating information from length-mutational events into phylogenetic analysis. Molecular Phylogenetics and Evolution 38: 667–676. Google Scholar

7.

F. Ronquist , M. Teslenko , P. van der Mark , D. L. Ayres , A. Darling , S. Höhna , B. Larget , et al. 2012. MrBayes 3.2: Efficient Bayesian phylogenetic inference and model choice across a large model space. Systematic Biology 61: 539–542. Google Scholar

8.

M. P. Simmons , and H. Ochoterena . 2000. Gaps as characters in sequence-based phylogenetic analysis. Systematic Biology 49: 369– 381. Google Scholar

9.

A. Stamatakis 2006. RAxML-VI-HPC: Maximum likelihood–based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics (Oxford, England) 22: 2688–2690. Google Scholar

10.

N. D. Young , and J. Healy . 2003. GapCoder automates the use of indel characters in phylogenetic analysis. BMC Bioinformatics 4: 6. Google Scholar

11.

D. J. Zwickl 2006. Genetic algorithm approaches for the phylogenetic analysis of large biological sequence datasets under the maximum likelihood criterion. PhD dissertation, The University of Texas at Austin, Austin, Texas, USA. Google Scholar
Nelson R. Salinas and Damon P. Little "2matrix: A Utility for Indel Coding and Phylogenetic Matrix Concatenation," Applications in Plant Sciences 2(1), (7 January 2014). https://doi.org/10.3732/apps.1300083
Received: 15 October 2013; Accepted: 14 December 2013; Published: 7 January 2014
JOURNAL ARTICLE
PAGES


SHARE
ARTICLE IMPACT
Back to Top