Major public DNA databases — NCBI GenBank, the DNA DataBank of Japan (DDBJ), and the European Molecular Biology Laboratory (EMBL) — are invaluable biodiversity libraries. Systematists and other biodiversity scientists commonly mine these databases for sequence data to use in phylogenetic studies, but such studies generally use only the taxonomic identity of the sequenced tissue, not the specimen identity. Thus studies that use DNA supermatrices to construct phylogenetic trees with species at the tips typically do not take advantage of the fact that for many individuals in the public DNA databases, several DNA regions have been sampled; and for many species, two or more individuals have been sampled. Thus these studies typically do not make full use of the multigene datasets in public DNA databases to test species coherence and select optimal sequences to represent a species. In this study, we introduce a set of tools developed in the R programming language to construct individual-based trees from NCBI GenBank data and present a set of trees for the genus Carex (Cyperaceae) constructed using these methods. For the more than 770 species for which we found sequence data, our approach recovered an average of 1.85 gene regions per specimen, up to seven for some specimens, and more than 450 species represented by two or more specimens. Depending on the subset of genes analyzed, we found up to 42% of species monophyletic. We introduce a simple tree statistic—the Taxonomic Disparity Index (TDI)—to assist in curating specimen-level datasets and provide code for selecting maximally informative (or, conversely, minimally misleading) sequences as species exemplars. While tailored to the Carex dataset, the approach and code presented in this paper can readily be generalized to constructing individual-level trees from large amounts of data for any species group.
Specimen-level data are at the heart of revisionary taxonomy, but much synthetic work in systematics has focused on development of species-level tools for phylogenetics (e.g. supertree and supermatrix approaches, and gene tree - species tree reconciliation) and monography (e.g. Scratchpads [Smith et al. 2012] and Encyclopedia of Life [Parr et al. 2014]). In the collections community, great strides have been made in databasing, georeferencing, and aggregating specimen data (e.g. BioGeomancer [Guralnick et al. 2006], BRAHMS [ http://herbaria.plants.ox.ac.uk/bol/brahms/], GBIF IPT [Robertson et al. 2014], Symbiota [Gries et al. 2014]), and in tracking specimen duplicates and reconciling annotations across these (e.g. BiSciCol; http://biscicol.blogspot.com/p/home.html). However, the tools needed to extract specimen-level data from major public sequence databases—NCBI GenBank (Benson et al. 2007), the DNA DataBank of Japan (DDBJ), and the European Molecular Biology Laboratory (EMBL)— are not in place. Hereafter we will refer only to NCBI GenBank in this paper, with the understanding that the issues are general across all three databases.
Why aggregate data to the specimen level instead of just focusing on species-level data? There are at least two potential advantages to discriminating specimens in NCBI data. The first is that aggregating data to the specimen level allows us, at least in principle, to test species boundaries under a phylogenetic or genealogical species concept (De Queiroz, 2007; Hausdorf 2011). In the conventional approach to utilizing NCBI data (e.g. Edwards and Smith 2010; Hinchliff and Roalson 2013), which we will refer to in this paper as the aggregate-to-species approach, phylogenetic tips are left at the species level, so the resulting trees offer no information on species boundaries. Related to this is the potential discovery of cryptic taxa, which may lie hidden in NCBI data but undiscovered in downstream analyses that only aggregate data to species. The second advantage is that our approach affords the researcher greater power to detect misidentified specimens and lab errors that would be all but invisible in the aggregate-tospecies approach, without prior expectations about where a given species should fall in the tree.
The genus Carex L. (Cyperaceae) is a particularly challenging group from which to develop a specimen-level NCBI resource. Under the broad circumscription of the genus to include the previously segregated genera Cymophyllus Mack., Kobresia Willd., Schoenoxiphium Nees, and Uncinia Pers. (Global Carex Group 2015), Carex comprises ca. 2000 species, spanning six continents and a remarkably wide range of terrestrial and aquatic ecosystems (Hipp 1998; Reznicek 1990, 1993; Escudero et al. 2012). Four or five major clades have been well characterized in the genus in many studies, but relationships among these clades and fine-scale relationships within clades remain in flux (Waterway and Starr 2007; Waterway et al. 2009; Starr and Ford 2009; Global Carex Group 2015). A more recent study (Starr et al. 2015) utilizes broader sampling of southeast Asian taxa and deeper DNA sampling (4400 base pairs) to further refine this understanding, suggesting that Carex comprises two major alliances and seven major clades. While there have been numerous subgenus- and section-level phylogenetic studies in Carex (Starr et al. 1999; Hendrichs et al. 2004a, b; Roalson and Friar 2004; Ford et al. 2006, 2012; Hipp et al. 2006; Escudero et al. 2008; Dragon and Barrington 2009; Escudero et al. 2009; Jiménez- Mejías et al. 2011; Shekhovtsov et al. 2012; Derieg et al. 2013; Gebauer et al. 2014; Yano et al. 2014; Maguilla et al. 2015; Molina et al. 2015), there has not been a recent effort to summarize these in a single synthetic paper focused on the genus. The most inclusive Cyperaceae phylogeny to date is from analysis of a Cyperaceae supermatrix (Hinchliff and Roalson 2013). This study confirms many of the higher-level relationships identified by others in previous studies but does not provide strong phylogenetic conclusions at fine scales. However, to date, no one has constructed a supermatrix multigene tree to the individual level for Carex. Presumably, this lack of individual-level trees can be attributed to the difficulty of parsing others' data to determine the individual specimens that are the source of the sequences. This problem becomes especially tricky when one considers that multiple studies may use the same individual specimen.
In this study, we provide (1) a set of informatics tools for annotating NCBI data to voucher, working around the limited data structures provided by NCBI for linking sequences to voucher, and for exploring resulting phylogenies for apparent non-monophyly of species or other taxa that may be due to lab error, taxonomic misidentification, or the need for taxonomic revision; and (2) a case study in Carex. We also provide a relatively small number of annotations of the NCBI database based on re-inspection of data by Global Carex Group (GCG) members who are coauthors on this paper; this is presented in the supplement to this paper for Carex researchers who may have used data in the past and would like to know which identifications have been annotated.
Materials and Methods
Downloading and parsing NCBI nucleotide records—The NCBI GenBank database was queried on 18 March 2015 for all nucleotide sequences for which the organism field contained the string “Carex,” “Cymophyllus,” “Kobresia,” “Schoenoxiphium,” or “Uncinia.” Data were exported as NCBI INSDSeq XML ( http://www.insdc.org/). Each sequence was maintained, as in GenBank, as a separate data record, irrespective of the source of genetic data; each data record, then, comprises sequence data plus a variety of metadata: information on the locus sequenced, taxonomic classification of the organism from which the DNA was extracted, and various forms of information on the identity of the specimen sequenced, including collector, collector number, geographic locality, “isolate” and “clone” (in quotes here because these terms are applied differently by different researchers), and other information distinguishing among sources of DNA sequence data. This information was used to identify each individual plant specimen that provided one or multiple sequences in the data set. The XML data were parsed into a flat file using the XML (Lang et al. 2015) and morton packages ( https://github.com/andrew-hipp/morton) in R versions 2.15.3 (‘Security Blanket’) through 3.2.5 (‘Very, Very Secure Dishes’; R Core Team 2015).
Linking nucleotide records to individual specimens—The parsed data table (Supplemental Table S1) was then analyzed to create a unique voucher or specimen code, a label unique to individual plant specimens. The NCBI database was not designed to hold specimen-level data that could be consistently compared across collections, researchers, research studies, or even different genes sampled within a single study. There is a field for voucher, which is often used. However, collector names, number, and collection (museum, herbarium) codes and numbers are not consistently indicated. As a consequence, we had to infer from the metadata associated with each sequence record what the specimen was, and create a code for specimens that could be associated across studies and genes. Collector names, locations, and numbers pertaining to the individual who collected the specimen, as well as collection names and numbers pertaining to the museum or collection housing the specimen, were parsed out of the specimen_voucher field using the parse.specimen function in morton. The parsing rules we used are hard-coded in the parse.specimen function. Even after parsing, however, substantial manual cleanup was required to address inconsistencies in collector and collection names and placement of collector and collection data within different fields. Following manual cleanup of the individual fields, the voucher names were automatically generated by concatenating five fields, if present: (1) collector name, (2) collector number, (3) isolate number, (4) collection (herbarium) name, and (5) collection (herbarium) accession number. Collector name was the only field required to contain information; if information associated with the collector of the plant sample was missing, the author name (s) for the paper in which the sequence was published was used instead and indicated using the label “AUTHOR.” Spaces and punctuation were removed and all characters were changed to lowercase in vouchers to minimize spurious differences among records. Plant names were not included in the voucher name, due to the relatively transient nature of classifications and the fact that identification changes made on specimens are often not communicated to NCBI. Over the 22 yr of sequence deposits represented in our sampling (Fig. 1), changes in taxonomic names and determinations on individual specimens make it likely that at least some individual specimens may have two or more scientific names associated with them. These names may or may not be synonyms; we did not attempt to distinguish the difference in the current study, though this could easily be done. All missing information was omitted from the voucher labels.
Following automated generation of voucher labels, the voucher labels were inspected visually and manually cleaned to differentiate those that were not fully differentiated in the previous step due to missing specimen data or variation in collection information (in some cases collection information was included for a given specimen in one study but not in another). When specimen-specific metadata for sequences were missing, multiple individuals were often erroneously grouped under a single voucher, usually by the primary author of the paper in which these individuals were first published. Our approach was to aggregate individuals to species within a research study if no specimen data were provided, under the assumption that multiple exemplars of a given species within a study are likely to derive from a single specimen, provided duplicate sequences are not present in the study. This assumption may of course result at times in incorrect attribution of specimen identity. When taxa lacking specimen data could not be unambiguously matched to a DNA region within a study, those individuals were excluded from analysis.
An advantage of linking specimens to vouchers became apparent with inspection of ITS and trnL-trnF data. Each of these regions was in many cases sequenced as two separate gene regions (e.g. ITS1 as one region, ITS2 as a separate region) and was therefore present in the data set as halves of a whole gene region. We concatenated any split sequences. Any gap in a sequence left unsequenced — for example, a gap for 5.8S between the first and second internal transcribed spacer (ITS) regions — was filled with Ns. Starr et al. (1999), the seminal study in the use of ITS in Carex, was used to estimate how many base pairs should be inserted for the 5.8S region, and 10 base pairs were removed from that value to avoid introducing spurious gaps during multiple alignment. For trnL- trnF, we inserted gaps by inspection and by reference to a publication in a more distantly related taxon (Bayer and Starr 1998).
Multiple alignment—After cleaning the data set, each gene region was written to a separate FASTA file to be aligned. FASTA files were exported from the INSDSeq XML data with a label indicating the taxon, the collector, and an arbitrary number referencing the voucher. Only specimens attributable to a single organism were exported — any voucher identified to different taxa for different sequences was excluded — and a translation table was exported to relate the FASTA files to the original NCBI metadata. Only the top 12 most represented DNA regions (Fig. 2) were written to FASTA files for inclusion in the final tree. This decision was made based on the observation that the less-represented regions have few individuals common with other regions and are mostly from highly taxonspecific studies, making their inclusion in the final tree problematic; sequences that did come from individuals sequenced for other loci would not be likely to resolve correctly in the phylogenetic trees inferred because there would be an equal likelihood that they would be located in multiple places on the tree. Multiple alignment was performed using MUSCLE v.8 (Edgar 2004a, b). The resulting multiple alignment files were adjusted manually, and sequences that did not align properly were flagged for reverse-complementation or deletion. After editing flagged sequences, multiple alignments were re-aligned using MUSCLE and readjusted manually. Problematic sequences were removed, and for trnL-trnF, data were aligned by four major clades — outgroups + Siderostictae clade; Vignea clade; core Carex clade; and Caricoid clade — prior to profile-to-profile realignment in MUSCLE. Sequence alignments were trimmed lightly to get rid of ragged ends.
Concatenated data matrices—We exported six datasets for analysis, with nicknames indicated in bold:
5-region: all individuals containing any one of the most commonly sampled five regions
12-region: all individuals containing any one of the most commonly sampled 12 regions
ITS scaffold: the top 12 DNA regions for all individuals that had been sampled for ITS; thus all individuals have ITS, and many have additional gene regions as well
ITS+ETS: ITS and ETS for those individuals that have both ITS and ETS (no missing data)
ITS only, ITS+ETS subset: ITS for those individuals that have both ITS and ETS
ETS only, ITS+ETS subset: ETS for those individuals that have both ITS and ETS
Nicknames indicated above are used throughout this manuscript.
Phylogenetic analyses—All analyses presented were conducted in RAxML v. 8 (Stamatakis 2014) using 50–100 fast bootstrap searches followed by slow ML search and optimization (this is the ‘-f a’ option commonly used in RAxML). Trees in all analyses presented were rooted by the outgroups Eriophorum vaginatum L., Scirpus polystachyus F.Muell., Trichophorum alpinum (L.) Pers., and T. cespitosum (L.) Hartm. These outgroups were available for the ITS, ETS, trnL-trnF, matK and rbcL datasets; for outgroups, no effort was made to match individuals. The following analyses were undertaken:
Comparison of datasets—To examine the impact of the number of regions included on the phylogenetic reconstruction, pairwise tree distances calculated using Penny and Hendy's (1985) tree bipartition metric, ignoring branch lengths, were calculated among all trees in the 5-region bootstrap treeset, the 12-region bootstrap treeset, and the ITS scaffold bootstrap treeset, as well as the ML tree for each of these datasets using the dist.topo function in the ape package (Paradis et al. 2004). All bootstrap and ML trees were pruned down to the taxa shared by all individuals shared among the three datasets prior to calculating pairwise tree distances. It was important to leave all taxa in the data matrices and prune after phylogenetic analysis for this portion of the study so that we could assess whether increasing both taxon sampling and percent missing data alters our understanding of relationships around a core group of tips for which we have maximal data. Multidimensional scaling as implemented in the vegan package (Oksanen et al. 2015) was used to visualize tree dissimilarities in two dimensions.
Effects of aggregating to individual: ITS vs. ETS—To evaluate the effects of aggregating data to the individual level, we estimated the ML tree for ITS+ETS; ITS only, ITS+ETS subset; and ETS only, ITS+ETS subset. Nodes on the ITS and ETS bootstrap consensus trees (bipartitions trees in RAxML) were matched back to equivalent nodes on the ITS+ETS tree using the phytools package (Revell 2012); equivalent nodes are defined as nodes on the rooted tree that have identical sets of descendants. Bootstraps for equivalent nodes were plotted for comparison using morton.
Taxon disparity index—We evaluated monophyly of species on our combined datasets by flagging tips based on species names (as reported in NCBI GenBank, ignoring synonymy, names below the species level, and possible misidentifications) and calculating the distribution of a taxon disparity index (TDI), which we introduce in this study. The TDI is defined here as the number of tips in the most restricted clade that includes all individuals of a given species label minus the number of tips of that species label. Unlike the consistency index (CI), the TDI has the desirable property of increasing as even a single outlier increases in phylogenetic distance from the core of the species; whereas a species label might have a CI 0.5 no matter how phylogenetically distant its tips are, TDI increases as species labels that violate monophyly increase in phylogenetic distance from one another. The taxon disparity index might also be formulated as the number of additional steps or the decrease in likelihood required to make all tips with a given species label monophyletic, using a paired-sites test such as the Shimodaira and Hasegawa (1999) test to compare the unconstrained tree from one in which all tips of a particular species label are monophyletic. However, both of these alternatives are computationally much more demanding, as they would require additional tree optimization steps, and it is not clear that they would appreciably alter our interpretation of how pruning tips or concatenating data affect monophyly of species labels. Note that we use “monophyly of species labels” deliberately in this context, because we are not distinguishing in this analysis among non-monophyly due to taxonomic issues (e.g. morphological species not reconciled with genealogical species), nomenclatural discrepancies, specimen misidentifications, or lab error. Note, too, that this analysis differs fundamentally from rogue taxon analysis, which investigates how consistent the placement of tips is among trees in a confidence set (e.g. bootstrap set).
Results
NCBI data, specimen assignment, and matrices—A total of 7994 sequence records encompassing 58 named DNA regions (after manual cleanup by the senior author; Fig. 2) and an additional set of anonymous regions (including SSR loci) were downloaded. Regions were drawn from plastid (37 regions), mitochondrial (three regions), nuclear (13), and nuclear ribosomal (five) DNA. Sequences were deposited in NCBI GenBank between 1993 and 2015 (Fig. 1). In the end, only 68 of 3909 vouchers identified had two taxonomic names associated with them, and none had more than two. Vouchers have an average of 1.85 +/- 0.95 gene regions sequenced per individual and range from one to seven gene regions each (Figs. 3, 4). The top 12 individual gene matrices range from 149 individuals (atpF and trnV-ndhC) to 1766 individuals (ITS), and in data completeness from < 5% gaps (trnV-ndhC) to more than 20% (ITS, ETS 3′ end, trnL-trnF, trnK, rps16; Table S2), with an average of 24% missing data. Much of the missing data was due to uncleaned ragged ends, and some from gaps that might be eliminated by aligning first within clades, then doing profile-to-profile alignment among clades (cf. Global Carex Group 2016, this issue). While we did this for the trnL-trnF data, we found little need to do so for the other matrices, though such an approach could be automated readily using an iterative phylogenetic analysis / multiple alignment approach. The combined 5-region and 12-region data matrices are 2827 individuals × 6228 bp and 3266 individuals × 13106 bp respectively. They represent 773 and 775 named species and comprise 83.5% and 91.1% gap characters respectively. The ITS scaffold matrix, like the ITS matrix itself, comprises 680 species, but a much higher proportion of missing data (90.8% gaps, as with the 12-region dataset; Supplemental Fig. S3). Supplemental tables and figures, including all matrices, are available from the Dryad Digital Repository at http://dx.doi.org/10.5061/dryad.6tn70d.
Phylogenetic analyses and comparison of datasets—Each of the matrices enumerated under “Concatenated Data Matrices” (Methods) was analyzed individually as described in Methods (Fig. S4A–F). Individual genes were analyzed separately in preliminary analyses for this in the course of cleaning the alignments, but separate analyses are not shown here. A tree from analysis of the 5-region matrix was pruned to make a separate tree for each of 20 GCG coauthors (in a few cases, combinations of authors) who had sequence data deposited in NCBI GenBank with their name as a lead or associated author. Each of the authors had an opportunity to review the tips associated with their names and give feedback to the communicating authors, including redetermination of identifications. This resulted in removal of 40 tips and redetermination of 15 vouchers.
All analyses recovered the same major clades found in prior studies of the genus: the Siderostictae, core Carex, Vignea, and Caricoid clades (Global Carex Group 2015 and references therein; Figs. 4, 5). The placement of these clades is not strongly supported, fitting with the uncertainty reported in prior studies (Starr and Ford 2009, Fig. 3 presents a good review), but the overall topology recovered in the 5-region and 12-region trees is essentially the same (Figs. 4, 5), with some within-clade differences particularly in the core Carex clade that are relatively minor compared to some of the within-clade rearrangements between the 5-region and ITSscaffold analyses (Fig. S5).
In the ordination of bootstrap trees for the 12-region, 5-region, and ITS scaffold datasets, pruned down to taxa common to all three datasets, there is strong overlap between the 12-region and ITS scaffold bootstrap sets, and minimal overlap between these and the 5-region bootstrap set (Fig. 6). The distances among the three ML trees are substantially lower than any other distance in the dataset (497–595 branches defining bipartitions that differ among the three ML trees, in comparison to differences of 835–1171 bipartitions for every pairwise comparison involving a bootstrap tree). Thus, the ML trees fall close together in the ordination (Fig. 6), surrounded by the bootstrap trees, rather than each ML tree in its own cloud of bootstrap trees.
Aggregating to individual—Aggregating data to specimen results in a net increase in clade support over not aggregating at all (Fig. 7). Only ITS and ETS were chosen for this investigation because a relatively large number of Carex individuals in NCBI have been sampled for both DNA regions, but we presume the benefits of aggregating to individual may extend beyond ITS and ETS. Looking across data matrices, we find that most species exhibit relatively low taxonomic disparity (TDI <100), with only a small number ranging to a TDI of 3000 or more (Figs. 8, S6), encompassing uncertainty across nearly the entire tree. There are islands of disparity that correspond to increasingly deep phylogenetic bipartitions that a species label might transgress (e.g. all taxa that are errantly split between the Vignea and core Carex clades will have a very similar and large disparity index, irrespective of how many individuals have that label). In linear regression of taxonomic disparity against number of taxa (Fig. S7), number of taxa is a weak predictor of TDI for all datasets (for the 12-gene and 5-gene datasets, r2 = 0.10–0.12, p < 0.001; in all other datasets, the regression coefficients are not significantly different from 0), and we thus consider it a poor corrector to normalize taxonomic disparity. We recommend using TDI in a largely heuristic manner, accompanied by manual inspection of the relevant trees to detect potential misidentifications, gene sampling issues, or taxonomically complicated species (cf. Global Carex Group 2016). Gross misidentifications could easily be weeded out in further analyses, and tools to do so are provided in the morton package.
Of 769 species labels in the 5-region tree, 441 have two or more individuals; of those, 22% of those species labels are monophyletic. In the 12-region tree, 452 of 771 species labels have two or more individuals, 20% of which are monophyletic. In the ITS scaffold tree, 345 of 676 species labels have two or more individuals, and 42% of these named taxa are monophyletic; an additional 31% have a TDI of 10 or lower (Fig. 8; Table S8A–C).
Discussion
In this paper, we introduce a practical approach to aggregating NCBI data to the individual or specimen level rather than the species level and provide a toolkit ( https://github.com/andrew-hipp/morton) to assist in the informatics. Our efforts to identify individual specimens from the database were fairly successful despite the fact that the pieces needed to identify specimens—collector name and number, collection location and accession number, isolate, clone—are not fully atomized or normalized in NCBI and have not always been entered in consistent ways. We have automated numerous steps in extracting these elements, and with additional manual manipulation of the data we arrived at an average of 1.85 +/- 0.95 DNA regions per specimen. We estimate that we spent ca. 100 person-hours manually curating specimen IDs for the 7994 sequence records, a relatively small time investment for a dataset of this magnitude.
Success at identifying individual specimens relied on proper and thorough entry of specimen data, especially voucher information on collection and collector number. The amount of information necessary to distinguish individuals varied from author to author based on the size of their study. In several studies, the only identifying information provided for a sequence was taxonomic identification and the author of the paper in which the sequence was published. In such cases, our automated parsing was unable to distinguish individuals, so manual examination of the data was necessary to tease out individuals. With no further specimen information, we generally assumed that all sequences for a given species in a given study came from a single individual, provided that a single sequence was provided per DNA region for that individual.
Our approach offers at least three benefits over the status quo of aggregating to species, ignoring specimens: (1) assessing species monophyly and reliable identification of tips; (2) allowing redeterminations of specimens to inform otherwise unlinked sequence records; and (3) improving species-level phylogenies throughmore informed concatenation of sequence data and selection of representative individuals.
Assessing species monophyly, and allowing redeterminations of specimens to inform multiple sequence records—The low taxonomic disparity index (TDI) demonstrated by the majority of species provides important data on the validity of those taxa as phylogenetic species, the concordance between morphological classification and phylogenetic classification, and the validity of NCBI data. In the ITS scaffold tree, for example, 40% of 327 species with more than one tip had a TDI of 0, and 71% had a TDI of 10 or lower (Fig. 8; Table S8A–C). It seems unlikely that frequent violations of species monophyly in this study reflect a widespread misunderstanding of species boundaries among practicing caricologists, who are an exceptionally attentive group of taxonomists. Rather, such violations are likely to more often represent misidentifications that would be missed in a traditional dataharvesting approach to phylogenetic analysis from NCBI data, or phylogenetic error due to gene sampling issues or sequencing noise. This argues for taking an approach such as the one presented here as a means to selecting which individual to represent a species in downstream analyses of NCBI data.
Some small values for TDI, however, probably bespeak true non-monophyly of species, discordance between taxonomic species and phylogeny of the genes studied. For example, the non-monophyly of C. deweyana Schwein. (TDI = 4, 4 individuals sampled) in the ITS scaffold tree may reflect the fact that two varieties of C. deweyana — C. deweyana var. deweyana and C. deweyana var. senanensis (Ohwi) T. Koyama — are perhaps better thought of as separate species than as varieties of a single species (Ford et al. 2006). This type of discordance is likely to result in relatively small TDI, as discordance between phylogenetic and taxonomic species will more often involve fine-scale relationships than transgressions of deep phylogenetic divergences. The misclassification of an individual into the wrong species — simple misidentification — or lab errors among close relatives will also cause an increase in TDI and may be difficult to differentiate from discordance between phylogenetic and taxonomic species without additional study. Large TDI values are likely due to lab error or profound errors of identification: 40 species names in the ITS scaffold dataset had a TDI of 100 or greater (Fig. 8), including such distinctive species as C. aurea Nutt. (TDI = 1535, N = 3) and C. gibba Wahlenb. (TDI = 575, N = 21). Of these remarkable examples, one was the subject of an earlier study of divergent paralogues (C. gibba), and is thus known to include problematic sequences (King and Roalson 2008); and one appears to be a misidentification based on habitat affinities (a putative C. aurea growing in a gravel beach, genotyping as C. glareosa; GenBank Accession JN999020, GI:359389285). In such cases, the TDI can be a useful tool for quickly identifying problematic specimens that are probably best removed from analysis.
Improving species-level phylogenies through betterinformed selection of individuals—For species-level phylogenies, rogue-taxon analysis is useful for identifying tips that are phylogenetically unstable due to poor sequencing of loci or genealogical discordance among loci (Aberer et al. 2013). However, rogue taxon analyses cannot help identify whether a given individual is the best representative of the taxon to which it has been ascribed. Our approach, by contrast, can be used to recognize potentially misidentified individuals by noting tips that are outliers with respect to other tips of the same name. In the examples presented above, rank-ordered TDI could be used to guide manual inspection of sequence data to remove individuals that are clearly misplaced or taxa for which placement is ambiguous, based on conflict among placement of individuals. Individuals could then be chosen to minimize missing data and the impact of the individual on apparent monophyly of the species it represents. This approach is complementary to rogue taxon analysis but fundamentally different from it. Rather than aiming at phylogenetic stability and removing problematic branches based on their movement among trees in a bootstrap set, the approach we describe aims at selecting individuals that best represent their species and are sequenced for the largest number of loci.
Generalizability of our approach—We have run our specimen-parsing scripts on NCBI sequence downloads for Euphorbia L. (Euphorbiaceae) and Quercus L. (Fagaceae) to validate that our approach has the potential to work with other datasets. After automated parsing, an individual researcher will still need to work through the data table to clean up collector names, collector numbers, collection names, and accession numbers. This portion of the work is perhaps best shared between a worker who is not familiar with the group and a more experienced researcher who knows many of the collectors and collections. Additionally, it will probably be quickest for an experienced researcher to bin the heterogeneous labels for a given locus to a single locus name; in our dataset, for example, the ITS regions were housed under “contains 18S ribosomal RNA, internal transcribed spacer 1, 5.8S ribosomal RNA, internal transcribed spacer 2, and 28S ribosomal RNA,” “contains internal transcribed spacer 1, 5.8S ribosomal RNA, internal transcribed spacer 2,” and 13 other unique names. Rather than writing a rule for assigning locus identity, it is probably easier for an experienced researcher to judge what labels represent the same locus and provide the unified label for this locus. Again, the task is not difficult. In our dataset, there were only 160 unique DNA region descriptions for nearly 8,000 sequence records, and these were easily binned to locus in about an hour. After parsing and cleanup of the collection data and gene region names, concatenation of separated loci (e.g. each of ITS and trnL-trnF when they are sequenced and submitted to NCBI as separate pieces), the mapping of individuals to gene regions, the production of graphical representations of the data, and the generation of summary statistics can be easily automated to facilitate data exploration using scripts provided in the morton package or modifications thereof.
Conclusions and next steps—The Carex dataset constructed for this study is one of the most inclusive to date for this large genus, comprising analyzable datasets between ca. 670 and 790 Carex species (ITS scaffold and 12 / 5-region datasets respectively). It is also the only NCBI-based supermatrix study of which we are aware that puts specimens rather than taxa at the center, and it may thus serve as a model for analyses of other large taxa. The study recovers the four major Carex clades—core Carex, Siderostictae, Vignea, and the Caricoid clade—and demonstrates that there is some topological variation among datasets within clades (Fig. 5). The study also supports the monophyly of approximately 345 morphological species based on the ITS scaffold dataset, 441 based on the 5-region dataset (Fig. 8; Table S8A–C). While the data are not consistently coded for geographic origin of sample, of those monophyletic species with three or more tips, 93% were collected by two or more different collectors, suggesting that this apparent monophyly is not likely to be due simply to collections made from the same population. The approach we present here may thus serve in investigations of species monophyly and genetic coherence beyond single studies.
Our study suggests the need for global databases that integrate specimen and DNA sequence data. While specimen databases have not fully lived up to the dream of data flow between collections and from users to curators, there has been a lot of improvement: users of the Symbiota system can, for example, share annotations among collections and readily incorporate feedback from scientists around the globe. As a next step, we would like to see direct links between sequence data deposited in public DNA databases and specimen data housed in the world's herbaria and zoological museums. Thoughtful integration of these databases or protocols for communication among them would facilitate downstream use of specimen-level sequence data. This in turn would propel taxonomic enterprises worldwide, allowing systematists to annotate sequence data in the same way they annotate specimens, and at the same time.
Acknowledgments.
We thank James Smith for his help in organizing this proceedings, and American Society of Plant Taxonomists and the Botany 2015 organizing committee for supporting the symposium in which the Global Carex Group papers in this issue of Systematic Botany were presented: “Ecological diversification and niche evolution in the rate zone's largest genus: Carex.” Chuck Cannon and two anonymous reviewers provided feedback on a draft of this manuscript. Funding for this work was provided by the National Science Foundation (Award #1255901 to ALH andMJWand Award #1256033 to EHR), including an REU supplement that supported KKP's work.
Literature Cited
The Global Carex Group," Systematic Botany 41(3), 529-539, (26 August 2016). https://doi.org/10.1600/036364416X692505