Shallow genetic differentiation and sampling limitations combine to restrict our understanding of biodiversity and evolution in many species-rich plant groups. Although numerous strategies for obtaining powerful genomic data sets are emerging (reviewed in Lemmon and Lemmon, 2013; McCormack et al., 2013), we remain fundamentally restricted by our access to samples. Other than the adoption of silica gel as a tissue dessicant (Chase and Hills, 1991), samples needed for plant molecular systematics studies are obtained essentially as they were at the beginning of the DNA era (Palmer and Zamir, 1982; Doyle et al., 1985). Researchers still must field-collect the majority of material—a rewarding, but expensive and time-consuming task that often precludes taxonomically rigorous sampling of large groups (>100 species) during the course of a dissertation or 3-yr federally funded project. If we are serious about understanding biodiversity and evolution in species-rich clades, we therefore need a transformative approach to obtaining samples.
Extracting DNA from herbarium specimen tissue is an obvious solution, an idea dating from the earliest days of plant molecular systematics (Rogers and Bendich, 1985). This type of sampling is, however, still viewed by most as a way to supplement an otherwise field-collected data set. Although studies utilizing genomic data sets obtained from herbarium specimens are emerging, most involve the recovery of high-copy organelle and/or rDNA cistron regions (Straub et al., 2012; Stull et al., 2013; Besnard et al., 20l4; Ripma et al., 2014), or are focused on adaptation within a single species (Vandepitte et al., 2014) or genome assembly of a single individual (Staats et al., 2013). Indeed, we are unaware of a study that has performed species delimitation or phylogeny reconstruction using a genome-wide data set obtained exclusively (or even largely) from herbarium material. Sampling exclusively from herbarium material would allow robust taxonomic and geographic sampling to be achieved rapidly, and if this sampling were performed under the guidance of expert taxonomists it would also ensure the strongest link between taxonomy and DNA. Sample sets obtained through this strategy, what we term “next-generation sampling,” could then be subjected to next-generation genotyping and sequencing techniques, as these workflows are presumably applicable to the sheared DNAs obtained from museum specimens (Nachman, 2013; Stull et al., 2013; Burrell et al., 2015). These rich data sets would then allow for the biodiversity and phylogeny of species-rich groups to be rigorously established in a short time.
In this study, we explore the compatibility of this sampling strategy with a genomic single-nucleotide polymorphism (SNP) protocol in the goldenrods (Solidago L., Asteraceae), a genus of ca. 150 currently recognized taxa (Semple and Cook, 2006). Taxonomic uncertainty in Solidago is widely recognized (Fernald, 1950; Nesom, 1993), a problem stemming from a combination of low interspecific genetic divergence (Kress et al., 2005; Fazekas et al., 2008; Schilling et al., 2008; Fazekas et al., 2009; Peirson et al., 2013), polyploidy (Semple, 1992), and species richness. In this study, we attempt to obtain genomic SNP information with a genotyping by sequencing (GBS) approach in a set of 95 herbarium specimens representing three Solidago subsections. These approaches identify SNPs at thousands of points throughout the genome by generating and sequencing a reduced representation library (Narum et al., 2013). Obtaining a genomic data set that carries species-level signal in this difficult genus, using only herbarium material, would be a powerful demonstration of the link between genomics and the expansive incorporation of herbarium material.
Sampling and DNA extraction/assessment—Polyploidy adds additional complexity to GBS data collection and analysis, including reduced per-individual sequencing depth due to increased genome size, the complicating nature of additional gene copies for SNP identification, and the relative lack of sophisticated analytical tools for polyploid data sets. We therefore chose to include diploid samples only in this pilot study. Herbarium tissue was obtained from 95 specimens representing 23 species in three Solidago subsections: Junceae (Rydb.) G. L. Nesom, Squarrosae A. Gray, and Triplinerviae (Torr. & A. Gray) G. L. Nesom (Appendix 1). All material was sampled from specimens at the University of Waterloo Herbarium (WAT), now housed as a unit of the Université de Montréal Herbarium (MT). Diploid mitotic chromosome counts were available for 73 of the 95 specimens (Semple et al., 1981, 1984, 1993; Semple and Chmielewski, 1987; J. Semple, unpublished data), and all exhibited microsatellite profiles indicative of diploidy (i.e., no more than two alleles per locus [J. Beck, unpublished data]). These specimens represented both a wide age range (collected between 1970 and 2010) and a diverse array of drying regimes, from field-based forced air techniques (similar to Blanco et al., 2006) to standard drying cabinets utilizing light bulbs or heaters. Approximately 15 mg of tissue were subjected to a cetyltrimethylammonium bromide (CTAB) protocol modified for 96-well plates (Beck et al., 2012). This high-throughput protocol has a history of yielding DNA quantity/quality sufficient for sequencing and genotyping in both herbarium (Beck et al., 2012, 2014; Alexander et al., 2013) and silica-dried (Rothfels et al., 2013) tissue. Concentration was determined with a Qubit 2.0 fluorometer (Life Technologies, Carlsbad, California, USA), and fragment size distribution was visualized by running 100 ng of extract against a λ DNA-HindIII digest (New England Biolabs, Ipswich, Massachusetts, USA) on a 1% agarose gel.
Library preparation, sequencing, and SNP calling—GBS library preparation (Elshire et al., 2011), sequencing, and SNP calling were performed at the Genomic Diversity Facility (GDF) at Cornell University's Biotechnology Resource Center. Trial libraries for one DNA were generated with three enzymes (ApeKI, EcoT221, PstI). Visual inspection of Experion (Bio-Rad, Hercules, California, USA) traces revealed that all exhibited fragment sizes generally between 150–300 bp. ApeKI was excluded due to the larger fragment pool, and thus lower read depth per fragment, that would result from this five-base recognition enzyme. Of the two six-base recognition enzymes, EcoT221 was then chosen because it exhibited a slightly smaller fragment pool. Libraries prepared from the 95 samples and one blank negative control were sequenced in one lane on an Illumina HiSeq 2500 (Illumina, San Diego, California, USA). Given that a reference genome was not available, the Universal Network-Enabled Analysis Kit (UNEAK) nonreference pipeline (Lu et al., 2013) implemented in TASSEL version 3.0.160 (Glaubitz et al., 2014) was used for tag alignment and subsequent SNP calling. The barcode/sample keyfile and all pipeline XML configuration files are archived at the Dryad Digital Repository ( http://dx.doi.org/10.5061/dryad.16pj5; Beck and Semple, 2015).
Data filtering and multivariate clustering—TASSEL 4.3 was used to produce preliminary SNP data sets by implementing high and low levels of missing data filtering on the total SNP set identified by UNEAK. This filtering and all further analyses excluded four samples (noted in Appendix 1). Two subsection Triplinerviae individuals were placed in other subsections in preliminary analyses, which along with other unpublished results strongly suggests that these are mislabeled DNA samples. Also excluded were two subsection Squarrosae individuals exhibiting low sequence read levels (see below). High filtering recovered SNPs present in 70% of samples, whereas low filtering recovered SNPs present in 30% of samples. Both filtering levels enforced a >1% minor allele frequency. These preliminary data sets were subjected to the multidimensional clustering approach employed in the principal coordinates analysis with modal clustering (PCO-MC) workflow (Reeves and Richards, 2009). This approach identifies the most cohesive groups in a data set by simultaneously considering information on all informative axes of a principal coordinates analysis. These groups are ranked by a “stability value,” which ranges from 0–100 and quantifies the relative density of the group in multidimensional space (Reeves and Richards, 2009). Many clustering approaches are available for the analysis of SNP data (Lawson and Falush, 2012), and we employed PCO-MC based on its computational efficiency and ability to objectively identify and rank clusters. Unlike popular methods such as STRUCTURE (Pritchard et al., 2000) and STRUCTURAMA (Huelsenbeck et al., 2011), PCO-MC does not incorporate a model of within-group Hardy-Weinberg equilibrium, an assumption that is unrealistic for sets of individuals sampled at different times across the range of a species. Instead, PCO-MC identifies groups of individuals with similar genotypes, as genotypic similarity is but one of many secondary criteria that can be used to identify lineages (Mallet, 1995; Hausdorf and Hennig, 2010) under the general lineage concept (de Queiroz, 2007). The correspondence between clusters identified by PCO-MC and morphologically defined species (morphospecies) at both filtering levels was assessed. Cluster/morphospecies correspondence at high and low filtering levels was qualitatively similar in subsection Triplinerviae and generally lower at high filtering in subsections Squarrosae and Junceae. Low-filtered data sets were therefore chosen for subsequent PCO-MC clustering.
Sequencing success and SNP recovery—Extracted DNA concentrations ranged from 15–155 ng/µL (mean: 46.2 ± 23.6), and total DNA yield ranged from 1050–10,850 ng (mean: 3185.9 ± 1665.9) (Appendix 1). Only five samples exhibited DNA yields below the 1.5-µg minimum recommended by the GDF. Gel electrophoresis indicated that all extracts were at least partially sheared, exhibiting fragment sizes between >23 kb and <500 bp ( Appendix S1 (apps.1500014_s1.pdf)). Each extract was given a qualitative score of DNA degradation (1 = mainly large fragments [>23 kb]; 2 = relatively even distribution of large to small fragments; 3 = mainly small fragments [<2 kb]) (Appendix 1, Appendix S1 (apps.1500014_s1.pdf)). These degradation scores were strongly related to specimen age, as all 21 group 1 DNAs (least degraded) were collected since 1992 (Appendix 1). Reduced representation library construction and Illumina sequencing yielded 230,232,173 (100 bp) reads. Of these, 197,917,774 were considered quality reads, exhibiting no N's in the first 72 bases and including both a full barcode and the expected remnant of the restriction cut site (Elshire et al., 2011). These quality reads were then collapsed into 18,947,823 identical sequence tags. The blank sample returned 7604 quality reads, which was 0.003% of the total quality reads and 0.04% of the mean quality reads (2,076,237) per nonblank sample. Two samples were designated as failures by the GDF based on a quality read number <10% of this mean. Overall, quality read number per sample was significantly lower in older specimens (r2 = 0.27, P = 6.8 × 10−8; Fig. 1A). While still significant, the relationship between age and read number was less pronounced in specimens >10 yr old (r2 = 0.080, P = 0.011). There were significant differences between the three DNA degradation categories [one-way ANOVA: F(2,92) = 18.44, P < 0.0001], with category 1 exhibiting more quality reads than categories 2 and 3 (Tukey honestly significant difference [HSD] test). The UNEAK pipeline identified 8470 unfiltered SNPs that were present in at least 10 of the 96 samples (blank included). Missingness, or the percentage of these SNPs exhibiting missing data in a given sample, was significantly higher in older specimens (r2 = 0.22, P = 9.2 × 10−7) (Fig. 1B). There were again significant differences between the three DNA degradation categories [F(2,92) = 20.44, P < 0.0001], with category 1 DNAs exhibiting reduced missingness relative to category 2, which in turn exhibited reduced missingness relative to category 3 (Tukey HSD). Filtering to recover SNPs present in at least 70% of samples resulted in individual data sets of 547 (subsect. Junceae), 185 (subsect. Squarrosae), and 359 (subsect. Triplinerviae) SNPs. Filtering to recover SNPs present in at least 30% of samples resulted in individual data sets of 1633 (subsect. Junceae), 1447 (subsect. Squarrosae), and 2168 (subsect. Triplinerviae) SNPs. Original read data (FASTQ) have been deposited in the National Center for Biotechnology Information (NCBI) Sequence Read Archive ( http://www.ncbi.nlm.nih.gov/sra) under BioProject ID PRJNA284163, and filtered subsection-specific HapMap matrices are archived at the Dryad Digital Repository ( http://dx.doi.org/10.5061/dryad.16pj5; Beck and Semple, 2015).
Multivariate clustering—Correspondence between genetic groups identified by PCO-MC multidimensional clustering and morphospecies was strong in subsection Junceae (Fig. 2A). The five most highly ranked, and thus most cohesive in multivariate space, genetic clusters corresponded either to single morphospecies or groups of morphospecies. This result is particularly striking for the widespread species S. missouriensis Nutt. and S. juncea Aiton. In each case, samples from disparate portions of the morphospecies' range (S. missouriensis range shown in Fig. 2D) were identified as belonging to a significant genetic cluster. Also notable is the single incidence of a genetic cluster not corresponding to an entire morphospecies or group of morphospecies. PCO-MC identified a highly ranked cluster comprising all three TN specimens of the rare, strongly disjunct S. gattingeri Chapm. ex A. Gray, while the two samples from MO were not placed in this or any other cluster. This suggests that S. gattingeri comprises two morphologically cryptic species separated by the Mississippi Embayment (Fig. 2D), a hypothesis that is supported by multivariate morphological analyses (J. Semple, unpublished data). Correspondence between genetic clusters and morphospecies was also strong in subsection Triplinerviae (Fig. 2B). The six most cohesive clusters corresponded to single morphospecies (S. gigantea Aiton, S. tortifolia Elliott, and S. elongata Nutt.) or groups of morphospecies. While the four TX specimens of S. juliae G. L. Nesom composed a single cluster, the two AZ S. juliae specimens were not placed in this group. This again suggests the presence of two geographically disjunct species (Fig. 2D). The remaining samples, representing S. altissima L., S. canadensis L., S. lepida DC., and S. brendiae Semple, composed a single cluster. These species can at times be difficult to distinguish (Semple et al., 2013, 2015), and their lack of genetic distinctiveness is not unexpected. Although correspondence was not as strong in subsection Squarrosae, multiple highly ranked clusters corresponded to single morphospecies or putatively closely related morphospecies pairs (Fig. 2C). The most highly ranked cluster comprised all individuals of S. pallida (Porter) Rydb. and S. rigidiuscula (Torr. & A. Gray) Porter, two morphologically similar species that were until recently both part of the S. speciosa s.l. complex (Semple et al., 2012). All individuals of S. erecta Banks ex Pursh, another taxon historically placed in the S. speciosa complex, formed the next most highly ranked cluster with those of S. speciosa Nutt. itself. The third-ranked cluster comprised S. puberula Nutt. and two of three S. pulverulenta Nutt. individuals, two species that until recently were considered northern and southern subspecies of S. puberula s.l. (Semple and Cook, 2006; Fig. 2D). Finally, all three individuals of S. squarrosa Muhl. formed the fifth most highly ranked cluster.
We were able to routinely attain data at >1700 SNPs in a set of herbarium specimens representing 23 species obtained by numerous collectors over a 40-yr time span, and these data carried clear biological signal. Of the 20 strongest clusters identified by PCO-MC, seven comprised all individuals of a single species, two comprised clear geographic subsets of a single species, and three comprised all individuals of potentially sister species. This signal is particularly encouraging given the extremely low sequence divergence among goldenrod species. Schilling et al. (2008) observed <1% sequence divergence among Solidago species at the often highly variable internal transcribed spacer (ITS) region of the nuclear rDNA cistron. Among the eight groups examined in Kress et al. (2005), Solidago harbored the lowest level of diversity at 10 highly variable plastid loci, exhibiting no substitutions at the putatively universal barcoding region psbA-trnH. Fazekas et al. (2008, 2009) examined nine potential barcoding regions in 32 genera and commented that Solidago was one of the two most “intractable” genera. It should also be noted that the inability of these data to recover clusters corresponding to all morphospecies may simply reflect biological reality, as it is unlikely that all currently recognized goldenrod species correspond to genetically cohesive groups (Semple and Cook, 2006). Taken together, these results indicate that the pairing of GBS with next-generation sampling holds considerable promise for species delimitation in large groups.
Recommendations—We were able to consistently recover DNA of sufficient quantity/quality for library construction with a standard CTAB extraction protocol modified for 96-well plates, and the inexpensive and high-throughput nature of this approach pairs well with the large sample sizes we propose. Although specimen age did negatively affect both the number of quality reads and the amount of missing data per sample (Fig. 1), this effect was less pronounced for specimens >10 yr old. This suggests that much of this detrimental effect occurs at the time of collection (drying technique or length of time the sample was held before drying) or during the early years of curation, an insight consistent with studies that have explicitly evaluated the timing of DNA damage (Staats et al., 2011) and shearing (Adams and Sharma, 2010; Neubig et al., 2014) in herbarium material. Sampling could perhaps then be focused on relatively recent specimens if sufficient material is available. Specimen preparation practices and storage conditions have also been shown to exert a strong effect on DNA quality (Ribeiro and Lovato, 2007; Särkinen et al., 2012; Lander et al., 2013; Neubig et al., 2014), and sampling from air-dried material stored in humidity/temperature-controlled facilities should be favored. Following DNA extraction, our data suggest that a qualitative gel-based assessment of DNA degradation can be a strong predictor of downstream success. Regardless, future studies will need to evaluate the timing and degree of herbarium DNA degradation in a range of plant groups, as this process has been shown to proceed at varying rates in different taxa (Neubig et al., 2014).
Future studies could greatly enhance SNP discovery by beginning with low-coverage sequencing of one target species. The reference-aided GBS Discovery pipeline is robust to higher levels of divergence during locus identification and often identifies more SNPs, particularly in diverse data sets. Even a highly fragmentary assembly greatly improves SNP discovery, because short (64 bp) GBS reads can be matched to very small contigs. Genome size should also be considered. A recently examined diploid Solidago species exhibited a 1C-value = 1.02 pg (Kubešová et al., 2010), which is considered a relatively small angiosperm holoploid genome size (Leitch and Leitch, 2013). Genome size estimates across the group of interest should be considered during project design, particularly in the choice of restriction enzyme (Elshire et al., 2011). If funds permit, additional sequencing can be performed to reduce missingness in large-genome taxa (Chen et al., 2013). We also recommend the inclusion of multiple replicate samples to assess the background error rate. This is expected to be particularly important at the low read depths likely to be encountered in studies incorporating large numbers of specimens with varying DNA quality. Regarding analysis, a clear limitation of the cluster analysis of GBS data are the inability to reconstruct the pattern/timing of divergence among inferred lineages (Carstens et al., 2013), and fully leveraging these data for species delimitation and phylogeny reconstruction will require analytical tools that allow species trees to be inferred with the short read data obtained with GBS methods (Cariou et al., 2013; Hipp et al., 2014). These tools will no doubt soon be available (Leaché et al., 2014), as will increasingly longer read lengths of reduced representation libraries. These considerations notwithstanding, we feel strongly that pairing herbarium collections with GBS and other increasingly accessible genomic workflows (Straub et al., 2012; Stull et al., 2013; Weitemier et al., 2014) should be a top priority in plant systematics. Besides allowing for rapid and economical sampling of large groups, next-generation sampling allows specimen selection to be performed in collaboration with group experts. Genomic data sets spanning both species' ranges and intra/interspecific morphological variation can then be used to rigorously test a wide range of hypotheses, thanks to the synergy between big data and big sampling.
 The authors thank K. Hyma and S. Mitchell from the Genomic Diversity Facility (GDF, Biotechnology Resource Center, Cornell University) for crucial guidance, and P. Soltis, P. Wolf, and two anonymous reviewers for comments on an earlier manuscript version. Sequencing at the GDF was supported by the National Institutes of Health (1S10OD010693-01). This work was supported by the Wichita State University Department of Biological Sciences, by a University of Wisconsin–Milwaukee Research Growth Initiative Grant, and Natural Sciences and Engineering Discovery Grants to J.C.S.