Studies focused on genomic and metagenomic questions often struggle to obtain the nucleotide sample(s) ideally suited for successful investigations. For example, organellar genome sequencing projects have employed a variety of techniques to obtain sequence data enriched with the genome of interest. These include laborious centrifugation prior to DNA extraction (e.g., Atherton et al., 2010), genome skimming approaches (e.g., Straub et al., 2012) that rely on the sheer scale of sequencing, long-range PCR (e.g., Uribe-Convers et al., 2014), and hybrid sequence capture techniques (reviewed by Cronn et al., 2012). Nuclear genome sequencing projects face the opposite problem, often wasting valuable sequencing efforts by redundantly sequencing small multicopy organellar genomes or prokaryotic contaminants. Studies focused on metagenomic communities (microbiomes) in eukaryotes deal with an even greater problem, plant nuclear DNA potentially eclipsing small prokaryotic genomes that may occur in low titer. As a result of these issues, the development of approaches that allow one to extract total genomic DNA (gDNA) and subsequently partition DNA from different genomic compartments into enriched fractions has considerable potential for numerous lines of research.
One means by which such fractionation has been done in animal systems involves the use of a DNA methyl-binding domain (Gebhard et al., 2006; Feehery et al., 2013). Various classes of DNA methylation are well known from all forms of life. In eukaryotic nuclear genomes, methylated CpG (methyl-CpG) sites have been particularly well studied, representing a eukaryote-specific form of methylation that is associated with critical epigenetic signaling and gene expression (e.g., Deaton and Bird, 2011). Whereas vertebrate nuclear genomes appear to largely use methyl-CpG (up to 80%; Feng et al., 2010), angiosperm genomes reveal more complex patterns of methylation. The latter also possess extensive cytosine methylation (Feng et al., 2010), which includes methyl-CHG and methyl-CHHG in addition to methyl-CpG (e.g., Feng et al., 2010; Zemach et al., 2010). Furthermore, plant nuclear genomes encode a more complex array of methyl-binding domains (e.g., Springer and Kaeppler, 2005; Feng et al., 2010). In contrast to methylation of nuclear genomes, the prokaryote-derived organellar genomes of eukaryotes have received less attention. In fact, a study has only recently overturned the commonly held view that human mitochondrial DNA (mtDNA) contains methyl-CpG (Hong et al., 2013). Similar work on Arabidopsis thaliana (L.) Heynh., Populus trichocarpa Torr. & A. Gray, and Oryza sativa L. suggests that angiosperm plastomes have ca. 1–1.5% total methylation and that these include less than 0.75% methyl-CpG compared to 22–59% in the same nuclear genomes (Feng et al., 2010). If similar patterns characterize angiosperm mtDNA, then the nuclear and organellar genomes of these organisms appear to be well differentiated in their methyl-CpG composition.
Variation in patterns of DNA methylation between major phyla and genomic compartments therefore offer unique opportunities to partition total gDNA into different source elements. Prior research on animal tissues has shown that the human methyl-CpG-binding domain (MBD2) can be used to partition fragments of DNA into those with methyl-CpG present vs. those lacking appreciable CpG methylation (Gebhard et al., 2006; Feehery et al., 2013). Through this technique, Feehery et al. (2013) demonstrated that gDNA extracted from human and fish tissues can be readily fractionated into methyl-CpG-rich and methyl-CpG-poor elements representing nuclear host and prokaryotic parasite or endosymbiont components, respectively.
We reasoned that an available MBD2 construct might be applicable to fractionate gDNA extracted from plant tissues into high-methyl-CpG nuclear chromosomal fragments vs. low-methyl-CpG elements. The latter classes that we considered include plastid DNA and mtDNA in addition to DNA from prokaryotic parasites or endophytes. If successful, such an approach could reduce the relative cost and time invested in a variety of studies on plant genomes as well as those focused on plant metagenomic communities.
METHODS
To test the potential utility of methyl-CpG capture on plant systems, we sequenced gDNA from A. thaliana, Glycine max (L.) Merr., Leucaena leucocephala (Lam.) de Wit, O. sativa, and Zea mays L. For each sample, three gDNA libraries were sequenced using a MiSeq instrument (Illumina, San Diego, California, USA): (1) untreated gDNA, (2) a fraction depleted of methyl-CpG, and (3) a fraction enriched for methyl-CpG. The input DNA sources for the latter two libraries represent the supernatant and bead fractions derived from the application of the NEB Next Microbiome DNA Enrichment Kit (#E2612S; New England Biolabs, Ipswich, Massachusetts, USA). This approach uses the IgG1 fused to the human methyl-CpG-binding domain (together “MBD2-Fc”) to pull down a methyl-CpG-enriched fraction in the bead-associated element, leaving a methyl-depleted fraction in the supernatant.
Seeds for G. max (Burpee Be Sweet 292), L. leucocephala, and Z. mays (Plantation Products, Norton, Massachusetts, USA) were grown under standard greenhouse or growth chamber conditions in Las Cruces, New Mexico. Voucher information is provided in Appendix 1. Total gDNA was obtained from young leaf material through standard cetyltrimethylammonium bromide (CTAB) extraction (Doyle and Doyle, 1987). Using these standard approaches, DNA from L. leucocephala was highly viscous and difficult to work with, as were standard column-based plasmid-type DNA extractions (e.g., QIAGEN). Therefore, we extracted DNA using Norgen Biotek's new Plant/Fungal DNA Isolation Kit (#26200; Norgen Biotek, Thorold, Ontario, Canada), which yielded a clean sample of L. leucocephala. Arabidopsis thaliana and O. sativa DNA was purchased directly from BioChain Institute (Newark, California, USA).
Due to the critical nature of gDNA quality as well as the quantity of gDNA to MBD2-Fc ratio for effective methyl-CpG capture, DNA quality and quantity were checked using two different methods. NanoDrop 1000 (Thermo Fisher Scientific, Wilmington, Delaware, USA) was used to calculate purity of DNA by A260/A280 as well as concentration. In addition, a double-stranded DNA (dsDNA) assay was also used to calculate concentration using the Qubit dsDNA BR Assay Kit (#Q32850; Life Technologies, Grand Island, New York, USA). Differences in concentration between NanoDrop and Qubit are likely due to single-stranded RNA contamination in genomic preps or other contaminants absorbing at A260. As a result, we used Qubit-derived estimates whenever there was a discrepancy. Enrichments followed manufacturer's recommendations (#E2612S; New England Biolabs). In short, we used a ratio of 1 µg of gDNA for 160-µL protein A magnetic beads and 16-µL MBD2-Fc protein. First, 160-µL protein A magnetic beads were prebound to 16-µL MBD2-Fc protein by incubating with rotation for 10 min at room temperature. Protein A/MBD2-Fc complex was pulled down by magnetic field and washed twice with 1 mL of ice-cold 1× wash/bind buffer to remove unbound MBD2-Fc protein. The beads were resuspended in 160 µL of ice-cold 1× wash/bind buffer. The gDNA was added to the protein A/MBD2-Fc beads and incubated with rotation for 15 min at room temperature. The beads were pulled down by magnetic field and the methyl-CpG-depleted supernatant carefully removed without disturbing the beads. The bead pellets were washed once with ice-cold 1× wash/bind buffer while sitting on a magnetic stand, resuspended in 150 µL 1× TE buffer (pH 8.0), and then treated with Proteinase K to release methyl-CpG-enriched DNA. After proteinase K treatment at 65°C for 20 min, the beads were pulled down, and the supernatant containing the methyl-CpG-enriched fraction was recovered. Both the methyl-CpG-enriched and methyl-CpG-depleted fractions were purified using 1.8× volume of AMPure beads (#A63880; Beckman Coulter, Brea, California, USA) and eluted in 60 µL of 1× TE buffer (pH 8.0).
DNA was sheared to 150–200-bp fragments using a Covaris S2 ultrasonicator (Covaris, Woburn, Massachusetts, USA) and directly used to prepare libraries with the NEBNext Ultra DNA Library Prep Kit (#E7370S; New England Biolabs) and multiplex barcodes (#E7335S and #E7500S; New England Biolabs) with nine rounds of PCR. To keep the amplification of libraries constant, 3–5% of adapter-ligated DNA was used as a template in a 50-µL PCR reaction for untreated (350 ng) and methyl-CpG-enriched fractions, whereas a third of adapter-ligated DNA was used as a template for the methyl-CpG-depleted fraction. The indexed libraries were multiplexed and sequenced (paired-end) on various runs of an Illumina MiSeq instrument using the 300-cycle reagent kit (version 2). Raw FASTQ files were generated using MiSeq Control Software version 2.2.0 (Illumina).
Paired-end sequencing reads were mapped to available relevant organellar genomes (Table 1) for each species using Bowtie 2 version 2.1.0 (Langmead and Salzberg, 2012). A conservative mapping approach was applied, requiring each read to: (1) stringently map (–score-min L,0,−0.6), (2) have its pair occur within a reasonable expected distance (-X 700), and (3) have each pair be properly oriented relative to one another (BamTools version 2.3.0, option “–isProperPair true”; Barnett et al., 2011). The percentage of reads mapping to a genome was calculated as the fraction of read-pairs mapped to the total readpairs in the library. In addition, for the modest-sized high-quality nuclear genome for A. thaliana (TAIR 9; Arabidopsis Genome Initiative, 2000) the data were mapped to the nuclear, chloroplast, and mitochondrial genomes simultaneously using the same approach. Geneious (version 6.8 created by Biomatters [ http://www.geneious.com]) was used for graphical representation of the evenness of organellar genome coverage for Arabidopsis (TAIR 9) and Oryza (build 4; Zhao et al., 2004).
RESULTS
The DNA extracts used here included high-molecular-weight samples that comigrated with lambda DNA marker (#N3019L; New England Biolabs) on an agarose gel and that had a 260/280 ratio of at least 1.80 (Table 2). Hereafter each library sequenced will be referred to as “UT” (untreated DNA), “E” (methyl-enriched), or “D” (methyl-depleted). The resulting Illumina library characteristics and percentage of read-pairs appropriately mapping to the plastome and mitochondrial genome for each library are presented in Table 3 and Fig. 1, and the primary data are available through the National Center for Biotechnology Information Sequence Read Archive (SUB581050).
Relative to untreated gDNA, the methyl-depleted libraries resulted in a 3.2–11.2-fold and 3.4–11.3-fold increase in chloroplast DNA (cpDNA) and mtDNA, respectively. Conversely, the methyl-enriched fraction resulted in a 1.8–31.3-fold and 1.3–29.0-fold decrease in cpDNA and mtDNA reads, respectively.
The mapping of reads to a reference genome that contained both nuclear and organellar DNA sequences for A. thaliana provided additional insight into the characteristics of the methyl-CpG-depleted and methyl-CpG-enriched fractions (Fig. 2). The percentage of reads mapped to the plastome and mitochondrial genomes using this approach (Fig. 2), and the aforementioned mapping only to the organellar genomes (Fig. 1), was essentially identical. Considering the nuclear genome, 74% of the untreated DNA read-pairs mapped to the genome while 94% and 23% mapped for the methyl-enriched and methyl-depleted fractions, respectively. The combined percentage of reads mapping to the three genomes was similar for the starting DNA and methyl-enriched library; however, the methyl-depleted library had ca. 10% fewer read-pairs mapping to any one of the three Arabidopsis genomes (see Discussion).
Table 1.
Plant species, genome accessions, and genome sizes used in this study.
To investigate whether one or more of these classes of libraries (UT, E, or D) was prone to bias in sequencing coverage across genomes, we mapped the A. thaliana and O. sativa libraries to their respective organellar genomes (cpDNA presented in Appendix S1 (apps.1400064_s1.pdf)) and chromosome 1 (selected arbitrarily) using Geneious for easy visualization. We were particularly interested in identifying whether multiple-kilobase segments had been excluded, likely a result of the method failing to pull down the large complete segments through the enrichment process. The resulting percentages of read-pairs mapping were consistent between the Geneious and aforementioned Bowtie 2 approach, and overall read coverage appears to be uniform for all three classes of libraries. For simplicity, the range of findings is presented in the following examples, focusing on the easily visualized plastome ( Appendix S1 (apps.1400064_s1.pdf)). The O. sativa cpDNA genome showed uniform coverage across the plastome for all libraries. For the methyl-CpG-depleted fraction, the lower coverage (mean 31×) likely explains slightly lower uniformity ( Appendix S1A (apps.1400064_s1.pdf)). With A. thaliana, there is nearly uniform read coverage for the UT library and D library except in three AT-rich zones showing lower coverage ( Appendix S1B (apps.1400064_s1.pdf)). This lower coverage is unlikely to be due to the enrichment process, as problems with Illumina coverage in low-complexity AT- and GC-rich regions are a known issue (e.g., Oyola et al., 2012). As with O. sativa, the A. thaliana E library showed the expected decrease in plastome coverage (mean 22×) and expected increase in variation in overall coverage as a result. The methyl-enriched libraries mapped to chromosome 1 revealed only small gaps (typically <500 bp) in coverage across the length (data not shown), which suggests that this approach has not excluded large regions due to failed capture.
Table 2.
Starting genomic DNA characteristics for species in the study.
DISCUSSION
The objective of this study was to investigate whether the MBD2-Fc construct can be used on plant samples to effectively partition total gDNA into methyl-rich and methyl-poor elements that better reflect their genomic origin than untreated DNA extractions. Using the relative representation of organellar (likely methyl-CpG-poor) and nuclear DNA (known methyl-CpG-rich) from DNA extracts of high purity (Table 2) and high molecular weight, the approach showed the predicted enrichment and depletion of methyl-CpG fractions in all cases. However, the changes relative to untreated DNA varied by species, library type, and major lineage. Results from our two monocot samples, both representatives of the economically important Poaceae, showed the greatest fold increase (9.3–11.3-fold) in the representation of organellar DNA in methyl-poor libraries (Table 3). These samples also showed considerable depletion of organellar DNA in the methyl-rich fraction (5.6–20.3-fold). Among the eudicot samples, the results were more variable by library and/or organelle. The Arabidopsis and Leucaena samples had the greatest starting percentage of cpDNA, and the methyl-rich fraction revealed a 23–31-fold decrease in cpDNA contamination. Although the starting concentration of mtDNA reads in any of the samples was low (all <8%), depletion of mtDNA reads in the methyl-rich fraction ranged from just 1.34-fold in Glycine to 29-fold in Leucaena.
Although generally effective, there are several potential causes for the apparent variation in the overall efficacy of MBD2-Fc to partition methyl-rich and methyl-poor genomic regions in our genomic extracts. First, divergence in MBD proteins has been documented between animal and plant systems as well as between monocots and eudicots (e.g., Feng et al., 2010). As a result, MBD2, for some unknown reason, may not bind plant methyl-CpG as effectively as with vertebrate methyl-CpG sites. This issue could be confounded if one or more of the samples was “contaminated” by native plant methyl-binding domains, not removed by DNA purification, that impeded access and binding of MBD2-Fc. Furthermore, limited existing data suggest that there may be variation in the frequency of CpG methylation between monocots and eudicots. Feng et al. (2010) found that chromosomal DNA of the eudicots A. thaliana and P. trichocarpa contained 22–42% methyl-CpG, but their single monocot sample, O. sativa, displayed 59%. Further research on patterns of plant DNA methylation will identify whether there are lineage-specific differences between monocots and eudicots and indicate whether these may be influencing the results presented here. In addition, Gebhard et al. (2006) discussed variation in binding sensitivity associated with salt concentration. which is worth future consideration in the application of this and similar approaches.
Table 3.
DNA sequence and read mapping results. For each species, three libraries were sequenced: untreated DNA (UT), methyl-enriched DNA (E), and methyl-depleted DNA (D). The total reads per library and the number of reads mapping (2× the read-pairs), and percentage of read-pairs mapping for each are listed.
Based on our preliminary results (not shown), it is clear that clean high-molecular-weight DNA is important for the application of MBD2-Fc. With high-molecular-weight DNA, a large fragment with even one CpG island along its length may be bound and pulled down and available for library prep even though other regions may lack methyl-CpG. Without high-molecular-weight DNA, variation in methyl-CpG could easily lead to unintentional partitioning into different fractions from a collinear segment of DNA (e.g., a chromosome). Although such findings could be helpful for studies on methylated vs. unmethylated collinear regions, they were not the objective of the current study.
The visualization of reads mapping to A. thaliana and O. sativa ( Appendix S1 (apps.1400064_s1.pdf)) illustrated that sequencing of either the methyl-enriched or methyl-depleted fractions provided relatively even coverage of sequence, without signs of large regions lacking coverage in any of the three genomes. There appears to be a slight decrease in coverage across low-complexity AT-rich regions in the A. thaliana methyl-depleted library ( Appendix S1B (apps.1400064_s1.pdf)). This result is most likely a slight exacerbation of a known Illumina AT sequencing issue (e.g., Oyola et al., 2012) rather than a representation issue in the actual libraries being sequenced.
Read mapping to the nuclear and organellar genomes of A. thaliana revealed an intriguing result. The total number of read-pairs mapping to any one of these genomes was ca. 10% less in the methyl-depleted library than the untreated or methyl-enriched libraries (Fig. 2). While this may represent stochastic variation among libraries, our preliminary studies of various libraries showed little variation in relative genomic representation between independently generated and sequenced libraries (data not shown). We suspected that the decrease in read mapping with methyl-depleted libraries is, at least in part, due to enrichment of “contaminating” prokaryotic sequences, which would be expected for this approach. This would be consistent with the findings of Feehery et al. (2013), who used this approach to better characterize prokaryotic species inhabiting animal tissues. To roughly investigate this idea, we ran each of the three A. thaliana libraries through MetaPhlAn (Segata et al., 2012), which focuses primarily on human-associated prokaryotic genomes. Running reads that were unmapped to the A. thaliana genomes, we applied the “Sensitive-Local” setting in MetaPhlAn and found nine, two, and 23 prokaryotic genome hits in the UT, E, and D libraries, respectively. The decreased representation in the E library and increase in the D library, from the same starting DNA sample, is consistent with the method having partitioned the prokaryotic contaminants into the D library and suggests that the protocol may be particularly important in plant metagenomic studies (e.g., Zheng et al., 2014).
When comparing the application of MBD2-Fc to other options for genome and microbiome sequencing, there are several issues to consider. These include project objectives, ease of use, cost, time, reproducibility, required starting material, and results. Careful consideration of all these elements should help narrow in on the most logical method for a given study (e.g., Jansen et al., 2005). Whole-genome sequencing of any genomic compartment using unperturbed total gDNA has been the standard for nuclear genome sequencing and even for organellar genomes. This rapid approach is appropriate for many applications and starting materials (e.g., Straub et al., 2012), but a considerable percentage of reads can be wasted on genomic compartments of lesser or no interest to the parent study. The isolation of entire genomic compartments (e.g., nuclei, mitochondria, and/or chloroplasts) prior to DNA extraction from one or more fractions is also a widely applied approach. However, the process is laborious, requires copious fresh starting material (e.g., 20 g in Shi et al., 2012), and the results with plant material have varied, occasionally resulting in little effective change in the total concentration of different fractions (noted by Atherton et al., 2010). In addition, the coverage presented by Shi et al. (2012) is far less uniform (see their Fig. 4) than what is presented for the MBD2-Fc ( Appendix S1 (apps.1400064_s1.pdf)). PCR-based approaches offer a powerful option that can be based on low-quantity and/or low-quality starting material. Nonetheless, coverage tends to be highly variable (e.g., Uribe-Convers et al., 2014), and numerous problems can presumably result from the coamplification of nuclear-encoded organellar genomic elements (e.g., Vieira et al., 2014). Hybridization approaches continue to rise in popularity and can be exceptionally cost-effective, particularly when sequencing a portion of the nuclear genome or most or all of organellar genomes for large numbers of individuals, populations, or species (e.g., Cronn et al., 2012; Stull et al., 2013; Mandel et al., 2014; Mariac et al., 2014; Weitemier et al., 2014). However, when the hybridization resources have not already been developed, the initial setup cost and time may be prohibitive, and variability in probe specificity can lead to limited uniformity of coverage. In addition, hybridization approaches do not differentiate between nuclear-encoded organellar elements vs. their organellar sequence (as is an issue with the PCR approach), but they should not create chimeric sequences that are more commonly associated with long-range PCR.
The methyl-CpG capture approach presented here requires intact high-molecular-weight starting DNA for optimal results, but it does not require large quantities of DNA (we have used as little as 500 ng of starting material for the enrichment and generation of libraries). Our results indicate a high degree of fractionation in many samples and uniform sequencing coverage across genomes. If one is concerned about the relative success of enrichments, real-time quantitative PCR can be used to test for shifts in well-developed cpDNA loci in the different fractions (untreated, methyl-rich, and methyl-poor) prior to costly library prep and sequencing. At ca. US$30 per enrichment, we see the use of MBD2-Fc as a fast and useful approach for modest numbers of samples in sequencing projects of organellar genomes (i.e., when the initial setup of hybridization approaches is not justified) and nuclear genome sequencing projects on plant groups with moderate to high organellar contamination that can lead to substantial losses of genomic sequence data on nontarget genomes. Furthermore, the deep sequencing of methyl-depleted and methyl-enriched gDNA libraries derived from plant tissues may give a more complete picture of the unobserved prokaryotic and eukaryotic coinhabitants. Libraries selectively depleted of plant nuclear DNA should be rich in plant organellar DNA as well as prokaryotic coinhabitants. These libraries may also be enriched for organellar DNA from microscopic eukaryotic coinhabitants (e.g., fungal or algal). In contrast, the methyl-enriched libraries not only should include a greater representation of plant nuclear DNA, but also may contain more nuclear DNA derived from microscopic eukaryotic coinhabitants (e.g., fungal or algal). Thus the use of MBD2-Fc shows considerable potential in plant biology and suggests that future research on the use of plant-specific MBDs is warranted for even greater efficiency of plant DNA partitioning in genomic and metagenomic studies.
LITERATURE CITED
[1] The authors thank Brad Langhorst, Shannon Straub, Kevin Weitemier, and Aaron Liston for assistance with approaches to data analysis and Yanxia Bei, Shannon Straub, and two anonymous reviewers for thoughtful comments and suggestions on the manuscript. This research work was supported by New England Biolabs and the National Science Foundation (Plant Genome Research grant no. 128731 to C.D.B.).