Cottons (Gossypium L. spp.) provide the leading natural fiber for textiles, as well as an important seed product for feed, food, and oil (Campbell et al., 2010). The most widely grown species are the allotetraploids G. hirsutum L. (upland cotton) and G. barbadense L. (Sea Island cotton). These species are both descended from an allopolyploidization event involving an A-genome diploid species, related to modern G. herbaceum L. and G. arboreum L., and D-genome diploid species, related to modern G. raimondii Ulbr. (Percival and Kohel, 1990). Recent developments in next-generation sequencing (NGS) technology have lowered the cost of sequencing per base, and enabled the genotyping by sequencing (GBS) approach for developing informative single-nucleotide polymorphism (SNP) markers in species with large, complex genomes (Elshire et al., 2011), including species without a reference genome sequence (Glaubitz et al., 2014). In this study, we employed a simple and cost-effective GBS approach to identify intraspecific and interspecific SNPs within and between allotetraploid cottons G. hirsutum and G. barbadense.
A major difficulty in the characterization and utilization of SNPs in polyploid species is determining whether a polymorphism detected by short-read NGS is the result of alternative alleles at a single locus or the presence of multiple homeologous loci. To identify markers that were likely to represent alternative alleles at a single orthologous locus, we used the Stacks bioinformatics pipeline (Catchen et al., 2011) as a filter to enrich for codominant markers composed of pairs of alleles that were homozygous in the respective taxa used for comparison (Fig. 1). Given high enough sequence coverage to accurately identify all relevant alleles and loci, the sstacks algorithm implemented by Stacks assigns an aa/bb marker type in this situation (Scenario 1). In contrast, detection of polymorphisms between homeologous loci will likely give rise to an ab/ab marker type prediction (Scenario 2). Detection of a polymorphism between paralogs within a subgenome will give rise to an aa/ab marker type (Scenario 3). Markers that are homozygous in one parent and heterozygous in the other parent will also give rise to an aa/ab marker type prediction (Scenario 4). A host of other combinations will give rise to either ab/ab or aa/bb patterns (not shown).
Given the array of possible scenarios that can give rise to candidate SNPs using GBS, we focused our efforts on markers with a simple aa/bb biallelic pattern. We considered these to be the best candidates for single loci with codominant polymorphisms between cotton accessions that would be useful for downstream applications such as genetic diversity studies, linkage and QTL mapping, genome-wide association studies (GWAS), and marker-assisted selection.
METHODS AND RESULTS
GBS was performed by a method similar to the general strategy outlined previously (Elshire et al., 2011), with major differences. Genomic DNAs from cotton taxa (Table 1) were extracted from liquid N2 flash-frozen seedlings using a protocol described previously (Pepper and Norwood, 2001) with the addition of 1/10th volume of Plant RNA Isolation Aid (Ambion, Austin, Texas, USA) to the initial extraction buffer. Genomic DNA (250 ng) was digested with 10 units of restriction endonucleases HinP1I or BsrGI for 2 h at 37°C. HinP1I is a CpG methylation-sensitive four-base cutter (G|CGC), while BsrGI is a methylation-insensitive six-base cutter (T|GTACA). Adapter ligations were carried out in 50-µL reactions in the presence of 40 pmol of the restriction-enzyme appropriate combination of annealed, 6-bp bar-coded, Illumina-compatible P5 and P7 adapters (Appendix 1) and 1600 cohesive-end ligation units of T4 DNA ligase (New England Biolabs, Ipswich, Massachusetts, USA) for 1 h at 37°C. Because active restriction enzymes were still present in these reactions, a final incubation for 30 min at 37°C was performed to cleave any chimeric ligation products between genomic DNA fragments (the adapters were designed to abolish the HinP1I or BsrGI recognition site upon ligation). All samples were pooled and then purified using MinElute columns (QIAGEN, Valencia, California, USA). Single-copy regions of the plastid genome can be present in >1500 copies per cell in leaf tissue (Zoschke et al., 2007) and may thus represent a major contaminant of GBS libraries. In silico restriction digestion of the G. hirsutum plastid genome (Lee et al., 2006) showed that no fragments were present between 186 bp and 218 bp in HinP1I digests and between 144 bp and 300 bp in BsrGI digests. To exclude fragments originating from the plastid genome, and to achieve complexity reduction, size selection (190–210 bp for HinP1I and 160–290 bp for BsrGI) was performed using 2.5% agarose gels stained with Gel Green (Biotium, Hayward, California, USA). Size-selected fractions were treated with NEBNext Fill-in and ssDNA Isolation Module (New England Biolabs) to obtain single-stranded biotinylated fragments to use as template for PCR amplification with Illumina-compatible primers PCR 1.01 and PCR 2.01 (Appendix 1). PCR cycling conditions consisted of 98°C for 30 s followed by 20–30 cycles of 98°C for 12 s, 65°C for 30 s, 72°C for 30 s, with a final extension time of 1 min at 72°C using Phusion polymerase (New England Biolabs). Because of the very narrow range of fragment sizes that were extracted in the size selection step, 20 cycles of PCR were required for amplification of BsrGI libraries and 30 cycles were required for HinP1I libraries. The samples were purified using Agencourt AMPure XP beads (Beckman Coulter, Brea, California, USA), then quantified using the AccuBlue High Sensitivity Quantitation Kit (Biotium) on the VICTOR X3 Multilabel Plate Reader (PerkinElmer, Akron, Ohio, USA). Samples were diluted to 10 nM, then provided to the Texas A&M AgriLife Genomics & Bioinformatics Services for sequencing on the Illumina GAII and/or HiSeq 2000 instrument (Illumina, San Diego, California, USA).
Seed sources, taxonomy, and preliminary GBS statistics for a set of diploid (A1-27, D5-1) and allotetraploid cottons.
A total of ca. 54 million raw 76-bp paired-end reads were imported into the Geneious bioinformatics package (Drummond et al., 2012) to trim for quality (P = 0.05). The length of all fragments was trimmed to 66 bp for analysis using Stacks ver. 0.998 (Catchen et al., 2011). Sequences were pre-processed using the “process_radtags” script, in which the 6-bp barcodes were sorted and removed, yielding 60-bp fragments. The process radtag options included: -c (clean data) and -q (discard low quality reads). Barcode-sorted FASTQ files were processed in pairwise combinations using the “denovo_map.pl” script. The de novo map parameters included: -n 3 (mismatches allowed between loci during catalog building), -m 3 (minimum number of identical, raw reads required to create a stack), -M 2 (number of mismatches allowed between loci when processing a single individual), and -t (remove, or break up, highly repetitive RAD-Tags during ustacks). Output data files were loaded to MySQL tables (Oracle Corporation, Redwood Shores, California, USA), and SNPs between taxa were annotated in a pairwise manner. We obtained an average of ca. 2 million raw sequences and ca. 15,000 unique ‘stacks’ (loci) for each sample (Table 1).
To examine the partitioning of GBS markers into the gene-rich component of the genome, we used BLASTN (Altschul et al., 1990) to search the 22,862 unique BsrGI stacks and 10,792 unique HinP1I stacks from G. hirsutum TM-1 against predicted coding DNA sequences (CDSs) and transcripts from the diploid G. raimondii (Paterson et al., 2012; Wang et al., 2012) using a significance threshold of <le-6. A very large proportion of HinP1I fragments (∼36–50%) showed some degree of sequence similarity to transcribed regions of the G. raimondii genome (Table 2). This proportion was far lower for BsrGI fragments (∼3–9%). Because HinP1I is methylation sensitive, the HinP1I libraries may be enriched in transcribed regions, which are hypomethylated in plant genomes (Feng et al., 2010). We also examined our GBS libraries for the presence of repetitive DNA using BLAST searches of all BsrGI and HinP1I against the TIGR Brassicaceae Repeat Database ver2_0_0 (Ouyang and Buell, 2004). We considered this database to be an appropriate proxy for cotton repetitive sequences because cotton (Malvaceae) and the Brassicaceae are sister taxa (Soltis et al., 2000). The presence of fragments with similarity to known repetitive elements was extremely low (<2%) in both BsrGI and HinP1I libraries. Organellar DNA contamination was examined using BLAST searches to the G. hirsutum plastid and mitochondrial genomes (Lee et al., 2006; Liu et al., 2013). The presence of fragments with similarity to the plastid and mitochondrial genomes was low in both libraries (Table 2).
BLASTN results (significance value <le-6) using total stacks from Gossypium hirsutum cv. TM-1 or selected aa/bb markers across all taxa as subject.a
For loci that were shared between any two taxa, polymorphisms were identified and categorized based on marker type predictions from Stacks. Across all taxa, totals of 18,073 and 5014 aa/bb pairwise SNP combinations were identified from the BsrGI and HinP1I libraries, respectively (Tables 3 and 4). Data for these marker sets have been submitted to the CottonGen SNP database ( http://www.cottongen.org). To determine the extent of overlap between this SNP collection and those already available on the CottonGen database, a subset of 1475 BsrGI fragments and 551 HinP1I fragments from G. hirsutum cultivar TM-1 were searched against all 183,035 CottonGen SNPs using the batch BLAST tool ( http://www.cottongen.org/tools/batch_blast; searched 3 January 2015). For BsrGI, only two fragments (0.13%) had 100% matches in the CottonGen SNP database, and only 19 fragments (1.3%) had matches to similar sequences in the database (with up to three mismatches). For HinP1I, only nine fragments (1.6%) had a 100% match in the CottonGen SNP database, and 27 fragments (4.9%) were similar up to three mismatches. Thus, the overlap between this and other cotton SNP collections was very low.
The proportion of aa/bb polymorphic loci to total (shared) loci was similar between BsrGI and HinP1I across all combinations of taxa (Fig. 2). In an intraspecific comparison within G. barbadense between cultivated variety Pima 3-79 and landrace K-56 (Peru), 6.7–8.1% of all markers showed aa/bb polymorphisms. In an intraspecific comparison within G. hirsutum between cultivated variety TM-1 and landrace TX-231 (Guatemala), 10.5–11.6% of markers showed aa/bb polymorphisms. An interspecific comparison between G. barbadense Pima 3-79 and G. hirsutum TM-1 showed the highest level of polymorphism (15–16.4%). These values correspond to approximate SNP frequencies of >0.0012–0.002 substitutions per base pair for intraspecific comparisons and >0.0028 substitutions per base pair for interspecific comparisons.
To test the efficacy of selecting aa/bb markers to enrich for orthologous loci SNPs, we employed both in silico and experimental validation. The in silico validation made use of the available A and D diploid genome sequences by determining whether the predicted aa/bb dual homozygous markers had the expected evolutionary pattern for sequences from Scenario 1 (Fig. 1). To perform this analysis, a subset of G. hirsutum TM-1 vs. G. barbadense Pima 3-79 aa/bb markers was searched against the complete genome sequences of G. raimondii (D-genome diploid) (Paterson et al., 2012; Wang et al., 2012) and G. arboreum (A-genome diploid) (Li et al., 2014) using BLASTN at a significance threshold of <1e-6. For each aa/bb marker, the tetraploid sequences and positive hits from diploid genomes were aligned using the map-to-reference tool implemented by the Geneious bioinformatics package (Drummond et al., 2012). Alignments were constructed for 549 HinP1I and 1413 BsrGI markers. Minimum spanning distance (smallest number of mutational changes to transition from one sequence to another) was used to classify the alignments into one of five categories (Fig. 3). Dual-homozygous markers that were polymorphic between the cotton species at a single locus (Fig. 1, Scenario 1) were expected to give rise to the alignment pattern designated Category I. In contrast, polymorphisms between homeologs present in the AT and DT subgenomes (Fig. 1, Scenario 2) were expected to give an alignment similar to Category II (Fig. 3), in which the TM-1 and Pima 3-79 alleles were each most similar to fragments in one of the two diploid species. If a putative marker is actually a polymorphism between paralogs within a given subgenome the expected alignment pattern would be exemplified by that shown for Category III (Fig. 3). For some markers, a likely subgenome identity could be discerned, but the marker showed unresolvable similarities to several paralogs within that subgenome (Category IV). This category may still manifest as aa/bb interspecific polymorphic markers, depending on the presence or absence of HinP1I or BsrGI flanking restriction sites. Finally, for some markers, the likely subgenome of the locus could not be determined (Category V) because of unresolvable similarities to fragments in both the AT or DT subgenomes. Again, these markers may still represent aa/bb markers, depending on flanking restriction sites. For the alignments of HinP1I and BsrGI markers (Table 5), positive BLAST hits were identified in one or both diploid genomes for 99.5% of BsrGI fragments and 98.5% of HinP1I fragments. Alignments for 83.5% of BsrGI markers and 69.4% of HinP1I markers (77.7% of markers overall) had the Category I pattern that was expected of aa/bb dual-homozygous single-locus markers, while only 1.6% of BsrGI markers and 10.7% of HinP1I markers (4.2% overall) had patterns indicating that they represented polymorphisms between homeologous loci from the AT or DT subgenomes (Category II) and only 1.5% of markers had alignment patterns suggesting that the polymorphism was between two paralogs in a given subgenome (Category III). Given that markers in both Categories IV and V can also, in principle, give rise to the aa/bb marker type, depending on flanking restriction sites, our in silico validation rate may actually be as high as 96.2% for BsrGI and 89.3% for HinP1I (94.2% overall).
Numbers of BsrGI shared stacks (loci) and dual homozygous (aa/bb) marker loci across a set of intraspecific and interspecific combinations of Gossypium taxa.
Numbers of HinP1I shared stacks (loci) and dual homozygous (aa/bb) marker loci across a set of intraspecific and interspecific combinations of Gossypium taxa.
Experimental validation was performed using the PCR-based cleaved amplified polymorphic sequence (CAPS) marker method (Konieczny and Ausubel, 1993). Only a small proportion of markers were suitable for experimental validation based on the following criteria: (1) SNP had to be near the middle of the 60-bp sequence to allow for design of flanking primers, (2) flanking sequences had to have suitable G+C content for primer design (30–60%) and lack simple sequence repeats, (3) specific primers had to be designed (using the alignments) to avoid amplification of known paralogs and homeologs, and (4) the SNP had to occur within the recognition site of a commercially available restriction enzyme. Only 22 TM-1 vs. Pima 3-79 markers (three HinP1I and 19 BsrGI) met all of these criteria; all of these fell into Category I (above) when examined in evolutionary alignments. Primer pairs shown in Table 6 were used in PCR amplification with the KAPA3G Plant PCR kit (Kapa Biosystems, Wilmington, Massachusetts, USA), as per the manufacturer's recommended protocol. Amplification products were examined using 4% agarose gel electrophoresis (E-Gel EX, Life Technologies, Grand Island, New York, USA) before and after restriction digestion. Of the 22 CAPS markers, one marker (Bsr1616) yielded multiple PCR amplicons, none of which were of the predicted size. One marker (Bsr18072) showed unexpected partial digestion in both TM-1 and Pima 3-79 accessions. Of those markers that could be definitively scored, 20/21 (96%) showed the predicted pattern of restriction digestion for polymorphic markers that were homozygous within each of the two taxa examined.
Cultivated cottons have complex allotetraploid genomes with high levels of repetitive DNA and a small proportion of gene-encoding DNA (Li et al., 2014). These characteristics greatly complicate efforts to apply GBS approaches. Foremost among these difficulties is the presence of homeologous gene copies (homeologs) inherited from the diploid ancestors. Furthermore, all plant genomes have paralogous regions arising from gene duplication processes other than allotetraploidization (such as tandem duplication). To filter out the confounding polymorphisms between homeologs and between paralogs, we selected for markers with a dual-homozygous aa/bb marker prediction from the Stacks algorithm. The likelihood of such a pattern arising from more than one orthologous locus by a mutational process or by gene conversion was considered to be small compared to the straightforward interpretation of alternative alleles at a single locus. The resulting filtered marker set was highly enriched for markers with evolutionary patterns that were consistent with alternative, codominant alleles at a single locus within a particular subgenome. The application of this filter also reduced the number of markers with sequence similarity to repetitive elements and organellar genomes (Table 5). BLASTN searches against G. raimondii transcript and CDS databases indicated that markers derived from the methylation-sensitive restriction enzyme HinP1I were highly enriched in gene-related sequences (Table 4). Thus we consider this marker set to be highly informative for mapping traits in gene-encoding regions of the genome.
Categorization of marker alignments of aa/bb markers polymorphic between Gossypium hirsutum TM-1 and G. barbadense Pima 3-79. Alignments included TM-1 and 3-97 alleles, along with any BLAST hits (1e-6) to sequenced A- and D-genome diploid species (Paterson et al., 2012; Wang et al., 2012; Liu et al., 2013). The five categories are described in the text and illustrated in Fig. 3.
The total set of 18,073 BsrGI-derived and 5014 HinP1I-derived polymorphic markers selected by this strategy can be used in a variety of applications across a range of taxa. For example, 921 SNPs between the photoperiodic G. barbadense landrace K-56 (Peru) and the photoperiod-independent cultivar Pima 3-79 could be used for mapping the photoperiodism trait. They could also be employed as a resource for marker-assisted conversion of photoperiodic germplasm to photoperiod independence (Percy, 2009). Similarly, our collection included 686 SNPs between the photoperiodic G. hirsutum landrace TX-231 (Guatemala) and the photoperiod-independent cultivar TM-1. Finally, our collection included ca. 2000 SNP markers that can be applied to the TM-1 × Pima 3-79 interspecific recombinant inbred line (RIL) population (Yu et al., 2012) by providing a linkage-based framework for ongoing genome sequencing and chromosome assembly efforts in the allotetraploid cottons G. hirsutum and G. barbadense.
Cleaved amplified polymorphic sequence validation of 22 aa/bb markers that are polymorphic between Gossypium hirsutum TM-1 and G. barbadense Pima 3-79.
These new markers add to existing collections of cotton SNPs developed from: (1) comparative transcriptome sequencing, (2) shallow depth genome sequencing, (3) genome reduction based on restriction site conservation (GR-RSC), detected by (4) Roche 454 pyrosequencing, and (5) genotyping by sequencing (Van Deynze et al., 2009; Byers et al., 2012; Lacape et al., 2012; Rai et al., 2013; Zhu et al., 2014). Because of the small fragment size (50 bp) and intrinsic similarities between orthologs and paralogs, the SNP loci described here may be of limited use for non-sequence-based genotyping approaches (e.g., KASPR, Illumina GoldenGate) (Hyten et al., 2008; Byers et al., 2012). However, this unique GBS approach can facilitate the discovery of large sets of informative markers that can be employed to genotype extensive collections of biological samples and experimental populations (RILs, F2, backcross) using barcoding and multiplex sequencing strategies (Elshire et al., 2011). Currently, a single lane on the Illumina HiSeq 2500 instrument can be used to genotype 96 samples at an average coverage of ca. 2 million reads per sample (the depth of coverage used in this work).
For some experimental purposes, the most flexible and cost-effective approach may be to use a “white list” of marker sequence and polymorphism data, such as that provided here, to design a targeted set of oligonucleotides that capture and enrich selected genomic fragments for resequencing using available NGS technologies. It is important to note that the overall strategy for marker discovery and annotation that we have provided in this study can be extended to any species, including those that are allopolyploid.
 We would like to express our deepest gratitude to James Frelichowski and Jared Harris for their help with cotton germplasm resources. This work was funded by Cotton Incorporated Fellowship 08-380 (to C.J.L.Y.) and specific cooperative agreement 3091-21000-038-03 (A.E.P. and J.Z.Y.). S.K.V. was supported by a CREST Award from the Ministry of Science and Technology of India.