The genus Amaranthus L. (Caryophyllales: Amaranthaceae) has a worldwide distribution with approximately 70 herbaceous plant species that range from aggressive weeds (pigweeds) to cultivated ornamentals, vegetables, and nutritious pseudocereals. The grain amaranths (A. hypochondriacus L., A. caudatus L., and A. cruentus L.) are ancient crops native to the New World that were likely domesticated multiple times from putative progenitor species (A. hybridus L., A. quitensis Kunth, and A. powellii S. Watson; Mallory et al., 2008; Maughan et al., 2011). Amaranths represent a regionally important staple food crop that also played an important cultural role in pre-Columbian civilizations; indeed, the Spaniards recorded that the Aztec emperor Moctezuma II required an annual tribute of amaranth nearly equal to the annual tribute of maize (Sauer, 1967). After the Spanish conquest of the New World, the arriving Europeans actively discouraged the cultivation of the grain amaranth due to its deeply rooted use in indigenous religious ceremonies (Iturbide and Gispert, 1994). Recently, however, the grain amaranths have acquired renewed attention due to the nutritional quality of their seed, which is high in crude protein (reaching 22.5% dry matter basis), fiber (8%), and the essential amino acid lysine (0.73– 0.84%) (Bressani et al., 1992).
The primary function of chloroplasts in plants is photosynthetic carbon fixation. Chloroplasts are thought to have originated from a free-living cyanobacteria-like ancestor through an ancient endosymbiotic relationship (Raven and Allen, 2003). The chloroplast genome has evolved into a circular molecule with a quadripartite structure consisting of a small single copy (SSC), a large single copy (LSC), and two copies of inverted repeat (IR) regions. In addition to this conserved quadripartite genome structure, chloroplast genomes have highly conserved gene content and a nearly collinear gene order among most land plants (Jansen et al., 2005). Because of their compact size, lack of recombination, and maternal inheritance (Birky, 2001), chloroplast genomes have served as important DNA sequences for plant identification and the reconstruction of plant phylogenies (Parks et al., 2009; Moore et al., 2010).
In this study, we report a high-quality chloroplast genome assembly of an agronomically important grain amaranth cultivar (A. hypochondriacus ‘Plainsman’), which we use as a reference sequence to reconstruct the chloroplast genome of four additional A. hypochondriacus accessions, as well as one accession each of A. caudatus, A. cruentus, and A. hybridus. We report the genome size, structure, and gene content of the chloroplast genome for each of these accessions and mine their genomes for simple sequence repeats (SSRs), repetitive sequences, single-nucleotide polymorphisms (SNPs), and insertion/deletion polymorphisms (indels). This is the first report of a chloroplast genome published in the genus Amaranthus and will serve as an important reference resource for future phylogenetic studies in the Amaranthaceae family (Schmitz-Linneweber et al., 2001; Li et al., 2014).
METHODS AND RESULTS
Seeds representing all three grain amaranth species and their putative progenitor species (A. hybridus) were obtained from the publicly available U.S. Department of Agriculture National Plant Germplasm System (USDA-NPGS) Amaranthus germplasm collection ( http://www.ars-grin.gov/; Table 1). All plant materials were grown in the Life Sciences greenhouses at Brigham Young University (Provo, Utah, USA) in 12-cm pots with one plant per pot using Sunshine Mix II (Sun Gro Horticulture, Agawam, Massachusetts, USA).
To develop a high-quality reference chloroplast genome for Amaranthus, high-molecular-weight DNA from the cultivar A. hypochondriacus ‘Plainsman’ (PI 558499) was extracted from fresh leaf tissue by Amplicon Express (Pullman, Washington, USA). Whole-genome DNA sequencing was performed using PacBio RS II (Pacific Biosciences, Menlo Park, California, USA) long-read technology at the National Center for Genome Resources (Santa Fe, New Mexico, USA). A total of four libraries were constructed using the PacBio 20-kbp protocol (Kim et al., 2014). These libraries were loaded onto 20 single-molecule real-time sequencing (SMRT) cells and sequenced using P6 polymerase and C4 chemistry with 4-h movie times. Sequencing yielded 2,359,408 filtered subreads (13,578,880,181 bp) with a mean read length of 5706 bp. These reads were assembled using FALCON version 0.4 ( https://github.com/PacificBiosciences/FALCON), which produced 9494 contigs, of which five showed high BLASTN (E-value = 0) homology and spanned nearly the entire length of chloroplast genome from other plant species within the family Amaranthaceae, specifically Spinacia oleracea L. (NC_002202.1) and Beta vulgaris L. subsp. vulgaris (KR230391.1). These contigs were assembled manually relative to the chloroplast sequence of S. oleracea, with putative gaps filled with Ns. Using default parameters, the PacBio-filtered subreads were then mapped back to the draft amaranth chloroplast and locally realigned using the map reads to reference and local realignment tools found in the NGS Core Tools of the CLC Genomic Workbench (version 8.0.3; QIAGEN, Valencia, California, USA). A total of 20,042 (0.8%) reads were aligned to the draft genome, from which a consensus sequence for the complete amaranth chloroplast genome was extracted. This reference sequence was validated by resequencing 14 chloroplast gene sequences (Table 2) using amplicon sequencing and by mapping Illumina short reads (see below) from ‘Plainsman’ to the PacBio reference assembly. Both the short-read mapping consensus as well as the amplicon resequencing showed 100% identity to the PacBio-assembled reference genome. The complete A. hypochondriacus reference chloroplast genome has been deposited into GenBank (accession no. KX279888).
To study genetic diversity, total genomic DNA for eight Amaranthus accessions, including ‘Plainsman’ and four additional A. hypochondriacus accessions, and one accession each of A. caudatus, A. cruentus, and A. hybridus (Table 1), was extracted from 30 mg of freeze-dried leaf tissue according to the protocol reported by Todd and Vodkin (1996). Mate-pair libraries with insert sizes of 800 bp were sequenced on the Illumina HiSeq platform (Illumina, San Diego, California, USA) to generate 2 × 100 bp reads (Beijing Genomic Institute, Hong Kong, China). The generated reads were quality trimmed and six million paired reads were randomly subsampled from each accession and mapped to the ‘Plainsman reference chloroplast genome followed by local realignment using the NGS core toolbox in the CLC Genomics Workbench with default parameters. Consensus sequences for each accession are provided in Appendix S1 (apps.1600063_s1.txt). Across all accessions, the mean depth of coverage and percentage of total reads that mapped to the ‘Plainsman’ reference was 254× and 6.6%, respectively (Table 1).
The consensus chloroplast genomes for each accession were annotated using CpGAVAS (Liu et al., 2012) using default settings with S. oleracea (Caryophyl-lales: Amaranthaceae) set as the reference species (Schmitz-Linneweber et al., 2001). Any problematic annotations were corrected using the Apollo genome editor (Lewis et al., 2002). All eight chloroplast genomes were composed of a single circular double-stranded DNA molecule with the typical quadripartite structure (Fig. 1). Not unexpectedly, all of the chloroplast genomes are similar in length to the ‘Plainsman’ reference genome (150,518 bp, with a 60-bp range among all accessions; Table 1). This is similar to other chloroplast genomes in the Amaranthaceae family (S. oleracea: 150,725 and B. vulgaris subsp. vulgaris: 149,637; Schmitz-Linneweber et al., 2001; Li et al., 2014). The ‘Plainsman’ reference chloroplast genome consisted of an 83,873-bp LSC (69-bp range among all accessions), a 17,941-bp SSC (16-bp range among all accessions), and a pair of 24,352-bp IRs (51-bp range among all accessions). The chloroplast genome had a GC content of 37.6% for all eight accessions (Table 1) and consisted of a total of 111 genes, including 72 protein-coding genes, 31 transfer RNA (tRNA) genes, and eight ribosomal RNA (rRNA) genes (Table 2, Appendix S2 (apps.1600063_s2.docx) ). Other published chloroplast genomes in the Amaranthaceae family have similar gene counts: S. oleracea is reported to have at least 108 genes and B. vulgaris subsp. vulgaris is reported to have 113 genes (Schmitz-Linneweber et al., 2001; Li et al., 2014). There were 13 genes that were repeated in the IR region, one is involved in photosynthesis and the others have self-replication functions (Table 2). There were seven genes with introns, one of which has two introns (Table 2). In the reference genome, 111 genes constituted a total of 72,598 bp and were composed of 18,299 codons, excluding the stop codons ( Appendix S2 (apps.1600063_s2.docx) ). Similar to other land plant chloroplast genomes, the most prevalent amino acid was leucine (10.6%) and the least common was cysteine (1.1%) (Huang et al., 2013, 2014; Qian et al., 2013), with a strong bias toward an A or T at the third codon position (Clegg et al., 1994).
Table 1.
Information on the sequenced amaranth chloroplast genomes and genome features.
Table 2.
Genes contained in the sequenced amaranth chloroplast genomes.
Microsatellites, or simple sequence repeats (SSRs), were identified using MISA (Thiel et al., 2003). The SSRs usually have a higher mutation rate, are easily genotyped using PCR, and are frequently used as markers for phylogenetic analysis and in marker-assisted breeding when located on nuclear chromosomes. We followed standard chloroplast thresholds for identifying SSRs (Sablok et al., 2015), specifically a minimum stretch of 12 for mono-, six for di-, four for tri-, and three for tetra-, penta-, and hexanucleotide repeats and a minimum distance of 100 bp between compound SSRs. We identified 37 SSRs in the reference genome, and 29 to 33 in the other sequenced Amaranthus chloroplast genomes (Table 3). The SSRs were most commonly found in the LSC region of the chloroplast genomes (60–63% of SSRs). The most abundant motifs were the runs of the mononucleotide A or T. Analysis of the consensus chloroplast genome sequences of the eight sequenced Amaranthus accessions identified repeat number polymorphisms for five, two, and one of the mono-, di-, and tetranucleotide SSRs, respectively. The complete list of the SSRs identified and their positions in the amaranth chloroplast genomes is provided in Appendix S3 (apps.1600063_s3.docx).
Repetitive sequences in chloroplast genomes are considered to play an important role in the rearrangement and stabilization of chloroplast genomes (Milligan et al., 1989; Maréchal et al., 2009). Repeated elements in the Amaranthus chloroplast genomes were identified using REPuter (Kurtz et al., 2001). A minimum repeat size of 30 bp and a sequence identity of ≥90% were used to identify forward and palindrome repeats (no reverse or complement repeats were present). Only a single IR region was included in the analysis to avoid inherent redundancy. The number and size of repeats were conserved between amaranth chloroplast genomes (Table 4). A total of 34 repeats were found in the A. hypochondriacus ‘Plainsman’ chloroplast genome, including 14 forward repeats and 20 palindromic repeats ranging from 30 to 64 bp in length ( Appendix S4 (apps.1600063_s4.docx)).
SNPs and indels were identified from binary alignment map (BAM) files generated by aligning reads to the ‘Plainsman’ reference genome using BWA-MEM (Li, 2013) for each of the seven resequenced amaranth accessions. SNPs were identified using SAMtools mpileup (Li et al., 2009), and indels were identified using the Genome Analysis Tool Kit HaplotypeCaller (GATK-HC) (McKenna et al., 2010). Relative to the ‘Plainsman’ reference chloroplast genome, an average of 210 SNPs and 122 indels were identified across the accessions (Table 5; Appendix S5 (apps.1600063_s5.docx), S6 (apps.1600063_s6.docx)). Indels ranged in size from 2 bp to 75 bp. T/G base substitutions accounted for the highest percentage of all SNPs (18.6%), while G/C base substitutions were the least frequent (2.9%) ( Appendix S5 (apps.1600063_s5.docx)).
CONCLUSIONS
We report the assembly and annotation of the first reference-quality chloroplast genome for the genus Amaranthus. Using this reference genome, we reconstructed the chloroplast genome for representatives of the grain amaranths—A. hypochondriacus, A. cruentus, and A. caudatus—and their putative wild ancestor, A. hybridus. The amaranth chloroplast genome retains the quadripartite structure so highly conserved among land plants and is highly conserved at the nucleotide level among the grain amaranths. Nonetheless, several SNPs, indels, and polymorphic SSRs were identified that will serve as invaluable genetic markers for the future genetic improvements of these emerging crop species. The development of a reference chloroplast genome for Amaranthus will also be invaluable for broader comparative studies within Caryophyllales—an order of increasing importance as it contains many halophytic and drought-tolerant species. The availability of a reference chloroplast genome will also open the door to biotechnology applications within the genus such as chloroplast gene transformation. We note that the genus contains not only important grain species, but also several well-known weedy species (e.g., A. retroflexus L., A. viridis L., A. tuberculatus (Moq.) J. D. Sauer, and A. spinosus L.) that are collectively referred to as “pigweeds” and are among the most economically damaging weeds in the United States (Tranel and Trucco, 2009). This reference chloroplast genome will provide an important framework for the accurate assembly of chloroplast genomes of these weedy species, which will be of particular value for monitoring spread and interspecies hybridizations.
Table 3.
Simple sequence repeats (SSRs) in the eight amaranth chloroplast genomes.
Table 4.
Number of repeated sequences by repeat type in the eight amaranth chloroplast genomes.
Table 5.
Number of single-nucleotide polymorphisms (SNPs) and insertions/deletions (indel) in the seven amaranth chloroplast genomes in comparison to the reference chloroplast genome, A. hypochondriacus (PI 558499).
ACKNOWLEDGMENTS
This research was supported by a grant from the Office of Research and Creative Activities at Brigham Young University. The authors gratefully acknowledge D. Brenner (U.S. Department of Agriculture National Plant Germplasm System [USDA-NPGS], Iowa State University, Ames, Iowa, USA) and D. Baltensperger (Texas A&M University, College Station, Texas, USA) for providing seed samples.