Artocarpus J. R. Forst. & G. Forst. (Moraceae) contains approximately 70 species of monoecious trees with a center of diversity in Malesia (Zerega et al., 2010). The genus includes several underutilized crops that can improve food security, most notably breadfruit (A. altilis (Parkinson) Fosberg), a long-lived perennial crop that is low input but high yielding (Jones et al., 2011). Other Artocarpus crops include the pantropically cultivated jackfruit (A. heterophyllus Lam.), crops of regional importance like cempedak (A. integer (Thunb.) Merr.) and terap (A. odoratissimus Blanco), and more than a dozen other species with edible fruits whose potential remains largely unexplored (Zerega et al., 2010). Artocarpus camansi Blanco (breadnut), native to New Guinea, is the diploid wild progenitor of breadfruit and is cultivated throughout the tropics for its large edible seeds (Zerega et al., 2005) (Fig. 1).
Existing genomic resources for breadfruit include nuclear (Witherup et al., 2013) and chloroplast (Gardner et al., 2015) microsatellites as well as transcriptomes of breadfruit and two wild relatives (Laricchia, 2014). In this study, we augment these resources with a low-coverage shotgun assembly of the A. camansi genome.
Recent studies focused on different taxonomic groups have used ultra-shallow sequencing (or “genome skimming”)—with coverage of less than 1×—to assemble portions of genomes for annotation and marker discovery, particularly high-copy sequences such as organellar genomes and repetitive elements (Straub et al., 2011; Blischak et al., 2014). Here, we explore the utility of somewhat deeper but still shallow (17×) genome sequencing for de novo genome assembly and annotation with the goal of phylogenomic marker development, gene discovery, and the detection of whole-genome duplications.
METHODS AND RESULTS
The individual for sequencing was selected for the absence of heterozygosity at 19 nuclear microsatellite loci (Zerega et al., 2015), possibly due to centuries of inbreeding. The individual is the offspring of a tree planted in the Lancetilla Botanical Garden, Honduras. The Honduran tree likely descended from Caribbean material, which in turn was likely descended from seedlings in Mauritius that were collected in the Philippines by the French naturalist Pierre Sonnerat (1748–1814) (Ragone, 1997). Leaf tissue from A. camansi (living collection at McBryde Garden at the National Tropical Botanical Garden, Kalaheo, Hawaii: NTBG 960576.001; voucher: EG149 [CHIC]) was collected and dried on silica. Genomic DNA was extracted using the QIAGEN DNeasy Plant Mini Kit following the manufacturer's protocol (QIAGEN, Valencia, California, USA). Two Illumina TruSeq libraries were prepared: a paired-end library with a mean insert size of 180 bp and a mate-pair library with a mean insert size of 854 bp (Illumina, San Diego, California, USA). The paired-end library was sequenced in a single lane on an Illumina HiSeq 2000 (2×100 bp, aired–end), and the mate-pair library was sequenced in one-half lane. Library preparation and sequencing took place at the Next Generation Sequencing Core Facility, Argonne National Laboratory (Lemont, Illinois, USA).
Assembly and bait sequence development—De-multiplexed sequences were quality trimmed (>Q20 in a 5-bp window) using Trimmomatic (Bolger et al., 2014) and assembled using Ray 2.3.1 with a k-mer size of 31 (Boisvert et al., 2010). We then conducted searches to identify sequences optimized for hybridization-based target enrichment (Hyb-Seq) baits. Hyb-Seq uses oligonucleotide fragments (“baits”) to enrich DNA libraries for the desired portions of the genome (“targets”) (Weitemier et al., 2014). Bait sequences for 333 phylogenetic markers were developed as follows: ESTScan 3.0.3 (Iseli et al., 1999; http://sourceforge.net/projects/estscan/) was run on the A. camansi contigs and the Cannabis sativa L. transcriptome (van Bakel et al., 2011) to predict amino acid sequences. Homologous protein clusters (orthogroups) were circumscribed using reciprocal BLAST searches with these proteomes and the published proteome of Morus notabilis C. K. Schneid. (Moraceae) (He et al., 2013) using Proteinortho version 4 (Lechner et al., 2011). Orthogroups containing exactly one copy per species were subjected to further filtering as follows. Organellar sequences (determined using a BLASTN search against the nonredundant nucleotide [nr/nt] database) (Altschul et al., 1990) and predicted coding sequences residing on contigs smaller than 1000 bp were discarded. To ensure that baits would not hybridize to multiple regions of the genome, the remaining Artocarpus sequences were searched against the original A. camansi contigs using BLASTN, and any query sequence with more than one hit (>90% identity and >150 bp) was discarded. Finally, 333 predicted genes with exons between 500 and 2200 bp were selected. We retained only long exons to maximize the likelihood of hybridization success in divergent taxa, reasoning that longer exons would have a greater likelihood of containing at least a short span of conserved sequence that would hybridize across greater phylogenetic distances.
Sequencing statistics in Artocarpus camansi.
In addition to the markers targeted for their phylogenetic utility, target sequences for 125 additional genes were developed as follows. These sequences were selected based on predicted function only and were not screened for optimal hybridization characteristics such as long exons, high alignment specificity, or low copy number. Protein-coding regions of all scaffolds were predicted with AUGUSTUS (Stanke et al., 2008) using default parameters and allowing partial gene models. These predictions differed from the ESTScan predictions in their tendency to find even small, widely spaced, and partial exons, which is important for resequencing of functional genes, but not necessarily ideal for optimizing hybridization. Putative MADS-box genes were identified using a BLASTP search seeded with sequences from Arabidopsis thaliana (L.) Heynh. (Lamesch et al., 2012) and M. notabilis (He et al., 2013) containing “MADS” in their annotations.
An initial set of genes resulting from the BLASTP search (E-value cutoff 1e-10) was searched against all embryophyte genes in the National Center for Biotechnology Information nonredundant (NR) protein database, and those 98 sequences whose top BLASTP hit was a MADS-box gene were retained. To fill out the bait set, an additional 27 genes homologous to genes involved in biosynthesis of floral volatile compounds were extracted using the same search method, seeded with sequences from Arabidopsis and Fragaria vesca L. that are associated with KEGG pathway map00902 (monoterpenoid biosynthesis) (Kanehisa and Goto, 2000). These genes were of interest because of the role floral volatile compounds play in attracting pollinators in Moraceae (Yu et al., 2015). All 125 target sequences of functional interest were annotated with a BLASTP search against all plant proteins in the NR database, and ortholog hit ratios were calculated by dividing the alignment length by the target sequence length (O'Neil and Emrich, 2013). To tentatively classify the 98 MADS-box bait sequences, corresponding amino acid sequences were aligned using MAFFT (Katoh and Standley, 2013) to the 70 Arabidopsis MADS-box proteins chosen for the most recent comparative study using a member of Rosales (Malus ×domestica Borkh., domesticated apple) (Tian et al., 2015). We used amino acid rather than nucleotide sequences because of the great phylogenetic distances involved in aligning an entire ancient gene family. A maximum-likelihood tree was constructed with RAxML using the “PROTGAMMAAUTO” model setting (Stamatakis, 2006).
Assembly statistics in Artocarpus camansi.
RNA scaffolding, gene annotation, and Ks analysis—To scaffold the genome assembly further before a full genome annotation was performed, we used a transcriptome assembly of A. camansi (seed offspring of NTBG 000501.005, voucher: EG140 [CHIC], BioProject PRJNA311339) (Laricchia, 2014) in L_RNA_scaffolder (Xue et al., 2013). This final assembly was used for all further analyses. Gene prediction on the extended scaffolds was performed using AUGUSTUS (Stanke et al., 2008), guided by alignments from the transcriptome assembly, and limited to complete gene models only. Predicted amino acid sequences were annotated with Swiss-Prot hits and Gene Ontology (GO) terms using Trinotate ( https://trinotate.github.io/). We determined orthology between the AUGUSTUS protein predictions and the published proteomes of M. notabilis and C. sativa with Proteinortho (Lechner et al., 2011), using default settings. Artocarpus proteins were reduced to a nonredundant set using CD-HIT-EST (Fu et al., 2012) to remove sequences that were 98% similar at the nucleotide level along 90% of their length. Chromosome counts support a hypothesis of two whole-genome duplications within Artocarpeae: the one at the root of the tribe, evidenced by a haploid chromosome number of 14 in Clarisia biflora Ruiz & Pav. (Oginuma and Tobe, 1995), and another in the genus Artocarpus, which has a haploid number of 28 (Banerji and Hakim, 1954; Oginuma and Tobe, 1995; Ragone, 2001). By contrast, Morus has a base haploid number of 7, although a polyploid series exists within that genus as well (He et al., 2013) (Fig. 2). To determine the age of possible whole-genome duplication (WGD) in Artocarpus, we selected Proteinortho orthogroups that were single copy in Morus and Cannabis, but multicopy in Artocarpus. We generated pairwise alignments of Artocarpus protein sequences (paralogs) in each of these orthogroups using MAFFT, and back-translated the sequences to nucleotides using PAL2NAL (Suyama et al., 2006). We calculated the synonymous substitution rate (Ks) for each paralog pair with KaKs_Calculator (Zhang et al., 2006) using the GY method (also known as F3 × 4) (Goldman and Yang, 1994). We also used this procedure to calculate Ks for orthogroups that were single copy in all three proteomes to generate three ortholog Ks distributions: Morus-Artocarpus, Morus-Cannabis, and Cannabis-Artocarpus.
Sequencing and assembly results—Sequencing and assembly statistics appear in Tables 1 and 2. Based on the k-mer counting method, the Ray assembler predicted a genome size of 669 Mb, with a peak coverage of 17×. The Ray assembly was 631 Mb (93% of the predicted size by the k-mer method implemented in Ray), and the scaffold N50 was 2426 bp. After additional scaffolding with L_RNA_scaffolder, the scaffold N50 increased to 2574, and the number of scaffolds dropped from 401,899 to 388,956 (Table 2).
Identification of target sequences—Proteinortho found 2041 putative singlecopy orthogroups with exactly one copy in each of the three species included that were nonorganellar and were present on contigs longer than 1000 bp. Of the 617 orthogroups with only a single BLASTN hit in the original A. camansi assembly hit (>90% identity and >150 bp), the 333 selected for bait design were between 504 and 2166 bp, and these totaled 354,171 bp. Comparison of these exons with full gene predictions by AUGUSTUS in IGV revealed that ESTScan tended to find single-exon genes or the longest exon in multiexon genes, which was expected to increase the efficiency of hybridization. GenBank accession numbers, Artocarpus scaffold coordinates, and orthology assignments to M. notabilis and C. sativa are presented in Appendix S1 (apps.1600017_s1.docx). The maximum-likelihood tree (Fig. 3) indicates that of the 98 putative MADS-box sequences, 51 are tentatively assignable to domain MIKC (Type 2) and 47 are tentatively assignable to domain M (Type 1). However, these assignments should be treated with caution due to the fragmentary nature of the assembly. Exons for the putative MADS-box genes and the additional 27 terpenoid synthesis genes totaled 101,871 bp. The median ortholog hit ratio, calculated against the top BLASTP hit, was 0.60, indicating that most sequences likely contained the majority of a gene; 16 had an ortholog hit ratio of 1, indicating complete coding sequences. GenBank accession numbers, Artocarpus scaffold coordinates, and the best BLASTP hit are presented in Appendix S2 (apps.1600017_s2.docx). Fine-scale name assignments based on BLASTP hits should be treated with caution, particularly for the MADS-box genes, as they do not always agree with the phylogenetic placement in the maximum-likelihood tree. The 458 Artocarpus sequences identified here, combined with Morus orthologs for the 333 phylogenetic markers, maximized the available target space for a bait set made up of 20,000 120-mer baits with 3× tiling (MYcroarray, Ann Arbor, Michigan, USA). The use of these bait sequences to capture sequences from taxa representing all Moraceae tribes, including differences in hybridization efficiency at various phylogenetic distances, and between the 333 phylogenetic markers and the other genes, is reported in Johnson et al. (2016).
Gene annotation and Ks analysis—AUGUSTUS predicted 83,061 gene sequences on the extended scaffolds; of these, Trinotate reported annotations for 49,089 sequences, including 22,865 sequences annotated with 1226 unique GO terms. A search for reciprocal best BLASTP hits between the predicted proteins for A. camansi and M. notabilis found 14,089 orthogroups collectively containing 15,634 Artocarpus genes and 14,506 Morus genes, which is 54% of the Morus total. Using the AUGUSTUS protein predictions, the M. notabilis proteome, and the C. sativa proteome, we generated 18,657 orthogroups using Proteinortho, 10,925 of which contained proteins from all three proteomes, and 8,539 were single copy in all three genomes. There were 1391 orthogroups containing multiple sequences from Artocarpus but only one sequence from each of the published genomes (many-to-one orthogroups). The distribution of synonymous substitutions (Ks) between pairs of Artocarpus paralogs in the manyto-one orthogroups reflects evidence of a possible lineage-specific WGD in A. camansi (Fig. 4). We then compared the Ks distribution of Artocarpus paralogs in many-to-one orthogroups to the distribution of Ks between pairs of orthologs in the single-copy orthogroups generated by Proteinortho. The Artocarpus paralog distribution indicates WGD in Artocarpus is more recent than the divergence of Artocarpus from M. notabilis, a member of the sister tribe Moreae (Fig. 4). This is consistent with a hypothesis of WGD events in the tribe Artocarpeae based on published chromosome counts (Fig. 2, discussed above). However, as confirmed chromosome counts exist for only a few of the ca. 70 Artocarpus species, more work is required to arrive at a firm conclusion regarding the precise phylogenetic position of the duplication within the genus.
The raw reads and the Ray assembly were deposited in GenBank under BioProject PRJNA301299. Scaffolds less than 200 bp (n = 5813) were omitted from the GenBank accession, because NCBI WGS does not accept scaffolds shorter than 200 bp. Both draft assemblies, bait sequences, and full-length targets described here have been archived in the Dryad Digital Repository ( http://dx.doi.org/10.5061/dryad.3293r; Gardner et al., 2016). They are also available on the Artocarpus Genomics website, which will host any future updates ( http://sites.northwestern.edu/zerega-lab/research/artocarpus-genomics/).
This study constitutes a substantial increase in the genomic resources available for Artocarpus in particular and Moraceae in general. The 333 phylogenetic markers will enable large-scale phylogenomic studies within the family using a Hyb-Seq (target enrichment) approach. Because sequence capture is anchored by conserved exons but can extend into flanking noncoding regions if read lengths are sufficient (Johnson et al., 2016), these markers can be employed at both deep and shallow phylogenetic scales. The gene predictions provide a resource that can be mined to explore gene families; for example, the MADS-box sequences identified provide a starting point for characterizing this important gene family in Artocarpus. Finally, the shotgun assembly presented here, while fragmentary, covers an estimated 93% of the genome, and represents a first step toward a reference genome assembly for Artocarpus. Thus the study demonstrates how existing reference genomes (Morus, Cannabis, and Arabidopsis) can be leveraged to increase the value of de novo shotgun assemblies from shallow genome sequencing, enabling the inexpensive development of genome-level resources for nonmodel organisms with moderately large genomes.
The authors thank S. Greenwald (Argonne National Laboratory) for library preparation and Illumina sequencing and A. Devault (MYcroarray) for assistance with bait design. This research was funded by National Science Foundation grants to N.J.C.Z. (DEB-0919119) and N.J.W. (DEB-1239992) and a grant to N.J.C.Z. from the Institute for Sustainability and Energy at Northwestern University.