To gain insights into the molecular level of adaptation, attention has turned to the investigation of adaptively relevant genes (candidate genes). For nonmodel organisms, access to candidate genes is limited and the transfer of primers, e.g., from expressed sequence tag (EST) libraries, if available, requires high labor costs. For instance, the resequencing of 800 genes selected from more than 7000 ESTs from Pinus taeda L. yielded only 70 candidate genes for Abies alba Mill. (Mosca et al., 2012). Because sequencing costs are decreasing rapidly, de novo sequencing in nonmodel organisms is now achievable. For the identification of candidate genes in de novo—sequenced organisms, the use of differential expression profiling (e.g., Street et al., 2006; Huang et al., 2012) can be performed, but it requires the sequencing of several samples. The sequencing of a single transcriptome, in contrast, is very cost-effective. However, the reduction of the data remains challenging. Blasting against available databases is the standard method, which results in outputs of large quantities and is therefore mainly used for annotation only (e.g., Parchman et al., 2010). Here, we present a protocol for the efficient reduction of transcriptomic data down to 283 candidate gene sequences that were used for immediate primer development. The protocol is applicable for species that lack genomic resources. It combines a standard and a specific annotation approach and led to the resequencing of 88 gene fragments in A. alba.
METHODS AND RESULTS
A normalized transcriptome of a 1-yr-old A. alba seedling from the Black Forest (Forest District Calw, Germany; voucher MB-P-001007, Herbarium Marburgense, University of Marburg) was sequenced on a 454 GS FLX Titanium platform (cDNA library preparation: Vertis Biotechnology AG, Freising, Germany; sequencing: Genoscreen, Lille, France). The 454 run yielded 1521698 reads with an average length of 359 nucleotides (nt). Trimming and de novo assembly of the raw reads into contigs using Newbler software version 2.3 (454 Life Sciences, Branford, Connecticut, USA) resulted in 25 149 contigs consisting of 381 808 complete and 619615 partially assembled reads. The contig length was between 100 nt and 2394 nt, with an average length of 498 nt. A total of 484576 reads remained as singletons (Table 1). Contigs were submitted to the Transcriptome Shotgun Assembly database (TSA) at the National Center for Biotechnology Information (NCBI) (accession no.: JV134525-JV157085).
In the specific approach (Fig. 1), we tested a novel annotation protocol: After a literature survey with key words “adaptation,” “candidates,” “drought,” “evolution,” “RT-PCR,” and “selection” in various combinations using the Web of Science database, we selected 5349 unique proteins and downloaded them from UniProt or NCBI (downloaded in November 2011). The proteins were subsequently searched against the contigs coming from the de novo transcriptome sequencing that were formatted as the reference database using the BLAST+ 2.2.24 toolkit (tBLASTn parameters: softmasking = threshold 15 max_target_seqs 10 000). To increase reliability of alignments and to avoid too-short amplicons, only alignments with a length of at least 100 amino acids and an identity of at least 90% were considered further. From the contigs that passed the filter, 157 were selected for primer design. In the standard approach (Fig. 1), contigs were searched against the refseq_protein database (downloaded from NCBI 14 June 2011) with strict BLAST-settings (BLASTx parameters: threshold 999, window-size 4, gapopen 32767, gapextend 32767, E-value le-20) (Altschul et al., 1990). Gene ontologies (Ashburner et al., 2000) were assigned to contig-protein hits using Blast2GO 2.5.0 (Conesa et al., 2005) and subsequently filtered as described above. To select for well-described proteins, contig sequences were used for primer design if they could be assigned to enzyme IDs with the Kyoto Encyclopedia of Genes and Genomes (KEGG) (Ogata et al., 1999) in the final annotation step. Primers were developed specifying the amplified range according to the contig-protein alignment boundaries using default standard PCR settings of PerlPrimer (version 1.1.12; Marshall, 2004). Primers were tested in a 30 µL PCR reaction with 17.28 µL double-distilled water, 3 µL 10 × PCR buffer with MgCl2 (20 mM), 1.2 µL MgCl2 (25 mM), 3 µL Primermix (forward and reverse each 2 µM), 1.44 µL dNTPs (each 5 mM), 0.24 µL bovine serum albumin (BSA) (20 mg mL-1), 0.24 µL Dream Taq polymerase (5 U µL-1, Fermentas, St. Leon-Rot, Germany), and 3.6 µL DNA (10 ng µL-1). The PCR was performed with 5 min initial denaturation at 94°C followed by 35 repetitions of 45 s denaturation at 94°C, 45 s annealing at 52–59°C, 45 s elongation at 72°C, and a 10 min final elongation at 72°C. For the amplification test, four samples were randomly chosen for each gene from a set of 80 different silver fir trees that were sampled in May 2011 in Mont Ventoux (44°10′44.35″N, 5°14′32.29′E, France). Amplification was evaluated by electrophoresis in 1 % agarose gels. When amplification was too weak, the volume of MgCl2 was increased to 1.8 µL. When faint ancillary bands appeared, no additional magnesium was added to the mastermix. If PCR products occurred as a single band, one sample was chosen for sequence analysis in each case to ensure that the region of interest was amplified (LGC Genomics GmbH, Berlin, Germany). Gene sequences were aligned to the corresponding contigs using the CodonCode Aligner software (default large gap settings) to reveal the location of the introns. The gene sequences were searched against the nr nucleotide database of NCBI (default discontiguous megaBLAST settings, web application).
Statistics of the 454 transcriptome sequencing run and metrics of the Newbler assembly software.
In the specific approach, tBLASTn and subsequent sorting led to 321 contigs. For primer development, 185 contigs were picked. In the standard approach, the initial number of contigs was decreased to one third after the BLASTx step. Approximately half of the hits could be further annotated with Gene Ontologies. After filtering, 126 contigs were successfully assigned to enzyme-IDs and used for primer design (Fig. 1). In combination, 283 different contigs were annotated and only 28 were annotated with both approaches. Primer testing and sequencing resulted in 88 gene sequences (Table 2). Fifty-seven genes were annotated using the specific approach, and 42 using the standard approach. Eleven were annotated by both approaches. The assembly of the gene sequences and the corresponding cDNA contigs revealed 43 introns in 26 genes. The length of the gene sequences ranged from 262 to 1486 nt. All gene sequences aligned to sequences from the nr nucleotide database (NCBI) where the highest E-value was 5.00e-32. Twelve gene sequences hit organelle DNA (10 chloroplast, one mitochondrial, and one ribosomal). The remaining 76 are involved in the biosynthesis of different compounds (21), regulation (20), primary metabolism (14), growth (11), stress response (8), and water transport (2). In the biosynthesis group, enzymes from the auxin pathways, the phenylpropanoid pathways, and the tetrapyrrol pathways were dominant. With the exception of the primary metabolism group, all groups included candidates for the analysis of adaptation at gene level that had been investigated in previous studies of conifers (e.g., González-Martínez et al., 2006).
The two approaches of the workflow are complementary, each contributing approximately half of the annotations in the final set of sequences. The standard approach can be run rapidly, but targets only well-known genes. The specific approach based on a review of the relevant literature is novel and provided a substantial amount of nonredundant annotations. As an advantage, it can be easily adjusted and extended freely to the researcher's interest. The quality-tested primers can be used for assessing the degree of gene polymorphism in ecological genetics studies.