The generation and analysis of mitochondrial DNA (mtDNA) sequence data has become routine in mammalogy. Unfortunately, these analyses can be confounded because fragments of the mitochondrial genome are contained in the nucleus of most eukaryotes. Furthermore, these nuclear fragments of mitochondrial genes, or numt pseudogenes, are often represented hundreds of times in mammalian nuclear genomes. Most modern analyses of mtDNA rely on the polymerase chain reaction to generate a population of molecules that can be sequenced. Templates for DNA sequencing reactions should be homogenous, and in the case of mtDNA, cytoplasmic in origin. The unwanted (and often unwitting) amplification of numts results in a heterogenous mixture of nuclear and cytoplasmic amplicons or, if a numt is preferentially amplified, a near-homogenous mixture of the wrong (nuclear) template. These nuclear sequences can cause major—although often cryptic—problems in the analyses of systematic or phylogeographic data. Here, we review the occurrence, detection, and avoidance of numts in mammals. Furthermore, we isolate a cytochrome-b numt and its corresponding mitochondrial sequence in the North American prairie vole (Microtus ochrogaster) to illustrate various methods to detect numts. Finally, we present approaches by which numts, once identified, can be utilized in molecular studies.
The mammalian mitochondrial genome possesses a number of characteristics that have made it a valuable molecular marker in population genetics, conservation biology, phylogeography, and systematics (Avise 1991). Since the publication of the genetic species concept of Bradley and Baker (2001), mammalogists have increasingly relied upon mitochondrial DNA (mtDNA) sequences in evolutionary studies. Unfortunately, many of these mtDNA appraisals have failed to consider the potentially confounding influence of nuclear sequences on their analyses. Despite the physical separation of the mitochondrion and the nucleus within the cell, fragments of the mtDNA genome have been found within the nuclear genome. Because these nuclear sequences can be highly similar to their mtDNA counterparts, they may be isolated along with mtDNA sequences during molecular studies.
Nuclear mitochondrial translocations (numt pseudogenes or numts—Lopez et al. 1994) are mtDNA fragments that have been incorporated into the nuclear genome (see Bensasson et al. 2001; Leister 2005; Zhang and Hewitt 1996 for reviews). First documented in the 1960s (du Buy and Riley 1967), they have subsequently been identified in a wide variety of plants and animals, including many mammalian lineages (Table 1). More than 600 numt pseudogenes have been documented in the human genome alone (Woischnik and Moraes 2002). Numts can originate from different portions of the mitochondrial genome, can vary in their degree of similarity to their corresponding mtDNA fragments, and can encompass multiple genes or mere fragments of genes. Richly and Leister (2004) analyzed numts within sequenced mtDNA and nuclear eukaryotic genomes and found that there was no apparent correlation between numt size or abundance with genome size or gene density. Fragments of the mtDNA control region are underrepresented in the human genome sequence (Mourier et al. 2001; Woischnik and Moraes 2002), but it is still unclear whether or not this is true in other organisms.
Although the original numts that colonized mammalian genomes probably originated via independent translocation events, once integrated into a chromosome these DNA sequences are subject to the same gene duplication processes that give rise to gene families (Triant and DeWoody, in press). The exact proportion of numts that are derived from independent translocations versus nuclear duplications is unclear (Bensasson et al. 2003; Hazkani-Covo et al. 2003; Mirol et al. 2000), but whatever the absolute contribution of insertions versus duplications, the fact remains that mammalian nuclear genomes are replete with fragments of mtDNA.
How and why mtDNA fragments incorporate themselves into the nuclear genome has also been the subject of speculation. Some suggest that numt integration may be associated with chromosomal repair mechanisms (Willett-Brozick et al. 2001), whereas others postulate that mobile elements may somehow facilitate numt transfer (Mishmar et al. 2004). Numts are generally thought to be nonfunctional once incorporated into the nucleus because of the different genetic codes utilized by the mitochondrial and nuclear genomes. Although their accumulation may have helped shape the evolution of mammalian genomes, numts may not be completely benign. For example, at least 27 numts have colonized the human genome in the last 4–6 million years (Ricchetti et al. 2004). Of these, 23 (85%) are inserted directly into genes (usually introns), including critical tumor suppressor genes. Turner et al. (2003) described a numt insertion in humans that caused a stop codon in a functional gene, leading to a truncated polypeptide that could potentially impede normal development.
Analyses of mtDNA should always consider numts as potential sources of contamination. Unrecognized numts can appear to be unique mtDNA haplotypes, which can then confound downstream molecular analyses (Collura and Stewart 1995; Hirano et al. 1997; Parr et al. 2006). For example, Jensen-Seaman et al. (2004) re-examined mtDNA control region sequences of gorillas (Gorilla) and found that multiple haplotypes deposited in the GenBank database were, in fact, numts. Another study reported that divergent primate mtDNA haplotypes thought to have contaminated polio vaccines were later discovered to be macaque (genus Macaca) numts (Vartanian and Wain-Hobson 2002).
Systematic and phylogeographic assessments can be compromised if sequences with different evolutionary histories are utilized without knowledge of their true ancestry. Because the mtDNA genome is effectively haploid, is under different selection pressures than the nuclear genome, and exhibits a faster rate of nucleotide substitution, the inclusion of nuclear DNA in mtDNA data sets can lead to inaccurate species identifications, divergence estimates, phylogenetic groupings, and population breaks (Zhang and Hewitt 1996). Possible signs of numt contamination include multiple bands during electrophoresis or in restriction profiles and ambiguities in chromatograms, but numts may appear as unambiguous sequences if they are amplified preferentially. Fortunately, there are many ways by which numts can be identified and thus excluded from analyses as confounding factors. Several approaches to avoid and detect numts are described below.
Avoiding numts
Isolation of entire mtDNA genomes
Traditionally (i.e., before the advent of polymerase chain reaction [PCR]), mtDNA genomes were isolated in their entirety by ultracentrifugation in a salt gradient such as cesium chloride (Lansman et al. 1981). Unfortunately, this method is labor-intensive and expensive, but entire intact mtDNA molecules can now be isolated using spin chromatography. In brief, genomic DNA, isolated using conventional procedures, is passed through a column containing a filter matrix that preferentially binds DNA of a given size. Many such columns will accommodate mammalian mtDNA (∼16–17 kilobases [kb]) and yield virtually pure product. This procedure is less labor intensive and less expensive than salt-gradient extractions and many commercial companies make spin-chromatography extraction kits (e.g., Bio-Rad, Hercules, California; Promega, Madison, Wisconsin; and Qiagen, Valencia, California). However, if primers preferentially amplify nuclear contaminants rather than mtDNA target sequences, purified mtDNA can still yield numt sequences (Collura and Stewart 1995).
Tissue sources
Tissue sources that are rich in mtDNA relative to nuclear DNA (e.g., muscle and liver) can also help to reduce numt occurrence. Numts have been found in equal proportion to mtDNA sequences in avian blood samples, which have nucleated red blood cells (Sorenson and Quinn 1998). Conversely, Greenwood and Pääbo (1999) found that mammalian blood samples, which contain anucleated red blood cells, were a good source of mtDNA. The authors used primers that amplified a portion of the control region in elephant blood but found that those same primers amplified nuclear insertions in hair samples taken from the same individual. This study illustrates that primers may amplify mtDNA in one tissue type but numts in another. If multiple tissue types are available, mtDNA sequences can be verified from multiple sources.
Long-distance PCR
Numts can vary in size, but most are small (<1 kb—Richly and Leister 2004). Thus, numts often can be avoided by using PCR primers that amplify substantial portions of the mtDNA molecule (i.e., more than 1 mtDNA gene). This approach supplies a long amplicon from which smaller amplicons of interest can be isolated using internal primers (e.g., Thalmann et al. 2004; Triant and DeWoody 2006). Long amplicons, when sequenced, should reveal only open reading frames if they are mitochondrial in origin. If the sequence is nuclear in origin, longer fragments will likely reveal numt features such as stop codons or frame-shift mutations (see below). Sequencing a long amplicon beyond the mtDNA gene(s) of interest can also reveal numt insertion sites as indicated by the sequence abruptly falling out of alignment with its corresponding mtDNA sequence. A related approach takes advantage of the circularity of the mtDNA genome; mtDNA fragments can be isolated with outward-extending primers in a manner similar to inverse PCR (Ochman et al. 1988). Using this approach, numts can be avoided regardless of their length because such inverse primers should not amplify linear fragments.
Detecting numts
Restriction digests
Signs of numt contamination often include unexplained banding patterns during electrophoresis and restriction assays. If bands appear to be pure mtDNA during electrophoresis, the PCR product then can be digested with restriction enzymes that cut within the mtDNA fragment to further identify potential numts. Sequencing should distinguish any suspect numts that exhibit spurious bands. With large data sets, digesting PCR products with restriction enzymes before sequencing can alert researchers to possible numt contamination and save valuable time and resources.
Cloning
Theoretically, PCR amplification and subsequent cloning of haploid mtDNA should yield recombinants identical in sequence. In practice, this is seldom the case because of polymerase (e.g., Taq) errors, heteroplasmy, and cloning artifacts (Baker et al. 1999). Despite this background level of nucleotide variation, numts have been isolated via T/A cloning of PCR products. T/A cloning can identify numts that appear as ambiguities in chromatograms or at such low levels that they are not otherwise detected through conventional sequencing (DeWoody et al. 1999). This procedure takes advantage of the tendency of some DNA polymerases, such as Taq, to add a 3′-A overhang to the end of PCR products. These PCR products can then be ligated into a vector with a complimentary 3′-T overhang, cloned, and sequenced (Sambrook and Russell 2001).
Gene expression
Most interspecific systematic assays in mammalogy rely on protein-coding genes (e.g., cytochrome b), but numts are usually thought to be unexpressed. Thus, one can potentially confirm the mitochondrial origin of sequences via reverse-transcriptase PCR. This approach relies upon the isolation of polyadenylated rRNAs and mRNAs (Fernández-Silva et al. 2003; Ojala et al. 1981) using a standard poly-T or random priming protocol (Sambrook and Russell 2001) and the subsequent conversion of mRNA into cDNA using reverse transcriptase; the resulting cDNA can then be used for PCR. This approach avoids unexpressed templates (i.e., most numts) but has the disadvantage of requiring fresh tissue for RNA extraction. Although some numts are occasionally transcribed (Blanchard and Schmidt 1996), this procedure can enrich for mtDNA template.
Fluorescent in situ hybridization
Another time-consuming but powerful method to identify potential numts and their chromosomal location within the nuclear genome is fluorescent in situ hybridization (Rudkin and Stollar 1977). Fluorescent in situ hybridization employs fluorescent microscopy to detect labeled DNA probes that have been hybridized to metaphase chromosomes and is useful in assessing numt position and copy number (Kim et al. 2006; Lopez et al. 1994). The suspect numt sequence can be used as a probe and hybridized to chromosome spreads resulting in a fluorescent signal visible at the sites of probe hybridization (i.e., putative numt integration—Trask 1991). Alternatively, the entire mtDNA genome can be used as a probe to assess whether the nuclear genome contains multiple numt copies. Probe-labeling kits are commercially available (e.g., Roche, Pleasanton, California; and Sigma, St. Louis, Missouri) but probes often need to be at least ∼5 kb in length to capture any fluorescent signal (Trask 1991) and the preparation of chromosomal slides requires fresh cells from live-captured animals (Baker et al. 2003).
Comparative sequence analysis, translations, and secondary structures
Numts are no longer under the strong selective constraints found in the mtDNA genome; thus, they should not exhibit codon position bias (e.g., selection against changes at the 2nd codon position). Numts also should lack the pronounced transitional bias found in animal mtDNA, depending upon their date of translocation to the nucleus (Bensasson et al. 2001). Substitution patterns inferred from pairwise comparisons of putative numts with known mtDNA sequences from related taxa can reveal whether a sequence is indeed nuclear, but this approach is not foolproof as some numts and their mtDNA complements have highly similar nucleotide compositions (Kim et al. 2006; Lopez et al. 1996; Triant and DeWoody 2007).
Functional protein-coding genes require open reading frames and most DNA analysis software can easily search for the presence of open reading frames in a sequence. Stop codons, insertions–deletions (indels), or frame-shift mutations within a coding mtDNA sequence are likely indicative of a numt, although recent translocations that have not yet accumulated such degenerative mutations could still possess an open reading frame. Unlike protein-coding genes, the control region, rRNA, and tRNA genes are not constrained by open reading frames and thus can be particularly problematic. However, numts derived from rRNAs and tRNAs can be identified through the evaluation of secondary structures (Sorenson and Quinn 1998).
The alignment of suspected numt sequences with a model based upon secondary structure can determine whether nucleotide substitutions cause structural abnormalities (e.g., disruptions to conserved stem-and-loop structures in tRNA—Hickson et al. 1996). Such comparisons are facilitated by the availability of public databases that curate secondary structure information (e.g., Van de Peer et al. 2000). For example, Pereira and Baker (2004) used the secondary structures of tRNA genes to detect numt pseudogenes in the chicken genome and found that most tRNA numt sequences were not predicted to fold into the proper secondary structure. On the other hand, Olson and Yoder (2002) attempted to use secondary structures to identify 12S rRNA numts but were unable to do so; they advocate using other methods (such as those we discussed herein) in addition to secondary structure analyses.
Phylogenetic analysis
With a prior knowledge of phylogenetic relationships, numt paralogs can often be detected by atypical branch lengths or incorrect placement within a clade (Arctander 1995). Once integrated into the nuclear genome, the substitution rate of the translocated fragment should decelerate because the substitution rate within the mitochondrial genome is higher than that found in nuclear DNA. The mtDNA genome lacks the proofreading and repair mechanisms found in the nuclear genome, suffers from cumulative oxidative damage, and replicates more frequently than the nuclear genome; thus, mammalian mtDNA can accumulate ∼10 times as many mutations as nuclear DNA (Brown et al. 1979). Most of these changes occur at 3rd codon positions or within intergenic regions (Avise 1991). Upon transfer of mtDNA to the nuclear genome, the decreased substitution rate can affect phylogenetic results and appear as shorter numt branch lengths. In contrast, Lopez et al. (1997) reported that some numts might be evolving at the same rate or faster than their mtDNA paralogs.
Because numt sequences and mtDNA sequences may not have the same evolutionary history, the position of a numt within a lineage can reveal when the transfer to the nuclear genome took place. If a numt insertion predated a speciation event, the nuclear sequence might appear basal to older lineages. Conversely, if the insertion event was recent, the numt might not have substantially diverged from its mtDNA counterpart. In any event, diagnosing numts within a phylogeny requires some knowledge of evolutionary relationships, but because many phylogenetic studies are conducted to establish relationships without previous knowledge, detecting numts via phylogenetic approaches can be challenging.
Of course, the most robust phylogenetic inferences rely upon analyses of multiple gene sequences, as single-gene trees are not equivalent to species trees (Avise 2000; Pamilo and Nei 1988; Tajima 1983). Thus, it is generally good practice to use multiple genes in attempts to recover phylogenetic relationships (Maddison and Knowles 2006). In so doing, however, one may encounter incongruities (deQueiroz et al. 1995) that are due to different modes of evolution among genes (or genomes) sampled. For example, if multiple mtDNA genes are utilized in an analysis, a discordance involving a single gene (amplicon) could indicate the presence of a numt in the data set. Alternatively, if both mtDNA and nuclear genes are utilized, putative mtDNA genes that exhibit phylogenetic signatures similar to nuclear sequences (e.g., cluster within nuclear clades) should be closely inspected, as this may be a clear indication that a given sequence is nuclear in origin.
Although the preventative measures described above can be effective in identifying numts, none are guaranteed. Herein, we illustrate the use of some methods outlined above in isolating a cytochrome-b numt and its corresponding mtDNA sequence in the North American prairie vole (Microtus ochrogaster). Nuclear copies of the mitochondrial cytochrome-b gene have previously been described in voles and other arvicoline rodents (DeWoody et al. 1999; Jaarola and Searle 2004; Jaarola et al. 2004; Triant and DeWoody 2007). We use comparative sequence analyses, mRNA expression assays, and phylogenetic analysis to highlight numt detection methods and offer cautionary suggestions.
Additionally, we further investigate numt representation within mammals using mammalian nuclear genome sequences available within the GenBank database. We use mtDNA protein-coding genes, ribosomal RNA genes, and control region sequences to examine whether or not certain regions of the mtDNA are more or less prone to translocation as numts.
Materials and Methods
Isolation of mitochondrial and nuclear sequences
Genomic DNA was extracted from cardiac and skeletal muscle tissue of a local specimen of M. ochrogaster with a standard proteinase K/phenol–chloroform protocol (Sambrook and Russell 2001). To mimic a typical mammalian evolutionary study, we amplified the mitochondrial cytochrome-b gene using the universal primers L14724/H15915 (Irwin et al. 1991). In parallel, we isolated a nuclear copy of the cytochrome-b gene using the numt-specific primers PcytbF2, PcytbR, and PcytbR2 (Triant and DeWoody 2007) in 2 separate reactions: PcytbF2/PcytbR and PcytbF2/PcytbR2. These 2 sets of primers were originally designed to isolate a cytochrome-b numt sequence in Microtus. PCRs for both mitochondrial (cytochrome-b) and nuclear (numt) amplifications were performed in a final volume of 25 μl and included 1X ThermoPol Buffer (New England BioLabs, Ipswich, Massachusetts), 2 mM MgSO4 0.2 mM deoxynucleoside triphosphates, 0.25 μM each primer, 1.5 U Taq DNA polymerase (New England Biolabs), and 0.015 U Pfu DNA polymerase (Stratagene, La Jolla, California) to reduce polymerase infidelity (Cline et al. 1996). The thermal profile consisted of an initial denaturation at 94°C for 2 min; 32 cycles of 94°C for 1 min, 50°C for 30 s, and 72°C for 1 min; and a final elongation step for 4 min at 72°C. PCR products were cleaned with sodium acetate–ethanol precipitation and sequenced in both directions with the amplification primers and 2 internal sequencing primers (mitochondrial cytochrome-b gene: M.och_Cytb_Int1: 5′-TCACACGATTCTTCGCCT-3′, M.och_Cytb_Int2: 5′-GGAATAGTAGATGGACTA-3′; numt: PcytbSeq1 and PcytbSeq2—Triant and DeWoody 2007) using BigDye v.3.1 (Applied Biosystems, Foster City, California) following the manufacturer's protocol modified to one-eighth reactions. This study was conducted according to the guidelines of the American Society of Mammalogists (Animal Care and Use Committee 1998).
Comparative sequence alignment
Putative mitochondrial and nuclear amplicons were aligned using Sequencher 4.1 (GeneCodes, Ann Arbor, Michigan). The sequence was considered mitochondrial in origin if, using the mammalian mtDNA genetic code, it possessed an open reading frame terminated by a stop codon. Alternatively, a sequence was considered nuclear in origin if it possessed premature stop codons or frame-shift mutations that disrupted the reading frame. We then used the mitochondrial cytochrome-b sequence of M. rossiaemeridionalis, the sibling vole (GenBank accession DQ015676), and aligned it with each of the sequences from M. ochrogaster to compare mitochondrial and nuclear substitution rates.
Phylogenetic analyses
We performed phylogenetic analyses of Microtus mitochondrial cytochrome-b data sets that included both mtDNA and numt sequences from M. ochrogaster. Included in the analysis were 12 North American Microtus species and 3 Asian species. Maximum-likelihood trees were generated with PAUP 4.0b10* (Swofford 2003) under the GTR+I+G model as determined by Modeltest 3.7 (Posada and Crandall 1998) under the hLRT and Akaike information criteria. We used heuristic searches with 100 bootstrap replicates. Myodes (formerly Clethrionomys) rutilus and M. glareolus were used as outgroups because Myodes is the putative sister taxon of Microtus (Conroy and Cook 2000; Jaarola et al. 2004).
Gene expression
Total RNA was isolated from fresh skeletal and cardiac muscle tissue of the same individual of M. ochrogaster described above. We used TRIzol (Invitrogen, Carlsbad, California) for RNA isolation and, to avoid numt contamination, removed trace amounts of genomic DNA from RNA products using DNase (Deoxyribonuclease I). SuperScript III First-Strand Synthesis System (Invitrogen) was used during reverse-transcriptase PCR to synthesize cDNA using the oligo (dT)20. Protocols were followed as per manufacturers' suggestions. The mitochondrial cytochrome-b primers of Irwin et al. (1991) are located within mitochondrial tRNA sequences. Thus, we designed novel primers within the mitochondrial cytochrome-b gene to amplify its transcripts from cDNA: M.och_RNA_F (forward primer) 5′-ATGACAATCATCCGAAAA-3′ and M.och_RNA_R (reverse primer) 5′-GGATGTTGTTTTCGATTATA-3′. PCR and thermal profile were the same as those used for genomic DNA. To test for possible (but unexpected) numt expression, we also attempted to isolate the numt sequence from cDNA using the same numt primer pair combinations and conditions described above. PCR products were bidirectionally sequenced with amplification primers and internal sequencing primers (M.och_Cytb_Int1/ M.och_Cytb_Int2) and comparative sequence analyses were performed with sequences generated from genomic DNA. Sequences generated in this study have been deposited into the GenBank database under the accession numbers DQ432006–DQ432008.
Numt surveys
We conducted NCBI-BLASTN searches (Altschul et al. 1997) on the sequenced mammalian genomes available for BLAST searching within the GenBank database as of November 2006 (n = 11), which included Bos taurus, Canis familiaris, Felis catus, Homo sapiens, Macaca mulatta, Mus musculus, Oryctolagus cuniculus, Ovis aries, Pan troglodytes, Rattus norvegicus, and Sus scrofa. We separately BLASTed each of their mtDNA protein-coding gene, ribosomal RNA gene, and control region (D-loop) sequences against their nuclear genomes, discounting matches with E values greater than 10 × 10−4. The genomes for Felis, Oryctolagus, Ovis, and Sus were draft genomes and our searches revealed few numt sequences. In light of the extensive numt integration that has been described for felids (Cracraft et al. 1998; Kim et al. 2006; Lopez et al. 1994), we attributed this paucity of numts to incomplete nuclear genome sequences. Venkatesh et al. (2006) demonstrated that misleading results and spurious matches can be generated when searching for numts in draft genomes; therefore, we removed these 4 taxa from further analyses. For each remaining taxon, we estimated the number and total length of numts per mtDNA genome region and the proportion of all numt sequences represented by each region.
Results
We used the universal cytochrome-b primers L14724/H15915 (Irwin et al. 1991) to amplify and sequence the cytochrome-b gene from genomic DNA of M. ochrogaster and subsequently translated it with the mammalian mitochondrial genetic code to confirm that it was mitochondrial in origin. The 1,143-base pair (bp) sequence possessed an open reading frame, an initiation codon, and a terminal stop codon. The chromatograms were clean with no noticeable secondary peaks.
We also used numt-specific primers to amplify and sequence a 941-bp putative cytochrome-b pseudogene that differed from the mitochondrial sequence at 17.5% of its sites (20% at 1st codon positions, 13% at 2nd codon positions, and 67% at 3rd codon positions) and had a transition : transversion ratio of 1.8:1.0. Both pseudogene primer pairs isolated the same numt sequence. The numt pseudogene contained frame-shift mutations and at least 10 stop codons in each of 3 possible reading frames as translated with both the universal and mammalian mitochondrial genetic codes. Indels included a 15-base deletion and a single base insertion (Fig. 1). Substitution rates calculated from the pairwise comparisons of M. ochrogaster and M. rossiaemeridionalis are listed in Table 2. Chi-square analysis revealed that they were significantly different (χ2 = 10.56, d.f. = 2, P = 0.005).
Phylogenetic analyses
The placement of the sequences from M. ochrogaster within the maximum-likelihood trees depended upon whether the sequence was mitochondrial or nuclear in origin (Fig. 2). The mitochondrial sequence was embedded within the Microtus clade among the other North American species but the placement of the nuclear sequence was at the base of the clade.
Gene expression
We extracted mRNA from both cardiac and skeletal muscle tissue and then used reverse transcriptase to produce cDNA. We isolated 1,013-bp mitochondrial cytochrome-b transcripts from the cDNA obtained from both cardiac and skeletal muscle tissue. The transcripts possessed an open reading frame when translated with the mitochondrial genetic code and matched the cytochrome-b sequence isolated from genomic DNA. Surprisingly, we isolated putative numt transcripts from cDNA obtained from cardiac tissue, but were unable to isolate numt transcripts from skeletal muscle tissue using numt primers (Fig. 3).
Numt surveys
The number of numts for the 7 taxa included in our search ranged from 50 to 1,030 and the summed length of all numts present ranged from 4,876 to 255,682 bp (Table 3). The mean length of individual numts ranged from 62-232 bp across taxa and the median values ranged from 49 to 160 bp (Table 3). The percentage of total numt sequences represented by each mtDNA region is shown in Fig. 4.
Discussion
We have empirically demonstrated various ways in which numt and mtDNA sequences from the same individual can be reconciled. Pairwise comparisons between the mitochondrial and nuclear sequences from M. ochrogaster revealed typical numt features such as stop codons, indels, and frame-shift mutations in the nuclear pseudogene. In comparisons with an mtDNA sequence from a related species, the numt from M. ochrogaster had less pronounced codon-position bias at the most selectively constrained site (mtDNA 2nd codon position) and a lower transition : transversion ratio (Table 2). Transition saturation may influence transition : transversion ratios and, therefore, should be used in combination with other numt detection measures, especially in species that show evidence of considerable intraspecific divergence. Differences in overall substitution rates revealed that the numt sequence may be evolving more rapidly than the mtDNA sequence (Table 2). Without knowing the date of the nuclear insertion, it is difficult to gauge these discrepancies and compare absolute or relative rates of mitochondrial and nuclear evolution. Standardized substitution rates can be used to estimate the divergence between a mtDNA sequence and its numt pseudogene (e.g., Lopez et al. 1994), but the rapid rate of evolution within microtine mtDNA genomes (Triant and DeWoody 2006) renders such an estimate derived from a single sequence pair questionable. However, divergence estimates may be possible with larger data sets that have been calibrated with a local clock.
Mitochondrial phylogenies (cytochrome b) for the genus Microtus have been established and within those phylogenies, North American species form a monophyletic clade (Conroy and Cook 2000; Jaarola et al. 2004). However, relationships within the North American clade were poorly resolved, and our data mirror those earlier works. Within our tree, the mtDNA sequence of M. ochrogaster clustered with other North American species but the nuclear (numt) sequence of M. ochrogaster did not and instead was basal to the Asian vole lineages (Fig. 2). The genus Microtus is thought to have originated less than 2 million years ago (mya—Chaline 1999; Repenning 1990), which suggests that the translocation to the nucleus of this numt sequence occurred at least 2 mya and is likely present in all Microtus species. In this instance, the unexpected placement of the nuclear sequence of M. ochrogaster at the base of the tree was conspicuous because it conflicts with the studies cited above. However, numt identification would be more difficult in the absence of a consensus phylogeny and its position within a tree would be dependent upon its age. Short branch lengths leading to numts could be signatures of the slower evolutionary rate found in nuclear genomes relative to mtDNA genomes resulting from the release of the selective constraints found within the mtDNA genome. However, long numt branch lengths have been reported in primate lineages (Schmitz et al. 2005; Zischler et al. 1998) and may be the result of DNA damage incurred during numt integration (Collura and Stewart 1995). Because mtDNA and nuclear sequences have different rates and modes of evolution, phylogenetic analyses can become complicated by the inclusion of both sequence types in a single analysis (Bensasson et al. 2001). The ultimate cause of long numt branch lengths is unclear, but our results coupled with the apparent higher rate of nucleotide substitution in our numt–mtDNA comparisons (Table 2) suggest that the Microtus numt described herein may be evolving more rapidly than its mtDNA counterpart.
Although numts are generally considered to be nonfunctional and therefore unexpressed, we provide strong evidence that the arvicoline cytochrome-b numt is expressed at low levels in cardiac tissue. The reverse-transcriptase PCR products from cardiac tissue provided insufficient template for direct sequencing, but the results were repeatable and can clearly be seen in Fig. 3. Despite this unexpected and potentially misleading result, we still endorse gene expression assays as a means for identifying numt sequences. PCR amplifications of mtDNA transcripts were consistently more robust than the numt transcripts, were found in more than 1 tissue type, and were easily sequenced. We cannot overlook the possibility of genomic DNA contamination in our mRNA extract despite our efforts to eliminate DNA from our samples using DNase. The elimination of genomic DNA is an optional step in cDNA library construction, but for numt avoidance we consider it a critical component. Further investigation of numt pseudogene transcription seems to be warranted (Blanchard and Schmidt 1996).
Using the mitochondrial and nuclear genome sequences for 7 mammalian taxa, we assessed the occurrence of mtDNA protein-coding genes, mtDNA ribosomal RNA genes, and the mtDNA control region within their nuclear genomes. Because we used individual mtDNA genes and not complete mtDNA genome sequences for our searches, the totals we present do not include various insertions that are associated with numts (e.g., mobile and repetitive elements—Mishmar et al. 2004). The primates in our sample had the most numts (total length in Homo: 222,644 bp; Macaca: 255,682; Pan: 172,423 bp), consistent with other studies that have found extensive mtDNA tranlocations in primates (Ricchetti et al. 2004; Schmitz et al. 2005). On the other end of the spectrum were rodents (Mus: 37,260 bp; Rattus: 4,876 bp; Table 3), and there was nearly an 8-fold difference between the 2 species sampled. Note that these numbers are not absolute, because the assembly status of genomes is constantly in flux and bioinformatic searches for numts in sequenced genomes can be hampered by misalignments (Venkatesh et al. 2006).
Our data do not reveal a propensity of 1 or more mtDNA genes or regions to translocate to the nuclear genome (Fig. 4). Consistent with other findings (Mourier et al. 2001; Woischnik and Moraes 2002), we found that the control region was relatively rare in the human nuclear genome but was common within rodent genomes (Fig. 4). The presence of the control region within the nuclear genome provides evidence for DNA-mediated numt insertions as opposed to RNA-mediated insertions as the control region has no RNA intermediate (Attardi and Schatz 1989).
Numts as molecular markers
While cryptic numts can unwittingly confound molecular analyses, once identified they do have some practical utility. There is a growing interest in numts and what they can reveal about the evolutionary history of their host. Because numts generally have a slower rate of evolution than mtDNA, they may represent an ancestral form of functional mtDNA sequences (Perna and Kocher 1996). In this respect, they can be used to compare rates of mitochondrial and nuclear evolution, as phylogenetic markers or outgroups, and to estimate divergence dates (Bensasson et al. 2001; Lopez et al. 1997; Zischler et al. 1995). Schmitz et al. (2005) used numts and mtDNA to retrace 40 million years of primate history. Numt insertion sites also can be examined to assess whether numts preferentially integrate into certain regions of the genome (Mishmar et al. 2004).
Conclusions
Some of the recommendations herein are more tractable than others. In particular, the isolation of cDNA template may be particularly burdensome for investigators using tissues that are not amenable to RNA extraction (e.g., hair or fecal samples). Researchers using noninvasive sampling or ancient DNA extraction techniques also might find these procedures troublesome because their templates are often limiting. On the other hand, these types of preventative measures are necessary only when initially characterizing primers and amplification profiles in a given species, usually at the beginning of a study. Once the target mtDNA sequence has been identified using some or all of the above procedures, this sequence can be aligned to any anomalous sequences to identify potential numts. If the numt and mtDNA sequences have sufficiently diverged from one another to allow for unique primer-binding sites, the numt and mtDNA sequence can be amplified independently to generate both mtDNA and nuclear data sets. Once in hand, the numt sequences can be used as neutral markers to generate comparative phylogenies, be compared to mtDNA results, or both, so long as investigators are aware that multiple (nonorthologous) numts may be amplified with the same primers.
Although we have presented various methods for detecting numts, we are not implying that PCR-based analyses of mtDNA are always suspect; indeed, hundreds of studies have successfully been performed without considering mitochondrial-derived nuclear pseudogenes. This study used universal primers and genomic DNA that was not enriched for mtDNA but we were able to generate clean mtDNA sequences without any apparent numt contamination. However, there have been cases where numts have been isolated with both universal and conserved primers (Lü et al. 2002; Mirol et al. 2000; Smith et al. 1992). Thus, mammalogists should at least be aware that numts have the potential to confound analyses and lead to potentially erroneous conclusions. To avoid numts, we suggest a number of preventative approaches that can be employed both before and after data collection. The precautions necessary to ensure that a data set is truly mitochondrial in origin might seem onerous, but they are well worth the effort if they later save researchers from having to reanalyze or recollect data that is numt contaminated.
Acknowledgments
We thank D. Bos, J. Detwiler, D. Glista, D. Gopurenko, E. Lach, J. Rudnick, L. Theile, S. Turner, R. Williams, M. Zavodna, and 2 anonymous reviewers for reviewing earlier versions of this manuscript. This research was supported in part by Purdue University, the National Science Foundation, and the United States Department of Agriculture's National Research Initiative. This is publication ARP2006-17891 from the School of Agriculture at Purdue University.
Literature Cited
Table 1.—A sample of nuclear mitochondrial translocations (numt pseudogenes or numts) that have been isolated from a wide variety of mammalian taxa. Intervening tRNA sequences, if present, are not listed. In most cases, the genes involved represent a minimum accounting because insertion sites and flanking sequences of numts are uncommon in published literature. In many cases, numerous numts are known to have occurred independently in the same lineage. Cytb, cytochrome b; 12S, 12S rRNA; 16S, 16S rRNA; C.R., control region; COI, cytochrome oxidase I; COII, cytochrome oxidase 2; mtDNA, mitochondrial DNA; ND1, reduced nicotinamide adenine dinucleotide (NADH) dehydrogenase subunit 1; ND2, (NADH) dehydrogenase subunit 2.
Table 2.—Percent differences, number of inserted–deleted base pairs (indels), and transition : transversion ratio (Ts:Tv) between mitochondrial cytochrome-b genes from Microtus ochrogaster and M. rossiaemeridionalis (Cytb/Cytb), and cytochrome-b nuclear mitochondrial translocation (numt pseudogene or numt) from M. ochrogaster and mitochondrial cytochrome-b gene from M. rossiaemeridionalis (numt/Cytb).
Table 3.—Nuclear mitochondrial translocations (numt pseudogenes or numts) in 7 completely sequenced mammalian genomes. “Total” refers to the total sum length (in base pairs [bp]) of each individual numt in the genome. The mean (X̄) is the average size, median is the value in the middle of the numt distribution, and range lists the smallest and largest numt revealed in our searches. Cytb, cytochrome b; ATP, adenosine triphosphatase; NADH, reduced nicotinamide adenine dinucleotide; ATP6, ATP synthase subunit 6; ATP8, ATP synthase subunit 8; COI, cytochrome oxidase 1; COII, cytochrome oxidase 2; COIII, cytochrome oxidase 3; ND1, NADH dehydrogenase subunit 1; ND2, NADH dehydrogenase subunit 2; ND3, NADH dehydrogenase subunit 3; ND4, NADH dehydrogenase subunit 4; ND4L, NADH dehydrogenase subunit 4L; ND5, NADH dehydrogenase subunit 5; ND6, NADH dehydrogenaese subunit 6.