Concerted conservation efforts over the past 50 yr have collected more than seven million plant germplasm accessions of more than 16,500 plant species that are currently conserved in 1750 genebanks worldwide (Engels, 2004; FAO, 2010). Management and utilization of these germplasm collections is a challenging mission (Engels and Visser, 2003), as large efforts are required to characterize these germplasm collections and only two million accessions are estimated to be unique (FAO, 2010). More core subsets need to be developed from large collections for germplasm screening of specific traits of interest (Frankel, 1984; Brown, 1989; van Hintum et al., 2000). Thus, assessing genetic distinctness or redundancy has become an important part of germplasm characterization (Waycott and Fort, 1994; Virk et al., 1995; Phippen et al., 1997; Chavarriaga-Aguirre et al., 1999; Dean et al., 1999; Karp, 2002; Fu, 2006; Kisha and Cramer, 2011; Motilal et al., 2013). Identification of genetically distinct germplasm is useful for the establishment of core subsets in a germplasm collection, and assessment of genetically redundant germplasm can help to validate accession duplication. Recent technical advances in the development of molecular markers will make molecular characterizations of plant germplasm more feasible than before (Fu and Peterson, 2012; Peterson et al., 2014; Song et al., 2015).
To facilitate germplasm characterization, we previously introduced a new marker-based approach using the average dissimilarity (AD) of an accession to assess genetic distinctness or genetic redundancy in a plant germplasm collection (Fu, 2006). The AD approach is based on the acquired molecular characterization data, generates the average dissimilarity of an accession against the remaining assayed accessions, and provides a means to identify genetically distinct or redundant germplasm. The approach has been well cited in the scientific literature, but it has not been applied as widely as hoped for. We reasoned that the lack of integrated computer software to streamline separate analyses involving dissimilarity calculation, analysis of molecular variance (AMOVA; Excoffier et al., 1992), and principal coordinates analysis (PCoA) is the major factor for its limited application.
Here we present an R function (R Core Team, 2016; https://www.r-project.org), called AveDissR, which integrates three separate analyses of the AD approach into one package for identifying genetically distinct or redundant germplasm. This R function considers five different data types of dominant or codominant molecular markers such as amplified fragment length polymorphisms (AFLPs), simple sequence repeats (SSRs), and single-nucleotide polymorphisms (SNPs); generates a useful set of output files for germplasm assessment; and can be run in R environments under different computer platforms. This newly developed R function will make the assessment of genetic distinctness and/or genetic redundancy in plant germplasm characterizations more feasible and useful.
METHODS AND RESULTS
The average dissimilarity approach—The original approach introduced by Fu (2006) is conceptually simple and has three major components. The first component is to calculate AD for each accession. In a typical marker-based characterization of self-fertile plant germplasm, an accession is usually represented by a single plant; n accessions are selected from a collection to represent p countries (or regions) of origin; and these accessions are assayed with molecular markers of m loci. Thus, a given accession can form n – 1 pairwise pairs with the remaining assayed accessions. For dominant markers (e.g., AFLP), the pairwise similarity Sij between accessions i and j can be calculated following the simple matching coefficient of Sokal and Michener (1958) as:
where a is the number of matched AFLP bands (or genotypes) between accessions i and j across nonmissing loci mnm (mnm ≤ m), and b is the number of mismatched bands (or genotypes) between i and j across mnm loci. Note that any loci with one or two accessions having no genotype values are not counted, and consequently mnm ≤ m in the pair. The dissimilarity for such pair Dij is defined as 1 – Sij [= b/(a + b)]. The AD value for accession i is obtained by averaging all n – 1 pairwise accession dissimilarities Dij as:
The higher the AD value, the more genetically distinct the accession is in the collection. The lower the AD value, the more genetically redundant the accession is in the collection. Thus, ranking the AD values of all n accessions provides a means of identifying the genetically most distinct and the genetically most redundant accessions. However, such ranking does not provide a threshold that can be used to define a group of genetically distinct or redundant accessions. Thus, the second component of the original approach was configured with the hope to identify a group of genetically redundant or distinct accessions. Fu (2006) explored a means of iteration through the selection of accessions with the lowest AD values and AMOVA to generate phi-statistics for the assessment of the impact of removing tentatively redundant accessions on the change of genetic differentiation present in the remaining assayed accessions. The redundant group of accessions selected with the lowest AD values will be defined when the genetic differentiation measured by phi-statistics in the remaining set of germplasm is largely maintained (or is within a nominal tolerance of 5% change in genetic differentiation from the originally assayed accessions). Similarly, to identify a group of the genetically most distinct accessions, the same means of iteration is applied through selection of accessions with the highest AD values and AMOVA to assess the change of genetic differentiation in the selected accessions. The distinct group of accessions selected with the highest AD values will be defined when the genetic differentiation inferred in the selected accessions is compatible with that present in the original assayed accessions (that is, within a nominal tolerance of 5% change in genetic differentiation). However, the iterative AMOVA may not always be informative when the representative groups are small and the inferred genetic differentiation may greatly deviate or fluctuate from the original one. Consequently, we introduced the third component to assess the representativeness of the accessions from AD-based identification relative to the whole set of samples through a PCoA plot of the genetic associations between the identified group and the whole set. This is done based on the dissimilarity matrix of all pairwise accessions, and the PCoA plot is generated with separate labels for the accessions of the identified group and the whole set.
To assist the application of the AD approach to analyze the flax collection of 2727 accessions genotyped with 149 polymorphic RAPD markers, we developed a SAS routine in 2006 (SAS Institute, 2004) to integrate the first two components of the approach, but not the third component of visual assessment over a PCoA plot. The latter was performed separately using NTSYS-pc software (Rohlf, 1997; Fu, 2006). Clearly, a set of computer software to streamline separate analyses is desirable for the wider application of the AD approach.
The R function: AveDissR—To facilitate the identification of genetically distinct or redundant germplasm, we developed an R function, called AveDissR, to integrate dissimilarity calculation, AMOVA, and PCoA in the original AD approach into one package. To widen its application with different markers, we also considered not only dominant or haploid marker data in the original AD approach, but also expanded it to deal with different codominant marker data. For a diploid codominant genotype marker data with m loci assayed for n accessions, the pairwise accession dissimilarity Dij is calculated as the proportion of marker loci having different marker genotypes, which is analogous to the Hamming distance (Hamming, 1950; Wang et al., 2015) as below:
where bij is the number of dismatched genotypes over nonmissing loci mnm (mnm ≤ m) between the paired accessions i and j (j = 1.. n – 1). For example, diploid SSR genotypes 242 : 266 and 242 : 254 (or 242 : 242) for the paired accessions are counted as dismatched at the SSR locus with three alleles of size: 242, 254, and 266, while the two genotypes 254 : 266 and 254 : 266 are matched. The AD value for accession i is calculated as the same as for dominant marker data by averaging all n – 1 pairwise accession dissimilarities Dij:
Accordingly, its AMOVA component also accommodates both dominant and codominant marker data for analysis of molecular variance following the established methods (Excoffier et al., 1992; Peakall and Smouse, 2006, 2012). The PCoA component is dependent only on a pairwise accession dissimilarity matrix, and the matrix is calculated as mentioned above for both dominant and codominant marker data.
Specifically, AveDissR was configured and developed to analyze five different types of marker data: (1) AFLP or dominant marker; (2 and 3) haploid SSR and SNP data, respectively; (4) diploid genotype data of SSR or SNP with a separator of “:” between two alleles; and (5) diploid genotype data of SNP without a separator format for two alleles. The R function was written with five major steps: (1) data input and dissimilarity matrix calculation, (2) generating AD values for all accessions, (3) performing PCoA, (4) conducting iterative AMOVA, and (5) PCoA display and result outputs. To achieve these tasks, AveDissR utiltizes five R packages (“reshape2”, “data.table”, “vegan”, “foreach”, and “doParallel”) and is equipped with several custom R functions to transform input data using “data.table” and “reshape2” utilities, calculate pairwise dissimilarity matrix and AD values, conduct two separate AMOVAs for dominant or codominant markers, perform PCoA vectors using “vegan” cmdscale, and generate PCoA plots and outputs. The packages “foreach” and “doParallel” were applied to use about half of existing processors or cores in a computing platform to speed up pairwise calculations. Depending on the analysis objective (i.e., to identify genetically distinct or genetically redundant germplasm), several output files are generated to obtain AD values and PCoA vectors, iterative AMOVA results, and PCoA plots.
This R function has several advantages over the original AD analysis with the developed SAS routine (Fu, 2006). First, it is a streamlined package that allows for one integrated analysis to generate AD value for each accession, as well as to have the iterative AMOVA and PCoA outputs to facilitate germplasm grouping. Second, it can analyze germplasm assayed with different marker types, making it easier to apply to a wide variety of problems in germplasm characterization. Third, the R function can take advantage of parallel computing to handle large data sets, particularly in operating systems Mac OS X or Linux. For example, running 4500 soybean (Glycine spp.) accessions genotyped at 6000 SNPs (including missing values) with 40-core parallel computing in a Linux server lasted 2 h 48 min, while the serial execution with the same data set consumed 43.92 h. Fourth, it is more accessible than the SAS routine (Fu, 2006), as R is not only a widely used programming language and software environment for statistical computing and graphics, but also freely available under the GNU General Public License. AveDissR is packaged as AveDissR.rar and is freely available for use with different computing systems under Windows, Mac OS, or Linux. The R script, user instruction file, and example data and output files are available on Figshare ( http://dx.doi.org/10.6084/m9.figshare.5082451; Yang and Fu, 2017). Its use was successfully tested in R version 3.2.3 Platform: x86_64-w64-mingw32/x64 (64-bit) (Windows) and Platform: x86_64-redhat-linuxgnu (64-bit) (Linux).
Using AveDissR—The package AveDissR.rar (Yang and Fu, 2017: Appendix S1) has one R script file AveDissR.r, the user instruction file Using AveDissR.pdf (Yang and Fu, 2017: Appendix S2), and three subfolders named examples, inputdata, and results (Appendix S2: Fig. S1). The subfolder examples has five example data files and one parameters_setting.example.csv file (Appendix S2: Fig. S2a). The subfolder inputdata is set up to place marker data and parameters_setting.csv files for an analysis. The subfolder results is where the output files are generated from the analysis (Appendix S2: Fig. S3). Using AveDissR.pdf provides detailed instructions for using AveDissR.r in R environments under different computer settings (Appendix S2). There are several major steps to follow. They include (1) install R, if not available in computer or Unixlike server; (2) install five R packages required for using AveDissR; (3) download AveDissR.rar from the Figshare website (Yang and Fu, 2017); (4) prepare marker data and parameters_setting files for analysis; (5) place the marker data and parameters_setting files into the inputdata subfolder; and (6) run AveDissR.r either without opening R console or within R console, as shown in Table 1. Attention should be paid to prepare marker data and parameters_setting. The R function is capable of analyzing any of the five marker data types, but each marker data type has its unique format (Appendix S2: Fig. S2c–g). The marker data file should be prepared following one of the five example data file formats. The R function uses fread function to input data file with a .csv (comma delimited) or .txt (tab delimited) format. The parameters_setting file is used to instruct the AveDissR analysis (Appendix S2: Fig. S2b). Seven sets of parameters need to be defined before an analysis can start. The parameters_setting.example.csv file in the examples subfolder (Appendix S2: Fig. S2b) shows the description of, and the usage for, each parameter. More attention should be paid to the parameters stepwise selection and specific selection, as both settings determine the iterative AMOVA and PCoA outputs to assess the changes in genetic differentiation of selected or removed germplasm to facilitate the grouping of germplasm with higher or lower AD values. Stepwise selection allows users to select the starting proportion (or the size of the step) for iteration up to 1 (100% germplasm). If one decides to select only one specific proportion of germplasm (e.g., 26%), stepwise selection can be silenced with NA and specific selection can be defined as 0.26.
Illustration of its usage—AveDissR can be applied to assess genetic distinctness or genetic redundancy separately or jointly, depending on the objectives of a plant germplasm characterization. Here we illustrate its usage to assess genetic distinctness and genetic redundancy separately, each with a published data set. Two illustrative data sets are available upon request from the corresponding author. The first data set selected is the AFLP data of 670 cultivated hexaploid oat (Avena sativa) accessions (Fu et al., 2005). The assayed accessions were selected from a world collection of 11,622 cultivated hexaploid oat accessions to represent 79 countries and one group of uncertain origin, and they were genotyped with 170 AFLP markers by five AFLP primer pairs. Previous AFLP analysis showed this set of oat germplasm was genetically diverse and distinct (Fu et al., 2005). Here we reanalyzed the data set using AveDissR to assess their genetic distinctness. We reformatted the data following 1_dominant_ AFLP_example.csv (Appendix S2: Fig. S2c) and defined the parameters_setting file as: Analysis_Type = 1, Marker_Type = 1, SampleN = 670, PopulationN = 80, LociN = 170, Missing_Label = NA, stepwise selection = 0.1, and specific selection = NA. Running the data in the mentioned Linux server lasted 3 min. Six original output files were generated: one bar plot for AD distribution, three PCoA plots for selected 50, 113, or 185 accessions, one .csv file for AD output and PCoA vectors, and one .csv file for iterative AMOVA output. To illustrate, three sets of the outputs are shown in Fig. 1: the distribution of average dissimilarity values in 670 accessions, the PCoA plot for the associations of 461 selected and 670 original accessions, and the AMOVA outputs showing the changes in genetic differentiation with the increased selection of accessions with the highest AD values. The AD values for these 670 oat accessions ranged widely from 0.214 to 0.575 with a mean of 0.289 (Fig. 1A). Selecting 469 out of 670 oat accessions with the highest AD values as the genetically distinct group could largely maintain the extent of genetic differentiation in the assayed 670 oat accessions, as shown with percentage of variation among populations (PVa) values of 0.09156 in the first step and 0.097355 in the eighth step with the selection of 469 samples, respectively (Fig. 1C). The PCoA plot (Fig. 1B) showed the representativeness of the 461 (out of 469 selected) samples (in the eighth step) used for AMOVA relative to the original 670 accessions, based on the revealed genetic associations. One could rerun AveDissR.r with different parameters_ setting files by adjusting stepwise selection values to select a tentative group of genetically distinct accessions, as suggested by Fu (2006) with a nominal tolerance of 5% change in genetic differentiation from the originally assayed accessions. When the tentative distinct group is selected, AveDissR.r can be rerun with specific selection value (equal to the size of the selected group over the assayed accessions) to generate the final PCoA plot for assessment. Note that Fig. 1B was obtained after rerunning AveDissR.r with specific selection = 0.7 (= 469/670) and that the number of samples used for AMOVA may be smaller than the selected samples, as AMOVA removed those populations (or countries in this case) that had only a single sample (Fig. 1C).
Table 1.
Commands used to run AveDissR.r in the computer operating systems Windows or Linux and its working environment.
The second data set was selected from the soybean data generated by Song et al. (2015) to illustrate its use in assessment of genetic redundancy. The original soybean data consists of 18,480 domesticated soybean and 1168 wild soybean accessions that were genotyped with 42,509 polymorphic SNP markers through the SoySNP50K Illumina Infinium II BeadChip (Illumina, San Diego, California, USA). The assayed soybean accessions represented germplasm introduced from 84 countries or developed in the United States. Here we reanalyzed part of the original soybean data set using AveDissR to illustrate the outputs for germplasm redundancy assessment. We extracted 1828 accessions from 18,480 Glycine max accessions with respect to country of origin and randomly selected 5000 (from 42,509) SNP markers to form a new soybean data set representing 21 countries. We reformatted the data following 3_haploid_SNP_example.csv (Appendix S2: Fig. S2e) and defined the parameters_setting file as: Analysis_Type = 2, Marker_Type = 3, SampleN = 1828, PopulationN = 21, LociN = 5000, Missing_Label = NA, stepwise selection = 0.1, and specific selection = NA. Running the data in the Linux server lasted roughly 22 min. Six original output files were generated: one bar plot for AD distribution; three PCoA plots for selected 183, 366, or 549 accessions; one .csv file for AD output and PCoA vectors; and one .csv file for iterative AMOVA output. To illustrate, three sets of the outputs are shown in Fig. 2: the distribution of average dissimilarity values in 1828 accessions, the PCoA plot for the associations of 549 selected and 1828 original accessions, and the AMOVA outputs showing the changes in genetic differentiation with the increased removal of accessions with the lowest AD values. The AD values for these soybean accessions ranged widely from 0.28 to 0.487 with a mean of 0.326 (Fig. 2A). Removing 549 accessions with the lowest AD values from the 1828 accessions did not show any large deviation from the original genetic differentiation presented in the assayed 1828 accessions, as shown with PVa values of 0.120434 in the first step and 0.112162 in the fourth step, respectively (Fig. 2C). The PCoA plot (Fig. 2B) showed the redundancy of the 549 samples with the lowest AD values in the original 1828 accessions, based on the revealed genetic associations. It is possible that one may need to rerun AveDissR.r with different parameters_setting files by adjusting stepwise selection values to identify a tentative group of genetically redundant accessions, as suggested by Fu (2006) with a nominal tolerance of 5% change in genetic differentiation from the originally assayed accessions. The tentative group of genetically redundant accessions could be used to verify those duplicated accessions previously identified (Song et al., 2015).
Application and limitation—AveDissR is an integrated statistical tool specifically developed for the AD approach to assist the assessment of genetic distinctness and/or redundancy in plant germplasm characterization. However, the AD approach per se is of limited resolution in defining genetically redundant or distinct groups, as clearly stated in Fu (2006) and shown in the analyses of oat and soybean data sets (see Figs. 1A, 2A). There is no definite criterion that one could develop and apply with AD values to identify genetically redundant or distinct groups of accessions. With additional information generated from iterative AMOVAs and PCoA plots, one could identify a tentative group of genetically distinct accessions for further evaluation and assessment in combination with passport, evaluation, and characterization data to develop a core subset from a large germplasm collection, as illustrated in the development of the flax core collection (Diederichsen et al., 2013). The tentative group of genetically distinct accessions per se may not necessarily be deemed as a core collection as defined by Brown (1989) for germplasm management and utilization. Similarly, the tentative redundant group can be used in combination with passport, evaluation, and characterization data to assist the identification of truly duplicated accessions in a germplasm collection for effective germplasm management and conservation. It is worth noting that the genetic redundancy acquired from an AveDissR application may not necessarily mean that the identified genetically redundant accessions are truly duplicated accessions.
Depending on research objectives, AveDissR can also be used to assess genetic distinctness or redundancy in experiments with selfing or outcrossing organisms in natural populations. For example, multiple individuals from one population (or accession in the sense of germplasm conservation) can be genotyped and analyzed using AveDissR, and genetic outliers (or the most genetically distinct individuals) from different populations can be identified. Similarly, the most genetically distinct populations over the geographic distribution of a species can be inferred for conservation and/or population genetic analyses. AveDissR can be applied to identify the genetically redundant (or similar) group of individuals from nearby natural populations and such group can allow for useful assessment of genetic relatedness in nearby populations. Thus, AveDissR has the potential to be applied in some genetic studies of natural populations, particularly for conservation biology. However, its application has not reached to genetic studies of natural populations yet and its usefulness in these studies needs further exploration.
CONCLUSIONS
The developed AveDissR function integrates three components of average dissimilarity calculation, iterative AMOVA, and PCoA as a package to assist the assessment of genetic distinctness or genetic redundancy in molecular characterization of plant germplasm. It can be applied to analyze five different types of dominant or codominant marker data, generate a useful set of output files for germplasm assessment, and run in R environments under different computer platforms. However, the R function, even with parallel computing capability, could still take hours or even days to analyze large marker data sets, particularly with missing values, as pairwise dissimilarity calculation for a large number of germplasm accessions (e.g., 10,000 accessions × 20,000 SNPs) is computationally intensive. Also, the average dissimilarity approach per se has its weaknesses and may not allow for a full identification of genetically redundant accessions (Fu, 2006). Nevertheless, the outputs from AveDissR should be useful for the assessment of genetic distinctness and/or genetic redundancy in plant germplasm characterization.
ACKNOWLEDGMENTS
The authors thank Dr. Qijin Song (U.S. Department of Agriculture–Agricultural Research Service) for his assistance in acquisition of soybean single-nucleotide polymorphism data for the analysis and two anonymous reviewers for their helpful comments on the manuscript. This research was financially supported by an Agriculture and Agri-Food Canada A-Base research project to Y.B.F., the National Natural Science Foundation of China (no. 31670678), and a China Scholarship Council Postdoctoral Abroad Grant to M-H.Y.