Translator Disclaimer
1 December 2009 Comparison of Similarity Coefficients used for Cluster Analysis with Amplified Fragment Length Polymorphism Markers in the Silkworm, Bombyx mori
Author Affiliations +

Establishing accurate genetic similarity and dissimilarity between individuals is an essential and decisive point for clustering and analyzing inter and intra population diversity because different similarity and dissimilarity indices may yield contradictory outcomes. We assessed the variations caused by three commonly used similarity coefficients including Jaccard, Sorensen-Dice and Simple matching in the clustering and ordination of seven Iranian native silkworm, Bombyx mori L. (Lepidoptera: Bombycidae), strains analyzed by amplified fragment length polymorphism markers. Comparisons among the similarity coefficients were made using the Spearman correlation analysis, dendrogram evaluation (visual inspection and consensus fork index - CIC), projection efficiency in a two-dimensional space, and groups formed by the Tocher optimization procedure. The results demonstrated that for almost all methodologies, the Jaccard and Sorensen-Dice coefficients revealed extremely close results, because both of them exclude negative co-occurrences. Due to the fact that there is no guarantee that the DNA regions with negative cooccurrences between two strains are indeed identical, the use of coefficients such as Jaccard and Sorensen-Dice that do not include negative co-occurrences was imperative for closely related organisms.


One of approaches that is commonly used in studies of genetic diversity within and among populations or groups of individuals, and is applied with all types of markers and organisms, is based on comparisons of individual genotypes within and between populations. In such cases a genetic similarity (or dissimilarity) matrix constructed from all potential pairwise combinations of individuals is used to characterize population structure based on relative affinities of each individual to all other individuals tested. This approach requires suitable methods for evaluating similarity between individuals, and it is particularly useful in the case of possible linkages between different loci. The choice of an appropriate coefficient of similarity is a very important and decisive point to evaluate clustering, true genetic similarity between individuals, analyzing diversity within populations and studying relationship between populations, because different similarity coefficients may yield conflicting results (Kosman and Leonard 2005).

Silkworms, Bombyx mon L. (Lepidoptera: Bombycidae), domesticated for silk production, include a large number of geographical races and inbred lines that illustrate considerable differences in their qualitative and quantitative traits. Traits such as cocoon shape, cocoon color, silk fiber length, and ethological traits are used to differentiate silkworm varieties and selection of parental strains. But silkworm varieties, particularly those that have been bred from crosses involving many varieties, cannot be distinguished unambiguously by the use of conventional characteristics. It is thus apparent that the use of molecular markers could provide a solution to the problem by providing unique DNA profiles. Such varietal DNA profiles would be useful in producing reliable estimates of genetic diversity, for the selection of parents for the development of elite hybrids, and to protect silkworm breeder's rights (Mirhoseini 1998, 2002; Reddy et al. 1999; Nagaraju and Goldsmith 2002).

Molecular markers are commonly used to characterize genetic diversity within or between populations or groups of individuals because they typically detect high levels of polymorphism. Furthermore, RAPDs and AFLPs are efficient in allowing multiple loci to be analyzed for each individual in a single gel run. In analyzing banding patterns of molecular markers, the data typically are coded as (0,1)-vectors, 1 indicating the presence and 0 indicating the absence of a band at a specific position in the gel. With diploid organisms and codominant markers, such as allozymes, RFLPs or SSRs, the banding patterns may be translated to homozygous or heterozygous genotypes at each locus, and the allelic structure derived is utilized for comparison between individuals (Peakall et al. 1995; Smouse and Peakall 1999; Maguire et al. 2002). More often, however, the binary patterns obtained are used directly in comparisons of similarity of individuals (Kosman and Leonard 2005).

A number of coefficients have been proposed (Sokal and Sneath 1963; Sneath and Sokal 1973; Johnson and Wichern 1988). Similarity coefficients specific for dichotomic (binary) variables, especially co-occurrence measures, are suggested for divergency studies based on dominant molecular markers, such as RAPD (Duarte et al., 1999). These coefficients utilize several explanations of similarity or dissimilarities by entire comparisons, and their values show a discrepancy from 0 to 1 (Skroch et al., 1992). Despite the fact that various coefficients are available, published studies often do not state their preference for any one in particular. Since clustering and ordination results can be influenced by this choice (Gower and Legendre 1986; Jackson et al. 1989), these coefficients need to be better understood, in order that the most efficient once can be utilized.

In the present study, the alterations caused by three commonly used similarity coefficients on the subsequent clustering and ordination analyses of seven Iranian native B. mon strains analyzed by AFLP markers were evaluated.

Materials and Methods

Three most commonly used similarity coefficients; the Simple matching, Jaccard and Sorensen-Dice coefficients (Table 1) were compared among seven Iranian native silkworm strains including Guilan Orange (Gu Or), Baghdadi (Ba), Harati White (Ha Wh), Harati Yellow (Ha Ye), Khorasan Lemon (Kh Le), Khorasan Orange (Kh Or) and Khorasan Pink (Kh Pi) which were sampled from Iran Sericulture Research Center (ISRC located in Rasht, Guilan province.

The AFLP marker was analyzed as described by Vos et al. (1995) with ten enzyme-primer combinations. Only polymorphic bands were used for the construction of the binary value matrix, representing the absence and presence of bands by 0 and 1, respectively. Each band was considered as a locus.

Genetic similarity estimates (gsij) between each pair of individuals (i,j) were performed for three similarity coefficients (Table 1). Similarity analyses were done with the NTSYS-pc ver. 2.02 software (Rohlf 1998). Similarity coefficients were compared using the Spearman's correlation coefficient (Hollander 1973). Dendrograms were produced according to the unweighted pair-group mean arithmetic method (UPGMA) using NTSYS-pc software. The different dendrograms were then compared using visual inspection and the consensus fork index CIC (Rohlf 1982), in an analogous form to that used by Duarte et al. (1999). This CIC index provides a relative estimate of the dendrogram similarities and was calculated using NTSYS-pc software.

The establishment of the clusters was also studied by the Tocher optimization procedure (Rao 1952), using the Gene Program (Cruz 2001). The greatest value of the set of smaller distances involving each individual studied was considered the inter-group distance limit.

Finally, the cluster methodology proposed by Cruz and Viana (1994) was used, which consists of making the dissimilarity matrix projection into a two-dimensional space. The similarity coefficients were compared regarding the efficiency of this obtained projection.

To do this, the following was considered:

  • Correlation between the original distances and the distances obtained by two-dimensional dispersion

  • Degree of distortion (1 - α), given by:

    where gdij is the graphical genetic distances between inbred lines i and j, in the two-dimensional space and odij the original distances between lines i and j, in a n-dimensional space.

  • Stress value (S), given by:


Table 1.

Similarity coefficients studied.


This statistical representation of stress (standardized residual sum of squares), proposed by Kruskal (1964), is a parameter that determines the goodness-of-fit of the graphic projection. The stress was classified according to the criteria presented in Table 2 (Kruskal 1964).

Table 2.

Stress (S) classification for the goodness-of-fit of the graphic projection.


Results and Discussion

The Spearman correlation coefficients between the three similarity coefficients were equal to or close to 1 (Table 3), making it evident that they are highly related. The Jaccard and Sorensen- Dice coefficients presented correlation values equal to 1.00, demonstrating that there is no alteration in the ranks using any one of these coefficients, i.e. they classify the similarity among strains exactly in the same order. However, between these two classes of coefficients and the Simple matching coefficient, the correlations were lower (0.87). These results are similar to those presented by Duart et al. (1999) for RAPD markers in common bean and Meyer et al. (2004) for AFLP and RAPD markers in maize.

Table 3.

The Spearman correlation coefficient between the similarity coefficients


A visual inspection of the dendrograms obtained with the UPGMA method (Figure 1) shows that, although the common structure of the dendrograms is highly comparable, there are minor alterations in the levels in which strains are clustered. The dendrogram constructed by Simple matching coefficient shows some distinct differences corroborating the similarity matrices outcomes (Table 3). Although the three coefficients made it possible to group four strains including Ba, Ha Wh, Gu Or and Kh Or in a main cluster, the Simple matching coefficient-based dendrogram, in the other hand, revealed some alterations in the grouping other strains including Ha Ye, Kh Le and Kh Pi (Figure 1). This also corroborates the differences observed in the similarity matrices (Table 3). It is important to note that, in the dendrogram constructed by Simple matching coefficient, the Kh Pi strain was distantly clustered from two other strains (Kh Or and Kh Le). This may be due to the fact that these three strains were collected from Khorasan province, a geographical region located in northern east of Iran, it is expected that they would be closely clustered in the dendrogram. This is observed in the dendrograms constructed by Jaccard and Sorensen- Dice's similarity coefficient confirming their validity in the Simple matching coefficient.

All the dendrograms were able to separate the individuals of the seven different strains without any overlapping that could be due to the high efficiency of the AFLP marker system used. It may also be a result of selection that has been carried out to conserve these strains, as well as the work carried out by the Iran Sericulture Research Center. In the past twenty years these native strains that have been collected from different geographical regions of Iran are inbreed by the Iran Sericulture Research Center to conserve the gene bank. The high similarity observed between individuals within each strain and consequent separation of strains can be due to the selection and inbreeding pressure. This result is in agreement with those obtained in the recent study using AFLP markers on B. mori (Mirhoseini et al. 2007). In contrast, earlier analysis using RAPD markers on individuals from the same strains (Mirhoseini 1998) did not lead to such consistent separation of the strains. The difference found here might be due to the different techniques used. The AFLP markers clearly resulted in a more consistent pattern.

The comparison of the constructed dendrograms by the consensus fork index CIc, allows a refinement of what is observed through visual inspection (Table 4). By this index, whose amplitude goes from 0 to 1, two dendrograms are considered identical when the calculated value equals one. As shown in table 4, based on the CIc index only, the dendrograms obtained by Jaccard and Sorensen-Dice's similarity coefficients are identical. These results are highly similar to those obtained by Duart et al. (1999) and Meyer et al. (2004). Highly coherent results were also obtained by Jackson et al. (1989), who studied relationships between different fish species based on different similarity coefficients and verified that cluster analysis shows a strong similarity between dendrograms obtained with Jaccard and Sorensen-Dice's coefficients, and Simple matching and Rogers and Tanimoto's coefficients.

Table 4.

Consensus fork (CIc ) among the dendrograms (UPGMA) produced by similarity coeficients, based on Jaccard (J), Sorensen-Dice (SD) and Simple Matching (SM) similarity coeficients among seven Iranian native silkworm populations.


The similar appearance in Jaccard and Sorensen-Dice's coefficients-based dendrograms can be simplified by the properties of these coefficients. They are discriminated by the way in which the matrix of original data (1 = presence of the AFLP marker and 0 = absence) is employed in the similarity estimate. When two genotypes are compared, the following situations occur: a = 1.1; b = 1.0; c = 0.1; d = 0.0. Thus, Jaccard and Sorensen-Dice's coefficients are equivalent, except that double weight is given to positive co-occurrences (a) in the Sorensen-Dice's coefficient whereas the Simple matching coefficient includes negative co-occurrences (d) (Duart et al. 1999).

The Tocher optimization procedure (Rao 1952) is an individual clustering method that has been employed with dominant data e.g. RAPD and AFLP. In this method, individuals are separated into non-empty and equally exclusive sub-groups, based on the similarity or dissimilarity matrix (Cruz and Regazzi 1994), which can be obtained by several coefficients. However, it does not necessarily form the same groups as the dendrograms. Non-etheless, there is no information about the similarity of the strains inside each group or about similarity among the groups which can be considered a disadvantage of the method. In the present study, three coefficients revealed low alteration in the number of groups formed (See Supplementary Table) and also altered the classification of some strains in these groups. The results of this method agree with those observed by the dendrograms, considering the consensus fork index, i.e., confirming that the Jaccard and Sorensen-Dice coefficients are separated from the Simple matching coefficient. These results are highly constant with those earlier obtained by AFLP and RAPD markers (Duart et al. 1999 and Meyer et al. 2004).

The two-dimensional projection efficiency based on Kruskal (1964) classification (Table 2), revealed unsatisfactory stress values for all coefficients (Table 5). A similar result was obtained by Meyer et al. (2004), suggesting that this two-dimensional projection method is not adequate for this set of data, i.e., that the projections did not efficiently represent the similarity matrices. Therefore, the coefficients comparison under such conditions must be carefully made. Furthermore, the degree of distortion was high and the correlations were low under all conditions, confirming the latter. These results are different to those obtained by Duarte et al. (1999), in which the stress values varied from 11.4 to 32.0 (excluding the Russell and Rao coefficient). It was, therefore, possible to compare the efficiency of the coefficients.

Figure 1.

DDendrograms constructed for the seven Iranian native silkworm populations obtained from genetic similarities based on Jaccard, Sorensen-Dice and Simple matching similarity coefficients for the AFLP molecular markers (UPGMA) obtained from genetic similarities based on Jaccard, Sorensen-Dice and Simple matching similarity coefficients for the AFLP molecular markers (UPGMA). The strains were Guilan Orange (Gu Or), Baghdadi (Ba), Harati White (Ha Wh), Harati Yellow (Ha Ye), Khorasan Lemon (Kh Le), Khorasan Orange (Kh Or) and Khorasan Pink (Kh Pi)


Table 5.

Distortion degree, correlation between the original and estimated distances (r) and stress value, obtained by the projection of the distances in the two-dimensional spaces.


Using the analyses described above that have distinct theoretical basis, some general tendencies were observed. The Jaccard and Sorensen- Dice coefficients can be separated from the Simple matching coefficient that always shows different results from the others. By inspection of their formulae, it can be perceived that the two first coefficients have common principles which differ from the third. The Jaccard and Sorensen-Dice coefficients do not consider the negative co-occurrences, while the Simple matching coefficient includes them in their expressions. This could possibly explain the different classification of the coefficients.

The Sorensen- Dice coefficient of similarity is frequently referred to as the measure of genetic similarity of Nei and Li (1979). For a given data set, the related values of Jaccard's similarity are always smaller than those of the Sorensen- Dice similarity and the simple matching coefficient. In contrast, values of the Sorensen- Dice similarity may be greater or smaller than the related values of the Simple matching coefficient depending on whether the number of positions with shared bands a is less or greater than the number of positions with shared absence of bands d, respectively.

The bases for choosing the most appropriate coefficient of similarity depend on type of marker and ploidy of the organism under consideration (Kosman and Leonard 2005). Landry and Lapoint (1996) suggested that the Sorensen- Dice or Jaccard coefficients might be preferable to the Simple matching coefficient when using RAPD analysis to compare groups of distantly related taxa. Hallden et al. (1994) considered the Simple matching coefficient to be the more appropriate measure of similarity when closely related taxa are considered, but Kosman and Leonard (2005) believe that choice should be supported with estimates of DNA sequence identity between the taxa. In the absence of supporting sequence identity estimates, similarity values based on dominant markers data should be regarded as tentative.

In their investigation, Kosman and Leonard (2005) could not recommend any preferred similarity measure for dominant markers in diploid (polyploid) organisms, because they believed that no suitable method could be proposed for measuring genetic similarity between diploid organisms on the basis of dominant banding profiles. In other words, banding patterns of diploids with dominant markers and polyploids with codominant markers represent individuals' phenotypes rather than genotypes. For the RAPD marker applied to common bean cultivars, Duarte et al. (1999) found greater efficiency in the two-dimensional projections for the Sorensen-Dice's coefficient, which was suggested for practical applications. However, we, and Meyer et al. (2004) did not find greater efficiency for this coefficient. As was stated by Meyer et al. (2004), based on the biochemical properties of the dominant markers, there is no guarantee that DNA regions with negative co-occurrences between two inbred lines are indeed identical. Consequently, it seems reasonable to conclude that the coefficients that exclude negative co-occurrences have more justification for being used when closely related organisms are being compared. Thus, it should be possible to use Jaccard or Sorensen-Dice to obtain satisfactory results when the organisms are closely related.


The authors are grateful to the Iran Sericulture Research Centre for providing silkworm samples and University of Guilan for financial support.



amplified fragment length polymorphism,


consensus fork index,


random amplification of polymorphic DNA,


simple sequence repeat,


restriction fragment length polymorphism,

Gu Or:

Guilan Orange,



Ha Wh:

Harati White,

Ha Ye:

Harati Yellow,

Kh Le:

Khorasan Lemon,

Kh Or:

Khorasan Orange,

Kh Pi:

Khorasan Pink,


unweighted pair-group mean arithmetic method,


numerical taxonomy system


  1. CD Cruz. 2001. Programa Genes: versão Windows; aplicativo computacional em genética e estatística. Universidade Federal de Viçosa, Viçosa. 648 Google Scholar

  2. CD Cruz, AJ Regazzi. 1997. Modelos biométricos aplicados ao melhoramento genético. Universidade Federal de Viçosa, Viçosa. 390 Google Scholar

  3. CD Cruz , JMS Viana . 1994. A Methodology of genetic divergence analysis based on sample unit projection on two-dimensional space. Revista Brasileira de Genética 17: 69–73. Google Scholar

  4. LR Dice . 1945. Measures of the amount of ecologic association between species. Ecology 26: 297–302. Google Scholar

  5. MC Duarte , JB Santos , LC Melo . 1999. Comparison of similarity coefficients based on RAPD markers in the common bean. Genetics and Molecular Biology 22: 427–432. Google Scholar

  6. JC Gower , P Legendre . 1986. Metric and Euclidean properties of dissimilarity coefficients. Journal of Classification 3: 5–48. Google Scholar

  7. C Hallden , NO Nilsson , IM Rading , T Sall . 1994. Evaluation of RFLP and RAPD markers in a comparison of Brassica napus breeding lines. Theoretical and Applied Genetics 88: 123–128. Google Scholar

  8. M Hollander. 1973. Nonparametric statistical methods Wiley. Google Scholar

  9. P Jaccard . 1901. Étude comparative de la distribuition florale dans une portion des Alpes et des. Jura', Bulletin Societe' Vandoise des sciences naturelles 37: 547–579. Google Scholar

  10. AA Jackson , KM Somers , HH Harvey . 1989. Similarity coefficients: measures for co-occurrence and association or simply measures of occurrence. American Naturalist 133: 436–453. Google Scholar

  11. RA Johnson, DW Wichern. 1988. Applied multivariate statistical analysis Prentice-Hall. Google Scholar

  12. E Kosman , KJ Leonard . 2005. Similarity coefficients for molecular markers in studies of genetic relationships between individuals for haploid, diploid, and polyploid species. Molecular Ecology 14: 415–424. Google Scholar

  13. JB Kruskal . 1964. Multidimensional scaling by optimizing goodness of fit to a nom-metric hypothesis. Psychometrika 29: 1–27. Google Scholar

  14. PA Landry , FJ Lapointe . 1996. RAPD problems in phylogenetics. Zoo logica Scripta 25: 283–290. Google Scholar

  15. TL Maguire , R Peakall , P Saenger . 2002. Comparative analysis of genetic diversity in the mangrove species Avicennia marina (Forsk.) Vierh. (Avicenniaceae) detected by AFLPs and SSRs. Theoretical and Applied Genetics 104: 388–398. Google Scholar

  16. A Meyer , AAF Garcia , AP Souza , CL Souza . 2004. Comparison of similarity coefficients used for cluster analysis with dominant markers in maize (Zea mays L). Genetics and Molecular Biology 27: 83–91. Google Scholar

  17. SZ Mirhoseini . 1998. Analysis of Genetic diversity in Iranian silkworm using protein and DNA markers Ph.D. dissertation, University of Tarbiat Modarres, Tehran, Iran. Google Scholar

  18. SZ Mirhoseini. 2002. Analysis of genetic diversity in some Iranian silkworm varieties using laser scanner densitometer of proteins. 164 Proceedings, 15th Iranian Congress of Plant protection-Silkworm Symposium7–11 September 2002Razi University of Kermanshah, Iran Google Scholar

  19. SZ Mirhoseini , SB Dalirsefat , M Pourkheirandish . 2007. Genetic Characterization of Iranian Native Silkworm (Bombyx mori L.) Strains by using AFLP Markers. Journal of Economic Entomology 100(3): 939–945. Google Scholar

  20. J Nagaraju , MR Goldsmith . 2002. Silkworm genomics-progress and prospects. Current Science 83: 415–425. Google Scholar

  21. M Nei , WH Li . 1979. Mathematical model for studying genetic variation in terms of restriction endonucleasis. Proceedings of the National Academy of Sciences of the USA 76: 5269–5273. Google Scholar

  22. R Peakall , PE Smouse , DR Huff . 1995. Evolutionary implications of allozyme and RAPD variation in diploid populations of dioecious buffalograss Buchloë dactyloides. Molecular Ecology 4: 135–147. Google Scholar

  23. KD Reddy , J Nagaraju , EG Abraham . 1999. Genetic characterization of the silkworm Bombyx mori by simple sequence repeat (SSR)-anchored PCR. Heredity 83: 681–687. Google Scholar

  24. RC Rao. 1952. Advanced statistical methods in biometric research J. Wiley. Google Scholar

  25. FJ Rohlf . 1982. Consensus indices for comparing classifications. Mathematical Bioscience 59: 131–144. Google Scholar

  26. FJ Rohlf. 1998. NTSTSpc: Numerical Taxonomy and Multivariate Analysis System version 2.02Exeter Software, Setauket, NY. Google Scholar

  27. P Skroch, J Tivang, J Nienhuis. 1992. Analysis of genetic relationships using RAPD marker data. Applications of RAPD technology to plant breeding. Symposia series, Madison, CCSA, ASHS, and AGMA, Minneapolis. 26–30. Google Scholar

  28. PE Smouse , R Peakall . 1999. Spatial autocorrelation analysis of individual multiallele and multilocus genetic structure. Heredity 82: 561–573. Google Scholar

  29. PHA Sneath, RR Sokal. 1973. Numeric taxonomy: the principles and practice of numerical classification. W.H. Freeman. Google Scholar

  30. RR Sokal , CD Michener . 1958. A statistical method for evaluating systematic relationships. University of Kansas Scientific Bulletin 38: 1409–1438. Google Scholar

  31. RR Sokal, PHA Sneath. 1963. Principles of numeric taxonomy. W.H. Freeman. Google Scholar

  32. T Sorensen . 1948. A method of establishing groups of equal amplitude in plant sociology based on similarity of species content and its application to analyses of the vegetation on Danish commons. Videnski Selskab Biologiske Skrifier 5: 1–34. Google Scholar

  33. P Vos , R Rogers , M Bleeker , M Reijans , T Van de Lee , M Homes , A Frijters , J Pot , J Peleman , M Kuipe , M Zabeau . 1995. AFLP: a new technique for DNA fingerprinting. Nucleic Acids Research 23: 4407–4414. Google Scholar

This is an open access paper. We use the Creative Commons Attribution 3.0 license that permits unrestricted use, provided that the paper is properly attributed.
Seyed Benyamin Dalirsefat, Andréia da Silva Meyer, and Seyed Ziyaeddin Mirhoseini "Comparison of Similarity Coefficients used for Cluster Analysis with Amplified Fragment Length Polymorphism Markers in the Silkworm, Bombyx mori," Journal of Insect Science 9(71), (1 December 2009).
Received: 12 September 2007; Accepted: 1 October 2008; Published: 1 December 2009

Back to Top