Invertebrates are often used in biological monitoring of soil and water ecosystems. Because of the huge number of invertebrate species, sample processing (sorting and identification) is a labor-intensive and often difficult task that is prone to error. These errors can bias assessment results, which often are used by environmental managers to guide funding decisions for costly restoration measures. However, quality control of assessment results is not implemented in many freshwater monitoring programs. We conducted the first audit of an official European freshwater monitoring program based on 414 macroinvertebrate samples from streams and rivers in Germany. The samples were collected by personnel at 7 different commercial environmental laboratories using the European Union (EU) Water Framework Directive protocol. We audited 12% of all samples at 3 different levels: 1) a sorting audit, 2) an identification audit, and 3) a total audit based on both sorting and identification. The sorting audit revealed that 29% of the specimens and every 5th taxon (20.6%) had been overlooked by the primary analyst. Differences in sorting were correlated with taxon body size (r = 0.61, p < 0.001). The identification audit showed that >30% of taxa differed between the results of the primary analysts and auditors. Taxa considered difficult to identify were not more prone to error than were taxa considered easier to identify. Primary analysts and auditors assigned 34% of audited samples to different quality classes in ≥1 of 3 assessment modules (organic pollution, acidification, and general degradation). For 16% of the samples, these changes resulted in a different final ecological assessment. Such a high rate of differences between primary analysts and auditors could lead to ineffective allocation of several million Euros. Our results clearly illustrate the need for adequate quality control and auditing in freshwater monitoring.
Use of benthic macroinvertebrates to monitor water quality and ecological status of freshwaters has a long history in Europe, North America, and Australia (Wright et al. 1984, Smith et al. 1999, Karr and Chu 2000). In Europe, the importance of monitoring increased significantly with the implementation of the European Union Water Framework Directive (EUWFD) in 2000 (European Union 2000). This directive requires that all surface water bodies achieve good or high ecological status based on structural, chemical, and biotic parameters. Where the current ecological status of a given body of water is not good or high, mitigation measures are demanded by the EUWFD. A first estimate of the ecological status of streams and rivers in Europe revealed that ∼70% of surface water bodies currently fail to meet good or high ecological status (EU Commission 2007). Consequently, many restoration measures will be conducted throughout Europe in the next few years at a cost of billions of Euros. Therefore, assessment results from monitoring programs must be reliable so that water managers can make sound decisions concerning restoration measures.
Much research effort has been devoted to developing new or adapting established monitoring programs and assessment systems throughout Europe (Haase et al. 2004a, Sandin et al. 2004, Vlek et al. 2004). All methods for monitoring benthic invertebrates are subject to variability derived from natural sources like patchiness (Pringle et al. 1988, Townsend 1989, Downes et al. 1993, Li et al. 2001, Olsen et al. 2007) or inherent to the method itself (Cao et al. 2003, Haase et al. 2004b, 2008, Clarke and Murphy 2006, Clarke et al. 2006, Friberg et al. 2006, Nichols et al. 2006, Sundermann et al. 2008). However, human error is a source of variability that has been largely neglected (Ostermiller and Hawkins 2004). Human error potentially affects all stages of freshwater biomonitoring, including site selection, field sampling, sorting, identification, data entry, analyses, and interpretation (Clarke and Hering 2006). However, we currently have very little understanding of the importance of this factor because few suitable auditing schemes exist. In Europe, regular quality control of macroinvertebrate sample processing for stream assessment is implemented only in Great Britain. Presumably, quality control is not widely implemented in macroinvertebrate-based stream assessment because it is costly. Moreover, identifying errors is often not as straightforward as it is for other analyses, e.g., in chemical water analysis, where analytical technologies are used to measure objective values. In biological quality control, differences in results are easy to detect, but true errors are difficult to evaluate because identifying species is often subjective. However, despite these difficulties, it is valuable to know how operator-dependent differences in sample processing and identification can affect outcomes of stream assessments. The few published studies of this issue show a considerable amount of operator-related sorting and identification error (Haase et al. 2006a, Stribling et al. 2008).
Our goal was to evaluate the effect of human error in sample processing on biomonitoring results. We audited macroinvertebrate samples collected from official EUWFD monitoring sites in Germany and analyzed the effect of sorting and identification errors on EUWFD assessment results. We present the first published quality-control audit of sample processing for official freshwater monitoring sites.
In 2006, the authorities of the German federal state Hesse contracted 7 commercial environmental laboratories (hereafter referred to as primary analysts [PAs] 1–7) to collect 414 macroinvertebrate samples from EUWFD monitoring sites using the EUWFD standard sampling protocol in Germany (Haase et al. 2004a). The protocol calls for sampling microhabitats in proportion to their coverage at the survey site (multihabitat sampling). All microhabitats are recorded in 5% coverage intervals and noted on a field protocol. Each 5%-microhabitat sampling unit is sampled by kick sampling with a hand-held net (mesh size = 0.5 mm). A complete sample consists of 20 sampling units (total sampling area = 1.25 m2), which are pooled for further treatment. The entire sample is sorted by the PA with the standardized protocol of Haase et al. (2004a). This protocol is analogous to the widely used Standardized River Classifications/Assessment System for the Ecological Quality of Streams and Rivers throughout Europe using Benthic Macroinvertebrates (STAR/AQEM) protocol (Furse et al. 2006). Sample material consists of the organisms and some coarse particulate organic matter (e.g., leaves, twigs, filamentous algae, or moss).
The PAs were directed to sort all organisms from a sample and to store them in vials for later identification. The sorting protocol did not allow either the PA or the auditor to use magnification or stains. The PAs were aware that all 414 samples were potentially subject to audit. The audit consisted of a sorting audit, an identification audit, and a total audit of 50 of the 414 samples (∼12%) randomly selected by the auditors. The number of processed samples was different for every PA, so the number of randomly selected samples varied between 4 and 18 and represented the relative contribution of each PA to the state-wide survey of 414 sites. Thus, the performance of individual PAs should be interpreted with caution because of the limited number of samples audited for many of the PAs.
The aim of the sorting audit was to detect specimens remaining in the sample residue. PAs were instructed to remove all individuals from the sample material and to retain the residue and sorted specimens separately. The auditors resorted the whole sample residue, removed any animals found, and placed them in a new, labeled vial. The auditors counted these specimens, identified them to the taxonomic level defined in the operational taxon list (Haase et al. 2006b), and added them to the corresponding taxon list generated by the PA. We compared the number of individuals, number of taxa, and assessment results for the original taxon list generated by the PA and the taxon list after the sorting audit. Percentages given in the text or tables refer to the taxon list after the sorting audit, which we assumed represented 100% of the organisms in the sample.
We used Spearman rank correlation analysis to test whether the probability of overlooking a taxon was related to body size. Body size was determined based on trait descriptions in Usseglio-Polatera (1991) and Chevenet et al. (1994). We also used correlation analysis to test whether the number of overlooked specimens was related to the percentage of organic substrates (fine particulate organic matter, coarse organic matter, or submerged aquatic vegetation; calculated from microhabitat records). We calculated the total number of differences in the sorting audit for all taxa. For taxon-specific calculations, we considered only taxa present in ≥10 of the 50 samples to avoid biasing results by taxa that occurred in very few samples.
PAs were instructed to identify all specimens from the sorted sample and to retain up to 5 voucher specimens for each identified taxon, which is common practice in Germany. The voucher specimens for each sample were stored together in 1 vial. Voucher specimens were reidentified by the auditors (who were unaware of what the PA had determined) to the taxonomic level defined in the operational taxon list (Haase et al. 2006b), which is genus or species level for most taxa, and subfamily or family level for selected groups, such as Chironomidae or Oligochaeta. The auditors were taxonomic specialists for particular taxonomic groups, and they re-identified voucher specimens from their area of expertise in all samples. However, we do not consider the results of the auditors to be necessarily the correct identifications. Thus, we do not refer to errors in our results, but rather to differences between the taxon lists generated by PAs and auditors. Samples used in the identification audit were identical to those used in the sorting audit.
We examined whether differences in the identification audit were related to the generally perceived difficulty of identifying particular taxonomic groups. We based our analysis on the following identification categories: 1) taxon can be identified in the field with basic taxonomic knowledge, 2) taxon can be identified in the field with advanced taxonomic knowledge or in the laboratory with basic taxonomic knowledge, 3) taxon can be identified in the laboratory with advanced taxonomic knowledge, and 4) taxon cannot be identified or can be identified only by a taxonomic specialist (Bayerisches Landesamt für Wasserwirtschaft 2004). As in the sorting audit, we calculated the total number of differences in the identification audit for all taxa, but taxon-specific calculations were done only for those taxa that occurred in ≥10 of the 50 samples.
Data analyses and assessment results
The EUWFD macroinvertebrate assessment system for streams and rivers in Germany is based on a reference-condition approach, and class boundaries are defined specifically for each stream type. The system consists of 3 modules, and each addresses a different stressor: organic pollution, acidification, and general degradation (including morphology). The assessment score for each module is classified in 1 of 5 quality classes (high, good, moderate, poor, bad) in accordance with the EUWFD. The overall assessment result is the Ecological Quality Class (EQC) and is based on the worst quality class to which a sample is assigned among the 3 modules (worst-case principle).
The organic pollution module is based on a saprobic index that ranges from 1 to 4 and has stream-type-specific quality-class boundaries (Meier et al. 2006). The acidification module follows a similar principle as the saprobic index. In Germany, 278 taxa were listed according to their sensitivity to acidification and were classified into 5 classes (Braukmann and Biss 2004). The acidification index is calculated only for the 2 stream types (of 24 total types in Germany) potentially affected by acidification. However, these 2 stream types are relatively common in Germany. The general degradation module is based on a stream-type-specific multimetric index (MMI). Each stream-type-specific MMI is composed of 3 to 5 different metrics and is scaled to values between 0 (bad) and 1 (high). Class boundaries occur at 0.2-unit scoring intervals (Böhmer et al. 2004).
For each sample, assessment scores for all 3 modules were calculated from taxon lists that were adjusted to the standardized taxonomic level outlined in the operational taxon list (Haase et al. 2006b; see Identification audit above) to ensure that the taxonomic resolution of taxon lists were comparable. The taxonomic levels used in each module and metric generally are species or genus level except in the metrics % Ephemeroptera, Plecoptera, Trichoptera (EPT) taxa and number of Ephemeroptera, Plecoptera, Trichoptera, Colembola, Bivalvia, Odonata (EPTCBO) taxa of the general degradation module, which partially use lower resolution. We used the audit data to calculate all assessment results with the software ASTERICS (version 3.01; www.fliessgewaesserbewertung.de), which is used by water managers in Germany. The influence of single taxon on the EQC was derived empirically by considering the importance of the taxon in the assessment system.
On average, 340 specimens/sample (29%) were overlooked or not removed by the PA and were found and removed by the auditor (Table 1). The number of overlooked specimens varied between 1.9% and 48.4% (11 to 850 specimens). On average, 18 different taxa/sample (51%) were overlooked or not removed by the PA (Table 1), and 20.6% of the taxa found in the residue by the auditors were new to the sample, i.e., were found only by the auditors and were added to the taxon list for the sample. The number of new taxa varied between 0.5 (1.3%) and 19.0 (39.8%) per sample. The number of specimens of these new taxa ranged from 4.3 (10.6%) to 31.2 (65.7%) per sample.
Results of the sorting audit. Mean (±1 SD) and range of relative and absolute numbers of specimens and taxa found in the sample residue. New taxa = those taxa found only by the auditors. Relative numbers are given as percentages based on the taxon list after the sorting audit.
The largest numbers of overlooked specimens in the sample residue belonged to the taxa Oligochaeta (Enchytraeidae 95.9%), Trichoptera (Limnephilidae 86.2%, Hydropsyche sp. 65.5%), Coleoptera (Hydraena sp. 69.9%, Oulimnius sp. 68.0%), and Diptera (Psychodidae 66.0%). The number of overlooked specimens was not correlated with the percentage of organic substrates at the sampling sites (Spearman rank correlation, p > 0.05). Thus, specimens were not more likely to be overlooked in samples from sites with higher proportions of depositional (e.g., fine sediments) or organic substrates (e.g., coarse organic matter or submerged aquatic vegetation) than in samples from clean substrates (e.g., sites dominated by gravel/cobble). No correlation was found between differences in the sorting audit and occurrence (p = 0.21) or abundance (p = 0.21) of taxa. However, the probability of being overlooked increased significantly with decreasing body size (r = 0.61, p < 0.001; Fig. 1).
Differences between taxon lists occurred when PAs and auditors came to different identification results. Differences also occurred when the auditors could not confirm identification of certain taxa, e.g., because the voucher specimens were not deposited by the PA or the voucher specimens lacked necessary structures for identification. On average, PA and auditor taxon lists differed by 9.3 taxa (33.8%). The number of differences between taxon lists varied by PA. The largest average number of taxon differences for a single PA was 13.8 taxa (44.5%), whereas the smallest difference was 5.2 taxa (15.8%) (Table 2).
Results of the identification audit. Means (±1 SD) and ranges of relative and absolute differences in identification results of primary analysts and auditors. New taxa = number of those taxa that were found only by the auditors. Differences between taxon lists occurred when the primary analyst and auditors came to different results. Relative and absolute numbers are based on the taxon list after the sorting audit.
The largest number of differences in the identification audit involved Hydropsyche sp. (58.3%), followed by Rhyacophila sp. (47.4%), and Baetis sp. (40.4%) (Table 3). Among different taxonomic levels, 20.4% of the differences were at species level, 40.3% at genus level, and 39.4% at family level. The largest number of identification differences in higher taxonomic groups involved Trichoptera (17.6%), followed by Turbellaria (17.1%), Mollusca (15.2%), and Hirudinea (15.2%).
Mean (±1 SD) percentages of all cases where individual taxa were missed (sorting audit) or determined differently (identification audit). Taxa are listed if they were present in ≥20% of all samples. Identification category represents the generally perceived difficulty in identifying a taxon based on Bayerisches Landesamt für Wasserwirtschaft (2004), where category 1 is easiest and category 4 is the most difficult (see text for definitions). Occurrence in sorting audit gives the number of samples in which the taxon was present. Abundance of a taxon is calculated as a mean of all samples. Importance is the relevance of a taxon in assessment and increases as the number of metrics in which it is used increases. Importance ranges from no (−) to high (+++) relevance. gr. = group, ad. = adult, lv. = larvae.
Of the taxa that were identified differently by the PA and the auditor, 14.7% belonged to identification category 1, 26.5% to category 2, 58.8% to category 3, and 0% to category 4. Taxa in category 4 generally were rare in the operational taxon list, and none of these taxa were found in ≥20% of all samples. However, mean differences in identification of all taxa did not differ significantly among categories (Kruskal–Wallis, p = 0.97; 19.1% in category 1, 13.1% in category 2, and 14.8% in category 3).
Effects on assessment results
Sorting and identification audits revealed substantial differences between PA and auditor results. We evaluated how these differences were reflected in scores for each of the 3 assessment modules (organic pollution, acidification, general degradation) and the resulting EQC by calculating and comparing assessment scores for the PA taxon list and the corresponding taxon lists obtained after the sorting and identification audits (Table 4).
Absolute (Abs) and relative (Rel) differences in assessment scores based on primary analyst (PA) taxon lists and auditor taxon lists after a sorting audit, an identification audit, and a total audit. SD = standard deviation, deviation = number of positive/negative changes in assessment results, quality-class changes = number of improvements/deteriorations in quality class resulting from the differences in assessment results, MMI = multimetric index.
Organic pollution module
The sorting and identification audits yielded similar results for the organic pollution module. Average absolute differences in scores were 0.11 (in both the sorting and identification audits) and 0.12 (total audit). The metric is scaled from 1 to 4, so these differences reflect a relative change of 3.6% and 3.9%, respectively. On average, sites scored worse based on auditor taxon lists than when based on PA results (Table 4). Samples were classified in different quality classes in 4 (8%), 3 (6%), and 5 (10%) cases in sorting, identification, and total audits, respectively. No clear tendency toward better or worse quality-class assignments was found for classification based on post-audit results.
Only 2 of the 24 stream types in Germany are affected by acidification, but these types are quite common. Twenty-five samples of our audit required assessment for this module. Of those 25 samples, 5 (20%), 3 (12%), and 5 (20%) samples were classified in a different quality class after sorting, identification, and total audits, respectively (Table 4).
General degradation module
The MMI values based on PA and auditor taxon lists were quite similar for several samples, but differed greatly for others (cf. samples 15 and 18 of PA No. 4; marked in Fig. 2). The mean difference in MMI scores based on PA and on auditor results was 0.03 (2.9%) in both the sorting and identification audits. This mean difference was 0.04 (3.8%) for the total audit (Table 4). The absolute difference in MMI values was positively correlated with the number of new taxa found by the auditors in the sorting audit across the entire data set (Spearman rank correlation, r = 0.56, p < 0.01) mainly because additional taxa increased the values of most of the metrics used in the MMI (for example number of EPTCBO taxa).
The differences in MMI values observed in each step of the audit caused some samples to score into a different degradation-module quality class. After the sorting, identification, and total audit, 8 (16%), 9 (18%), and 10 (20%) of 50 samples, respectively, scored into different quality classes. Most (6) of these 10 samples had better assessment results after the total audit (Table 4).
The final EQC resulting from all 3 modules was calculated using a worst-case principle (Meier et al. 2006). EQC for 6 of 50 samples (12%) differed after the sorting audit. EQC of 8 samples (16%) differed after both the identification audit and the total audits (Table 4). Lower and higher EQCs were calculated equally as often after the identification and total audits.
Changes in the EQC arise from differences that affect all 3 modules. However, the EQC does not summarize the total number of differences in the 3 modules. A change in quality class in a particular module does not necessarily lead to a change in EQC because the EQC is derived from a worst-case principle. We counted and summarized all changes in the total audit of assessment results from each of the 3 modules to get a more detailed overview of the total number of changes in quality-class assignments for all assessment modules. Audits resulted in a change in quality class in ≥1 module for 17 of 50 samples (34%). This number is >2× the number of changes in EQC.
Two Baetis species, Ephemera, Rhithrogena (Ephemeroptera), 2 Gammarus species and Assellus (Crustacea), as well as taxa of other taxonomic groups (e.g., Leuctra [Plecoptera], Dugesia [Tricladia], and Sericostoma [Trichoptera]) had the greatest influence on changing the EQC (Table 3). This is mainly because these taxa were used in more metrics than other species and thus had a greater impact.
The sorting protocol required PAs to remove and sort all specimens from a sample. We assumed that sorting an entire sample under laboratory conditions would be a relatively simple task that should generally lead to nearly complete removal of specimens from a sample. However, the audit revealed that close to ⅓ of all specimens and ⅕ of all taxa were either overlooked or ignored and remained in the sample residue. Small and slender taxa, such as Elmidae, were particularly affected (Table 3, Fig. 1). Other sorting techniques, such as live-sorting in the field and all sorting methods based on the estimation of the number of specimens, should have an even higher potential for error (Haase et al. 2004b). In an audit of macroinvertebrate samples done within the EU-funded STAR project (Furse et al. 2006), sorting error was <½ for STAR/AQEM samples than for samples from other large assessment programs, such as River Invertebrate Prediction and Classification System (RIVPACS; Wright et al. 2000) (Haase et al. 2006a). STAR/AQEM samples are sorted in the laboratory following a similar procedure to the one applied in our study. RIVPACS also uses a laboratory sorting technique, but allows estimation of number of individuals in abundant taxa. Thus, sorting error in more complex or less standardized sorting procedures or protocols allowing for an estimation of number of individuals, such as live sorting, is greater than in simple laboratory sorting procedures. If the putatively most robust sorting technique (laboratory sorting) leads to error rates as high as those observed in our audit, then error rates for more complex sorting techniques are likely to be even higher. Sorting error seems to be much more likely in sample processing and more likely to affect assessment outcomes than expected. Therefore, we recommend using only laboratory sorting techniques and carefully trained personnel in important surveys.
The mean number of differences between PA and auditor taxon lists was 33.8% of the recorded taxa in our audit. Thus, identification results differed for ∼⅓ of our records. Stribling et al. (2008) found identification differences in 22.1% of the taxa investigated. Results of both studies demonstrate the risk of inaccurate identification in sample processing. Moreover, PAs might have chosen the least ambiguous individuals of a species as voucher specimens. Thus, the results of our identification audit might have been biased toward a smaller error rate. It is reasonable to assume that a fully quantitative identification audit based on all individuals of a sample would lead to more frequent and even greater differences in assessment results.
In both studies, differences in identification (and sorting) were related to the taxonomic unit (order, family) to which specimens belonged. However, comparing differences in identification (and sorting) among higher taxonomic units is difficult because these units are composed of species that differ greatly in attributes like body size, difficulty of identification, or ecological preference. We expected greater differences in identification for taxa known to be difficult to identify. However, differences were independent of identification difficulty category. This result could indicate that operators were unconsciously aware of the different identification difficulty categories and paid less attention to taxa that are easier to identify. Thus, identification error might depend less on how difficult a taxon is to identify than on how much attention an operator pays to identification of different taxa. These results correspond to the results of the sorting audit where operator attention to detail affected the error rate. Differences in identification could be caused by limited attention to detail, deficits in taxonomic expertise, or both. Identification courses and workshops are critically needed to help overcome these problems. Those courses should, thus, not focus on “difficult” taxa, but rather on common and higher-level taxa and should promote operator awareness of attention to detail when working with easier-to-identify taxa.
Our identification audit assessed only qualitative differences because the PAs were asked to deposit ≥5 individuals/taxon as voucher specimens. Restricting the number of voucher specimens is a common practice mandated by German water authorities to save time and money. In light of our results, we think it reasonable to consider changing the procedure to at least retain all voucher specimens combined in 1 vial and the remainder of the individuals from a sample in a separate vial. This change would enable better, quantitative quality assurance in the future with a minimal increase in overall costs for routine sampling.
Effects on assessment results
The EQC of 6 monitoring sites (12%) changed in response to the sorting audit, and the EQC of 8 monitoring sites (16%) changed in response to the identification audit. However, the EQC of only 8 monitoring sites (16%) changed after the total audit. The effects of sorting and identification differences were not cumulative because the EQC of only 2 of the 50 samples used in the audit changed in both the sorting and the identification audit.
About every 3rd monitoring site (34%) scored into a different quality class in 1 to 3 of the assessment modules (organic pollution, acidification, and general degradation) after the total audit. This relatively high number of reclassifications clearly demonstrates the effect of human sorting and identification differences on assessment results. However, the effects of sorting, identification, and total audit on scoring did not differ among assessment modules (Wilcoxon, all p > 0.25), indicating no difference in sensitivity of the modules to error.
The general degradation module is the most important module in the German assessment system because scores for ∼90% of German rivers are lowest in this module. Thus, based on the worst-case principle, scores in the general degradation module are responsible for 90% of the final EQC assessments in Germany. Difference rates in the sorting audit were high for several taxa that have high relevance in this module (indicated by ++ and +++ in Table 3), but difference rates in the identification audit were high for only a few of them (Table 3). However, the relative and total number of changes in EQC was slightly higher after the identification audit than after the sorting audit. Thus, taxa that are generally not expected to influence assessments (e.g., Gammarus sp. or Baetis sp.) must have contributed to post-audit changes in the EQC. This effect could arise if noninfluential taxa were very abundant and differences in identification led to changes in a relatively large proportion of samples.
Across the 3 modules and the final EQC, the total number of post-audit quality-class changes did not differ significantly among sorting (18 better/5 worse), identification (10/13), and total (16/12) audits (Table 4). However, additional specimens and taxa found in the sorting audit led to more positive than negative changes, whereas positive and negative changes were equally common in the identification and total audits. Thus, overlooking or ignoring specimens during sorting led to underestimation of the EQC.
The EUWFD requires use of the worst-case principle when calculating the final EQC. We compared the effect of sorting and identification audits on this worst-case approach to effects on an approach based on a mean score calculated across all 3 assessment modules. The mean-score approach leads to better assessment results than the worst-case approach, but the effect of the 3 audits did not differ significantly between the 2 approaches (Wilcoxon, all p > 0.46).
The value of audits and quality control
To our knowledge, the UK is the only country with a routine auditing scheme for macroinvertebrate biomonitoring samples. In the 1st year after implementation of the UK auditing scheme, differences between PAs and auditors were large (Dines and Murray-Bligh 2000, Murray-Bligh et al. 2005), probably because most investigators were convinced that they worked accurately and, therefore, failed in self-evaluation. However, in the 2nd year after implementing the UK auditing scheme, the differences between PAs and auditors decreased rapidly. This quick response is probably attributable to increased awareness of operators that sample processing was more prone to error than previously assumed.
In our study, the differences in sorting and identification between PAs and auditors equally affected assessment results. Sample sorting is conceptually very simple, and one might reasonably expect the task to be left to the most junior and inexperienced biologists. However, our audit results and the results from Haase et al. (2006a) demonstrate that sorting is, in fact, a task that requires more skill than has been recognized in the past. Our results demonstrate the need for formal training of sorting personnel and for improvement of internal laboratory quality-control procedures.
Macroinvertebrate specimen identification is generally regarded to be more difficult than sample sorting. However, carelessness and lack of formally trained and experienced staff are presumably the predominant sources of error in both sorting and identification. Causes of human errors include incorrect interpretation of technical literature; transcription or recording errors; coarse definitions of terminology, nomenclature, and standard procedures; differences in optical equipment; and sample handling and preparation techniques (Stribling et al. 2003, 2008, Dalcin 2004, Chapman 2005). Unfortunately, biologists often receive little formal training in sample sorting, specimen identification, or other steps in sample processing. Thus, the large differences between PAs and auditors in the identification audit are not surprising. Our audit results clearly indicate that formal training is of utmost importance in macroinvertebrate sample processing and could presumably reduce assessment errors. Basic taxonomic skills are being taught less and less in tertiary education programs around the globe (Holzenthal et al. 2010). This trend has been acknowledged in the benthological community. Efforts like the North American Benthological Society's Taxonomist Certification Program ( www.nabstcp.com/) provide professional taxonomic certification to ensure high-quality taxonomy and, thus, credible ecological and reliable bioassessment studies, and support graduate training of next generation taxonomic experts. Such efforts should be promoted even more strongly, perhaps even mandated, for operators of official bioassessments.
Our data also show that human error in sample processing has a great effect on assessment results. Implementation of a quality-management system is vital to overcome these shortcomings. At minimum, such a system must incorporate standardized methods and protocols, formal training procedures and evaluations, and an auditing scheme for randomly selected samples. Experience in the UK clearly shows that implementing a routine audit scheme has a strong positive effect on data quality. Therefore, quality control will increase the accuracy and precision of biological data and strengthen confidence of water managers in assessment results. This confidence is of utmost importance because assessment results are used to guide and direct cost-intensive mitigation measures in river rehabilitation. According to BMU (2005), 60% of German streams and rivers will fail the good or high ecological status required by the EUWFD. An additional 26% of streams and rivers are at risk of failing to meet this status. If we estimate a cost of 400,000 Euro/km of river length for morphological improvements (Interwies et al. 2004), restoration will cost billions of Euros in Germany alone. An error rate of 16% (as measured in our study) would lead to inefficient allocation of several million Euros. The costs of misplaced remedial actions will outweigh costs caused by the implementation of quality control by several orders of magnitude. Thus, a quality-control system should be implemented in freshwater monitoring programs to help avoid the high costs of unnecessary or incorrectly guided restoration measures based on inaccurate assessment results.
Our results are already reaching water managers in Germany and are causing them to rethink current and past practices. They realize that human error is more prominent than previously thought and have started to develop a quality-control system. This new course of action is encouraged by the positive effect of quality-control application in marine monitoring programs (Ranasinghe et al. 2003). The UK, German, and marine examples should encourage water managers in other countries to establish quality-management systems where they are presently not in place. Furthermore, we assume that high error rates in sorting and identification of invertebrate samples is a general pattern that is not restricted to riverine systems but also occurs during processing of samples from lakes and terrestrial ecosystems (e.g., soil fauna). The large number of individuals or taxa in invertebrate samples makes invertebrate-based monitoring more prone to human error than monitoring based on other groups of organisms (e.g., vertebrates or higher plants). We still know little about the effect of human error on monitoring outcomes, but we do know that human error is a serious problem that affects applied science and any kind of ecological research dealing with invertebrate taxon lists. Therefore, implementation of quality-management systems is vital for both applied and basic ecology.
We thank Manfred Colling, Thomas O. Eggers, Christine Engelhardt, Wolfram Graf, Elisabeth Heigl, Arne Haybach, Monika Hess, Uwe Jueg, Irene Rademacher, Rüdiger Wagner, Doreen Werner, and Michael Zettler for their support of our audit. We thank Robin Thomson (St Paul, Minnesota), Mark Vinson (Logan, Utah), and 1 anonymous referee for helpful comments on earlier versions of our manuscript. This study was financed by Hessisches Landesamt für Umwelt und Geologie and the research funding programme LOEWE — Landes-Offensive zur Entwicklung Wissenschaftlich-oekonomischer Exzellenz of Hesse's Ministry of Higher Education, Research, and the Arts.