In this paper we aim to investigate the problems and potentialities of species distribution modeling (SDM) as a tool for conservation planning and policy development and implementation in tropical regions. We reviewed 123 studies published between 1995 and 2007 in five of the leading journals in ecology and conservation, and examined two tropical case studies in which distribution modeling is currently being applied to support conservation planning. We also analyzed the characteristics of data typically used for fitting models within the specific context of modeling tree species distribution in Central America. The results showed that methodological papers outnumbered reports of SDMs being used in an applied context for setting conservation priorities, particularly in the tropics. Most applications of SDMs were in temperate regions and biased towards certain organisms such as mammals and birds. Studies from tropical regions were less likely to be validated than those from temperate regions. Unpublished data from two major tropical case studies showed that those species that are most in need of conservation actions, namely those that are the rarest or most threatened, are those for which SDM is least likely to be useful. We found that only 15% of the tree species of conservation concern in Central America could be reliably modelled using data from a substantial source (Missouri Botanical Garden VAST database). Lack of data limits model validation in tropical areas, further restricting the value of SDMs. We concluded that SDMs have a great potential to support biodiversity conservation in the tropics, by supporting the development of conservation strategies and plans, identifying knowledge gaps, and providing a tool to examine the potential impacts of environmental change. However, for this potential to be fully realized, problems of data quality and availability need to be overcome. Weaknesses in current biological datasets need to be systematically addressed, by increasing collection of field survey data, improving data sharing and increasing structural integration of data sources. This should include use of distributed databases with common standards, referential integrity, and rigorous quality control. Integration of data management with SDMs could significantly add value to existing data resources by improving data quality control and enabling knowledge gaps to be identified.
Introduction
Predictive species distribution models are empirical models relating field observations to environmental variables, based on statistically or theoretically derived response surfaces [1, 2]. The most common strategy for estimating the potential geographic distribution of a species is to characterize the environmental conditions that are suitable for that species. The spatial distribution of environments that are suitable for a species can then be estimated across a given study region. A wide variety of modeling techniques have been developed for this purpose (see Appendix 1), including generalized linear models, generalized additive models, bioclimatic envelopes, habitat suitability indices, and the genetic algorithm for rule-set prediction (GARP).
Species distribution modeling (SDM) has become increasingly popular in recent years among researchers. It has been used to address a variety of different problems at various scales, with a range of different species occurring in different geographic areas. Applications of SDM methods include quantifying the environmental niche of species [3, 4], testing biogeographical, ecological, and evolutionary hypotheses [5, 6, 7], assessing species' invasions [8, 9], assessing the impact of climate, land use, and other environmental changes on species distribution [10, 11, 12], suggesting unsurveyed sites of high potential of occurrence for rare species [13, 14, 15], and supporting conservation planning and reserve selection [16, 17].
There are several particular advantages to using SDM to support conservation planning: (1) Maps of documented occurrences of species convey no information on the likelihood of occurrence in areas that have not been surveyed. Range maps from field guides and similar data are often too coarse to be useful for on-the-ground conservation action or research. (2) Accurate predictive distribution maps make field inventories more efficient and effective. They show where to commit the limited available resources for inventories by highlighting the areas where a targeted species or habitat type is most likely to be found. (3) Predictive distribution maps for multiple species or habitat types, produced with consistent and reliable methods, are well suited for identifying spatial patterns in biological diversity, which can be of value for assessing conservation priorities. (4) Predictive distribution maps are very useful for conservation planning efforts at a range of different scales. As a result of these advantages, in the last decade, a number of international organizations have employed species modeling in order to address key policy objectives at a global scale (e.g., UNEP, the Convention of Biological Diversity, Organization for Economic Co-operation and Development, European Union, Conservation International, IUCN, WWF, etc.).
Several statistical issues, however, stand as obstacles for species distribution analysis. The first and foremost is data availability [18]. Much biodiversity has yet to be formally described and catalogued. In general, this problem—the so-called “Linnean shortfall” [19]—appears to be of increasing relevance as the organisms decrease in size [20]. In addition, knowledge of the global, regional, and even local distributions of many taxa is currently inadequate, a problem that Lomolino [21] named the “Wallacean shortfall.” Many areas of the world remain seriously under-collected for most taxa, with the result that even for higher plants, reliable systematic species range maps are available only for a fraction of the earth's surface [20]. Many of these problems are particularly intense in tropical areas. Whereas it is widely appreciated that most megadiverse areas occur in the tropics [22], rates of habitat loss and environmental degradation also tend to be higher in tropical regions [23, 24, 25, 26, 27]. Therefore, the need for tools to assist conservation planning, policy development, and implementation is particularly urgent in tropical regions.
In this paper, we first examine the extent to which SDM techniques are being developed and applied in tropical regions, based on the results of a literature review. We then use this review to evaluate the scope and objectives of SDM initiatives in tropical areas, in comparison to those undertaken in temperate regions. We then explore the potential limitations to the application of SDM in the tropics, by examining two case studies in detail. Finally, we discuss how the potential value of SDM approaches might best be realized in future, given the current limitations that exist.
Literature review
We conducted a literature search on the Web of Science using the keywords “species distribution” AND “model.” We selected the five journals containing the largest number of studies using SDM that could be considered to provide information suitable for guiding conservation policies. The selected journals were Biological Conservation, Conservation Biology, Diversity and Distributions, Global Ecology and Biogeography, and Journal of Applied Ecology. We reviewed all papers published between January 1995 and May 2007, selecting those that had used SDMs to predict species distributions, but excluding review papers and studies using raw geographical data without the inclusion of statistical models. Our search yielded 123 papers (Appendix 2, references are provided in Appendix 3) that satisfied the inclusion criteria. From each study we extracted the following information: (1) study region; (2) aim of the study (classified as “methodological” if it focused on the development of modeling methods or compared the accuracy of different modeling techniques; or “applied” if it involved the application of SDM methods to practical conservation problems, such as biological invasions, climate change, conservation prioritization, and biodiversity mapping); (3) model validation (classified according to the approaches used as “non reported,” “validated with the same data,” “k-fold partitioning,” “prospective sampling,” and “informal validation”); (4) focal taxa; and (5) data type (classified as “presence only,” “presence-absence,” and “abundance” data).
Most studies in our literature review were conducted using data from relatively well sampled countries (Appendix 2), such as the USA and Canada (28), Australia (10), or European countries (48). Relatively few used data from tropical regions such as Central and South America (8), Africa (10), or Asia (6).
We found that 39% of papers (48) focused on the development or evaluation of methods rather than on their application (Appendix 2). This is an unexpectedly high proportion, taking into account exclusion of methodological journals such as Ecological Modelling. These papers documented studies presenting new methodologies for predicting species distributions [e.g., 28, 29, 30], and those evaluating the performance of different models [e.g., 31, 32, 33]. Some also explored a variety of issues related to spatial scale and extent [e.g., 34, 35], model accuracy [e.g., 36, 37], or variable selection [e.g., 38]. Overall, the high proportion of methodological papers may be indicating that the use of these techniques are not free of controversy in their application. Among those studies that used SDM in a more applied context, there was a broad mixture of goals, including species conservation (29 papers), biological invasions (14 papers), climate change (10 papers), autoecology (7 papers), and biogeography (6 papers). Conservation prioritization was mentioned in only six papers and biodiversity mapping in three. When stratified by biomes, it can be observed that most methodological studies were carried out in temperate regions, while applied studies tended to be conducted in tropical regions ((Fig. 1a)).
It is generally accepted that a robust test for the prediction success of a model should include independent data, i.e., data not used to develop the predictive model. However, only 13% of the reviewed studies (17) validated SDMs using a new sample of cases obtained from a different region or time after the model had been developed (referred to as prospective sampling in Appendix 2). Just 8% of studies (10) used some sort of informal validation, e.g., by comparing the SDMs with existing distribution atlases [40, 41] or through literature review [42]. A large proportion of studies (48; 38%) partitioned the data into subsets and used one of these sets for training and the remaining sets for testing purposes. Though this is a common practice, data partitioning is not the same as collecting new independent data for model testing [43], particularly if predictions are to be tested for their general use. Only if predictions are to be restricted to a homogeneous region can data-partitioning be expected to output similar results than prospective sampling [44]. Finally, 17% of studies (22) used the same data for training and testing, and 24% (31) did not report any validation at all. When stratified by biomes, the number of studies using prospective sampling or some sort of informal validation was proportionally higher in temperate (22%) than in tropical regions (15% respectively, (Fig. 1b)). On the contrary, the number of studies not reporting any form of validation or using data partitioning was proportionally higher in tropical (26% and 41% respectively) than in temperate regions (22% and 38%). The number of studies using the same dataset for validation was roughly the same (18%) for both temperate and tropical regions. Overall, this indicates that a large proportion of the reviewed studies—especially those undertaken in tropical regions—reported model results without rigorous testing of model properties. This was particularly the case in applied studies, where lack of reporting of validation and testing SDM with the same data was commonplace (Figura 1c). In contrast, methodological studies accounted for the largest proportion of cases validated through prospective sampling.
Most papers were published on birds (46), plants (34), or mammals (29, Appendix 2). Groups receiving less attention were reptiles (11), amphibians (7), fishes (2), lower plants (2), and invertebrates (24). Invertebrates encompassed many different taxa such as insects, arachnids, snails, crustaceans, and rotifers. The shortage of studies focusing on this group contrasts markedly with the high diversity of organisms that are represented within it.
Finally, 60 out of 123 papers used presence-only data for modeling species distributions, 60 used presence-absence data, and only seven used abundance data (from which four studies also used presence-absence data, Appendix 2). A high proportion of the studies conducted in tropical regions (73%) used presence-only data, whereas studies conducted in temperate regions most often used presence-absence or abundance data (59%).
Overall, these results show that studies using SDM in the tropics are more scarce than in temperate regions, which stands in contrast with the high biodiversity held by tropical ecosystems. Fairly complete datasets from well-sampled regions make possible the development or evaluation of methods. However, these methods are not always as effective when applied to conservation case studies in tropical regions. Results from such case studies reveal data-driven constraints that limit the applicability of SDM, such as lack of independent information to validate the predicted distribution of species or lack of reliable absence data.
Application of SDM in the tropics: two case studies
To explore current approaches to applying SDM techniques in the tropics and highlight the problems encountered in greater depth, two case studies are described here in detail. Although none of these case studies have been published previously in scientific journals, they were selected because they were commissioned by conservation organizations, so they can potentially illustrate how well SDMs meet the expectations posed by governmental and non-governmental conservation organizations in tropical countries. The first study reported in this section was published on-line [45] and was part of a larger Andes-Amazon project commissioned by NatureServe. The second study was provided by one of the authors, who works for the Mexican Commission for the Knowledge and Use of Biodiversity (CONABIO).
The first study was conducted on the eastern slope of the Andes in Peru and Bolivia to model the real distribution of endemic species [45]. The study aimed to fill knowledge gaps in support of conservation planning in the Tropical Andes. The list of focal species included 115 birds, 55 mammals, 177 amphibians, and 435 plants. The Maxent algorithm was selected for modeling species distribution because previous comparative studies had shown that it performs well even with small sample sizes [46, 47, 48]. Maxent output predicted four distributions for each species using all the available locality data but varying the input environmental layers (see [45] for further details). Because of the scarcity and low spatial precision of available locality data, it was not possible to partition the data into records used for training the model and those set aside for a statistical model evaluation. Alternatively, specialists in each group reviewed and selected which, if any, of the four models reflected a realistic depiction of the distribution. This decision was based partly on validation with the same dataset used for modeling and partly on expert judgment. In the cases in which a Maxent model was considered to be reasonable, the reviewers then selected a cutoff threshold to convert the continuous Maxent predictions to presence-absence maps. Despite the alleged suitability of this method for this purpose [46, 47, 48], there were many cases were the Maxent models did not produce a realistic distribution map for the species. In such cases, deductive and hybrid models were relied on. Deductive models were created by defining the maximum and minimum elevations at which the species was expected to occur. Hybrid models used part of the Maxent prediction in one portion of the species' range and a deductive model for the remaining area. Table 1 shows that a large proportion of the target species could not be effectively modelled with Maxent. Endemic amphibians and plants were particularly challenging; 52.0% and 39.3% of the Maxent models for these two groups, respectively, did not produce realistic distribution maps. Similarly, the mean number of records per species and the number of species with one record for endemic amphibians (6.0 and 36.7% respectively) and plants (7.0 and 28.3%, respectively) were comparatively smaller than for birds (21.2. and 2.6%, respectively) and mammals (11.2 and 7.3%, respectively) (Table 2). This highlights the problems of using objective modeling approaches for analyzing range delimitation when quantitative data are lacking.
Table 1.
Summary of biological data and modeling methods used in the prediction of endemic species distributions on the east slope of the Andes in Peru and Bolivia [45]. The Maxent method was used for modeling when possible. Where this method did not output a realistic distribution map for the species, deductive and hybrid models were used. Deductive models were created by defining the maximum and minimum elevations at which the species was expected to occur. Hybrid models used part of the Maxent prediction in one portion of the species' range and a deductive model for the remaining part. In brackets, the proportion of the species in each taxon modelled with each of the three methods is given.
Our second example refers to the Gap Analysis for terrestrial biodiversity in Mexico ( http://www.conanp.gob.mx/pdf_vacios/terrestre.pdf), an initiative coordinated by CONABIO and CONANP (National Commission for Protected Areas). This initiative aims to assess the efficiency of the current network of protected areas to conserve a representative part of the country's biodiversity and generate a strategy for adapting the protected area system [48]. As part of this task, species distribution maps were generated by experts on several taxonomic groups using the GARP algorithm [50] (Table 2). Species geographical distributions were constructed from raw occurrence data obtained from the National Biodiversity Information System (SNIB, CONABIO) and the World Information Network on Biodiversity (REMIB), in combination with datasets of environmental variables believed to affect species distributions in Mexico. Data layers included climatic variables from Worldclim ( http://www.worldclim.org/), topographic and hydrologic parameters from Hydro1k ( http://edc.usgs.gov/products/elevation/gtopo30/hydro/index.html), and thematic national datasets from INEGI and CONABIO. The species selected for this purpose were catalogued under the Mexican Red List of Threatened Species NOM-059-SEMARNAT-2001, were range restricted or rare, or belonged to taxonomic groups of particular conservation concern (e.g., Agavacea spp., Opuntia spp.). Because the GARP algorithm had proved to be ineffective in earlier studies when few records were available [e.g., 51], the decision was made to model only species with at least eight records. This excluded 56% of the species catalogued under the Mexican Red List, and 35% of the non-threatened species (Table 2). Amphibians and reptiles distribution maps were validated with the best subset procedure, using half of the data to build the model and the other half to test the predictive ability of the model (i.e., data partitioning). SDMs for mammals and plants were validated with data from a literature search and expert knowledge (i.e., informal validation). In addition, knowledge on the ecology and biogeography of the species, coarse-scale maps (ecoregions, biogeographic realms), and auxiliary datasets held by experts (e.g., sightings and field specimens not included in the datasets used for modeling) were used to trim SDMs in order to eliminate, at least to some extent, possible model over-predictions and improve estimates of the actual species distributions. This highlights the limitations of SDM for conservation of endangered species, as sufficient data for effective application of SDM are only available for a small minority of species. Consequently, those species that are most likely to require conservation action, namely those that are the rarest or most threatened, are those for which SDM is least likely to be useful.
Table 2.
Number of species targeted for distribution modeling by CONABIO, as part of a national conservation assessment initiative between 2004 and 2006 [49]. A total of 1,843 were initially selected for modeling, including 166 amphibians and 435 reptiles [102] (Flores-Villela, unpublished data), 336 trees, 294 agave species and 612 plant species not included in any of the previous categories (shrubs, grasses, etc., CONABIO, unpublished data). The table shows the number of species whose distributions were and were not successfully modelled using GARP [49]. Species distribution models (SDM) were not successful when there were fewer than eight records available. 71.1% of the target species (1311) were catalogued under the Mexican Red List of Threatened Species NOM-059-SEMARNAT-2001 and were, therefore, of particular conservation concern. In brackets, percentage of the species successfully and unsuccessfully modelled under each of the three categories is given.
The main lesson that can be drawn from these case studies is that statistical modeling is not effective when few data points are available. SDMs fail to produce reliable predictions in cases where the distribution data is very limited and, in such cases, predictions must rely mostly on subjective judgment. We believe that the problems encountered in these two case studies are common to many tropical taxa and regions, as explored in the following section.
Problems of applying SDM approaches in tropical regions
Data shortage
The most important problem that species distribution modellers in tropical regions often have to face is the small number of available data points. Lack of information about the distribution of organisms, what has been referred to as the “Wallacean shortfall” [21], is widely recognised to be a major constraint to conservation planning in the tropics [22, 52, 53, 54]. In addition, it represents a problem to SDM approaches. Previous studies have shown that a sample size lower than around 70 observations decreases the performance of SDMs [55, 56, 57]. Drake et al. [29] studied how model performance depends on the sample size of the training dataset, and concluded that at least 40 observations were necessary to obtain consistent models using support vector machine-based methods (see Appendix 1). The GARP algorithm was reported to consistently under-predict the distribution of mammal species in poorly surveyed regions of west-central Guyana [51].
To analyze the characteristics of the data typically available for fitting SDMs in the tropics, we downloaded presence-only data (hereafter referred to as “records”) of all known tropical tree species in Central America from the VAST database of the Missouri Botanical Garden (MOBOT, http://mobot.mobot.org/W3T/Search/vast.html). We excluded records without geographical coordinates, or with coordinates derived from political districts. This provided a list of 3,359 species with a total of 135,241 records. We found that 8% of the species consisted of a single record, 21% had fewer than five records, and 50% fewer than 17 records. If a limit of 40 known occurrence points is considered to be the minimum for rigorous modeling (following Drake et al. [29]), the distribution of only 30% of the species could be effectively modelled using this dataset. In addition, 313 of these species have been categorized as at risk of extinction (CR, EN, VU) in the IUCN Red List, and are thus potentially of some conservation concern. Only 15% of these were found to have more than 40 records, while 40% had fewer than five. The most-collected species in the list were those considered to be of least conservation concern. This pattern seems to be the case in many other regions of the world [58, 59].
Data paucity has a range of causes. It can occur even when collection effort is relatively intense. Small-bodied or nocturnal species can be difficult to detect. Many species are genuinely rare, but rarity takes many different forms [60, 61]. A species can have a broad range but have low population abundances, or a narrow range, but be locally abundant. Additionally, data shortage can result from sampling bias towards certain taxa [62]. As noted in the literature review, some groups of organisms (such as birds and mammals) tend to attract greater interest from researchers than others. This bias has been well documented; Keddy [63] refers to it as “moose-goose” syndrome. The distribution of many groups of taxa, such as invertebrates, reptiles, amphibians, bryophytes, and fungi, tends to be relatively poorly documented.
Table 3.
Density of tropical tree herbaria specimens for Central America and Great Britain. Collection data for Central America has been obtained from the Missouri Botanical Garden VAST database ( http://mobot.mobot.org/W3T/Search/vast.html) for a total of 3,359 tropical tree species. Collection data for Great Britain has been obtained from the New Atlas of British and Irish Flora and from the National Biodiversity Network ( http://www.searchnbn.net) for a total of 137 tree species. Collection data for the Netherlands has been obtained from the Florbase ( www.florbase.nl) database and the ‘landelijke vegetatie datbase’ ( www.synbiosys.alterra.nl/lvd) for a total of 206 tree species.
In addition, collecting effort is also unevenly distributed across countries (Table 3). If we compare the spatial density of tree collection specimens for different countries of Central America we find large differences in proportional collecting effort. Based on these data, the biodiversity of large countries appears to receive proportionally less research attention than that of smaller countries. In the Neotropics, large countries also tend to attract proportionally fewer visiting researchers per unit area. Tropical Mexico, Guatemala, and Honduras have low data density (4.2, 6.0 and 7.6 collection data 100 km−2, respectively), while Costa Rica and Belize both have relatively high data densities of up to 67.3 collection data 100 km−2 (Table 2). Yet all these data densities are low when compared with those of well-sampled countries such as Great Britain and Ireland or the Netherlands, where even with fewer tree species (137 and 206 species, respectively), much higher data densities have been recorded (187 and 3,317 records 100 km−2, respectively) (Table 2).
There is also some evidence that the collecting effort for tropical taxa is declining, at least for some groups. The Mexican butterfly database described by Llorente et al. [64] contained 36,685 records collected between 1900 and 1990. When analyzing the utility of this database for conservation, Soberón et al. [65] found an abrupt increase in collecting effort in the 1970s and 1980s. Collecting effort in the 1990s, however, decreased to levels similar to the average between 1910 and 1950. We found similar trends for: (1) tropical tree species in Central American countries (Fig. 2), with collecting efforts mostly peaking between 1980 and 1990 and decreasing progressively from 1990 onwards; and (2) plants (all phyla) in Brazil, Thailand, and Madagascar (Fig. 3), with collecting effort dropping after peaking in the 1980′s, 90′s and 50′s, respectively.
Data quality issues
Guisan et al. [66] clearly demonstrated the importance of data quality for model performance. Based on a high-quality database they compared several modeling techniques for predictive accuracy and sensitivity to, among others, location error, changes in map resolution, and sample size. They found that sample size and location error affected model performance in particular. Ideally, in order to model species distributions, sampling effort should be uniform across the species' range, so that all recorded variations in distribution patterns are real and not the result of variation in sampling effort [67, 68]. However, systematic surveys of large areas are rare, and therefore models focusing on large-scale patterns of species distribution often rely on incomplete and geographically biased information [68, 69, 70, 71, 72, 73]. This is particularly true for models based on specimen collections in herbaria and natural history museums [74]. Collection data are inherently biased in many respects [73, 75, 76]; therefore, models based on such data may lead to inaccurate predictions.
Geographic bias can come in many different forms, though bias owing to accessibility and a focus on priority areas are probably the most important [68]. The existence of roadside bias (the so-called “highway effect”) in survey and collection data has often been emphasized [3, 62, 77, 78, 79, 80], but less frequently quantified [65, 68, 73, 81, 82]. A similar sampling bias has been observed along rivers [68] and near cities [68, 81]. The effect of nature reserves or priority areas on biological record collection intensities is potentially complex. A common pattern is for such areas to receive attention by collectors prior to their declaration as reserves, followed by a decline in collecting due to restrictions imposed when reserve status is granted. We could not detect, for example, any difference in a historically pooled sample of collections within and outside nature reserves in Mexico for species such as the deer mouse (Peromyscus sp.), or various endangered agave species (Agave spp., M. Kolb, unpublished data). Likewise, Freitag et al. [70] detected no bias of small mammal survey records towards nature reserves in South Africa. However, they found that large mammal data had been mostly collected within existing conservation areas. Similarly, Reddy and Dávalos [68] reported a bias of passerine bird samples towards areas now designated as conservation priorities in sub-Saharan Africa.
The degree to which geographical bias affects the performance of distribution models has rarely been explored, but may be case specific. According to Kadmon et al. [73], a negative effect of roadside bias on predictive accuracy of bioclimatic models must follow from two necessary conditions: (1) climatic bias should affect the accuracy of model predictions; and (2) the road network should be biased climatically. Both conditions are met in tropical mountain regions where roads can be at low elevations and vegetation patterns are linked to altitudinally determined climatic gradients (D. Golicher and L. Cayuela, unpublished data). But even in apparently well surveyed regions and groups, such biases still produce inaccurate geographical model representations [83, 84], because the process of discovery of species distribution has occurred in a climatically or spatially structured fashion [85]. Stockwell and Peterson [86] suggest methods to correct this bias. However, these are difficult to implement when there are limited available data.
There is a well known general effect of geographical sampling bias in the context of SDM. When presence-only data are used, pseudo-absences or background absences (hereafter both referred to as pseudo-absences) are often used in order to fit models. Procedures for this are frequently integrated within the SDM software programs used by researchers, such is the case of GARP [50] or Maxent [48]. However, model users often do not explicitly investigate the properties of pseudo-absences or the impact of using pseudo-absences on overall model results. This is of paramount importance since, if there are not reliable absence data, the method of pseudo-absence selection strongly conditions the obtained model, generating different model predictions in the gradient between potential and realized distributions [87]. In addition, if a large area is being modelled, pseudo-absences may be taken from well beyond the species' actual distribution limit. This can provide over-optimistic evaluation of a model's predictive ability from inspection of ROC curves [88]. This has been referred to as the “naughty noughts” effect [89]. It is not easily avoided if the data available to suggest credible bounds of a species distribution are the same data that are later used in the SDM.
In addition to geographic biases, there are frequently errors in the geographical coordinates of specimens and data collections. Before accurate GPS technology became available, specimen collectors used a variety of ad-hoc descriptive protocols to record the localities where collections were made. These textual descriptions were then converted into geographic coordinates using available cartography. Records made before the mid 1990s are therefore inherently imprecise. Where place names are ambiguous, geographical errors may be considerable. Species distribution databases rarely include an explicit measurement of geographic precision. However, the degree of precision can be inferred from examining the last digits of the coordinates. We found that 90% of data points from MOBOT had apparently been rounded up to the nearest arcminute and 8% to the nearest degree. Small errors have relatively minor consequences when data are used in the context of a traditional distribution atlas. However, the effect of even small positional errors on modern statistical distribution models is potentially more severe. Environmental variables are fed into modeling algorithms as a result of overlaying the points on interpolated raster coverages. In mountainous tropical terrain, temperature and precipitation are strongly correlated with steep elevational gradients. Small positional errors can thus result in markedly different climate parameters being associated with a collection point. This is likely to produce poor or misleading models [66]. The severity of this effect is a function of the size of the error and the specific topography of the region.
A final problem related to data quality is the risk of species misidentification. This is particularly problematic when collating information from different sources, as different datasets may have been generated with different taxonomic concepts [18]. To merge several data sources into one homogenous dataset is an enormous challenge that usually dwarfs the time required to analyze these data [7].
Implications for conservation
Strengthening the applied role of SDM in tropical conservation
Being empirical, SDMs are explicitly data driven. The accuracy of model predictions depends critically on the quality and quantity of data. Biological databases are, by nature, incomplete and have heterogeneous spatial coverage [68, 69, 70, 71, 72, 73, 75, 76]. This has led to the development of techniques aimed at overcoming the analytical challenges posed by incomplete spatial coverage. However, an inevitable tension arises when data-driven models are applied to conservation problems. On the one hand, the fundamental motivation behind the modeling exercise may be to fill data gaps by suggesting present or future species distributions that have not actually been observed. On the other hand, the need for scientifically rigorous tests of model predictions places demands that frequently cannot be met by the limited data available for tropical species of conservation concern.
Logistic difficulties and lack of resources remain a major barrier to data collection in the tropics [54]. At the same time, much invaluable data that have already been collected remain unavailable for modeling owing to unstructured data management. The successful collation of systematic records in relatively well-studied temperate regions can provide a positive role model for strengthening tropical data resources [90]. For example, botanical records in Great Britain and Ireland have been well organized since 1954, when the Botanical Society of the British Isles (BSBI) systematically divided Great Britain and Ireland into 3,500 10-km squares to aid surveying. The BSBI encouraged volunteer recorders to join their network. This was fundamental to the success of the project. By the following year, the BSBI network comprised 1,500 recorders that contributed to the field survey [91]. As a result of such approaches, the UK National Biodiversity Network (NBN) now provides excellent on-line access to detailed wildlife information at the national scale (available at http://www.searchnbn.net). Datasets have been contributed by over 50 distinct organizations and hold over 18 million records. A similar monitoring network is present in the Netherlands (Florbase), with more than 20 million records collected from the early 20th century onwards.
The British and the Dutch models rely on strengthening available formal collection data by drawing on qualified volunteer recorders. This bottom-up approach is also being promoted in some parts of the tropics, such as the BERDS database for Belize ( http://www.biodiversity.bz/). The key feature of the Belizean initiative is the use of a spatially explicit relational database as a key tool for data storage, display, and analysis. Other examples that show potential ways forward for effective collaboration and sharing of plot data are the RAINFOR initiative [92] and the Amazon Plot Network [93].
Top-down initiatives can also strengthen the data available for SDMs, although they provide no new data by themselves. Because volunteer work is likely to be hindered by the high diversity of organisms present in most tropical ecosystems, top-down approaches may work better in the tropics than bottom-up ones. One such top-down initiative is the Global Biodiversity Information Facility (GBIF) [94]. GBIF has collated millions of data entries from natural history collections, library materials, and databases, though this information shows many of the taxonomic and geographic gaps and biases mentioned above [94]. Most of the information currently available still refers to developed countries. For example Spain has 97,295 registered entries for plants, while in countries such as Madagascar, India, or Philippines there are only 6,497, 1,544 and 2,639 available data entries respectively (data referring to March 2008 accessed through www.gbif.org). Another approach is the Conservation Commons ( http://www.biodiversity.org/), which has identified a set of principles to promote sharing of biodiversity data, information, and knowledge to facilitate the conservation and sustainable use of biodiversity. These principles encourage organizations and individuals alike to ensure open access to data, information, expertise, and knowledge related to the conservation of biodiversity. However, application of these principles has been limited to date.
Such ongoing initiatives to improve data quality and quantity should support the future application of SDMs. Errors in geographic coordinates or taxonomic determination often become evident when SDMs are fitted to data. Although data cleaning is an unrewarding task for research scientists, the development of efficient methods for flagging and correcting dubious records could provide a clear applied role for SDMs [95]. Those involved in the design of data portals and the structure of databases should take into account the need of SDMs to connect directly with database servers as clients in order to automate computationally intensive iterated modeling [73]. This can now be easily achieved through protocols such as Open Data Base Connectivity (ODBC).
The resources being devoted to tropical field research are generally considered to be inadequate [54]. Many important aspects of the distribution and abundance of tropical organisms are likely to remain unknown. However, this also could reveal a positive application for SDMs. If a SDM provides poor predictions, this can be taken as a clear indication that more distribution data are required for the taxon in question. We also suggest that researchers using SDMs should become more open regarding the limitations of SDMs. This is particularly important when poor model results are clearly attributable to weak data rather than poorly constructed algorithms. Reviewers and editors should also be prepared to accept studies that rigorously document model failures as well as successes, in order to prevent the repetition of mistakes through over-optimistic expectations.
Overall, the field of SDM needs a serious reflection about the conceptual basis that underlies species distribution models, as well as about the true meaning of their predictions (potential versus realized distribution) [97]. The design of future works evaluating, comparing, and applying species distribution modeling techniques should be thus rooted in a good understanding of their conceptual background. If species distribution models are to be a common-use tool for biodiversity research and conservation assessment, the foundations of their application must be much more solid than they are now [97].
Finally, since data quality problems and data shortage appear to be very common, a pressing question is: what can be done with such biased and incomplete information? Statistical modeling is not effective when few data points are available. In such cases, input from some sort of expert judgment is inevitably required in order to evaluate which of the outputs from species distribution models are most credible. Expert knowledge is already recognized as an essential source of information for assessing the conservation status of species, given the lack of reliable quantitative information [59]. The development of tools to support the effective integration of expert knowledge with SDM approaches represents a key challenge for the future. Another approach might be to shift the focus of modeling from individual entities to collective properties of biodiversity [16, 98], such as species assemblages or communities [12, 99, 100]. Although this does not fully circumvent the problem, it might be used to indicate where rare or threatened species are likely to be found in association with which other species [12].
Final remarks
We consider the following steps as vital to a further development of SDM within an applied conservation context in the tropics. First, to reinvigorate SDM applications, more emphasis should be given to mechanisms for improving data sharing and better structural integration of diverse data sources, using distributed databases with common standards, referential integrity, and rigorous quality control. Second, SDMs can be used for strengthening available data, as they provide useful tools for quality control. Finally, SDMs can play a role in prioritizing areas for field survey, by identifying knowledge gaps. While current ongoing initiatives have already implemented mechanisms and stressed the need for open, unrestricted access to data, information and knowledge related to the conservation of biodiversity (e.g., GBIF), none of them have yet considered the applied value of current modeling techniques to improve the applied value of such datasets. We strongly believe that SDMs have great potential to support the conservation of tropical biodiversity in the future, if their value in this context is recognized.
Acknowledgements
We are indebted to Bradford A. Hawkins, Joaquín Hortal, Jorge M. Lobo and three anonymous reviewers for their insightful comments on the manuscript. This study was financed by the Netherlands Environmental Assessment Agency (PBL) in the context of the Biodiversity International project (E/555050). LC was supported by the European project REFORLAN (INCO-CT-2006-032132) and the Andalusian Regional Government project GESBOME (P06-RNM-1890). MK was supported by CONABIO. EJMMA was supported by the Dutch Ministry of Agriculture, Nature and Food Quality (Project BO-10-003-01). This research was initiated in a working group at Managua, Nicaragua.
References
Appendix 3. References listed in Appendix 2
Appendices
Appendix 1.
Description of modeling methods and examples found in the literature review (references are provided in Appendix 3.
Appendix 2.
Methodologies and approaches used in modeling species distribution in 123 articles published from January, 1995 to May, 2007, in Biological Conservation, Conservation Biology, Diversity and Distribution, Global Ecology and Biogeography and the Journal of Applied Ecology. Full references are provided in Appendix 3.