Modeling the distribution of rare and endangered species is challenging, and there is substantial debate regarding what species distribution models (SDMs) actually represent. Here I investigated whether locations of different lowland tapir signs (feces, trails and tracks) generated different distributions of suitable habitat using a presence-only species distribution modeling technique. Comparison of the equivalence and overlap of the predicted distributions showed no significant differences between the different signs. The contribution of the 11 variables used to build the distribution models was also similar between signs. Although predictions from different signs were similar, the use of different threshold selection methods generated substantially different suitable areas and omission errors. These results highlight the importance of a fundamental understanding of species natural history to determine not only appropriate model parameters, but also the biological relevance of SDMs. My findings also support the need for healthy skepticism regarding what is represented by presence-only species distributions. To help address this skepticism I conclude by providing guidelines for generating reliable local-scale distribution models.
Introduction
Predicting the geographic distributions of species is a growing field in conservation science [123–4]. Species distribution models (SDMs) permit the analysis of a wide variety of biodiversity phenomena, including future potential distributions under scenarios of climate change, species' invasions, and priorities for biodiversity conservation [328–4]. Yet there is uncertainty about the inferences possible from novel prediction methods [3].
Modeling the distribution of rare and endangered species is challenging not only because acquiring robust empirical field data is often prohibitively expensive (in terms of time and money), but also because many techniques available for modeling species distributions are not appropriate for data that are typically sparse and clustered [323–4]. Novel models can generate accurate and informative predictions from presence-only locations for a variety of faunal and floral species [128–2,514–6], but studies also highlight that model predictions are sensitive to a number of analytic and sampling biases [3,7].
Tapirs (Tapirus spp.) characterize the challenges of modeling species distributions in the tropics. Due to tapirs' relatively low densities and secretive nature, indirect signs (such as feces, tracks and trails) have been frequently used to estimate the distribution and abundance of tapir species in numerous tropical biomes [89101112–13]. Often a combination of different sign types is used for generating tapir distribution models (e.g. [8,14]). Yet it is unclear what can be inferred from the use of these different indirect signs when modeling the species distribution. For example, when predicting distributions is it appropriate to model different signs together (to increase sample size and analytic power)? Do different signs generate different niches and therefore different distribution maps? Answering such questions is vital for understanding whether the inferences made from the predicted distributions are robust and reliable for the study species [3].
The objective of this study was to evaluate whether different lowland tapir (Tapirus terrestris) signs (Feces, Tracks and Trails) generated different distributions of suitable habitat from ecological niches modeled using presence-only data. Specifically, as fecal samples occur where lowland tapir have walked, I predicted that the distribution from Feces locations should represent a subset of that obtained from Tracks & Trails. To test this prediction, I used a maximum-entropy algorithm (MaxEnt [2]) to compare the distribution of suitable habitat derived from locations of Tapirus terrestris in a protected area of the Brazilian Atlantic Forest.
Methods
Study area
Surveys took place in Núcleo Caraguatatuba (hereafter Caragua). Caragua is a ≈49,953ha administrative unit of the Serra-do-Mar State Park (Figura 1, [15]), which protects ≈315,390ha of Atlantic Forest in the Brazilian State of Sao Paulo. The Serra-do-Mar State Park is located along the pre-Cambrian Serra do Mar mountain chain [16]. Caragua is located in the center of the coastal tourist region of Sao Paulo, and receives approximately 5,000 visitors annually [15]. Caragua is bisected by the Tamoios road, a state highway that leads to the town of Caraguatatuba (45° 25′ 57″ W and 23° 35′ 52″ S). The western portion of Caragua is also traversed by one of the main pipelines of the Brazilian petroleum company “Petrobras.” The poorly monitored access provided by the Tamoios highway and the pipeline are the two principal vectors of anthropogenic pressure (including illegal hunting and palm-heart harvesting) in Caragua ([15], p.119–143).
The regional climate is subtropical, with a mean annual temperature of 23.2 °C (daily means ranging from 4.6 to 36.1 °C, data from 2010 downloaded from the Brazilian weather center http://www.cptec.inpe.br/, station ID: 83671, Lat −21.98, Long: −47.35, masl = 598), and annual rainfall from 1,400 to 4,000 mm [16]. Forests range from coastal (≍20 m) to elevations > 900m, with stark floristic gradients from shrubs to well-developed montane forest [15,17].
Tapir locations
Between April and November 2011 (for a total of 61 days), line-transect surveys were used to sample the distribution of T. terrestris across Caragua. Diurnal surveys were conducted by two observers, who recorded all indirect signs (feces, tracks and trails) encountered. The locations of these signs were recorded using a GPS (Garmin 60x, horizontal error <9m). Further details of the study area and survey methodology are presented in [18].
To provide a representative sample, pre-existing trails were walked by two observers (n=14, total trail length = 68.8 km, length range = 2.1 – 15.7 km). Trails had been established for at least five years prior to surveys and were distributed throughout the area (Fig. 1), encompassing both the full altitudinal range (20 – 874 masl) and the variety of secondary and primary forest habitats found within the survey area. As rainfall occurs throughout the year in the study region I assume that climatic conditions did not influence sign detectability. As the survey effort was distributed evenly between the austral winter (May to July) and summer months (September to November), I also assume that surveys accounted for any possible seasonal variation in T. terrestris distribution.
Environmental data
I considered 11 environmental and anthropogenic predictors (Appendix 1) that (i) based on previous studies [10,192021–22] could be important determinants of T. terrestris distribution within the Atlantic Forest study area, and (ii) for which the pairwise correlations were less than 0.85 [1] (Pearson correlations between numeric variables, polyserial correlations between numeric and ordinal variables, and polychoric correlations between ordinal variables [23]). All variable layers were resampled using a common origin to a 1km2 cell size and projected to the same coordinate system (SAD69, UTM zone 23S). Following [24], this cell size was chosen based on a combination of: (i) the question being asked (i.e., conservation / management requirements of the relatively large study area); (ii) our knowledge of the spatial response (i.e., species that ranges widely across a variety of habitats [10,2028–21,25]); and (iii) the spatial properties of the available occurrence data (Appendix 2). All GIS processing was carried out using SAGA GIS ( http://www.saga-gis.org/en/index.html) and QGIS ( http://www.qgis.org/en/site/).
Comparison of predicted distributions
The distribution of different signs (Feces, Tracks & Trails and All) was modeled using a maximum-entropy approach (MaxEnt version 3.3.3k, download URL: http://www.cs.princeton.edu/~schapire/maxent/; [2,26]). Although a number of different modeling approaches are available for presence-only data [1], I selected MaxEnt as it has been shown to perform relatively well compared to alternative approaches for modeling species such as T. terrestris that are widely distributed and represented by a low (5–21) to moderate (38–94) number of presence locations [1,6]. Full details of the MaxEnt modeling are provided in Appendix 3. To facilitate reproduction and validation, the complete MaxEnt output files can be obtained from the corresponding author or downloaded from: http://sdrv.ms/1dNPVhf. To ensure that modeled differences between signs were not biased by pre-existing differences in the spatial distribution of locations, I compared the spatial scale, intensity and autocorrelation of the different sign locations using diagnostic functions available in the R [27] package “spatstat” [28] prior to MaxEnt modeling (Appendix 2).
The logistic output of MaxEnt generates a map with values ranging from 0 to 1. I interpreted this map as representing the distribution of suitable habitat (i.e. a habitat suitability index (HSI), Appendix 3). To compare the predicted distribution from the different types of sign, I examined equivalence and overlap [29] of suitable habitat using functions available in ENMTools [30] and/or the R packages SDMtools [31] and dismo [32]. To examine the relative ranking of variables used to generate model predictions and their importance in the models, I compared the ranked order of variable contributions (standard MaxEnt output) using the Chi-squared test.
Finally, I compared the area of suitable habitat obtained for each sign using seven of the threshold selection methods available in MaxEnt: Minimum training presence, Fixed cumulative value 1, Fixed cumulative value 5, Fixed cumulative value 10, 10 percentile training presence, Equal training sensitivity and specificity, and Maximum training sensitivity plus specificity. For each selection method, I used the mean logistic threshold value from the 50 runs and calculated the omission error (proportion of all locations with values below the threshold) and the proportion of the prediction area classified as “absent” (i.e. unsuitable areas below the threshold) for each sign. As the thresholds were used to compare the final mean prediction maps considering all locations, the results should not be compared with those reported by MaxEnt.
Results
From 354.5 survey km a total of 141 T. terrestris presence locations were obtained (Table 1) from Tracks & Trails (63) and Feces (78). Excluding duplicates within the same 1km2 pixel reduced the number of locations to 80, 50, and 39 (All, Tracks & Trails, and Feces respectively). Although spatial diagnostics showed that the spatial scale, intensity and autocorrelation of Tracks & Trails and Feces were similar (Appendix 2), examination of nearest neighbor distances (within a radius of 3km) showed that locations of feces tended to be more clustered (mean distances = 81.2, 115.4 m, Feces and Tracks & Trails respectively, Mann-Whitney test, P<0.0001). When duplicates within the same 1km cell were excluded, there was no significant difference between mean nearest neighbor distances (mean distances within a 3km radius = 1,233.4, 1,158.7 m, Feces and Tracks & Trails respectively, Mann-Whitney test, P=0.466).
Table 1.
MaxEnt model summary. Summary of MaxEnt models used to predict the distribution of Tapirus terrestris within an Atlantic forest protected area using presence locations from different sign types.
Table 2.
Overlap and equivalence of predicted distributions. Comparison of observed overlap (a) and equivalence (b) between the distributions from different sign, using both I (lower diagonal) and D (upper diagonal) similarity metrics. Significance of equivalence tested by the randomization test of [29], where a significant value denotes a pair of sign that are ecologically distinct (ns = not significant i.e. ecologically similar).
MaxEnt model predictions appeared to be accurate, with mean test data AUC > 0.90 for the three types of locations (Table 2), which is a good score for the model validation [26]. Not only were AUC means similar but the standard deviation of AUC values was also similar among the three types of sign (Table 2), which suggests that differences in sample size did not influence the appropriateness of the MaxEnt models.
The predicted distribution of suitable habitat for T. terrestris was not homogeneous in the study area (Fig. 2). Visual comparison suggested that the distribution of suitable habitat was similar in the maps generated from the three sign types (Fig. 2). This visual assessment was confirmed by the similarity in the frequency distributions (Figura 2, Kolmogorov-Smirnov, P>0.537 for all three pairwise comparisons) and strong correlations between the mapped habitat suitability values derived from the three sign types (Pearson's correlation r>0.85, P<0.0001 for all three pairwise comparisons). This descriptive analysis of MaxEnt predictions was also supported by hypothesis tests that showed the predicted distributions overlapped and were ecologically equivalent (Table 2).
Pearson's chi-squared test showed that the type of sign did not significantly influence the observed importance (ranked contribution) of the different variables (X-squared=13.557, P=0.921). Although the mean contribution of some variables differed slightly among sign types, variable importance was generally consistent among the three sign types (Fig. 3). Generally there was a clear separation with only five variables showing important contributions: distance to park border, distance to road, distance to river, altitude, and proportion of forest within a 5km radius (summed variable contribution % = 94.3, 85.9, 92.5, for All, Feces, and Tracks & Trails respectively). The other six variables contributed little on average (<5% each, Fig. 3).
Using different threshold methods resulted in substantial differences in omission errors and unsuitable areas (Fig. 4), yet results from the three different sign types showed small and insignificant differences between the average threshold values (Kruskal-Wallis Rank Sum Test, P>0.1, for the comparison of threshold, omission and unsuitable area values, Fig. 4). Only Minimum Training presence and Fixed cumulative value 1 had zero omission error, all presence locations being correctly retained with values above the threshold. On the other hand, Maximum training sensitivity plus specificity and Equal training sensitivity and specificity generated the most omission errors (between 15 and 21%). The different threshold selection methods also resulted in substantial differences in the predicted distributions, with the unsuitable area ranging from 18 to 85% of the prediction area depending on the method adopted (Fig. 4). Again, the greatest contrast was between values from Minimum Training presence and Fixed cumulative value 1 (smallest unsuitable areas) and those obtained from Maximum training sensitivity plus specificity and Equal training sensitivity and specificity (largest unsuitable areas) (Fig. 4).
Discussion
There is substantial debate about what species distribution and ecological niche models actually represent, and perhaps more importantly, how these two concepts are related [3]. My findings support the necessity of “a healthy skepticism about which components of the niche are represented by predictions from an SDM” [3].
Theoretically, the T. terrestris presence locations are “natural distribution data” representing a “realized niche” [332021–34]. The habitat suitability from different signs therefore reflects the use of the available habitats by T. terrestris [3328–34]. If this were true, the maps of habitat suitability for T. terrestris would be extremely informative for researchers and park managers in the study area, particularly as the Serra-do-Mar continues to be intensely threatened by anthropogenic perturbations. However, there are substantial challenges to modeling highly mobile species, which also typically (through sample bias and/or natural history) exhibit spatial autocorrelation (but see [35] for an alternative approach for generating a “realized” distribution).
What the presence locations from different signs actually represent (i.e. their ecological meaning) is undeniably contentious and perhaps irrelevant from the perspective of modeling species distributions [36]. However, what the predicted distributions actually represent does have important implications from the perspective of species management and conservation. For example, feces are an important source for genetic studies, and improving fecal sampling efficiency is an active research area in the tropics e.g. Neotropical carnivores [37] and deer [38]. Predicting the likely location of fecal samples would improve sampling efficiency and reduce costs. However, the similarity found between the predicted distributions of Feces and Tracks & Trails suggests that the approach adopted, which evaluates correlative rather than mechanistic relationships [22021–3], is not adequate for such purposes. In other words, the maps derived from these signs may show a substantial portion of the confirmed distribution of the species, but are not necessarily suitable for establishing ecological associations.
From another perspective, the similarity between the distributions of the different sign types can be useful for conservation and management activities. The similarity of the predicted distributions means that it should be possible to combine different signs to increase the statistical robustness of T. terrestris distribution models. For example, increasing the sample size enables the adoption of additional modeling parameters/features and improved predictions [62021–7,3928–40]. Additionally, this similarity suggests that results from studies using different combinations of feces and/or tracks and trails should also be comparable. However, the findings also highlight that the ability to compare results will strongly depend on the model parameters and threshold selection methods applied.
Threshold selection is one of the many possible biases in species distribution modeling [2,7,402021–41], yet few studies have evaluated the influence of threshold selection for presence-only data. In a recent study, Liu et al. [42] evaluated the suitability of 13 threshold selection methods for presence-only data using simulated species. These authors found Maximum training sensitivity plus specificity to be a promising selection method for presence-only data. This result contrasts with my findings (using a “real” species to represent the challenges typical of tropical species) that Maximum training sensitivity plus specificity resulted in both the greatest omission error and increased loss of suitable areas among all the predicted distributions of T. terrestris.
The determination of thresholds should not be arbitrary and should consider the relative importance of omission and commission errors [6,412021–42]. For the case of T. terrestris in Caragua, reducing omission error is the most important determinant of threshold selection method, because this wide-ranging and long-lived species is likely to find suitable conditions throughout the prediction area. In the present study, both sample sizes and spatial arrangement were similar and MaxEnt model parameters standardized such that the differences in suitable areas and omission rates can be directly attributable to the threshold selection method adopted. But which threshold should be chosen? Although T. terrestris are rare, the natural history of this charismatic and widespread species is well studied [2028–21]. T. terrestris occur in dry forests (e.g. Bolivian tropical and subtropical dry Chaco forests [20,43]) to tropical regions, and lowland to upland habitats [20]. Therefore, based on such knowledge, it is possible to conclude that the threshold selection methods which resulted in lower threshold values, i.e. with a wider distribution of suitable habitat and close to zero omission error (Minimum training presence or Fixed cumulative value 1) should be the most appropriate to identify suitable and unsuitable areas for T. terrestris in the Serra do Mar protected area.
Considering the threshold values from the Minimum training presence and Fixed cumulative value 1, an average of 23% of the prediction area (equivalent to 18,900 ha) is unsuitable for T. terrestris. For all predicted distributions, the most important variables were related to anthropogenic access (Distance to park border and Distance to road). These results are unsurprising, given the location of Caragua and the impact of intense anthropogenic pressures (both current and historic). Previous studies have shown the importance of anthropogenic (distance to park border and distance to road) and environmental variables (altitude and forest cover) as determinants of the distributions of other endangered mammals in the region, such as white-lipped peccary (Tayassu pecari, [35]) and buffy-tufted-ear marmosets (Callithrix aurita, [44]). Whilst anthropogenic factors are obviously important determinants of mammal distributions in remaining Atlantic Forest areas, the results from T. terrestris highlight limitations of correlative distribution models.
Correlative distribution models provide a simple output (distribution map) that indirectly represents many different processes [452021–46]. Should the distribution change over time, for example, in response to management actions within the protected area, we are left with an uncomfortable, nagging uncertainty about what has actually changed (individual behavior, population demographics, etc.). Recent studies have started to explore how to develop mechanistic distribution models that incorporate aspects such as physiology [46] and population demographics [47]. However, such approaches are still developmental [45] and the use of such models requires additional data that are not readily available for the majority of tropical species.
Implications for conservation
Modeling presence locations to represent broad scale species distributions undoubtedly provides robust and reliable inferences, but my findings highlight some of the challenges for local scale predictions [3]. With appropriate sampling, model building, and threshold selection, it is possible to gain an understanding of local scale distributions from presence-only locations, yet what these distributions actually represent remains unclear [12021–2,7,36].
Theoretical [3,36] and statistical [48] advances in species distribution modeling must be accompanied by the collection of more detailed field data. For example, if we knew that individual T. terrestris mark their range limits with fecal samples, the inferences possible from the models would be entirely different from the case that fecal samples represent areas that are more commonly used by individuals. While previous reviews highlight the importance of integrating theory [3], improving modeling methods [36] and improving data quality/availability [4], my findings suggest that integrating complementary data regarding species natural history (understanding the how and why of species distributions) is vital to generate meaningful conservation insight from local scale presence-only species distribution models.
Finally, findings from the analysis conducted and from previous studies [3,24,35,49] enable me to present some practical guidelines for generating reliable local scale distribution models that should be useful for conservation practitioners:
Spatial resolution must be established prior to distribution modeling, based on (i) the question being addressed, (ii) knowledge of the spatial response and (iii) spatial properties of the available occurrence data.
Spatial autocorrelation should be quantified and if necessary implicitly addressed within the modeling workflow.
Ensure surveys are conducted in a representative sample of the environmental gradients in the survey area. Prediction outside of the sampled gradients is not recommended.
Potential biases caused by the integration of different sources and types of occurrence data should be evaluated as part of the modeling workflow.
Environmental data must be selected based on (i) previous studies, (ii) availability at the required spatial resolution, and (iii) accessibility (freely available data sources that allow reproduction should be preferred).
The selection of modeling algorithm (or algorithms in the case of ensemble models) should be based on (i) study objectives, (ii) species natural history and (iii) spatial properties of the available occurrence data.
Because modeling algorithms are increasingly complex, model inputs and outputs must be available for independent peer review and validation by experts.
Acknowledgements
I am deeply indebted to José F. Moreira-Ramírez for his tireless help and dedication during field work. I thank Milton Ribeiro who assisted with the GIS processing and together with participants of the “Ecologia de Paisagem” course inspired much of the analysis. I also thank Mauro Galetti, Alexandra Sanches, Rafael Loyola, Jose Alexandre F. Diniz-Filho and two anonymous reviewers for comments that improved a previous version. Financial support came from a Rufford Small Grant for Nature Conservation and postgraduate scholarships from CNPq (159806/2012-7 and 164999/2013-2). I also thank UNESP (Rio Claro) for logistical support and the Instituto Florestal de São Paulo for permission to conduct research (COTEC-SMA: 260108014.661/010).