We used conventional and finite mixture removal models with and without time-varying covariates to evaluate availability given presence for 152 bird species using data from point counts in boreal North America. We found that the choice of model had an impact on the estimability of unknown model parameters and affected the bias and variance of corrected counts. Finite mixture models provided better fit than conventional removal models and better adjusted for count duration. However, reliably estimating parameters and minimizing variance using mixture models required at least 200–1,000 detections. Mixture models with time-varying proportions of infrequent singers were best supported across species, indicating that accounting for date- and time-related heterogeneity is important when combining data across studies over large spatial scales, multiple sampling time frames, or variable survey protocols. Our flexible and continuous time-removal modeling framework can be used to account for such heterogeneity through the incorporation of easily obtainable covariates, such as methods, date, time, and location. Accounting for availability bias in bird surveys allows for better integration of disparate studies at large spatial scales and better adjustment of local, regional, and continental population size estimates.

## INTRODUCTION

It has long been recognized that nearly all avian field surveys underestimate abundances (Leopold 1933, Kendeigh 1944), unless the estimates are adjusted for the proportion of birds present but undetected at the times and locations surveyed (Hayne 1949, Anderson and Pospahala 1970). Such adjustments require an estimate of the probability of detecting birds present during surveys (detectability). Detectability is the product of the probability that birds make themselves available for detection by emitting detectable cues (availability) and the probability that the available birds will be perceived by a bird surveyor (perceptibility; Diefenbach et al. 2007).

All abundance estimators developed to account for imperfect detectability impose conditions on survey design, requiring either some form of replication, for example, of visits or observers, or ancillary data collection, for example, records of distance or time for each observation (Nichols et al. 2009, Matsuoka et al. 2014, but see Sólymos and Lele 2016). The added costs and expertise necessary to collect these ancillary data have prevented most of the standardized continental surveys, such as the North American Breeding Bird Survey (BBS), from adopting these methods (O'Connor et al. 2000). Some researchers (e.g., Thogmartin et al. 2006, 2010, Matsuoka et al. 2012, Twedt 2015) contend that this limits the rigor of spatial models of avian abundance and population size at large geographic scales, reducing the utility of these models for setting and evaluating progress toward meeting population objectives (e.g., Partners in Flight; Rosenberg and Blancher 2005). Unstandardized counts may also lead to improper inferences about the influence of covariates, thus potentially biasing the identification of suitable habitats for sensitive species (Sólymos et al. 2013).

Detection probabilities are strongly influenced by environmental conditions, bird behaviors, and survey protocols (Johnson 2008, Matsuoka et al. 2014), and are expected to show greater variation between studies collating datasets than within individual studies. For point count surveys, duration varies greatly among studies (Barker et al. 2015), but rarely changes within studies. As more and more large-scale studies rely on compiled point count data from disparate local studies (e.g., Bart 2005, Kelling et al. 2009, Dickinson et al. 2010, Hampton et al. 2013), standardization becomes critical because count duration greatly affects observations. The number of individual birds counted over a 10-min period is roughly 60% higher than during a 3-min period (Farnsworth et al. 2002, Etterson et al. 2009, Matsuoka et al. 2014), and both durations are routinely used for avian surveys.

The time-removal model (Moran 1951, Zippin 1956, 1958) was originally developed for estimating wildlife and fish abundances from capture–mark–recapture studies (Seber 1982). It was later reformulated for avian surveys with the goal of improving estimates of bird abundance, population size, and population trend by accounting for the availability bias inherent in point count data (Farnsworth et al. 2002, Sólymos et al. 2013). The model has now been widely applied to point count studies because the required ancillary data, that is, tallies of the first detections of individual birds in sequential time intervals, are easy to collect (Johnson 2008, Matsuoka et al. 2014). The removal model applied to point count surveys estimates the probability that a bird is available for detection as a function of the average number of detectable cues that an individual bird gives per min (singing rate), and the known count duration (Seber 1982, Barker and Sauer 1995, Alldredge et al. 2007a, Sólymos et al. 2013). However, the removal model has not been rigorously assessed to determine whether availability bias varies with duration, and whether fitting more complex parameterizations of the removal model can minimize such bias.

Singing rates of birds vary with time of day, time of year, breeding status, and stage of the nesting cycle (Wilson and Bart 1985, Rosenberg and Blancher 2005, Stacier et al. 2006). Thus, removal model estimates of availability may be improved by accounting for variation in singing rates using covariates for time of day and day of year (Reidy et al. 2011, Sólymos et al. 2013, Amundson et al. 2014). The removal model can also accommodate behavioral heterogeneity in singing by subdividing the sampled population of a species at a given point into a finite mixture of birds with low and high singing rates, which requires the additional estimation of the proportion of birds in the sampled population with low singing rates (Farnsworth et al. 2002, Alldredge et al. 2007a, 2007b, Reidy et al. 2011).

A key concern when fitting removal models of increasing complexity is the tradeoff between bias and variance, and the moderating role of sample size in this tradeoff. Several studies have recommended finite mixture models over conventional removal models to reduce bias in abundance estimates (Farnsworth et al. 2002, Etterson et al. 2009, Reidy et al. 2011), but these studies analyzed large sample sizes to support the more complex models. In contrast, efforts to fit finite mixture models to smaller datasets have resulted in models with unidentified parameters or point estimates of availability that were similar, but far less precise, than those from the conventional removal model (Amundson et al. 2014). The ability to fit more complex removal models that both reduce bias and minimize variance in estimates of availability may be, in part, a function of the number of detections, but the sample size requirements and bias–variance tradeoffs have not been systematically assessed for removal models.

We formulated and evaluated a continuous time-removal model for avian point counts by estimating the availability probabilities of 152 terrestrial bird species using a large database of point count surveys conducted across boreal and hemiboreal regions of North America (Figure 1, Appendix Tables 2 and 3; Cumming et al. 2010, Barker et al. 2015). We compared different parameterizations of removal models to determine the relationships among model complexity, bias, variance, sample size, and duration. To address general estimation of availability, we fit all models to 152 species that met minimum sampling requirements, but varied widely in singing rates and sample sizes of detections. To address protocol variability, we compared the reliability of the different model classes for adjusting the raw survey counts for variation in count duration. We also examined what factors influenced bias and variance in the adjustments. Finally, we compared our removal model adjustments of availability to those applied to BBS data across North America (Blancher et al. 2013) to evaluate the extent to which removal models may improve continental estimates of landbird population sizes. We finish by summarizing our recommendations for selecting among the various removal model types given study design and data characteristics.

## TABLE 1

Relative support for different model classes and candidate models of temporal covariate effects to explain the probability of availability during point count surveys of terrestrial landbird species in northern North America. We fit each of the models to 147 species and summarized relative support for each model as the number of species for which the model had the lowest Akaike's information criterion corrected for small sample sizes (AIC_{c}). Model classes included: *M*_{0} = time-invariant conventional; *M _{f}* = time-invariant finite mixture;

*M*

_{0}

^{ϕ}= conventional with time-varying rate parameter;

*M*

_{f}^{ϕ}= finite mixture with time-varying rate parameter; and

*M*= finite mixture with time-varying proportion of infrequent singers. The total is the sum of model types

_{f}^{c}*M*

_{0}

^{ϕ},

*M*

_{f}^{ϕ}, and

*M*by covariate model ID; frequencies for winter resident species are also shown (frequency for migrant species = Total − Residents).

_{f}^{c}## METHODS

### Point Count Data

The Boreal Avian Modelling Project (BAM) was initiated to facilitate the conservation of boreal birds. As part of these efforts, BAM has collated data from georeferenced point count surveys conducted in boreal, hemiboreal, and sub-Arctic biomes by independent inventory, monitoring, and research studies (Cumming et al. 2010, Barker et al. 2015). The BAM database (version 5) contains 250,822 off-road point count surveys conducted at 145,289 point count locations by 131 separate studies between 1991 and 2014. Of these surveys, 3% were 3 min, 60% were 5 min, and 37% were 10 min in total point count duration. We applied removal models to a subset of 89,304 off-road point counts at 25,754 locations conducted by 71 projects in boreal and hemiboreal regions (hereafter, ‘boreal') in which observers recorded when they first detected each bird relative to ≥2 time intervals (hereafter referred to as the ‘full dataset'; Figure 1). We excluded roadside surveys because of possible differences in singing behavior between birds breeding along roadsides vs. in off-road areas (Kociolek et al. 2011). The number of time intervals varied from 2 to 10, and there were 11 distinct combinations of count duration and the number and length of the time intervals (Figure 1). We restricted our analyses to detections of vocalizing gallinaceous gamebirds, shorebirds, woodpeckers, and songbirds, which included species that are diurnally active and are first detected primarily by their auditory displays during breeding surveys. We analyzed species detected aurally during a minimum of 75 point count surveys in the full dataset (median = 1,240 point count surveys, maximum = 31,334). We present results for the 152 species for which we could successfully estimate the constant singing rate parameter of the conventional removal model (Appendix Tables 2 and 3).

Count duration varied from 3 min to 20 min, but we omitted time intervals beyond 10 min to minimize the inclusion of movements by birds during counts, which could violate the closed population assumption of the removal model. For example, individuals that move toward the observer are more likely to be detected than ones that move away, and would inflate counts (Glennie et al. 2015). Point count radii in our dataset were at least 50 m, with 100-m radius (24%) and unlimited distance (72%) counts making up the majority. The removal model captures availability, while perceptibility is often estimated via distance sampling (Diefenbach et al. 2007). The conditional model formulation uses the cumulative counts over the time intervals relative to the total count, which makes the estimation of availability independent of the observer and distance-related perceptibility component and of the true underlying population abundance. By conditioning on the total counts, the abundance and other nuisance variables (such as perceptibility) are canceled out of the model likelihood (Sólymos et al. 2013).

As a result of this conditional modeling approach, we treated within- and between-year repeat visits to the same locations as independent observations. We checked that our assumption of conditional independence was valid, and we found neither systematic bias in our parameter estimates nor shrinking of standard errors as a result of treating data from revisits as independent (Appendix Figure 6). Using data from within- and between-year revisits to the same locations allowed us to better estimate date- and time-related patterns in our dataset, because revisits increased our sampling coverage of different dates and times, and therefore strengthened our inference regarding longitudinal variation. The diurnal and seasonal timing of the point counts varied widely, so we standardized the timing of each point count relative to local sunrise and spring, respectively, using the National Oceanic and Atmospheric Administration's Sunrise/Sunset Calculator ( https://www.esrl.noaa.gov/gmd/grad/solcalc/sunrise.html, Meeus 1999), implemented in the R package maptools (Bivand and Lewin-Koh 2016). We quantified the ordinal day of each point count, and also a standardized version of the ordinal day of each point count relative to the 30-yr average of the local start of the growing season (McKenney et al. 2006). Temporal covariates were standardized by dividing the observed values by the maximum possible values: time since local sunrise (SR; mean = 2.2 hr, range = −5.7 to 12.5 hr, maximum possible = 24 hr); ordinal day of year (DY; mean = 166, range = 128 to 200, maximum possible = 365); and days since local spring (LS; mean = 48.5, range = −8 to 94, maximum possible = 365).

### Continuous Time-removal Models

Time-removal models are based on a removal experiment wherein animals are trapped and thereby removed from the closed population of animals being sampled (Zippin 1958). When applying removal models to avian point count surveys, the counts of singing birds (*Y _{ij}*, …,

*Y*) within a given point count

_{iJ}*i*(

*i*= 1, …,

*n*) are tallied relative to when each bird is first detected in multiple and consecutive time intervals, with the survey start time

*t*

_{i}_{0}= 0, the end times of the time intervals

*t*(

_{ij}*j*= 1, 2, …,

*J*), and the total count duration of the survey

*t*. Each individual bird is counted once, so that individuals are ‘mentally removed' from a closed population of undetected birds by the surveyor. In the continuous time-removal model (hereafter, ‘removal model'), singing events by individual birds are assumed to follow a Poisson process. We can use the rate parameter of the Poisson process (ϕ) to estimate the singing rates of birds during a point count (Alldredge et al. 2007a).

_{iJ}In the conventional removal model (*M*_{0}), all individuals of a single species at a given location and time are assumed to be homogeneous in their singing rates. The time to first detection follows the exponential distribution *f*(*t _{ij}*) = ϕ exp(−

*t*ϕ), and the cumulative density function of times to first detection in time interval (0,

_{ij}*t*) gives us the probability that a bird sings at least once during the point count as

_{iJ}*p*(

*t*) = 1 − exp(−

_{iJ}*t*ϕ) (Barker and Sauer 1995, Alldredge et al. 2007a, 2007b, Sólymos et al. 2013). Although the term ‘detection' implies both the perceptibility and detectability processes, the perceptibility component is not part of our models due to the conditional likelihood approach. The perceptibility component should be addressed in a distance sampling framework independent of the removal models (Handel et al. 2009, Sólymos et al. 2013).

_{iJ}In the continuous time formulation of the finite mixture (or 2-point mixture) removal model, the cumulative density function during a point count is given by *p*(*t _{iJ}*) = (1 −

*c*) 1 +

*c*[1 − exp(−

*t*ϕ)] = 1 −

_{iJ}*c*exp(−

*t*ϕ), where ϕ is the singing rate for the group of infrequently singing birds, and

_{iJ}*c*is the proportion of birds during the point count that are infrequent singers (see Jewell 1982 for a discussion of mixtures of exponential distributions). The remaining proportions (1 −

*c*; the intercept of the cumulative density function) of the frequent singers are assumed to be detected instantaneously at the start of the first time interval (Alldredge et al. 2007b). In the simplest form of the finite mixture model, the proportion and singing rate of birds that sing infrequently is homogeneous across all times and locations (model

*M*).

_{f}The singing rate in the conventional removal model (ϕ* _{i}*) may also vary with respect to survey-specific covariates. We hereafter refer to these models collectively as time-varying conventional removal models (

*M*

_{0}

^{ϕ}). Previously, researchers have applied covariate effects to the parameter ϕ

*of the finite mixture model, assuming that the parameter*

_{i}*c*is constant irrespective of time and location (i.e. only the infrequent singer group changes its singing behavior; Farnsworth et al. 2002, Aldredge et al. 2007a, Etterson et al. 2009, Reidy et al. 2011). An alternative parameterization is for

*c*rather than ϕ to be the time-varying parameter, allowing individuals to switch between the frequent and infrequent singer group depending on covariates. We defined 2 time-varying finite mixture model classes: (1)

_{i}*M*

_{f}^{ϕ}, with covariate dependent rate parameter

*p*(

*t*) = 1 −

_{iJ}*c*exp(−

*t*ϕ

_{iJ}*); and (2)*

_{i}*M*, with the proportion of infrequent singers dependent on covariates:

_{f}^{c}*p*(

*t*) = 1 −

_{iJ}*c*exp(−

_{i}*t*ϕ).

_{iJ}In all 3 classes of time-varying models (*M*_{0}^{ϕ}, *M _{f}*

^{ϕ}, and

*M*), we modeled covariate effects as linear or quadratic effects of time since local sunrise (SR), ordinal day of year (DY), and days since local spring (LS) in 14 different candidate models (Table 1). We estimated model parameters using the conditional maximum-likelihood procedure described by Sólymos et al. (2013) and implemented in the R package detect (R Core Team 2017, Sólymos et al. 2018). This included using the cmulti function, with type argument ‘rem' for conventional (

_{f}^{c}*M*

_{0},

*M*

_{0}

^{ϕ}) models and type argument ‘fmix' and ‘mix' for the 2 mixture model types (

*M*

_{f}^{ϕ}and

*M*, respectively; both options give identical results for

_{f}^{c}*M*). Covariate effects for ϕ

_{f}*were modeled using the logarithmic link function to ensure that ϕ*

_{i}*> 0, whereas covariate effects for*

_{i}*c*were modeled using the logistic link function to ensure that

_{i}*c*was in the interval (0, 1).

_{i}### Sample Size Requirements for Removal Models

We first examined the minimum sample size required to reliably estimate the parameters of the time-invariant *M*_{0} and *M _{f}* models. We stratified the full point count dataset by project, and then estimated the model parameters for 3,002 combinations of the 65 projects and 152 species that had a minimum of 2 nonzero total counts (hereafter referred to as the ‘project-level dataset'). We classified an estimation as a success if the Nelder-Mead optimization algorithm for the conditional maximum-likelihood estimator (MLE) was able to converge on a point estimate of ϕ for

*M*

_{0}or on a joint estimate of (ϕ,

*c*) for

*M*, and if the algorithm could estimate the corresponding variances, Var[log(ϕ)] for

_{f}*M*

_{0}, and jointly Var[log(ϕ)] and Var[logit(

*c*)] for

*M*. We fit logistic regression models separately to the outcomes of estimates and variances for

_{f}*M*

_{0}and

*M*models to predict the probability of successfully estimating the model parameters as a function of log number of point counts with nonzero totals for the species in a given project, and calibrated the fitted models to predict the minimum number of nonzero counts required to have a 90% probability of successfully estimating removal model parameters and variances. The number of nonzero counts was highly correlated with the mean count (Spearman's

_{f}*r*= 0.82,

*P*< 0.001) in the full dataset; therefore, we did not separately assess relationships with mean counts.

### Adjusting Point Counts for Variation in Count Duration

We used a subset of the full dataset for 133 species from 38,540 point count surveys with 0–3, 3–5, and 5–10 min intervals (hereafter referred to as the ‘validation dataset') to assess the bias and variance based on the time-invariant models (*M*_{0} and *M _{f}*; estimates based on the validation data) when predicting the number of birds counted in the full 10-min point count, using the data from the 0–3 min and 0–5 min portions of the same point counts as inputs. The expected total number of individuals of a species counted across point counts of duration

*t*can be written as

*E*[

*Y*] =

_{t}*N p*, where

_{t}*p*is the estimated availability probability for duration

_{t}*t*from the removal model (

*M*

_{0}or

*M*), and

_{f}*N*is the unobserved total number of individuals of that species present across the point counts. It follows that

*N*=

*E*[

*Y*

_{3}]/

*p*

_{3}=

*E*[

*Y*

_{5}]/

*p*

_{5}=

*E*[

*Y*

_{10}]/

*p*

_{10}for intervals of the same point count, in which case

*N*is assumed to be constant due to the closed population. We computed the bias of the removal model when adjusting the 3-min and 5-min point counts to a 10-min standard, respectively, as (Σ

*Y*

_{3}/

*p*

_{3}) / (Σ

*Y*

_{10}/

*p*

_{10}) and (Σ

*Y*

_{5}/

*p*

_{5}) / (Σ

*Y*

_{10}/

*p*

_{10}), with a value of 1 resulting when the adjustment was unbiased (hereafter referred to as the ‘corrected relative count'). For each species, we fit

*M*

_{0}or

*M*models and drew

_{f}*B*= 1,000 replications from the asymptotic distribution of the parameter estimates to represent uncertainty, and we calculated bias (mean signed difference from the expected value of 1) and variance (mean squared deviation of the corrected relative counts from their sample mean across the

*B*replications) based on corrected relative counts from the

*B*replicates. We graphically displayed the distribution of bias and variance across species as a function of sample size, count duration (3 or 5 min), and model type (

*M*

_{0}or

*M*). Bias and variance are presented as proportions of the 10-min corrected count. We used the median and 5% and 95% quantiles to represent the distribution of bias and the 5–95% interval to represent the distribution of variance of the species whose sample sizes were equal to or greater than the sample size selected for comparison.

_{f}### Relative Support for Models and Plotting Time-varying Effects

For each of the 152 species, we fit 44 candidate models that included the time-invariant (*M*_{0} and *M _{f}*) and time-varying models with temporal covariate effects (

*M*

_{0}

^{ϕ},

*M*

_{f}^{ϕ}, and

*M*; 14 models for each) based on the full dataset (Table 1, Appendix Table 3). Although our dataset was composed of many point counts, the sample size for conditional likelihood estimation is defined by the number of point counts with >0 total counts. As a result, sample sizes varied considerably across species. Therefore, we compared the relative support for each model within a species using Akaike's information criterion corrected for small sample sizes (AIC

_{f}^{c}_{c}). We assessed model support across species by counting the number of species for which each model was best-supported (smallest AIC

_{c}value). We only compared models for the 147 species for which all model parameters were estimated. We compiled information on each species' migratory strategy (winter resident vs. migrant) from Poole (2012), and used a chi-square test to see whether migratory strategy affected how often ordinal day (DY) and days since local spring (LS) were selected in the best-fitting models (Appendix Table 3).

We examined temporal patterns in availability probabilities across the species for which the best-fitting removal model included temporal covariates (SR, DY, or LS; Table 1, Appendix Table 3). We did so by predicting availability probabilities based on a 3-min point count duration for each species as a function of the observed covariate values from the full dataset (response curves). We conditioned the predictions on the mean of the other covariates. We overlaid the response curves of all species that shared the covariate effect, and described the general patterns in availability probabilities across species using a 2-dimensional kernel density estimate with a bivariate normal kernel (Wand and Jones 1995) using the R package KernSmooth (Wand 2015). We identified the contour for the 75% bivariate quantile to identify the highest density region, that is, where most of the species' response curves overlapped. We also identified 5%, 25%, 50%, 75%, and 95% contours based on the cumulative density conditional on the covariate values, that is, percentage of the species with the same or higher availability given the date and time. We show average responses for all 3 time-varying model classes (*M*_{0}^{ϕ}, *M _{f}*

^{ϕ}, and

*M*).

_{f}^{c}### Removal Models vs. PIF Approach to Adjusting for Incomplete Availability

We compared adjustments for incomplete availability of individual species based on removal models with those used in the Partners in Flight (PIF) Population Estimates Database (Blancher et al. 2013) to determine how PIF estimates of continental population sizes might be improved by using removal model–based estimates of availability. PIF uses a time adjustment (here referred to as *T _{adj}*; Blancher et al. 2013) to account for diurnal variation in availability, which is calculated as the ratio of the maximum count estimated from a polynomial regression to the mean count based on 50 stops in the BBS data (Rosenberg and Blancher 2005, Blancher et al. 2013), and assumes that the probability of availability = 1 at the time of the maximum count.

We defined our species adjustment factor as the reciprocal of average availability (*U _{inv}* = 1 /

*p*). We calculated the correlation between

_{avg}*T*and

_{adj}*U*and compared the distributions across 129 species for which PIF adjustments were available (Blancher et al. 2013; Appendix Table 3). We also assessed how maximum availabilities (

_{inv}*p*) compared with the PIF assumption of the maximum being 1. We calculated average (

_{max}*p*) and maximum (

_{avg}*p*) availability probabilities by using the best-supported model for each species (Table 1, Appendix Table 3) for predicting the probability of availability for each point count in the entire BAM dataset (Figure 1) given the covariate values, and then calculating the average and maximum of the predicted values for each individual species. We set count duration to be 3 min, to be consistent with the BBS methodology that

_{max}*T*was based on.

_{adj}## RESULTS

### Sample Size Requirements for Time-invariant Removal Models

We were able to estimate the singing rate parameter (ϕ) for all 152 species using the conventional removal model (*M*_{0}) fit to the full dataset from the boreal study area (Appendix Table 2). The mean probability of availability increased from 0.51 ± 0.15 SD for 3-min point count duration to 0.68 ± 0.15 SD for 5-min and 0.88 ± 0.12 SD for 10-min count duration (Figure 2). A logistic regression model (β_{0} = −1.16 ± 0.10 SE, β_{log(# nonzero counts)} = 0.92 ± 0.04 SE) predicted that a species needed to be detected during 45 ± 6 SE point counts for the *M*_{0} model to have a 90% probability of successfully converging on a project-specific estimate of ϕ. Sample size requirements for Var[log(ϕ)] were identical to that of the point estimates.

We jointly estimated the parameters for the *M _{f}* models (ϕ,

*c*) for 147 species using the full dataset. Mean probabilities of availability based on the

*M*models across the 147 species were 9–16% lower than for the

_{f}*M*

_{0}models, and increased from 0.46 ± 0.17 SD for 3-min, to 0.56 ± 0.17 SD for 5-min, to 0.74 ± 0.17 SD for 10-min count duration (Figure 2). A logistic regression model (β

_{0}= −1.16 ± 0.09 SE, β

_{log(# nonzero counts)}= 0.66 ± 0.03 SE) predicted that a species needed to be detected during 158 ± 24 SE point counts for the

*M*model to have a 90% probability of successfully converging on project-level estimates of ϕ and

_{f}*c*. The sample size requirement for a 90% probability of successfully estimating the associated variance was substantially larger, 963 ± 177 SE (β

_{0}= −1.93 ± 0.09 SE, β

_{log(# nonzero counts)}= 0.60 ± 0.03). The number of nonzero counts for the 5 species for which we were unable to estimate

*M*model parameters varied between 88 and 419 (Appendix Table 2).

_{f}### Adjusting Point Counts for Variation in Count Duration

The bias when correcting 3- and 5-min duration counts to 10-min counts was affected by count duration. A 5-min count duration decreased median bias and tightened the 90% quantile range of bias closer to 0 (no bias) across species for both the *M*_{0} and *M _{f}* models, when compared with the distribution of 3-min based biases. The median bias with respect to 10-min counts for the

*M*

_{0}model was 0.05 for 3-min and −0.03 for 5-min counts, and did not improve further with increasing sample size (Figure 3). Although the 90% quantile range for

*M*

_{0}models shortened with increasing sample size, it stayed centered around the biased median. Median bias with respect to 10-min counts for the

*M*model was small (absolute value <0.01) regardless of sample size; the 90% quantile range shortened with sample size. The high end of the 90% quantile range was close to 0 (no bias), whereas the lower limit spread wider into the negative direction, changing from −0.27 at a sample size of 20 to −0.12 at a sample size of 200 (Figure 3).

_{f}Variance was also greatly affected by count duration. Increasing the count duration from 3 to 5 min resulted in reducing the 90% limit for the *M _{f}* models from 0.06 to 0.03. Variance of the corrected relative counts under the

*M*

_{0}model was <0.01 irrespective of duration and sample size, but nevertheless showing similar relative patterns. The 90% limit for variance of corrected relative counts under the

*M*models decreased to 0.03 and 0.01 with 200 nonzero counts for the 3- and 5-min counts, respectively (Figure 3).

_{f}Probability of availability and corrected relative counts based on the time-invariant removal models (*M*_{0} and *M _{f}*) varied among species with different sample sizes (Figure 4). The Rusty Blackbird (

*Euphagus carolinus*), with 254 nonzero counts, showed high uncertainty in both probabilities and corrected relative counts, but bias with respect to 10-min counts was small for both models because the finite mixture model intercept (probability at time 0, 1 −

*c*) was close to 0 (

*c*= 0.96 ± 0.18 SE), and thus the singing rates were similar (

*M*

_{0}: ϕ = 0.16 ± 0.00 SE;

*M*: ϕ = 0.14 ± 0.05 SE). The high uncertainty for the

_{f}*M*model was a result of imprecise

_{f}*c*estimates and high (0.6) correlation between

*c*and ϕ (high rate with low intercept (1 −

*c*), and vice versa). The Western Wood-Pewee (

*Contopus sordidulus*) had 493 nonzero counts in the validation set, and, as a result, its

*M*estimates, especially

_{f}*c*, were more precise, with low correlation (ϕ = 0.11 ± 0.05 SE,

*c*= 0.73 ± 0.03 SE, correlation = 0.15). As the intercept (1 −

*c*) was higher, the singing rates differed more between models (

*M*

_{0}: ϕ = 0.27 ± 0.00 SE). This resulted in larger differences between the 2 models in corrected relative counts, with

*M*being unbiased, although uncertainties (ranges of confidence intervals) were comparable. The Connecticut Warbler (

_{f}*Oporornis agilis*) had 1,162 nonzero counts, and had the most precise parameter estimates of the 3 example species and the highest intercept (

*M*

_{0}: ϕ = 0.40 ± 0.00 SE;

*M*: ϕ = 0.17 ± 0.04 SE,

_{f}*c*= 0.54 ± 0.03 SE, correlation = 0.56), resulting in low bias and variance compared with the other 2 species.

### Relative Support for Models and Plotting Time-varying Effects

We fit 44 conventional and finite mixture removal models to 147 species detected during at least 75 point counts in the full dataset (fitting finite mixture models was unsuccessful for 5 species; Appendix Tables 2 and 3). Finite mixture models were best-supported for 92% of the 147 species (*M _{f}*

^{ϕ}best-supported for 26%,

*M*best-supported for 66%; Table 1). The only species for which the time-invariant (

_{f}^{c}*M*

_{0}) model was best-supported based on AIC

_{c}was the Yellow-billed Cuckoo (

*Coccyzus americanus*;

*n =*196). The best-supported models included time since local sunrise (SR) for 75%, ordinal day (DY) for 39%, and days since local spring (LS) for 37% of the species. Best models included both time since sunrise and either ordinal day or days since spring for 52% of the species (Table 1).

Availability probabilities generally declined across the observed ranges of times since sunrise, and were consistently high between 0 and 4 hr after sunrise. Responses were more variable before and after this 4-hr period. Species' responses to time since sunrise were predominantly nonlinear (77 species with a quadratic and 33 with a linear relationship). Availability declined slowly with respect to date, and was highest across species between ordinal days 155 and 175 and 30–60 days after local spring. Species' responses to dates were predominantly linear for both ordinal day (40 species with a linear relationship and 18 with a quadratic relationship) and days since spring (35 species with a linear and 19 with a quadratic relationship). The temporal patterns of predicted availabilities across all species were very similar between the different classes of time-varying removal models (*M*_{0}^{ϕ}, *M _{f}*

^{ϕ}, and

*M*; Figure 5).

_{f}^{c}The selection frequency for ordinal day (DY) was higher for migratory species (41%, 51 out of 124) than for winter residents (30%, 7 out of 23), but the difference was not significant (χ^{2} = 0.54, *P* = 0.47). The selection frequency for days since local spring (LS) was somewhat lower for migrants (35%, 44 out of 124) than for resident species (43%, 10 out of 23), also with a nonsignificant (χ^{2} = 0.93, *P* = 0.34) difference (Table 1).

### Removal Models vs. PIF Approach to Adjusting for Incomplete Availability

Across 129 species of landbird, the PIF adjustment factor (*T _{adj}*) for BBS counts had median value of 1.27 and ranged from 1.03 to 25.88. The mean of the average availability (

*p*) across these species was 0.43 ± 0.17 SD; the mean of the maximum availability (

_{avg}*p*) was 0.66 ± 0.23 SD. Our inverse availability based adjustment factor (

_{max}*U*) derived from fitted removal models was 1.7 times higher than

_{inv}*T*(

_{adj}*U*median = 2.57, range = 1.25–89.69);

_{inv}*U*and

_{inv}*T*were not correlated (Spearman's

_{adj}*r*= 0.005,

*P*= 0.95). Average availability was very low, and, as a result,

*U*was very high for the Eastern Bluebird (

_{inv}*n*= 99,

*U*= 6,788,866.26) and Rusty Blackbird (

_{inv}*n*= 420,

*U*= 10,022.05) based on their best-supported finite mixture models (Appendix Tables 2 and 3). We show these results to demonstrate the problems that arise when numbers of detections are too small for the mixture model. These 2 species were excluded from the comparisons.

_{inv}## DISCUSSION

### Conventional vs. Finite Mixture Models

We evaluated alternate formulations of the time-removal model (a conventional removal model and a finite mixture removal model), with and without covariates, for 152 bird species, using a compilation of point count data spanning northern North America. We found that predicted availability probabilities under conventional and finite mixture models were very similar in their ranges of probability values and shapes of their response curves to predictor variables. However, finite mixture models were supported for 92% of species with sufficient nonzero counts, in accordance with previous studies that found that mixture models provided better fit (Farnsworth et al. 2002, Etterson et al. 2009, Reidy et al. 2011). By incorporating variation in singing behavior within a survey event, the finite mixture model relaxes the homogeneous singing rate assumption of the conventional removal models, thus reducing bias when using availability probabilities to adjust survey counts for detectability.

Count duration had a profound effect on average counts. In our data, the first 3-min and first 5-min counts contained, on average, 59% and 73% of the full 10-min counts from the same surveys. Both conventional and finite mixture models reduced this bias considerably. Finite mixture models adjusted the shorter-duration counts with low bias (<1%), whereas conventional models tended to underpredict counts by 5% for 3-min counts and overpredict counts by 3% for 5-min counts when compared with the 10-min counts from the same surveys. The corrected counts using the conventional models were characterized by much smaller variances (<1%) than those from finite mixture models (<6%). However, the variance of finite mixture model–based corrected counts reached acceptable levels for species with at least 200 nonzero counts, the sample size for which bias reduction was greatest.

We evaluated the number of nonzero counts needed to reliably estimate the parameters of the removal models, and found that conventional models required a minimum of 50 nonzero counts, whereas finite mixture models required at least 160 nonzero counts. This corroborates the substantial reduction in variance of the adjusted counts both within and among species at 200 nonzero counts. The low precision of the finite mixture model parameters can considerably inflate the variance of corrected relative counts at low sample sizes (Figures 3 and 4), and at least 1,000 nonzero counts were required to reliably estimate the variance–covariance matrix for finite mixture models. Because the number of nonzero counts was correlated with mean counts (relative abundances) of the study species, it follows that less abundant species will require larger samples for reasonable removal model estimates.

The 1,000 nonzero counts per species required for mixture models (Amundson et al. 2014) is somewhat greater than what is required by other models of detection probability, such as distance sampling (Buckland et al. 2001:240). However, when sample sizes were sufficient, the mixture models were nearly unbiased for most species when adjusting point counts for protocol variation. Because associated variances were often high, even with much larger sample sizes, we recommend that researchers compare results from conventional and finite mixture models, preferring finite mixture models when the estimates and their variances are not outlying values. Otherwise, the benefits of potential bias reduction might be obscured, especially at low detection rates.

### Implications for Survey Design and Modeling

Counts of longer duration increased the numbers of detections and were therefore effective for reducing bias and variance in the estimates of availability and corrected relative counts. We found that increasing the count duration from 3 to 5 min reduced the range of variation in bias (with respect to 10-min counts) across species, and reduced variance by at least 50% for most species, using both conventional and finite mixture models. Thus, when designing field surveys, lengthening the count duration to 5 or 10 min is an important consideration to increase the accuracy and precision of population estimates by sampling design (Barker et al. 1993, Ralph et al. 1995, Matsuoka et al. 2014). We recognize that shorter count durations (3 min) are favored for logistical reasons, but the choice of count duration must be based on a careful consideration of the tradeoff between survey costs and study objectives.

Longer count durations increase the chances that individuals may move in or out of the surveyed area or will be double counted, which violates assumptions of the removal model and most other models of detectability applied to point counts. Double counting was found to comprise 0.9–3.4% of detections during experimental point counts (Simons et al. 2009), and may also have been low in our study. We excluded count intervals beyond 10 min to minimize the effect of movement-related biases on our estimates. Although longer-duration counts might have been inflated compared with counts in nested subintervals, this would not have changed our conclusions in terms of relative bias among different models.

We found that, among the 135 species for which finite mixture models were found to fit better than conventional models, for 72% there was more support for the models with a time-varying proportion of infrequent singers, when compared with the models with time-varying singing rates. This indicates that our proposed mechanism of individuals switching membership between the groups of frequent and infrequent singers is more realistic than the alternative of fixed proportions as used by previous studies (Farnsworth et al. 2002, Aldredge et al. 2007a, Etterson et al. 2009, Reidy et al. 2011). The overall patterns regarding responses of singing rates to dates and times were very similar, regardless of support across the model classes. The periods when we found detectability to be high (0–4 hr after sunrise and June 4–24 [ordinal day 155–175]; see Results) agreed well with long-standing protocol recommendations (Ralph et al. 1995). Conducting point counts during these times of day and year will make surveys much more efficient for accumulating detections and will thereby improve the resulting accuracy and precision of abundance estimates.

Besides these general guidelines, local and regional context and study goals should also shape survey timing. Although sunrise time already reflects geographic differences, weather and resulting local spring date can vary considerably across years, thus influencing the timing of migration for nonresidents, and the timing of high vocal activity for most species. Regional differences can also introduce heterogeneity into bird availability through geographic advancement of breeding through the growing season. We captured this breeding progression with the days since spring variable (vs. ordinal days). The fact that the 2 predictors related to time of year were supported by different species suggests differences among species' plasticity of response to interannual variations in weather conditions. We found no significant differences between winter resident and migratory species with respect to selection frequencies of time-of-year covariates. This indicates that, although ordinal day and multidecadal averages (our days since spring variable) are important predictors of availability, time-of-year covariates are inadequate for establishing linkages between breeding phenology and weather events for migratory species (Both and Visser 2001, Both et al. 2010). Estimating year-specific effects with respect to days since spring vs. ordinal day might be indicative of changes in breeding phenology with a changing climate, which will be important to account for in programs monitoring long-term changes in bird population sizes. Accounting for year-to-year variation in covariates is important given the large interannual variation in spring green-up (15–20 days in the boreal region; Delbart et al. 2006) and the strong indications of advancing phenology in boreal and hemiboreal regions with climate change (Linkosalo et al. 2009, Beaubien and Hamann 2011).

### Conclusions

Our study demonstrates the importance of accounting for temporal heterogeneity in bird availability when analyzing avian point count data. This is especially true when combining data across studies with different locations, sampling timeframes, or protocols. Location influences the expected time of the breeding season, and the timing of surveys relative to the local breeding season affects availability, for example, through time-varying behaviors. It is difficult to retrospectively control for these sources of heterogeneity introduced by sampling design, and assuming constant availability in space and time is unrealistic, especially when analyzing heterogeneous and/or spatially extensive datasets. Thus, a model-based approach to analysis is required. We argue that the continuous time-removal modeling framework outlined in this manuscript is sufficient. Easily obtainable covariates, such as the ones used in this study, can help to better correct availability bias of counts.

Comparing our average availability-based corrections with the time adjustment factors developed by Partners in Flight (PIF; Rosenberg and Blancher 2005, Blancher et al. 2013), we found our removal model–based adjustment factors to be 1.6 times higher. This was largely due to the PIF assumption of 100% availability during the times of day with the maximum counts, whereas our maximum availability estimates for a 3-min point count averaged 66%. As a result, PIF population size estimates are most likely to be underestimates for most species. This availability-related underestimation is relatively small compared with the 5-fold difference in population sizes when adjusting point counts for perceptibility probability using distance sampling versus PIF's expert-derived adjustment for maximum detection distances (Matsuoka et al. 2012). Similarly, both field tests (Confer et al. 2008) and sensitivity analyses (Thogmartin 2010) have indicated that PIF population size estimates are more strongly influenced by the choice of adjustment for perceptibility (maximum detection distance) than availability (time adjustment). We feel that availability probabilities from removal models could replace PIF's time adjustment (Rosenberg and Blancher 2005, Blancher et al. 2013) to modestly improve population size estimates, particularly if covariates were included in the removal models to account for date- and time-related heterogeneity in singing rates (Thogmartin et al. 2006). We did not find a correlation between correction factors based on removal models and PIF's time adjustments, even after adjusting for maximum availability. This is most likely due to geographic differences (mostly nonboreal vs. boreal regions, respectively) and the on- vs. off-road nature of the BBS and BAM datasets, which are complexities to be tackled by future studies.

The probability of availability, given the presence of an individual, can be efficiently estimated by conventional or finite mixture type removal models, with or without covariates for bird behavior. In this paper, we have outlined a flexible and continuous-time parameterization of these model types implemented in R package detect (Sólymos et al. 2018). We have extended removal model methodology to flexibly model time-varying finite mixture models using covariates for either the singing rate (ϕ) or the proportion of infrequent singers (*c*) parameter. We found that the choice of model had an impact on the estimability of unknown model parameters, and that this choice also affected the bias and variance of the corrected relative counts. In general, a conventional removal model is recommended for smaller sample sizes. Model selection approaches can be used to compare support for different parameterizations, but only at much higher sample sizes (>200 nonzero counts, but preferably >1,000). Lengthening the count duration from 3 min to 5 or 10 min increased sample sizes of detections and improved the accuracy and precision of removal model estimates. Well-informed survey design combined with various forms of removal sampling are useful for accounting for availability bias in point counts, thereby improving population size estimates and allowing for better integration of disparate studies at large spatial scales.

## ACKNOWLEDGMENTS

This publication is a contribution of the Boreal Avian Modelling (BAM) Project, an international research collaboration on the ecology, management, and conservation of boreal birds. We acknowledge BAM's members, avian and biophysical data partners, and funding agencies (including Environment and Climate Change Canada and the U.S. Fish and Wildlife Service), listed in full at www.borealbirds.ca/index.php/acknowledgements

**Funding statement:** This project was funded by Environment and Climate Change Canada, the Natural Sciences and Engineering Research Council of Canada, U.S. Fish and Wildlife Service's Neotropical Migratory Bird Conservation Act, and the Alberta Biodiversity Monitoring Institute. Any use of trade, firm, or product names is for descriptive purposes only and does not imply endorsement by the U.S. Government. None of the funders had any influence on the content of the submitted or published manuscript. This paper has been peer reviewed and approved for publication consistent with USGS Fundamental Science Practices ( http://pubs.usgs.gov/circ/1367/).

**Ethics statement:** This research was conducted in compliance with the *Ethical Guidelines for Statistical Practice*.

**Author contributions:** P.S., S.M.M., and E.M.B. conceived the idea and designed the study; P.S. analyzed the data; P.F. organized data for analyses; and all authors substantially contributed to writing the paper.

## LITERATURE CITED

*Setophaga ruticilla*). The Wilson Journal of Ornithology 118: 439– 451. Google Scholar

## Appendices

## APPENDIX TABLE 2

Common and scientific names of the species used in this study evaluating the probability of availability during point count surveys of terrestrial landbird species in northern North America, with number of nonzero counts (*n*) from the full dataset of point count surveys, along with parameter estimates of availability (ϕ: singing rate; *c*: proportion of infrequent singers) from time-invariant models (conventional: *M*_{0}; finite mixture: *M _{f}*).

## APPENDIX TABLE 3

Migratory status and best-supported models for probability of availability during point count surveys of terrestrial landbird species in northern North America, based on comparing the 3 classes of time-varying removal models (see Table 1 for model IDs). The column ‘Best' considers all 3 classes (model ID is <class> <model ID>) and time adjustment values (PIF's [Partners in Flight] time adjustment: *T _{adj}*; and inverse availability based measures:

*U*). Migratory status: WR = winter resident; SD = short-distance migrant; LD = long-distance migrant. Model classes:

_{inv}*M*

_{0}

^{ϕ}= conventional with time-varying rate parameter;

*M*

_{f}^{ϕ}= finite mixture with time-varying rate parameter; and

*M*= finite mixture with time-varying proportion of infrequent singers.

_{f}^{c}