Technological advancements have allowed biologists and ecologists to amass enormous amounts of data. Moreover, many contemporary issues require various sources of information—differing in quality, completeness, and scale—to answer pressing research questions or to evaluate important hypotheses. These advancements and the demands of modern science call for sophisticated methods for dealing with large and diverse sources of data. Modern computational and statistical methodologies are bringing data analysis and modeling into balance with data collection and storage capabilities, allowing us to address scientific problems in ways that were unimaginable 20—or even 10—years ago.
Some statistics-oriented books introduce, often in great detail, modern computational methods for data analysis that are meeting the challenges of a data-rich world. In my opinion, however, biologists and ecologists largely lack the training in these state-of-the-art statistical methods. The ideal textbook would provide an overview of those modern computational statistics that are potentially of greatest relevance to biologists, highlight the important theoretical underpinnings, discuss the motivation behind using particular approaches, give sufficient detail on implementation, and include examples based on the types of data and problems that biologists “typically” encounter. When I learned of Derek Roff's recent book, Introduction to Computer-intensive Methods of Data Analysis in Biology, I initially thought this could be that book.
Roff is a professor in the Department of Biology at the University of California, Riverside. As he is an evolutionary population ecologist, many of his examples are drawn from evolutionary ecology, population biology, and ecological genetics. He has published extensively in these fields, including four books related to evolutionary biology, life-history evolution, and quantitative genetics. Roff 's experiences working with a variety of data sources are presumably what stimulated him to compile this book and to write, “Much of the development of statistical tools has been premised on a set of assumptions, designed to make the analytical approaches tractable.”
In my view, such assumptions—normality, linearity, and independence, among others—are often very restrictive and can force one to pigeonhole a data set (and a research question) into an existing approach that is inappropriate for the problem at hand. I therefore wholeheartedly agree with Roff 's observation that “we have now entered an era where we can, in many instances, dispense with such assumptions and use statistical approaches that are rigorous but largely freed from the straight-jacket imposed by the relative simplicity of analytical solution.” And indeed, Roff does give many biologically inspired examples that do not conform to classical, canned analysis methods.
The title and preface led me to think that the book would cover modern approaches of computer-intense data analysis, but Roff focuses primarily on relatively old-school, established methods—the jackknife (chapter 3), the bootstrap (chapter 4), randomization (chapter 5), and regression trees (chapter 6). Chapter 2 (maximum likelihood) and chapter 7 (Bayesian methods) have the greatest potential to contribute toward a discussion of cutting-edge computational statistical methods. However, Roff pays little attention to many computer-intensive methods that have come to the fore in the past decade, even though these are primarily responsible for the resurgence and escalating popularity of Bayesian, maximum likelihood, and some other approaches. For example, the Bayesian chapter is based on analytical results and issues from some years ago: there is no reference to the recent flood of numerical and computational methodologies that are responsible for the rapid rise of Bayesian approaches in many applied fields. Among the key advances missing from the book are, for example, Markov chain Monte Carlo (MCMC) algorithms (e.g., Gibbs sampling, the Metropolis–Hastings algorithm). Likewise, much of Roff's discussion of maximum likelihood approaches is based on rather simple examples and analytical derivations. Yet these approaches may require sophisticated and computationally intensive algorithms such as the expectation-maximization algorithm, simulated annealing, and MCMC-type hill-climbing routines. I was disappointed that Computer-intensive Methods overlooks many of the truly state-of-the-art computational statistical methods that are applicable to data analysis in biology and ecology.
Nevertheless, some elements make the book a potentially useful resource for researchers and graduate students. One redeeming quality is that throughout the book, Roff makes the point that one should routinely conduct simulation analyses to evaluate the usefulness of a particular data analysis method. He gives several examples of instances for which he simulated pseudodata from a known process (e.g., parameter values and distributional forms are known), and then subjects the pseudodata to different analysis methods to reconstruct (estimate) the known parameters. A good method would yield parameter estimates that agree with the known values.
Another positive feature is that about one-third of the text is an appendix with annotated S-PLUS code; any biologist who wished to learn or use S-PLUS for conducting simulations and data analysis would very likely find this book valuable. Moreover, Roff has made the code publicly available on his UC Riverside Web page ( www.biology.ucr.edu/people/ faculty/Roff.html). S-PLUS is a powerful and flexible language for conducting both classical and more modern, computationally demanding analyses. Although code for R (a free software package similar to S-PLUS) might have been useful to a wider audience, one should be able to use the S-PLUS syntax as a starting point for programming Roff 's examples in R. Last, Roff does provide a fair amount of detail and references regarding various jackknife and bootstrap methods, which someone interested in using these methods would find valuable.
In summary, Roff gives a brief overview of some data-analysis methods, but many areas lack sufficient explanation. Thus, it would be good if users had a basic understanding of the theoretical foundations underlying the different approaches. I also would have liked to have seen more discussion about the motivation for choosing particular methods. Additionally, with the exception of the jackknife and bootstrap chapters, I felt that the chapters were not well integrated. I would hesitate to recommend Computer-intensive Methods as a primary text for a graduate course; however, the S-PLUS appendix and simulation analysis examples make this book a potentially valuable resource for those who are interested in these aspects of data analysis.