1
|
Tekwa EW, Whalen MA, Martone PT, O'Connor MI. Theory and application of an improved species richness estimator. Philos Trans R Soc Lond B Biol Sci 2023; 378:20220187. [PMID: 37246376 DOI: 10.1098/rstb.2022.0187] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2022] [Accepted: 04/10/2023] [Indexed: 05/30/2023] Open
Abstract
Species richness is an essential biodiversity variable indicative of ecosystem states and rates of invasion, speciation and extinction both contemporarily and in fossil records. However, limited sampling effort and spatial aggregation of organisms mean that biodiversity surveys rarely observe every species in the survey area. Here we present a non-parametric, asymptotic and bias-minimized richness estimator, Ω by modelling how spatial abundance characteristics affect observation of species richness. Improved asymptotic estimators are critical when both absolute richness and difference detection are important. We conduct simulation tests and applied Ω to a tree census and a seaweed survey. Ω consistently outperforms other estimators in balancing bias, precision and difference detection accuracy. However, small difference detection is poor with any asymptotic estimator. An R-package, Richness, performs the proposed richness estimations along with other asymptotic estimators and bootstrapped precisions. Our results explain how natural and observer-induced variations affect species observation, how these factors can be used to correct observed richness using the estimator Ω on a variety of data, and why further improvements are critical for biodiversity assessments. This article is part of the theme issue 'Detecting and attributing the causes of biodiversity change: needs, gaps and solutions'.
Collapse
Affiliation(s)
- Eden W Tekwa
- Department of Zoology, University of British Columbia, Vancouver, V6T 1Z4 British Columbia, Canada
- Hakai Institute, Heriot Bay, V0P 1H0 British Columbia, Canada
- Department of Biology, McGill University, H3A 1B1 Montreal, Quebec, Canada
| | - Matthew A Whalen
- Department of Botany, University of British Columbia, Vancouver, V6T 1Z4 British Columbia, Canada
- Hakai Institute, Heriot Bay, V0P 1H0 British Columbia, Canada
- Department of Biology, Virginia State University, Petersburg, 23806 VA, USA
| | - Patrick T Martone
- Department of Botany, University of British Columbia, Vancouver, V6T 1Z4 British Columbia, Canada
| | - Mary I O'Connor
- Department of Zoology, University of British Columbia, Vancouver, V6T 1Z4 British Columbia, Canada
| |
Collapse
|
2
|
A computationally efficient approach to estimating species richness and rarefaction curve. Comput Stat 2022. [DOI: 10.1007/s00180-021-01185-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/03/2022]
|
3
|
Li CT, Li KH. Species abundance distribution and species accumulation curve: a general framework and results. Electron J Stat 2022. [DOI: 10.1214/22-ejs2072] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Affiliation(s)
- Cheuk Ting Li
- Department of Information Engineering, The Chinese University of Hong Kong, Hong Kong SAR, China
| | - Kim-Hung Li
- Asian Cities Research Centre Ltd., New Treasure Centre, San Po Kong, Hong Kong SAR, China
| |
Collapse
|
4
|
Deng C, Daley T, De Sena Brandine G, Smith AD. Molecular Heterogeneity in Large-Scale Biological Data: Techniques and Applications. Annu Rev Biomed Data Sci 2019. [DOI: 10.1146/annurev-biodatasci-072018-021339] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
High-throughput sequencing technologies have evolved at a stellar pace for almost a decade and have greatly advanced our understanding of genome biology. In these sampling-based technologies, there is an important detail that is often overlooked in the analysis of the data and the design of the experiments, specifically that the sampled observations often do not give a representative picture of the underlying population. This has long been recognized as a problem in statistical ecology and in the broader statistics literature. In this review, we discuss the connections between these fields, methodological advances that parallel both the needs and opportunities of large-scale data analysis, and specific applications in modern biology. In the process we describe unique aspects of applying these approaches to sequencing technologies, including sequencing error, population and individual heterogeneity, and the design of experiments.
Collapse
Affiliation(s)
- Chao Deng
- Department of Molecular and Computational Biology, University of Southern California, Los Angeles, California 90089, USA
| | - Timothy Daley
- Department of Statistics and Department of Bioengineering, Stanford University, Stanford, California 94305, USA
| | - Guilherme De Sena Brandine
- Department of Molecular and Computational Biology, University of Southern California, Los Angeles, California 90089, USA
| | - Andrew D. Smith
- Department of Molecular and Computational Biology, University of Southern California, Los Angeles, California 90089, USA
| |
Collapse
|
5
|
François Koladjo B, Ohannessian MI, Gassiat E. A Truncation Model for Estimating Species Richness. Int J Biostat 2018; 15:ijb-2017-0035. [PMID: 30048236 DOI: 10.1515/ijb-2017-0035] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2017] [Accepted: 06/19/2018] [Indexed: 11/15/2022]
Abstract
We propose a truncation model for the abundance distribution in species richness estimation. This model is inherently semiparametric and incorporates an unknown truncation threshold between rare and abundant observations. Using the conditional likelihood, we derive a class of estimators for the parameters in this model by stepwise maximization. The species richness estimator is given by the integer maximizing the binomial likelihood, given all other parameters in the model. Under regularity conditions, we show that our estimators of the model parameters are asymptotically efficient. We recover Chaos lower bound estimator of species richness when the parametric part of the model is single-component Poisson. Thus our class of estimators strictly generalized the latter. We illustrate the performance of the proposed method in a simulation study, and compare it favorably to other widely-used estimators. We also give an application to estimating the number of distinct vocabulary words in French playwright Molière's Tartuffe.
Collapse
Affiliation(s)
| | | | - Elisabeth Gassiat
- Laboratoire de Mathématiques d'Orsay, Université Paris-Saclay, Univ. Paris-Sud, CNRS, 91405 Orsay, France
| |
Collapse
|
6
|
Shestopaloff K, Escobar MD, Xu W. Analyzing differences between microbiome communities using mixture distributions. Stat Med 2018; 37:4036-4053. [PMID: 30039541 DOI: 10.1002/sim.7896] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2017] [Revised: 04/09/2018] [Accepted: 06/13/2018] [Indexed: 01/14/2023]
Abstract
In this paper, we present a method to assess differences between microbiome communities that effectively models sparse count data and accounts for presence-absence bias frequently encountered when zeros are present. We assume that the observed data for each operational taxonomic unit is Poisson generated with the rate for each sample originating from an underlying rate distribution. We propose to model this distribution using a mixture model, specifying the components based on the posterior rate distribution of a count and estimating the optimal weights using a least squares objective function. The distribution incorporates varying resolutions of samples, a point mass for differentiating structural and nonstructural zeros, and a truncation point mass to account for high values that are too sparse to model. As mixture component specification is not always straightforward, a method to estimate a joint model from several mixture distributions using minimum distances of bootstrap iterates is proposed. Once the population rate distribution is approximated, we obtain sample-specific distributions by conditioning on the observed operational taxonomic unit count, resolution, and estimated mixture distribution and then use these to estimate pairwise distances for a permutation test. The method gives an accurate estimate of the true proportion of zeros for presence-absence, effectively models the distribution of low counts using the mixture distribution, and achieves good power for detecting differences in a variety of scenarios. The method is tested using a simulation study and applied to two microbiome datasets. In each case, the results are compared with a number of existing methods.
Collapse
Affiliation(s)
- Konstantin Shestopaloff
- Dalla Lana School of Public Health, University of Toronto, Toronto, Ontario, Canada.,Krembil Research Institute, University Health Network, Toronto, Ontario, Canada
| | - Michael D Escobar
- Dalla Lana School of Public Health, University of Toronto, Toronto, Ontario, Canada
| | - Wei Xu
- Dalla Lana School of Public Health, University of Toronto, Toronto, Ontario, Canada
| |
Collapse
|
7
|
Abstract
Rainfall modeling is significant for prediction and forecasting purposes in agriculture, weather derivatives, hydrology, and risk and disaster preparedness. Normally two models are used to model the rainfall process as a chain dependent process representing the occurrence and intensity of rainfall. Such two models help in understanding the physical features and dynamics of rainfall process. However rainfall data is zero inflated and exhibits overdispersion which is always underestimated by such models. In this study we have modeled the two processes simultaneously as a compound Poisson process. The rainfall events are modeled as a Poisson process while the intensity of each rainfall event is Gamma distributed. We minimize overdispersion by introducing the dispersion parameter in the model implemented through Tweedie distributions. Simulated rainfall data from the model shows a resemblance of the actual rainfall data in terms of seasonal variation, means, variance, and magnitude. The model also provides mechanisms for small but important properties of the rainfall process. The model developed can be used in forecasting and predicting rainfall amounts and occurrences which is important in weather derivatives, agriculture, hydrology, and prediction of drought and flood occurrences.
Collapse
|
8
|
Chee CS, Wang Y. Nonparametric estimation of species richness using discrete k-monotone distributions. Comput Stat Data Anal 2016. [DOI: 10.1016/j.csda.2014.10.021] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
|
9
|
Deng C, Daley T, Smith AD. Applications of species accumulation curves in large-scale biological data analysis. QUANTITATIVE BIOLOGY 2015; 3:135-144. [PMID: 27252899 DOI: 10.1007/s40484-015-0049-7] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
Abstract
The species accumulation curve, or collector's curve, of a population gives the expected number of observed species or distinct classes as a function of sampling effort. Species accumulation curves allow researchers to assess and compare diversity across populations or to evaluate the benefits of additional sampling. Traditional applications have focused on ecological populations but emerging large-scale applications, for example in DNA sequencing, are orders of magnitude larger and present new challenges. We developed a method to estimate accumulation curves for predicting the complexity of DNA sequencing libraries. This method uses rational function approximations to a classical non-parametric empirical Bayes estimator due to Good and Toulmin [Biometrika, 1956, 43, 45-63]. Here we demonstrate how the same approach can be highly effective in other large-scale applications involving biological data sets. These include estimating microbial species richness, immune repertoire size, and k-mer diversity for genome assembly applications. We show how the method can be modified to address populations containing an effectively infinite number of species where saturation cannot practically be attained. We also introduce a flexible suite of tools implemented as an R package that make these methods broadly accessible.
Collapse
Affiliation(s)
- Chao Deng
- Molecular and Computational Biology, University of Southern California, Los Angeles, CA 90089, USA
| | - Timothy Daley
- Molecular and Computational Biology, University of Southern California, Los Angeles, CA 90089, USA
| | - Andrew D Smith
- Molecular and Computational Biology, University of Southern California, Los Angeles, CA 90089, USA
| |
Collapse
|
10
|
Rösner S, Brandl R, Segelbacher G, Lorenc T, Müller J. Noninvasive genetic sampling allows estimation of capercaillie numbers and population structure in the Bohemian Forest. EUR J WILDLIFE RES 2014. [DOI: 10.1007/s10344-014-0848-6] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
11
|
Chiu CH, Wang YT, Walther BA, Chao A. An improved nonparametric lower bound of species richness via a modified good-turing frequency formula. Biometrics 2014; 70:671-82. [DOI: 10.1111/biom.12200] [Citation(s) in RCA: 154] [Impact Index Per Article: 15.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2013] [Revised: 04/01/2014] [Accepted: 04/01/2014] [Indexed: 11/26/2022]
Affiliation(s)
- Chun-Huo Chiu
- Institute of Statistics; National Tsing Hua University; Hsin-Chu 30043 Taiwan
| | - Yi-Ting Wang
- Institute of Statistics; National Tsing Hua University; Hsin-Chu 30043 Taiwan
| | - Bruno A. Walther
- Master Program in Global Health and Development, College of Public Health and Nutrition; Taipei Medical University; 250 Wu-Hsing St., Taipei 110 Taiwan
| | - Anne Chao
- Institute of Statistics; National Tsing Hua University; Hsin-Chu 30043 Taiwan
| |
Collapse
|
12
|
Wilson JR, Tzeng WP, Spesock A, Music N, Guo Z, Barrington R, Stevens J, Donis RO, Katz JM, York IA. Diversity of the murine antibody response targeting influenza A(H1N1pdm09) hemagglutinin. Virology 2014; 458-459:114-24. [PMID: 24928044 PMCID: PMC4904151 DOI: 10.1016/j.virol.2014.04.011] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2014] [Revised: 03/21/2014] [Accepted: 04/09/2014] [Indexed: 11/29/2022]
Abstract
UNLABELLED We infected mice with the 2009 influenza A pandemic virus (H1N1pdm09), boosted with an inactivated vaccine, and cloned immunoglobulins (Igs) from HA-specific B cells. Based on the redundancy in germline gene utilization, we inferred that between 72-130 unique IgH VDJ and 35 different IgL VJ combinations comprised the anti-HA recall response. The IgH VH1 and IgL VK14 variable gene families were employed most frequently. A representative panel of antibodies were cloned and expressed to confirm reactivity with H1N1pdm09 HA. The majority of the recombinant antibodies were of high avidity and capable of inhibiting H1N1pdm09 hemagglutination. Three of these antibodies were subtype-specific cross-reactive, binding to the HA of A/South Carolina/1/1918(H1N1), and one further reacted with A/swine/Iowa/15/1930(H1N1). These results help to define the genetic diversity of the influenza anti-HA antibody repertoire profile induced following infection and vaccination, which may facilitate the development of influenza vaccines that are more protective and broadly neutralizing. IMPORTANCE Protection against influenza viruses is mediated mainly by antibodies, and in most cases this antibody response is narrow, only providing protection against closely related viruses. In spite of this limited range of protection, recent findings indicate that individuals immune to one influenza virus may contain antibodies (generally a minority of the overall response) that are more broadly reactive. These findings have raised the possibility that influenza vaccines could induce a more broadly protective response, reducing the need for frequent vaccine strain changes. However, interpretation of these observations is hampered by the lack of quantitative characterization of the antibody repertoire. In this study, we used single-cell cloning of influenza HA-specific B cells to assess the diversity and nature of the antibody response to influenza hemagglutinin in mice. Our findings help to put bounds on the diversity of the anti-hemagglutinin antibody response, as well as characterizing the cross-reactivity, affinity, and molecular nature of the antibody response.
Collapse
Affiliation(s)
- Jason R Wilson
- Influenza Division, National Center for Immunization and Respiratory Diseases, Centers for Disease Control and Prevention, Atlanta, GA, USA
| | - Wen-Pin Tzeng
- Influenza Division, National Center for Immunization and Respiratory Diseases, Centers for Disease Control and Prevention, Atlanta, GA, USA
| | - April Spesock
- Influenza Division, National Center for Immunization and Respiratory Diseases, Centers for Disease Control and Prevention, Atlanta, GA, USA
| | - Nedzad Music
- Influenza Division, National Center for Immunization and Respiratory Diseases, Centers for Disease Control and Prevention, Atlanta, GA, USA
| | - Zhu Guo
- Influenza Division, National Center for Immunization and Respiratory Diseases, Centers for Disease Control and Prevention, Atlanta, GA, USA
| | | | - James Stevens
- Influenza Division, National Center for Immunization and Respiratory Diseases, Centers for Disease Control and Prevention, Atlanta, GA, USA
| | - Ruben O Donis
- Influenza Division, National Center for Immunization and Respiratory Diseases, Centers for Disease Control and Prevention, Atlanta, GA, USA
| | - Jacqueline M Katz
- Influenza Division, National Center for Immunization and Respiratory Diseases, Centers for Disease Control and Prevention, Atlanta, GA, USA
| | - Ian A York
- Influenza Division, National Center for Immunization and Respiratory Diseases, Centers for Disease Control and Prevention, Atlanta, GA, USA.
| |
Collapse
|
13
|
Livingston G, Jha S, Vega A, Gilbert L. Conservation value and permeability of neotropical oil palm landscapes for orchid bees. PLoS One 2013; 8:e78523. [PMID: 24147137 PMCID: PMC3798381 DOI: 10.1371/journal.pone.0078523] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2013] [Accepted: 09/20/2013] [Indexed: 11/22/2022] Open
Abstract
The proliferation of oil palm plantations has led to dramatic changes in tropical landscapes across the globe. However, relatively little is known about the effects of oil palm expansion on biodiversity, especially in key ecosystem-service providing organisms like pollinators. Rapid land use change is exacerbated by limited knowledge of the mechanisms causing biodiversity decline in the tropics, particularly those involving landscape features. We examined these mechanisms by undertaking a survey of orchid bees, a well-known group of Neotropical pollinators, across forest and oil palm plantations in Costa Rica. We used chemical baits to survey the community in four regions: continuous forest sites, oil palm sites immediately adjacent to forest, oil palm sites 2 km from forest, and oil palm sites greater than 5 km from forest. We found that although orchid bees are present in all environments, orchid bee communities diverged across the gradient, and community richness, abundance, and similarity to forest declined as distance from forest increased. In addition, mean phylogenetic distance of the orchid bee community declined and was more clustered in oil palm. Community traits also differed with individuals in oil palm having shorter average tongue length and larger average geographic range size than those in the forest. Our results indicate two key features about Neotropical landscapes that contain oil palm: 1) oil palm is selectively permeable to orchid bees and 2) orchid bee communities in oil palm have distinct phylogenetic and trait structure compared to communities in forest. These results suggest that conservation and management efforts in oil palm-cultivating regions should focus on landscape features.
Collapse
Affiliation(s)
- George Livingston
- Department of Integrative Biology, University of Texas at Austin, Austin, Texas, United States of America
| | - Shalene Jha
- Department of Integrative Biology, University of Texas at Austin, Austin, Texas, United States of America
| | | | - Lawrence Gilbert
- Department of Integrative Biology, University of Texas at Austin, Austin, Texas, United States of America
| |
Collapse
|
14
|
Bissiri PG, Ongaro A, Walker SG. Species sampling models: consistency for the number of species. Biometrika 2013. [DOI: 10.1093/biomet/ast006] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
|
15
|
Hernando L, Mendiburu A, Lozano JA. An evaluation of methods for estimating the number of local optima in combinatorial optimization problems. EVOLUTIONARY COMPUTATION 2013; 21:625-658. [PMID: 23270389 DOI: 10.1162/evco_a_00100] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/01/2023]
Abstract
The solution of many combinatorial optimization problems is carried out by metaheuristics, which generally make use of local search algorithms. These algorithms use some kind of neighborhood structure over the search space. The performance of the algorithms strongly depends on the properties that the neighborhood imposes on the search space. One of these properties is the number of local optima. Given an instance of a combinatorial optimization problem and a neighborhood, the estimation of the number of local optima can help not only to measure the complexity of the instance, but also to choose the most convenient neighborhood to solve it. In this paper we review and evaluate several methods to estimate the number of local optima in combinatorial optimization problems. The methods reviewed not only come from the combinatorial optimization literature, but also from the statistical literature. A thorough evaluation in synthetic as well as real problems is given. We conclude by providing recommendations of methods for several scenarios.
Collapse
Affiliation(s)
- Leticia Hernando
- Intelligent Systems Group, Department of Computer Science and Artificial Intelligence, University of the Basque Country UPV/EHU, 20018 San Sebastián, Spain
| | | | | |
Collapse
|
16
|
Zhang H, Ghosh K, Ghosh P. Sampling designs via a multivariate hypergeometric-Dirichlet process model for a multi-species assemblage with unknown heterogeneity. Comput Stat Data Anal 2012. [DOI: 10.1016/j.csda.2012.02.013] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
|
17
|
Predicting nucleosome positioning using a duration Hidden Markov Model. BMC Bioinformatics 2010; 11:346. [PMID: 20576140 PMCID: PMC2900280 DOI: 10.1186/1471-2105-11-346] [Citation(s) in RCA: 90] [Impact Index Per Article: 6.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2010] [Accepted: 06/24/2010] [Indexed: 11/30/2022] Open
Abstract
Background The nucleosome is the fundamental packing unit of DNAs in eukaryotic cells. Its detailed positioning on the genome is closely related to chromosome functions. Increasing evidence has shown that genomic DNA sequence itself is highly predictive of nucleosome positioning genome-wide. Therefore a fast software tool for predicting nucleosome positioning can help understanding how a genome's nucleosome organization may facilitate genome function. Results We present a duration Hidden Markov model for nucleosome positioning prediction by explicitly modeling the linker DNA length. The nucleosome and linker models trained from yeast data are re-scaled when making predictions for other species to adjust for differences in base composition. A software tool named NuPoP is developed in three formats for free download. Conclusions Simulation studies show that modeling the linker length distribution and utilizing a base composition re-scaling method both improve the prediction of nucleosome positioning regarding sensitivity and false discovery rate. NuPoP provides a user-friendly software tool for predicting the nucleosome occupancy and the most probable nucleosome positioning map for genomic sequences of any size. When compared with two existing methods, NuPoP shows improved performance in sensitivity.
Collapse
|