1
|
Sankaran K, Jeganathan P. mbtransfer: Microbiome intervention analysis using transfer functions and mirror statistics. PLoS Comput Biol 2024; 20:e1012196. [PMID: 38875277 PMCID: PMC11210883 DOI: 10.1371/journal.pcbi.1012196] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2024] [Revised: 06/27/2024] [Accepted: 05/27/2024] [Indexed: 06/16/2024] Open
Abstract
Time series studies of microbiome interventions provide valuable data about microbial ecosystem structure. Unfortunately, existing models of microbial community dynamics have limited temporal memory and expressivity, relying on Markov or linearity assumptions. To address this, we introduce a new class of models based on transfer functions. These models learn impulse responses, capturing the potentially delayed effects of environmental changes on the microbial community. This allows us to simulate trajectories under hypothetical interventions and select significantly perturbed taxa with False Discovery Rate guarantees. Through simulations, we show that our approach effectively reduces forecasting errors compared to strong baselines and accurately pinpoints taxa of interest. Our case studies highlight the interpretability of the resulting differential response trajectories. An R package, mbtransfer, and notebooks to replicate the simulation and case studies are provided.
Collapse
Affiliation(s)
- Kris Sankaran
- Department of Statistics, University of Wisconsin - Madison, Madison, Wisconsin, United States of America
| | - Pratheepa Jeganathan
- Department of Mathematics & Statistics, McMaster University, Hamilton, Ontario, Canada
| |
Collapse
|
2
|
Servius L, Pigoli D, Ng J, Fraternali F. Predicting class switch recombination in B-cells from antibody repertoire data. Biom J 2024; 66:e2300171. [PMID: 38785212 DOI: 10.1002/bimj.202300171] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2023] [Revised: 03/01/2024] [Accepted: 03/07/2024] [Indexed: 05/25/2024]
Abstract
Statistical and machine learning methods have proved useful in many areas of immunology. In this paper, we address for the first time the problem of predicting the occurrence of class switch recombination (CSR) in B-cells, a problem of interest in understanding antibody response under immunological challenges. We propose a framework to analyze antibody repertoire data, based on clonal (CG) group representation in a way that allows us to predict CSR events using CG level features as input. We assess and compare the performance of several predicting models (logistic regression, LASSO logistic regression, random forest, and support vector machine) in carrying out this task. The proposed approach can obtain an unweighted average recall of71 % $71\%$ with models based on variable region descriptors and measures of CG diversity during an immune challenge and, most notably, before an immune challenge.
Collapse
Affiliation(s)
- Lutecia Servius
- Department of Mathematics, King's College London, London, UK
| | - Davide Pigoli
- Department of Mathematics, King's College London, London, UK
| | - Joseph Ng
- Institute of Structural and Molecular Biology, University College London, London, UK
| | - Franca Fraternali
- Institute of Structural and Molecular Biology, University College London, London, UK
| |
Collapse
|
3
|
Zhu R, Luo W, Grieneisen ML, Zuoqiu S, Zhan Y, Yang F. A novel approach to deriving the fine-scale daily NO 2 dataset during 2005-2020 in China: Improving spatial resolution and temporal coverage to advance exposure assessment. ENVIRONMENTAL RESEARCH 2024; 249:118381. [PMID: 38331142 DOI: 10.1016/j.envres.2024.118381] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/02/2023] [Revised: 01/22/2024] [Accepted: 01/30/2024] [Indexed: 02/10/2024]
Abstract
Surface NO2 pollution can result in serious health consequences such as cardiovascular disease, asthma, and premature mortality. Due to the extensive spatial variation in surface NO2, the spatial resolution of a NO2 dataset has a significant impact on the exposure and health impact assessment. There is currently no long-term, high-resolution, and publicly available NO2 dataset for China. To fill this gap, this study generated a NO2 dataset named RBE-DS-NO2 for China during 2005-2020 at 1 km and daily resolution. We employed the robust back-extrapolation via a data augmentation approach (RBE-DA) to ensure the predictive accuracy in back-extrapolation before 2013, and utilized an improved spatial downscaling technique (DS) to refine the spatial resolution from 10 km to 1 km. Back-extrapolation validation based on 2005-2012 observations from sites in Taiwan province yielded an R2 of 0.72 and RMSE of 10.7 μg/m3, while cross-validation across China during 2013-2020 showed an R2 of 0.73 and RMSE of 9.6 μg/m3. RBE-DS-NO2 better captured spatiotemporal variation of surface NO2 in China compared to the existing publicly available datasets. Exposure assessment using RBE-DS-NO2 show that the population living in non-attainment areas (NO2 ≥ 30 μg/m3) grew from 376 million in 2005 to 612 million in 2012, then declined to 404 million by 2020. Unlike this national trend, exposure levels in several major cities (e.g., Shanghai and Chengdu) continued to increase during 2012-2020, driven by population growth and urban migration. Furthermore, this study revealed that low-resolution dataset (i.e., the 10 km intermediate dataset before the downscaling) overestimated NO2 levels, due to the limited specificity of the low-resolution model in simulating the relationship between NO2 and the predictor variables. Such limited specificity likely biased previous long-term NO2 exposure and health impact studies employing low-resolution datasets. The RBE-DS-NO2 dataset enables robust long-term assessments of NO2 exposure and health impacts in China.
Collapse
Affiliation(s)
- Rongxin Zhu
- Department of Environmental Science and Engineering, Sichuan University, Chengdu, Sichuan, 610065, China; College of Carbon Neutrality Future Technology, Sichuan University, Chengdu, Sichuan, 610065, China
| | - Wenfeng Luo
- Department of Environmental Science and Engineering, Sichuan University, Chengdu, Sichuan, 610065, China
| | - Michael L Grieneisen
- Department of Land, Air, and Water Resources, University of California, Davis, CA, 95616, United States
| | - Sophia Zuoqiu
- Pittsburgh Institute, Sichuan University, Chengdu, Sichuan, 610207, China
| | - Yu Zhan
- College of Carbon Neutrality Future Technology, Sichuan University, Chengdu, Sichuan, 610065, China.
| | - Fumo Yang
- College of Carbon Neutrality Future Technology, Sichuan University, Chengdu, Sichuan, 610065, China
| |
Collapse
|
4
|
Shen B, Coruzzi GM, Shasha D. Bipartite networks represent causality better than simple networks: evidence, algorithms, and applications. Front Genet 2024; 15:1371607. [PMID: 38798697 PMCID: PMC11120958 DOI: 10.3389/fgene.2024.1371607] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2024] [Accepted: 04/17/2024] [Indexed: 05/29/2024] Open
Abstract
A network, whose nodes are genes and whose directed edges represent positive or negative influences of a regulatory gene and its targets, is often used as a representation of causality. To infer a network, researchers often develop a machine learning model and then evaluate the model based on its match with experimentally verified "gold standard" edges. The desired result of such a model is a network that may extend the gold standard edges. Since networks are a form of visual representation, one can compare their utility with architectural or machine blueprints. Blueprints are clearly useful because they provide precise guidance to builders in construction. If the primary role of gene regulatory networks is to characterize causality, then such networks should be good tools of prediction because prediction is the actionable benefit of knowing causality. But are they? In this paper, we compare prediction quality based on "gold standard" regulatory edges from previous experimental work with non-linear models inferred from time series data across four different species. We show that the same non-linear machine learning models have better predictive performance, with improvements from 5.3% to 25.3% in terms of the reduction in the root mean square error (RMSE) compared with the same models based on the gold standard edges. Having established that networks fail to characterize causality properly, we suggest that causality research should focus on four goals: (i) predictive accuracy; (ii) a parsimonious enumeration of predictive regulatory genes for each target gene g; (iii) the identification of disjoint sets of predictive regulatory genes for each target g of roughly equal accuracy; and (iv) the construction of a bipartite network (whose node types are genes and models) representation of causality. We provide algorithms for all goals.
Collapse
Affiliation(s)
- Bingran Shen
- Courant Institute of Mathematical Sciences, Department of Computer Science, New York University, New York, United States
| | - Gloria M. Coruzzi
- Center for Genomics and Systems Biology, Department of Biology, New York University, New York, United States
| | - Dennis Shasha
- Courant Institute of Mathematical Sciences, Department of Computer Science, New York University, New York, United States
| |
Collapse
|
5
|
Rafiee M, Jahangiri-Rad M, Mohseni-Bandpei A, Razmi E. Impacts of socioeconomic and environmental factors on neoplasms incidence rates using machine learning and GIS: a cross-sectional study in Iran. Sci Rep 2024; 14:10604. [PMID: 38719879 PMCID: PMC11078954 DOI: 10.1038/s41598-024-61397-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2024] [Accepted: 05/06/2024] [Indexed: 05/12/2024] Open
Abstract
Neoplasm is an umbrella term used to describe either benign or malignant conditions. The correlations between socioeconomic and environmental factors and the occurrence of new-onset of neoplasms have already been demonstrated in a body of research. Nevertheless, few studies have specifically dealt with the nature of relationship, significance of risk factors, and geographic variation of them, particularly in low- and middle-income communities. This study, thus, set out to (1) analyze spatiotemporal variations of the age-adjusted incidence rate (AAIR) of neoplasms in Iran throughout five time periods, (2) investigate relationships between a collection of environmental and socioeconomic indicators and the AAIR of neoplasms all over the country, and (3) evaluate geographical alterations in their relative importance. Our cross-sectional study design was based on county-level data from 2010 to 2020. AAIR of neoplasms data was acquired from the Institute for Health Metrics and Evaluation (IHME). HotSpot analyses and Anselin Local Moran's I indices were deployed to precisely identify AAIR of neoplasms high- and low-risk clusters. Multi-scale geographically weight regression (MGWR) analysis was worked out to evaluate the association between each explanatory variable and the AAIR of neoplasms. Utilizing random forests (RF), we also examined the relationships between environmental (e.g., UV index and PM2.5 concentration) and socioeconomic (e.g., Gini coefficient and literacy rate) factors and AAIR of neoplasms. AAIR of neoplasms displayed a significant increasing trend over the study period. According to the MGWR, the only factor that significantly varied spatially and was associated with the AAIR of neoplasms in Iran was the UV index. A good accuracy RF model was confirmed for both training and testing data with correlation coefficients R2 greater than 0.91 and 0.92, respectively. UV index and Gini coefficient ranked the highest variables in the prediction of AAIR of neoplasms, based on the relative influence of each variable. More research using machine learning approaches taking the advantages of considering all possible determinants is required to assess health strategies outcomes and properly formulate policy planning.
Collapse
Affiliation(s)
- Mohammad Rafiee
- Air Quality and Climate Change Research Center, Shahid Beheshti University of Medical Sciences, Tehran, Iran
- Department of Environmental Health Engineering, School of Public Health and Safety, Shahid Beheshti University of Medical Sciences, Tehran, Iran
| | - Mahsa Jahangiri-Rad
- Department of Environmental Health Engineering, School of Health, Tehran Medical Sciences, Islamic Azad University, Tehran, Iran.
- Water Purification Research Center, Islamic Azad University, Tehran, Iran.
| | - Anoushiravan Mohseni-Bandpei
- Air Quality and Climate Change Research Center, Shahid Beheshti University of Medical Sciences, Tehran, Iran
- Department of Environmental Health Engineering, School of Public Health and Safety, Shahid Beheshti University of Medical Sciences, Tehran, Iran
| | - Elham Razmi
- Department of Environmental Health Engineering, School of Public Health, Iran University of Medical Sciences, Tehran, Iran
| |
Collapse
|
6
|
Chakraborty S, Guan Z, Begg CB, Shen R. Topical hidden genome: discovering latent cancer mutational topics using a Bayesian multilevel context-learning approach. Biometrics 2024; 80:ujae030. [PMID: 38682463 PMCID: PMC11056772 DOI: 10.1093/biomtc/ujae030] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2022] [Revised: 03/18/2024] [Accepted: 04/04/2024] [Indexed: 05/01/2024]
Abstract
Inferring the cancer-type specificities of ultra-rare, genome-wide somatic mutations is an open problem. Traditional statistical methods cannot handle such data due to their ultra-high dimensionality and extreme data sparsity. To harness information in rare mutations, we have recently proposed a formal multilevel multilogistic "hidden genome" model. Through its hierarchical layers, the model condenses information in ultra-rare mutations through meta-features embodying mutation contexts to characterize cancer types. Consistent, scalable point estimation of the model can incorporate 10s of millions of variants across thousands of tumors and permit impressive prediction and attribution. However, principled statistical inference is infeasible due to the volume, correlation, and noninterpretability of mutation contexts. In this paper, we propose a novel framework that leverages topic models from computational linguistics to effectuate dimension reduction of mutation contexts producing interpretable, decorrelated meta-feature topics. We propose an efficient MCMC algorithm for implementation that permits rigorous full Bayesian inference at a scale that is orders of magnitude beyond the capability of existing out-of-the-box inferential high-dimensional multi-class regression methods and software. Applying our model to the Pan Cancer Analysis of Whole Genomes dataset reveals interesting biological insights including somatic mutational topics associated with UV exposure in skin cancer, aging in colorectal cancer, and strong influence of epigenome organization in liver cancer. Under cross-validation, our model demonstrates highly competitive predictive performance against blackbox methods of random forest and deep learning.
Collapse
Affiliation(s)
- Saptarshi Chakraborty
- Department of Biostatistics, State University of New York at Buffalo, Buffalo, NY 14214, USA
| | - Zoe Guan
- Biostatistics Center, Mass General Research Institute, Boston, MA 02114, USA
| | - Colin B Begg
- Department of Epidemiology and Biostatistics, Memorial Sloan-Kettering Cancer Center, New York, NY 10065, USA
| | - Ronglai Shen
- Department of Epidemiology and Biostatistics, Memorial Sloan-Kettering Cancer Center, New York, NY 10065, USA
| |
Collapse
|
7
|
Huang B, Kong L, Wang C, Ju F, Zhang Q, Zhu J, Gong T, Zhang H, Yu C, Zheng WM, Bu D. Protein Structure Prediction: Challenges, Advances, and the Shift of Research Paradigms. GENOMICS, PROTEOMICS & BIOINFORMATICS 2023; 21:913-925. [PMID: 37001856 PMCID: PMC10928435 DOI: 10.1016/j.gpb.2022.11.014] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/15/2022] [Revised: 11/23/2022] [Accepted: 11/30/2022] [Indexed: 03/31/2023]
Abstract
Protein structure prediction is an interdisciplinary research topic that has attracted researchers from multiple fields, including biochemistry, medicine, physics, mathematics, and computer science. These researchers adopt various research paradigms to attack the same structure prediction problem: biochemists and physicists attempt to reveal the principles governing protein folding; mathematicians, especially statisticians, usually start from assuming a probability distribution of protein structures given a target sequence and then find the most likely structure, while computer scientists formulate protein structure prediction as an optimization problem - finding the structural conformation with the lowest energy or minimizing the difference between predicted structure and native structure. These research paradigms fall into the two statistical modeling cultures proposed by Leo Breiman, namely, data modeling and algorithmic modeling. Recently, we have also witnessed the great success of deep learning in protein structure prediction. In this review, we present a survey of the efforts for protein structure prediction. We compare the research paradigms adopted by researchers from different fields, with an emphasis on the shift of research paradigms in the era of deep learning. In short, the algorithmic modeling techniques, especially deep neural networks, have considerably improved the accuracy of protein structure prediction; however, theories interpreting the neural networks and knowledge on protein folding are still highly desired.
Collapse
Affiliation(s)
- Bin Huang
- Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China; University of Chinese Academy of Sciences, Beijing 100049, China
| | - Lupeng Kong
- Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China; Changping Laboratory, Beijing 102206, China
| | - Chao Wang
- Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China
| | - Fusong Ju
- Microsoft Research AI4Science, Beijing 100080, China
| | - Qi Zhang
- Huawei Noah's Ark Lab, Wuhan 430206, China
| | - Jianwei Zhu
- Microsoft Research AI4Science, Beijing 100080, China
| | - Tiansu Gong
- Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China; University of Chinese Academy of Sciences, Beijing 100049, China
| | - Haicang Zhang
- Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China; University of Chinese Academy of Sciences, Beijing 100049, China; Zhongke Big Data Academy, Zhengzhou 450046, China.
| | - Chungong Yu
- Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China; University of Chinese Academy of Sciences, Beijing 100049, China; Zhongke Big Data Academy, Zhengzhou 450046, China.
| | - Wei-Mou Zheng
- Institute of Theoretical Physics, Chinese Academy of Sciences, Beijing 100190, China.
| | - Dongbo Bu
- Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China; University of Chinese Academy of Sciences, Beijing 100049, China; Zhongke Big Data Academy, Zhengzhou 450046, China.
| |
Collapse
|
8
|
Ringwaldt EM, Brook BW, Buettel JC, Cunningham CX, Fuller C, Gardiner R, Hamer R, Jones M, Martin AM, Carver S. Host, environment, and anthropogenic factors drive landscape dynamics of an environmentally transmitted pathogen: Sarcoptic mange in the bare-nosed wombat. J Anim Ecol 2023; 92:1786-1801. [PMID: 37221666 DOI: 10.1111/1365-2656.13960] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2022] [Accepted: 05/09/2023] [Indexed: 05/25/2023]
Abstract
Understanding the spatial dynamics and drivers of wildlife pathogens is constrained by sampling logistics, with implications for advancing the field of landscape epidemiology and targeted allocation of management resources. However, visually apparent wildlife diseases, when combined with remote-surveillance and distribution modelling technologies, present an opportunity to overcome this landscape-scale problem. Here, we investigated dynamics and drivers of landscape-scale wildlife disease, using clinical signs of sarcoptic mange (caused by Sarcoptes scabiei) in its bare-nosed wombat (BNW; Vombatus ursinus) host. We used 53,089 camera-trap observations from over 3261 locations across the 68,401 km2 area of Tasmania, Australia, combined with landscape data and ensemble species distribution modelling (SDM). We investigated: (1) landscape variables predicted to drive habitat suitability of the host; (2) host and landscape variables associated with clinical signs of disease in the host; and (3) predicted locations and environmental conditions at greatest risk of disease occurrence, including some Bass Strait islands where BNW translocations are proposed. We showed that the Tasmanian landscape, and ecosystems therein, are nearly ubiquitously suited to BNWs. Only high mean annual precipitation reduced habitat suitability for the host. In contrast, clinical signs of sarcoptic mange disease in BNWs were widespread, but heterogeneously distributed across the landscape. Mange (which is environmentally transmitted in BNWs) was most likely to be observed in areas of increased host habitat suitability, lower annual precipitation, near sources of freshwater and where topographic roughness was minimal (e.g. human modified landscapes, such as farmland and intensive land-use areas, shrub and grass lands). Thus, a confluence of host, environmental and anthropogenic variables appear to influence the risk of environmental transmission of S. scabiei. We identified that the Bass Strait Islands are highly suitable for BNWs and predicted a mix of high and low suitability for the pathogen. This study is the largest spatial assessment of sarcoptic mange in any host species, and advances understanding of the landscape epidemiology of environmentally transmitted S. scabiei. This research illustrates how host-pathogen co-suitability can be useful for allocating management resources in the landscape.
Collapse
Affiliation(s)
- E M Ringwaldt
- School of Natural Sciences, Biological Science, University of Tasmania, Hobart, Tasmania, Australia
| | - B W Brook
- School of Natural Sciences, Biological Science, University of Tasmania, Hobart, Tasmania, Australia
| | - J C Buettel
- School of Natural Sciences, Biological Science, University of Tasmania, Hobart, Tasmania, Australia
| | - C X Cunningham
- School of Natural Sciences, Biological Science, University of Tasmania, Hobart, Tasmania, Australia
- School of Environmental and Forest Sciences, University of Washington, Seattle, Washington, USA
| | - C Fuller
- School of Natural Sciences, Biological Science, University of Tasmania, Hobart, Tasmania, Australia
| | - R Gardiner
- School of Science, Engineering and Technology, University of Sunshine Coast, Sippy Downs, Queensland, Australia
| | - R Hamer
- School of Natural Sciences, Biological Science, University of Tasmania, Hobart, Tasmania, Australia
| | - M Jones
- School of Natural Sciences, Biological Science, University of Tasmania, Hobart, Tasmania, Australia
| | - A M Martin
- Caesar Kleberg Wildlife Research Institute, Texas A&M University-Kingsville, Kingsville, Texas, USA
| | - S Carver
- School of Natural Sciences, Biological Science, University of Tasmania, Hobart, Tasmania, Australia
| |
Collapse
|
9
|
Data driven contagion risk management in low-income countries using machine learning applications with COVID-19 in South Asia. Sci Rep 2023; 13:3732. [PMID: 36878910 PMCID: PMC9987367 DOI: 10.1038/s41598-023-30348-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2022] [Accepted: 02/21/2023] [Indexed: 03/08/2023] Open
Abstract
In the absence of real-time surveillance data, it is difficult to derive an early warning system and potential outbreak locations with the existing epidemiological models, especially in resource-constrained countries. We proposed a contagion risk index (CR-Index)-based on publicly available national statistics-founded on communicable disease spreadability vectors. Utilizing the daily COVID-19 data (positive cases and deaths) from 2020 to 2022, we developed country-specific and sub-national CR-Index for South Asia (India, Pakistan, and Bangladesh) and identified potential infection hotspots-aiding policymakers with efficient mitigation planning. Across the study period, the week-by-week and fixed-effects regression estimates demonstrate a strong correlation between the proposed CR-Index and sub-national (district-level) COVID-19 statistics. We validated the CR-Index using machine learning methods by evaluating the out-of-sample predictive performance. Machine learning driven validation showed that the CR-Index can correctly predict districts with high incidents of COVID-19 cases and deaths more than 85% of the time. This proposed CR-Index is a simple, replicable, and easily interpretable tool that can help low-income countries prioritize resource mobilization to contain the disease spread and associated crisis management with global relevance and applicability. This index can also help to contain future pandemics (and epidemics) and manage their far-reaching adverse consequences.
Collapse
|
10
|
Candès E, Lei L, Ren Z. Conformalized survival analysis. J R Stat Soc Series B Stat Methodol 2023. [DOI: 10.1093/jrsssb/qkac004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/30/2023]
Abstract
Abstract
In this paper, we develop an inferential method based on conformal prediction, which can wrap around any survival prediction algorithm to produce calibrated, covariate-dependent lower predictive bounds on survival times. In the Type I right-censoring setting, when the censoring times are completely exogenous, the lower predictive bounds have guaranteed coverage in finite samples without any assumptions other than that of operating on independent and identically distributed data points. Under a more general conditionally independent censoring assumption, the bounds satisfy a doubly robust property which states the following: marginal coverage is approximately guaranteed if either the censoring mechanism or the conditional survival function is estimated well. The validity and efficiency of our procedure are demonstrated on synthetic data and real COVID-19 data from the UK Biobank.
Collapse
Affiliation(s)
- Emmanuel Candès
- Department of Mathematics, Stanford University , Stanford, CA , USA
- Department of Statistics, Stanford University , Stanford, CA , USA
| | - Lihua Lei
- Department of Statistics, Stanford University , Stanford, CA , USA
- Graduate School of Business, Stanford University , Stanford, CA , USA
| | - Zhimei Ren
- Department of Statistics, University of Chicago , Chicago, IL , USA
| |
Collapse
|
11
|
Garrett KA, Bebber DP, Etherton BA, Gold KM, Plex Sulá AI, Selvaraj MG. Climate Change Effects on Pathogen Emergence: Artificial Intelligence to Translate Big Data for Mitigation. ANNUAL REVIEW OF PHYTOPATHOLOGY 2022; 60:357-378. [PMID: 35650670 DOI: 10.1146/annurev-phyto-021021-042636] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/20/2023]
Abstract
Plant pathology has developed a wide range of concepts and tools for improving plant disease management, including models for understanding and responding to new risks from climate change. Most of these tools can be improved using new advances in artificial intelligence (AI), such as machine learning to integrate massive data sets in predictive models. There is the potential to develop automated analyses of risk that alert decision-makers, from farm managers to national plant protection organizations, to the likely need for action and provide decision support for targeting responses. We review machine-learning applications in plant pathology and synthesize ideas for the next steps to make the most of these tools in digital agriculture. Global projects, such as the proposed global surveillance system for plant disease, will be strengthened by the integration of the wide range of new data, including data from tools like remote sensors, that are used to evaluate the risk ofplant disease. There is exciting potential for the use of AI to strengthen global capacity building as well, from image analysis for disease diagnostics and associated management recommendations on farmers' phones to future training methodologies for plant pathologists that are customized in real-time for management needs in response to the current risks. International cooperation in integrating data and models will help develop the most effective responses to new challenges from climate change.
Collapse
Affiliation(s)
- K A Garrett
- Plant Pathology Department, University of Florida, Gainesville, Florida, USA;
- Food Systems Institute, University of Florida, Gainesville, Florida, USA
- Emerging Pathogens Institute, University of Florida, Gainesville, Florida, USA
| | - D P Bebber
- Department of Biosciences, University of Exeter, Exeter, United Kingdom
| | - B A Etherton
- Plant Pathology Department, University of Florida, Gainesville, Florida, USA;
- Food Systems Institute, University of Florida, Gainesville, Florida, USA
- Emerging Pathogens Institute, University of Florida, Gainesville, Florida, USA
| | - K M Gold
- Plant Pathology and Plant Microbe Biology Section, School of Integrative Plant Sciences, Cornell AgriTech, Cornell University, Geneva, New York, USA
| | - A I Plex Sulá
- Plant Pathology Department, University of Florida, Gainesville, Florida, USA;
- Food Systems Institute, University of Florida, Gainesville, Florida, USA
- Emerging Pathogens Institute, University of Florida, Gainesville, Florida, USA
| | - M G Selvaraj
- The Alliance of Bioversity International and the International Center for Tropical Agriculture (CIAT), Cali, Colombia
| |
Collapse
|
12
|
Fokkema M, Iliescu D, Greiff S, Ziegler M. Machine Learning and Prediction in Psychological Assessment. EUROPEAN JOURNAL OF PSYCHOLOGICAL ASSESSMENT 2022. [DOI: 10.1027/1015-5759/a000714] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
Abstract
Abstract. Modern prediction methods from machine learning (ML) and artificial intelligence (AI) are becoming increasingly popular, also in the field of psychological assessment. These methods provide unprecedented flexibility for modeling large numbers of predictor variables and non-linear associations between predictors and responses. In this paper, we aim to look at what these methods may contribute to the assessment of criterion validity and their possible drawbacks. We apply a range of modern statistical prediction methods to a dataset for predicting the university major completed, based on the subscales and items of a scale for vocational preferences. The results indicate that logistic regression combined with regularization performs strikingly well already in terms of predictive accuracy. More sophisticated techniques for incorporating non-linearities can further contribute to predictive accuracy and validity, but often marginally.
Collapse
Affiliation(s)
- Marjolein Fokkema
- Methodology and Statistics Department, Institute of Psychology, Leiden University, The Netherlands
| | - Dragos Iliescu
- Faculty of Psychology and Educational Sciences, University of Bucharest, Romania
| | - Samuel Greiff
- Department of Behavioural and Cognitive Sciences, University of Luxembourg, Luxembourg
| | | |
Collapse
|
13
|
Tan ZC, Murphy MC, Alpay HS, Taylor SD, Meyer AS. Tensor-structured decomposition improves systems serology analysis. Mol Syst Biol 2021; 17:e10243. [PMID: 34487431 PMCID: PMC8420856 DOI: 10.15252/msb.202110243] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2021] [Revised: 08/12/2021] [Accepted: 08/16/2021] [Indexed: 01/04/2023] Open
Abstract
Systems serology provides a broad view of humoral immunity by profiling both the antigen-binding and Fc properties of antibodies. These studies contain structured biophysical profiling across disease-relevant antigen targets, alongside additional measurements made for single antigens or in an antigen-generic manner. Identifying patterns in these measurements helps guide vaccine and therapeutic antibody development, improve our understanding of diseases, and discover conserved regulatory mechanisms. Here, we report that coupled matrix-tensor factorization (CMTF) can reduce these data into consistent patterns by recognizing the intrinsic structure of these data. We use measurements from two previous studies of HIV- and SARS-CoV-2-infected subjects as examples. CMTF outperforms standard methods like principal components analysis in the extent of data reduction while maintaining equivalent prediction of immune functional responses and disease status. Under CMTF, model interpretation improves through effective data reduction, separation of the Fc and antigen-binding effects, and recognition of consistent patterns across individual measurements. Data reduction also helps make prediction models more replicable. Therefore, we propose that CMTF is an effective general strategy for data exploration in systems serology.
Collapse
Affiliation(s)
- Zhixin Cyrillus Tan
- Bioinformatics Interdepartmental ProgramUniversity of California, Los AngelesLos AngelesCAUSA
| | - Madeleine C Murphy
- Computational and Systems BiologyUniversity of California, Los AngelesLos AngelesCAUSA
| | - Hakan S Alpay
- Department of Computer ScienceUniversity of California, Los AngelesLos AngelesCAUSA
| | - Scott D Taylor
- Department of BioengineeringUniversity of California, Los AngelesLos AngelesCAUSA
| | - Aaron S Meyer
- Bioinformatics Interdepartmental ProgramUniversity of California, Los AngelesLos AngelesCAUSA
- Department of BioengineeringUniversity of California, Los AngelesLos AngelesCAUSA
- Jonsson Comprehensive Cancer CenterUniversity of California, Los AngelesLos AngelesCAUSA
- Eli and Edythe Broad Center of Regenerative Medicine and Stem Cell ResearchUniversity of California, Los AngelesLos AngelesCAUSA
| |
Collapse
|
14
|
Guerriero S, Pascual M, Ajossa S, Neri M, Musa E, Graupera B, Rodriguez I, Alcazar JL. Artificial intelligence (AI) in the detection of rectosigmoid deep endometriosis. Eur J Obstet Gynecol Reprod Biol 2021; 261:29-33. [PMID: 33873085 DOI: 10.1016/j.ejogrb.2021.04.012] [Citation(s) in RCA: 14] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2021] [Revised: 04/06/2021] [Accepted: 04/11/2021] [Indexed: 12/12/2022]
Abstract
OBJECTIVES The aim of this study was to compare the accuracy of seven classical Machine Learning (ML) models trained with ultrasound (US) soft markers to raise suspicion of endometriotic bowel involvement. MATERIALS AND METHODS Input data to the models was retrieved from a database of a previously published study on bowel endometriosis performed on 333 patients. The following models have been tested: k-nearest neighbors algorithm (k-NN), Naive Bayes, Neural Networks (NNET-neuralnet), Support Vector Machine (SVM), Decision Tree, Random Forest, and Logistic Regression. The data driven strategy has been to split randomly the complete dataset in two different datasets. The training dataset and the test dataset with a 67 % and 33 % of the original cases respectively. All models were trained on the training dataset and the predictions have been evaluated using the test dataset. The best model was chosen based on the accuracy demonstrated on the test dataset. The information used in all the models were: age; presence of US signs of uterine adenomyosis; presence of an endometrioma; adhesions of the ovary to the uterus; presence of "kissing ovaries"; absence of sliding sign. All models have been trained using CARET package in R with ten repeated 10-fold cross-validation. Accuracy, Sensitivity, Specificity, positive (PPV) and negative (NPV) predictive value were calculated using a 50 % threshold. Presence of intestinal involvement was defined in all cases in the test dataset with an estimated probability greater than 0.5. RESULTS In our previous study from where the inputs were retrieved, 106 women had a final expert US diagnosis of rectosigmoid endometriosis. In term of diagnostic accuracy the best model was the Neural Net (Accuracy, 0.73; sensitivity, 0.72; specificity 0.73; PPV 0.52; and NPV 0.86) but without significant difference with the others. CONCLUSIONS The accuracy of ultrasound soft markers in raising suspicion of rectosigmoid endometriosis using Artificial Intelligence (AI) models showed similar results to the logistic model.
Collapse
Affiliation(s)
- Stefano Guerriero
- Centro Integrato di Procreazione Medicalmente Assistita (PMA) e Diagnostica Ostetrico-Ginecologica, Policlinico Universitario Duilio Casula, Monserrato, Cagliari, Italy; University of Cagliari, Cagliari, Italy.
| | - MariaAngela Pascual
- Department of Obstetrics, Gynecology, and Reproduction, Hospital Universitari Dexeus, Spain
| | - Silvia Ajossa
- Department of Obstetrics and Gynecology, University of Cagliari, Policlinico Universitario Duilio Casula, Monserrato, Cagliari, Italy
| | - Manuela Neri
- Department of Obstetrics and Gynecology, University of Cagliari, Policlinico Universitario Duilio Casula, Monserrato, Cagliari, Italy
| | - Eleonora Musa
- Department of Obstetrics and Gynecology, University of Cagliari, Policlinico Universitario Duilio Casula, Monserrato, Cagliari, Italy
| | - Betlem Graupera
- Department of Obstetrics, Gynecology, and Reproduction, Hospital Universitari Dexeus, Spain
| | - Ignacio Rodriguez
- Unidad Epidemiología y Estadística, Departamento de Obstetricia, Ginecología y Reproducción, Hospital Universitario Quirón Dexeus, Barcelona, Spain
| | - Juan Luis Alcazar
- Department of Obstetrics and Gynecology, Clínica Universidad de Navarra, School of Medicine, University of Navarra, Pamplona, Spain
| |
Collapse
|