1
|
Li J, Ionides EL, King AA, Pascual M, Ning N. Inference on spatiotemporal dynamics for coupled biological populations. J R Soc Interface 2024; 21:20240217. [PMID: 38981516 DOI: 10.1098/rsif.2024.0217] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2024] [Accepted: 06/07/2024] [Indexed: 07/11/2024] Open
Abstract
Mathematical models in ecology and epidemiology must be consistent with observed data in order to generate reliable knowledge and evidence-based policy. Metapopulation systems, which consist of a network of connected sub-populations, pose technical challenges in statistical inference owing to nonlinear, stochastic interactions. Numerical difficulties encountered in conducting inference can obstruct the core scientific questions concerning the link between the mathematical models and the data. Recently, an algorithm has been proposed that enables computationally tractable likelihood-based inference for high-dimensional partially observed stochastic dynamic models of metapopulation systems. We use this algorithm to build a statistically principled data analysis workflow for metapopulation systems. Via a case study of COVID-19, we show how this workflow addresses the limitations of previous approaches. The COVID-19 pandemic provides a situation where mathematical models and their policy implications are widely visible, and we revisit an influential metapopulation model used to inform basic epidemiological understanding early in the pandemic. Our methods support self-critical data analysis, enabling us to identify and address model weaknesses, leading to a new model with substantially improved statistical fit and parameter identifiability. Our results suggest that the lockdown initiated on 23 January 2020 in China was more effective than previously thought.
Collapse
Affiliation(s)
- Jifan Li
- Department of Statistics, Texas A&M University, College Station, TX 77843, USA
| | - Edward L Ionides
- Department of Statistics, University of Michigan, Ann Arbor, MI 48109, USA
| | - Aaron A King
- Department of Ecology & Evolutionary Biology, University of Michigan, Ann Arbor, MI 48109, USA
- Center for the Study of Complex Systems, University of Michigan, Ann Arbor, MI 48109, USA
- Santa Fe Institute, Santa Fe, NM 87501, USA
| | - Mercedes Pascual
- Santa Fe Institute, Santa Fe, NM 87501, USA
- Departments of Biology and Environmental Studies, New York University, NY 10012, USA
| | - Ning Ning
- Department of Statistics, Texas A&M University, College Station, TX 77843, USA
| |
Collapse
|
2
|
Mandros P, Gallagher I, Fanfani V, Chen C, Fischer J, Ismail A, Hsu L, Saha E, DeConti DK, Quackenbush J. node2vec2rank: Large Scale and Stable Graph Differential Analysis via Multi-Layer Node Embeddings and Ranking. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.06.16.599201. [PMID: 38948759 PMCID: PMC11212899 DOI: 10.1101/2024.06.16.599201] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/02/2024]
Abstract
Computational methods in biology can infer large molecular interaction networks from multiple data sources and at different resolutions, creating unprecedented opportunities to explore the mechanisms driving complex biological phenomena. Networks can be built to represent distinct conditions and compared to uncover graph-level differences-such as when comparing patterns of gene-gene interactions that change between biological states. Given the importance of the graph comparison problem, there is a clear and growing need for robust and scalable methods that can identify meaningful differences. We introduce node2vec2rank (n2v2r), a method for graph differential analysis that ranks nodes according to the disparities of their representations in joint latent embedding spaces. Improving upon previous bag-of-features approaches, we take advantage of recent advances in machine learning and statistics to compare graphs in higher-order structures and in a data-driven manner. Formulated as a multi-layer spectral embedding algorithm, n2v2r is computationally efficient, incorporates stability as a key feature, and can provably identify the correct ranking of differences between graphs in an overall procedure that adheres to veridical data science principles. By better adapting to the data, node2vec2rank clearly outperformed the commonly used node degree in finding complex differences in simulated data. In the real-world applications of breast cancer subtype characterization, analysis of cell cycle in single-cell data, and searching for sex differences in lung adenocarcinoma, node2vec2rank found meaningful biological differences enabling the hypothesis generation for therapeutic candidates. Software and analysis pipelines implementing n2v2r and used for the analyses presented here are publicly available.
Collapse
Affiliation(s)
- Panagiotis Mandros
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | - Ian Gallagher
- School of Mathematics, University of Bristol, UK, and the Heilbronn Institute for Mathematical Research, Bristol, UK
| | - Viola Fanfani
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | - Chen Chen
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | - Jonas Fischer
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | - Anis Ismail
- Faculty of Bioscience Engineering, KU Leuven, Belgium
| | - Lauren Hsu
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA
- Department of Cancer Immunology and Virology, Dana-Farber Cancer Institute, Boston, MA, USA
| | - Enakshi Saha
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | - Derrick K DeConti
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | - John Quackenbush
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA
- Channing Division of Network Medicine, Brigham and Women's Hospital, Boston, MA, USA
- Department of Data Science, Dana-Farber Cancer Institute, Boston, MA, USA
| |
Collapse
|
3
|
Haab B, Qian L, Staal B, Jain M, Fahrmann J, Worthington C, Prosser D, Velokokhatnaya L, Lopez C, Tang R, Hurd MW, Natarajan G, Kumar S, Smith L, Hanash S, Batra SK, Maitra A, Lokshin A, Huang Y, Brand RE. A Rigorous Multi-Laboratory Study of Known PDAC Biomarkers Identifies Increased Sensitivity and Specificity Over CA19-9 Alone. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.05.22.595399. [PMID: 38826212 PMCID: PMC11142185 DOI: 10.1101/2024.05.22.595399] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/04/2024]
Abstract
A blood test that enables surveillance for early-stage pancreatic ductal adenocarcinoma (PDAC) is an urgent need. Independent laboratories have reported PDAC biomarkers that could improve biomarker performance over CA19-9 alone, but the performance of the previously reported biomarkers in combination is not known. Therefore, we conducted a coordinated case/control study across multiple laboratories using common sets of blinded training and validation samples (132 and 295 plasma samples, respectively) from PDAC patients and non-PDAC control subjects representing conditions under which surveillance occurs. We analyzed the training set to identify candidate biomarker combination panels using biomarkers across laboratories, and we applied the fixed panels to the validation set. The panels identified in the training set, CA19-9 with CA199.STRA, LRG1, TIMP-1, TGM2, THSP2, ANG, and MUC16.STRA, achieved consistent performance in the validation set. The panel of CA19-9 with the glycan biomarker CA199.STRA improved sensitivity from 0.44 with 0.98 specificity for CA19-9 alone to 0.71 with 0.98 specificity (p < 0.001, 1000-fold bootstrap). Similarly, CA19-9 combined with the protein biomarker LRG1 and CA199.STRA improved specificity from 0.16 with 0.94 sensitivity for CA19-9 to 0.65 with 0.89 sensitivity (p < 0.001, 1000-fold bootstrap). We further validated significantly improved performance using biomarker panels that did not include CA19-9. This study establishes the effectiveness of a coordinated study of previously discovered biomarkers and identified panels of those biomarkers that significantly increased the sensitivity and specificity of early-stage PDAC detection in a rigorous validation trial.
Collapse
|
4
|
Wang Q, Tang TM, Youlton N, Weldy CS, Kenney AM, Ronen O, Weston Hughes J, Chin ET, Sutton SC, Agarwal A, Li X, Behr M, Kumbier K, Moravec CS, Wilson Tang WH, Margulies KB, Cappola TP, Butte AJ, Arnaout R, Brown JB, Priest JR, Parikh VN, Yu B, Ashley EA. Epistasis regulates genetic control of cardiac hypertrophy. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2024:2023.11.06.23297858. [PMID: 37987017 PMCID: PMC10659487 DOI: 10.1101/2023.11.06.23297858] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/22/2023]
Abstract
The combinatorial effect of genetic variants is often assumed to be additive. Although genetic variation can clearly interact non-additively, methods to uncover epistatic relationships remain in their infancy. We develop low-signal signed iterative random forests to elucidate the complex genetic architecture of cardiac hypertrophy. We derive deep learning-based estimates of left ventricular mass from the cardiac MRI scans of 29,661 individuals enrolled in the UK Biobank. We report epistatic genetic variation including variants close to CCDC141 , IGF1R , TTN , and TNKS. Several loci where variants were deemed insignificant in univariate genome-wide association analyses are identified. Functional genomic and integrative enrichment analyses reveal a complex gene regulatory network in which genes mapped from these loci share biological processes and myogenic regulatory factors. Through a network analysis of transcriptomic data from 313 explanted human hearts, we found strong gene co-expression correlations between these statistical epistasis contributors in healthy hearts and a significant connectivity decrease in failing hearts. We assess causality of epistatic effects via RNA silencing of gene-gene interactions in human induced pluripotent stem cell-derived cardiomyocytes. Finally, single-cell morphology analysis using a novel high-throughput microfluidic system shows that cardiomyocyte hypertrophy is non-additively modifiable by specific pairwise interactions between CCDC141 and both TTN and IGF1R . Our results expand the scope of genetic regulation of cardiac structure to epistasis.
Collapse
|
5
|
Behr M, Kumbier K, Cordova-Palomera A, Aguirre M, Ronen O, Ye C, Ashley E, Butte AJ, Arnaout R, Brown B, Priest J, Yu B. Learning epistatic polygenic phenotypes with Boolean interactions. PLoS One 2024; 19:e0298906. [PMID: 38625909 PMCID: PMC11020961 DOI: 10.1371/journal.pone.0298906] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2023] [Accepted: 01/31/2024] [Indexed: 04/18/2024] Open
Abstract
Detecting epistatic drivers of human phenotypes is a considerable challenge. Traditional approaches use regression to sequentially test multiplicative interaction terms involving pairs of genetic variants. For higher-order interactions and genome-wide large-scale data, this strategy is computationally intractable. Moreover, multiplicative terms used in regression modeling may not capture the form of biological interactions. Building on the Predictability, Computability, Stability (PCS) framework, we introduce the epiTree pipeline to extract higher-order interactions from genomic data using tree-based models. The epiTree pipeline first selects a set of variants derived from tissue-specific estimates of gene expression. Next, it uses iterative random forests (iRF) to search training data for candidate Boolean interactions (pairwise and higher-order). We derive significance tests for interactions, based on a stabilized likelihood ratio test, by simulating Boolean tree-structured null (no epistasis) and alternative (epistasis) distributions on hold-out test data. Finally, our pipeline computes PCS epistasis p-values that probabilisticly quantify improvement in prediction accuracy via bootstrap sampling on the test set. We validate the epiTree pipeline in two case studies using data from the UK Biobank: predicting red hair and multiple sclerosis (MS). In the case of predicting red hair, epiTree recovers known epistatic interactions surrounding MC1R and novel interactions, representing non-linearities not captured by logistic regression models. In the case of predicting MS, a more complex phenotype than red hair, epiTree rankings prioritize novel interactions surrounding HLA-DRB1, a variant previously associated with MS in several populations. Taken together, these results highlight the potential for epiTree rankings to help reduce the design space for follow up experiments.
Collapse
Affiliation(s)
- Merle Behr
- Faculty of Informatics and Data Science, University of Regensburg, Regensburg, Germany
| | - Karl Kumbier
- Department of Pharmaceutical Chemistry, University of California, San Francisco, San Francisco, CA, United States of America
| | | | - Matthew Aguirre
- Department of Pediatrics, Stanford Medicine, Stanford, CA, United States of America
- Department of Biomedical Data Science, Stanford Medicine, Stanford, CA, United States of America
| | - Omer Ronen
- Department of Statistics, University of California at Berkeley, Berkeley, CA, United States of America
| | - Chengzhong Ye
- Department of Statistics, University of California at Berkeley, Berkeley, CA, United States of America
| | - Euan Ashley
- Division of Cardiovascular Medicine, Stanford Medicine, Stanford, CA, United States of America
| | - Atul J. Butte
- Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco, CA, United States of America
| | - Rima Arnaout
- Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco, CA, United States of America
- Division of Cardiology, Department of Medicine, University of California, San Francisco, San Francisco, CA, United States of America
| | - Ben Brown
- Department of Statistics, University of California at Berkeley, Berkeley, CA, United States of America
- Biosciences Area, Lawrence Berkeley National Laboratory, Berkeley, CA, United States of America
| | - James Priest
- Department of Pediatrics, Stanford Medicine, Stanford, CA, United States of America
| | - Bin Yu
- Department of Statistics, University of California at Berkeley, Berkeley, CA, United States of America
- Department of Electrical Engineering and Computer Sciences and Center for Computational Biology, University of California at Berkeley, Berkeley, CA, United States of America
| |
Collapse
|
6
|
Lasko TA, Strobl EV, Stead WW. Why do probabilistic clinical models fail to transport between sites. NPJ Digit Med 2024; 7:53. [PMID: 38429353 PMCID: PMC10907678 DOI: 10.1038/s41746-024-01037-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2023] [Accepted: 02/14/2024] [Indexed: 03/03/2024] Open
Abstract
The rising popularity of artificial intelligence in healthcare is highlighting the problem that a computational model achieving super-human clinical performance at its training sites may perform substantially worse at new sites. In this perspective, we argue that we should typically expect this failure to transport, and we present common sources for it, divided into those under the control of the experimenter and those inherent to the clinical data-generating process. Of the inherent sources we look a little deeper into site-specific clinical practices that can affect the data distribution, and propose a potential solution intended to isolate the imprint of those practices on the data from the patterns of disease cause and effect that are the usual target of probabilistic clinical models.
Collapse
Affiliation(s)
- Thomas A Lasko
- Vanderbilt University Medical Center, Nashville, TN, USA.
| | - Eric V Strobl
- Vanderbilt University Medical Center, Nashville, TN, USA
| | | |
Collapse
|
7
|
Cheng F, Wang F, Tang J, Zhou Y, Fu Z, Zhang P, Haines JL, Leverenz JB, Gan L, Hu J, Rosen-Zvi M, Pieper AA, Cummings J. Artificial intelligence and open science in discovery of disease-modifying medicines for Alzheimer's disease. Cell Rep Med 2024; 5:101379. [PMID: 38382465 PMCID: PMC10897520 DOI: 10.1016/j.xcrm.2023.101379] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2022] [Revised: 08/15/2023] [Accepted: 12/19/2023] [Indexed: 02/23/2024]
Abstract
The high failure rate of clinical trials in Alzheimer's disease (AD) and AD-related dementia (ADRD) is due to a lack of understanding of the pathophysiology of disease, and this deficit may be addressed by applying artificial intelligence (AI) to "big data" to rapidly and effectively expand therapeutic development efforts. Recent accelerations in computing power and availability of big data, including electronic health records and multi-omics profiles, have converged to provide opportunities for scientific discovery and treatment development. Here, we review the potential utility of applying AI approaches to big data for discovery of disease-modifying medicines for AD/ADRD. We illustrate how AI tools can be applied to the AD/ADRD drug development pipeline through collaborative efforts among neurologists, gerontologists, geneticists, pharmacologists, medicinal chemists, and computational scientists. AI and open data science expedite drug discovery and development of disease-modifying therapeutics for AD/ADRD and other neurodegenerative diseases.
Collapse
Affiliation(s)
- Feixiong Cheng
- Genomic Medicine Institute, Lerner Research Institute, Cleveland Clinic, Cleveland, OH 44195, USA; Cleveland Clinic Genome Center, Lerner Research Institute, Cleveland Clinic, Cleveland, OH 44195, USA; Department of Molecular Medicine, Cleveland Clinic Lerner College of Medicine, Case Western Reserve University, Cleveland, OH 44195, USA.
| | - Fei Wang
- Department of Population Health Sciences, Weill Cornell Medical College, Cornell University, New York, NY 10065, USA
| | - Jian Tang
- Mila-Quebec Institute for Learning Algorithms and CIFAR AI Research Chair, HEC Montreal, Montréal, QC H3T 2A7, Canada
| | - Yadi Zhou
- Genomic Medicine Institute, Lerner Research Institute, Cleveland Clinic, Cleveland, OH 44195, USA
| | - Zhimin Fu
- Genomic Medicine Institute, Lerner Research Institute, Cleveland Clinic, Cleveland, OH 44195, USA; College of Pharmacy, Northeast Ohio Medical University, Rootstown, OH 44272, USA
| | - Pengyue Zhang
- Department of Biostatistics and Health Data Science, Indiana University, Indianapolis, IN 46037, USA
| | - Jonathan L Haines
- Cleveland Institute for Computational Biology, and Department of Population & Quantitative Health Sciences, Case Western Reserve University, Cleveland, OH 44106, USA
| | - James B Leverenz
- Lou Ruvo Center for Brain Health, Neurological Institute, Cleveland Clinic, Cleveland, OH 44195, USA
| | - Li Gan
- Helen and Robert Appel Alzheimer's Disease Research Institute, Brain and Mind Research Institute, Weill Cornell Medicine, New York, NY 10021, USA
| | - Jianying Hu
- IBM Research, Yorktown Heights, New York, NY 10598, USA
| | - Michal Rosen-Zvi
- AI for Accelerated Healthcare and Life Sciences Discovery, IBM Research Labs, Haifa 3498825, Israel; Faculty of Medicine, The Hebrew University of Jerusalem, Jerusalem 9190500, Israel
| | - Andrew A Pieper
- Brain Health Medicines Center, Harrington Discovery Institute, University Hospitals Cleveland Medical Center, Cleveland, OH, 44106, USA; Department of Psychiatry, Case Western Reserve University, Cleveland, OH 44106, USA; Geriatric Psychiatry, GRECC, Louis Stokes Cleveland VA Medical Center, Cleveland, OH 44106, USA; Institute for Transformative Molecular Medicine, School of Medicine, Case Western Reserve University, Cleveland OH 44106, USA; Department of Pathology, Case Western Reserve University, School of Medicine, Cleveland, OH, 44106, USA; Department of Neurosciences, Case Western Reserve University, School of Medicine, Cleveland, OH 44106, USA
| | - Jeffrey Cummings
- Chambers-Grundy Center for Transformative Neuroscience, Department of Brain Health, School of Integrated Health Sciences, UNLV, Las Vegas, NV 89154, USA
| |
Collapse
|
8
|
Tognolini M, Lodola A, Giorgio C. Drug discovery: In silico dry data can bypass biological wet data? Br J Pharmacol 2024; 181:340-344. [PMID: 37872106 DOI: 10.1111/bph.16266] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2023] [Revised: 09/27/2023] [Accepted: 10/10/2023] [Indexed: 10/25/2023] Open
Abstract
The recent and extraordinary increase in computer power, along with the availability of efficient algorithms based on artificial intelligence, has prompted a large number of inexperienced scientists to challenge the complex and yet competitive world of drug discovery, by pretending to identify new hits through the sole use of computer aided drug design (CADD). Does the golden era of dry data run the risk of overshadowing the importance of wet data and, in doing so, forget that in silico and biological data need each other in successful preclinical drug discovery programmes?
Collapse
Affiliation(s)
| | - Alessio Lodola
- Department of Food and Drug, University of Parma, Parma, Italy
| | - Carmine Giorgio
- Department of Food and Drug, University of Parma, Parma, Italy
| |
Collapse
|
9
|
Zhang H, Liu S, Wang Y, Huang H, Sun L, Yuan Y, Cheng L, Liu X, Ning K. Deep learning enhanced the diagnostic merit of serum glycome for multiple cancers. iScience 2024; 27:108715. [PMID: 38226168 PMCID: PMC10788220 DOI: 10.1016/j.isci.2023.108715] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2023] [Revised: 10/24/2023] [Accepted: 12/11/2023] [Indexed: 01/17/2024] Open
Abstract
Protein glycosylation is associated with the pathogenesis of various cancers. The utilization of certain glycans in cancer diagnosis models holds promise, yet their accuracy is not always guaranteed. Here, we investigated the utility of deep learning techniques, specifically random forests combined with transfer learning, in enhancing serum glycome's discriminative power for cancer diagnosis (including ovarian cancer, non-small cell lung cancer, gastric cancer, and esophageal cancer). We started with ovarian cancer and demonstrated that transfer learning can achieve superior performance in data-disadvantaged cohorts (AUROC >0.9), outperforming the approach of PLS-DA. We identified a serum glycan-biomarker panel including 18 serum N-glycans and 4 glycan derived traits, most of which were featured with sialylation. Furthermore, we validated advantage of the transfer learning scheme across other cancer groups. These findings highlighted the superiority of transfer learning in improving the performance of glycans-based cancer diagnosis model and identifying cancer biomarkers, providing a new high-fidelity cancer diagnosis venue.
Collapse
Affiliation(s)
- Haobo Zhang
- Key Laboratory of Molecular Biophysics of the Ministry of Education, Hubei Key Laboratory of Bioinformatics and Molecular-imaging, Center of AI Biology, Department of Bioinformatics and Systems Biology, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, Hubei, China
| | - Si Liu
- Key Laboratory of Molecular Biophysics of the Ministry of Education, Hubei Key Laboratory of Bioinformatics and Molecular-imaging, Center of AI Biology, Department of Bioinformatics and Systems Biology, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, Hubei, China
- Department of Epidemiology and Health Statistics, School of Public Health, Fujian Medical University, Fuzhou, Fujian, China
| | - Yi Wang
- Department of Laboratory Medicine, Tongji Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, Hubei, China
| | - Hanhui Huang
- Key Laboratory of Molecular Biophysics of the Ministry of Education, Hubei Key Laboratory of Bioinformatics and Molecular-imaging, Center of AI Biology, Department of Bioinformatics and Systems Biology, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, Hubei, China
| | - Lukang Sun
- Department of Laboratory Medicine, Tongji Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, Hubei, China
| | - Youyuan Yuan
- Key Laboratory of Molecular Biophysics of the Ministry of Education, Hubei Key Laboratory of Bioinformatics and Molecular-imaging, Center of AI Biology, Department of Bioinformatics and Systems Biology, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, Hubei, China
| | - Liming Cheng
- Department of Laboratory Medicine, Tongji Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, Hubei, China
| | - Xin Liu
- Key Laboratory of Molecular Biophysics of the Ministry of Education, Hubei Key Laboratory of Bioinformatics and Molecular-imaging, Center of AI Biology, Department of Bioinformatics and Systems Biology, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, Hubei, China
| | - Kang Ning
- Key Laboratory of Molecular Biophysics of the Ministry of Education, Hubei Key Laboratory of Bioinformatics and Molecular-imaging, Center of AI Biology, Department of Bioinformatics and Systems Biology, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, Hubei, China
| |
Collapse
|
10
|
Hu ZT, Yu Y, Chen R, Yeh SJ, Chen B, Huang H. Large-Scale Information Retrieval and Correction of Noisy Pharmacogenomic Datasets through Residual Thresholded Deep Matrix Factorization. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.12.07.570723. [PMID: 38106027 PMCID: PMC10723412 DOI: 10.1101/2023.12.07.570723] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/19/2023]
Abstract
Pharmacogenomics studies are attracting an increasing amount of interest from researchers in precision medicine. The advances in high-throughput experiments and multiplexed approaches allow the large-scale quantification of drug sensitivities in molecularly characterized cancer cell lines (CCLs), resulting in a number of open drug sensitivity datasets for drug biomarker discovery. However, a significant inconsistency in drug sensitivity values among these datasets has been noted. Such inconsistency indicates the presence of substantial noise, subsequently hindering downstream analyses. To address the noise in drug sensitivity data, we introduce a robust and scalable deep learning framework, Residual Thresholded Deep Matrix Factorization (RT-DMF). This method takes a single drug sensitivity data matrix as its sole input and outputs a corrected and imputed matrix. Deep Matrix Factorization (DMF) excels at uncovering subtle patterns, due to its minimal reliance on data structure assumptions. This attribute significantly boosts DMF's ability to identify complex hidden patterns among nuisance effects in the data, thereby facilitating the detection of signals that are therapeutically relevant. Furthermore, RT-DMF incorporates an iterative residual thresholding (RT) procedure, which plays a crucial role in retaining signals more likely to hold therapeutic importance. Validation using simulated datasets and real pharmacogenomics datasets demonstrates the effectiveness of our approach in correcting noise and imputing missing data in drug sensitivity datasets (open source package available at https://github.com/tomwhoooo/rtdmf).
Collapse
Affiliation(s)
- Zhiyue Tom Hu
- Division of Biostatistics, University of California Berkeley, Berkeley, 94720, U.S.A
| | - Yaodong Yu
- Department of Electrical Engineer and Computer Science, University of California Berkeley, Berkeley, 94720, U.S.A
| | - Ruoqiao Chen
- Department of Pharmacology and Toxicology, Michigan State University, 48824, U.S.A
| | - Shan-Ju Yeh
- School of Medicine, National Tsing Hua University, Hsinchu, 300044, Taiwan R.O.C
| | - Bin Chen
- Department of Pharmacology and Toxicology, Michigan State University, 48824, U.S.A
- Department of Pediatrics and Human Development, Michigan State University, 48824, U.S.A
| | - Haiyan Huang
- Department of Statistics, University of California Berkeley, Berkeley, 94720, U.S.A
| |
Collapse
|
11
|
Frostig T, Benjamini Y, Kehat O, Weiss-Meilik A, Mandel D, Peleg B, Strauss Z, Mitelpunkt A. Developing a length of stay prediction model for newborns, achieving better accuracy with greater usability. Int J Med Inform 2023; 180:105267. [PMID: 37918217 DOI: 10.1016/j.ijmedinf.2023.105267] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2023] [Revised: 10/13/2023] [Accepted: 10/20/2023] [Indexed: 11/04/2023]
Abstract
BACKGROUND One in ten newborn children is born prematurely. The elongated length of stay (LOS) of these children in the Neonatal Intensive Care Unit (NICU) has important implications on hospital occupancy figures, healthcare and management costs, as well as the psychology of parents. In order to allow accurate planning and resource allocation, this study aims to create a generalizable and robust model to predict the NICU LOS of preterm newborns. METHODS Data were collected from a large tertiary center NICU between 2011 and 2018 and relates to 5,362 newborns. The selected model was externally validated using a data set of 8,768 newborns from another tertiary center NICU. This report compares several models, such as Random Forest (RF), quantile RF, and other feature selection methods, including LASSO and AIC step-forward selection. In addition, a novel step-forward selection based on False Discovery Rate (FDR) for quantile regression is presented and evaluated. RESULTS A high-orderquantile regression model for predicting preterm newborns' LOS that uses only four features available at birth had more attractive properties than other richer ones. The model achieved a Mean Absolute Error (MAE) of 6.26 days on the internal validation set (average LOS 27.04) and an MAE of 6.04 days on the external validation set (average LOS 29.32). The suggested model surpassed the accuracy obtained by models in the literature. It is shown empirically that the FDR-based selection has better properties than the AIC-based step-forward selection approach. CONCLUSION This paper demonstrates a process to create a predictive model for NICU LOS in preterm newborns, where each step is reasoned. We obtain a simple and robust model for NICU LOS prediction, which achieves far better results than the current model used for financing NICUs. Utilizing this model, we have created an easy-to-use online web application to ease parents' worries and to assist NICU management: https://tzviel.shinyapps.io/calcuLOS.
Collapse
Affiliation(s)
- Tzviel Frostig
- Department of Statistics and Operation Research, Tel Aviv University, Ramat Aviv, 69978, Tel Aviv, Israel.
| | - Yoav Benjamini
- Department of Statistics and Operation Research, Tel Aviv University, Ramat Aviv, 69978, Tel Aviv, Israel; Sagol School of Neuroscience and the Edmond Safra Bioinformatics Center, Tel Aviv University Ramat Aviv, 69978, Tel Aviv, Israel
| | - Orli Kehat
- I-Medata AI Center, Tel Aviv Medical Center, 6 Weizmann St., 64239, Tel Aviv, Israel
| | - Ahuva Weiss-Meilik
- I-Medata AI Center, Tel Aviv Medical Center, 6 Weizmann St., 64239, Tel Aviv, Israel
| | - Dror Mandel
- Departments of Neonatology and Pediatrics, Dana Dwek Children's Hospital, Tel Aviv Medical Center, 6 Weizmann St., 64239, Tel Aviv, Israel
| | - Ben Peleg
- Sackler Faculty of Medicine, Tel Aviv University, Ramat Aviv, 69978, Tel Aviv, Israel; Department of Neonatology, Edmond and Lily Safra Children's Hospital, Sheba Medical Center, Tel-HaShomer, Israel
| | - Zipora Strauss
- Sackler Faculty of Medicine, Tel Aviv University, Ramat Aviv, 69978, Tel Aviv, Israel; Department of Neonatology, Edmond and Lily Safra Children's Hospital, Sheba Medical Center, Tel-HaShomer, Israel
| | - Alexis Mitelpunkt
- Sackler Faculty of Medicine, Tel Aviv University, Ramat Aviv, 69978, Tel Aviv, Israel; Pediatric Rehabilitation, Department of Rehabilitation, Dana Dwek Children's Hospital, Tel Aviv Medical Center, 6 Weizmann St., 64239, Tel Aviv, Israel
| |
Collapse
|
12
|
Affiliation(s)
- David J Hunter
- From the Nuffield Department of Population Health (D.J.H.) and the Department of Statistics and Nuffield Department of Medicine (C.H.), University of Oxford, Oxford, and the Alan Turing Institute, London (C.H.) - both in the United Kingdom
| | - Christopher Holmes
- From the Nuffield Department of Population Health (D.J.H.) and the Department of Statistics and Nuffield Department of Medicine (C.H.), University of Oxford, Oxford, and the Alan Turing Institute, London (C.H.) - both in the United Kingdom
| |
Collapse
|
13
|
Irajizad E, Kenney A, Tang T, Vykoukal J, Wu R, Murage E, Dennison JB, Sans M, Long JP, Loftus M, Chabot JA, Kluger MD, Kastrinos F, Brais L, Babic A, Jajoo K, Lee LS, Clancy TE, Ng K, Bullock A, Genkinger JM, Maitra A, Do KA, Yu B, Wolpin BM, Hanash S, Fahrmann JF. A blood-based metabolomic signature predictive of risk for pancreatic cancer. Cell Rep Med 2023; 4:101194. [PMID: 37729870 PMCID: PMC10518621 DOI: 10.1016/j.xcrm.2023.101194] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2022] [Revised: 12/20/2022] [Accepted: 08/21/2023] [Indexed: 09/22/2023]
Abstract
Emerging evidence implicates microbiome involvement in the development of pancreatic cancer (PaCa). Here, we investigate whether increases in circulating microbial-related metabolites associate with PaCa risk by applying metabolomics profiling to 172 sera collected within 5 years prior to PaCa diagnosis and 863 matched non-subject sera from participants in the Prostate, Lung, Colorectal, and Ovarian (PLCO) cohort. We develop a three-marker microbial-related metabolite panel to assess 5-year risk of PaCa. The addition of five non-microbial metabolites further improves 5-year risk prediction of PaCa. The combined metabolite panel complements CA19-9, and individuals with a combined metabolite panel + CA19-9 score in the top 2.5th percentile have absolute 5-year risk estimates of >13%. The risk prediction model based on circulating microbial and non-microbial metabolites provides a potential tool to identify individuals at high risk of PaCa that would benefit from surveillance and/or from potential cancer interception strategies.
Collapse
Affiliation(s)
- Ehsan Irajizad
- Department of Biostatistics, The University of Texas MD Anderson Cancer Center, Houston, TX, USA; Department of Clinical Cancer Prevention, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
| | - Ana Kenney
- Department of Statistics, University of California, Berkeley, Berkeley, CA, USA
| | - Tiffany Tang
- Department of Statistics, University of California, Berkeley, Berkeley, CA, USA
| | - Jody Vykoukal
- Department of Clinical Cancer Prevention, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
| | - Ranran Wu
- Department of Clinical Cancer Prevention, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
| | - Eunice Murage
- Department of Clinical Cancer Prevention, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
| | - Jennifer B Dennison
- Department of Clinical Cancer Prevention, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
| | - Marta Sans
- Division of Gastroenterology, Hepatology and Endoscopy, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
| | - James P Long
- Department of Biostatistics, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
| | - Maureen Loftus
- Dana-Farber Brigham and Women's Cancer Center, Division of Gastrointestinal Oncology, Department of Medical Oncology, Dana-Farber Cancer Institute, Harvard Medical School, Boston, MA, USA
| | - John A Chabot
- Division of Digestive and Liver Diseases, Columbia University Irving Medical Cancer and the Vagelos College of Physicians and Surgeons, New York, NY, USA
| | - Michael D Kluger
- Division of Digestive and Liver Diseases, Columbia University Irving Medical Cancer and the Vagelos College of Physicians and Surgeons, New York, NY, USA
| | - Fay Kastrinos
- Division of Digestive and Liver Diseases, Columbia University Irving Medical Cancer and the Vagelos College of Physicians and Surgeons, New York, NY, USA; Herbert Irving Comprehensive Cancer Center, Columbia University Irving Medical Center, New York, NY, USA
| | - Lauren Brais
- Dana-Farber Brigham and Women's Cancer Center, Division of Gastrointestinal Oncology, Department of Medical Oncology, Dana-Farber Cancer Institute, Harvard Medical School, Boston, MA, USA
| | - Ana Babic
- Dana-Farber Brigham and Women's Cancer Center, Division of Gastrointestinal Oncology, Department of Medical Oncology, Dana-Farber Cancer Institute, Harvard Medical School, Boston, MA, USA
| | - Kunal Jajoo
- Division of Gastroenterology, Hepatology and Endoscopy, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
| | - Linda S Lee
- Division of Gastroenterology, Hepatology and Endoscopy, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
| | - Thomas E Clancy
- Dana-Farber Brigham and Women's Cancer Center, Division of Surgical Oncology, Department of Surgery, Brigham and Women's Hospital, Harvard Medical School, Boston, MA USA
| | - Kimmie Ng
- Dana-Farber Brigham and Women's Cancer Center, Division of Gastrointestinal Oncology, Department of Medical Oncology, Dana-Farber Cancer Institute, Harvard Medical School, Boston, MA, USA
| | - Andrea Bullock
- Division of Hematology/Oncology, Department of Medicine, Beth Israel Deaconess Medical Center, Harvard Medical School, Boston, MA, USA
| | - Jeanine M Genkinger
- Herbert Irving Comprehensive Cancer Center, Columbia University Irving Medical Center, New York, NY, USA; Department of Epidemiology, Columbia Mailman School of Public Health, New York, NY, USA
| | - Anirban Maitra
- Department of Pathology, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
| | - Kim-Anh Do
- Department of Biostatistics, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
| | - Bin Yu
- Department of Statistics, University of California, Berkeley, Berkeley, CA, USA
| | - Brian M Wolpin
- Dana-Farber Brigham and Women's Cancer Center, Division of Gastrointestinal Oncology, Department of Medical Oncology, Dana-Farber Cancer Institute, Harvard Medical School, Boston, MA, USA
| | - Sam Hanash
- Department of Clinical Cancer Prevention, The University of Texas MD Anderson Cancer Center, Houston, TX, USA.
| | - Johannes F Fahrmann
- Department of Clinical Cancer Prevention, The University of Texas MD Anderson Cancer Center, Houston, TX, USA.
| |
Collapse
|
14
|
Landeros A, Xu J, Lange K. MM optimization: Proximal distance algorithms, path following, and trust regions. Proc Natl Acad Sci U S A 2023; 120:e2303168120. [PMID: 37339185 PMCID: PMC10319036 DOI: 10.1073/pnas.2303168120] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2023] [Accepted: 05/09/2023] [Indexed: 06/22/2023] Open
Abstract
We briefly review the majorization-minimization (MM) principle and elaborate on the closely related notion of proximal distance algorithms, a generic approach for solving constrained optimization problems via quadratic penalties. We illustrate how the MM and proximal distance principles apply to a variety of problems from statistics, finance, and nonlinear optimization. Drawing from our selected examples, we also sketch a few ideas pertinent to the acceleration of MM algorithms: a) structuring updates around efficient matrix decompositions, b) path following in proximal distance iteration, and c) cubic majorization and its connections to trust region methods. These ideas are put to the test on several numerical examples, but for the sake of brevity, we omit detailed comparisons to competing methods. The current article, which is a mix of review and current contributions, celebrates the MM principle as a powerful framework for designing optimization algorithms and reinterpreting existing ones.
Collapse
Affiliation(s)
- Alfonso Landeros
- Department of Computational Medicine, University of California, Los Angeles, CA90095
| | - Jason Xu
- Department of Statistical Science, Duke University, Durham, NC27708
| | - Kenneth Lange
- Department of Computational Medicine, University of California, Los Angeles, CA90095
- Department of Human Genetics, University of California, Los Angeles, CA90095
- Department of Statistics, University of California, Los Angeles, CA90095
| |
Collapse
|
15
|
Aw A, Jin LC, Ioannidis N, Song YS. The Impact of Stability Considerations on Genetic Fine-Mapping. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.04.11.536456. [PMID: 37090514 PMCID: PMC10120703 DOI: 10.1101/2023.04.11.536456] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/25/2023]
Abstract
Fine-mapping methods, which aim to identify genetic variants responsible for complex traits following genetic association studies, typically assume that sufficient adjustments for confounding within the association study cohort have been made, e.g., through regressing out the top principal components (i.e., residualization). Despite its widespread use, however, residualization may not completely remove all sources of confounding. Here, we propose a complementary stability-guided approach that does not rely on residualization, which identifies consistently fine-mapped variants across different genetic backgrounds or environments. We demonstrate the utility of this approach by applying it to fine-map eQTLs in the GEUVADIS data. Using 378 different functional annotations of the human genome, including recent deep learning-based annotations (e.g., Enformer), we compare enrichments of these annotations among variants for which the stability and traditional residualization-based fine-mapping approaches agree against those for which they disagree, and find that the stability approach enhances the power of traditional fine-mapping methods in identifying variants with functional impact. Finally, in cases where the two approaches report distinct variants, our approach identifies variants comparably enriched for functional annotations. Our findings suggest that the stability principle, as a conceptually simple device, complements existing approaches to fine-mapping, reinforcing recent advocacy of evaluating cross-population and cross-environment portability of biological findings. To support visualization and interpretation of our results, we provide a Shiny app, available at: https://alan-aw.shinyapps.io/stability_v0/.
Collapse
Affiliation(s)
- Alan Aw
- Department of Statistics, University of California, Berkeley
- Center for Computational Biology, University of California, Berkeley
| | | | - Nilah Ioannidis
- Center for Computational Biology, University of California, Berkeley
- Computer Science Division, University of California, Berkeley
| | - Yun S. Song
- Department of Statistics, University of California, Berkeley
- Center for Computational Biology, University of California, Berkeley
- Computer Science Division, University of California, Berkeley
| |
Collapse
|
16
|
Sadybekov AV, Katritch V. Computational approaches streamlining drug discovery. Nature 2023; 616:673-685. [PMID: 37100941 DOI: 10.1038/s41586-023-05905-z] [Citation(s) in RCA: 135] [Impact Index Per Article: 135.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2022] [Accepted: 03/01/2023] [Indexed: 04/28/2023]
Abstract
Computer-aided drug discovery has been around for decades, although the past few years have seen a tectonic shift towards embracing computational technologies in both academia and pharma. This shift is largely defined by the flood of data on ligand properties and binding to therapeutic targets and their 3D structures, abundant computing capacities and the advent of on-demand virtual libraries of drug-like small molecules in their billions. Taking full advantage of these resources requires fast computational methods for effective ligand screening. This includes structure-based virtual screening of gigascale chemical spaces, further facilitated by fast iterative screening approaches. Highly synergistic are developments in deep learning predictions of ligand properties and target activities in lieu of receptor structure. Here we review recent advances in ligand discovery technologies, their potential for reshaping the whole process of drug discovery and development, as well as the challenges they encounter. We also discuss how the rapid identification of highly diverse, potent, target-selective and drug-like ligands to protein targets can democratize the drug discovery process, presenting new opportunities for the cost-effective development of safer and more effective small-molecule treatments.
Collapse
Affiliation(s)
- Anastasiia V Sadybekov
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA
- Center for New Technologies in Drug Discovery and Development, Bridge Institute, Michelson Center for Convergent Biosciences, University of Southern California, Los Angeles, CA, USA
| | - Vsevolod Katritch
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA.
- Center for New Technologies in Drug Discovery and Development, Bridge Institute, Michelson Center for Convergent Biosciences, University of Southern California, Los Angeles, CA, USA.
- Department of Chemistry, University of Southern California, Los Angeles, CA, USA.
| |
Collapse
|
17
|
Broderick T, Gelman A, Meager R, Smith AL, Zheng T. Toward a taxonomy of trust for probabilistic machine learning. SCIENCE ADVANCES 2023; 9:eabn3999. [PMID: 36791188 PMCID: PMC9931201 DOI: 10.1126/sciadv.abn3999] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/23/2021] [Accepted: 01/13/2023] [Indexed: 06/18/2023]
Abstract
Probabilistic machine learning increasingly informs critical decisions in medicine, economics, politics, and beyond. To aid the development of trust in these decisions, we develop a taxonomy delineating where trust in an analysis can break down: (i) in the translation of real-world goals to goals on a particular set of training data, (ii) in the translation of abstract goals on the training data to a concrete mathematical problem, (iii) in the use of an algorithm to solve the stated mathematical problem, and (iv) in the use of a particular code implementation of the chosen algorithm. We detail how trust can fail at each step and illustrate our taxonomy with two case studies. Finally, we describe a wide variety of methods that can be used to increase trust at each step of our taxonomy. The use of our taxonomy highlights not only steps where existing research work on trust tends to concentrate and but also steps where building trust is particularly challenging.
Collapse
Affiliation(s)
- Tamara Broderick
- Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Andrew Gelman
- Department of Statistics, Columbia University, New York, NY, USA
- Department of Political Science, Columbia University, New York, NY, USA
| | - Rachael Meager
- Department of Economics, London School of Economics and Political Science, London, UK
| | - Anna L. Smith
- Department of Statistics, University of Kentucky, Lexington, KY, USA
| | - Tian Zheng
- Department of Statistics, Columbia University, New York, NY, USA
| |
Collapse
|
18
|
Marmolejo‐Ramos F, Tejo M, Brabec M, Kuzilek J, Joksimovic S, Kovanovic V, González J, Kneib T, Bühlmann P, Kook L, Briseño‐Sánchez G, Ospina R. Distributional regression modeling via generalized additive models for location, scale, and shape: An overview through a data set from learning analytics. WILEY INTERDISCIPLINARY REVIEWS. DATA MINING AND KNOWLEDGE DISCOVERY 2023; 13:e1479. [PMID: 37502671 PMCID: PMC10369920 DOI: 10.1002/widm.1479] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 09/15/2021] [Revised: 06/11/2022] [Accepted: 10/05/2022] [Indexed: 07/29/2023]
Abstract
The advent of technological developments is allowing to gather large amounts of data in several research fields. Learning analytics (LA)/educational data mining has access to big observational unstructured data captured from educational settings and relies mostly on unsupervised machine learning (ML) algorithms to make sense of such type of data. Generalized additive models for location, scale, and shape (GAMLSS) are a supervised statistical learning framework that allows modeling all the parameters of the distribution of the response variable with respect to the explanatory variables. This article overviews the power and flexibility of GAMLSS in relation to some ML techniques. Also, GAMLSS' capability to be tailored toward causality via causal regularization is briefly commented. This overview is illustrated via a data set from the field of LA. This article is categorized under:Application Areas > Education and LearningAlgorithmic Development > StatisticsTechnologies > Machine Learning.
Collapse
Affiliation(s)
| | - Mauricio Tejo
- Instituto de EstadísticaUniversidad de ValparaísoValparaísoChile
| | - Marek Brabec
- Department of Statistical ModellingInstitute of Computer Science of the Czech Academy of SciencesPragueCzech Republic
| | - Jakub Kuzilek
- Czech Institute of InformaticsRobotics and Cybernetics, CTUPragueCzech Republic
- Computer Science Education/Computer Science and Society Research GroupHumboldt University of BerlinBerlinGermany
| | - Srecko Joksimovic
- Centre for Change and Complexity in LearningUniversity of South AustraliaAdelaideAustralia
| | - Vitomir Kovanovic
- Centre for Change and Complexity in LearningUniversity of South AustraliaAdelaideAustralia
| | - Jorge González
- Departamento de EstadísticaPontificia Universidad Católica de ChileSantiago de ChileChile
| | - Thomas Kneib
- Campus Institute Data Science (CIDAS) and Chair of StatisticsGeorg‐August‐Universität GöttingenGöttingenGermany
| | | | - Lucas Kook
- Epidemiology, Biostatistics, and Prevention InstituteUniversity of ZurichZurichSwitzerland
- Institute of Data Analysis and Process DesignZurich University of Applied SciencesWinterthurSwitzerland
| | | | - Raydonal Ospina
- Department of Statistics, CASTLabFederal University of PernambucoRecifeBrazil
| |
Collapse
|
19
|
De Paolis Kaluza MC, Jain S, Radivojac P. An Approach to Identifying and Quantifying Bias in Biomedical Data. PACIFIC SYMPOSIUM ON BIOCOMPUTING. PACIFIC SYMPOSIUM ON BIOCOMPUTING 2023; 28:311-322. [PMID: 36540987 PMCID: PMC9782737] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/25/2022]
Abstract
Data biases are a known impediment to the development of trustworthy machine learning models and their application to many biomedical problems. When biased data is suspected, the assumption that the labeled data is representative of the population must be relaxed and methods that exploit a typically representative unlabeled data must be developed. To mitigate the adverse effects of unrepresentative data, we consider a binary semi-supervised setting and focus on identifying whether the labeled data is biased and to what extent. We assume that the class-conditional distributions were generated by a family of component distributions represented at different proportions in labeled and unlabeled data. We also assume that the training data can be transformed to and subsequently modeled by a nested mixture of multivariate Gaussian distributions. We then develop a multi-sample expectation-maximization algorithm that learns all individual and shared parameters of the model from the combined data. Using these parameters, we develop a statistical test for the presence of the general form of bias in labeled data and estimate the level of this bias by computing the distance between corresponding class-conditional distributions in labeled and unlabeled data. We first study the new methods on synthetic data to understand their behavior and then apply them to real-world biomedical data to provide evidence that the bias estimation procedure is both possible and effective.
Collapse
|
20
|
Marmolejo-Ramos F, Ospina R, García-Ceja E, Correa JC. Ingredients for Responsible Machine Learning: A Commented Review of The Hitchhiker’s Guide to Responsible Machine Learning. JOURNAL OF STATISTICAL THEORY AND APPLICATIONS 2022; 21:175-185. [PMID: 36160758 PMCID: PMC9483296 DOI: 10.1007/s44199-022-00048-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2022] [Accepted: 09/02/2022] [Indexed: 11/25/2022] Open
Abstract
AbstractIn The hitchhiker’s guide to responsible machine learning, Biecek, Kozak, and Zawada (here BKZ) provide an illustrated and engaging step-by-step guide on how to perform a machine learning (ML) analysis such that the algorithms, the software, and the entire process is interpretable and transparent for both the data scientist and the end user. This review summarises BKZ’s book and elaborates on three elements key to ML analyses: inductive inference, causality, and interpretability.
Collapse
Affiliation(s)
- Fernando Marmolejo-Ramos
- Centre for Change and Complexity in Learning, University of South Australia, Adelaide, SA 5001 Australia
| | - Raydonal Ospina
- CASTLab, Department of Statistics, Universidade Federal de Pernambuco, Recife, Pernambuco 51280-000 Brazil
| | - Enrique García-Ceja
- Escuela de Ingeniería y Ciencias, Tecnológico de Monterrey, 64849 Monterrey, Nuevo León Mexico
| | - Juan C. Correa
- CESA Business School, Bogotá, Bogotá, DC, 110231 Colombia
| |
Collapse
|
21
|
Kornblith AE, Singh C, Devlin G, Addo N, Streck CJ, Holmes JF, Kuppermann N, Grupp-Phelan J, Fineman J, Butte AJ, Yu B. Predictability and stability testing to assess clinical decision instrument performance for children after blunt torso trauma. PLOS DIGITAL HEALTH 2022; 1:e0000076. [PMID: 36812570 PMCID: PMC9931266 DOI: 10.1371/journal.pdig.0000076] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 03/22/2022] [Accepted: 06/14/2022] [Indexed: 11/18/2022]
Abstract
OBJECTIVE The Pediatric Emergency Care Applied Research Network (PECARN) has developed a clinical-decision instrument (CDI) to identify children at very low risk of intra-abdominal injury. However, the CDI has not been externally validated. We sought to vet the PECARN CDI with the Predictability Computability Stability (PCS) data science framework, potentially increasing its chance of a successful external validation. MATERIALS & METHODS We performed a secondary analysis of two prospectively collected datasets: PECARN (12,044 children from 20 emergency departments) and an independent external validation dataset from the Pediatric Surgical Research Collaborative (PedSRC; 2,188 children from 14 emergency departments). We used PCS to reanalyze the original PECARN CDI along with new interpretable PCS CDIs developed using the PECARN dataset. External validation was then measured on the PedSRC dataset. RESULTS Three predictor variables (abdominal wall trauma, Glasgow Coma Scale Score <14, and abdominal tenderness) were found to be stable. A CDI using only these three variables would achieve lower sensitivity than the original PECARN CDI with seven variables on internal PECARN validation but achieve the same performance on external PedSRC validation (sensitivity 96.8% and specificity 44%). Using only these variables, we developed a PCS CDI which had a lower sensitivity than the original PECARN CDI on internal PECARN validation but performed the same on external PedSRC validation (sensitivity 96.8% and specificity 44%). CONCLUSION The PCS data science framework vetted the PECARN CDI and its constituent predictor variables prior to external validation. We found that the 3 stable predictor variables represented all of the PECARN CDI's predictive performance on independent external validation. The PCS framework offers a less resource-intensive method than prospective validation to vet CDIs before external validation. We also found that the PECARN CDI will generalize well to new populations and should be prospectively externally validated. The PCS framework offers a potential strategy to increase the chance of a successful (costly) prospective validation.
Collapse
Affiliation(s)
- Aaron E. Kornblith
- Department of Emergency Medicine, University of California, San Francisco, San Francisco, United States of America
- Department of Pediatrics, University of California, San Francisco, San Francisco, United States of America
| | - Chandan Singh
- Department of Electrical Engineering & Computer Science, University of California, Berkeley, Berkeley, United States of America
| | - Gabriel Devlin
- Department of Pediatrics, University of California, San Francisco, San Francisco, United States of America
| | - Newton Addo
- Department of Emergency Medicine, University of California, San Francisco, San Francisco, United States of America
| | - Christian J. Streck
- Department of Surgery, Medical University of South Carolina, Children’s Hospital, Charleston, United States of America
| | - James F. Holmes
- Department of Emergency Medicine, University of California, Davis, Davis, United States of America
| | - Nathan Kuppermann
- Department of Emergency Medicine, University of California, Davis, Davis, United States of America
- Department of Pediatrics, University of California, Davis, Davis, United States of America
| | - Jacqueline Grupp-Phelan
- Department of Emergency Medicine, University of California, San Francisco, San Francisco, United States of America
- Department of Pediatrics, University of California, San Francisco, San Francisco, United States of America
| | - Jeffrey Fineman
- Department of Pediatrics, University of California, San Francisco, San Francisco, United States of America
| | - Atul J. Butte
- Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco, United States of America
| | - Bin Yu
- Department of Electrical Engineering & Computer Science, University of California, Berkeley, Berkeley, United States of America
- Departments of Statistics, University of California, Berkeley, Berkeley, United States of America
- * E-mail:
| |
Collapse
|
22
|
Trella AL, Zhang KW, Nahum-Shani I, Shetty V, Doshi-Velez F, Murphy SA. Designing Reinforcement Learning Algorithms for Digital Interventions: Pre-Implementation Guidelines. ALGORITHMS 2022; 15:255. [PMID: 36713810 PMCID: PMC9881427 DOI: 10.3390/a15080255] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Indexed: 02/02/2023]
Abstract
Online reinforcement learning (RL) algorithms are increasingly used to personalize digital interventions in the fields of mobile health and online education. Common challenges in designing and testing an RL algorithm in these settings include ensuring the RL algorithm can learn and run stably under real-time constraints, and accounting for the complexity of the environment, e.g., a lack of accurate mechanistic models for the user dynamics. To guide how one can tackle these challenges, we extend the PCS (predictability, computability, stability) framework, a data science framework that incorporates best practices from machine learning and statistics in supervised learning to the design of RL algorithms for the digital interventions setting. Furthermore, we provide guidelines on how to design simulation environments, a crucial tool for evaluating RL candidate algorithms using the PCS framework. We show how we used the PCS framework to design an RL algorithm for Oralytics, a mobile health study aiming to improve users' tooth-brushing behaviors through the personalized delivery of intervention messages. Oralytics will go into the field in late 2022.
Collapse
Affiliation(s)
- Anna L. Trella
- School of Engineering and Applied Sciences, Harvard University, Cambridge, MA 02420, USA
- Correspondence:
| | - Kelly W. Zhang
- School of Engineering and Applied Sciences, Harvard University, Cambridge, MA 02420, USA
| | - Inbal Nahum-Shani
- Institute for Social Research, University of Michigan, Ann Arbor, MI 48109, USA
| | - Vivek Shetty
- Schools of Dentistry & Engineering, University of California, Los Angeles, CA 90095, USA
| | - Finale Doshi-Velez
- School of Engineering and Applied Sciences, Harvard University, Cambridge, MA 02420, USA
| | - Susan A. Murphy
- School of Engineering and Applied Sciences, Harvard University, Cambridge, MA 02420, USA
| |
Collapse
|
23
|
Lu JH, Callahan A, Patel BS, Morse KE, Dash D, Pfeffer MA, Shah NH. Assessment of Adherence to Reporting Guidelines by Commonly Used Clinical Prediction Models From a Single Vendor: A Systematic Review. JAMA Netw Open 2022; 5:e2227779. [PMID: 35984654 PMCID: PMC9391954 DOI: 10.1001/jamanetworkopen.2022.27779] [Citation(s) in RCA: 15] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 12/23/2022] Open
Abstract
IMPORTANCE Various model reporting guidelines have been proposed to ensure clinical prediction models are reliable and fair. However, no consensus exists about which model details are essential to report, and commonalities and differences among reporting guidelines have not been characterized. Furthermore, how well documentation of deployed models adheres to these guidelines has not been studied. OBJECTIVES To assess information requested by model reporting guidelines and whether the documentation for commonly used machine learning models developed by a single vendor provides the information requested. EVIDENCE REVIEW MEDLINE was queried using machine learning model card and reporting machine learning from November 4 to December 6, 2020. References were reviewed to find additional publications, and publications without specific reporting recommendations were excluded. Similar elements requested for reporting were merged into representative items. Four independent reviewers and 1 adjudicator assessed how often documentation for the most commonly used models developed by a single vendor reported the items. FINDINGS From 15 model reporting guidelines, 220 unique items were identified that represented the collective reporting requirements. Although 12 items were commonly requested (requested by 10 or more guidelines), 77 items were requested by just 1 guideline. Documentation for 12 commonly used models from a single vendor reported a median of 39% (IQR, 37%-43%; range, 31%-47%) of items from the collective reporting requirements. Many of the commonly requested items had 100% reporting rates, including items concerning outcome definition, area under the receiver operating characteristics curve, internal validation, and intended clinical use. Several items reported half the time or less related to reliability, such as external validation, uncertainty measures, and strategy for handling missing data. Other frequently unreported items related to fairness (summary statistics and subgroup analyses, including for race and ethnicity or sex). CONCLUSIONS AND RELEVANCE These findings suggest that consistent reporting recommendations for clinical predictive models are needed for model developers to share necessary information for model deployment. The many published guidelines would, collectively, require reporting more than 200 items. Model documentation from 1 vendor reported the most commonly requested items from model reporting guidelines. However, areas for improvement were identified in reporting items related to model reliability and fairness. This analysis led to feedback to the vendor, which motivated updates to the documentation for future users.
Collapse
Affiliation(s)
- Jonathan H. Lu
- Center for Biomedical Informatics Research, Stanford University School of Medicine, Stanford, California
| | - Alison Callahan
- Center for Biomedical Informatics Research, Stanford University School of Medicine, Stanford, California
| | - Birju S. Patel
- Center for Biomedical Informatics Research, Stanford University School of Medicine, Stanford, California
| | - Keith E. Morse
- Department of Pediatrics, Stanford University School of Medicine, Stanford, California
- Department of Clinical Informatics, Lucile Packard Children’s Hospital, Palo Alto, California
| | - Dev Dash
- Center for Biomedical Informatics Research, Stanford University School of Medicine, Stanford, California
| | - Michael A. Pfeffer
- Center for Biomedical Informatics Research, Stanford University School of Medicine, Stanford, California
- Technology and Digital Solutions, Stanford Medicine, Stanford, California
| | - Nigam H. Shah
- Center for Biomedical Informatics Research, Stanford University School of Medicine, Stanford, California
- Technology and Digital Solutions, Stanford Medicine, Stanford, California
- Clinical Excellence Research Center, Stanford Medicine, Stanford, California
| |
Collapse
|
24
|
Provable Boolean interaction recovery from tree ensemble obtained via random forests. Proc Natl Acad Sci U S A 2022; 119:e2118636119. [PMID: 35609192 PMCID: PMC9295780 DOI: 10.1073/pnas.2118636119] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
SignificanceRandom Forests (RFs) are among the most successful machine-learning algorithms in terms of prediction accuracy. In many domain problems, however, the primary goal is not prediction, but to understand the data-generation process-in particular, finding important features and feature interactions. There exists strong empirical evidence that RF-based methods-in particular, iterative RF (iRF)-are very successful in terms of detecting feature interactions. In this work, we propose a biologically motivated, Boolean interaction model. Using this model, we complement the existing empirical evidence with theoretical evidence for the ability of iRF-type methods to select desirable interactions. Our theoretical analysis also yields deeper insights into the general interaction selection mechanism of decision-tree algorithms and the importance of feature subsampling.
Collapse
|
25
|
Nicholson G, Blangiardo M, Briers M, Diggle PJ, Fjelde TE, Ge H, Goudie RJB, Jersakova R, King RE, Lehmann BCL, Mallon AM, Padellini T, Teh YW, Holmes C, Richardson S. Interoperability of statistical models in pandemic preparedness: principles and reality. Stat Sci 2022; 37:183-206. [PMID: 35664221 PMCID: PMC7612804 DOI: 10.1214/22-sts854] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Abstract
We present interoperability as a guiding framework for statistical modelling to assist policy makers asking multiple questions using diverse datasets in the face of an evolving pandemic response. Interoperability provides an important set of principles for future pandemic preparedness, through the joint design and deployment of adaptable systems of statistical models for disease surveillance using probabilistic reasoning. We illustrate this through case studies for inferring and characterising spatial-temporal prevalence and reproduction numbers of SARS-CoV-2 infections in England.
Collapse
Affiliation(s)
| | - Marta Blangiardo
- MRC Centre for Environment and Health, Dept of Epidemiology and Biostatistics, Imperial College London
| | | | - Peter J Diggle
- CHICAS, Lancaster Medical School, Lancaster University, UK
| | | | - Hong Ge
- Department of Engineering, University of Cambridge, UK
| | | | | | | | | | | | - Tullia Padellini
- MRC Centre for Environment and Health, Dept of Epidemiology and Biostatistics, Imperial College London
| | | | - Chris Holmes
- University of Oxford, UK
- The Alan Turing Institute, London, UK
- MRC Harwell Institute, Harwell, UK
| | - Sylvia Richardson
- The Alan Turing Institute, London, UK
- MRC Biostatistics Unit, University of Cambridge, UK
| |
Collapse
|
26
|
Navigating the pitfalls of applying machine learning in genomics. Nat Rev Genet 2022; 23:169-181. [PMID: 34837041 DOI: 10.1038/s41576-021-00434-9] [Citation(s) in RCA: 66] [Impact Index Per Article: 33.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 10/28/2021] [Indexed: 11/08/2022]
Abstract
The scale of genetic, epigenomic, transcriptomic, cheminformatic and proteomic data available today, coupled with easy-to-use machine learning (ML) toolkits, has propelled the application of supervised learning in genomics research. However, the assumptions behind the statistical models and performance evaluations in ML software frequently are not met in biological systems. In this Review, we illustrate the impact of several common pitfalls encountered when applying supervised ML in genomics. We explore how the structure of genomics data can bias performance evaluations and predictions. To address the challenges associated with applying cutting-edge ML methods to genomics, we describe solutions and appropriate use cases where ML modelling shows great potential.
Collapse
|
27
|
Abstract
Interpretability is becoming increasingly important for predictive model analysis. Unfortunately, as remarked by many authors, there is still no consensus regarding this notion. The goal of this paper is to propose the definition of a score that allows for quickly comparing interpretable algorithms. This definition consists of three terms, each one being quantitatively measured with a simple formula: predictivity, stability and simplicity. While predictivity has been extensively studied to measure the accuracy of predictive algorithms, stability is based on the Dice-Sorensen index for comparing two rule sets generated by an algorithm using two independent samples. The simplicity is based on the sum of the lengths of the rules derived from the predictive model. The proposed score is a weighted sum of the three terms mentioned above. We use this score to compare the interpretability of a set of rule-based algorithms and tree-based algorithms for the regression case and for the classification case.
Collapse
|
28
|
Pfister N, Williams EG, Peters J, Aebersold R, Bühlmann P. Stabilizing variable selection and regression. Ann Appl Stat 2021. [DOI: 10.1214/21-aoas1487] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Affiliation(s)
- Niklas Pfister
- Department of Mathematical Sciences, University of Copenhagen
| | - Evan G. Williams
- Luxembourg Centre for Systems Biomedicine, University of Luxembourg
| | - Jonas Peters
- Department of Mathematical Sciences, University of Copenhagen
| | | | | |
Collapse
|
29
|
Wu Y, Di B, Luo Y, Grieneisen ML, Zeng W, Zhang S, Deng X, Tang Y, Shi G, Yang F, Zhan Y. A robust approach to deriving long-term daily surface NO 2 levels across China: Correction to substantial estimation bias in back-extrapolation. ENVIRONMENT INTERNATIONAL 2021; 154:106576. [PMID: 33901976 DOI: 10.1016/j.envint.2021.106576] [Citation(s) in RCA: 16] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/05/2020] [Revised: 04/09/2021] [Accepted: 04/09/2021] [Indexed: 06/12/2023]
Abstract
BACKGROUND Long-term surface NO2 data are essential for retrospective policy evaluation and chronic human exposure assessment. In the absence of NO2 observations for Mainland China before 2013, training a model with 2013-2018 data to make predictions for 2005-2012 (back-extrapolation) could cause substantial estimation bias due to concept drift. OBJECTIVE This study aims to correct the estimation bias in order to reconstruct the spatiotemporal distribution of daily surface NO2 levels across China during 2005-2018. METHODS On the basis of ground- and satellite-based data, we proposed the robust back-extrapolation with a random forest (RBE-RF) to simulate the surface NO2 through intermediate modeling of the scaling factors. For comparison purposes, we also employed a random forest (Base-RF), as a representative of the commonly used approach, to directly model the surface NO2 levels. RESULTS The validation against Taiwan's NO2 observations during 2005-2012 showed that RBE-RF adequately corrected the substantial underestimation by Base-RF. The RMSE decreased from 10.1 to 8.2 µg/m3, 7.1 to 4.3 µg/m3, and 6.1 to 2.9 µg/m3 in predicting daily, monthly, and annual levels, respectively. For North China with the most severe pollution, the population-weighted NO2 ([NO2]pw) during 2005-2012 was estimated as 40.2 and 50.9 µg/m3 by Base-RF and RBE-RF, respectively, i.e., 21.0% difference. While both models predicted that the national annual [NO2]pw increased during 2005-2011 and then decreased, the interannual trends were underestimated by >50.2% by Base-RF relative to RBE-RF. During 2005-2018, the nationwide population that lived in the areas with NO2 > 40 µg/m3 were estimated as 259 and 460 million by Base-RF and RBE-RF, respectively. CONCLUSION With RBE-RF, we corrected the estimation bias in back-extrapolation and obtained a full-coverage dataset of daily surface NO2 across China during 2005-2018, which is valuable for environmental management and epidemiological research.
Collapse
Affiliation(s)
- Yangyang Wu
- Department of Environmental Science and Engineering, Sichuan University, Chengdu, Sichuan 610065, China
| | - Baofeng Di
- Department of Environmental Science and Engineering, Sichuan University, Chengdu, Sichuan 610065, China; Institute for Disaster Management and Reconstruction, Sichuan University, Chengdu, Sichuan 610200, China
| | - Yuzhou Luo
- Department of Land, Air, and Water Resources, University of California, Davis, CA 95616, United States
| | - Michael L Grieneisen
- Department of Land, Air, and Water Resources, University of California, Davis, CA 95616, United States
| | - Wen Zeng
- Department of Environmental Science and Engineering, Sichuan University, Chengdu, Sichuan 610065, China
| | - Shifu Zhang
- Department of Environmental Science and Engineering, Sichuan University, Chengdu, Sichuan 610065, China
| | - Xunfei Deng
- Institute of Digital Agriculture, Zhejiang Academy of Agricultural Sciences, Hangzhou, Zhejiang 310021, China
| | - Yulei Tang
- Department of Environmental Science and Engineering, Sichuan University, Chengdu, Sichuan 610065, China; Natural Resources Comprehensive Survey Command Center, China Geological Survey, Beijing 100055, China
| | - Guangming Shi
- Department of Environmental Science and Engineering, Sichuan University, Chengdu, Sichuan 610065, China; National Engineering Research Center for Flue Gas Desulfurization, Chengdu, Sichuan 610065, China
| | - Fumo Yang
- Department of Environmental Science and Engineering, Sichuan University, Chengdu, Sichuan 610065, China; National Engineering Research Center for Flue Gas Desulfurization, Chengdu, Sichuan 610065, China
| | - Yu Zhan
- Department of Environmental Science and Engineering, Sichuan University, Chengdu, Sichuan 610065, China; National Engineering Research Center for Flue Gas Desulfurization, Chengdu, Sichuan 610065, China; Yibin Institute of Industrial Technology, Sichuan University Yibin Park, Yibin 644000, China.
| |
Collapse
|
30
|
Abstract
Abstract
Like a hydra, fraudsters adapt and circumvent increasingly sophisticated barriers erected by public or private institutions. Among these institutions, banks must quickly take measures to avoid losses while guaranteeing the satisfaction of law-abiding customers. Facing an expanding flow of operations, effective banking relies on data analytics to support established risk control processes, but also on a better understanding of the underlying fraud mechanism. In addition, fraud being a criminal offence, the evidential aspect of the process must also be considered. These legal, operational, and strategic constraints lead to compromises on the means to be implemented for fraud management. This paper first focuses on the translation of practical questions raised in the banking industry at each step of the fraud management process into performance evaluation required to design a fraud detection model. Secondly, it considers a range of machine learning approaches that address these specificities: the imbalance between fraudulent and nonfraudulent operations, the lack of fully trusted labels, the concept-drift phenomenon, and the unavoidable trade-off between accuracy and interpretability of detection. This state-of-the-art review sheds some light on a technology race between black box machine learning models improved by post-hoc interpretation and intrinsic interpretable models boosted to gain accuracy. Finally, it discusses how concrete and promising hybrid approaches can provide pragmatic, short-term answers to banks and policy makers without swallowing up stakeholders with economical and ethical stakes in this technological race.
Collapse
|
31
|
Affiliation(s)
| | - Zhigen Zhao
- Department of Statistical Science, Temple University, Philadelphia, PA
| | - Jun S. Liu
- Department of Statistics, Harvard University, Cambridge, MA
| |
Collapse
|
32
|
Mo W, Qi Z, Liu Y. Rejoinder: Learning Optimal Distributionally Robust Individualized Treatment Rules. J Am Stat Assoc 2021; 116:699-707. [PMID: 34177008 PMCID: PMC8221610 DOI: 10.1080/01621459.2020.1866581] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2020] [Accepted: 12/12/2020] [Indexed: 10/21/2022]
Abstract
We thank the opportunity offered by editors for this discussion and the discussants for their insightful comments and thoughtful contributions. We also want to congratulate Kallus (2020) for his inspiring work in improving the effciency of policy learning by retargeting. Motivated from the discussion in Dukes and Vansteelandt (2020), we first point out interesting connections and distinctions between our work and Kallus (2020) in Section 1. In particular, the assumptions and sources of variation for consideration in these two papers lead to different research problems with different scopes and focuses. In Section 2, following the discussions in Li et al. (2020); Liang and Zhao (2020), we also consider the efficient policy evaluation problem when we have some data from the testing distribution available at the training stage. We show that under the assumption that the sample sizes from training and testing are growing in the same order, efficient value function estimates can deliver competitive performance. We further show some connections of these estimates with existing literature. However, when the growth of testing sample size available for training is in a slower order, efficient value function estimates may not perform well anymore. In contrast, the requirement of the testing sample size for DRITR is not as strong as that of efficient policy evaluation using the combined data. Finally, we highlight the general applicability and usefulness of DRITR in Section 3.
Collapse
Affiliation(s)
- Weibin Mo
- Department of Statistics and Operations Research, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
| | - Zhengling Qi
- Department of Decision Sciences, George Washington University, Washington, D.C. 20052, USA
| | - Yufeng Liu
- Department of Statistics and Operations Research, Department of Genetics, Department of Biostatistics, Carolina Center for Genome Science, Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, NC 27599, USA
| |
Collapse
|
33
|
Knowledge Management for Sustainable Development in the Era of Continuously Accelerating Technological Revolutions: A Framework and Models. SUSTAINABILITY 2021. [DOI: 10.3390/su13063353] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
This conceptual, interdisciplinary paper will start by introducing the commencement of a new era in which human society faces continuously accelerating technological revolutions, named the Post Accelerating Data and Knowledge Online Society, or ‘Padkos’ (“food for the journey; prog; provisions for journey”—in Afrikaans) for short. In this context, a conceptual model of sustainable development with a focus on knowledge management and sharing will be proposed. The construct of knowledge management will be unpacked into a new three-layer model with a focus on the knowledge-human and data-machine spheres. Then, each sphere will be discussed with concentration on the learning and decision- making processes, the digital supporting systems and the human actors’ aspects. Moreover, the recombination of new knowledge development and contemporary knowledge management into one amalgamated construct will be proposed. The holistic conceptual model of knowledge management for sustainable development is comprised by time, cybersecurity and two alternative humanistic paradigms (Homo Technologicus and Homo Sustainabiliticus). Two additional particular models are discussed in depth. First, a recently proposed model of quantum organizational decision-making is elaborated. Next, a boundary management and learning process is deliberated. The paper ends with a number of propositions and several implications for the future based on the deliberations in the paper and the models discussed and with conclusions.
Collapse
|
34
|
Abstract
A systematic and reproducible “workflow”—the process that moves a scientific investigation from raw data to coherent research question to insightful contribution—should be a fundamental part of academic data-intensive research practice. In this paper, we elaborate basic principles of a reproducible data analysis workflow by defining 3 phases: the Explore, Refine, and Produce Phases. Each phase is roughly centered around the audience to whom research decisions, methodologies, and results are being immediately communicated. Importantly, each phase can also give rise to a number of research products beyond traditional academic publications. Where relevant, we draw analogies between design principles and established practice in software development. The guidance provided here is not intended to be a strict rulebook; rather, the suggestions for practices and tools to advance reproducible, sound data-intensive analysis may furnish support for both students new to research and current researchers who are new to data-intensive work.
Collapse
|
35
|
Dimitriadis T, Gneiting T, Jordan AI. Stable reliability diagrams for probabilistic classifiers. Proc Natl Acad Sci U S A 2021; 118:e2016191118. [PMID: 33597296 PMCID: PMC7923594 DOI: 10.1073/pnas.2016191118] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
A probability forecast or probabilistic classifier is reliable or calibrated if the predicted probabilities are matched by ex post observed frequencies, as examined visually in reliability diagrams. The classical binning and counting approach to plotting reliability diagrams has been hampered by a lack of stability under unavoidable, ad hoc implementation decisions. Here, we introduce the CORP approach, which generates provably statistically consistent, optimally binned, and reproducible reliability diagrams in an automated way. CORP is based on nonparametric isotonic regression and implemented via the pool-adjacent-violators (PAV) algorithm-essentially, the CORP reliability diagram shows the graph of the PAV-(re)calibrated forecast probabilities. The CORP approach allows for uncertainty quantification via either resampling techniques or asymptotic theory, furnishes a numerical measure of miscalibration, and provides a CORP-based Brier-score decomposition that generalizes to any proper scoring rule. We anticipate that judicious uses of the PAV algorithm yield improved tools for diagnostics and inference for a very wide range of statistical and machine learning methods.
Collapse
Affiliation(s)
- Timo Dimitriadis
- Alfred Weber Institute of Economics, Heidelberg University, 69115 Heidelberg, Germany;
- Computational Statistics Group, Heidelberg Institute for Theoretical Studies, 69118 Heidelberg, Germany
| | - Tilmann Gneiting
- Computational Statistics Group, Heidelberg Institute for Theoretical Studies, 69118 Heidelberg, Germany
- Institute for Stochastics, Karlsruhe Institute of Technology, 76131 Karlsruhe, Germany
| | - Alexander I Jordan
- Computational Statistics Group, Heidelberg Institute for Theoretical Studies, 69118 Heidelberg, Germany
| |
Collapse
|
36
|
Ward OG, Huang Z, Davison A, Zheng T. Next waves in veridical network embedding*. Stat Anal Data Min 2021. [DOI: 10.1002/sam.11486] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Affiliation(s)
- Owen G. Ward
- Department of Statistics Columbia University New York New York USA
| | - Zhen Huang
- Department of Statistics Columbia University New York New York USA
| | - Andrew Davison
- Department of Statistics Columbia University New York New York USA
| | - Tian Zheng
- Department of Statistics Columbia University New York New York USA
- Data Science Institute Columbia University New York New York USA
| |
Collapse
|
37
|
Rothenhäusler D, Meinshausen N, Bühlmann P, Peters J. Anchor regression: Heterogeneous data meet causality. J R Stat Soc Series B Stat Methodol 2021. [DOI: 10.1111/rssb.12398] [Citation(s) in RCA: 15] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Affiliation(s)
| | | | | | - Jonas Peters
- Department of Mathematical Sciences University of Copenhagen Copenhagen Denmark
| |
Collapse
|
38
|
Yu B. Independence and Diversity as Taught by My Mentors. LEADERSHIP IN STATISTICS AND DATA SCIENCE 2021:341-348. [DOI: 10.1007/978-3-030-60060-0_23] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/01/2023]
|
39
|
Candès E, Sabatti C. Discussion of the Paper “Prediction, Estimation, and Attribution” by B. Efron. Int Stat Rev 2020. [DOI: 10.1111/insr.12412] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Affiliation(s)
- Emmanuel Candès
- Department of Mathematics Stanford University Stanford CA
- Department of Statistics Stanford University Stanford CA
| | - Chiara Sabatti
- Department of Statistics Stanford University Stanford CA
- Department of Biomedical Data Science Stanford University Stanford CA
| |
Collapse
|
40
|
Affiliation(s)
- Bin Yu
- Statistics Department University of California Berkeley Berkeley CA
- EECS Department University of California Berkeley Berkeley CA
- Chan Zuckerberg Biohub San Francisco CA
| | - Rebecca Barter
- Statistics Department University of California Berkeley Berkeley CA
| |
Collapse
|
41
|
Dwivedi R, Tan YS, Park B, Wei M, Horgan K, Madigan D, Yu B. Stable Discovery of Interpretable Subgroups via Calibration in Causal Studies. Int Stat Rev 2020. [DOI: 10.1111/insr.12427] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022]
Affiliation(s)
- Raaz Dwivedi
- Department of EECS University of California, Berkeley Berkeley CA USA
| | - Yan Shuo Tan
- Department of Statistics University of California, Berkeley Berkeley CA USA
| | - Briton Park
- Department of Statistics University of California, Berkeley Berkeley CA USA
| | - Mian Wei
- Department of Statistics University of California, Berkeley Berkeley CA USA
| | - Kevin Horgan
- Protypia Inc 111 10th Avenue South, Suite 102 Nashville TN 37023 USA
| | - David Madigan
- Khoury College of Computer Sciences Northeastern University Boston MA USA
| | - Bin Yu
- Department of EECS University of California, Berkeley Berkeley CA USA
- Department of Statistics University of California, Berkeley Berkeley CA USA
- Division of Biostatistics University of California, Berkeley Berkeley CA USA
- Center for Computational Biology University of California, Berkeley Berkeley CA USA
- Chan Zuckerberg Biohub San Francisco CA USA
| |
Collapse
|
42
|
Hur C, Wi J, Kim Y. Facilitating the Development of Deep Learning Models with Visual Analytics for Electronic Health Records. INTERNATIONAL JOURNAL OF ENVIRONMENTAL RESEARCH AND PUBLIC HEALTH 2020; 17:E8303. [PMID: 33182703 PMCID: PMC7697823 DOI: 10.3390/ijerph17228303] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/28/2020] [Revised: 10/27/2020] [Accepted: 11/04/2020] [Indexed: 11/24/2022]
Abstract
Electronic health record (EHR) data are widely used to perform early diagnoses and create treatment plans, which are key areas of research. We aimed to increase the efficiency of iteratively applying data-intensive technology and verifying the results for complex and big EHR data. We used a system entailing sequence mining, interpretable deep learning models, and visualization on data extracted from the MIMIC-IIIdatabase for a group of patients diagnosed with heart disease. The results of sequence mining corresponded to specific pathways of interest to medical staff and were used to select patient groups that underwent these pathways. An interactive Sankey diagram representing these pathways and a heat map visually representing the weight of each variable were developed for temporal and quantitative illustration. We applied the proposed system to predict unplanned cardiac surgery using clinical pathways determined by sequence pattern mining to select cardiac surgery from complex EHRs to label subject groups and deep learning models. The proposed system aids in the selection of pathway-based patient groups, simplification of labeling, and exploratory the interpretation of the modeling results. The proposed system can help medical staff explore various pathways that patients have undergone and further facilitate the testing of various clinical hypotheses using big data in the medical domain.
Collapse
Affiliation(s)
- Cinyoung Hur
- Linewalks, 8F, 5, Teheran-ro 14-gil, Gangnam-gu, Seoul 06235, Korea;
| | - JeongA Wi
- Graduate School of Advanced Imaging Science, Multimedia & Film, Chung-Ang University 84, Heukseok ro, Dongjak-gu, Seoul 06974, Korea;
| | - YoungBin Kim
- Graduate School of Advanced Imaging Science, Multimedia & Film, Chung-Ang University 84, Heukseok ro, Dongjak-gu, Seoul 06974, Korea;
| |
Collapse
|
43
|
Bühlmann P, Ćevid D. Deconfounding and Causal Regularisation for Stability and External Validity. Int Stat Rev 2020. [DOI: 10.1111/insr.12426] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Affiliation(s)
| | - Domagoj Ćevid
- Seminar for Statistics ETH Zürich Zürich Switzerland
| |
Collapse
|
44
|
Veridical Causal Inference using Propensity Score Methods for Comparative Effectiveness Research with Medical Claims. HEALTH SERVICES AND OUTCOMES RESEARCH METHODOLOGY 2020; 21:206-228. [PMID: 34040495 DOI: 10.1007/s10742-020-00222-8] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022]
Abstract
Medical insurance claims are becoming increasingly common data sources to answer a variety of questions in biomedical research. Although comprehensive in terms of longitudinal characterization of disease development and progression for a potentially large number of patients, population-based inference using these datasets require thoughtful modifications to sample selection and analytic strategies relative to other types of studies. Along with complex selection bias and missing data issues, claims-based studies are purely observational, which limits effective understanding and characterization of the treatment differences between groups being compared. All these issues contribute to a crisis in reproducibility and replication of comparative findings using medical claims. This paper offers practical guidance to the analytical process, demonstrates methods for estimating causal treatment effects with propensity score methods for several types of outcomes common to such studies, such as binary, count, time to event and longitudinally-varying measures, and also aims to increase transparency and reproducibility of reporting of results from these investigations. We provide an online version of the paper with readily implementable code for the entire analysis pipeline to serve as a guided tutorial for practitioners. The online version can be accessed at https://rydaro.github.io/. The analytic pipeline is illustrated using a sub-cohort of patients with advanced prostate cancer from the large Clinformatics TM Data Mart Database (OptumInsight, Eden Prairie, Minnesota), consisting of 73 million distinct private payer insurees from 2001-2016.
Collapse
|
45
|
|
46
|
Affiliation(s)
- Bin Yu
- Statistics Department, University of California Berkeley, Berkeley, CA
- EECS Department, University of California Berkeley, Berkeley, CA
- Chan Zuckerberg Biohub, San Francisco, CA
| | - Rebecca Barter
- Statistics Department, University of California Berkeley, Berkeley, CA
| |
Collapse
|
47
|
Candès E, Sabatti C. Discussion of the Paper “Prediction, Estimation, and Attribution” by B. Efron. J Am Stat Assoc 2020. [DOI: 10.1080/01621459.2020.1762618] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
Affiliation(s)
- Emmanuel Candès
- Department of Mathematics, Stanford University, Stanford, CA
- Department of Statistics, Stanford University, Stanford, CA
| | - Chiara Sabatti
- Department of Statistics, Stanford University, Stanford, CA
- Department of Biomedical Data Science, Stanford University, Stanford, CA
| |
Collapse
|
48
|
QnAs with Bin Yu. Proc Natl Acad Sci U S A 2020; 117:3893-3894. [DOI: 10.1073/pnas.2001302117] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
|