1
|
Masoero L, Thomas E, Parmigiani G, Tyekucheva S, Trippa L. Cross-Study Replicability in Cluster Analysis. Stat Sci 2023; 38:303-316. [PMID: 37885824 PMCID: PMC10600961 DOI: 10.1214/22-sts871] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2023]
Affiliation(s)
| | - Emma Thomas
- Department of Biostatistics, Harvard T.H. Chan School of Public Health
| | | | | | | |
Collapse
|
2
|
Ventz S, Mazumder R, Trippa L. Integration of survival data from multiple studies. Biometrics 2022; 78:1365-1376. [PMID: 34190337 DOI: 10.1111/biom.13517] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2020] [Revised: 05/24/2021] [Accepted: 06/17/2021] [Indexed: 12/30/2022]
Abstract
We introduce a statistical procedure that integrates datasets from multiple biomedical studies to predict patients' survival, based on individual clinical and genomic profiles. The proposed procedure accounts for potential differences in the relation between predictors and outcomes across studies, due to distinct patient populations, treatments and technologies to measure outcomes and biomarkers. These differences are modeled explicitly with study-specific parameters. We use hierarchical regularization to shrink the study-specific parameters towards each other and to borrow information across studies. The estimation of the study-specific parameters utilizes a similarity matrix, which summarizes differences and similarities of the relations between covariates and outcomes across studies. We illustrate the method in a simulation study and using a collection of gene expression datasets in ovarian cancer. We show that the proposed model increases the accuracy of survival predictions compared to alternative meta-analytic methods.
Collapse
Affiliation(s)
- Steffen Ventz
- Department of Data Science, Dana-Farber Cancer Institute and Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, Massachusetts, USA
| | - Rahul Mazumder
- Sloan School of Management, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA
| | - Lorenzo Trippa
- Department of Data Science, Dana-Farber Cancer Institute and Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, Massachusetts, USA
| |
Collapse
|
3
|
Guan BZ, Parmigiani G, Braun D, Trippa L. PREDICTION OF HEREDITARY CANCERS USING NEURAL NETWORKS. Ann Appl Stat 2022; 16:495-520. [PMID: 37873507 PMCID: PMC10593124 DOI: 10.1214/21-aoas1510] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2023]
Abstract
Family history is a major risk factor for many types of cancer. Mendelian risk prediction models translate family histories into cancer risk predictions, based on knowledge of cancer susceptibility genes. These models are widely used in clinical practice to help identify high-risk individuals. Mendelian models leverage the entire family history, but they rely on many assumptions about cancer susceptibility genes that are either unrealistic or challenging to validate, due to low mutation prevalence. Training more flexible models, such as neural networks, on large databases of pedigrees can potentially lead to accuracy gains. In this paper we develop a framework to apply neural networks to family history data and investigate their ability to learn inherited susceptibility to cancer. While there is an extensive literature on neural networks and their state-of-the-art performance in many tasks, there is little work applying them to family history data. We propose adaptations of fully-connected neural networks and convolutional neural networks to pedigrees. In data simulated under Mendelian inheritance, we demonstrate that our proposed neural network models are able to achieve nearly optimal prediction performance. Moreover, when the observed family history includes misreported cancer diagnoses, neural networks are able to outperform the Mendelian BRCAPRO model embedding the correct inheritance laws. Using a large dataset of over 200,000 family histories, the Risk Service cohort, we train prediction models for future risk of breast cancer. We validate the models using data from the Cancer Genetics Network.
Collapse
Affiliation(s)
- By Zoe Guan
- Department of Epidemiology and Biostatistics, Memorial Sloan Kettering Cancer Center
| | | | - Danielle Braun
- Department of Biostatistics, Harvard T.H. Chan School of Public Health
| | - Lorenzo Trippa
- Department of Data Sciences, Dana-Farber Cancer Institute
| |
Collapse
|
4
|
Quantitative Cluster Headache Analysis for Neurological Diagnosis Support Using Statistical Classification. INFORMATION 2020. [DOI: 10.3390/info11080393] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open
Abstract
Cluster headache (CH) belongs to the group III of The International Classification of Headaches. It is characterized by attacks of severe pain in the ocular/periocular area accompanied by cranial autonomic signs, including parasympathetic activation and sympathetic hypofunction on the symptomatic side. Iris pigmentation occurs in the neonatal period and depends on the sympathetic tone in each eye. We hypothesized that the presence of visible or subtle color iris changes in both eyes could be used as a quantitative biomarker for screening and early detection of CH. This work scrutinizes the scope of an automatic diagnosis-support system for early detection of CH, by using as indicator the error rate provided by a statistical classifier designed to identify the eye (left vs. right) from iris pixels in color images. Systematic tests were performed on a database with images of 11 subjects (four with CH, four with other ophthalmic diseases affecting the iris pigmentation, and three control subjects). Several aspects were addressed to design the classifier, including: (a) the most convenient color space for the statistical classifier; (b) whether the use of features associated to several color spaces is convenient; (c) the robustness of the classifier to iris spatial subregions; (d) the contribution of the pixels neighborhood. Our results showed that a reduced value for the error rate (lower than 0.25) can be used as CH marker, whereas structural regions of the iris image need to be taken into account. The iris color feature analysis using statistical classification is a potentially useful technique to investigate disorders affecting the autonomous nervous system in CH.
Collapse
|
5
|
Large-scale predictive modeling and analytics through regression queries in data management systems. INTERNATIONAL JOURNAL OF DATA SCIENCE AND ANALYTICS 2020. [DOI: 10.1007/s41060-018-0163-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
|
6
|
Meruelo AD, Jacobus J, Idy E, Nguyen-Louie T, Brown G, Tapert SF. Early adolescent brain markers of late adolescent academic functioning. Brain Imaging Behav 2020; 13:945-952. [PMID: 29911279 DOI: 10.1007/s11682-018-9912-2] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
Academic performance in adolescence strongly influences adult prospects. Intelligence quotient (IQ) has historically been considered a strong predictor of academic performance. Less objectively explored have been morphometric features. We analyzed brain MRI morphometry metrics in early adolescence (age 12-14 years) as quantitative predictors of academic performance over high school using a naïve Bayesian classifier approach with n = 170 subjects. Based on the mean GPA, subjects were divided into high (GPA ≥3.54; n = 87) and low (GPA <3.54; n = 83) academic performers. Covariance analysis was performed to look at the influence of subject demographics. We examined predictive features from the 343 available regions (surface areas, cortical thickness, and subcortical volumes) and applied 4 algorithms for selection and reduction of attributes using Weka. Cortical thickness measures performed better than surface areas or subcortical volumes as predictors of academic performance. We identified 15 cortical thickness regions most predictive of academic performance, three of which have not been described in the literature predictive of academic performance. These were in the left hemisphere fusiform, bilateral insula, and left hemisphere paracentral regions. Prediction had a sensitivity of 0.65 and specificity of 0.73 with independent validation. Follow-up independent t-test analyses between high and low academic achievers on 10 of 15 regions showed between-group significance at the p < 0.05 level. High achievers demonstrated thicker cortices than low achievers. These newly identified regions may help pinpoint new targets for further study in understanding the developing adolescent brain in the classroom setting.
Collapse
Affiliation(s)
- Alejandro Daniel Meruelo
- Department of Psychiatry, University of California San Diego, 9500 Gilman Drive #0603V, La Jolla, CA, 92093, USA.
| | - Joanna Jacobus
- Department of Psychiatry, University of California San Diego, 9500 Gilman Drive #0603V, La Jolla, CA, 92093, USA
| | - Erick Idy
- Department of Psychiatry, University of California San Diego, 9500 Gilman Drive #0603V, La Jolla, CA, 92093, USA
| | - Tam Nguyen-Louie
- San Diego State University/University of California San Diego Joint Doctoral Program in Clinical Psychology, San Diego, CA, USA
| | - Gregory Brown
- Department of Psychiatry, University of California San Diego, 9500 Gilman Drive #0603V, La Jolla, CA, 92093, USA.,VA San Diego Healthcare System, La Jolla, CA, USA
| | - Susan Frances Tapert
- Department of Psychiatry, University of California San Diego, 9500 Gilman Drive #0603V, La Jolla, CA, 92093, USA
| |
Collapse
|
7
|
Gerber S, Pospisil L, Navandar M, Horenko I. Low-cost scalable discretization, prediction, and feature selection for complex systems. SCIENCE ADVANCES 2020; 6:eaaw0961. [PMID: 32064328 PMCID: PMC6989146 DOI: 10.1126/sciadv.aaw0961] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/16/2018] [Accepted: 11/22/2019] [Indexed: 06/10/2023]
Abstract
Finding reliable discrete approximations of complex systems is a key prerequisite when applying many of the most popular modeling tools. Common discretization approaches (e.g., the very popular K-means clustering) are crucially limited in terms of quality, parallelizability, and cost. We introduce a low-cost improved quality scalable probabilistic approximation (SPA) algorithm, allowing for simultaneous data-driven optimal discretization, feature selection, and prediction. We prove its optimality, parallel efficiency, and a linear scalability of iteration cost. Cross-validated applications of SPA to a range of large realistic data classification and prediction problems reveal marked cost and performance improvements. For example, SPA allows the data-driven next-day predictions of resimulated surface temperatures for Europe with the mean prediction error of 0.75°C on a common PC (being around 40% better in terms of errors and five to six orders of magnitude cheaper than with common computational instruments used by the weather services).
Collapse
Affiliation(s)
- S. Gerber
- Center of Computational Sciences, Johannes-Gutenberg-University of Mainz, PhysMat/Staudingerweg 9, 55128 Mainz, Germany
| | - L. Pospisil
- Faculty of Informatics, Universita della Svizzera Italiana, Via G. Buffi 13, 6900 Lugano Switzerland
| | - M. Navandar
- Center of Computational Sciences, Johannes-Gutenberg-University of Mainz, PhysMat/Staudingerweg 9, 55128 Mainz, Germany
| | - I. Horenko
- Faculty of Informatics, Universita della Svizzera Italiana, Via G. Buffi 13, 6900 Lugano Switzerland
| |
Collapse
|
8
|
Gendoo DMA, Zon M, Sandhu V, Manem VSK, Ratanasirigulchai N, Chen GM, Waldron L, Haibe-Kains B. MetaGxData: Clinically Annotated Breast, Ovarian and Pancreatic Cancer Datasets and their Use in Generating a Multi-Cancer Gene Signature. Sci Rep 2019; 9:8770. [PMID: 31217513 PMCID: PMC6584731 DOI: 10.1038/s41598-019-45165-4] [Citation(s) in RCA: 17] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2018] [Accepted: 05/31/2019] [Indexed: 12/13/2022] Open
Abstract
A wealth of transcriptomic and clinical data on solid tumours are under-utilized due to unharmonized data storage and format. We have developed the MetaGxData package compendium, which includes manually-curated and standardized clinical, pathological, survival, and treatment metadata across breast, ovarian, and pancreatic cancer data. MetaGxData is the largest compendium of curated transcriptomic data for these cancer types to date, spanning 86 datasets and encompassing 15,249 samples. Open access to standardized metadata across cancer types promotes use of their transcriptomic and clinical data in a variety of cross-tumour analyses, including identification of common biomarkers, and assessing the validity of prognostic signatures. Here, we demonstrate that MetaGxData is a flexible framework that facilitates meta-analyses by using it to identify common prognostic genes in ovarian and breast cancer. Furthermore, we use the data compendium to create the first gene signature that is prognostic in a meta-analysis across 3 cancer types. These findings demonstrate the potential of MetaGxData to serve as an important resource in oncology research, and provide a foundation for future development of cancer-specific compendia.
Collapse
Affiliation(s)
- Deena M A Gendoo
- Centre for Computational Biology, Institute of Cancer and Genomic Sciences, University of Birmingham, Birmingham, B15 2TT, United Kingdom.
| | - Michael Zon
- Princess Margaret Cancer Center, University Health Network, Toronto, M5G 2C1, Canada.,Department of Biomedical Engineering, McMaster University, Toronto, L8S 4L8, Canada
| | - Vandana Sandhu
- Princess Margaret Cancer Center, University Health Network, Toronto, M5G 2C1, Canada
| | - Venkata S K Manem
- Princess Margaret Cancer Center, University Health Network, Toronto, M5G 2C1, Canada.,Department of Medical Biophysics, University of Toronto, Toronto, M5S 3H7, Canada.,Institut Universitaire de Cardiologie et de Pneumologie de Québec, Université Laval, Québec City, G1V 4G5, Canada
| | | | - Gregory M Chen
- Princess Margaret Cancer Center, University Health Network, Toronto, M5G 2C1, Canada
| | - Levi Waldron
- Graduate School of Public Health and Health Policy, Institute of Implementation Science in Population Health, City University of New York School, New York, 11101, USA.
| | - Benjamin Haibe-Kains
- Princess Margaret Cancer Center, University Health Network, Toronto, M5G 2C1, Canada. .,Department of Medical Biophysics, University of Toronto, Toronto, M5S 3H7, Canada. .,Department of Computer Science, University of Toronto, Toronto, M5T 3A1, Canada. .,Ontario Institute of Cancer Research, Toronto, M5G 0A3, Canada. .,Vector Institute, Toronto, M5G 1M1, Canada.
| |
Collapse
|
9
|
Yuan HY, Wen TH, Kung YH, Tsou HH, Chen CH, Chen LW, Lin PS. Prediction of annual dengue incidence by hydro-climatic extremes for southern Taiwan. INTERNATIONAL JOURNAL OF BIOMETEOROLOGY 2019; 63:259-268. [PMID: 30680621 DOI: 10.1007/s00484-018-01659-w] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/12/2018] [Revised: 10/30/2018] [Accepted: 12/03/2018] [Indexed: 05/16/2023]
Abstract
Dengue is one of the most rapidly spreading mosquito-borne viral diseases in the world. An increase in the incidence of dengue is commonly thought to be a consequence of variability of weather conditions. Taiwan, which straddles the Tropic of Cancer, is an excellent place to study the relationship between weather conditions and dengue fever cases since the island forms an isolated geographic environment. Therefore, clarifying the association between extreme weather conditions and annual dengue incidence is one of important issues for epidemic early warning. In this paper, we develop a Poisson regression model with extreme weather parameters for prediction of annual dengue incidence. A leave-one-out method is used to evaluate the performance of predicting dengue incidence. Our results indicate that dengue transmission has a positive relationship with the minimum temperature predictors during the early summer while a negative relationship with all the maximum 24-h rainfall predictors during the early epidemic phase of dengue outbreaks. Our findings provide a better understanding of the relationships between extreme weather and annual trends in dengue cases in Taiwan and it could have important implications for dengue forecasts in surrounding areas with similar meteorological conditions.
Collapse
Affiliation(s)
- Hsiang-Yu Yuan
- Department of Biomedical Sciences, City University of Hong Kong, Kowloon Tong, Hong Kong
| | - Tzai-Hung Wen
- Department of Geography, National Taiwan University, Taipei City, Taiwan
| | - Yi-Hung Kung
- National Mosquito-Borne Disease Control Research Center, National Health Research Institutes, 35 Keyan Road, Zhuna, Miaoli, 350, Taiwan
| | - Hsiao-Hui Tsou
- National Mosquito-Borne Disease Control Research Center, National Health Research Institutes, 35 Keyan Road, Zhuna, Miaoli, 350, Taiwan
- Institute of Population Health Sciences, National Health Research Institutes, 35 Keyan Road, Zhuna, Miaoli, 350, Taiwan
| | - Chun-Hong Chen
- National Mosquito-Borne Disease Control Research Center, National Health Research Institutes, 35 Keyan Road, Zhuna, Miaoli, 350, Taiwan
- Institute of Infectious Diseases and Vaccinology, National Health Research Institutes, Zhuna, Taiwan
| | - Li-Wei Chen
- Institute of Population Health Sciences, National Health Research Institutes, 35 Keyan Road, Zhuna, Miaoli, 350, Taiwan
| | - Pei-Sheng Lin
- National Mosquito-Borne Disease Control Research Center, National Health Research Institutes, 35 Keyan Road, Zhuna, Miaoli, 350, Taiwan.
- Institute of Population Health Sciences, National Health Research Institutes, 35 Keyan Road, Zhuna, Miaoli, 350, Taiwan.
| |
Collapse
|
10
|
Development of a clinical prediction model for diagnosing adenomyosis. Fertil Steril 2018; 110:957-964.e3. [DOI: 10.1016/j.fertnstert.2018.06.009] [Citation(s) in RCA: 27] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2018] [Revised: 06/05/2018] [Accepted: 06/06/2018] [Indexed: 11/18/2022]
|
11
|
Abstract
This article considers replicability of the performance of predictors across studies. We suggest a general approach to investigating this issue, based on ensembles of prediction models trained on different studies. We quantify how the common practice of training on a single study accounts in part for the observed challenges in replicability of prediction performance. We also investigate whether ensembles of predictors trained on multiple studies can be combined, using unique criteria, to design robust ensemble learners trained upfront to incorporate replicability into different contexts and populations.
Collapse
Affiliation(s)
- Prasad Patil
- Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, Boston, MA 02215
- Department of Biostatistics, Harvard T. H. Chan School of Public Health, Boston, MA 02115
| | - Giovanni Parmigiani
- Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, Boston, MA 02215;
- Department of Biostatistics, Harvard T. H. Chan School of Public Health, Boston, MA 02115
| |
Collapse
|
12
|
Abu-Alqumsan M, Kapeller C, Hintermüller C, Guger C, Peer A. Invariance and variability in interaction error-related potentials and their consequences for classification. J Neural Eng 2017; 14:066015. [DOI: 10.1088/1741-2552/aa8416] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
|