1
|
Shen J, Wang S, Sun H, Huang J, Bai L, Wang X, Dong Y, Tang Z. A novel non-negative Bayesian stacking modeling method for Cancer survival prediction using high-dimensional omics data. BMC Med Res Methodol 2024; 24:105. [PMID: 38702624 PMCID: PMC11067084 DOI: 10.1186/s12874-024-02232-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2023] [Accepted: 04/23/2024] [Indexed: 05/06/2024] Open
Abstract
BACKGROUND Survival prediction using high-dimensional molecular data is a hot topic in the field of genomics and precision medicine, especially for cancer studies. Considering that carcinogenesis has a pathway-based pathogenesis, developing models using such group structures is a closer mimic of disease progression and prognosis. Many approaches can be used to integrate group information; however, most of them are single-model methods, which may account for unstable prediction. METHODS We introduced a novel survival stacking method that modeled using group structure information to improve the robustness of cancer survival prediction in the context of high-dimensional omics data. With a super learner, survival stacking combines the prediction from multiple sub-models that are independently trained using the features in pre-grouped biological pathways. In addition to a non-negative linear combination of sub-models, we extended the super learner to non-negative Bayesian hierarchical generalized linear model and artificial neural network. We compared the proposed modeling strategy with the widely used survival penalized method Lasso Cox and several group penalized methods, e.g., group Lasso Cox, via simulation study and real-world data application. RESULTS The proposed survival stacking method showed superior and robust performance in terms of discrimination compared with single-model methods in case of high-noise simulated data and real-world data. The non-negative Bayesian stacking method can identify important biological signal pathways and genes that are associated with the prognosis of cancer. CONCLUSIONS This study proposed a novel survival stacking strategy incorporating biological group information into the cancer prognosis models. Additionally, this study extended the super learner to non-negative Bayesian model and ANN, enriching the combination of sub-models. The proposed Bayesian stacking strategy exhibited favorable properties in the prediction and interpretation of complex survival data, which may aid in discovering cancer targets.
Collapse
Affiliation(s)
- Junjie Shen
- Department of Biostatistics, School of Public Health, Jiangsu Key Laboratory of Preventive and Translational Medicine for Major Chronic Non-communicable Diseases, MOE Key Laboratory of Geriatric Diseases and Immunology, Suzhou Medical College of Soochow University, Suzhou, Jiangsu, 215123, People's Republic of China
| | - Shuo Wang
- Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center-University of Freiburg, 79085, Freiburg, Germany
| | - Hao Sun
- Department of Biostatistics, School of Public Health, Jiangsu Key Laboratory of Preventive and Translational Medicine for Major Chronic Non-communicable Diseases, MOE Key Laboratory of Geriatric Diseases and Immunology, Suzhou Medical College of Soochow University, Suzhou, Jiangsu, 215123, People's Republic of China
| | - Jie Huang
- Department of Biostatistics, School of Public Health, Jiangsu Key Laboratory of Preventive and Translational Medicine for Major Chronic Non-communicable Diseases, MOE Key Laboratory of Geriatric Diseases and Immunology, Suzhou Medical College of Soochow University, Suzhou, Jiangsu, 215123, People's Republic of China
| | - Lu Bai
- Department of Biostatistics, School of Public Health, Jiangsu Key Laboratory of Preventive and Translational Medicine for Major Chronic Non-communicable Diseases, MOE Key Laboratory of Geriatric Diseases and Immunology, Suzhou Medical College of Soochow University, Suzhou, Jiangsu, 215123, People's Republic of China
| | - Xichao Wang
- Department of Biostatistics, School of Public Health, Jiangsu Key Laboratory of Preventive and Translational Medicine for Major Chronic Non-communicable Diseases, MOE Key Laboratory of Geriatric Diseases and Immunology, Suzhou Medical College of Soochow University, Suzhou, Jiangsu, 215123, People's Republic of China
| | - Yongfei Dong
- Department of Biostatistics, School of Public Health, Jiangsu Key Laboratory of Preventive and Translational Medicine for Major Chronic Non-communicable Diseases, MOE Key Laboratory of Geriatric Diseases and Immunology, Suzhou Medical College of Soochow University, Suzhou, Jiangsu, 215123, People's Republic of China
| | - Zaixiang Tang
- Department of Biostatistics, School of Public Health, Jiangsu Key Laboratory of Preventive and Translational Medicine for Major Chronic Non-communicable Diseases, MOE Key Laboratory of Geriatric Diseases and Immunology, Suzhou Medical College of Soochow University, Suzhou, Jiangsu, 215123, People's Republic of China.
| |
Collapse
|
2
|
Shen J, Wang S, Dong Y, Sun H, Wang X, Tang Z. A non-negative spike-and-slab lasso generalized linear stacking prediction modeling method for high-dimensional omics data. BMC Bioinformatics 2024; 25:119. [PMID: 38509499 PMCID: PMC10953151 DOI: 10.1186/s12859-024-05741-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2023] [Accepted: 03/11/2024] [Indexed: 03/22/2024] Open
Abstract
BACKGROUND High-dimensional omics data are increasingly utilized in clinical and public health research for disease risk prediction. Many previous sparse methods have been proposed that using prior knowledge, e.g., biological group structure information, to guide the model-building process. However, these methods are still based on a single model, offen leading to overconfident inferences and inferior generalization. RESULTS We proposed a novel stacking strategy based on a non-negative spike-and-slab Lasso (nsslasso) generalized linear model (GLM) for disease risk prediction in the context of high-dimensional omics data. Briefly, we used prior biological knowledge to segment omics data into a set of sub-data. Each sub-model was trained separately using the features from the group via a proper base learner. Then, the predictions of sub-models were ensembled by a super learner using nsslasso GLM. The proposed method was compared to several competitors, such as the Lasso, grlasso, and gsslasso, using simulated data and two open-access breast cancer data. As a result, the proposed method showed robustly superior prediction performance to the optimal single-model method in high-noise simulated data and real-world data. Furthermore, compared to the traditional stacking method, the proposed nsslasso stacking method can efficiently handle redundant sub-models and identify important sub-models. CONCLUSIONS The proposed nsslasso method demonstrated favorable predictive accuracy, stability, and biological interpretability. Additionally, the proposed method can also be used to detect new biomarkers and key group structures.
Collapse
Affiliation(s)
- Junjie Shen
- Department of Biostatistics, School of Public Health, Jiangsu Key Laboratory of Preventive and Translational Medicine for Geriatric Diseases, MOE Key Laboratory of Geriatric Diseases and Immunology, Suzhou Medical College of Soochow University, No. 199 Renai Road, Suzhou, 215123, Jiangsu, People's Republic of China
| | - Shuo Wang
- Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center, University of Freiburg, 79085, Freiburg, Germany
| | - Yongfei Dong
- Department of Biostatistics, School of Public Health, Jiangsu Key Laboratory of Preventive and Translational Medicine for Geriatric Diseases, MOE Key Laboratory of Geriatric Diseases and Immunology, Suzhou Medical College of Soochow University, No. 199 Renai Road, Suzhou, 215123, Jiangsu, People's Republic of China
| | - Hao Sun
- Department of Biostatistics, School of Public Health, Jiangsu Key Laboratory of Preventive and Translational Medicine for Geriatric Diseases, MOE Key Laboratory of Geriatric Diseases and Immunology, Suzhou Medical College of Soochow University, No. 199 Renai Road, Suzhou, 215123, Jiangsu, People's Republic of China
| | - Xichao Wang
- Department of Biostatistics, School of Public Health, Jiangsu Key Laboratory of Preventive and Translational Medicine for Geriatric Diseases, MOE Key Laboratory of Geriatric Diseases and Immunology, Suzhou Medical College of Soochow University, No. 199 Renai Road, Suzhou, 215123, Jiangsu, People's Republic of China
| | - Zaixiang Tang
- Department of Biostatistics, School of Public Health, Jiangsu Key Laboratory of Preventive and Translational Medicine for Geriatric Diseases, MOE Key Laboratory of Geriatric Diseases and Immunology, Suzhou Medical College of Soochow University, No. 199 Renai Road, Suzhou, 215123, Jiangsu, People's Republic of China.
| |
Collapse
|
3
|
A machine learning method for improving liver cancer staging. J Biomed Inform 2023; 137:104266. [PMID: 36494059 DOI: 10.1016/j.jbi.2022.104266] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2022] [Revised: 11/13/2022] [Accepted: 12/01/2022] [Indexed: 12/12/2022]
Abstract
Liver cancer is a common malignant tumor, and its clinical stage is closely related to the clinical treatment and prognosis of patients. Currently, the BCLC staging system revised by the BCLC group of University of Barcelona is the globally recognized staging system for liver cancer. However, with the deepening of related research, the current staging system can no longer fully meet the clinical needs. In this work, we propose a novel machine learning method for constructing an automatic hepatocellular carcinoma staging model that incorporates far more clinical variables than any existing staging system. Our model is based on random survival forests, which generates a unique hazard function for each patient. B-splines are used to embed hazard functions into vectors in low-dimensional space and hierarchical clustering method groups similar patients to form staging cohorts. The resulting staging system significantly outperforms the BCLC system in terms of distinctiveness between patients in different stages.
Collapse
|
4
|
Schomberg J. Identification of Targetable Pathways in Oral Cancer Patients via Random Forest and Chemical Informatics. Cancer Inform 2019; 18:1176935119889911. [PMID: 31819345 PMCID: PMC6883365 DOI: 10.1177/1176935119889911] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2019] [Accepted: 10/31/2019] [Indexed: 02/06/2023] Open
Abstract
Treatment of head and neck cancer has been slow to change with epidermal growth
factor receptor (EGFR) inhibitors, PD1 inhibitors, and
taxane-/plant-alkaloid-derived chemotherapies being the only therapies approved
by the U.S. Food and Drug Administration (FDA) in the last 10 years for the
treatment of head and neck cancers. Head and neck cancer is a relatively rare
cancer compared to breast or lung cancers. However, it is possible that existing
therapies for more common solid tumors or for the treatment of other diseases
could also prove effective against oral cancers. Many therapies have molecular
targets that could be appropriate in oral cancer as well as the cancer in which
the drug gained initial FDA approval. Also, there may be targets in oral cancer
for which existing FDA-approved drugs could be applied. This study describes
informatics methods that use machine learning to identify influential gene
targets in patients receiving platinum-based chemotherapy, non-platinum-based
chemotherapy, and genes influential in both groups of patients. This analysis
yielded 6 small molecules that had a high Tanimoto similarity (>50%) to
ligands binding genes shown to be highly influential in determining treatment
response in oral cancer patients. In addition to influencing treatment response,
these genes were also found to act as gene hubs connected to more than 100 other
genes in pathways enriched with genes determined to be influential in treatment
response by a random forest classifier with 20 000 trees trying 320 variables at
each tree node. This analysis validates the use of multiple informatics methods
to identify small molecules that have a greater likelihood of efficacy in a
given cancer of interest.
Collapse
Affiliation(s)
- John Schomberg
- CHOC Children's, Orange, CA, USA.,School of Population Health Science, University of California Irvine, Irvine, CA, USA.,Afecta Pharmaceuticals, Irvine, CA, USA
| |
Collapse
|
5
|
Jardillier R, Chatelain F, Guyon L. Bioinformatics Methods to Select Prognostic Biomarker Genes from Large Scale Datasets: A Review. Biotechnol J 2018; 13:e1800103. [DOI: 10.1002/biot.201800103] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2018] [Revised: 10/15/2018] [Indexed: 12/28/2022]
Affiliation(s)
- Rémy Jardillier
- University Grenoble Alpes, CEA, INSERMBiology of Cancer Infection UMR_S 103638000GrenobleFrance
- University Grenoble Alpes, CNRS, Grenoble INPGIPSA‐labInstitute of Engineering University Grenoble Alpes38000GrenobleFrance
| | - Florent Chatelain
- University Grenoble Alpes, CNRS, Grenoble INPGIPSA‐labInstitute of Engineering University Grenoble Alpes38000GrenobleFrance
| | - Laurent Guyon
- University Grenoble Alpes, CEA, INSERMBiology of Cancer Infection UMR_S 103638000GrenobleFrance
| |
Collapse
|
6
|
Korkmaz S, Goksuluk D, Zararsiz G, Karahan S. geneSurv: An interactive web-based tool for survival analysis in genomics research. Comput Biol Med 2017; 89:487-496. [DOI: 10.1016/j.compbiomed.2017.08.031] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2017] [Revised: 08/23/2017] [Accepted: 08/24/2017] [Indexed: 12/26/2022]
|
7
|
Liu W, Wang W, Tian G, Xie W, Lei L, Liu J, Huang W, Xu L, Li E. Topologically inferring pathway activity for precise survival outcome prediction: breast cancer as a case. MOLECULAR BIOSYSTEMS 2017; 13:537-548. [PMID: 28098303 DOI: 10.1039/c6mb00757k] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
Abstract
Accurately predicting the survival outcome of patients is of great importance in clinical cancer research. In the past decade, building survival prediction models based on gene expression data has received increasing interest. However, the existing methods are mainly based on individual gene signatures, which are known to have limited prediction accuracy on independent datasets and unclear biological relevance. Here, we propose a novel pathway-based survival prediction method called DRWPSurv in order to accurately predict survival outcome. DRWPSurv integrates gene expression profiles and prior gene interaction information to topologically infer survival associated pathway activities, and uses the pathway activities as features to construct Lasso-Cox model. It uses topological importance of genes evaluated by directed random walk to enhance the robustness of pathway activities and thereby improve the predictive performance. We applied DRWPSurv on three independent breast cancer datasets and compared the predictive performance with a traditional gene-based method and four pathway-based methods. Results showed that pathway-based methods obtained comparable or better predictive performance than the gene-based method, whereas DRWPSurv could predict survival outcome with better accuracy and robustness among the pathway-based methods. In addition, the risk pathways identified by DRWPSurv provide biologically informative models for breast cancer prognosis and treatment.
Collapse
Affiliation(s)
- Wei Liu
- The Key Laboratory of Molecular Biology for High Cancer Incidence Coastal Chaoshan Area, Shantou University Medical College, Shantou, 515041, China. and Department of Mathematics, Heilongjiang Institute of Technology, Harbin, 150050, China
| | - Wei Wang
- Department of Mathematics, Heilongjiang Institute of Technology, Harbin, 150050, China
| | - Guohua Tian
- Department of Mathematics, Heilongjiang Institute of Technology, Harbin, 150050, China
| | - Wenming Xie
- Network Information Center, Shantou University Medical College, Shantou, 515041, China
| | - Li Lei
- Network Information Center, Shantou University Medical College, Shantou, 515041, China
| | - Jiujin Liu
- Network Information Center, Shantou University Medical College, Shantou, 515041, China
| | - Wanxun Huang
- Network Information Center, Shantou University Medical College, Shantou, 515041, China
| | - Liyan Xu
- The Key Laboratory of Molecular Biology for High Cancer Incidence Coastal Chaoshan Area, Shantou University Medical College, Shantou, 515041, China. and Institute of Oncologic Pathology, Shantou University Medical College, Shantou, 515041, China
| | - Enmin Li
- The Key Laboratory of Molecular Biology for High Cancer Incidence Coastal Chaoshan Area, Shantou University Medical College, Shantou, 515041, China. and Department of Biochemistry and Molecular Biology, Shantou University Medical College, Shantou 515041, China
| |
Collapse
|
8
|
Gai L, Sun C, Yu W, Liu H. Screening of intracerebral hemorrhage associated allele combinations at different loci using a novel association analysis. Gene 2016; 579:1-7. [PMID: 26723510 DOI: 10.1016/j.gene.2015.12.031] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2015] [Revised: 10/28/2015] [Accepted: 12/15/2015] [Indexed: 11/30/2022]
Abstract
BACKGROUND Genetic research has progressed along with scientific and technological developments. However, it is difficult to identify frequency differences in the allele combination at cross-loci. OBJECTIVE The purpose of this study was to examine the relationship between the presence of specific allele combinations of short tandem repeat (STR) loci and the onset of intracerebral hemorrhage (ICH) using a novel methodology. METHODS DNA samples were collected from patients with ICH, who were adult population. There were a total of 51 Chinese patients (102 chromosomes), comprising 30 males and 21 females. Alleles from short tandem repeat (STR) loci were determined using the STR Profiler Plus PCR amplification kit (15 STR loci). Statistically significant differences between observed and expected frequencies of allele combinations were identified. To further determine allele combinations related to the disease, analyses of patient age at disease onset for those carrying a specific allele combination were conducted. Finally, cross-validation of the two sets of analytical results was carried out. RESULTS A total of 1550 pairwise combinations were obtained by computer counting, of which eight pairs of alleles showed significant differences between the observed and expected frequencies (p<0.05, from 0.006 to 0.042). The p value for the cross-validation analysis was less than 0.05 for two pairs of alleles (D13S317-11 and vWA-17, p=0.021; D7S820-13 and D2S1338-18, p=0.023). CONCLUSIONS The study identified each population had a unique gene distribution and that distribution followed certain rules. ICH onset may be associated with this allele combinations (D13S317-11 and vWA-17; D7S820-13 and D2S1338-18). The new methodology used in this study could enable additional discoveries pertaining to the relationship between specific allele combinations at different loci and the onset of complex diseases.
Collapse
Affiliation(s)
- Liping Gai
- College of Medical Laboratory, Dalian Medical University, China
| | - Cui Sun
- College of Medical Laboratory, Dalian Medical University, China
| | - Weijian Yu
- College of Medical Laboratory, Dalian Medical University, China
| | - Hui Liu
- College of Medical Laboratory, Dalian Medical University, China.
| |
Collapse
|
9
|
Geman D, Ochs M, Price ND, Tomasetti C, Younes L. An argument for mechanism-based statistical inference in cancer. Hum Genet 2015; 134:479-95. [PMID: 25381197 PMCID: PMC4612627 DOI: 10.1007/s00439-014-1501-x] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2014] [Accepted: 10/14/2014] [Indexed: 01/07/2023]
Abstract
Cancer is perhaps the prototypical systems disease, and as such has been the focus of extensive study in quantitative systems biology. However, translating these programs into personalized clinical care remains elusive and incomplete. In this perspective, we argue that realizing this agenda—in particular, predicting disease phenotypes, progression and treatment response for individuals—requires going well beyond standard computational and bioinformatics tools and algorithms. It entails designing global mathematical models over network-scale configurations of genomic states and molecular concentrations, and learning the model parameters from limited available samples of high-dimensional and integrative omics data. As such, any plausible design should accommodate: biological mechanism, necessary for both feasible learning and interpretable decision making; stochasticity, to deal with uncertainty and observed variation at many scales; and a capacity for statistical inference at the patient level. This program, which requires a close, sustained collaboration between mathematicians and biologists, is illustrated in several contexts, including learning biomarkers, metabolism, cell signaling, network inference and tumorigenesis.
Collapse
Affiliation(s)
- Donald Geman
- Department of Applied Mathematics and Statistics, Johns Hopkins University, Baltimore, MD, 21210, USA,
| | | | | | | | | |
Collapse
|
10
|
Ye S, Dawson JA, Kendziorski C. Extending information retrieval methods to personalized genomic-based studies of disease. Cancer Inform 2015; 13:85-95. [PMID: 25733795 PMCID: PMC4332045 DOI: 10.4137/cin.s16354] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2014] [Revised: 10/22/2014] [Accepted: 10/23/2014] [Indexed: 01/30/2023] Open
Abstract
Genomic-based studies of disease now involve diverse types of data collected on large groups of patients. A major challenge facing statistical scientists is how best to combine the data, extract important features, and comprehensively characterize the ways in which they affect an individual’s disease course and likelihood of response to treatment. We have developed a survival-supervised latent Dirichlet allocation (survLDA) modeling framework to address these challenges. Latent Dirichlet allocation (LDA) models have proven extremely effective at identifying themes common across large collections of text, but applications to genomics have been limited. Our framework extends LDA to the genome by considering each patient as a “document” with “text” detailing his/her clinical events and genomic state. We then further extend the framework to allow for supervision by a time-to-event response. The model enables the efficient identification of collections of clinical and genomic features that co-occur within patient subgroups, and then characterizes each patient by those features. An application of survLDA to The Cancer Genome Atlas ovarian project identifies informative patient subgroups showing differential response to treatment, and validation in an independent cohort demonstrates the potential for patient-specific inference.
Collapse
Affiliation(s)
- Shuyun Ye
- Department of Statistics, University of Wisconsin, Madison, WI, USA
| | - John A Dawson
- Department of Statistics, University of Wisconsin, Madison, WI, USA
| | - Christina Kendziorski
- Department of Biostatistics and Medical Informatics, University of Wisconsin, Madison, WI, USA
| |
Collapse
|
11
|
Gai LP, Liu H, Cui JH, Ji N, Ding XD, Sun C, Yu LS. Distributions of allele combination in single and cross loci among patients with several kinds of chronic diseases and the normal population. Genomics 2015; 105:168-74. [PMID: 25561352 DOI: 10.1016/j.ygeno.2014.12.008] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2014] [Revised: 12/05/2014] [Accepted: 12/27/2014] [Indexed: 11/27/2022]
Abstract
Genetic research has progressed along with scientific and technological developments. However, it is difficult to identify frequency differences in a particular allele distribution at a single locus. Such differences can be identified by examining the allele combination distribution. We explored different mathematical methods for statistical analyses to assess the association between the genotype and phenotype. We investigated the frequency distributions of alleles, combinations of single-locus genes, and combinations of cross-loci genes at 15 loci using 447 blood samples of 200 normal subjects, 72 patients with chronic obstructive pulmonary resistance, 50 liver cancers, 75 stomach cancers and 50 hematencephalon and identified each population as having a unique gene distribution and that the distribution followed certain rules. The probability of illness followed different rules and had apparent specificity. Differences obtained using statistics of combinations of cross-loci genes are superior to single-locus gene statistics, and combinations of single-locus gene statistics are better than allelic statistics.
Collapse
Affiliation(s)
- Li-ping Gai
- Department of Physics, Dalian Medical University, Dalian 116044, China
| | - Hui Liu
- College of Medical Laboratory, Dalian Medical University, Dalian 116044, China.
| | - Jing-hui Cui
- College of Medical Laboratory, Dalian Medical University, Dalian 116044, China
| | - Na Ji
- College of Medical Laboratory, Dalian Medical University, Dalian 116044, China
| | - Xiao-dong Ding
- Department of Physics, Dalian Medical University, Dalian 116044, China
| | - Cui Sun
- College of Medical Laboratory, Dalian Medical University, Dalian 116044, China
| | - Lai-shui Yu
- College of Medical Laboratory, Dalian Medical University, Dalian 116044, China
| |
Collapse
|
12
|
Liu W, Wang Q, Zhao J, Zhang C, Liu Y, Zhang J, Bai X, Li X, Feng H, Liao M, Wang W, Li C. Integration of pathway structure information into a reweighted partial Cox regression approach for survival analysis on high-dimensional gene expression data. MOLECULAR BIOSYSTEMS 2015; 11:1876-86. [DOI: 10.1039/c5mb00044k] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Abstract
Accurately predicting the risk of cancer relapse or death is important for clinical utility.
Collapse
|
13
|
Bühnemann C, Li S, Yu H, Branford White H, Schäfer KL, Llombart-Bosch A, Machado I, Picci P, Hogendoorn PCW, Athanasou NA, Noble JA, Hassan AB. Quantification of the heterogeneity of prognostic cellular biomarkers in ewing sarcoma using automated image and random survival forest analysis. PLoS One 2014; 9:e107105. [PMID: 25243408 PMCID: PMC4171480 DOI: 10.1371/journal.pone.0107105] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2014] [Accepted: 08/12/2014] [Indexed: 02/05/2023] Open
Abstract
Driven by genomic somatic variation, tumour tissues are typically heterogeneous, yet unbiased quantitative methods are rarely used to analyse heterogeneity at the protein level. Motivated by this problem, we developed automated image segmentation of images of multiple biomarkers in Ewing sarcoma to generate distributions of biomarkers between and within tumour cells. We further integrate high dimensional data with patient clinical outcomes utilising random survival forest (RSF) machine learning. Using material from cohorts of genetically diagnosed Ewing sarcoma with EWSR1 chromosomal translocations, confocal images of tissue microarrays were segmented with level sets and watershed algorithms. Each cell nucleus and cytoplasm were identified in relation to DAPI and CD99, respectively, and protein biomarkers (e.g. Ki67, pS6, Foxo3a, EGR1, MAPK) localised relative to nuclear and cytoplasmic regions of each cell in order to generate image feature distributions. The image distribution features were analysed with RSF in relation to known overall patient survival from three separate cohorts (185 informative cases). Variation in pre-analytical processing resulted in elimination of a high number of non-informative images that had poor DAPI localisation or biomarker preservation (67 cases, 36%). The distribution of image features for biomarkers in the remaining high quality material (118 cases, 104 features per case) were analysed by RSF with feature selection, and performance assessed using internal cross-validation, rather than a separate validation cohort. A prognostic classifier for Ewing sarcoma with low cross-validation error rates (0.36) was comprised of multiple features, including the Ki67 proliferative marker and a sub-population of cells with low cytoplasmic/nuclear ratio of CD99. Through elimination of bias, the evaluation of high-dimensionality biomarker distribution within cell populations of a tumour using random forest analysis in quality controlled tumour material could be achieved. Such an automated and integrated methodology has potential application in the identification of prognostic classifiers based on tumour cell heterogeneity.
Collapse
Affiliation(s)
- Claudia Bühnemann
- CR-UK, Tumour Growth Group, Oxford Molecular Pathology Institute, Sir William Dunn School of Pathology, University of Oxford, Oxford, United Kingdom
| | - Simon Li
- Institute of Biomedical Engineering, Department of Engineering Science, Old Road Campus Research Building, University of Oxford, Headington, Oxford, United Kingdom
| | - Haiyue Yu
- CR-UK, Tumour Growth Group, Oxford Molecular Pathology Institute, Sir William Dunn School of Pathology, University of Oxford, Oxford, United Kingdom; Institute of Biomedical Engineering, Department of Engineering Science, Old Road Campus Research Building, University of Oxford, Headington, Oxford, United Kingdom
| | - Harriet Branford White
- CR-UK, Tumour Growth Group, Oxford Molecular Pathology Institute, Sir William Dunn School of Pathology, University of Oxford, Oxford, United Kingdom
| | - Karl L Schäfer
- Institute of Pathology, Heinrich-Heine University, Medical Faculty, Düsseldorf, Germany
| | | | - Isidro Machado
- Pathology Department, University of Valencia, Valencia, Spain
| | - Piero Picci
- Research, The Rizzoli Institute, Bologna, Italy
| | | | - Nicholas A Athanasou
- Nuffield Department of Orthopaedics, Rheumatology and Musculoskeletal Sciences, Nuffield Orthopaedic Centre, University of Oxford, Oxford, United Kingdom
| | - J Alison Noble
- Institute of Biomedical Engineering, Department of Engineering Science, Old Road Campus Research Building, University of Oxford, Headington, Oxford, United Kingdom
| | - A Bassim Hassan
- CR-UK, Tumour Growth Group, Oxford Molecular Pathology Institute, Sir William Dunn School of Pathology, University of Oxford, Oxford, United Kingdom
| |
Collapse
|
14
|
Gardeux V, Achour I, Li J, Maienschein-Cline M, Li H, Pesce L, Parinandi G, Bahroos N, Winn R, Foster I, Garcia JGN, Lussier YA. 'N-of-1-pathways' unveils personal deregulated mechanisms from a single pair of RNA-Seq samples: towards precision medicine. J Am Med Inform Assoc 2014; 21:1015-25. [PMID: 25301808 PMCID: PMC4215042 DOI: 10.1136/amiajnl-2013-002519] [Citation(s) in RCA: 32] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/25/2023] Open
Abstract
Background The emergence of precision medicine allowed the incorporation of individual molecular data into patient care. Indeed, DNA sequencing predicts somatic mutations in individual patients. However, these genetic features overlook dynamic epigenetic and phenotypic response to therapy. Meanwhile, accurate personal transcriptome interpretation remains an unmet challenge. Further, N-of-1 (single-subject) efficacy trials are increasingly pursued, but are underpowered for molecular marker discovery. Method ‘N-of-1-pathways’ is a global framework relying on three principles: (i) the statistical universe is a single patient; (ii) significance is derived from geneset/biomodules powered by paired samples from the same patient; and (iii) similarity between genesets/biomodules assesses commonality and differences, within-study and cross-studies. Thus, patient gene-level profiles are transformed into deregulated pathways. From RNA-Seq of 55 lung adenocarcinoma patients, N-of-1-pathways predicts the deregulated pathways of each patient. Results Cross-patient N-of-1-pathways obtains comparable results with conventional genesets enrichment analysis (GSEA) and differentially expressed gene (DEG) enrichment, validated in three external evaluations. Moreover, heatmap and star plots highlight both individual and shared mechanisms ranging from molecular to organ-systems levels (eg, DNA repair, signaling, immune response). Patients were ranked based on the similarity of their deregulated mechanisms to those of an independent gold standard, generating unsupervised clusters of diametric extreme survival phenotypes (p=0.03). Conclusions The N-of-1-pathways framework provides a robust statistical and relevant biological interpretation of individual disease-free survival that is often overlooked in conventional cross-patient studies. It enables mechanism-level classifiers with smaller cohorts as well as N-of-1 studies. Software http://lussierlab.org/publications/N-of-1-pathways
Collapse
Affiliation(s)
- Vincent Gardeux
- Department of Medicine, Bio5 Institute, UA Cancer Center, University of Arizona, Tucson, Arizona, USA Department of Medicine, University of Illinois at Chicago, Chicago, Illinois, USA Department of Informatics, School of Engineering, EISTI (École Internationale des Sciences du Traitement de l'Information), Cergy-Pontoise, France Institute for Translational Health Informatics, University of Illinois at Chicago, Chicago, Illinois, USA
| | - Ikbel Achour
- Department of Medicine, Bio5 Institute, UA Cancer Center, University of Arizona, Tucson, Arizona, USA Department of Medicine, University of Illinois at Chicago, Chicago, Illinois, USA Institute for Translational Health Informatics, University of Illinois at Chicago, Chicago, Illinois, USA
| | - Jianrong Li
- Department of Medicine, Bio5 Institute, UA Cancer Center, University of Arizona, Tucson, Arizona, USA Department of Medicine, University of Illinois at Chicago, Chicago, Illinois, USA Institute for Translational Health Informatics, University of Illinois at Chicago, Chicago, Illinois, USA
| | - Mark Maienschein-Cline
- Institute for Translational Health Informatics, University of Illinois at Chicago, Chicago, Illinois, USA
| | - Haiquan Li
- Department of Medicine, Bio5 Institute, UA Cancer Center, University of Arizona, Tucson, Arizona, USA Department of Medicine, University of Illinois at Chicago, Chicago, Illinois, USA Institute for Translational Health Informatics, University of Illinois at Chicago, Chicago, Illinois, USA
| | - Lorenzo Pesce
- Computation Institute, Argonne National Laboratory & University of Chicago, Chicago, Illinois, USA
| | - Gurunadh Parinandi
- Department of Medicine, University of Illinois at Chicago, Chicago, Illinois, USA Institute for Translational Health Informatics, University of Illinois at Chicago, Chicago, Illinois, USA
| | - Neil Bahroos
- Institute for Translational Health Informatics, University of Illinois at Chicago, Chicago, Illinois, USA
| | - Robert Winn
- Department of Medicine, University of Illinois at Chicago, Chicago, Illinois, USA University of Illinois Cancer Center, Chicago, Illinois, USA
| | - Ian Foster
- Computation Institute, Argonne National Laboratory & University of Chicago, Chicago, Illinois, USA Department of Computer Science, University of Chicago, Chicago, Illinois, USA Mathematics and Computer Science Division, Argonne National Laboratory, Chicago, Illinois, USA
| | - Joe G N Garcia
- Department of Medicine, Bio5 Institute, UA Cancer Center, University of Arizona, Tucson, Arizona, USA
| | - Yves A Lussier
- Department of Medicine, Bio5 Institute, UA Cancer Center, University of Arizona, Tucson, Arizona, USA Department of Medicine, University of Illinois at Chicago, Chicago, Illinois, USA Institute for Translational Health Informatics, University of Illinois at Chicago, Chicago, Illinois, USA Computation Institute, Argonne National Laboratory & University of Chicago, Chicago, Illinois, USA University of Illinois Cancer Center, Chicago, Illinois, USA Department of Bioengineering, University of Illinois at Chicago, Chicago, Illinois, USA Department of Biopharmaceutical Science, College of Pharmacy, University of Illinois at Chicago, Illinois, USA Institute for Genomics and Systems Biology, University of Chicago, Chicago, Illinois, USA Department of Pharmacology, University of Illinois at Chicago, Chicago, Illinois, USA
| |
Collapse
|
15
|
Gardeux V, Arslan AD, Achour I, Ho TT, Beck WT, Lussier YA. Concordance of deregulated mechanisms unveiled in underpowered experiments: PTBP1 knockdown case study. BMC Med Genomics 2014; 7 Suppl 1:S1. [PMID: 25079003 PMCID: PMC4101571 DOI: 10.1186/1755-8794-7-s1-s1] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/01/2023] Open
Abstract
Background Genome-wide transcriptome profiling generated by microarray and RNA-Seq often provides deregulated genes or pathways applicable only to larger cohort. On the other hand, individualized interpretation of transcriptomes is increasely pursued to improve diagnosis, prognosis, and patient treatment processes. Yet, robust and accurate methods based on a single paired-sample remain an unmet challenge. Methods "N-of-1-pathways" translates gene expression data profiles into mechanism-level profiles on single pairs of samples (one p-value per geneset). It relies on three principles: i) statistical universe is a single paired sample, which serves as its own control; ii) statistics can be derived from multiple gene expression measures that share common biological mechanisms assimilated to genesets; iii) semantic similarity metric takes into account inter-mechanisms' relationships to better assess commonality and differences, within and cross study-samples (e.g. patients, cell-lines, tissues, etc.), which helps the interpretation of the underpinning biology. Results In the context of underpowered experiments, N-of-1-pathways predictions perform better or comparable to those of GSEA and Differentially Expressed Genes enrichment (DEG enrichment), within-and cross-datasets. N-of-1-pathways uncovered concordant PTBP1-dependent mechanisms across datasets (Odds-Ratios≥13, p-values≤1 × 10−5), such as RNA splicing and cell cycle. In addition, it unveils tissue-specific mechanisms of alternatively transcribed PTBP1-dependent genesets. Furthermore, we demonstrate that GSEA and DEG Enrichment preclude accurate analysis on single paired samples. Conclusions N-of-1-pathways enables robust and biologically relevant mechanism-level classifiers with small cohorts and one single paired samples that surpasses conventional methods. Further, it identifies unique sample/ patient mechanisms, a requirement for precision medicine.
Collapse
|
16
|
Cassese A, Guindani M, Tadesse MG, Falciani F, Vannucci M. A HIERARCHICAL BAYESIAN MODEL FOR INFERENCE OF COPY NUMBER VARIANTS AND THEIR ASSOCIATION TO GENE EXPRESSION. Ann Appl Stat 2014; 8:148-175. [PMID: 24834139 DOI: 10.1214/13-aoas705] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Abstract
A number of statistical models have been successfully developed for the analysis of high-throughput data from a single source, but few methods are available for integrating data from different sources. Here we focus on integrating gene expression levels with comparative genomic hybridization (CGH) array measurements collected on the same subjects. We specify a measurement error model that relates the gene expression levels to latent copy number states which, in turn, are related to the observed surrogate CGH measurements via a hidden Markov model. We employ selection priors that exploit the dependencies across adjacent copy number states and investigate MCMC stochastic search techniques for posterior inference. Our approach results in a unified modeling framework for simultaneously inferring copy number variants (CNV) and identifying their significant associations with mRNA transcripts abundance. We show performance on simulated data and illustrate an application to data from a genomic study on human cancer cell lines.
Collapse
|
17
|
Chen X, Ishwaran H. Pathway hunting by random survival forests. Bioinformatics 2013; 29:99-105. [PMID: 23129299 PMCID: PMC3530909 DOI: 10.1093/bioinformatics/bts643] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2012] [Revised: 07/18/2012] [Accepted: 10/17/2012] [Indexed: 01/22/2023] Open
Abstract
MOTIVATION Pathway or gene set analysis has been widely applied to genomic data. Many current pathway testing methods use univariate test statistics calculated from individual genomic markers, which ignores the correlations and interactions between candidate markers. Random forests-based pathway analysis is a promising approach for incorporating complex correlation and interaction patterns, but one limitation of previous approaches is that pathways have been considered separately, thus pathway cross-talk information was not considered. RESULTS In this article, we develop a new pathway hunting algorithm for survival outcomes using random survival forests, which prioritize important pathways by accounting for gene correlation and genomic interactions. We show that the proposed method performs favourably compared with five popular pathway testing methods using both synthetic and real data. We find that the proposed methodology provides an efficient and powerful pathway modelling framework for high-dimensional genomic data. AVAILABILITY The R code for the analysis used in this article is available upon request.
Collapse
Affiliation(s)
- Xi Chen
- Department of Biostatistics, Vanderbilt University School of Medicine, Nashville, TN 37232, USA.
| | | |
Collapse
|
18
|
Ramos-Rodriguez RR, Cuevas-Diaz-Duran R, Falciani F, Tamez-Peña JG, Trevino V. COMPADRE: an R and web resource for pathway activity analysis by component decompositions. ACTA ACUST UNITED AC 2012; 28:2701-2. [PMID: 22923303 DOI: 10.1093/bioinformatics/bts513] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
UNLABELLED The analysis of biological networks has become essential to study functional genomic data. Compadre is a tool to estimate pathway/gene sets activity indexes using sub-matrix decompositions for biological networks analyses. The Compadre pipeline also includes one of the direct uses of activity indexes to detect altered gene sets. For this, the gene expression sub-matrix of a gene set is decomposed into components, which are used to test differences between groups of samples. This procedure is performed with and without differentially expressed genes to decrease false calls. During this process, Compadre also performs an over-representation test. Compadre already implements four decomposition methods [principal component analysis (PCA), Isomaps, independent component analysis (ICA) and non-negative matrix factorization (NMF)], six statistical tests (t- and f-test, SAM, Kruskal-Wallis, Welch and Brown-Forsythe), several gene sets (KEGG, BioCarta, Reactome, GO and MsigDB) and can be easily expanded. Our simulation results shown in Supplementary Information suggest that Compadre detects more pathways than over-representation tools like David, Babelomics and Webgestalt and less false positives than PLAGE. The output is composed of results from decomposition and over-representation analyses providing a more complete biological picture. Examples provided in Supplementary Information show the utility, versatility and simplicity of Compadre for analyses of biological networks. AVAILABILITY AND IMPLEMENTATION Compadre is freely available at http://bioinformatica.mty.itesm.mx:8080/compadre. The R package is also available at https://sourceforge.net/p/compadre.
Collapse
Affiliation(s)
- Roberto-Rafael Ramos-Rodriguez
- Cátedra of Bioinformática and Department of Computer Sciences, Tecnológico de Monterrey, Campus Monterrey, Monterrey, Nuevo León, México
| | | | | | | | | |
Collapse
|
19
|
Chen X, Ishwaran H. Random forests for genomic data analysis. Genomics 2012; 99:323-9. [PMID: 22546560 PMCID: PMC3387489 DOI: 10.1016/j.ygeno.2012.04.003] [Citation(s) in RCA: 380] [Impact Index Per Article: 31.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2012] [Revised: 04/11/2012] [Accepted: 04/14/2012] [Indexed: 11/25/2022]
Abstract
Random forests (RF) is a popular tree-based ensemble machine learning tool that is highly data adaptive, applies to "large p, small n" problems, and is able to account for correlation as well as interactions among features. This makes RF particularly appealing for high-dimensional genomic data analysis. In this article, we systematically review the applications and recent progresses of RF for genomic data, including prediction and classification, variable selection, pathway analysis, genetic association and epistasis detection, and unsupervised learning.
Collapse
Affiliation(s)
- Xi Chen
- Department of Biostatistics, Vanderbilt University, Nashville, TN 37232, USA.
| | | |
Collapse
|
20
|
Saintigny P, Zhang L, Fan YH, El-Naggar AK, Papadimitrakopoulou VA, Feng L, Lee JJ, Kim ES, Ki Hong W, Mao L. Gene expression profiling predicts the development of oral cancer. Cancer Prev Res (Phila) 2011; 4:218-29. [PMID: 21292635 DOI: 10.1158/1940-6207.capr-10-0155] [Citation(s) in RCA: 106] [Impact Index Per Article: 8.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Patients with oral premalignant lesion (OPL) have a high risk of developing oral cancer. Although certain risk factors, such as smoking status and histology, are known, our ability to predict oral cancer risk remains poor. The study objective was to determine the value of gene expression profiling in predicting oral cancer development. Gene expression profile was measured in 86 of 162 OPL patients who were enrolled in a clinical chemoprevention trial that used the incidence of oral cancer development as a prespecified endpoint. The median follow-up time was 6.08 years and 35 of the 86 patients developed oral cancer over the course. Gene expression profiles were associated with oral cancer-free survival and used to develop multivariate predictive models for oral cancer prediction. We developed a 29-transcript predictive model which showed marked improvement in terms of prediction accuracy (with 8% predicting error rate) over the models using previously known clinicopathologic risk factors. On the basis of the gene expression profile data, we also identified 2,182 transcripts significantly associated with oral cancer risk-associated genes (P value < 0.01; univariate Cox proportional hazards model). Functional pathway analysis revealed proteasome machinery, MYC, and ribosomal components as the top gene sets associated with oral cancer risk. In multiple independent data sets, the expression profiles of the genes can differentiate head and neck cancer from normal mucosa. Our results show that gene expression profiles may improve the prediction of oral cancer risk in OPL patients and the significant genes identified may serve as potential targets for oral cancer chemoprevention.
Collapse
Affiliation(s)
- Pierre Saintigny
- Department of Thoracic/Head and Neck Medical Oncology, The University of Texas MD Anderson Cancer Center, Houston, Texas, USA
| | | | | | | | | | | | | | | | | | | |
Collapse
|
21
|
Porzelius C, Johannes M, Binder H, Beißbarth T. Leveraging external knowledge on molecular interactions in classification methods for risk prediction of patients. Biom J 2011; 53:190-201. [DOI: 10.1002/bimj.201000155] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2010] [Revised: 10/22/2010] [Accepted: 10/29/2010] [Indexed: 12/17/2022]
|