1
|
Shen J, Wang S, Sun H, Huang J, Bai L, Wang X, Dong Y, Tang Z. A novel non-negative Bayesian stacking modeling method for Cancer survival prediction using high-dimensional omics data. BMC Med Res Methodol 2024; 24:105. [PMID: 38702624 PMCID: PMC11067084 DOI: 10.1186/s12874-024-02232-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2023] [Accepted: 04/23/2024] [Indexed: 05/06/2024] Open
Abstract
BACKGROUND Survival prediction using high-dimensional molecular data is a hot topic in the field of genomics and precision medicine, especially for cancer studies. Considering that carcinogenesis has a pathway-based pathogenesis, developing models using such group structures is a closer mimic of disease progression and prognosis. Many approaches can be used to integrate group information; however, most of them are single-model methods, which may account for unstable prediction. METHODS We introduced a novel survival stacking method that modeled using group structure information to improve the robustness of cancer survival prediction in the context of high-dimensional omics data. With a super learner, survival stacking combines the prediction from multiple sub-models that are independently trained using the features in pre-grouped biological pathways. In addition to a non-negative linear combination of sub-models, we extended the super learner to non-negative Bayesian hierarchical generalized linear model and artificial neural network. We compared the proposed modeling strategy with the widely used survival penalized method Lasso Cox and several group penalized methods, e.g., group Lasso Cox, via simulation study and real-world data application. RESULTS The proposed survival stacking method showed superior and robust performance in terms of discrimination compared with single-model methods in case of high-noise simulated data and real-world data. The non-negative Bayesian stacking method can identify important biological signal pathways and genes that are associated with the prognosis of cancer. CONCLUSIONS This study proposed a novel survival stacking strategy incorporating biological group information into the cancer prognosis models. Additionally, this study extended the super learner to non-negative Bayesian model and ANN, enriching the combination of sub-models. The proposed Bayesian stacking strategy exhibited favorable properties in the prediction and interpretation of complex survival data, which may aid in discovering cancer targets.
Collapse
Affiliation(s)
- Junjie Shen
- Department of Biostatistics, School of Public Health, Jiangsu Key Laboratory of Preventive and Translational Medicine for Major Chronic Non-communicable Diseases, MOE Key Laboratory of Geriatric Diseases and Immunology, Suzhou Medical College of Soochow University, Suzhou, Jiangsu, 215123, People's Republic of China
| | - Shuo Wang
- Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center-University of Freiburg, 79085, Freiburg, Germany
| | - Hao Sun
- Department of Biostatistics, School of Public Health, Jiangsu Key Laboratory of Preventive and Translational Medicine for Major Chronic Non-communicable Diseases, MOE Key Laboratory of Geriatric Diseases and Immunology, Suzhou Medical College of Soochow University, Suzhou, Jiangsu, 215123, People's Republic of China
| | - Jie Huang
- Department of Biostatistics, School of Public Health, Jiangsu Key Laboratory of Preventive and Translational Medicine for Major Chronic Non-communicable Diseases, MOE Key Laboratory of Geriatric Diseases and Immunology, Suzhou Medical College of Soochow University, Suzhou, Jiangsu, 215123, People's Republic of China
| | - Lu Bai
- Department of Biostatistics, School of Public Health, Jiangsu Key Laboratory of Preventive and Translational Medicine for Major Chronic Non-communicable Diseases, MOE Key Laboratory of Geriatric Diseases and Immunology, Suzhou Medical College of Soochow University, Suzhou, Jiangsu, 215123, People's Republic of China
| | - Xichao Wang
- Department of Biostatistics, School of Public Health, Jiangsu Key Laboratory of Preventive and Translational Medicine for Major Chronic Non-communicable Diseases, MOE Key Laboratory of Geriatric Diseases and Immunology, Suzhou Medical College of Soochow University, Suzhou, Jiangsu, 215123, People's Republic of China
| | - Yongfei Dong
- Department of Biostatistics, School of Public Health, Jiangsu Key Laboratory of Preventive and Translational Medicine for Major Chronic Non-communicable Diseases, MOE Key Laboratory of Geriatric Diseases and Immunology, Suzhou Medical College of Soochow University, Suzhou, Jiangsu, 215123, People's Republic of China
| | - Zaixiang Tang
- Department of Biostatistics, School of Public Health, Jiangsu Key Laboratory of Preventive and Translational Medicine for Major Chronic Non-communicable Diseases, MOE Key Laboratory of Geriatric Diseases and Immunology, Suzhou Medical College of Soochow University, Suzhou, Jiangsu, 215123, People's Republic of China.
| |
Collapse
|
2
|
Shen J, Wang S, Dong Y, Sun H, Wang X, Tang Z. A non-negative spike-and-slab lasso generalized linear stacking prediction modeling method for high-dimensional omics data. BMC Bioinformatics 2024; 25:119. [PMID: 38509499 PMCID: PMC10953151 DOI: 10.1186/s12859-024-05741-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2023] [Accepted: 03/11/2024] [Indexed: 03/22/2024] Open
Abstract
BACKGROUND High-dimensional omics data are increasingly utilized in clinical and public health research for disease risk prediction. Many previous sparse methods have been proposed that using prior knowledge, e.g., biological group structure information, to guide the model-building process. However, these methods are still based on a single model, offen leading to overconfident inferences and inferior generalization. RESULTS We proposed a novel stacking strategy based on a non-negative spike-and-slab Lasso (nsslasso) generalized linear model (GLM) for disease risk prediction in the context of high-dimensional omics data. Briefly, we used prior biological knowledge to segment omics data into a set of sub-data. Each sub-model was trained separately using the features from the group via a proper base learner. Then, the predictions of sub-models were ensembled by a super learner using nsslasso GLM. The proposed method was compared to several competitors, such as the Lasso, grlasso, and gsslasso, using simulated data and two open-access breast cancer data. As a result, the proposed method showed robustly superior prediction performance to the optimal single-model method in high-noise simulated data and real-world data. Furthermore, compared to the traditional stacking method, the proposed nsslasso stacking method can efficiently handle redundant sub-models and identify important sub-models. CONCLUSIONS The proposed nsslasso method demonstrated favorable predictive accuracy, stability, and biological interpretability. Additionally, the proposed method can also be used to detect new biomarkers and key group structures.
Collapse
Affiliation(s)
- Junjie Shen
- Department of Biostatistics, School of Public Health, Jiangsu Key Laboratory of Preventive and Translational Medicine for Geriatric Diseases, MOE Key Laboratory of Geriatric Diseases and Immunology, Suzhou Medical College of Soochow University, No. 199 Renai Road, Suzhou, 215123, Jiangsu, People's Republic of China
| | - Shuo Wang
- Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center, University of Freiburg, 79085, Freiburg, Germany
| | - Yongfei Dong
- Department of Biostatistics, School of Public Health, Jiangsu Key Laboratory of Preventive and Translational Medicine for Geriatric Diseases, MOE Key Laboratory of Geriatric Diseases and Immunology, Suzhou Medical College of Soochow University, No. 199 Renai Road, Suzhou, 215123, Jiangsu, People's Republic of China
| | - Hao Sun
- Department of Biostatistics, School of Public Health, Jiangsu Key Laboratory of Preventive and Translational Medicine for Geriatric Diseases, MOE Key Laboratory of Geriatric Diseases and Immunology, Suzhou Medical College of Soochow University, No. 199 Renai Road, Suzhou, 215123, Jiangsu, People's Republic of China
| | - Xichao Wang
- Department of Biostatistics, School of Public Health, Jiangsu Key Laboratory of Preventive and Translational Medicine for Geriatric Diseases, MOE Key Laboratory of Geriatric Diseases and Immunology, Suzhou Medical College of Soochow University, No. 199 Renai Road, Suzhou, 215123, Jiangsu, People's Republic of China
| | - Zaixiang Tang
- Department of Biostatistics, School of Public Health, Jiangsu Key Laboratory of Preventive and Translational Medicine for Geriatric Diseases, MOE Key Laboratory of Geriatric Diseases and Immunology, Suzhou Medical College of Soochow University, No. 199 Renai Road, Suzhou, 215123, Jiangsu, People's Republic of China.
| |
Collapse
|
3
|
Chen Y, Liu S, Papageorgiou LG, Theofilatos K, Tsoka S. Optimisation Models for Pathway Activity Inference in Cancer. Cancers (Basel) 2023; 15:1787. [PMID: 36980673 PMCID: PMC10046797 DOI: 10.3390/cancers15061787] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2023] [Revised: 02/24/2023] [Accepted: 03/08/2023] [Indexed: 03/18/2023] Open
Abstract
BACKGROUND With advances in high-throughput technologies, there has been an enormous increase in data related to profiling the activity of molecules in disease. While such data provide more comprehensive information on cellular actions, their large volume and complexity pose difficulty in accurate classification of disease phenotypes. Therefore, novel modelling methods that can improve accuracy while offering interpretable means of analysis are required. Biological pathways can be used to incorporate a priori knowledge of biological interactions to decrease data dimensionality and increase the biological interpretability of machine learning models. METHODOLOGY A mathematical optimisation model is proposed for pathway activity inference towards precise disease phenotype prediction and is applied to RNA-Seq datasets. The model is based on mixed-integer linear programming (MILP) mathematical optimisation principles and infers pathway activity as the linear combination of pathway member gene expression, multiplying expression values with model-determined gene weights that are optimised to maximise discrimination of phenotype classes and minimise incorrect sample allocation. RESULTS The model is evaluated on the transcriptome of breast and colorectal cancer, and exhibits solution results of good optimality as well as good prediction performance on related cancer subtypes. Two baseline pathway activity inference methods and three advanced methods are used for comparison. Sample prediction accuracy, robustness against noise expression data, and survival analysis suggest competitive prediction performance of our model while providing interpretability and insight on key pathways and genes. Overall, our work demonstrates that the flexible nature of mathematical programming lends itself well to developing efficient computational strategies for pathway activity inference and disease subtype prediction.
Collapse
Affiliation(s)
- Yongnan Chen
- Department of Informatics, Faculty of Natural, Mathematical and Engineering Sciences, King's College London, Bush House, London WC2B 4BG, UK
| | - Songsong Liu
- School of Management, Harbin Institute of Technology, Harbin 150001, China
| | - Lazaros G Papageorgiou
- The Sargent Centre for Process Systems Engineering, Department of Chemical Engineering, University College London, Torrington Place, London WC1E 7JE, UK
| | - Konstantinos Theofilatos
- King's College London British Heart Foundation Centre, School of Cardiovascular and Metabolic Medicine and Sciences, London SE1 7EH, UK
| | - Sophia Tsoka
- Department of Informatics, Faculty of Natural, Mathematical and Engineering Sciences, King's College London, Bush House, London WC2B 4BG, UK
| |
Collapse
|
4
|
Wang W, Liu W. PCLasso: a protein complex-based, group lasso-Cox model for accurate prognosis and risk protein complex discovery. Brief Bioinform 2021; 22:6291946. [PMID: 34086850 DOI: 10.1093/bib/bbab212] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2021] [Revised: 05/08/2021] [Accepted: 05/15/2021] [Indexed: 12/12/2022] Open
Abstract
For high-dimensional expression data, most prognostic models perform feature selection based on individual genes, which usually lead to unstable prognosis, and the identified risk genes are inherently insufficient in revealing complex molecular mechanisms. Since most genes carry out cellular functions by forming protein complexes-basic representatives of functional modules, identifying risk protein complexes may greatly improve our understanding of disease biology. Coupled with the fact that protein complexes have been shown to have innate resistance to batch effects and are effective predictors of disease phenotypes, constructing prognostic models and selecting features with protein complexes as the basic unit should improve the robustness and biological interpretability of the model. Here, we propose a protein complex-based, group lasso-Cox model (PCLasso) to predict patient prognosis and identify risk protein complexes. Experiments on three cancer types have proved that PCLasso has better prognostic performance than prognostic models based on individual genes. The resulting risk protein complexes not only contain individual risk genes but also incorporate close partners that synergize with them, which may promote the revealing of molecular mechanisms related to cancer progression from a comprehensive perspective. Furthermore, a pan-cancer prognostic analysis was performed to identify risk protein complexes of 19 cancer types, which may provide novel potential targets for cancer research.
Collapse
Affiliation(s)
- Wei Wang
- Heilongjiang Institute of Technology, Harbin 150050, China
| | - Wei Liu
- School of Science at Heilongjiang Institute of Technology, Harbin 150050, China
| |
Collapse
|
5
|
Raghu VK, Ge X, Balajiee A, Shirer DJ, Das I, Benos PV, Chrysanthis PK. A Pipeline for Integrated Theory and Data-Driven Modeling of Biomedical Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:811-822. [PMID: 32841121 PMCID: PMC8237279 DOI: 10.1109/tcbb.2020.3019237] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Genome sequencing technologies have the potential to transform clinical decision making and biomedical research by enabling high-throughput measurements of the genome at a granular level. However, to truly understand mechanisms of disease and predict the effects of medical interventions, high-throughput data must be integrated with demographic, phenotypic, environmental, and behavioral data from individuals. Further, effective knowledge discovery methods must infer relationships between these data types. We recently proposed a pipeline (CausalMGM) to achieve this. CausalMGM uses probabilistic graphical models to infer the relationships between variables in the data; however, CausalMGM's graphical structure learning algorithm can only handle small datasets efficiently. We propose a new methodology (piPref-Div) that selects the most informative variables for CausalMGM, enabling it to scale. We validate the efficacy of piPref-Div against other feature selection methods and demonstrate how the use of the full pipeline improves breast cancer outcome prediction and provides biologically interpretable views of gene expression data.
Collapse
|
6
|
Wang W, Liu W. Integration of gene interaction information into a reweighted Lasso-Cox model for accurate survival prediction. Bioinformatics 2020; 36:5405-5414. [PMID: 33325490 DOI: 10.1093/bioinformatics/btaa1046] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2020] [Revised: 11/13/2020] [Accepted: 12/07/2020] [Indexed: 11/14/2022] Open
Abstract
Abstract
Motivation
Accurately predicting the risk of cancer patients is a central challenge for clinical cancer research. For high-dimensional gene expression data, Cox proportional hazard model with the least absolute shrinkage and selection operator for variable selection (Lasso-Cox) is one of the most popular feature selection and risk prediction algorithms. However, the Lasso-Cox model treats all genes equally, ignoring the biological characteristics of the genes themselves. This often encounters the problem of poor prognostic performance on independent datasets.
Results
Here, we propose a Reweighted Lasso-Cox (RLasso-Cox) model to ameliorate this problem by integrating gene interaction information. It is based on the hypothesis that topologically important genes in the gene interaction network tend to have stable expression changes. We used random walk to evaluate the topological weight of genes, and then highlighted topologically important genes to improve the generalization ability of the RLasso-Cox model. Experiments on datasets of three cancer types showed that the RLasso-Cox model improves the prognostic accuracy and robustness compared with the Lasso-Cox model and several existing network-based methods. More importantly, the RLasso-Cox model has the advantage of identifying small gene sets with high prognostic performance on independent datasets, which may play an important role in identifying robust survival biomarkers for various cancer types.
Availability and implementation
http://bioconductor.org/packages/devel/bioc/html/RLassoCox.html
Supplementary information
Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Wei Wang
- Department of Mathematics, College of Science, Heilongjiang Institute of Technology, Harbin 150050, China
| | - Wei Liu
- Department of Mathematics, College of Science, Heilongjiang Institute of Technology, Harbin 150050, China
| |
Collapse
|
7
|
Perscheid C. Integrative biomarker detection on high-dimensional gene expression data sets: a survey on prior knowledge approaches. Brief Bioinform 2020; 22:5881664. [PMID: 32761115 DOI: 10.1093/bib/bbaa151] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2020] [Revised: 06/15/2020] [Accepted: 06/16/2020] [Indexed: 02/06/2023] Open
Abstract
Gene expression data provide the expression levels of tens of thousands of genes from several hundred samples. These data are analyzed to detect biomarkers that can be of prognostic or diagnostic use. Traditionally, biomarker detection for gene expression data is the task of gene selection. The vast number of genes is reduced to a few relevant ones that achieve the best performance for the respective use case. Traditional approaches select genes based on their statistical significance in the data set. This results in issues of robustness, redundancy and true biological relevance of the selected genes. Integrative analyses typically address these shortcomings by integrating multiple data artifacts from the same objects, e.g. gene expression and methylation data. When only gene expression data are available, integrative analyses instead use curated information on biological processes from public knowledge bases. With knowledge bases providing an ever-increasing amount of curated biological knowledge, such prior knowledge approaches become more powerful. This paper provides a thorough overview on the status quo of biomarker detection on gene expression data with prior biological knowledge. We discuss current shortcomings of traditional approaches, review recent external knowledge bases, provide a classification and qualitative comparison of existing prior knowledge approaches and discuss open challenges for this kind of gene selection.
Collapse
Affiliation(s)
- Cindy Perscheid
- Hasso Plattner Institute, University of Potsdam, Potsdam, 14482, Germany
| |
Collapse
|
8
|
Guan X, Runger G, Liu L. Dynamic incorporation of prior knowledge from multiple domains in biomarker discovery. BMC Bioinformatics 2020; 21:77. [PMID: 32164534 PMCID: PMC7068914 DOI: 10.1186/s12859-020-3344-x] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022] Open
Abstract
Background In biomarker discovery, applying domain knowledge is an effective approach to eliminating false positive features, prioritizing functionally impactful markers and facilitating the interpretation of predictive signatures. Several computational methods have been developed that formulate the knowledge-based biomarker discovery as a feature selection problem guided by prior information. These methods often require that prior information is encoded as a single score and the algorithms are optimized for biological knowledge of a specific type. However, in practice, domain knowledge from diverse resources can provide complementary information. But no current methods can integrate heterogeneous prior information for biomarker discovery. To address this problem, we developed the Know-GRRF (know-guided regularized random forest) method that enables dynamic incorporation of domain knowledge from multiple disciplines to guide feature selection. Results Know-GRRF embeds domain knowledge in a regularized random forest framework. It combines prior information from multiple domains in a linear model to derive a composite score, which, together with other tuning parameters, controls the regularization of the random forests model. Know-GRRF concurrently optimizes the weight given to each type of domain knowledge and other tuning parameters to minimize the AIC of out-of-bag predictions. The objective is to select a compact feature subset that has a high discriminative power and strong functional relevance to the biological phenotype. Via rigorous simulations, we show that Know-GRRF guided by multiple-domain prior information outperforms feature selection methods guided by single-domain prior information or no prior information. We then applied Known-GRRF to a real-world study to identify prognostic biomarkers of prostate cancers. We evaluated the combination of cancer-related gene annotations, evolutionary conservation and pre-computed statistical scores as the prior knowledge to assemble a panel of biomarkers. We discovered a compact set of biomarkers with significant improvements on prediction accuracies. Conclusions Know-GRRF is a powerful novel method to incorporate knowledge from multiple domains for feature selection. It has a broad range of applications in biomarker discoveries. We implemented this method and released a KnowGRRF package in the R/CRAN archive.
Collapse
Affiliation(s)
- Xin Guan
- College of Health Solutions, Arizona State University, Phoenix, AZ, 85004, USA.,Intel Corporation, Chandler, AZ, 85226, USA
| | - George Runger
- College of Health Solutions, Arizona State University, Phoenix, AZ, 85004, USA
| | - Li Liu
- College of Health Solutions, Arizona State University, Phoenix, AZ, 85004, USA. .,Biodesign Institute, Arizona State University, Tempe, AZ, 85287, USA. .,Department of Neurology, Mayo Clinic, Scottsdale, AZ, 85259, USA.
| |
Collapse
|
9
|
Liu W, Wang W, Tian G, Xie W, Lei L, Liu J, Huang W, Xu L, Li E. Topologically inferring pathway activity for precise survival outcome prediction: breast cancer as a case. MOLECULAR BIOSYSTEMS 2017; 13:537-548. [PMID: 28098303 DOI: 10.1039/c6mb00757k] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
Abstract
Accurately predicting the survival outcome of patients is of great importance in clinical cancer research. In the past decade, building survival prediction models based on gene expression data has received increasing interest. However, the existing methods are mainly based on individual gene signatures, which are known to have limited prediction accuracy on independent datasets and unclear biological relevance. Here, we propose a novel pathway-based survival prediction method called DRWPSurv in order to accurately predict survival outcome. DRWPSurv integrates gene expression profiles and prior gene interaction information to topologically infer survival associated pathway activities, and uses the pathway activities as features to construct Lasso-Cox model. It uses topological importance of genes evaluated by directed random walk to enhance the robustness of pathway activities and thereby improve the predictive performance. We applied DRWPSurv on three independent breast cancer datasets and compared the predictive performance with a traditional gene-based method and four pathway-based methods. Results showed that pathway-based methods obtained comparable or better predictive performance than the gene-based method, whereas DRWPSurv could predict survival outcome with better accuracy and robustness among the pathway-based methods. In addition, the risk pathways identified by DRWPSurv provide biologically informative models for breast cancer prognosis and treatment.
Collapse
Affiliation(s)
- Wei Liu
- The Key Laboratory of Molecular Biology for High Cancer Incidence Coastal Chaoshan Area, Shantou University Medical College, Shantou, 515041, China. and Department of Mathematics, Heilongjiang Institute of Technology, Harbin, 150050, China
| | - Wei Wang
- Department of Mathematics, Heilongjiang Institute of Technology, Harbin, 150050, China
| | - Guohua Tian
- Department of Mathematics, Heilongjiang Institute of Technology, Harbin, 150050, China
| | - Wenming Xie
- Network Information Center, Shantou University Medical College, Shantou, 515041, China
| | - Li Lei
- Network Information Center, Shantou University Medical College, Shantou, 515041, China
| | - Jiujin Liu
- Network Information Center, Shantou University Medical College, Shantou, 515041, China
| | - Wanxun Huang
- Network Information Center, Shantou University Medical College, Shantou, 515041, China
| | - Liyan Xu
- The Key Laboratory of Molecular Biology for High Cancer Incidence Coastal Chaoshan Area, Shantou University Medical College, Shantou, 515041, China. and Institute of Oncologic Pathology, Shantou University Medical College, Shantou, 515041, China
| | - Enmin Li
- The Key Laboratory of Molecular Biology for High Cancer Incidence Coastal Chaoshan Area, Shantou University Medical College, Shantou, 515041, China. and Department of Biochemistry and Molecular Biology, Shantou University Medical College, Shantou 515041, China
| |
Collapse
|
10
|
Matsui S, Crowley J. An Evaluation of Gene Set Analysis for Biomarker Discovery with Applications to Myeloma Research. FRONTIERS OF BIOSTATISTICAL METHODS AND APPLICATIONS IN CLINICAL ONCOLOGY 2017. [PMCID: PMC7120714 DOI: 10.1007/978-981-10-0126-0_25] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
In this paper, we evaluate 15 methods for gene set analysis in microarray classification problems. We employ four datasets from myeloma research and three types of biological gene sets, encompassing a total of 12 scenarios. Taking a two-step approach, we first identify important genes within gene sets to create summary gene set scores, we then construct predictive models using the gene set scores as predictors. We propose two powerful linear methods in addition to the well-known SuperPC method for calculating scores. By comparing the 15 gene set methods with methods used in individual-gene analysis, we conclude that, overall, the gene set analysis approach provided more accurate predictions than the individual-gene analysis.
Collapse
Affiliation(s)
- Shigeyuki Matsui
- Graduate School of Medicine, Nagoya University Graduate School of Medicine, Nagoya, Aichi Japan
| | - John Crowley
- Cancer Research and Biostatistics, Seattle, Washington USA
| |
Collapse
|
11
|
Hira ZM, Gillies DF. Identifying Significant Features in Cancer Methylation Data Using Gene Pathway Segmentation. Cancer Inform 2016; 15:189-98. [PMID: 27688706 PMCID: PMC5030825 DOI: 10.4137/cin.s39859] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2016] [Revised: 06/19/2016] [Accepted: 07/03/2016] [Indexed: 12/19/2022] Open
Abstract
In order to provide the most effective therapy for cancer, it is important to be able to diagnose whether a patient's cancer will respond to a proposed treatment. Methylation profiling could contain information from which such predictions could be made. Currently, hypothesis testing is used to determine whether possible biomarkers for cancer progression produce statistically significant results. However, this approach requires the identification of individual genes, or sets of genes, as candidate hypotheses, and with the increasing size of modern microarrays, this task is becoming progressively harder. Exhaustive testing of small sets of genes is computationally infeasible, and so hypothesis generation depends either on the use of established biological knowledge or on heuristic methods. As an alternative machine learning, methods can be used to identify groups of genes that are acting together within sets of cancer data and associate their behaviors with cancer progression. These methods have the advantage of being multivariate and unbiased but unfortunately also rapidly become computationally infeasible as the number of gene probes and datasets increases. To address this problem, we have investigated a way of utilizing prior knowledge to segment microarray datasets in such a way that machine learning can be used to identify candidate sets of genes for hypothesis testing. A methylation dataset is divided into subsets, where each subset contains only the probes that relate to a known gene pathway. Each of these pathway subsets is used independently for classification. The classification method is AdaBoost with decision trees as weak classifiers. Since each pathway subset contains a relatively small number of gene probes, it is possible to train and test its classification accuracy quickly and determine whether it has valuable diagnostic information. Finally, genes from successful pathway subsets can be combined to create a classifier of high accuracy.
Collapse
Affiliation(s)
- Zena M. Hira
- Department of Computing, Imperial College London, London, UK
| | | |
Collapse
|
12
|
Plant miRNA function prediction based on functional similarity network and transductive multi-label classification algorithm. Neurocomputing 2016. [DOI: 10.1016/j.neucom.2015.12.011] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/12/2023]
|
13
|
Kamkar I, Gupta SK, Phung D, Venkatesh S. Stabilizing l1-norm prediction models by supervised feature grouping. J Biomed Inform 2015; 59:149-68. [PMID: 26689771 DOI: 10.1016/j.jbi.2015.11.012] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2015] [Revised: 11/18/2015] [Accepted: 11/23/2015] [Indexed: 01/05/2023]
Abstract
Emerging Electronic Medical Records (EMRs) have reformed the modern healthcare. These records have great potential to be used for building clinical prediction models. However, a problem in using them is their high dimensionality. Since a lot of information may not be relevant for prediction, the underlying complexity of the prediction models may not be high. A popular way to deal with this problem is to employ feature selection. Lasso and l1-norm based feature selection methods have shown promising results. But, in presence of correlated features, these methods select features that change considerably with small changes in data. This prevents clinicians to obtain a stable feature set, which is crucial for clinical decision making. Grouping correlated variables together can improve the stability of feature selection, however, such grouping is usually not known and needs to be estimated for optimal performance. Addressing this problem, we propose a new model that can simultaneously learn the grouping of correlated features and perform stable feature selection. We formulate the model as a constrained optimization problem and provide an efficient solution with guaranteed convergence. Our experiments with both synthetic and real-world datasets show that the proposed model is significantly more stable than Lasso and many existing state-of-the-art shrinkage and classification methods. We further show that in terms of prediction performance, the proposed method consistently outperforms Lasso and other baselines. Our model can be used for selecting stable risk factors for a variety of healthcare problems, so it can assist clinicians toward accurate decision making.
Collapse
Affiliation(s)
- Iman Kamkar
- Centre for Pattern Recognition and Data Analytics, Deakin University, Australia.
| | - Sunil Kumar Gupta
- Centre for Pattern Recognition and Data Analytics, Deakin University, Australia.
| | - Dinh Phung
- Centre for Pattern Recognition and Data Analytics, Deakin University, Australia.
| | - Svetha Venkatesh
- Centre for Pattern Recognition and Data Analytics, Deakin University, Australia.
| |
Collapse
|
14
|
Meng J, Li R, Luan Y. Classification by integrating plant stress response gene expression data with biological knowledge. Math Biosci 2015; 266:65-72. [PMID: 26092610 DOI: 10.1016/j.mbs.2015.06.005] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2015] [Revised: 05/03/2015] [Accepted: 06/05/2015] [Indexed: 12/01/2022]
Abstract
Classification of microarray data has always been a challenging task because of the enormous number of genes. In this study, a clustering method by integrating plant stress response gene expression data with biological knowledge is presented. Clustering is one of the promising tools for attribute reduction, but gene clusters are biologically uninformative. So we integrated biological knowledge into genomic analysis to help to improve the interpretation of the results. Biological similarity based on gene ontology (GO) semantic similarity was combined with gene expression data to find out biologically meaningful clusters. Affinity propagation clustering algorithm was chosen to analyze the impact of the biological similarity on the results. Based on clustering result, neighborhood rough set was used to select representative genes for each cluster. The prediction accuracy of classifiers built on reduced gene subsets indicated that our approach outperformed other classical methods. The information fusion was proven to be effective through quantitative analysis, as it could select gene subsets with high biological significance and select significant genes.
Collapse
Affiliation(s)
- Jun Meng
- School of Computer Science and Technology, Dalian University of Technology, Dalian, Liaoning 116023, China..
| | - Rui Li
- School of Computer Science and Technology, Dalian University of Technology, Dalian, Liaoning 116023, China..
| | - Yushi Luan
- School of Life Science and Biotechnology, Dalian University of Technology, Dalian, Liaoning 116023, China..
| |
Collapse
|
15
|
Hira ZM, Gillies DF. A Review of Feature Selection and Feature Extraction Methods Applied on Microarray Data. Adv Bioinformatics 2015; 2015:198363. [PMID: 26170834 PMCID: PMC4480804 DOI: 10.1155/2015/198363] [Citation(s) in RCA: 282] [Impact Index Per Article: 31.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2015] [Accepted: 05/18/2015] [Indexed: 02/07/2023] Open
Abstract
We summarise various ways of performing dimensionality reduction on high-dimensional microarray data. Many different feature selection and feature extraction methods exist and they are being widely used. All these methods aim to remove redundant and irrelevant features so that classification of new instances will be more accurate. A popular source of data is microarrays, a biological platform for gathering gene expressions. Analysing microarrays can be difficult due to the size of the data they provide. In addition the complicated relations among the different genes make analysis more difficult and removing excess features can improve the quality of the results. We present some of the most popular methods for selecting significant features and provide a comparison between them. Their advantages and disadvantages are outlined in order to provide a clearer idea of when to use each one of them for saving computational time and resources.
Collapse
Affiliation(s)
- Zena M. Hira
- Department of Computing, Imperial College London, London SW7 2AZ, UK
| | - Duncan F. Gillies
- Department of Computing, Imperial College London, London SW7 2AZ, UK
| |
Collapse
|
16
|
Yang Y, Li D, Yang Y, Jiang G. An integrated analysis of the effects of microRNA and mRNA on esophageal squamous cell carcinoma. Mol Med Rep 2015; 12:945-52. [PMID: 25823933 PMCID: PMC4438920 DOI: 10.3892/mmr.2015.3557] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2014] [Accepted: 03/16/2015] [Indexed: 01/29/2023] Open
Abstract
Esophageal squamous cell cancer (ESCC) is an aggressive type of cancer with poor prognosis and leading to decreased quality of life. The identification of patients at increased risk of esophageal squamous cell cancer may improve current understanding of the role of micro (mi)RNA in tumorigenesis, since the miRNA pattern of these patients may be associated with tumorigenesis. In the present study, the miRNA and mRNA expression profiles of ESCC tissue samples and adjacent normal control tissue samples were obtained from two dependent GEO series. Bioinformatics analyses, including the use of the Gene Oncology and Kyoto Encyclopedia of Genes and Genomes databases, were used to identify genes and pathways, which were specifically associated with miRNA-associated ESCC oncology. A total of 17 miRNAs and 1,670 probes were differentially expressed in the two groups, and the differentially expressed miRNA and target interactions were analyzed. The mRNA of miRNA target genes were found to be involve 49 GO terms and 14 pathways. Of the genes differentially expressed between the two groups, miRNA-181a, miRNA-202, miRNA-155, FNDC3B, BNC2 and MBD2 were the most significantly altered and may be important in the regulatory network. In the present study, a novel pattern of differential miRNA-target expression was constructed, which with further investigation, may provide novel targets for diagnosing and understanding the mechanism of ESCC.
Collapse
Affiliation(s)
- Yong Yang
- Department of Thoracic Surgery, Shanghai Pulmonary Hospital Affiliated Tongji University, Shanghai 200433, P.R. China
| | - Dianbo Li
- Department of Thoracic Surgery, Linyi Tumor Hospital, Linyi, Shandong 276001, P.R. China
| | - Yang Yang
- Department of Thoracic Surgery, Shanghai Pulmonary Hospital Affiliated Tongji University, Shanghai 200433, P.R. China
| | - Gening Jiang
- Department of Thoracic Surgery, Shanghai Pulmonary Hospital Affiliated Tongji University, Shanghai 200433, P.R. China
| |
Collapse
|
17
|
Zhou XL, Wu JH, Wang XJ, Guo FJ. Integrated microRNA-mRNA analysis revealing the potential roles of microRNAs in tongue squamous cell cancer. Mol Med Rep 2015; 12:885-94. [PMID: 25760063 PMCID: PMC4438953 DOI: 10.3892/mmr.2015.3467] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2014] [Accepted: 02/02/2015] [Indexed: 01/21/2023] Open
Abstract
Tongue squamous cell carcinoma (TSCC) is a rare and aggressive type of cancer, which is associated with a poor prognosis. Identification of patients at high risk of TSCC tumorigenesis may provide information for the early detection of metastases, and for potential treatment strategies. MicroRNA (miRNA; miR) and mRNA expression profiling of TSCC tissue samples and normal control tissue samples were obtained from three Gene Expression Omnibus (GEO) data series. Bioinformatics analyses, including the Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes were used to identify genes and pathways specifically associated with miRNA-associated TSCC oncology. A total of 25 miRNAs and 769 mRNAs were differentially expressed in the two groups assessed, and all the differentially expressed miRNA and mRNA target interactions were analyzed. The miRNA target genes were predominantly associated with 38 GO terms and 13 pathways. Of the genes differentially expressed between the two groups, and confirmed in another GEO series, miRNA-494, miRNA-96, miRNA-183, runt-related transcription factor 1, programmed cell death protein 4 and membrane-associated guanylate kinase were the most significantly altered, and may be central in the regulation of TSCC. Bioinformatics may be used to analyze large quantities of data in microarrays through rigorous experimental planning, statistical analysis and the collection of complete data on TSCC. In the present study, a novel differential miRNA-mRNA expression network was constructed, and further investigation may provide novel targets for the diagnosis of TSCC.
Collapse
Affiliation(s)
- Xiao-Li Zhou
- Department of Stomatology, The First Affiliated Hospital of Henan University of Science and Technology, Luoyang, Henan 471003, P.R. China
| | - Jun-Hua Wu
- Department of Prosthodontics, School of Stomatology, Tongji University, Shanghai 200072, P.R. China
| | - Xin-Juan Wang
- Department of Stomatology, The First Affiliated Hospital of Henan University of Science and Technology, Luoyang, Henan 471003, P.R. China
| | - Fu-Jun Guo
- Department of Stomatology, The First Affiliated Hospital of Henan University of Science and Technology, Luoyang, Henan 471003, P.R. China
| |
Collapse
|
18
|
Meng J, Zhang J, Luan Y. Gene Selection Integrated with Biological Knowledge for Plant Stress Response Using Neighborhood System and Rough Set Theory. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2015; 12:433-444. [PMID: 26357229 DOI: 10.1109/tcbb.2014.2361329] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Mining knowledge from gene expression data is a hot research topic and direction of bioinformatics. Gene selection and sample classification are significant research trends, due to the large amount of genes and small size of samples in gene expression data. Rough set theory has been successfully applied to gene selection, as it can select attributes without redundancy. To improve the interpretability of the selected genes, some researchers introduced biological knowledge. In this paper, we first employ neighborhood system to deal directly with the new information table formed by integrating gene expression data with biological knowledge, which can simultaneously present the information in multiple perspectives and do not weaken the information of individual gene for selection and classification. Then, we give a novel framework for gene selection and propose a significant gene selection method based on this framework by employing reduction algorithm in rough set theory. The proposed method is applied to the analysis of plant stress response. Experimental results on three data sets show that the proposed method is effective, as it can select significant gene subsets without redundancy and achieve high classification accuracy. Biological analysis for the results shows that the interpretability is well.
Collapse
|
19
|
Ye S, Dawson JA, Kendziorski C. Extending information retrieval methods to personalized genomic-based studies of disease. Cancer Inform 2015; 13:85-95. [PMID: 25733795 PMCID: PMC4332045 DOI: 10.4137/cin.s16354] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2014] [Revised: 10/22/2014] [Accepted: 10/23/2014] [Indexed: 01/30/2023] Open
Abstract
Genomic-based studies of disease now involve diverse types of data collected on large groups of patients. A major challenge facing statistical scientists is how best to combine the data, extract important features, and comprehensively characterize the ways in which they affect an individual’s disease course and likelihood of response to treatment. We have developed a survival-supervised latent Dirichlet allocation (survLDA) modeling framework to address these challenges. Latent Dirichlet allocation (LDA) models have proven extremely effective at identifying themes common across large collections of text, but applications to genomics have been limited. Our framework extends LDA to the genome by considering each patient as a “document” with “text” detailing his/her clinical events and genomic state. We then further extend the framework to allow for supervision by a time-to-event response. The model enables the efficient identification of collections of clinical and genomic features that co-occur within patient subgroups, and then characterizes each patient by those features. An application of survLDA to The Cancer Genome Atlas ovarian project identifies informative patient subgroups showing differential response to treatment, and validation in an independent cohort demonstrates the potential for patient-specific inference.
Collapse
Affiliation(s)
- Shuyun Ye
- Department of Statistics, University of Wisconsin, Madison, WI, USA
| | - John A Dawson
- Department of Statistics, University of Wisconsin, Madison, WI, USA
| | - Christina Kendziorski
- Department of Biostatistics and Medical Informatics, University of Wisconsin, Madison, WI, USA
| |
Collapse
|
20
|
Liu W, Wang Q, Zhao J, Zhang C, Liu Y, Zhang J, Bai X, Li X, Feng H, Liao M, Wang W, Li C. Integration of pathway structure information into a reweighted partial Cox regression approach for survival analysis on high-dimensional gene expression data. MOLECULAR BIOSYSTEMS 2015; 11:1876-86. [DOI: 10.1039/c5mb00044k] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Abstract
Accurately predicting the risk of cancer relapse or death is important for clinical utility.
Collapse
|
21
|
Hassanzadeh HR, Phan JH, Wang MD. A semi-supervised method for predicting cancer survival using incomplete clinical data. ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL INTERNATIONAL CONFERENCE 2015; 2015:210-213. [PMID: 26736237 DOI: 10.1109/embc.2015.7318337] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Prediction of survival for cancer patients is an open area of research. However, many of these studies focus on datasets with a large number of patients. We present a novel method that is specifically designed to address the challenge of data scarcity, which is often the case for cancer datasets. Our method is able to use unlabeled data to improve classification by adopting a semi-supervised training approach to learn an ensemble classifier. The results of applying our method to three cancer datasets show the promise of semi-supervised learning for prediction of cancer survival.
Collapse
|
22
|
Yang L, Ainali C, Tsoka S, Papageorgiou LG. Pathway activity inference for multiclass disease classification through a mathematical programming optimisation framework. BMC Bioinformatics 2014; 15:390. [PMID: 25475756 PMCID: PMC4269079 DOI: 10.1186/s12859-014-0390-2] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2014] [Accepted: 11/19/2014] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Applying machine learning methods on microarray gene expression profiles for disease classification problems is a popular method to derive biomarkers, i.e. sets of genes that can predict disease state or outcome. Traditional approaches where expression of genes were treated independently suffer from low prediction accuracy and difficulty of biological interpretation. Current research efforts focus on integrating information on protein interactions through biochemical pathway datasets with expression profiles to propose pathway-based classifiers that can enhance disease diagnosis and prognosis. As most of the pathway activity inference methods in literature are either unsupervised or applied on two-class datasets, there is good scope to address such limitations by proposing novel methodologies. RESULTS A supervised multiclass pathway activity inference method using optimisation techniques is reported. For each pathway expression dataset, patterns of its constituent genes are summarised into one composite feature, termed pathway activity, and a novel mathematical programming model is proposed to infer this feature as a weighted linear summation of expression of its constituent genes. Gene weights are determined by the optimisation model, in a way that the resulting pathway activity has the optimal discriminative power with regards to disease phenotypes. Classification is then performed on the resulting low-dimensional pathway activity profile. CONCLUSIONS The model was evaluated through a variety of published gene expression profiles that cover different types of disease. We show that not only does it improve classification accuracy, but it can also perform well in multiclass disease datasets, a limitation of other approaches from the literature. Desirable features of the model include the ability to control the maximum number of genes that may participate in determining pathway activity, which may be pre-specified by the user. Overall, this work highlights the potential of building pathway-based multi-phenotype classifiers for accurate disease diagnosis and prognosis problems.
Collapse
Affiliation(s)
- Lingjian Yang
- Centre for Process Systems Engineering, Department of Chemical Engineering, University College London, London, WC1E 7JE, UK.
| | - Chrysanthi Ainali
- Department of Informatics, School of Natural and Mathematical Sciences, King's College London, London, WC2R 2LS, UK.
| | - Sophia Tsoka
- Department of Informatics, School of Natural and Mathematical Sciences, King's College London, London, WC2R 2LS, UK.
| | - Lazaros G Papageorgiou
- Centre for Process Systems Engineering, Department of Chemical Engineering, University College London, London, WC1E 7JE, UK.
| |
Collapse
|
23
|
Hira ZM, Trigeorgis G, Gillies DF. An algorithm for finding biologically significant features in microarray data based on a priori manifold learning. PLoS One 2014; 9:e90562. [PMID: 24595155 PMCID: PMC3940899 DOI: 10.1371/journal.pone.0090562] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2013] [Accepted: 02/02/2014] [Indexed: 11/19/2022] Open
Abstract
Microarray databases are a large source of genetic data, which, upon proper analysis, could enhance our understanding of biology and medicine. Many microarray experiments have been designed to investigate the genetic mechanisms of cancer, and analytical approaches have been applied in order to classify different types of cancer or distinguish between cancerous and non-cancerous tissue. However, microarrays are high-dimensional datasets with high levels of noise and this causes problems when using machine learning methods. A popular approach to this problem is to search for a set of features that will simplify the structure and to some degree remove the noise from the data. The most widely used approach to feature extraction is principal component analysis (PCA) which assumes a multivariate Gaussian model of the data. More recently, non-linear methods have been investigated. Among these, manifold learning algorithms, for example Isomap, aim to project the data from a higher dimensional space onto a lower dimension one. We have proposed a priori manifold learning for finding a manifold in which a representative set of microarray data is fused with relevant data taken from the KEGG pathway database. Once the manifold has been constructed the raw microarray data is projected onto it and clustering and classification can take place. In contrast to earlier fusion based methods, the prior knowledge from the KEGG databases is not used in, and does not bias the classification process--it merely acts as an aid to find the best space in which to search the data. In our experiments we have found that using our new manifold method gives better classification results than using either PCA or conventional Isomap.
Collapse
Affiliation(s)
- Zena M. Hira
- Department of Computing, Imperial College London, London, United Kingdom
- * E-mail:
| | - George Trigeorgis
- Department of Computing, Imperial College London, London, United Kingdom
| | - Duncan F. Gillies
- Department of Computing, Imperial College London, London, United Kingdom
| |
Collapse
|
24
|
Gu S, Su P, Yan J, Zhang X, An X, Gao J, Xin R, Liu Y. Comparison of gene expression profiles and related pathways in chronic thromboembolic pulmonary hypertension. Int J Mol Med 2013; 33:277-300. [PMID: 24337368 PMCID: PMC3896458 DOI: 10.3892/ijmm.2013.1582] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2013] [Accepted: 12/03/2013] [Indexed: 01/08/2023] Open
Abstract
Chronic thromboembolic pulmonary hypertension (CTEPH) is one of the main causes of severe pulmonary hypertension. However, despite treatment (pulmonary endarterectomy), in approximately 15–20% of patients, pulmonary vascular resistance and pulmonary arterial pressure continue to increase. To date, little is known about the changes that occur in gene expression in CTEPH. The identification of genes associated with CTEPH may provide insight into the pathogenesis of CTEPH and may aid in diagnosis and treatment. In this study, we analyzed the gene expresion profiles of pulmonary artery endothelial cells from 5 patients with CTEPH and 5 healthy controls using oligonucleotide microarrays. Bioinformatics analyses using the Gene Ontology (GO) and KEGG databases were carried out to identify the genes and pathways specifically associated with CTEPH. Signal transduction networks were established to identify the core genes regulating the progression of CTEPH. A number of genes were found to be differentially expressed in the pulmonary artery endothelial cells from patients with CTEPH. In total, 412 GO terms and 113 pathways were found to be associated with our list of genes. All differential gene interactions in the Signal-Net network were analyzed. JAK3, GNA15, MAPK13, ARRB2 and F2R were the most significantly altered. Bioinformatics analysis may help gather and analyze large amounts of data in microarrays by means of rigorous experimental planning, scientific statistical analysis and the collection of complete data. In this study, a novel differential gene expression pattern was constructed. However, further studies are required to identify novel targets for the diagnosis and treatment of CTEPH.
Collapse
Affiliation(s)
- Song Gu
- Department of Cardiac Surgery, Beijing Chao-Yang Hospital, Capital Medical University, Beijing 100020, P.R. China
| | - Pixiong Su
- Department of Cardiac Surgery, Beijing Chao-Yang Hospital, Capital Medical University, Beijing 100020, P.R. China
| | - Jun Yan
- Department of Cardiac Surgery, Beijing Chao-Yang Hospital, Capital Medical University, Beijing 100020, P.R. China
| | - Xitao Zhang
- Department of Cardiac Surgery, Beijing Chao-Yang Hospital, Capital Medical University, Beijing 100020, P.R. China
| | - Xiangguang An
- Department of Cardiac Surgery, Beijing Chao-Yang Hospital, Capital Medical University, Beijing 100020, P.R. China
| | - Jie Gao
- Department of Cardiac Surgery, Beijing Chao-Yang Hospital, Capital Medical University, Beijing 100020, P.R. China
| | - Rui Xin
- Department of Cardiac Surgery, Beijing Chao-Yang Hospital, Capital Medical University, Beijing 100020, P.R. China
| | - Yan Liu
- Department of Cardiac Surgery, Beijing Chao-Yang Hospital, Capital Medical University, Beijing 100020, P.R. China
| |
Collapse
|
25
|
Yang Y, Li H, Hou S, Hu B, Liu J, Wang J. Differences in gene expression profiles and carcinogenesis pathways involved in cisplatin resistance of four types of cancer. Oncol Rep 2013; 30:596-614. [PMID: 23733047 DOI: 10.3892/or.2013.2514] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2012] [Accepted: 03/04/2013] [Indexed: 11/06/2022] Open
Abstract
Cisplatin-based chemotherapy is the standard therapy used for the treatment of several types of cancer. However, its efficacy is largely limited by the acquired drug resistance. To date, little is known about the RNA expression changes in cisplatin-resistant cancers. Identification of the RNAs related to cisplatin resistance may provide specific insight into cancer therapy. In the present study, expression profiling of 7 cancer cell lines was performed using oligonucleotide microarray analysis data obtained from the GEO database. Bioinformatic analyses such as the Gene Ontology (GO) and KEGG pathway were used to identify genes and pathways specifically associated with cisplatin resistance. A signal transduction network was established to identify the core genes in regulating cancer cell cisplatin resistance. A number of genes were differentially expressed in 7 groups of cancer cell lines. They mainly participated in 85 GO terms and 11 pathways in common. All differential gene interactions in the Signal-Net were analyzed. CTNNB1, PLCG2 and SRC were the most significantly altered. With the use of bioinformatics, large amounts of data in microarrays were retrieved and analyzed by means of thorough experimental planning, scientific statistical analysis and collection of complete data on cancer cell cisplatin resistance. In the present study, a novel differential gene expression pattern was constructed and further study will provide new targets for the diagnosis and mechanisms of cancer cisplatin resistance.
Collapse
Affiliation(s)
- Yong Yang
- Beijing Key Laboratory of Respiratory and Pulmonary Circulation, Capital Medical University, Beijing 100069, PR China
| | | | | | | | | | | |
Collapse
|
26
|
Zycinski G, Barla A, Squillario M, Sanavia T, Camillo BD, Verri A. Knowledge Driven Variable Selection (KDVS) - a new approach to enrichment analysis of gene signatures obtained from high-throughput data. SOURCE CODE FOR BIOLOGY AND MEDICINE 2013; 8:2. [PMID: 23302187 PMCID: PMC3605163 DOI: 10.1186/1751-0473-8-2] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/28/2012] [Accepted: 12/13/2012] [Indexed: 11/10/2022]
Abstract
Background High–throughput (HT) technologies provide huge amount of gene expression data that can be used to identify biomarkers useful in the clinical practice. The most frequently used approaches first select a set of genes (i.e. gene signature) able to characterize differences between two or more phenotypical conditions, and then provide a functional assessment of the selected genes with an a posteriori enrichment analysis, based on biological knowledge. However, this approach comes with some drawbacks. First, gene selection procedure often requires tunable parameters that affect the outcome, typically producing many false hits. Second, a posteriori enrichment analysis is based on mapping between biological concepts and gene expression measurements, which is hard to compute because of constant changes in biological knowledge and genome analysis. Third, such mapping is typically used in the assessment of the coverage of gene signature by biological concepts, that is either score–based or requires tunable parameters as well, limiting its power. Results We present Knowledge Driven Variable Selection (KDVS), a framework that uses a priori biological knowledge in HT data analysis. The expression data matrix is transformed, according to prior knowledge, into smaller matrices, easier to analyze and to interpret from both computational and biological viewpoints. Therefore KDVS, unlike most approaches, does not exclude a priori any function or process potentially relevant for the biological question under investigation. Differently from the standard approach where gene selection and functional assessment are applied independently, KDVS embeds these two steps into a unified statistical framework, decreasing the variability derived from the threshold–dependent selection, the mapping to the biological concepts, and the signature coverage. We present three case studies to assess the usefulness of the method. Conclusions We showed that KDVS not only enables the selection of known biological functionalities with accuracy, but also identification of new ones. An efficient implementation of KDVS was devised to obtain results in a fast and robust way. Computing time is drastically reduced by the effective use of distributed resources. Finally, integrated visualization techniques immediately increase the interpretability of results. Overall, KDVS approach can be considered as a viable alternative to enrichment–based approaches.
Collapse
Affiliation(s)
- Grzegorz Zycinski
- DIBRIS, University of Genoa, via Dodecaneso 35, I-16146 Genova, Italy.
| | | | | | | | | | | |
Collapse
|
27
|
Perez-Rathke A, Li H, Lussier YA. Interpreting personal transcriptomes: personalized mechanism-scale profiling of RNA-seq data. PACIFIC SYMPOSIUM ON BIOCOMPUTING. PACIFIC SYMPOSIUM ON BIOCOMPUTING 2013:159-170. [PMID: 23424121 PMCID: PMC3595401] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/01/2023]
Abstract
Despite thousands of reported studies unveiling gene-level signatures for complex diseases, few of these techniques work at the single-sample level with explicit underpinning of biological mechanisms. This presents both a critical dilemma in the field of personalized medicine as well as a plethora of opportunities for analysis of RNA-seq data. In this study, we hypothesize that the "Functional Analysis of Individual Microarray Expression" (FAIME) method we developed could be smoothly extended to RNA-seq data and unveil intrinsic underlying mechanism signatures across different scales of biological data for the same complex disease. Using publicly available RNA-seq data for gastric cancer, we confirmed the effectiveness of this method (i) to translate each sample transcriptome to pathway-scale scores, (ii) to predict deregulated pathways in gastric cancer against gold standards (FDR<5%, Precision=75%, Recall =92%), and (iii) to predict phenotypes in an independent dataset and expression platform (RNA-seq vs microarrays, Fisher Exact Test p<10(-6)). Measuring at a single-sample level, FAIME could differentiate cancer samples from normal ones; furthermore, it achieved comparative performance in identifying differentially expressed pathways as compared to state-of-the-art cross-sample methods. These results motivate future work on mechanism-level biomarker discovery predictive of diagnoses, treatment, and therapy.
Collapse
Affiliation(s)
- Alan Perez-Rathke
- Department of Medicine, University of Illinois at Chicago, Chicago, IL 60612, USA.
| | | | | |
Collapse
|
28
|
|
29
|
Sanavia T, Aiolli F, Da San Martino G, Bisognin A, Di Camillo B. Improving biomarker list stability by integration of biological knowledge in the learning process. BMC Bioinformatics 2012; 13 Suppl 4:S22. [PMID: 22536969 PMCID: PMC3314566 DOI: 10.1186/1471-2105-13-s4-s22] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/20/2023] Open
Abstract
BACKGROUND The identification of robust lists of molecular biomarkers related to a disease is a fundamental step for early diagnosis and treatment. However, methodologies for biomarker discovery using microarray data often provide results with limited overlap. It has been suggested that one reason for these inconsistencies may be that in complex diseases, such as cancer, multiple genes belonging to one or more physiological pathways are associated with the outcomes. Thus, a possible approach to improve list stability is to integrate biological information from genomic databases in the learning process; however, a comprehensive assessment based on different types of biological information is still lacking in the literature. In this work we have compared the effect of using different biological information in the learning process like functional annotations, protein-protein interactions and expression correlation among genes. RESULTS Biological knowledge has been codified by means of gene similarity matrices and expression data linearly transformed in such a way that the more similar two features are, the more closely they are mapped. Two semantic similarity matrices, based on Biological Process and Molecular Function Gene Ontology annotation, and geodesic distance applied on protein-protein interaction networks, are the best performers in improving list stability maintaining almost equal prediction accuracy. CONCLUSIONS The performed analysis supports the idea that when some features are strongly correlated to each other, for example because are close in the protein-protein interaction network, then they might have similar importance and are equally relevant for the task at hand. Obtained results can be a starting point for additional experiments on combining similarity matrices in order to obtain even more stable lists of biomarkers. The implementation of the classification algorithm is available at the link: http://www.math.unipd.it/~dasan/biomarkers.html.
Collapse
Affiliation(s)
- Tiziana Sanavia
- Department of Information Engineering, University of Padova, via G. Gradenigo 6/B, 35131 Padova, Italy
| | - Fabio Aiolli
- Department of Pure and Applied Mathematics, University of Padova, Via Trieste 63, 35121, Padova, Italy
| | - Giovanni Da San Martino
- Department of Pure and Applied Mathematics, University of Padova, Via Trieste 63, 35121, Padova, Italy
| | - Andrea Bisognin
- Department of Biology, University of Padova, Via G. Colombo 3, 35121, Padova, Italy
| | - Barbara Di Camillo
- Department of Information Engineering, University of Padova, via G. Gradenigo 6/B, 35131 Padova, Italy
| |
Collapse
|
30
|
Yang X, Regan K, Huang Y, Zhang Q, Li J, Seiwert TY, Cohen EEW, Xing HR, Lussier YA. Single sample expression-anchored mechanisms predict survival in head and neck cancer. PLoS Comput Biol 2012; 8:e1002350. [PMID: 22291585 PMCID: PMC3266878 DOI: 10.1371/journal.pcbi.1002350] [Citation(s) in RCA: 59] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2011] [Accepted: 11/28/2011] [Indexed: 12/11/2022] Open
Abstract
Gene expression signatures that are predictive of therapeutic response or prognosis are increasingly useful in clinical care; however, mechanistic (and intuitive) interpretation of expression arrays remains an unmet challenge. Additionally, there is surprisingly little gene overlap among distinct clinically validated expression signatures. These “causality challenges” hinder the adoption of signatures as compared to functionally well-characterized single gene biomarkers. To increase the utility of multi-gene signatures in survival studies, we developed a novel approach to generate “personal mechanism signatures” of molecular pathways and functions from gene expression arrays. FAIME, the Functional Analysis of Individual Microarray Expression, computes mechanism scores using rank-weighted gene expression of an individual sample. By comparing head and neck squamous cell carcinoma (HNSCC) samples with non-tumor control tissues, the precision and recall of deregulated FAIME-derived mechanisms of pathways and molecular functions are comparable to those produced by conventional cohort-wide methods (e.g. GSEA). The overlap of “Oncogenic FAIME Features of HNSCC” (statistically significant and differentially regulated FAIME-derived genesets representing GO functions or KEGG pathways derived from HNSCC tissue) among three distinct HNSCC datasets (pathways:46%, p<0.001) is more significant than the gene overlap (genes:4%). These Oncogenic FAIME Features of HNSCC can accurately discriminate tumors from control tissues in two additional HNSCC datasets (n = 35 and 91, F-accuracy = 100% and 97%, empirical p<0.001, area under the receiver operating characteristic curves = 99% and 92%), and stratify recurrence-free survival in patients from two independent studies (p = 0.0018 and p = 0.032, log-rank). Previous approaches depending on group assignment of individual samples before selecting features or learning a classifier are limited by design to discrete-class prediction. In contrast, FAIME calculates mechanism profiles for individual patients without requiring group assignment in validation sets. FAIME is more amenable for clinical deployment since it translates the gene-level measurements of each given sample into pathways and molecular function profiles that can be applied to analyze continuous phenotypes in clinical outcome studies (e.g. survival time, tumor volume). Clinical utilization of multi-gene expression signatures that are predictive of therapeutic response has been steadily increasing, however, interpretation of such results remains challenging because multi-gene signatures, generated from analyzing different patient cohorts, tend to be equally predictive but contain minimal overlap. Whereas pathway-level analyses of expression arrays show promise for generating clinically meaningful mechanistic signatures, current approaches do not permit single-patient based analyses that are independent of cross-group calculations. To bridge the gap between deterministic biological mechanisms of single-gene biomarkers and the statistical predictive power of multi-gene signatures that are disconnected from mechanisms, we developed FAIME, a novel method that transforms microarray gene expression data into individualized patient profiles of molecular mechanisms. We have validated its capability for predicting clinical outcomes, including cancer patient samples derived from six different clinical trial cohorts of head and neck cancers. This method provides opportunities to harness an untapped resource for personal genomics: clinical evaluation and testing of individually interpretable mechanistic profiles derived from gene expression arrays.
Collapse
Affiliation(s)
- Xinan Yang
- Center for Biomedical Informatics, The University of Chicago, Chicago, Illinois, United States of America
- Section of Genetic Medicine, The University of Chicago, Chicago, Illinois, United States of America
| | - Kelly Regan
- Center for Biomedical Informatics, The University of Chicago, Chicago, Illinois, United States of America
| | - Yong Huang
- Center for Biomedical Informatics, The University of Chicago, Chicago, Illinois, United States of America
- Section of Genetic Medicine, The University of Chicago, Chicago, Illinois, United States of America
| | - Qingbei Zhang
- Center for Biomedical Informatics, The University of Chicago, Chicago, Illinois, United States of America
- Section of Genetic Medicine, The University of Chicago, Chicago, Illinois, United States of America
| | - Jianrong Li
- Center for Biomedical Informatics, The University of Chicago, Chicago, Illinois, United States of America
- Section of Genetic Medicine, The University of Chicago, Chicago, Illinois, United States of America
| | - Tanguy Y. Seiwert
- Section of Hematology/Oncology of the Department of Medicine, The University of Chicago, Chicago, Illinois, United States of America
- Comprehensive Cancer Center, The University of Chicago, Chicago, Illinois, United States of America
| | - Ezra E. W. Cohen
- Section of Hematology/Oncology of the Department of Medicine, The University of Chicago, Chicago, Illinois, United States of America
- Comprehensive Cancer Center, The University of Chicago, Chicago, Illinois, United States of America
| | - H. Rosie Xing
- Comprehensive Cancer Center, The University of Chicago, Chicago, Illinois, United States of America
- Departments of Pathology and of Cellular and Radiation Oncology, The University of Chicago, Chicago, Illinois, United States of America
- Ludwig Center for Metastasis Research, The University of Chicago, Chicago, Illinois, United States of America
| | - Yves A. Lussier
- Center for Biomedical Informatics, The University of Chicago, Chicago, Illinois, United States of America
- Section of Genetic Medicine, The University of Chicago, Chicago, Illinois, United States of America
- Comprehensive Cancer Center, The University of Chicago, Chicago, Illinois, United States of America
- Departments of Pathology and of Cellular and Radiation Oncology, The University of Chicago, Chicago, Illinois, United States of America
- Ludwig Center for Metastasis Research, The University of Chicago, Chicago, Illinois, United States of America
- Computation Institute, Institute for Translational Medicine, and Institute for Genomics and Systems Biology, The University of Chicago, Chicago, Illinois, United States of America
- * E-mail:
| |
Collapse
|
31
|
Lee S, Kim J, Lee S. A comparative study on gene-set analysis methods for assessing differential expression associated with the survival phenotype. BMC Bioinformatics 2011; 12:377. [PMID: 21943316 PMCID: PMC3196970 DOI: 10.1186/1471-2105-12-377] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2011] [Accepted: 09/26/2011] [Indexed: 11/10/2022] Open
Abstract
Background Many gene-set analysis methods have been previously proposed and compared through simulation studies and analysis of real datasets for binary phenotypes. We focused on the survival phenotype and compared the performances of Gene Set Enrichment Analysis (GSEA), Global Test (GT), Wald-type Test (WT) and Global Boost Test (GBST) methods in a simulation study and on two ovarian cancer data sets. We considered two versions of GSEA by allowing different weights: GSEA1 uses equal weights, yielding results similar to the Kolmogorov-Smirnov test; while GSEA2's weights are based on the correlation between genes and the phenotype. Results We compared GSEA1, GSEA2, GT, WT and GBST in a simulation study with various settings for the correlation structure of the genes and the association parameter between the survival outcome and the genes. Simulation results indicated that GT, WT and GBST consistently have higher power than GSEA1 and GSEA2 across all scenarios. However, the power of the five tests depends on the combination of correlation structure and association parameter. For the ovarian cancer data set, using the FDR threshold of q < 0.1, the GT, WT and GBST detected 12, 6 and 8 significant pathways, respectively, whereas neither GSEA1 nor GSEA2 detected any significant pathways. In addition, among the pathways found significant by GT, WT, and GBST, three pathways - Purine metabolism, Leukocyte transendothelial migration and Jak-STAT signaling pathway - overlapped with those reported in previous ovarian cancer microarray studies. Conclusion Simulation studies and a real data example indicate that GT, WT and GBST tend to have high power, whereas GSEA1 and GSEA2 have lower power. We also found that the power of the five tests is much higher when genes are correlated than when genes are independent, when survival is positively associated with genes. It seems that there is a synergistic effect in detecting significant gene sets when significant genes have within-class correlation and the association between survival and genes is positive or negative (i.e., one-direction correlation).
Collapse
Affiliation(s)
- Seungyeoun Lee
- Department of Mathematics and Statistics, Sejong University, Seoul, 143-747, Korea.
| | | | | |
Collapse
|
32
|
Zhao Z, Liu Y, He H, Chen X, Chen J, Lu YC. Candidate genes influencing sensitivity and resistance of human glioblastoma to Semustine. Brain Res Bull 2011; 86:189-94. [PMID: 21807073 DOI: 10.1016/j.brainresbull.2011.07.010] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2011] [Revised: 07/07/2011] [Accepted: 07/13/2011] [Indexed: 12/28/2022]
Abstract
OBJECTIVE The prognosis of glioblastoma (GBM) is poor. The therapeutic outcome of conventional surgical and adjuvant treatments remains unsatisfactory, and therefore individualized adjuvant chemotherapy has aroused more attention. Microarrays have been applied to study mechanism of GBM development and progression but it has difficulty in determining responsible genes from the plethora of genes on microarrays unrelated to outcome. The present study was attempted to use bioinformatics method to investigate candidate genes that may influence chemosensitivity of GBM to Semustine (Me-CCNU). METHODS Clinical data of 4 GBM patients in Affymetrix microarray were perfected through long-term follow-up study. Differential expression genes between the long- and short-survival groups were picked out, GO-analysis and pathway-analysis of the differential expression genes were performed. Me-CCNU-related signal transduction networks were constructed. The methods combined three steps before were used to screen core genes that influenced Me-CCNU chemosensitivity in GBM. RESULTS In Affymetrix microarray there were altogether 2018 differential expression genes that influenced survival duration of GBM. Of them, 934 genes were up-regulated and 1084 down-regulated. They mainly participated in 94 pathways. Me-CCNU-related signal transduction networks were constructed. The total number of genes in the networks was 466, of which 66 were also found in survival duration-related differential expression genes. Studied key genes through GO-analysis, pathway-analysis and in the Me-CCNU-related signal transduction networks, 25 core genes that influenced chemosensitivity of GBM to Me-CCNU were obtained, including TP53, MAP2K2, EP300, PRKCA, TNF, CCND1, AKT2, RBL1, CDC2, ID2, RAF1, CDKN2C, FGFR1, SP1, CDK6, IGFBP3, MDM4, PDGFD, SOCS2, CCNG2, CDK2, SDC2, STMN1, TCF7L1, TUBB. CONCLUSION Bioinformatics may help excavate and analyze large amounts of data in microarrays by means of rigorous experimental planning, scientific statistical analysis and collection of complete data about survival of GBM patients. In the present study, a novel differential gene expression pattern was constructed and advanced study will provide new targets for chemosensitivity of GBM.
Collapse
Affiliation(s)
- Zhenyu Zhao
- Department of Neurosurgery, ChangZheng Hospital, Second Military Medical University, Shanghai, China
| | | | | | | | | | | |
Collapse
|
33
|
Chen X, Wang L, Ishwaran H. An Integrative Pathway-based Clinical-genomic Model for Cancer Survival Prediction. Stat Probab Lett 2010; 80:1313-1319. [PMID: 21731150 DOI: 10.1016/j.spl.2010.04.011] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 01/19/2023]
Abstract
Prediction models that use gene expression levels are now being proposed for personalized treatment of cancer, but building accurate models that are easy to interpret remains a challenge. In this paper, we describe an integrative clinical-genomic approach that combines both genomic pathway and clinical information. First, we summarize information from genes in each pathway using Supervised Principal Components (SPCA) to obtain pathway-based genomic predictors. Next, we build a prediction model based on clinical variables and pathway-based genomic predictors using Random Survival Forests (RSF). Our rationale for this two-stage procedure is that the underlying disease process may be influenced by environmental exposure (measured by clinical variables) and perturbations in different pathways (measured by pathway-based genomic variables), as well as their interactions. Using two cancer microarray datasets, we show that the pathway-based clinical-genomic model outperforms gene-based clinical-genomic models, with improved prediction accuracy and interpretability.
Collapse
Affiliation(s)
- Xi Chen
- Division of Cancer Biostatistics, Department of Biostatistics, Vanderbilt University, Nashville, TN 37232, USA
| | | | | |
Collapse
|
34
|
Lee KH, Lee SH. Predicting Survival of DLBCL Patients in Pathway-Based Microarray Analysis. KOREAN JOURNAL OF APPLIED STATISTICS 2010. [DOI: 10.5351/kjas.2010.23.4.705] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
|
35
|
He Z, Yu W. Stable feature selection for biomarker discovery. Comput Biol Chem 2010; 34:215-25. [PMID: 20702140 DOI: 10.1016/j.compbiolchem.2010.07.002] [Citation(s) in RCA: 131] [Impact Index Per Article: 9.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2010] [Revised: 06/27/2010] [Accepted: 07/10/2010] [Indexed: 12/27/2022]
|
36
|
Banerjee D. Reinventing diagnostics for personalized therapy in oncology. Cancers (Basel) 2010; 2:1066-91. [PMID: 24281107 PMCID: PMC3835119 DOI: 10.3390/cancers2021066] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2010] [Revised: 05/15/2010] [Accepted: 05/28/2010] [Indexed: 11/16/2022] Open
Abstract
Human cancers are still diagnosed and classified using the light microscope. The criteria are based upon morphologic observations by pathologists and tend to be subject to interobserver variation. In preoperative biopsies of non-small cell lung cancers, the diagnostic concordance, even amongst experienced pulmonary pathologists, is no better than a coin-toss. Only 25% of cancer patients, on average, benefit from therapy as most therapies do not account for individual factors that influence response or outcome. Unsuccessful first line therapy costs Canada CAN$1.2 billion for the top 14 cancer types, and this extrapolates to $90 billion globally. The availability of accurate drug selection for personalized therapy could better allocate these precious resources to the right therapies. This wasteful situation is beginning to change with the completion of the human genome sequencing project and with the increasing availability of targeted therapies. Both factors are giving rise to attempts to correlate tumor characteristics and response to specific adjuvant and neoadjuvant therapies. Static cancer classification and grading systems need to be replaced by functional classification systems that not only account for intra- and inter- tumor heterogeneity, but which also allow for the selection of the correct chemotherapeutic compounds for the individual patient. In this review, the examples of lung and breast cancer are used to illustrate the issues to be addressed in the coming years, as well as the emerging technologies that have great promise in enabling personalized therapy.
Collapse
Affiliation(s)
- Diponkar Banerjee
- Centre for Translational and Applied Genomics (CTAG), Provincial Health Services Authority (PHSA) Laboratories, Vancouver, British Columbia, Canada.
| |
Collapse
|
37
|
Podo F, Buydens LMC, Degani H, Hilhorst R, Klipp E, Gribbestad IS, Van Huffel S, van Laarhoven HWM, Luts J, Monleon D, Postma GJ, Schneiderhan-Marra N, Santoro F, Wouters H, Russnes HG, Sørlie T, Tagliabue E, Børresen-Dale AL. Triple-negative breast cancer: present challenges and new perspectives. Mol Oncol 2010; 4:209-29. [PMID: 20537966 PMCID: PMC5527939 DOI: 10.1016/j.molonc.2010.04.006] [Citation(s) in RCA: 225] [Impact Index Per Article: 16.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2010] [Accepted: 04/16/2010] [Indexed: 12/28/2022] Open
Abstract
Triple-negative breast cancers (TNBC), characterized by absence of estrogen receptor (ER), progesterone receptor (PR) and lack of overexpression of human epidermal growth factor receptor 2 (HER2), are typically associated with poor prognosis, due to aggressive tumor phenotype(s), only partial response to chemotherapy and present lack of clinically established targeted therapies. Advances in the design of individualized strategies for treatment of TNBC patients require further elucidation, by combined 'omics' approaches, of the molecular mechanisms underlying TNBC phenotypic heterogeneity, and the still poorly understood association of TNBC with BRCA1 mutations. An overview is here presented on TNBC profiling in terms of expression signatures, within the functional genomic breast tumor classification, and ongoing efforts toward identification of new therapy targets and bioimaging markers. Due to the complexity of aberrant molecular patterns involved in expression, pathological progression and biological/clinical heterogeneity, the search for novel TNBC biomarkers and therapy targets requires collection of multi-dimensional data sets, use of robust multivariate data analysis techniques and development of innovative systems biology approaches.
Collapse
Affiliation(s)
- Franca Podo
- Department of Cell Biology and Neurosciences, Istituto Superiore di Sanità, Viale Regina Elena 299, 00161 Rome, Italy.
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
38
|
De Haan JR, Piek E, van Schaik RC, de Vlieg J, Bauerschmidt S, Buydens LMC, Wehrens R. Integrating gene expression and GO classification for PCA by preclustering. BMC Bioinformatics 2010; 11:158. [PMID: 20346140 PMCID: PMC2860362 DOI: 10.1186/1471-2105-11-158] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2009] [Accepted: 03/26/2010] [Indexed: 12/03/2022] Open
Abstract
Background Gene expression data can be analyzed by summarizing groups of individual gene expression profiles based on GO annotation information. The mean expression profile per group can then be used to identify interesting GO categories in relation to the experimental settings. However, the expression profiles present in GO classes are often heterogeneous, i.e., there are several different expression profiles within one class. As a result, important experimental findings can be obscured because the summarizing profile does not seem to be of interest. We propose to tackle this problem by finding homogeneous subclasses within GO categories: preclustering. Results Two microarray datasets are analyzed. First, a selection of genes from a well-known Saccharomyces cerevisiae dataset is used. The GO class "cell wall organization and biogenesis" is shown as a specific example. After preclustering, this term can be associated with different phases in the cell cycle, where it could not be associated with a specific phase previously. Second, a dataset of differentiation of human Mesenchymal Stem Cells (MSC) into osteoblasts is used. For this dataset results are shown in which the GO term "skeletal development" is a specific example of a heterogeneous GO class for which better associations can be made after preclustering. The Intra Cluster Correlation (ICC), a measure of cluster tightness, is applied to identify relevant clusters. Conclusions We show that this method leads to an improved interpretability of results in Principal Component Analysis.
Collapse
Affiliation(s)
- Jorn R De Haan
- Institute for Molecules and Materials, Analytical Chemistry, Radboud University Nijmegen, Heyendaalseweg 135, Nijmegen, The Netherlands
| | | | | | | | | | | | | |
Collapse
|
39
|
Bright LA, Burgess SC, Chowdhary B, Swiderski CE, McCarthy FM. Structural and functional-annotation of an equine whole genome oligoarray. BMC Bioinformatics 2009; 10 Suppl 11:S8. [PMID: 19811692 PMCID: PMC3226197 DOI: 10.1186/1471-2105-10-s11-s8] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open
Abstract
Background The horse genome is sequenced, allowing equine researchers to use high-throughput functional genomics platforms such as microarrays; next-generation sequencing for gene expression and proteomics. However, for researchers to derive value from these functional genomics datasets, they must be able to model this data in biologically relevant ways; to do so requires that the equine genome be more fully annotated. There are two interrelated types of genomic annotation: structural and functional. Structural annotation is delineating and demarcating the genomic elements (such as genes, promoters, and regulatory elements). Functional annotation is assigning function to structural elements. The Gene Ontology (GO) is the de facto standard for functional annotation, and is routinely used as a basis for modelling and hypothesis testing, large functional genomics datasets. Results An Equine Whole Genome Oligonucleotide (EWGO) array with 21,351 elements was developed at Texas A&M University. This 70-mer oligoarray was designed using the approximately 7× assembled and annotated sequence of the equine genome to be one of the most comprehensive arrays available for expressed equine sequences. To assist researchers in determining the biological meaning of data derived from this array, we have structurally annotated it by mapping the elements to multiple database accessions, including UniProtKB, Entrez Gene, NRPD (Non-Redundant Protein Database) and UniGene. We next provided GO functional annotations for the gene transcripts represented on this array. Overall, we GO annotated 14,531 gene products (68.1% of the gene products represented on the EWGO array) with 57,912 annotations. GAQ (GO Annotation Quality) scores were calculated for this array both before and after we added GO annotation. The additional annotations improved the meanGAQ score 16-fold. This data is publicly available at AgBase http://www.agbase.msstate.edu/. Conclusion Providing additional information about the public databases which link to the gene products represented on the array allows users more flexibility when using gene expression modelling and hypothesis-testing computational tools. Moreover, since different databases provide different types of information, users have access to multiple data sources. In addition, our GO annotation underpins functional modelling for most gene expression analysis tools and enables equine researchers to model large lists of differentially expressed transcripts in biologically relevant ways.
Collapse
Affiliation(s)
- Lauren A Bright
- Department of Clinical Sciences, College of Veterinary Medicine, Mississippi State University, PO Box 6100, Mississippi State, MS, 39762, USA.
| | | | | | | | | |
Collapse
|
40
|
Guan P, Huang D, He M, Zhou B. Lung cancer gene expression database analysis incorporating prior knowledge with support vector machine-based classification method. JOURNAL OF EXPERIMENTAL & CLINICAL CANCER RESEARCH : CR 2009; 28:103. [PMID: 19615083 PMCID: PMC2719616 DOI: 10.1186/1756-9966-28-103] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/03/2009] [Accepted: 07/18/2009] [Indexed: 01/13/2023]
Abstract
Background A reliable and precise classification is essential for successful diagnosis and treatment of cancer. Gene expression microarrays have provided the high-throughput platform to discover genomic biomarkers for cancer diagnosis and prognosis. Rational use of the available bioinformation can not only effectively remove or suppress noise in gene chips, but also avoid one-sided results of separate experiment. However, only some studies have been aware of the importance of prior information in cancer classification. Methods Together with the application of support vector machine as the discriminant approach, we proposed one modified method that incorporated prior knowledge into cancer classification based on gene expression data to improve accuracy. A public well-known dataset, Malignant pleural mesothelioma and lung adenocarcinoma gene expression database, was used in this study. Prior knowledge is viewed here as a means of directing the classifier using known lung adenocarcinoma related genes. The procedures were performed by software R 2.80. Results The modified method performed better after incorporating prior knowledge. Accuracy of the modified method improved from 98.86% to 100% in training set and from 98.51% to 99.06% in test set. The standard deviations of the modified method decreased from 0.26% to 0 in training set and from 3.04% to 2.10% in test set. Conclusion The method that incorporates prior knowledge into discriminant analysis could effectively improve the capacity and reduce the impact of noise. This idea may have good future not only in practice but also in methodology.
Collapse
Affiliation(s)
- Peng Guan
- Department of Epidemiology, School of Public Health, China Medical University, Shenyang 110001, PR China.
| | | | | | | |
Collapse
|