151
|
Blocker AW, Meng XL. The potential and perils of preprocessing: Building new foundations. BERNOULLI 2013. [DOI: 10.3150/13-bejsp16] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
152
|
Mao Z, Cai W, Shao X. Selecting significant genes by randomization test for cancer classification using gene expression data. J Biomed Inform 2013; 46:594-601. [DOI: 10.1016/j.jbi.2013.03.009] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2012] [Revised: 01/30/2013] [Accepted: 03/28/2013] [Indexed: 12/30/2022]
|
153
|
Sung J, Kim PJ, Ma S, Funk CC, Magis AT, Wang Y, Hood L, Geman D, Price ND. Multi-study integration of brain cancer transcriptomes reveals organ-level molecular signatures. PLoS Comput Biol 2013; 9:e1003148. [PMID: 23935471 PMCID: PMC3723500 DOI: 10.1371/journal.pcbi.1003148] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2012] [Accepted: 06/05/2013] [Indexed: 12/23/2022] Open
Abstract
We utilized abundant transcriptomic data for the primary classes of brain cancers to study the feasibility of separating all of these diseases simultaneously based on molecular data alone. These signatures were based on a new method reported herein – Identification of Structured Signatures and Classifiers (ISSAC) – that resulted in a brain cancer marker panel of 44 unique genes. Many of these genes have established relevance to the brain cancers examined herein, with others having known roles in cancer biology. Analyses on large-scale data from multiple sources must deal with significant challenges associated with heterogeneity between different published studies, for it was observed that the variation among individual studies often had a larger effect on the transcriptome than did phenotype differences, as is typical. For this reason, we restricted ourselves to studying only cases where we had at least two independent studies performed for each phenotype, and also reprocessed all the raw data from the studies using a unified pre-processing pipeline. We found that learning signatures across multiple datasets greatly enhanced reproducibility and accuracy in predictive performance on truly independent validation sets, even when keeping the size of the training set the same. This was most likely due to the meta-signature encompassing more of the heterogeneity across different sources and conditions, while amplifying signal from the repeated global characteristics of the phenotype. When molecular signatures of brain cancers were constructed from all currently available microarray data, 90% phenotype prediction accuracy, or the accuracy of identifying a particular brain cancer from the background of all phenotypes, was found. Looking forward, we discuss our approach in the context of the eventual development of organ-specific molecular signatures from peripheral fluids such as the blood. From a multi-study, integrated transcriptomic dataset, we identified a marker panel for differentiating major human brain cancers at the gene-expression level. The ISSAC molecular signatures for brain cancers, composed of 44 unique genes, are based on comparing expression levels of pairs of genes, and phenotype prediction follows a diagnostic hierarchy. We found that sufficient dataset integration across multiple studies greatly enhanced diagnostic performance on truly independent validation sets, whereas signatures learned from only one dataset typically led to high error rate. Molecular signatures of brain cancers, when obtained using all currently available gene-expression data, achieved 90% phenotype prediction accuracy. Thus, our integrative approach holds significant promise for developing organ-level, comprehensive, molecular signatures of disease.
Collapse
Affiliation(s)
- Jaeyun Sung
- Institute for Systems Biology, Seattle, Washington, United States of America
- Department of Chemical and Biomolecular Engineering, University of Illinois, Urbana, Illinois, United States of America
| | - Pan-Jun Kim
- Asia Pacific Center for Theoretical Physics, Pohang, Gyeongbuk, Republic of Korea
- Department of Physics, POSTECH, Pohang, Gyeongbuk, Republic of Korea
| | - Shuyi Ma
- Institute for Systems Biology, Seattle, Washington, United States of America
- Department of Chemical and Biomolecular Engineering, University of Illinois, Urbana, Illinois, United States of America
| | - Cory C. Funk
- Institute for Systems Biology, Seattle, Washington, United States of America
| | - Andrew T. Magis
- Institute for Systems Biology, Seattle, Washington, United States of America
- Center for Biophysics and Computational Biology, University of Illinois, Urbana, Illinois, United States of America
| | - Yuliang Wang
- Institute for Systems Biology, Seattle, Washington, United States of America
- Department of Chemical and Biomolecular Engineering, University of Illinois, Urbana, Illinois, United States of America
| | - Leroy Hood
- Institute for Systems Biology, Seattle, Washington, United States of America
| | - Donald Geman
- Institute for Computational Medicine, Johns Hopkins University, Baltimore, Maryland, United States of America
- Department of Applied Mathematics and Statistics, Johns Hopkins University, Baltimore, Maryland, United States of America
| | - Nathan D. Price
- Institute for Systems Biology, Seattle, Washington, United States of America
- * E-mail:
| |
Collapse
|
154
|
An ensemble of SVM classifiers based on gene pairs. Comput Biol Med 2013; 43:729-37. [DOI: 10.1016/j.compbiomed.2013.03.010] [Citation(s) in RCA: 40] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2012] [Revised: 03/21/2013] [Accepted: 03/22/2013] [Indexed: 11/19/2022]
|
155
|
Ren X, Wang Y, Zhang XS, Jin Q. iPcc: a novel feature extraction method for accurate disease class discovery and prediction. Nucleic Acids Res 2013; 41:e143. [PMID: 23761440 PMCID: PMC3737526 DOI: 10.1093/nar/gkt343] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/13/2023] Open
Abstract
Gene expression profiling has gradually become a routine procedure for disease diagnosis and classification. In the past decade, many computational methods have been proposed, resulting in great improvements on various levels, including feature selection and algorithms for classification and clustering. In this study, we present iPcc, a novel method from the feature extraction perspective to further propel gene expression profiling technologies from bench to bedside. We define ‘correlation feature space’ for samples based on the gene expression profiles by iterative employment of Pearson’s correlation coefficient. Numerical experiments on both simulated and real gene expression data sets demonstrate that iPcc can greatly highlight the latent patterns underlying noisy gene expression data and thus greatly improve the robustness and accuracy of the algorithms currently available for disease diagnosis and classification based on gene expression profiles.
Collapse
Affiliation(s)
- Xianwen Ren
- MOH Key Laboratory of Systems Biology of Pathogens, Institute of Pathogen Biology, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing 100730, China
| | | | | | | |
Collapse
|
156
|
Zhang L, Hao C, Shen X, Hong G, Li H, Zhou X, Liu C, Guo Z. Rank-based predictors for response and prognosis of neoadjuvant taxane-anthracycline-based chemotherapy in breast cancer. Breast Cancer Res Treat 2013; 139:361-9. [DOI: 10.1007/s10549-013-2566-2] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2013] [Accepted: 05/10/2013] [Indexed: 12/22/2022]
|
157
|
Marchionni L, Afsari B, Geman D, Leek JT. A simple and reproducible breast cancer prognostic test. BMC Genomics 2013; 14:336. [PMID: 23682826 PMCID: PMC3662649 DOI: 10.1186/1471-2164-14-336] [Citation(s) in RCA: 38] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2012] [Accepted: 05/04/2013] [Indexed: 11/10/2022] Open
Abstract
Background A small number of prognostic and predictive tests based on gene expression are currently offered as reference laboratory tests. In contrast to such success stories, a number of flaws and errors have recently been identified in other genomic-based predictors and the success rate for developing clinically useful genomic signatures is low. These errors have led to widespread concerns about the protocols for conducting and reporting of computational research. As a result, a need has emerged for a template for reproducible development of genomic signatures that incorporates full transparency, data sharing and statistical robustness. Results Here we present the first fully reproducible analysis of the data used to train and test MammaPrint, an FDA-cleared prognostic test for breast cancer based on a 70-gene expression signature. We provide all the software and documentation necessary for researchers to build and evaluate genomic classifiers based on these data. As an example of the utility of this reproducible research resource, we develop a simple prognostic classifier that uses only 16 genes from the MammaPrint signature and is equally accurate in predicting 5-year disease free survival. Conclusions Our study provides a prototypic example for reproducible development of computational algorithms for learning prognostic biomarkers in the era of personalized medicine.
Collapse
Affiliation(s)
- Luigi Marchionni
- The Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University, School of Medicine, Baltimore, MD 21231, USA
| | | | | | | |
Collapse
|
158
|
Liu HC, Peng PC, Hsieh TC, Yeh TC, Lin CJ, Chen CY, Hou JY, Shih LY, Liang DC. Comparison of feature selection methods for cross-laboratory microarray analysis. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2013; 10:593-604. [PMID: 24091394 DOI: 10.1109/tcbb.2013.70] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/25/2023]
Abstract
The amount of gene expression data of microarray has grown exponentially. To apply them for extensive studies, integrated analysis of cross-laboratory (cross-lab) data becomes a trend, and thus, choosing an appropriate feature selection method is an essential issue. This paper focuses on feature selection for Affymetrix (Affy) microarray studies across different labs. We investigate four feature selection methods: $(t)$-test, significance analysis of microarrays (SAM), rank products (RP), and random forest (RF). The four methods are applied to acute lymphoblastic leukemia, acute myeloid leukemia, breast cancer, and lung cancer Affy data which consist of three cross-lab data sets each. We utilize a rank-based normalization method to reduce the bias from cross-lab data sets. Training on one data set or two combined data sets to test the remaining data set(s) are both considered. Balanced accuracy is used for prediction evaluation. This study provides comprehensive comparisons of the four feature selection methods in cross-lab microarray analysis. Results show that SAM has the best classification performance. RF also gets high classification accuracy, but it is not as stable as SAM. The most naive method is $(t)$-test, but its performance is the worst among the four methods. In this study, we further discuss the influence from the number of training samples, the number of selected genes, and the issue of unbalanced data sets.
Collapse
Affiliation(s)
- Hsi-Che Liu
- Mackay Medical College and Division of Pediatric Hematology-Oncology, Mackay Memorial Hospital, New Taipei
| | | | | | | | | | | | | | | | | |
Collapse
|
159
|
Gene-pair expression signatures reveal lineage control. Nat Methods 2013; 10:577-83. [PMID: 23603899 PMCID: PMC4131748 DOI: 10.1038/nmeth.2445] [Citation(s) in RCA: 114] [Impact Index Per Article: 10.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2012] [Accepted: 03/11/2013] [Indexed: 11/17/2022]
Abstract
The distinct cell types of multicellular organisms arise due to constraints imposed by gene regulatory networks on the collective change of gene expression across the genome, creating self-stabilizing expression states, or attractors. We compiled a resource of curated human expression data comprising 166 cell types and 2,602 transcription regulating genes and developed a data driven method built around the concept of expression reversal defined at the level of gene pairs, such as those participating in toggle switch circuits. This approach allows us to organize the cell types into their ontogenetic lineage-relationships and to reflect regulatory relationships among genes that explain their ability to function as determinants of cell fate. We show that this method identifies genes belonging to regulatory circuits that control neuronal fate, pluripotency and blood cell differentiation, thus offering a novel large-scale perspective on lineage specification.
Collapse
|
160
|
Earls JC, Eddy JA, Funk CC, Ko Y, Magis AT, Price ND. AUREA: an open-source software system for accurate and user-friendly identification of relative expression molecular signatures. BMC Bioinformatics 2013; 14:78. [PMID: 23496976 PMCID: PMC3599560 DOI: 10.1186/1471-2105-14-78] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2012] [Accepted: 02/08/2013] [Indexed: 11/27/2022] Open
Abstract
Background Public databases such as the NCBI Gene Expression Omnibus contain extensive and exponentially increasing amounts of high-throughput data that can be applied to molecular phenotype characterization. Collectively, these data can be analyzed for such purposes as disease diagnosis or phenotype classification. One family of algorithms that has proven useful for disease classification is based on relative expression analysis and includes the Top-Scoring Pair (TSP), k-Top-Scoring Pairs (k-TSP), Top-Scoring Triplet (TST) and Differential Rank Conservation (DIRAC) algorithms. These relative expression analysis algorithms hold significant advantages for identifying interpretable molecular signatures for disease classification, and have been implemented previously on a variety of computational platforms with varying degrees of usability. To increase the user-base and maximize the utility of these methods, we developed the program AUREA (Adaptive Unified Relative Expression Analyzer)—a cross-platform tool that has a consistent application programming interface (API), an easy-to-use graphical user interface (GUI), fast running times and automated parameter discovery. Results Herein, we describe AUREA, an efficient, cohesive, and user-friendly open-source software system that comprises a suite of methods for relative expression analysis. AUREA incorporates existing methods, while extending their capabilities and bringing uniformity to their interfaces. We demonstrate that combining these algorithms and adaptively tuning parameters on the training sets makes these algorithms more consistent in their performance and demonstrate the effectiveness of our adaptive parameter tuner by comparing accuracy across diverse datasets. Conclusions We have integrated several relative expression analysis algorithms and provided a unified interface for their implementation while making data acquisition, parameter fixing, data merging, and results analysis ‘point-and-click’ simple. The unified interface and the adaptive parameter tuning of AUREA provide an effective framework in which to investigate the massive amounts of publically available data by both ‘in silico’ and ‘bench’ scientists. AUREA can be found at http://price.systemsbiology.net/AUREA/.
Collapse
|
161
|
GANESHKUMAR P, RANI C, DEEPA SN. FORMATION OF FUZZY IF-THEN RULES AND MEMBERSHIP FUNCTION USING ENHANCED PARTICLE SWARM OPTIMIZATION. INT J UNCERTAIN FUZZ 2013. [DOI: 10.1142/s0218488513500062] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
This paper proposes an Enhanced Particle Swarm Optimization (EPSO) for extracting optimal rule set and tuning membership function for fuzzy logic based classifier model. The standard PSO is more sensitive to premature convergence due to lack of diversity in the swarm and can easily get trapped into local minima when it is used for data classification. To overcome this issue, BLX-α crossover and Non-uniform mutation from Genetic Algorithm (GA) are incorporated in addition to standard velocity and position updating of PSO. The performance of the proposed approach is evaluated using ten publicly available bench mark data sets. From the simulation study, it is found that the proposed approach enhances the convergence and generates a comprehensible fuzzy classifier system with high classification accuracy for all the data sets. Statistical analysis of the test result shows the suitability of the proposed method over other approaches reported in the literature.
Collapse
Affiliation(s)
- P. GANESHKUMAR
- Department of Information Technology, Anna University of Technology Coimbatore, Coimbatore-641047, Tamil Nadu, India
| | - C. RANI
- Department of Information Technology, Anna University of Technology Coimbatore, Coimbatore-641047, Tamil Nadu, India
- Department of Electrical and Electronics Engineering, Anna University of Technology Coimbatore, Coimbatore-641047, Tamil Nadu, India
| | - S. N. DEEPA
- Department of Electrical and Electronics Engineering, Anna University of Technology Coimbatore, Coimbatore-641047, Tamil Nadu, India
| |
Collapse
|
162
|
Ulfenborg B, Klinga-Levan K, Olsson B. Classification of tumor samples from expression data using decision trunks. Cancer Inform 2013; 12:53-66. [PMID: 23467331 PMCID: PMC3579425 DOI: 10.4137/cin.s10356] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022] Open
Abstract
We present a novel machine learning approach for the classification of cancer samples using expression data. We refer to the method as “decision trunks,” since it is loosely based on decision trees, but contains several modifications designed to achieve an algorithm that: (1) produces smaller and more easily interpretable classifiers than decision trees; (2) is more robust in varying application scenarios; and (3) achieves higher classification accuracy. The decision trunk algorithm has been implemented and tested on 26 classification tasks, covering a wide range of cancer forms, experimental methods, and classification scenarios. This comprehensive evaluation indicates that the proposed algorithm performs at least as well as the current state of the art algorithms in terms of accuracy, while producing classifiers that include on average only 2–3 markers. We suggest that the resulting decision trunks have clear advantages over other classifiers due to their transparency, interpretability, and their correspondence with human decision-making and clinical testing practices.
Collapse
Affiliation(s)
- Benjamin Ulfenborg
- Systems Biology Research Centre, School of Life Sciences, University of Skövde, Skövde, Sweden
| | | | | |
Collapse
|
163
|
Wang H, Zhang H, Dai Z, Chen MS, Yuan Z. TSG: a new algorithm for binary and multi-class cancer classification and informative genes selection. BMC Med Genomics 2013; 6 Suppl 1:S3. [PMID: 23445528 PMCID: PMC3552704 DOI: 10.1186/1755-8794-6-s1-s3] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022] Open
Abstract
BACKGROUND One of the challenges in classification of cancer tissue samples based on gene expression data is to establish an effective method that can select a parsimonious set of informative genes. The Top Scoring Pair (TSP), k-Top Scoring Pairs (k-TSP), Support Vector Machines (SVM), and prediction analysis of microarrays (PAM) are four popular classifiers that have comparable performance on multiple cancer datasets. SVM and PAM tend to use a large number of genes and TSP, k-TSP always use even number of genes. In addition, the selection of distinct gene pairs in k-TSP simply combined the pairs of top ranking genes without considering the fact that the gene set with best discrimination power may not be the combined pairs. The k-TSP algorithm also needs the user to specify an upper bound for the number of gene pairs. Here we introduce a computational algorithm to address the problems. The algorithm is named Chisquare-statistic-based Top Scoring Genes (Chi-TSG) classifier simplified as TSG. RESULTS The TSG classifier starts with the top two genes and sequentially adds additional gene into the candidate gene set to perform informative gene selection. The algorithm automatically reports the total number of informative genes selected with cross validation. We provide the algorithm for both binary and multi-class cancer classification. The algorithm was applied to 9 binary and 10 multi-class gene expression datasets involving human cancers. The TSG classifier outperforms TSP family classifiers by a big margin in most of the 19 datasets. In addition to improved accuracy, our classifier shares all the advantages of the TSP family classifiers including easy interpretation, invariant to monotone transformation, often selects a small number of informative genes allowing follow-up studies, resistant to sampling variations due to within sample operations. CONCLUSIONS Redefining the scores for gene set and the classification rules in TSP family classifiers by incorporating the sample size information can lead to better selection of informative genes and classification accuracy. The resulting TSG classifier offers a useful tool for cancer classification based on numerical molecular data.
Collapse
Affiliation(s)
- Haiyan Wang
- Hunan Provincial Key Laboratory of Crop Germplasm Innovation and Utilization, Changsha 410128, China
| | | | | | | | | |
Collapse
|
164
|
Clustering in Conjunction with Quantum Genetic Algorithm for Relevant Genes Selection for Cancer Microarray Data. ACTA ACUST UNITED AC 2013. [DOI: 10.1007/978-3-642-40319-4_37] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register]
|
165
|
|
166
|
Wu G, Stein L. A network module-based method for identifying cancer prognostic signatures. Genome Biol 2012; 13:R112. [PMID: 23228031 PMCID: PMC3580410 DOI: 10.1186/gb-2012-13-12-r112] [Citation(s) in RCA: 111] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2012] [Revised: 11/21/2012] [Accepted: 12/10/2012] [Indexed: 12/12/2022] Open
Abstract
Discovering robust prognostic gene signatures as biomarkers using genomics data can be challenging. We have developed a simple but efficient method for discovering prognostic biomarkers in cancer gene expression data sets using modules derived from a highly reliable gene functional interaction network. When applied to breast cancer, we discover a novel 31-gene signature associated with patient survival. The signature replicates across 5 independent gene expression studies, and outperforms 48 published gene signatures. When applied to ovarian cancer, the algorithm identifies a 75-gene signature associated with patient survival. A Cytoscape plugin implementation of the signature discovery method is available at http://wiki.reactome.org/index.php/Reactome_FI_Cytoscape_Plugin.
Collapse
Affiliation(s)
- Guanming Wu
- Ontario Institute for Cancer Research, MaRS Centre, South Tower, 101 College Street, Suite 800, Toronto, ON M5G 0A3, Canada
| | - Lincoln Stein
- Ontario Institute for Cancer Research, MaRS Centre, South Tower, 101 College Street, Suite 800, Toronto, ON M5G 0A3, Canada
- Department of Molecular Genetics, University of Toronto, 1 King's College Circle, #4386, Medical Sciences Building, Toronto ON M5S 1A8, Canada
| |
Collapse
|
167
|
Paul R, Groza T, Hunter J, Zankl A. Decision support methods for finding phenotype--disorder associations in the bone dysplasia domain. PLoS One 2012; 7:e50614. [PMID: 23226331 PMCID: PMC3511538 DOI: 10.1371/journal.pone.0050614] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2012] [Accepted: 10/26/2012] [Indexed: 11/18/2022] Open
Abstract
A lack of mature domain knowledge and well established guidelines makes the medical diagnosis of skeletal dysplasias (a group of rare genetic disorders) a very complex process. Machine learning techniques can facilitate objective interpretation of medical observations for the purposes of decision support. However, building decision support models using such techniques is highly problematic in the context of rare genetic disorders, because it depends on access to mature domain knowledge. This paper describes an approach for developing a decision support model in medical domains that are underpinned by relatively sparse knowledge bases. We propose a solution that combines association rule mining with the Dempster-Shafer theory (DST) to compute probabilistic associations between sets of clinical features and disorders, which can then serve as support for medical decision making (e.g., diagnosis). We show, via experimental results, that our approach is able to provide meaningful outcomes even on small datasets with sparse distributions, in addition to outperforming other Machine Learning techniques and behaving slightly better than an initial diagnosis by a clinician.
Collapse
Affiliation(s)
- Razan Paul
- School of ITEE, The University of Queensland, St. Lucia, Queensland, Australia
| | - Tudor Groza
- School of ITEE, The University of Queensland, St. Lucia, Queensland, Australia
| | - Jane Hunter
- School of ITEE, The University of Queensland, St. Lucia, Queensland, Australia
| | - Andreas Zankl
- Bone Dysplasia Research Group, UQ Centre for Clinical Research (UQCCR), The University of Queensland, Herston, Queensland, Australia
- Genetic Health Queensland, Royal Brisbane and Women’s Hospital, Herston, Queensland, Australia
| |
Collapse
|
168
|
Zhang H, Wang H, Dai Z, Chen MS, Yuan Z. Improving accuracy for cancer classification with a new algorithm for genes selection. BMC Bioinformatics 2012; 13:298. [PMID: 23148517 PMCID: PMC3562261 DOI: 10.1186/1471-2105-13-298] [Citation(s) in RCA: 45] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2012] [Accepted: 09/24/2012] [Indexed: 12/21/2022] Open
Abstract
Background Even though the classification of cancer tissue samples based on gene expression data has advanced considerably in recent years, it faces great challenges to improve accuracy. One of the challenges is to establish an effective method that can select a parsimonious set of relevant genes. So far, most methods for gene selection in literature focus on screening individual or pairs of genes without considering the possible interactions among genes. Here we introduce a new computational method named the Binary Matrix Shuffling Filter (BMSF). It not only overcomes the difficulty associated with the search schemes of traditional wrapper methods and overfitting problem in large dimensional search space but also takes potential gene interactions into account during gene selection. This method, coupled with Support Vector Machine (SVM) for implementation, often selects very small number of genes for easy model interpretability. Results We applied our method to 9 two-class gene expression datasets involving human cancers. During the gene selection process, the set of genes to be kept in the model was recursively refined and repeatedly updated according to the effect of a given gene on the contributions of other genes in reference to their usefulness in cancer classification. The small number of informative genes selected from each dataset leads to significantly improved leave-one-out (LOOCV) classification accuracy across all 9 datasets for multiple classifiers. Our method also exhibits broad generalization in the genes selected since multiple commonly used classifiers achieved either equivalent or much higher LOOCV accuracy than those reported in literature. Conclusions Evaluation of a gene’s contribution to binary cancer classification is better to be considered after adjusting for the joint effect of a large number of other genes. A computationally efficient search scheme was provided to perform effective search in the extensive feature space that includes possible interactions of many genes. Performance of the algorithm applied to 9 datasets suggests that it is possible to improve the accuracy of cancer classification by a big margin when joint effects of many genes are considered.
Collapse
Affiliation(s)
- Hongyan Zhang
- Hunan Provincial Key Laboratory of Crop Germplasm Innovation and Utilization, Changsha 410128, China
| | | | | | | | | |
Collapse
|
169
|
Hochrein J, Klein MS, Zacharias HU, Li J, Wijffels G, Schirra HJ, Spang R, Oefner PJ, Gronwald W. Performance Evaluation of Algorithms for the Classification of Metabolic 1H NMR Fingerprints. J Proteome Res 2012; 11:6242-51. [DOI: 10.1021/pr3009034] [Citation(s) in RCA: 28] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/16/2023]
Affiliation(s)
- Jochen Hochrein
- Institute of Functional Genomics, University of Regensburg, Josef-Engert-Strasse 9, 93053 Regensburg, Germany
| | - Matthias S. Klein
- Institute of Functional Genomics, University of Regensburg, Josef-Engert-Strasse 9, 93053 Regensburg, Germany
| | - Helena U. Zacharias
- Institute of Functional Genomics, University of Regensburg, Josef-Engert-Strasse 9, 93053 Regensburg, Germany
| | - Juan Li
- CSIRO Livestock Industries, Queensland Bioscience Precinct, 306 Carmody Rd., St. Lucia, QLD
4067, Australia
| | - Gene Wijffels
- CSIRO Livestock Industries, Queensland Bioscience Precinct, 306 Carmody Rd., St. Lucia, QLD
4067, Australia
| | - Horst Joachim Schirra
- Centre for
Advanced Imaging, The University of Queensland, Brisbane, QLD 4072, Australia
| | - Rainer Spang
- Institute of Functional Genomics, University of Regensburg, Josef-Engert-Strasse 9, 93053 Regensburg, Germany
| | - Peter J. Oefner
- Institute of Functional Genomics, University of Regensburg, Josef-Engert-Strasse 9, 93053 Regensburg, Germany
| | - Wolfram Gronwald
- Institute of Functional Genomics, University of Regensburg, Josef-Engert-Strasse 9, 93053 Regensburg, Germany
| |
Collapse
|
170
|
Magis AT, Price ND. The top-scoring 'N' algorithm: a generalized relative expression classification method from small numbers of biomolecules. BMC Bioinformatics 2012; 13:227. [PMID: 22966958 PMCID: PMC3663421 DOI: 10.1186/1471-2105-13-227] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2012] [Accepted: 09/03/2012] [Indexed: 01/17/2023] Open
Abstract
Background Relative expression algorithms such as the top-scoring pair (TSP) and the
top-scoring triplet (TST) have several strengths that distinguish them from
other classification methods, including resistance to overfitting,
invariance to most data normalization methods, and biological
interpretability. The top-scoring ‘N’ (TSN) algorithm is a
generalized form of other relative expression algorithms which uses generic
permutations and a dynamic classifier size to control both the permutation
and combination space available for classification. Results TSN was tested on nine cancer datasets, showing statistically significant
differences in classification accuracy between different classifier sizes
(choices of N). TSN also performed competitively against a wide
variety of different classification methods, including artificial neural
networks, classification trees, discriminant analysis, k-Nearest neighbor,
naïve Bayes, and support vector machines, when tested on the Microarray
Quality Control II datasets. Furthermore, TSN exhibits low levels of
overfitting on training data compared to other methods, giving confidence
that results obtained during cross validation will be more generally
applicable to external validation sets. Conclusions TSN preserves the strengths of other relative expression algorithms while
allowing a much larger permutation and combination space to be explored,
potentially improving classification accuracies when fewer numbers of
measured features are available.
Collapse
|
171
|
Pronk TE, van Someren EP, Stierum RH, Ezendam J, Pennings JL. Unraveling toxicological mechanisms and predicting toxicity classes with gene dysregulation networks. J Appl Toxicol 2012; 33:1407-15. [DOI: 10.1002/jat.2800] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2012] [Revised: 06/25/2012] [Accepted: 06/25/2012] [Indexed: 11/05/2022]
Affiliation(s)
- Tessa E. Pronk
- Laboratory for Health Protection Research, National Institute for Public Health and the Environment; PO Box 1 NL-3720 BA Bilthoven the Netherlands
- Department of Toxicogenomics; Maastricht University, PO Box 616; NL-6200 MD Maastricht the Netherlands
| | - Eugene P. van Someren
- Research Group Microbiology and Systems Biology; TNO, PO Box 360 NL-3700 AJ Zeist the Netherlands
| | - Rob H. Stierum
- Research Group Microbiology and Systems Biology; TNO, PO Box 360 NL-3700 AJ Zeist the Netherlands
| | - Janine Ezendam
- Laboratory for Health Protection Research, National Institute for Public Health and the Environment; PO Box 1 NL-3720 BA Bilthoven the Netherlands
| | - Jeroen L.A. Pennings
- Laboratory for Health Protection Research, National Institute for Public Health and the Environment; PO Box 1 NL-3720 BA Bilthoven the Netherlands
| |
Collapse
|
172
|
Glaab E, Bacardit J, Garibaldi JM, Krasnogor N. Using rule-based machine learning for candidate disease gene prioritization and sample classification of cancer gene expression data. PLoS One 2012; 7:e39932. [PMID: 22808075 PMCID: PMC3394775 DOI: 10.1371/journal.pone.0039932] [Citation(s) in RCA: 82] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2012] [Accepted: 05/29/2012] [Indexed: 12/19/2022] Open
Abstract
Microarray data analysis has been shown to provide an effective tool for studying cancer and genetic diseases. Although classical machine learning techniques have successfully been applied to find informative genes and to predict class labels for new samples, common restrictions of microarray analysis such as small sample sizes, a large attribute space and high noise levels still limit its scientific and clinical applications. Increasing the interpretability of prediction models while retaining a high accuracy would help to exploit the information content in microarray data more effectively. For this purpose, we evaluate our rule-based evolutionary machine learning systems, BioHEL and GAssist, on three public microarray cancer datasets, obtaining simple rule-based models for sample classification. A comparison with other benchmark microarray sample classifiers based on three diverse feature selection algorithms suggests that these evolutionary learning techniques can compete with state-of-the-art methods like support vector machines. The obtained models reach accuracies above 90% in two-level external cross-validation, with the added value of facilitating interpretation by using only combinations of simple if-then-else rules. As a further benefit, a literature mining analysis reveals that prioritizations of informative genes extracted from BioHEL's classification rule sets can outperform gene rankings obtained from a conventional ensemble feature selection in terms of the pointwise mutual information between relevant disease terms and the standardized names of top-ranked genes.
Collapse
Affiliation(s)
- Enrico Glaab
- Interdisciplinary Computing and Complex Systems (ICOS) Research Group, University of Nottingham, Nottingham, United Kingdom
| | - Jaume Bacardit
- Interdisciplinary Computing and Complex Systems (ICOS) Research Group, University of Nottingham, Nottingham, United Kingdom
| | - Jonathan M. Garibaldi
- Intelligent Modeling and Analysis (IMA) Research Group, University of Nottingham, Nottingham, United Kingdom
| | - Natalio Krasnogor
- Interdisciplinary Computing and Complex Systems (ICOS) Research Group, University of Nottingham, Nottingham, United Kingdom
| |
Collapse
|
173
|
Ren X, Wang Y, Wang J, Zhang XS. A unified computational model for revealing and predicting subtle subtypes of cancers. BMC Bioinformatics 2012; 13:70. [PMID: 22548981 PMCID: PMC3464623 DOI: 10.1186/1471-2105-13-70] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2011] [Accepted: 05/01/2012] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Gene expression profiling technologies have gradually become a community standard tool for clinical applications. For example, gene expression data has been analyzed to reveal novel disease subtypes (class discovery) and assign particular samples to well-defined classes (class prediction). In the past decade, many effective methods have been proposed for individual applications. However, there is still a pressing need for a unified framework that can reveal the complicated relationships between samples. RESULTS We propose a novel convex optimization model to perform class discovery and class prediction in a unified framework. An efficient algorithm is designed and software named OTCC (Optimization Tool for Clustering and Classification) is developed. Comparison in a simulated dataset shows that our method outperforms the existing methods. We then applied OTCC to acute leukemia and breast cancer datasets. The results demonstrate that our method not only can reveal the subtle structures underlying those cancer gene expression data but also can accurately predict the class labels of unknown cancer samples. Therefore, our method holds the promise to identify novel cancer subtypes and improve diagnosis. CONCLUSIONS We propose a unified computational framework for class discovery and class prediction to facilitate the discovery and prediction of subtle subtypes of cancers. Our method can be generally applied to multiple types of measurements, e.g., gene expression profiling, proteomic measuring, and recent next-generation sequencing, since it only requires the similarities among samples as input.
Collapse
Affiliation(s)
- Xianwen Ren
- MOH Key Laboratory of Systems Biology of Pathogens, Institute of Pathogen Biology, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing 100730, China
| | | | | | | |
Collapse
|
174
|
Popovici V, Budinska E, Tejpar S, Weinrich S, Estrella H, Hodgson G, Van Cutsem E, Xie T, Bosman FT, Roth AD, Delorenzi M. Identification of a Poor-Prognosis BRAF-Mutant–Like Population of Patients With Colon Cancer. J Clin Oncol 2012; 30:1288-95. [DOI: 10.1200/jco.2011.39.5814] [Citation(s) in RCA: 180] [Impact Index Per Article: 15.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/27/2022] Open
Abstract
Purpose Our purpose was development and assessment of a BRAF-mutant gene expression signature for colon cancer (CC) and the study of its prognostic implications. Materials and Methods A set of 668 stage II and III CC samples from the PETACC-3 (Pan-European Trails in Alimentary Tract Cancers) clinical trial were used to assess differential gene expression between c.1799T>A (p.V600E) BRAF mutant and non-BRAF, non-KRAS mutant cancers (double wild type) and to construct a gene expression–based classifier for detecting BRAF mutant samples with high sensitivity. The classifier was validated in independent data sets, and survival rates were compared between classifier positive and negative tumors. Results A 64 gene-based classifier was developed with 96% sensitivity and 86% specificity for detecting BRAF mutant tumors in PETACC-3 and independent samples. A subpopulation of BRAF wild-type patients (30% of KRAS mutants, 13% of double wild type) showed a gene expression pattern and had poor overall survival and survival after relapse, similar to those observed in BRAF-mutant patients. Thus they form a distinct prognostic subgroup within their mutation class. Conclusion A characteristic pattern of gene expression is associated with and accurately predicts BRAF mutation status and, in addition, identifies a population of BRAF mutated-like KRAS mutants and double wild-type patients with similarly poor prognosis. This suggests a common biology between these tumors and provides a novel classification tool for cancers, adding prognostic and biologic information that is not captured by the mutation status alone. These results may guide therapeutic strategies for this patient segment and may help in population stratification for clinical trials.
Collapse
Affiliation(s)
- Vlad Popovici
- Vlad Popovici, Eva Budinska, and Mauro Delorenzi, Swiss Institute of Bioinformatics; Fred T. Bosman and Mauro Delorenzi, Lausanne University Medical Center, Lausanne; Arnaud D. Roth, Geneva University Hospital, Geneva; Arnaud D. Roth, The Swiss Group for Clinical Cancer Research, Bern, Switzerland; Sabine Tejpar and Eric Van Cutsem, University Hospital Gasthuisberg, Katholieke Universiteit Leuven, Leuven, Belgium; and Scott Weinrich, Heather Estrella, Graeme Hodgson, and Tao Xie, Pfizer, La Jolla, CA
| | - Eva Budinska
- Vlad Popovici, Eva Budinska, and Mauro Delorenzi, Swiss Institute of Bioinformatics; Fred T. Bosman and Mauro Delorenzi, Lausanne University Medical Center, Lausanne; Arnaud D. Roth, Geneva University Hospital, Geneva; Arnaud D. Roth, The Swiss Group for Clinical Cancer Research, Bern, Switzerland; Sabine Tejpar and Eric Van Cutsem, University Hospital Gasthuisberg, Katholieke Universiteit Leuven, Leuven, Belgium; and Scott Weinrich, Heather Estrella, Graeme Hodgson, and Tao Xie, Pfizer, La Jolla, CA
| | - Sabine Tejpar
- Vlad Popovici, Eva Budinska, and Mauro Delorenzi, Swiss Institute of Bioinformatics; Fred T. Bosman and Mauro Delorenzi, Lausanne University Medical Center, Lausanne; Arnaud D. Roth, Geneva University Hospital, Geneva; Arnaud D. Roth, The Swiss Group for Clinical Cancer Research, Bern, Switzerland; Sabine Tejpar and Eric Van Cutsem, University Hospital Gasthuisberg, Katholieke Universiteit Leuven, Leuven, Belgium; and Scott Weinrich, Heather Estrella, Graeme Hodgson, and Tao Xie, Pfizer, La Jolla, CA
| | - Scott Weinrich
- Vlad Popovici, Eva Budinska, and Mauro Delorenzi, Swiss Institute of Bioinformatics; Fred T. Bosman and Mauro Delorenzi, Lausanne University Medical Center, Lausanne; Arnaud D. Roth, Geneva University Hospital, Geneva; Arnaud D. Roth, The Swiss Group for Clinical Cancer Research, Bern, Switzerland; Sabine Tejpar and Eric Van Cutsem, University Hospital Gasthuisberg, Katholieke Universiteit Leuven, Leuven, Belgium; and Scott Weinrich, Heather Estrella, Graeme Hodgson, and Tao Xie, Pfizer, La Jolla, CA
| | - Heather Estrella
- Vlad Popovici, Eva Budinska, and Mauro Delorenzi, Swiss Institute of Bioinformatics; Fred T. Bosman and Mauro Delorenzi, Lausanne University Medical Center, Lausanne; Arnaud D. Roth, Geneva University Hospital, Geneva; Arnaud D. Roth, The Swiss Group for Clinical Cancer Research, Bern, Switzerland; Sabine Tejpar and Eric Van Cutsem, University Hospital Gasthuisberg, Katholieke Universiteit Leuven, Leuven, Belgium; and Scott Weinrich, Heather Estrella, Graeme Hodgson, and Tao Xie, Pfizer, La Jolla, CA
| | - Graeme Hodgson
- Vlad Popovici, Eva Budinska, and Mauro Delorenzi, Swiss Institute of Bioinformatics; Fred T. Bosman and Mauro Delorenzi, Lausanne University Medical Center, Lausanne; Arnaud D. Roth, Geneva University Hospital, Geneva; Arnaud D. Roth, The Swiss Group for Clinical Cancer Research, Bern, Switzerland; Sabine Tejpar and Eric Van Cutsem, University Hospital Gasthuisberg, Katholieke Universiteit Leuven, Leuven, Belgium; and Scott Weinrich, Heather Estrella, Graeme Hodgson, and Tao Xie, Pfizer, La Jolla, CA
| | - Eric Van Cutsem
- Vlad Popovici, Eva Budinska, and Mauro Delorenzi, Swiss Institute of Bioinformatics; Fred T. Bosman and Mauro Delorenzi, Lausanne University Medical Center, Lausanne; Arnaud D. Roth, Geneva University Hospital, Geneva; Arnaud D. Roth, The Swiss Group for Clinical Cancer Research, Bern, Switzerland; Sabine Tejpar and Eric Van Cutsem, University Hospital Gasthuisberg, Katholieke Universiteit Leuven, Leuven, Belgium; and Scott Weinrich, Heather Estrella, Graeme Hodgson, and Tao Xie, Pfizer, La Jolla, CA
| | - Tao Xie
- Vlad Popovici, Eva Budinska, and Mauro Delorenzi, Swiss Institute of Bioinformatics; Fred T. Bosman and Mauro Delorenzi, Lausanne University Medical Center, Lausanne; Arnaud D. Roth, Geneva University Hospital, Geneva; Arnaud D. Roth, The Swiss Group for Clinical Cancer Research, Bern, Switzerland; Sabine Tejpar and Eric Van Cutsem, University Hospital Gasthuisberg, Katholieke Universiteit Leuven, Leuven, Belgium; and Scott Weinrich, Heather Estrella, Graeme Hodgson, and Tao Xie, Pfizer, La Jolla, CA
| | - Fred T. Bosman
- Vlad Popovici, Eva Budinska, and Mauro Delorenzi, Swiss Institute of Bioinformatics; Fred T. Bosman and Mauro Delorenzi, Lausanne University Medical Center, Lausanne; Arnaud D. Roth, Geneva University Hospital, Geneva; Arnaud D. Roth, The Swiss Group for Clinical Cancer Research, Bern, Switzerland; Sabine Tejpar and Eric Van Cutsem, University Hospital Gasthuisberg, Katholieke Universiteit Leuven, Leuven, Belgium; and Scott Weinrich, Heather Estrella, Graeme Hodgson, and Tao Xie, Pfizer, La Jolla, CA
| | - Arnaud D. Roth
- Vlad Popovici, Eva Budinska, and Mauro Delorenzi, Swiss Institute of Bioinformatics; Fred T. Bosman and Mauro Delorenzi, Lausanne University Medical Center, Lausanne; Arnaud D. Roth, Geneva University Hospital, Geneva; Arnaud D. Roth, The Swiss Group for Clinical Cancer Research, Bern, Switzerland; Sabine Tejpar and Eric Van Cutsem, University Hospital Gasthuisberg, Katholieke Universiteit Leuven, Leuven, Belgium; and Scott Weinrich, Heather Estrella, Graeme Hodgson, and Tao Xie, Pfizer, La Jolla, CA
| | - Mauro Delorenzi
- Vlad Popovici, Eva Budinska, and Mauro Delorenzi, Swiss Institute of Bioinformatics; Fred T. Bosman and Mauro Delorenzi, Lausanne University Medical Center, Lausanne; Arnaud D. Roth, Geneva University Hospital, Geneva; Arnaud D. Roth, The Swiss Group for Clinical Cancer Research, Bern, Switzerland; Sabine Tejpar and Eric Van Cutsem, University Hospital Gasthuisberg, Katholieke Universiteit Leuven, Leuven, Belgium; and Scott Weinrich, Heather Estrella, Graeme Hodgson, and Tao Xie, Pfizer, La Jolla, CA
| |
Collapse
|
175
|
Abstract
Progress in oncology drug development has been hampered by a lack of preclinical models that reliably predict clinical activity of novel compounds in cancer patients. In an effort to address these shortcomings, there has been a recent increase in the use of patient-derived tumour xenografts (PDTX) engrafted into immune-compromised rodents such as athymic nude or NOD/SCID mice for preclinical modelling. Numerous tumour-specific PDTX models have been established and, importantly, they are biologically stable when passaged in mice in terms of global gene-expression patterns, mutational status, metastatic potential, drug responsiveness and tumour architecture. These characteristics might provide significant improvements over standard cell-line xenograft models. This Review will discuss specific PDTX disease examples illustrating an overview of the opportunities and limitations of these models in cancer drug development, and describe concepts regarding predictive biomarker development and future applications.
Collapse
|
176
|
Tan AC. Employing gene set top scoring pairs to identify deregulated pathway-signatures in dilated cardiomyopathy from integrated microarray gene expression data. Methods Mol Biol 2012; 802:345-361. [PMID: 22130892 DOI: 10.1007/978-1-61779-400-1_23] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/31/2023]
Abstract
It is well accepted that a set of genes must act in concert to drive various cellular processes. However, under different biological phenotypes, not all the members of a gene set will participate in a biological process. Hence, it is useful to construct a discriminative classifier by focusing on the core members (subset) of a highly informative gene set. Such analyses can reveal which of those subsets from the same gene set correspond to different biological phenotypes. In this study, we propose Gene Set Top Scoring Pairs (GSTSP) approach that exploits the simple yet powerful relative expression reversal concept at the gene set levels to achieve these goals. To illustrate the usefulness of GSTSP, we applied this method to five different human heart failure gene expression data sets. We take advantage of the direct data integration feature in the GSTSP approach to combine two data sets, identify a discriminative gene set from >190 predefined gene sets, and evaluate the predictive power of the GSTSP classifier derived from this informative gene set on three independent test sets (79.31% in test accuracy). The discriminative gene pairs identified in this study may provide new biological understanding on the disturbed pathways that are involved in the development of heart failure. GSTSP methodology is general in purpose and is applicable to a variety of phenotypic classification problems using gene expression data.
Collapse
Affiliation(s)
- Aik Choon Tan
- Division of Medical Oncology, Department of Medicine, School of Medicine, University of Colorado Anschutz Medical Campus, Aurora, CO, USA.
| |
Collapse
|
177
|
Shin E, Yoon Y, Ahn J, Park S. TC-VGC: a tumor classification system using variations in genes' correlation. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2011; 104:e87-e101. [PMID: 21531474 DOI: 10.1016/j.cmpb.2011.03.002] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/24/2010] [Revised: 01/11/2011] [Accepted: 03/07/2011] [Indexed: 05/30/2023]
Abstract
Classification analysis of microarray data is widely used to reveal biological features and to diagnose various diseases, including cancers. Most existing approaches improve the performance of learning models by removing most irrelevant and redundant genes from the data. They select the marker genes which are expressed differently in normal and tumor tissues. These techniques ignore the importance of the complex functional-dependencies between genes. In this paper, we propose a new method for cancer classification which uses distinguished variations of gene-gene correlation in two sample groups. The cancer specific genetic network composed of these gene pairs contains many literature-curated prostate cancer genes, and we were successful in identifying new candidate prostate cancer genes inferred by them. Furthermore, this method achieved a high accuracy with a small number of genes in cancer classification.
Collapse
Affiliation(s)
- Eunji Shin
- Department of Computer Science, Yonsei University, 134 Sinchon-dong, Seodaemun-gu, Seoul 120-749, South Korea
| | | | | | | |
Collapse
|
178
|
Robust two-gene classifiers for cancer prediction. Genomics 2011; 99:90-5. [PMID: 22138042 DOI: 10.1016/j.ygeno.2011.11.003] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2011] [Revised: 11/04/2011] [Accepted: 11/09/2011] [Indexed: 11/23/2022]
Abstract
Two-gene classifiers have attracted a broad interest for their simplicity and practicality. Most existing two-gene classification algorithms were involved in exhaustive search that led to their low time-efficiencies. In this study, we proposed two new two-gene classification algorithms which used simple univariate gene selection strategy and constructed simple classification rules based on optimal cut-points for two genes selected. We detected the optimal cut-point with the information entropy principle. We applied the two-gene classification models to eleven cancer gene expression datasets and compared their classification performance to that of some established two-gene classification models like the top-scoring pairs model and the greedy pairs model, as well as standard methods including Diagonal Linear Discriminant Analysis, k-Nearest Neighbor, Support Vector Machine and Random Forest. These comparisons indicated that the performance of our two-gene classifiers was comparable to or better than that of compared models.
Collapse
|
179
|
CHOUDHARY ASHISH, HUA JIANPING, BITTNER MICHAELL, DOUGHERTY EDWARDR. THE EFFECT OF POPULATION CONTEXTS ON CLASSIFIER PERFORMANCE. J BIOL SYST 2011. [DOI: 10.1142/s0218339008002587] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Classifying a patient based on disease type, treatment prognosis, survivability, or other such criteria has become a major focus of genomics and proteomics. From the perspective of the general population of a particular kind of cell, one would like a classifier that applies to the whole population; however, it is often the case that the population is sufficiently structurally diverse that a satisfactory classifier cannot be designed from available sample data. In such a circumstance, it can be useful to identify cellular contexts within which a disease can be reliably diagnosed, which in effect means that one would like to find classifiers that apply to different sub-populations within the overall population. Using a model-based approach, this paper quantifies the effect of contexts on classification performance as a function of the classifier used and the sample size. The advantage of a model-based approach is that we can vary the contextual confusion as a function of the model parameters, thereby allowing us to compare the classification performance in terms of the degree of discriminatory confusion caused by the contexts. We consider five popular classifiers: linear discriminant analysis, three nearest neighbor, linear support vector machine, polynomial support vector machine, and Boosting. We contrast the case where classification is done with a single classifier without discriminating between the contexts to the case where there are context markers that facilitate context separation before classifier design. We observe that little can be done if there is high contextual confusion, but when the contextual confusion is low, context separation can be beneficial, the benefit depending on the classifier.
Collapse
Affiliation(s)
- ASHISH CHOUDHARY
- Pharmaceutical Genomics Division, Translational Genomics Research Institute, 13208 E Shea Boulevard, Suite 100, Scottsdale, AZ 85259, USA
| | - JIANPING HUA
- Computational Biology Division, Translational Genomics Research Institute, 445 North Fifth Street, Suite 600, Phoenix, AZ 85004, USA
| | - MICHAEL L. BITTNER
- Computational Biology Division, Translational Genomics Research Institute, 445 North Fifth Street, Suite 600, Phoenix, AZ 85004, USA
| | - EDWARD R. DOUGHERTY
- Department of Electrical Engineering, Texas A&M University, College Station, TX, 77843, USA
- Computational Biology Division, Translational Genomics Research Institute, 445 North Fifth Street, Suite 600, Phoenix, AZ 85004, USA
| |
Collapse
|
180
|
Obulkasim A, Meijer GA, van de Wiel MA. Stepwise classification of cancer samples using clinical and molecular data. BMC Bioinformatics 2011; 12:422. [PMID: 22034839 PMCID: PMC3221726 DOI: 10.1186/1471-2105-12-422] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2011] [Accepted: 10/28/2011] [Indexed: 11/10/2022] Open
Abstract
Background Combining clinical and molecular data types may potentially improve prediction accuracy of a classifier. However, currently there is a shortage of effective and efficient statistical and bioinformatic tools for true integrative data analysis. Existing integrative classifiers have two main disadvantages: First, coarse combination may lead to subtle contributions of one data type to be overshadowed by more obvious contributions of the other. Second, the need to measure both data types for all patients may be both unpractical and (cost) inefficient. Results We introduce a novel classification method, a stepwise classifier, which takes advantage of the distinct classification power of clinical data and high-dimensional molecular data. We apply classification algorithms to two data types independently, starting with the traditional clinical risk factors. We only turn to relatively expensive molecular data when the uncertainty of prediction result from clinical data exceeds a predefined limit. Experimental results show that our approach is adaptive: the proportion of samples that needs to be re-classified using molecular data depends on how much we expect the predictive accuracy to increase when re-classifying those samples. Conclusions Our method renders a more cost-efficient classifier that is at least as good, and sometimes better, than one based on clinical or molecular data alone. Hence our approach is not just a classifier that minimizes a particular loss function. Instead, it aims to be cost-efficient by avoiding molecular tests for a potentially large subgroup of individuals; moreover, for these individuals a test result would be quickly available, which may lead to reduced waiting times (for diagnosis) and hence lower the patients distress. Stepwise classification is implemented in R-package stepwiseCM and available at the Bioconductor website.
Collapse
Affiliation(s)
- Askar Obulkasim
- Department of Epidemiology and Biostatistics, VU University Medical Center, Amsterdam, The Netherlands.
| | | | | |
Collapse
|
181
|
Shi P, Ray S, Zhu Q, Kon MA. Top scoring pairs for feature selection in machine learning and applications to cancer outcome prediction. BMC Bioinformatics 2011; 12:375. [PMID: 21939564 PMCID: PMC3223741 DOI: 10.1186/1471-2105-12-375] [Citation(s) in RCA: 38] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/31/2010] [Accepted: 09/23/2011] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The widely used k top scoring pair (k-TSP) algorithm is a simple yet powerful parameter-free classifier. It owes its success in many cancer microarray datasets to an effective feature selection algorithm that is based on relative expression ordering of gene pairs. However, its general robustness does not extend to some difficult datasets, such as those involving cancer outcome prediction, which may be due to the relatively simple voting scheme used by the classifier. We believe that the performance can be enhanced by separating its effective feature selection component and combining it with a powerful classifier such as the support vector machine (SVM). More generally the top scoring pairs generated by the k-TSP ranking algorithm can be used as a dimensionally reduced subspace for other machine learning classifiers. RESULTS We developed an approach integrating the k-TSP ranking algorithm (TSP) with other machine learning methods, allowing combination of the computationally efficient, multivariate feature ranking of k-TSP with multivariate classifiers such as SVM. We evaluated this hybrid scheme (k-TSP+SVM) in a range of simulated datasets with known data structures. As compared with other feature selection methods, such as a univariate method similar to Fisher's discriminant criterion (Fisher), or a recursive feature elimination embedded in SVM (RFE), TSP is increasingly more effective than the other two methods as the informative genes become progressively more correlated, which is demonstrated both in terms of the classification performance and the ability to recover true informative genes. We also applied this hybrid scheme to four cancer prognosis datasets, in which k-TSP+SVM outperforms k-TSP classifier in all datasets, and achieves either comparable or superior performance to that using SVM alone. In concurrence with what is observed in simulation, TSP appears to be a better feature selector than Fisher and RFE in some of the cancer datasets CONCLUSIONS The k-TSP ranking algorithm can be used as a computationally efficient, multivariate filter method for feature selection in machine learning. SVM in combination with k-TSP ranking algorithm outperforms k-TSP and SVM alone in simulated datasets and in some cancer prognosis datasets. Simulation studies suggest that as a feature selector, it is better tuned to certain data characteristics, i.e. correlations among informative genes, which is potentially interesting as an alternative feature ranking method in pathway analysis.
Collapse
Affiliation(s)
- Ping Shi
- Harvard Medical School and Harvard Pilgrim Healthcare Institute, Boston, MA 02215, USA.
| | | | | | | |
Collapse
|
182
|
Weisman D, Liu H, Redfern J, Zhu L, Colón-Carmona A. Novel computational identification of highly selective biomarkers of pollutant exposure. ENVIRONMENTAL SCIENCE & TECHNOLOGY 2011; 45:5132-5138. [PMID: 21542576 DOI: 10.1021/es200065f] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/30/2023]
Abstract
The use of in vivo biosensors to acquire environmental pollution data is an emerging and promising paradigm. One major challenge is the identification of highly specific biomarkers that selectively report exposure to a target pollutant, while remaining quiescent under a diverse set of other, often unknown, environmental conditions. This study hypothesized that a microarray data mining approach can identify highly specific biomarkers, and, that the robustness property can generalize to unforeseen environmental conditions. Starting with Arabidopsis thaliana microarray data measuring responses to a variety of treatments, the study used the top scoring pair (TSP) algorithm to identify mRNA transcripts that respond uniquely to phenanthrene, a model polycyclic aromatic hydrocarbon. Subsequent in silico analysis with a larger set of microarray data indicated that the biomarkers remained robust under new conditions. Finally, in vivo experiments were performed with unforeseen conditions that mimic phenanthrene stress, and the biomarkers were assayed using qRT-PCR. In these experiments, the biomarkers always responded positively to phenanthrene, and never responded to the unforeseen conditions, thereby supporting the hypotheses. This data mining approach requires only microarray or next-generation RNA-seq data, and, in principle, can be applied to arbitrary biomonitoring organisms and chemical exposures.
Collapse
Affiliation(s)
- David Weisman
- Department of Biology, University of Massachusetts Boston, Boston, Massachusetts 02125, USA
| | | | | | | | | |
Collapse
|
183
|
Cancer classification based on microarray gene expression data using a principal component accumulation method. Sci China Chem 2011. [DOI: 10.1007/s11426-011-4263-5] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
184
|
Youssef YM, White NM, Grigull J, Krizova A, Samy C, Mejia-Guerrero S, Evans A, Yousef GM. Accurate Molecular Classification of Kidney Cancer Subtypes Using MicroRNA Signature. Eur Urol 2011; 59:721-30. [DOI: 10.1016/j.eururo.2011.01.004] [Citation(s) in RCA: 181] [Impact Index Per Article: 13.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2010] [Accepted: 01/03/2011] [Indexed: 12/01/2022]
|
185
|
Popovici V, Budinska E, Delorenzi M. Rgtsp: a generalized top scoring pairs package for class prediction. Bioinformatics 2011; 27:1729-30. [DOI: 10.1093/bioinformatics/btr233] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
|
186
|
Gelfond J, Zarzabal LA, Burton T, Burns S, Sogayar M, Penalva LOF. Latent rank change detection for analysis of splice-junction microarrays with nonlinear effects. Ann Appl Stat 2011. [DOI: 10.1214/10-aoas389] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
187
|
Zhang X, Yan Z, Zhang J, Gong L, Li W, Cui J, Liu Y, Gao Z, Li J, Shen L, Lu Y. Combination of hsa-miR-375 and hsa-miR-142-5p as a predictor for recurrence risk in gastric cancer patients following surgical resection. Ann Oncol 2011; 22:2257-66. [PMID: 21343377 DOI: 10.1093/annonc/mdq758] [Citation(s) in RCA: 129] [Impact Index Per Article: 9.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022] Open
Abstract
BACKGROUND Recurrence is a major factor leading to treatment failure and death in gastric cancer (GC) patients following surgical resection. Importantly, the prediction of recurrence is critical in improving clinical outcomes. We isolated a group of microRNAs (miRNAs) and evaluated their usefulness as prognostic markers for the recurrence of GC. PATIENTS AND METHODS A total of 65 GC patients were selected for systematic analysis, 29 patients with recurrence and 36 patients without recurrence. Firstly, miRNAs microarray and bioinformatics methods were used to characterize classifiers from primary tumor samples (n = 8). Following, we validated these predictors both in frozen fresh and paraffin-embedded tissue samples (n = 57) using quantitative PCR. RESULTS We have identified 17 differential miRNAs including 10 up-regulated and 7 down-regulated miRNAs in recurrence group. Using k-top scoring pairs (k-TSP) method, we further ascertained hsa-miR-375 and hsa-miR-142-5p as a classifier to recognize recurrence and nonrecurrence cases both in the training and test samples. Moreover, we validated this classifier in 34 frozen fresh tissues and 38 paraffin-embedded tissues with consistent sensitivity and specificity with training set; among them, 15 cases were matched. A high frequency recurrence and poor survival were observed in GC cases with high level of hsa-miR-375 and low level of hsa-miR-142-5p (P < 0.001). In addition, we evaluated that hsa-miR-375 and hsa-miR-142-5p were involved in regulating target genes in several oncogenic signal pathways, such as TP53, MAPK, Wnt and vascular endothelial growth factor. CONCLUSION Our results indicate that the combination of hsa-miR-375 and hsa-miR-142-5p as a predictor of disease progression has the potential to predict recurrence risk for GC patients.
Collapse
Affiliation(s)
- X Zhang
- Department of Gastrointestinal Oncology, Peking University School of Oncology, Beijing Cancer Hospital and Institute, Beijing
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
188
|
Jones LK, Zou F, Kheifets A, Rybnikov K, Berry D, Tan AC. Confident predictability: identifying reliable gene expression patterns for individualized tumor classification using a local minimax kernel algorithm. BMC Med Genomics 2011; 4:10. [PMID: 21261972 PMCID: PMC3038886 DOI: 10.1186/1755-8794-4-10] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2010] [Accepted: 01/24/2011] [Indexed: 12/02/2022] Open
Abstract
Background Molecular classification of tumors can be achieved by global gene expression profiling. Most machine learning classification algorithms furnish global error rates for the entire population. A few algorithms provide an estimate of probability of malignancy for each queried patient but the degree of accuracy of these estimates is unknown. On the other hand local minimax learning provides such probability estimates with best finite sample bounds on expected mean squared error on an individual basis for each queried patient. This allows a significant percentage of the patients to be identified as confidently predictable, a condition that ensures that the machine learning algorithm possesses an error rate below the tolerable level when applied to the confidently predictable patients. Results We devise a new learning method that implements: (i) feature selection using the k-TSP algorithm and (ii) classifier construction by local minimax kernel learning. We test our method on three publicly available gene expression datasets and achieve significantly lower error rate for a substantial identifiable subset of patients. Our final classifiers are simple to interpret and they can make prediction on an individual basis with an individualized confidence level. Conclusions Patients that were predicted confidently by the classifiers as cancer can receive immediate and appropriate treatment whilst patients that were predicted confidently as healthy will be spared from unnecessary treatment. We believe that our method can be a useful tool to translate the gene expression signatures into clinical practice for personalized medicine.
Collapse
Affiliation(s)
- Lee K Jones
- Department of Mathematical Sciences, University of Massachusetts, Lowell, MA, USA.
| | | | | | | | | | | |
Collapse
|
189
|
Magis AT, Earls JC, Ko YH, Eddy JA, Price ND. Graphics processing unit implementations of relative expression analysis algorithms enable dramatic computational speedup. ACTA ACUST UNITED AC 2011; 27:872-3. [PMID: 21257608 DOI: 10.1093/bioinformatics/btr033] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
SUMMARY The top-scoring pair (TSP) and top-scoring triplet (TST) algorithms are powerful methods for classification from expression data, but analysis of all combinations across thousands of human transcriptome samples is computationally intensive, and has not yet been achieved for TST. Implementation of these algorithms for the graphics processing unit results in dramatic speedup of two orders of magnitude, greatly increasing the searchable combinations and accelerating the pace of discovery. AVAILABILITY http://www.igb.illinois.edu/labs/price/downloads/.
Collapse
Affiliation(s)
- Andrew T Magis
- Department of Computer Science, University of Illinois, Urbana, IL 61801, USA
| | | | | | | | | |
Collapse
|
190
|
Cho JH, Gelinas R, Wang K, Etheridge A, Piper MG, Batte K, Dakhallah D, Price J, Bornman D, Zhang S, Marsh C, Galas D. Systems biology of interstitial lung diseases: integration of mRNA and microRNA expression changes. BMC Med Genomics 2011; 4:8. [PMID: 21241464 PMCID: PMC3035594 DOI: 10.1186/1755-8794-4-8] [Citation(s) in RCA: 87] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2010] [Accepted: 01/17/2011] [Indexed: 11/17/2022] Open
Abstract
Background The molecular pathways involved in the interstitial lung diseases (ILDs) are poorly understood. Systems biology approaches, with global expression data sets, were used to identify perturbed gene networks, to gain some understanding of the underlying mechanisms, and to develop specific hypotheses relevant to these chronic lung diseases. Methods Lung tissue samples from patients with different types of ILD were obtained from the Lung Tissue Research Consortium and total cell RNA was isolated. Global mRNA and microRNA were profiled by hybridization and amplification-based methods. Differentially expressed genes were compiled and used to identify critical signaling pathways and potential biomarkers. Modules of genes were identified that formed a regulatory network, and studies were performed on cultured cells in vitro for comparison with the in vivo results. Results By profiling mRNA and microRNA (miRNA) expression levels, we found subsets of differentially expressed genes that distinguished patients with ILDs from controls and that correlated with different disease stages and subtypes of ILDs. Network analysis, based on pathway databases, revealed several disease-associated gene modules, involving genes from the TGF-β, Wnt, focal adhesion, and smooth muscle actin pathways that are implicated in advancing fibrosis, a critical pathological process in ILDs. A more comprehensive approach was also adapted to construct a putative global gene regulatory network based on the perturbation of key regulatory elements, transcription factors and microRNAs. Our data underscores the importance of TGF-β signaling and the persistence of smooth muscle actin-containing fibroblasts in these diseases. We present evidence that, downstream of TGF-β signaling, microRNAs of the miR-23a cluster and the transcription factor Zeb1 could have roles in mediating an epithelial to mesenchymal transition (EMT) and the resultant persistence of mesenchymal cells in these diseases. Conclusions We present a comprehensive overview of the molecular networks perturbed in ILDs, discuss several potential key molecular regulatory circuits, and identify microRNA species that may play central roles in facilitating the progression of ILDs. These findings advance our understanding of these diseases at the molecular level, provide new molecular signatures in defining the specific characteristics of the diseases, suggest new hypotheses, and reveal new potential targets for therapeutic intervention.
Collapse
Affiliation(s)
- Ji-Hoon Cho
- Institute for Systems Biology, Seattle, WA, USA
| | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
191
|
|
192
|
Top Scoring Pair Decision Tree for Gene Expression Data Analysis. ADVANCES IN EXPERIMENTAL MEDICINE AND BIOLOGY 2011; 696:27-35. [DOI: 10.1007/978-1-4419-7046-6_3] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/12/2023]
|
193
|
Gelfond J, Zarzabal LA, Burton T, Burns S, Sogayar M, Penalva LOF. LATENT RANK CHANGE DETECTION FOR ANALYSIS OF SPLICE-JUNCTION MICROARRAYS WITH NONLINEAR EFFECTS. THE ANNALS OF APPLIED STATISTICS 2011; 5:364-380. [PMID: 23335951 DOI: 10.1214/10-aoas389supp] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Abstract
Alternative splicing of gene transcripts greatly expands the functional capacity of the genome, and certain splice isoforms may indicate specific disease states such as cancer. Splice junction microarrays interrogate thousands of splice junctions, but data analysis is difficult and error prone because of the increased complexity compared to differential gene expression analysis. We present Rank Change Detection (RCD) as a method to identify differential splicing events based upon a straightforward probabilistic model comparing the over- or underrepresentation of two or more competing isoforms. RCD has advantages over commonly used methods because it is robust to false positive errors due to nonlinear trends in microarray measurements. Further, RCD does not depend on prior knowledge of splice isoforms, yet it takes advantage of the inherent structure of mutually exclusive junctions, and it is conceptually generalizable to other types of splicing arrays or RNA-Seq. RCD specifically identifies the biologically important cases when a splice junction becomes more or less prevalent compared to other mutually exclusive junctions. The example data is from different cell lines of glioblastoma tumors assayed with Agilent microarrays.
Collapse
Affiliation(s)
- Jonathan Gelfond
- UT Health Science Center San Antonio, UT Health Science Center San Antonio, UT Health Science Center San Antonio, UT Health Science Center San Antonio, Universidad de São Paulo and UT Health Science Center San Antonio
| | | | | | | | | | | |
Collapse
|
194
|
Abstract
Recent studies suggest that the deregulation of pathways, rather than individual genes, may be critical in triggering carcinogenesis. The pathway deregulation is often caused by the simultaneous deregulation of more than one gene in the pathway. This suggests that robust gene pair combinations may exploit the underlying bio-molecular reactions that are relevant to the pathway deregulation and thus they could provide better biomarkers for cancer, as compared to individual genes. In order to validate this hypothesis, in this paper, we used gene pair combinations, called doublets, as input to the cancer classification algorithms, instead of the original expression values, and we showed that the classification accuracy was consistently improved across different datasets and classification algorithms. We validated the proposed approach using nine cancer datasets and five classification algorithms including Prediction Analysis for Microarrays (PAM), C4.5 Decision Trees (DT), Naive Bayesian (NB), Support Vector Machine (SVM), and k-Nearest Neighbor (k-NN).
Collapse
|
195
|
Wang L, Chu F. Extracting very simple diagnostic rules from microarray data. ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL INTERNATIONAL CONFERENCE 2010; 2010:807-10. [PMID: 21096115 DOI: 10.1109/iembs.2010.5626565] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
We present an approach to deriving very simple classification rules from microarray data by first selecting very small gene subsets that can ensure highly accurate classification of cancers. Finding such minimum gene subsets can greatly reduce the computational load and "noise" arising from irrelevant genes. The derived simple classification rules allow for accurate diagnosis without the need for any classifiers. This work can simplify gene expression tests by including only a very small number of genes rather than thousands or tens of thousands of genes, which can significantly bring down the cost for cancer testing. These studies also call for further investigations into possible biological relationship between these small number of genes and cancer development and treatment. For example, we report the following simple, and yet 100% accurate, diagnostic rules involving only 2 genes to separate the 3 types of lymphoma patients: the patient has diffuse large B-cell lymphoma (DLBCL), if and only if the expression level of gene GENE1622X is greater than -0.75; the patient has chronic lymphocytic leukaemia (CLL), if and only if the expression level of gene GENE540X is less than -1; and the patient has follicular lymphoma (FL) otherwise, i.e., if and only if the expression level of gene GENE1622X is less than -0.75 and the expression level of gene GENE540X is greater than -1.
Collapse
Affiliation(s)
- Lipo Wang
- School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore 639798.
| | | |
Collapse
|
196
|
Tentler JJ, Nallapareddy S, Tan AC, Spreafico A, Pitts TM, Morelli MP, Selby HM, Kachaeva MI, Flanigan SA, Kulikowski GN, Leong S, Arcaroli JJ, Messersmith WA, Eckhardt SG. Identification of predictive markers of response to the MEK1/2 inhibitor selumetinib (AZD6244) in K-ras-mutated colorectal cancer. Mol Cancer Ther 2010; 9:3351-62. [PMID: 20923857 DOI: 10.1158/1535-7163.mct-10-0376] [Citation(s) in RCA: 61] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Mutant K-ras activity leads to the activation of the RAS/RAF/MEK/ERK pathway in approximately 44% of colorectal cancer (CRC) tumors. Accordingly, several inhibitors of the MEK pathway are under clinical evaluation in several malignancies including CRC. The aim of this study was to develop and characterize predictive biomarkers of response to the MEK1/2 inhibitor AZD6244 in CRC in order to maximize the clinical utility of this agent. Twenty-seven human CRC cell lines were exposed to AZD6244 and classified according to the IC(50) value as sensitive (≤ 0.1 μmol/L) or resistant (>1 μmol/L). All cell lines were subjected to immunoblotting for effector proteins, K-ras/BRAF mutation status, and baseline gene array analysis. Further testing was done in cell line xenografts and K-ras mutant CRC human explants models to develop a predictive genomic classifier for AZD6244. The most sensitive and resistant cell lines were subjected to differential gene array and pathway analyses. Members of the Wnt signaling pathway were highly overexpressed in cell lines resistant to AZD6244 and seem to be functionally involved in mediating resistance by shRNA knockdown studies. Baseline gene array data from CRC cell lines and xenografts were used to develop a k-top scoring pair (k-TSP) classifier, which predicted with 71% accuracy which of a test set of patient-derived K-ras mutant CRC explants would respond to AZD6244, providing the basis for a patient-selective clinical trial. These results also indicate that resistance to AZD6244 may be mediated, in part, by the upregulation of the Wnt pathway, suggesting potential rational combination partners for AZD6244 in CRC.
Collapse
Affiliation(s)
- John J Tentler
- Division of Medical Oncology, University of Colorado Denver Anschutz Medical Campus, Aurora, Colorado 80045, USA.
| | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
197
|
Arcaroli JJ, Touban BM, Tan AC, Varella-Garcia M, Powell RW, Eckhardt SG, Elvin P, Gao D, Messersmith WA. Gene array and fluorescence in situ hybridization biomarkers of activity of saracatinib (AZD0530), a Src inhibitor, in a preclinical model of colorectal cancer. Clin Cancer Res 2010; 16:4165-77. [PMID: 20682712 DOI: 10.1158/1078-0432.ccr-10-0066] [Citation(s) in RCA: 38] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/22/2023]
Abstract
PURPOSE To evaluate the efficacy of saracatinib (AZD0530), an oral Src inhibitor, in colorectal cancer (CRC) and to identify biomarkers that predict antitumor activity. EXPERIMENTAL DESIGN Twenty-three CRC cell lines were exposed to saracatinib, and baseline gene expression profiles of three sensitive and eight resistant cell lines in vitro and in vivo were used to predict saracatinib sensitivity in an independent group of 10 human CRC explant tumors using the gene array K-Top Scoring Pairs (K-TSP) method. In addition, fluorescence in situ hybridization (FISH) and immunoblotting determined both Src gene copy number and activation of Src, respectively. RESULTS Two of 10 explant tumors were determined to be sensitive to saracatinib. The K-TSP classifier (TOX>GLIS2, TSPAN7>BCAS4, and PARD6G>NXN) achieved 70% (7 of 10) accuracy on the test set. Evaluation of Src gene copy number by FISH showed a trend toward significance (P = 0.066) with respect to an increase in Src gene copy and resistance to saracatinib. Tumors sensitive to saracatinib showed an increase in the activation of Src and FAK when compared with resistant tumors. CONCLUSIONS Saracatinib significantly decreased tumor growth in a subset of CRC cell lines and explants. A K-TSP classifier (TOX>GLIS2, TSPAN7>BCAS4, and PARD6G>NXN) was predictive for sensitivity to saracatinib. In addition, increased activation of the Src pathway was associated with sensitivity to saracatinib. These results suggest that FISH, a K-TSP classifier, and activation of the Src pathway have potential in identifying CRC patients that would potentially benefit from treatment with saracatinib.
Collapse
Affiliation(s)
- John J Arcaroli
- Division of Medical Oncology, University of Colorado, Denver, Colorado, USA
| | | | | | | | | | | | | | | | | |
Collapse
|
198
|
|
199
|
Edelman LB, Eddy JA, Price ND. In silico models of cancer. WILEY INTERDISCIPLINARY REVIEWS. SYSTEMS BIOLOGY AND MEDICINE 2010; 2:438-459. [PMID: 20836040 PMCID: PMC3157287 DOI: 10.1002/wsbm.75] [Citation(s) in RCA: 72] [Impact Index Per Article: 5.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
Abstract
Cancer is a complex disease that involves multiple types of biological interactions across diverse physical, temporal, and biological scales. This complexity presents substantial challenges for the characterization of cancer biology, and motivates the study of cancer in the context of molecular, cellular, and physiological systems. Computational models of cancer are being developed to aid both biological discovery and clinical medicine. The development of these in silico models is facilitated by rapidly advancing experimental and analytical tools that generate information-rich, high-throughput biological data. Statistical models of cancer at the genomic, transcriptomic, and pathway levels have proven effective in developing diagnostic and prognostic molecular signatures, as well as in identifying perturbed pathways. Statistically inferred network models can prove useful in settings where data overfitting can be avoided, and provide an important means for biological discovery. Mechanistically based signaling and metabolic models that apply a priori knowledge of biochemical processes derived from experiments can also be reconstructed where data are available, and can provide insight and predictive ability regarding the behavior of these systems. At longer length scales, continuum and agent-based models of the tumor microenvironment and other tissue-level interactions enable modeling of cancer cell populations and tumor progression. Even though cancer has been among the most-studied human diseases using systems approaches, significant challenges remain before the enormous potential of in silico cancer biology can be fully realized.
Collapse
Affiliation(s)
- Lucas B. Edelman
- Institute for Genomic Biology, Department of Bioengineering, University of Illinois, Urbana-Champaign
| | - James A. Eddy
- Institute for Genomic Biology, Department of Bioengineering, University of Illinois, Urbana-Champaign
| | - Nathan D. Price
- Department of Chemical and Biomolecular Engineering, Institute for Genomic Biology, Center for Biophysics and Computational Biology, University of Illinois, Urbana-Champaign
| |
Collapse
|
200
|
Pitts TM, Tan AC, Kulikowski GN, Tentler JJ, Brown AM, Flanigan SA, Leong S, Coldren CD, Hirsch FR, Varella-Garcia M, Korch C, Eckhardt SG. Development of an integrated genomic classifier for a novel agent in colorectal cancer: approach to individualized therapy in early development. Clin Cancer Res 2010; 16:3193-204. [PMID: 20530704 PMCID: PMC2889230 DOI: 10.1158/1078-0432.ccr-09-3191] [Citation(s) in RCA: 65] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022]
Abstract
BACKGROUND A plethora of agents is in early stages of development for colorectal cancer (CRC), including those that target the insulin-like growth factor I receptor (IGFIR) pathway. In the current environment of numerous cancer targets, it is imperative that patient selection strategies be developed with the intent of preliminary testing in the latter stages of phase I trials. The goal of this study was to develop and characterize predictive biomarkers for an IGFIR tyrosine kinase inhibitor, OSI-906, that could be applied in CRC-specific studies of this agent. METHODS Twenty-seven CRC cell lines were exposed to OSI-906 and classified according to IC(50) value as sensitive (5 micromol/L). Cell lines were subjected to immunoblotting and immunohistochemistry for effector proteins, IGFIR copy number by fluorescence in situ hybridization, KRAS/BRAF/phosphoinositide 3-kinase mutation status, and baseline gene array analysis. The most sensitive and resistant cell lines were used for gene array and pathway analyses, along with shRNA knockdown of highly ranked genes. The resulting integrated genomic classifier was then tested against eight human CRC explants in vivo. RESULTS Baseline gene array data from cell lines and xenografts were used to develop a k-top scoring pair (k-TSP) classifier, which, in combination with IGFIR fluorescence in situ hybridization and KRAS mutational status, was able to predict with 100% accuracy a test set of patient-derived CRC xenografts. CONCLUSIONS These results indicate that an integrated approach to the development of individualized therapy is feasible and should be applied early in the development of novel agents, ideally in conjunction with late-stage phase I trials.
Collapse
Affiliation(s)
- Todd M. Pitts
- Division of Medical Oncology. University of Colorado Denver, Anschutz Medical Campus, Aurora, Colorado
| | - Aik Choon Tan
- Division of Medical Oncology. University of Colorado Denver, Anschutz Medical Campus, Aurora, Colorado
| | - Gillian N. Kulikowski
- Division of Medical Oncology. University of Colorado Denver, Anschutz Medical Campus, Aurora, Colorado
| | - John J. Tentler
- Division of Medical Oncology. University of Colorado Denver, Anschutz Medical Campus, Aurora, Colorado
| | - Amy M. Brown
- Division of Medical Oncology. University of Colorado Denver, Anschutz Medical Campus, Aurora, Colorado
| | - Sara A. Flanigan
- Division of Medical Oncology. University of Colorado Denver, Anschutz Medical Campus, Aurora, Colorado
| | - Stephen Leong
- Division of Medical Oncology. University of Colorado Denver, Anschutz Medical Campus, Aurora, Colorado
| | - Christopher D. Coldren
- Division of Pulmonary Sciences and Critical Care Medicine. University of Colorado Denver, Anschutz Medical Campus, Aurora, Colorado
| | - Fred R. Hirsch
- Division of Medical Oncology. University of Colorado Denver, Anschutz Medical Campus, Aurora, Colorado
| | - Marileila Varella-Garcia
- Division of Medical Oncology. University of Colorado Denver, Anschutz Medical Campus, Aurora, Colorado
| | - Christopher Korch
- Division of Medical Oncology. University of Colorado Denver, Anschutz Medical Campus, Aurora, Colorado
| | - S. Gail Eckhardt
- Division of Medical Oncology. University of Colorado Denver, Anschutz Medical Campus, Aurora, Colorado
| |
Collapse
|