1
|
Kwan B, Fuhrer T, Montemayor D, Fink JC, He J, Hsu CY, Messer K, Nelson RG, Pu M, Ricardo AC, Rincon-Choles H, Shah VO, Ye H, Zhang J, Sharma K, Natarajan L. A generalized covariate-adjusted top-scoring pair algorithm with applications to diabetic kidney disease stage classification in the Chronic Renal Insufficiency Cohort (CRIC) Study. BMC Bioinformatics 2023; 24:57. [PMID: 36803209 PMCID: PMC9942303 DOI: 10.1186/s12859-023-05171-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2022] [Accepted: 02/02/2023] [Indexed: 02/22/2023] Open
Abstract
BACKGROUND The growing amount of high dimensional biomolecular data has spawned new statistical and computational models for risk prediction and disease classification. Yet, many of these methods do not yield biologically interpretable models, despite offering high classification accuracy. An exception, the top-scoring pair (TSP) algorithm derives parameter-free, biologically interpretable single pair decision rules that are accurate and robust in disease classification. However, standard TSP methods do not accommodate covariates that could heavily influence feature selection for the top-scoring pair. Herein, we propose a covariate-adjusted TSP method, which uses residuals from a regression of features on the covariates for identifying top scoring pairs. We conduct simulations and a data application to investigate our method, and compare it to existing classifiers, LASSO and random forests. RESULTS Our simulations found that features that were highly correlated with clinical variables had high likelihood of being selected as top scoring pairs in the standard TSP setting. However, through residualization, our covariate-adjusted TSP was able to identify new top scoring pairs, that were largely uncorrelated with clinical variables. In the data application, using patients with diabetes (n = 977) selected for metabolomic profiling in the Chronic Renal Insufficiency Cohort (CRIC) study, the standard TSP algorithm identified (valine-betaine, dimethyl-arg) as the top-scoring metabolite pair for classifying diabetic kidney disease (DKD) severity, whereas the covariate-adjusted TSP method identified the pair (pipazethate, octaethylene glycol) as top-scoring. Valine-betaine and dimethyl-arg had, respectively, ≥ 0.4 absolute correlation with urine albumin and serum creatinine, known prognosticators of DKD. Thus without covariate-adjustment the top-scoring pair largely reflected known markers of disease severity, whereas covariate-adjusted TSP uncovered features liberated from confounding, and identified independent prognostic markers of DKD severity. Furthermore, TSP-based methods achieved competitive classification accuracy in DKD to LASSO and random forests, while providing more parsimonious models. CONCLUSIONS We extended TSP-based methods to account for covariates, via a simple, easy to implement residualizing process. Our covariate-adjusted TSP method identified metabolite features, uncorrelated from clinical covariates, that discriminate DKD severity stage based on the relative ordering between two features, and thus provide insights into future studies on the order reversals in early vs advanced disease states.
Collapse
Grants
- U01 DK061028 NIDDK NIH HHS
- U01 DK060963 NIDDK NIH HHS
- R01DK118736, 1R01DK110541-01A1, U01DK060990, U01DK060984, U01DK061022, U01DK061021, U01DK061028, U01DK060980, U01DK060963, U01DK060902, U24DK060990 NIDDK NIH HHS
- R01 DK110541 NIDDK NIH HHS
- U01 DK060902 NIDDK NIH HHS
- U01 DK060990 NIDDK NIH HHS
- U01 DK060984 NIDDK NIH HHS
- U01 DK061021 NIDDK NIH HHS
- U24 DK060990 NIDDK NIH HHS
- U01 DK060980 NIDDK NIH HHS
- R01 DK118736 NIDDK NIH HHS
- U01 DK061022 NIDDK NIH HHS
- National Science Foundation Graduate Research Fellowship Program
- Intramural Research Program of the National Institute of Diabetes and Digestive and Kidney Diseases
- National Institute of Diabetes and Digestive and Kidney Diseases
Collapse
Affiliation(s)
- Brian Kwan
- Division of Biostatistics and Bioinformatics, Herbert Wertheim School of Public Health, University of California, San Diego, La Jolla, CA, USA
- Moores Cancer Center, University of California, San Diego, La Jolla, CA, USA
| | - Tobias Fuhrer
- Institute of Molecular Systems Biology, ETH Zurich, Zurich, Switzerland
| | - Daniel Montemayor
- Division of Nephrology, Department of Medicine, University of Texas Health San Antonio, San Antonio, TX, USA
- Center for Renal Precision Medicine, University of Texas Health San Antonio, San Antonio, TX, USA
| | - Jeffery C Fink
- Department of Medicine, University of Maryland, Baltimore School of Medicine, Baltimore, MD, USA
| | - Jiang He
- Department of Epidemiology, Tulane University School of Public Health and Tropical Medicine and Tulane University Translational Science Institute,, New Orleans, LA, USA
| | - Chi-Yuan Hsu
- Division of Nephrology, University of California, San Francisco School of Medicine, San Francisco, CA, USA
| | - Karen Messer
- Division of Biostatistics and Bioinformatics, Herbert Wertheim School of Public Health, University of California, San Diego, La Jolla, CA, USA
- Moores Cancer Center, University of California, San Diego, La Jolla, CA, USA
| | - Robert G Nelson
- Chronic Kidney Disease Section, National Institute of Diabetes and Digestive and Kidney Diseases, Phoenix, AZ, USA
| | - Minya Pu
- Moores Cancer Center, University of California, San Diego, La Jolla, CA, USA
| | - Ana C Ricardo
- Department of Medicine, University of Illinois, Chicago, IL, USA
| | - Hernan Rincon-Choles
- Department of Nephrology, Glickman Urological and Kidney Institute, Cleveland Clinic Foundation, Cleveland, OH, USA
| | - Vallabh O Shah
- University of New Mexico Health Sciences Center, Albuquerque, NM, USA
| | - Hongping Ye
- Division of Nephrology, Department of Medicine, University of Texas Health San Antonio, San Antonio, TX, USA
- Center for Renal Precision Medicine, University of Texas Health San Antonio, San Antonio, TX, USA
| | - Jing Zhang
- Moores Cancer Center, University of California, San Diego, La Jolla, CA, USA
| | - Kumar Sharma
- Division of Nephrology, Department of Medicine, University of Texas Health San Antonio, San Antonio, TX, USA
- Center for Renal Precision Medicine, University of Texas Health San Antonio, San Antonio, TX, USA
| | - Loki Natarajan
- Division of Biostatistics and Bioinformatics, Herbert Wertheim School of Public Health, University of California, San Diego, La Jolla, CA, USA.
- Moores Cancer Center, University of California, San Diego, La Jolla, CA, USA.
| |
Collapse
|
2
|
The Association between Immune Subgroups and Gene Modules for the Clinical, Cellular, and Molecular Characteristic of Hepatocellular Carcinoma. JOURNAL OF ONCOLOGY 2022; 2022:7253876. [PMID: 36090895 PMCID: PMC9452932 DOI: 10.1155/2022/7253876] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/19/2022] [Revised: 07/21/2022] [Accepted: 08/01/2022] [Indexed: 11/18/2022]
Abstract
The heterogeneity of hepatocellular carcinoma (HCC) is related to immune cell infiltration and genetic aberrations in the tumor microenvironment. This study aimed to identify the novel molecular typing of HCC according to the genetic and immune characteristics, to obtain accurate clinical management of this disease. We performed consensus clustering to divide 424 patients into different immune subgroups and assessed the reproducibility and efficiency in two independent cohorts with 921 patients. The associations between molecular typing and molecular, cellular, and clinical characteristics were investigated by a multidimensional bioinformatics approach. Furthermore, we conducted graph structure learning-based dimensionality reduction to depict the immune landscape to reveal the interrelation between the immune and gene systems in molecular typing. We revealed and validated that HCC patients could be segregated into 5 immune subgroups (IS1-5) and 7 gene modules with significantly different molecular, cellular, and clinical characteristics. IS5 had the worst prognosis and lowest enrichment of immune characteristics and was considered the immune cold type. IS4 had the longest overall survival, high immune activity, and antitumorigenesis, which were defined as the immune hot and antitumorigenesis types. In addition, immune landscape analysis further revealed significant intraclass heterogeneity within each IS, and each IS represented distinct clinical, cellular, and molecular characteristics. Our study provided 5 immune subgroups with distinct clinical, cellular, and molecular characteristics of HCC and may have clinical implications for precise therapeutic strategies and facilitate the investigation of immune mechanisms in HCC.
Collapse
|
3
|
Shen JP. Artificial intelligence, molecular subtyping, biomarkers, and precision oncology. Emerg Top Life Sci 2021; 5:747-756. [PMID: 34881776 PMCID: PMC8786277 DOI: 10.1042/etls20210212] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2021] [Revised: 11/23/2021] [Accepted: 11/24/2021] [Indexed: 11/17/2022]
Abstract
A targeted cancer therapy is only useful if there is a way to accurately identify the tumors that are susceptible to that therapy. Thus rapid expansion in the number of available targeted cancer treatments has been accompanied by a robust effort to subdivide the traditional histological and anatomical tumor classifications into molecularly defined subtypes. This review highlights the history of the paired evolution of targeted therapies and biomarkers, reviews currently used methods for subtype identification, and discusses challenges to the implementation of precision oncology as well as possible solutions.
Collapse
Affiliation(s)
- John Paul Shen
- Department of Gastrointestinal Medical Oncology, University of Texas MD Anderson Cancer Center, Houston, U.S.A
| |
Collapse
|
4
|
Using Domain Knowledge for Interpretable and Competitive Multi-Class Human Activity Recognition. SENSORS 2020; 20:s20041208. [PMID: 32098362 PMCID: PMC7070332 DOI: 10.3390/s20041208] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/31/2019] [Revised: 02/17/2020] [Accepted: 02/19/2020] [Indexed: 11/17/2022]
Abstract
Human activity recognition (HAR) has become an increasingly popular application of machine learning across a range of domains. Typically the HAR task that a machine learning algorithm is trained for requires separating multiple activities such as walking, running, sitting, and falling from each other. Despite a large body of work on multi-class HAR, and the well-known fact that the performance on a multi-class problem can be significantly affected by how it is decomposed into a set of binary problems, there has been little research into how the choice of multi-class decomposition method affects the performance of HAR systems. This paper presents the first empirical comparison of multi-class decomposition methods in a HAR context by estimating the performance of five machine learning algorithms when used in their multi-class formulation, with four popular multi-class decomposition methods, five expert hierarchies—nested dichotomies constructed from domain knowledge—or an ensemble of expert hierarchies on a 17-class HAR data-set which consists of features extracted from tri-axial accelerometer and gyroscope signals. We further compare performance on two binary classification problems, each based on the topmost dichotomy of an expert hierarchy. The results show that expert hierarchies can indeed compete with one-vs-all, both on the original multi-class problem and on a more general binary classification problem, such as that induced by an expert hierarchy’s topmost dichotomy. Finally, we show that an ensemble of expert hierarchies performs better than one-vs-all and comparably to one-vs-one, despite being of lower time and space complexity, on the multi-class problem, and outperforms all other multi-class decomposition methods on the two dichotomous problems.
Collapse
|
5
|
Best MG, In 't Veld SGJG, Sol N, Wurdinger T. RNA sequencing and swarm intelligence-enhanced classification algorithm development for blood-based disease diagnostics using spliced blood platelet RNA. Nat Protoc 2019; 14:1206-1234. [PMID: 30894694 DOI: 10.1038/s41596-019-0139-5] [Citation(s) in RCA: 86] [Impact Index Per Article: 14.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2018] [Accepted: 01/17/2019] [Indexed: 12/12/2022]
Abstract
Blood-based diagnostics tests, using individual or panels of biomarkers, may revolutionize disease diagnostics and enable minimally invasive therapy monitoring. However, selection of the most relevant biomarkers from liquid biosources remains an immense challenge. We recently presented the thromboSeq pipeline, which enables RNA sequencing and cancer classification via self-learning and swarm intelligence-enhanced bioinformatics algorithms using blood platelet RNA. Here, we provide the wet-lab protocol for the generation of platelet RNA-sequencing libraries and the dry-lab protocol for the development of swarm intelligence-enhanced machine-learning-based classification algorithms. The wet-lab protocol includes platelet RNA isolation, mRNA amplification, and preparation for next-generation sequencing. The dry-lab protocol describes the automated FASTQ file pre-processing to quantified gene counts, quality controls, data normalization and correction, and swarm intelligence-enhanced support vector machine (SVM) algorithm development. This protocol enables platelet RNA profiling from 500 pg of platelet RNA and allows automated and optimized biomarker panel selection. The wet-lab protocol can be performed in 5 d before sequencing, and the algorithm development can be completed in 2 d, depending on computational resources. The protocol requires basic molecular biology skills and a basic understanding of Linux and R. In all, with this protocol, we aim to enable the scientific community to test platelet RNA for diagnostic algorithm development.
Collapse
Affiliation(s)
- Myron G Best
- Department of Neurosurgery, Cancer Center Amsterdam, Amsterdam UMC, VU University Medical Center, Amsterdam, the Netherlands. .,Department of Pathology, Cancer Center Amsterdam, Amsterdam UMC, VU University Medical Center, Amsterdam, the Netherlands. .,Brain Tumor Center Amsterdam, Amsterdam UMC, VU University Medical Center, Amsterdam, the Netherlands.
| | - Sjors G J G In 't Veld
- Department of Neurosurgery, Cancer Center Amsterdam, Amsterdam UMC, VU University Medical Center, Amsterdam, the Netherlands.,Brain Tumor Center Amsterdam, Amsterdam UMC, VU University Medical Center, Amsterdam, the Netherlands
| | - Nik Sol
- Brain Tumor Center Amsterdam, Amsterdam UMC, VU University Medical Center, Amsterdam, the Netherlands.,Department of Neurology, Cancer Center Amsterdam, Amsterdam UMC, VU University Medical Center, Amsterdam, the Netherlands
| | - Thomas Wurdinger
- Department of Neurosurgery, Cancer Center Amsterdam, Amsterdam UMC, VU University Medical Center, Amsterdam, the Netherlands. .,Brain Tumor Center Amsterdam, Amsterdam UMC, VU University Medical Center, Amsterdam, the Netherlands.
| |
Collapse
|
6
|
Xiao H, Xu D, Chen P, Zeng G, Wang X, Zhang X. Identification of Five Genes as a Potential Biomarker for Predicting Progress and Prognosis in Adrenocortical Carcinoma. J Cancer 2018; 9:4484-4495. [PMID: 30519354 PMCID: PMC6277665 DOI: 10.7150/jca.26698] [Citation(s) in RCA: 26] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2018] [Accepted: 09/20/2018] [Indexed: 12/19/2022] Open
Abstract
Background: Adrenocortical carcinoma (ACC) is a limited endocrine fatality with a minor diagnosis and rare remedial options. The progressive and predictive meaning of message RNA (mRNA) expression oddity in ACC has been studied extensively in recent years. However, differences in measurement platforms and lab protocols as well as small sample sizes can render gene expression levels incomparable. Methods: An extensive study of GEO datasets was conducted to define potential mRNA biomarkers for ACC. The study compared the mRNA expression profiles of ACC tissues and neighboring noncancerous adrenal tissues in the pair. The study covered a sum of 165 tumors and 36 benign control samples. Hub genes were identified through a protein-protein interaction (PPI) network and Robust Rank Aggregation method. Then the Cancer Genome Atlas (TCGA) and Oncomine database were used to perform the validation of hub genes. 4 ACC tissues and 4 normal tissues were collected and then Polymerase Chain Reaction (PCR), Western-blot and immunofluorescence were conducted to validate the expression of five hub genes. Results: We identified five statistically significant genes (TOP2A, NDC80, CEP55, CDKN3, CDK1) corrected with clinical features. The expression of five hub genes in TCGA and Oncomine database were significantly overexpressed in ACC compared with normal ones. Among all the TCGA ACC cases, the strong expression of TOP2A (logrank p=1.4e-04, HR=4.7), NDC80 (logrank p=8.8e-05, HR=4.9), CEP55 (logrank p=5.2e-07, HR=8.6), CDKN3 (log rank p=2.3e-06, HR=7.6) and CDK1 (logrank p=7e-08, HR=11) were correlated with low comprehensive survival, disease free survival (logrank p < 0.001), pathology stage and pathology T stage (FDR < 0.001). PCR results showed that the transcriptional levels of these five genes were significantly higher in ACC tissues than in normal tissues. The western blotting results also showed that the translational level of TOP2A was significantly higher in tumor tissues than in normal tissues. The results of immunofluorescence showed that TOP2A was abundantly observed in the adrenal cortical cell membrane and nucleus and its expression in ACC tissues was significantly higher than that in normal tissues. Conclusions: The distinguished five genes may be utilized to form a board of progressive and predictive biomarkers for ACC for clinical purpose.
Collapse
Affiliation(s)
- He Xiao
- Department of Urology, Zhongnan Hospital of Wuhan University, Wuhan 430071, P.R. China
| | - Deqiang Xu
- Department of Urology, Zhongnan Hospital of Wuhan University, Wuhan 430071, P.R. China
| | - Ping Chen
- Department of Urology, Zhongnan Hospital of Wuhan University, Wuhan 430071, P.R. China
| | - Guang Zeng
- Department of Urology, Zhongnan Hospital of Wuhan University, Wuhan 430071, P.R. China.,Biomedical Engineering, Stony Brook University, New York 11790
| | - Xinghuan Wang
- Department of Urology, Zhongnan Hospital of Wuhan University, Wuhan 430071, P.R. China
| | - Xinhua Zhang
- Department of Urology, Zhongnan Hospital of Wuhan University, Wuhan 430071, P.R. China
| |
Collapse
|
7
|
Abstract
SummaryBackground: Multi-class molecular cancer classification has great potential clinical implications. Such applications require statistical methods to accurately classify cancer types with a small subset of genes from thousands of genes in the data.Objectives: This paper presents a new functional gradient descent boosting algorithm that directly extends the HingeBoost algorithm from the binary case to the multi-class case without reducing the original problem to multiple binary problems.Methods: Minimizing a multi-class hinge loss with boosting technique, the proposed Hinge-Boost has good theoretical properties by implementing the Bayes decision rule and providing a unifying framework with either equal or unequal misclassification costs. Furthermore, we propose Twin HingeBoost which has better feature selection behavior than Hinge-Boost by reducing the number of ineffective covariates. Simulated data, benchmark data and two cancer gene expression data sets are utilized to evaluate the performance of the proposed approach.Results: Simulations and the benchmark data showed that the multi-class HingeBoost generated accurate predictions when compared with the alternative methods, especially with high-dimensional covariates. The multi-class Hinge-Boost also produced more accurate prediction or comparable prediction in two cancer classification problems using gene expression data.Conclusions: This work has shown that the HingeBoost provides a powerful tool for multi-classification problems. In many applications, the classification accuracy and feature selection behavior can be further improved when using Twin HingeBoost.
Collapse
|
8
|
Abstract
BACKGROUND The liver is the most frequent site of metastatic disease, and metastatic disease to the liver is far more common than primary liver carcinoma in the United States. Pathologic evaluation of biopsy samples is key to establishing a correct diagnosis for patient management. Morphologic and immunoperoxidase studies, which are the standard for pathologic practice, accurately classify most tumors. Subclassification of carcinoma of unknown primary remains problematic. METHODS The author reviewed the literature for articles pertaining to liver biopsy, diagnosis of specific tumor types, utility of immunohistochemical markers, and microarray and proteomic analysis. RESULTS Sampling of liver lesions is best accomplished by combining fine-needle aspiration and needle core biopsy. Many malignancies have distinct morphologic and immunohistochemical patterns and can be correctly subclassified. Adenocarcinoma of unknown primary remains enigmatic since current immunohistochemical markers for this differential diagnosis lack specificity. Microarray analysis and proteomic analysis of tumors can provide distinct gene or protein expression profiles, respectively, for tumor classification. These technologies can be used with fine-needle aspiration and needle core biopsy samples. CONCLUSIONS Most metastatic malignancies in the liver may be correctly diagnosed using standard morphology and immunohistochemical techniques. However, subtyping of some carcinomas and identification of site of unknown primary remains problematic. New technologies may help to further refine our diagnostic capabilities.
Collapse
Affiliation(s)
- Barbara A Centeno
- Pathology Services, H. Lee Moffitt Cancer Center & Research Institute, Tampa FL 33612, USA.
| |
Collapse
|
9
|
Li J, Wang F. Towards Unsupervised Gene Selection: A Matrix Factorization Framework. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2017; 14:514-521. [PMID: 28113598 DOI: 10.1109/tcbb.2016.2591545] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
The recent development of microarray gene expression techniques have made it possible to offer phenotype classification of many diseases. However, in gene expression data analysis, each sample is represented by quite a large number of genes, and many of them are redundant or insignificant to clarify the disease problem. Therefore, how to efficiently select the most useful genes has been becoming one of the most hot research topics in the gene expression data analysis. In this paper, a novel unsupervised two-stage coarse-fine gene selection method is proposed. In the first stage, we apply the kmeans algorithm to over-cluster the genes and discard some redundant genes. In the second stage, we select the most representative genes from the remaining ones based on matrix factorization. Finally the experimental results on several data sets are presented to show the effectiveness of our method.
Collapse
|
10
|
Abstract
Background The Receiver Operator Characteristic (ROC) curve is well-known in evaluating classification performance in biomedical field. Owing to its superiority in dealing with imbalanced and cost-sensitive data, the ROC curve has been exploited as a popular metric to evaluate and find out disease-related genes (features). The existing ROC-based feature selection approaches are simple and effective in evaluating individual features. However, these approaches may fail to find real target feature subset due to their lack of effective means to reduce the redundancy between features, which is essential in machine learning. Results In this paper, we propose to assess feature complementarity by a trick of measuring the distances between the misclassified instances and their nearest misses on the dimensions of pairwise features. If a misclassified instance and its nearest miss on one feature dimension are far apart on another feature dimension, the two features are regarded as complementary to each other. Subsequently, we propose a novel filter feature selection approach on the basis of the ROC analysis. The new approach employs an efficient heuristic search strategy to select optimal features with highest complementarities. The experimental results on a broad range of microarray data sets validate that the classifiers built on the feature subset selected by our approach can get the minimal balanced error rate with a small amount of significant features. Conclusions Compared with other ROC-based feature selection approaches, our new approach can select fewer features and effectively improve the classification performance.
Collapse
|
11
|
Ganesh Kumar P, Kavitha MS, Ahn BC. Automated Detection of Cancer Associated Genes Using a Combined Fuzzy-Rough-Set-Based F-Information and Water Swirl Algorithm of Human Gene Expression Data. PLoS One 2016; 11:e0167504. [PMID: 27936033 PMCID: PMC5148587 DOI: 10.1371/journal.pone.0167504] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2016] [Accepted: 11/15/2016] [Indexed: 11/22/2022] Open
Abstract
This study describes a novel approach to reducing the challenges of highly nonlinear multiclass gene expression values for cancer diagnosis. To build a fruitful system for cancer diagnosis, in this study, we introduced two levels of gene selection such as filtering and embedding for selection of potential genes and the most relevant genes associated with cancer, respectively. The filter procedure was implemented by developing a fuzzy rough set (FR)-based method for redefining the criterion function of f-information (FI) to identify the potential genes without discretizing the continuous gene expression values. The embedded procedure is implemented by means of a water swirl algorithm (WSA), which attempts to optimize the rule set and membership function required to classify samples using a fuzzy-rule-based multiclassification system (FRBMS). Two novel update equations are proposed in WSA, which have better exploration and exploitation abilities while designing a self-learning FRBMS. The efficiency of our new approach was evaluated on 13 multicategory and 9 binary datasets of cancer gene expression. Additionally, the performance of the proposed FRFI-WSA method in designing an FRBMS was compared with existing methods for gene selection and optimization such as genetic algorithm (GA), particle swarm optimization (PSO), and artificial bee colony algorithm (ABC) on all the datasets. In the global cancer map with repeated measurements (GCM_RM) dataset, the FRFI-WSA showed the smallest number of 16 most relevant genes associated with cancer using a minimal number of 26 compact rules with the highest classification accuracy (96.45%). In addition, the statistical validation used in this study revealed that the biological relevance of the most relevant genes associated with cancer and their linguistics detected by the proposed FRFI-WSA approach are better than those in the other methods. The simple interpretable rules with most relevant genes and effectively classified samples suggest that the proposed FRFI-WSA approach is reliable for classification of an individual’s cancer gene expression data with high precision and therefore it could be helpful for clinicians as a clinical decision support system.
Collapse
Affiliation(s)
| | - Muthu Subash Kavitha
- Department of Computer Vision and Image Processing, School of Electronics Engineering, Kyungpook National University, Daegu, South Korea
| | - Byeong-Cheol Ahn
- Department of Nuclear Medicine, Kyungpook National University School of Medicine and Hospital, Daegu, South Korea
- * E-mail:
| |
Collapse
|
12
|
He H, Lin D, Zhang J, Wang Y, Deng HW. Biostatistics, Data Mining and Computational Modeling. TRANSLATIONAL BIOINFORMATICS 2016. [DOI: 10.1007/978-94-017-7543-4_2] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/21/2022]
|
13
|
Best MG, Sol N, Kooi I, Tannous J, Westerman BA, Rustenburg F, Schellen P, Verschueren H, Post E, Koster J, Ylstra B, Ameziane N, Dorsman J, Smit EF, Verheul HM, Noske DP, Reijneveld JC, Nilsson RJA, Tannous BA, Wesseling P, Wurdinger T. RNA-Seq of Tumor-Educated Platelets Enables Blood-Based Pan-Cancer, Multiclass, and Molecular Pathway Cancer Diagnostics. Cancer Cell 2015; 28:666-676. [PMID: 26525104 PMCID: PMC4644263 DOI: 10.1016/j.ccell.2015.09.018] [Citation(s) in RCA: 607] [Impact Index Per Article: 60.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 03/23/2015] [Revised: 07/02/2015] [Accepted: 09/25/2015] [Indexed: 12/12/2022]
Abstract
Tumor-educated blood platelets (TEPs) are implicated as central players in the systemic and local responses to tumor growth, thereby altering their RNA profile. We determined the diagnostic potential of TEPs by mRNA sequencing of 283 platelet samples. We distinguished 228 patients with localized and metastasized tumors from 55 healthy individuals with 96% accuracy. Across six different tumor types, the location of the primary tumor was correctly identified with 71% accuracy. Also, MET or HER2-positive, and mutant KRAS, EGFR, or PIK3CA tumors were accurately distinguished using surrogate TEP mRNA profiles. Our results indicate that blood platelets provide a valuable platform for pan-cancer, multiclass cancer, and companion diagnostics, possibly enabling clinical advances in blood-based "liquid biopsies".
Collapse
Affiliation(s)
- Myron G Best
- Department of Pathology, VU University Medical Center, Cancer Center Amsterdam, De Boelelaan 1117, 1081 HV Amsterdam, the Netherlands; Department of Neurosurgery, VU University Medical Center, Cancer Center Amsterdam, De Boelelaan 1117, 1081 HV Amsterdam, the Netherlands
| | - Nik Sol
- Department of Neurology, VU University Medical Center, Cancer Center Amsterdam, De Boelelaan 1117, 1081 HV Amsterdam, the Netherlands
| | - Irsan Kooi
- Department of Clinical Genetics, VU University Medical Center, Cancer Center Amsterdam, De Boelelaan 1117, 1081 HV Amsterdam, the Netherlands
| | - Jihane Tannous
- Department of Neurology, Massachusetts General Hospital and Neuroscience Program, Harvard Medical School, 149 13th Street, Charlestown, MA 02129, USA
| | - Bart A Westerman
- Department of Neurosurgery, VU University Medical Center, Cancer Center Amsterdam, De Boelelaan 1117, 1081 HV Amsterdam, the Netherlands
| | - François Rustenburg
- Department of Pathology, VU University Medical Center, Cancer Center Amsterdam, De Boelelaan 1117, 1081 HV Amsterdam, the Netherlands; Department of Neurosurgery, VU University Medical Center, Cancer Center Amsterdam, De Boelelaan 1117, 1081 HV Amsterdam, the Netherlands
| | - Pepijn Schellen
- Department of Neurosurgery, VU University Medical Center, Cancer Center Amsterdam, De Boelelaan 1117, 1081 HV Amsterdam, the Netherlands; thromboDx B.V., 1098 EA Amsterdam, the Netherlands
| | - Heleen Verschueren
- Department of Neurosurgery, VU University Medical Center, Cancer Center Amsterdam, De Boelelaan 1117, 1081 HV Amsterdam, the Netherlands; thromboDx B.V., 1098 EA Amsterdam, the Netherlands
| | - Edward Post
- Department of Neurosurgery, VU University Medical Center, Cancer Center Amsterdam, De Boelelaan 1117, 1081 HV Amsterdam, the Netherlands; thromboDx B.V., 1098 EA Amsterdam, the Netherlands
| | - Jan Koster
- Department of Oncogenomics, Academic Medical Center, Meibergdreef 9, 1105 AZ Amsterdam, the Netherlands
| | - Bauke Ylstra
- Department of Pathology, VU University Medical Center, Cancer Center Amsterdam, De Boelelaan 1117, 1081 HV Amsterdam, the Netherlands
| | - Najim Ameziane
- Department of Clinical Genetics, VU University Medical Center, Cancer Center Amsterdam, De Boelelaan 1117, 1081 HV Amsterdam, the Netherlands
| | - Josephine Dorsman
- Department of Clinical Genetics, VU University Medical Center, Cancer Center Amsterdam, De Boelelaan 1117, 1081 HV Amsterdam, the Netherlands
| | - Egbert F Smit
- Department of Pulmonary Diseases, VU University Medical Center, Cancer Center Amsterdam, De Boelelaan 1117, 1081 HV Amsterdam, the Netherlands
| | - Henk M Verheul
- Department of Medical Oncology, VU University Medical Center, Cancer Center Amsterdam, De Boelelaan 1117, 1081 HV Amsterdam, the Netherlands
| | - David P Noske
- Department of Neurosurgery, VU University Medical Center, Cancer Center Amsterdam, De Boelelaan 1117, 1081 HV Amsterdam, the Netherlands
| | - Jaap C Reijneveld
- Department of Neurology, VU University Medical Center, Cancer Center Amsterdam, De Boelelaan 1117, 1081 HV Amsterdam, the Netherlands
| | - R Jonas A Nilsson
- Department of Neurosurgery, VU University Medical Center, Cancer Center Amsterdam, De Boelelaan 1117, 1081 HV Amsterdam, the Netherlands; thromboDx B.V., 1098 EA Amsterdam, the Netherlands; Department of Radiation Sciences, Oncology, Umeå University, 90185 Umeå, Sweden
| | - Bakhos A Tannous
- Department of Neurology, Massachusetts General Hospital and Neuroscience Program, Harvard Medical School, 149 13th Street, Charlestown, MA 02129, USA
| | - Pieter Wesseling
- Department of Pathology, VU University Medical Center, Cancer Center Amsterdam, De Boelelaan 1117, 1081 HV Amsterdam, the Netherlands; Department of Pathology, Radboud University Medical Center, 6500 HB Nijmegen, the Netherlands
| | - Thomas Wurdinger
- Department of Neurosurgery, VU University Medical Center, Cancer Center Amsterdam, De Boelelaan 1117, 1081 HV Amsterdam, the Netherlands; Department of Neurology, Massachusetts General Hospital and Neuroscience Program, Harvard Medical School, 149 13th Street, Charlestown, MA 02129, USA; thromboDx B.V., 1098 EA Amsterdam, the Netherlands.
| |
Collapse
|
14
|
Geman D, Ochs M, Price ND, Tomasetti C, Younes L. An argument for mechanism-based statistical inference in cancer. Hum Genet 2015; 134:479-95. [PMID: 25381197 PMCID: PMC4612627 DOI: 10.1007/s00439-014-1501-x] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2014] [Accepted: 10/14/2014] [Indexed: 01/07/2023]
Abstract
Cancer is perhaps the prototypical systems disease, and as such has been the focus of extensive study in quantitative systems biology. However, translating these programs into personalized clinical care remains elusive and incomplete. In this perspective, we argue that realizing this agenda—in particular, predicting disease phenotypes, progression and treatment response for individuals—requires going well beyond standard computational and bioinformatics tools and algorithms. It entails designing global mathematical models over network-scale configurations of genomic states and molecular concentrations, and learning the model parameters from limited available samples of high-dimensional and integrative omics data. As such, any plausible design should accommodate: biological mechanism, necessary for both feasible learning and interpretable decision making; stochasticity, to deal with uncertainty and observed variation at many scales; and a capacity for statistical inference at the patient level. This program, which requires a close, sustained collaboration between mathematicians and biologists, is illustrated in several contexts, including learning biomarkers, metabolism, cell signaling, network inference and tumorigenesis.
Collapse
Affiliation(s)
- Donald Geman
- Department of Applied Mathematics and Statistics, Johns Hopkins University, Baltimore, MD, 21210, USA,
| | | | | | | | | |
Collapse
|
15
|
Afsari B, Braga-Neto UM, Geman D. Rank discriminants for predicting phenotypes from RNA expression. Ann Appl Stat 2014. [DOI: 10.1214/14-aoas738] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
16
|
Recognition of multiple imbalanced cancer types based on DNA microarray data using ensemble classifiers. BIOMED RESEARCH INTERNATIONAL 2013; 2013:239628. [PMID: 24078908 PMCID: PMC3770038 DOI: 10.1155/2013/239628] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/07/2013] [Revised: 07/08/2013] [Accepted: 07/17/2013] [Indexed: 11/24/2022]
Abstract
DNA microarray technology can measure the activities of tens of thousands of genes simultaneously, which provides an efficient way to diagnose cancer at the molecular level. Although this strategy has attracted significant research attention, most studies neglect an important problem, namely, that most DNA microarray datasets are skewed, which causes traditional learning algorithms to produce inaccurate results. Some studies have considered this problem, yet they merely focus on binary-class problem. In this paper, we dealt with multiclass imbalanced classification problem, as encountered in cancer DNA microarray, by using ensemble learning. We utilized one-against-all coding strategy to transform multiclass to multiple binary classes, each of them carrying out feature subspace, which is an evolving version of random subspace that generates multiple diverse training subsets. Next, we introduced one of two different correction technologies, namely, decision threshold adjustment or random undersampling, into each training subset to alleviate the damage of class imbalance. Specifically, support vector machine was used as base classifier, and a novel voting rule called counter voting was presented for making a final decision. Experimental results on eight skewed multiclass cancer microarray datasets indicate that unlike many traditional classification approaches, our methods are insensitive to class imbalance.
Collapse
|
17
|
Winslow RL, Trayanova N, Geman D, Miller MI. Computational medicine: translating models to clinical care. Sci Transl Med 2013; 4:158rv11. [PMID: 23115356 DOI: 10.1126/scitranslmed.3003528] [Citation(s) in RCA: 119] [Impact Index Per Article: 9.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022]
Abstract
Because of the inherent complexity of coupled nonlinear biological systems, the development of computational models is necessary for achieving a quantitative understanding of their structure and function in health and disease. Statistical learning is applied to high-dimensional biomolecular data to create models that describe relationships between molecules and networks. Multiscale modeling links networks to cells, organs, and organ systems. Computational approaches are used to characterize anatomic shape and its variations in health and disease. In each case, the purposes of modeling are to capture all that we know about disease and to develop improved therapies tailored to the needs of individuals. We discuss advances in computational medicine, with specific examples in the fields of cancer, diabetes, cardiology, and neurology. Advances in translating these computational methods to the clinic are described, as well as challenges in applying models for improving patient health.
Collapse
Affiliation(s)
- Raimond L Winslow
- The Institute for Computational Medicine, Center for Cardiovascular Bioinformatics and Modeling, and Department of Biomedical Engineering, The Johns Hopkins University School of Medicine, Baltimore, MD 21218, USA.
| | | | | | | |
Collapse
|
18
|
A review on particle swarm optimization algorithm and its variants to clustering high-dimensional data. Artif Intell Rev 2013. [DOI: 10.1007/s10462-013-9400-4] [Citation(s) in RCA: 119] [Impact Index Per Article: 9.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
|
19
|
Hughes C, Iqbal-Wahid J, Brown M, Shanks JH, Eustace A, Denley H, Hoskin PJ, West C, Clarke NW, Gardner P. FTIR microspectroscopy of selected rare diverse sub-variants of carcinoma of the urinary bladder. JOURNAL OF BIOPHOTONICS 2013; 6:73-87. [PMID: 23125109 DOI: 10.1002/jbio.201200126] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/03/2012] [Revised: 10/01/2012] [Accepted: 10/01/2012] [Indexed: 06/01/2023]
Abstract
Urothelial carcinomas of the bladder are a heterogeneous group of tumours, although some histological sub-variants are rare and sparsely reported in the literature. Diagnosis of sub-variants from conventional urothelial carcinoma can be challenging, as they may mimic the morphology of other malignancies or benign tumours and therefore their distinction is important. For the first time, the spectral pathology of some of these sub-variants has been documented by infrared microspectroscopy and an attempt made to profile their biochemistry. It is important not only to identify and separate the cancer-associated epithelial tissue spectra from common tissue features such as stroma or blood, but also to detect the signatures of tumour sub-variants. As shown, their spectroscopic signals can change dramatically as a consequence of differentiation. Example cases are discussed and compared with histological evaluations.
Collapse
Affiliation(s)
- Caryn Hughes
- Manchester Institute of Biotechnology, University of Manchester, Manchester, UK
| | | | | | | | | | | | | | | | | | | |
Collapse
|
20
|
Wang RL, Bencic D, Biales A, Flick R, Lazorchak J, Villeneuve D, Ankley GT. Discovery and validation of gene classifiers for endocrine-disrupting chemicals in zebrafish (danio rerio). BMC Genomics 2012; 13:358. [PMID: 22849515 PMCID: PMC3469349 DOI: 10.1186/1471-2164-13-358] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2012] [Accepted: 07/18/2012] [Indexed: 01/04/2023] Open
Abstract
BACKGROUND Development and application of transcriptomics-based gene classifiers for ecotoxicological applications lag far behind those of biomedical sciences. Many such classifiers discovered thus far lack vigorous statistical and experimental validations. A combination of genetic algorithm/support vector machines and genetic algorithm/K nearest neighbors was used in this study to search for classifiers of endocrine-disrupting chemicals (EDCs) in zebrafish. Searches were conducted on both tissue-specific and tissue-combined datasets, either across the entire transcriptome or within individual transcription factor (TF) networks previously linked to EDC effects. Candidate classifiers were evaluated by gene set enrichment analysis (GSEA) on both the original training data and a dedicated validation dataset. RESULTS Multi-tissue dataset yielded no classifiers. Among the 19 chemical-tissue conditions evaluated, the transcriptome-wide searches yielded classifiers for six of them, each having approximately 20 to 30 gene features unique to a condition. Searches within individual TF networks produced classifiers for 15 chemical-tissue conditions, each containing 100 or fewer top-ranked gene features pooled from those of multiple TF networks and also unique to each condition. For the training dataset, 10 out of 11 classifiers successfully identified the gene expression profiles (GEPs) of their targeted chemical-tissue conditions by GSEA. For the validation dataset, classifiers for prochloraz-ovary and flutamide-ovary also correctly identified the GEPs of corresponding conditions while no classifier could predict the GEP from prochloraz-brain. CONCLUSIONS The discrepancies in the performance of these classifiers were attributed in part to varying data complexity among the conditions, as measured to some degree by Fisher's discriminant ratio statistic. This variation in data complexity could likely be compensated by adjusting sample size for individual chemical-tissue conditions, thus suggesting a need for a preliminary survey of transcriptomic responses before launching a full scale classifier discovery effort. Classifier discovery based on individual TF networks could yield more mechanistically-oriented biomarkers. GSEA proved to be a flexible and effective tool for application of gene classifiers but a similar and more refined algorithm, connectivity mapping, should also be explored. The distribution characteristics of classifiers across tissues, chemicals, and TF networks suggested a differential biological impact among the EDCs on zebrafish transcriptome involving some basic cellular functions.
Collapse
Affiliation(s)
- Rong-Lin Wang
- USEPA, Ecological Exposure Research Division, National Exposure Research Laboratory, 26 W Martin Luther King Dr, Cincinnati, OH, 45268, USA
| | - David Bencic
- USEPA, Ecological Exposure Research Division, National Exposure Research Laboratory, 26 W Martin Luther King Dr, Cincinnati, OH, 45268, USA
| | - Adam Biales
- USEPA, Ecological Exposure Research Division, National Exposure Research Laboratory, 26 W Martin Luther King Dr, Cincinnati, OH, 45268, USA
| | - Robert Flick
- USEPA, Ecological Exposure Research Division, National Exposure Research Laboratory, 26 W Martin Luther King Dr, Cincinnati, OH, 45268, USA
| | - Jim Lazorchak
- USEPA, Ecological Exposure Research Division, National Exposure Research Laboratory, 26 W Martin Luther King Dr, Cincinnati, OH, 45268, USA
| | - Daniel Villeneuve
- USEPA, Mid-Continent Ecology Division, National Health and Environmental Effects Research Laboratory, 6201 Congdon Boulevard, Duluth, MN, 55804, USA
| | - Gerald T Ankley
- USEPA, Mid-Continent Ecology Division, National Health and Environmental Effects Research Laboratory, 6201 Congdon Boulevard, Duluth, MN, 55804, USA
| |
Collapse
|
21
|
Wang SL, Li XL, Fang J. Finding minimum gene subsets with heuristic breadth-first search algorithm for robust tumor classification. BMC Bioinformatics 2012; 13:178. [PMID: 22830977 PMCID: PMC3465202 DOI: 10.1186/1471-2105-13-178] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2011] [Accepted: 05/18/2012] [Indexed: 01/03/2023] Open
Abstract
Background Previous studies on tumor classification based on gene expression profiles suggest that gene selection plays a key role in improving the classification performance. Moreover, finding important tumor-related genes with the highest accuracy is a very important task because these genes might serve as tumor biomarkers, which is of great benefit to not only tumor molecular diagnosis but also drug development. Results This paper proposes a novel gene selection method with rich biomedical meaning based on Heuristic Breadth-first Search Algorithm (HBSA) to find as many optimal gene subsets as possible. Due to the curse of dimensionality, this type of method could suffer from over-fitting and selection bias problems. To address these potential problems, a HBSA-based ensemble classifier is constructed using majority voting strategy from individual classifiers constructed by the selected gene subsets, and a novel HBSA-based gene ranking method is designed to find important tumor-related genes by measuring the significance of genes using their occurrence frequencies in the selected gene subsets. The experimental results on nine tumor datasets including three pairs of cross-platform datasets indicate that the proposed method can not only obtain better generalization performance but also find many important tumor-related genes. Conclusions It is found that the frequencies of the selected genes follow a power-law distribution, indicating that only a few top-ranked genes can be used as potential diagnosis biomarkers. Moreover, the top-ranked genes leading to very high prediction accuracy are closely related to specific tumor subtype and even hub genes. Compared with other related methods, the proposed method can achieve higher prediction accuracy with fewer genes. Moreover, they are further justified by analyzing the top-ranked genes in the context of individual gene function, biological pathway, and protein-protein interaction network.
Collapse
Affiliation(s)
- Shu-Lin Wang
- Applied Bioinformatics Laboratory, University of Kansas, 2034 Becker Drive, Lawrence, KS 66047, USA
| | | | | |
Collapse
|
22
|
RE MATTEO, VALENTINI GIORGIO. Ensemble Methods. ADVANCES IN MACHINE LEARNING AND DATA MINING FOR ASTRONOMY 2012. [DOI: 10.1201/b11822-34] [Citation(s) in RCA: 28] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
|
23
|
FLAOUNAS ILIASN, IAKOVIDIS DIMITRISK, MAROULIS DIMITRISE. CASCADING SVMS AS A TOOL FOR MEDICAL DIAGNOSIS USING MULTI-CLASS GENE EXPRESSION DATA. INT J ARTIF INTELL T 2011. [DOI: 10.1142/s0218213006002709] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
In this paper we propose a novel Support Vector Machines-based architecture for medical diagnosis using multi-class gene expression data. It consists of a pre-processing unit and N-1 sequentially ordered blocks capable of classifying N classes in a cascading manner. Each block embodies both a gene selection and a classification module. It offers the flexibility of constructing block-specific gene expression spaces and hypersurfaces for the discrimination of the different classes. The proposed architecture was applied for medical diagnostic tasks including prostate and lung cancer diagnosis. Its performance was evaluated by using a leave-one-out cross validation approach which avoids the bias introduced by the gene selection process. The results show that it provides high accuracy which in most cases exceeds the accuracy achieved by the popular one-vs-one and one-vs-all SVM combination schemes and Nearest-Neighbor classifiers. The cascading SVMs can be successfully applied as a medical diagnostic tool.
Collapse
Affiliation(s)
- ILIAS N. FLAOUNAS
- Department of Informatics and Telecommunications, National and Kapodestrian Univ. of Athens, Panepistimiopolis, Ilissia, Athens, 15784, Greece
| | - DIMITRIS K. IAKOVIDIS
- Department of Informatics and Telecommunications, National and Kapodestrian Univ. of Athens, Panepistimiopolis, Ilissia, Athens, 15784, Greece
| | - DIMITRIS E. MAROULIS
- Department of Informatics and Telecommunications, National and Kapodestrian Univ. of Athens, Panepistimiopolis, Ilissia, Athens, 15784, Greece
| |
Collapse
|
24
|
Abstract
Global gene expression measurements are increasingly obtained as a function of cell type, spatial position within a tissue and other biologically meaningful coordinates. Such data should enable quantitative analysis of the cell-type specificity of gene expression, but such analyses can often be confounded by the presence of noise. We introduce a specificity measure Spec that quantifies the information in a gene's complete expression profile regarding any given cell type, and an uncertainty measure dSpec, which measures the effect of noise on specificity. Using global gene expression data from the mouse brain, plant root and human white blood cells, we show that Spec identifies genes with variable expression levels that are nonetheless highly specific of particular cell types. When samples from different individuals are used, dSpec measures genes’ transcriptional plasticity in each cell type. Our approach is broadly applicable to mapped gene expression measurements in stem cell biology, developmental biology, cancer biology and biomarker identification. As an example of such applications, we show that Spec identifies a new class of biomarkers, which exhibit variable expression without compromising specificity. The approach provides a unifying theoretical framework for quantifying specificity in the presence of noise, which is widely applicable across diverse biological systems.
Collapse
Affiliation(s)
- Kenneth D Birnbaum
- Center for Genomics and Systems Biology, Department of Biology, New York University, NY 10003, USA
| | | |
Collapse
|
25
|
Posorski N, Kaemmerer D, Ernst G, Grabowski P, Hoersch D, Hommann M, von Eggeling F. Localization of sporadic neuroendocrine tumors by gene expression analysis of their metastases. Clin Exp Metastasis 2011; 28:637-47. [PMID: 21681495 DOI: 10.1007/s10585-011-9397-5] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2010] [Accepted: 05/31/2011] [Indexed: 01/20/2023]
Abstract
A characteristic of human gastroenteropancreatic neuroendocrine tumors (GEP-NET) is a minute unobtrusive primary tumor which often cannot be detected by common physical examinations. It therefore remains unidentified until the tumor has spread and space-occupying metastases cause clinical symptoms leading to diagnosis. Cases in which the primary cannot be located are referred to as NET with CUP-syndrome (cancer of unknown primary syndrome). With the help of array-CGH (comparative genomic hybridization, Agilent 105K) and gene expression analysis (Agilent 44K), microdissected primaries and their metastases were compared to identify up- and down-regulated genes which can be used as a marker for tumor progression. In a next analysis step, a hierarchical clustering of 41.078 genes revealed three genes [C-type lectin domain family 13 member A (CD302), peptidylprolyl isomerase containing WD40 repeat (PPWD1) and abhydrolase domain containing 14B (ABHD14B)] which expression levels can categorize the metastases into three groups depending on the localization of their primary. Because cancer therapy is dependent on the localization of the primary, the gene expression level of these three genes are promising markers to unravel the CUP syndrome in NET.
Collapse
Affiliation(s)
- Nicole Posorski
- Core Unit Chip Application, Institute of Human Genetics, UKJ, University Hospital Jena, Germany
| | | | | | | | | | | | | |
Collapse
|
26
|
Kilpinen SK, Ojala KA, Kallioniemi OP. Alignment of gene expression profiles from test samples against a reference database: New method for context-specific interpretation of microarray data. BioData Min 2011; 4:5. [PMID: 21453538 PMCID: PMC3080808 DOI: 10.1186/1756-0381-4-5] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2010] [Accepted: 03/31/2011] [Indexed: 02/07/2023] Open
Abstract
Background Gene expression microarray data have been organized and made available as public databases, but the utilization of such highly heterogeneous reference datasets in the interpretation of data from individual test samples is not as developed as e.g. in the field of nucleotide sequence comparisons. We have created a rapid and powerful approach for the alignment of microarray gene expression profiles (AGEP) from test samples with those contained in a large annotated public reference database and demonstrate here how this can facilitate interpretation of microarray data from individual samples. Methods AGEP is based on the calculation of kernel density distributions for the levels of expression of each gene in each reference tissue type and provides a quantitation of the similarity between the test sample and the reference tissue types as well as the identity of the typical and atypical genes in each comparison. As a reference database, we used 1654 samples from 44 normal tissues (extracted from the Genesapiens database). Results Using leave-one-out validation, AGEP correctly defined the tissue of origin for 1521 (93.6%) of all the 1654 samples in the original database. Independent validation of 195 external normal tissue samples resulted in 87% accuracy for the exact tissue type and 97% accuracy with related tissue types. AGEP analysis of 10 Duchenne muscular dystrophy (DMD) samples provided quantitative description of the key pathogenetic events, such as the extent of inflammation, in individual samples and pinpointed tissue-specific genes whose expression changed (SAMD4A) in DMD. AGEP analysis of microarray data from adipocytic differentiation of mesenchymal stem cells and from normal myeloid cell types and leukemias provided quantitative characterization of the transcriptomic changes during normal and abnormal cell differentiation. Conclusions The AGEP method is a widely applicable method for the rapid comprehensive interpretation of microarray data, as proven here by the definition of tissue- and disease-specific changes in gene expression as well as during cellular differentiation. The capability to quantitatively compare data from individual samples against a large-scale annotated reference database represents a widely applicable paradigm for the analysis of all types of high-throughput data. AGEP enables systematic and quantitative comparison of gene expression data from test samples against a comprehensive collection of different cell/tissue types previously studied by the entire research community.
Collapse
Affiliation(s)
- Sami K Kilpinen
- Institute for Molecular Medicine Finland (FIMM), University of Helsinki, Tukholmankatu 8, Helsinki, Finland.
| | | | | |
Collapse
|
27
|
Tapia E, Ornella L, Bulacio P, Angelone L. Multiclass classification of microarray data samples with a reduced number of genes. BMC Bioinformatics 2011; 12:59. [PMID: 21342522 PMCID: PMC3056725 DOI: 10.1186/1471-2105-12-59] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2010] [Accepted: 02/22/2011] [Indexed: 01/05/2023] Open
Abstract
Background Multiclass classification of microarray data samples with a reduced number of genes is a rich and challenging problem in Bioinformatics research. The problem gets harder as the number of classes is increased. In addition, the performance of most classifiers is tightly linked to the effectiveness of mandatory gene selection methods. Critical to gene selection is the availability of estimates about the maximum number of genes that can be handled by any classification algorithm. Lack of such estimates may lead to either computationally demanding explorations of a search space with thousands of dimensions or classification models based on gene sets of unrestricted size. In the former case, unbiased but possibly overfitted classification models may arise. In the latter case, biased classification models unable to support statistically significant findings may be obtained. Results A novel bound on the maximum number of genes that can be handled by binary classifiers in binary mediated multiclass classification algorithms of microarray data samples is presented. The bound suggests that high-dimensional binary output domains might favor the existence of accurate and sparse binary mediated multiclass classifiers for microarray data samples. Conclusions A comprehensive experimental work shows that the bound is indeed useful to induce accurate and sparse multiclass classifiers for microarray data samples.
Collapse
Affiliation(s)
- Elizabeth Tapia
- CIFASIS-Conicet Institute, Bv, 27 de Febrero 210 Bis, Rosario, Argentina.
| | | | | | | |
Collapse
|
28
|
Pillai R, Deeter R, Rigl CT, Nystrom JS, Miller MH, Buturovic L, Henner WD. Validation and reproducibility of a microarray-based gene expression test for tumor identification in formalin-fixed, paraffin-embedded specimens. J Mol Diagn 2010; 13:48-56. [PMID: 21227394 DOI: 10.1016/j.jmoldx.2010.11.001] [Citation(s) in RCA: 103] [Impact Index Per Article: 6.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2010] [Revised: 06/28/2010] [Accepted: 07/30/2010] [Indexed: 12/31/2022] Open
Abstract
Tumors whose primary site is challenging to diagnose represent a considerable proportion of new cancer cases. We present validation study results for a gene expression-based diagnostic test (the Pathwork Tissue of Origin Test) that aids in determining the tissue of origin using formalin-fixed, paraffin-embedded (FFPE) specimens. Microarray data files were generated for 462 metastatic, poorly differentiated, or undifferentiated FFPE tumor specimens, all of which had a reference diagnosis. The reference diagnoses were masked, and the microarray data files were analyzed using a 2000-gene classification model. The algorithm quantifies the similarity between RNA expression patterns of the study specimens and the 15 tissues on the test panel. Among the 462 specimens, overall agreement with the reference diagnosis was 89% (95% CI, 85% to 91%). In addition to the positive test results (ie, rule-ins), an average of 12 tissues for each specimen could be ruled out with >99% probability. The large size of this study increases confidence in the test results. A multisite reproducibility study showed 89.3% concordance between laboratories. The Tissue of Origin Test makes the benefits of microarray-based gene expression tests for tumor diagnosis available for use with the most common type of histology specimen (ie, FFPE).
Collapse
Affiliation(s)
- Raji Pillai
- Pathwork Diagnostics, Inc., Redwood City, California 94063-4737, USA.
| | | | | | | | | | | | | |
Collapse
|
29
|
A robust ensemble classification method analysis. ADVANCES IN EXPERIMENTAL MEDICINE AND BIOLOGY 2010. [PMID: 20865496 DOI: 10.1007/978-1-4419-5913-3_17] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register]
Abstract
Apart from the dimensionality problem, the uncertainty of Microarray data quality is another major challenge of Microarray classification. Microarray data contain various levels of noise and quite often high levels of noise, and these data lead to unreliable and low accuracy analysis as well as high dimensionality problem. In this paper, we propose a new Microarray data classification method, based on diversified multiple trees. The new method contains features that (1) make most use of the information from the abundant genes in the Microarray data and (2) use a unique diversity measurement in the ensemble decision committee. The experimental results show that the proposed classification method (DMDT) and the well-known method (CS4), which diversifies trees by using distinct tree roots, are more accurate on average than other well-known ensemble methods, including Bagging, Boosting, and Random Forests. The experiments also indicate that using diversity measurement of DMDT improves the classification accuracy of ensemble classification on Microarray data.
Collapse
|
30
|
Stability of ranked gene lists in large microarray analysis studies. J Biomed Biotechnol 2010; 2010:616358. [PMID: 20625502 PMCID: PMC2896709 DOI: 10.1155/2010/616358] [Citation(s) in RCA: 29] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2010] [Accepted: 05/17/2010] [Indexed: 11/29/2022] Open
Abstract
This paper presents an empirical study that aims to explain the relationship between the number of samples and stability of different gene selection techniques for microarray datasets. Unlike other similar studies where number of genes in a ranked gene list is variable, this study uses an alternative approach where stability is observed at different number of samples that are used for gene selection. Three different metrics of stability, including a novel metric in bioinformatics, were used to estimate the stability of the ranked gene lists. Results of this study demonstrate that the univariate selection methods produce significantly more stable ranked gene lists than the multivariate selection methods used in this study. More specifically, thousands of samples are needed for these multivariate selection methods to achieve the same level of stability any given univariate selection method can achieve with only hundreds.
Collapse
|
31
|
Eddy JA, Sung J, Geman D, Price ND. Relative expression analysis for molecular cancer diagnosis and prognosis. Technol Cancer Res Treat 2010; 9:149-59. [PMID: 20218737 DOI: 10.1177/153303461000900204] [Citation(s) in RCA: 80] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/27/2022] Open
Abstract
The enormous amount of biomolecule measurement data generated from high-throughput technologies has brought an increased need for computational tools in biological analyses. Such tools can enhance our understanding of human health and genetic diseases, such as cancer, by accurately classifying phenotypes, detecting the presence of disease, discriminating among cancer sub-types, predicting clinical outcomes, and characterizing disease progression. In the case of gene expression microarray data, standard statistical learning methods have been used to identify classifiers that can accurately distinguish disease phenotypes. However, these mathematical prediction rules are often highly complex, and they lack the convenience and simplicity desired for extracting underlying biological meaning or transitioning into the clinic. In this review, we survey a powerful collection of computational methods for analyzing transcriptomic microarray data that address these limitations. Relative Expression Analysis (RXA) is based only on the relative orderings among the expressions of a small number of genes. Specifically, we provide a description of the first and simplest example of RXA, the K-TSP classifier, which is based on _ pairs of genes; the case K = 1 is the TSP classifier. Given their simplicity and ease of biological interpretation, as well as their invariance to data normalization and parameter-fitting, these classifiers have been widely applied in aiding molecular diagnostics in a broad range of human cancers. We review several studies which demonstrate accurate classification of disease phenotypes (e.g., cancer vs. normal), cancer subclasses (e.g., AML vs. ALL, GIST vs. LMS), disease outcomes (e.g., metastasis, survival), and diverse human pathologies assayed through blood-borne leukocytes. The studies presented demonstrate that RXA-specifically the TSP and K-TSP classifiers-is a promising new class of computational methods for analyzing high-throughput data, and has the potential to significantly contribute to molecular cancer diagnosis and prognosis.
Collapse
Affiliation(s)
- James A Eddy
- Institute for Genomic Biology, University of Illinois, Urbana, IL 61801, USA
| | | | | | | |
Collapse
|
32
|
Staub E, Buhr HJ, Gröne J. Predicting the site of origin of tumors by a gene expression signature derived from normal tissues. Oncogene 2010; 29:4485-92. [PMID: 20514016 DOI: 10.1038/onc.2010.196] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022]
Abstract
Multiple expression signatures for the prediction of the site of origin of metastatic cancer of unknown primary origin (CUP) have been developed. Owing to their limited coverage of tumor types and suboptimal prediction accuracy on distinct tumors, there is still room for alternative CUP gene expression signatures. Whereas in past studies, CUP classifiers were trained solely on data from tumor samples, we now use expression patterns from normal tissues for classifier training. This approach potentially avoids pitfalls related to the representation of genetically heterogeneous tumor subtypes during classifier training. Two expression data sets of normal human tissues have been reanalyzed to derive an expression signature for liver, prostate, kidney, ovarian and lung tissues. In reciprocal validation, classifiers trained on either data set achieved overall accuracies greater than 97%. Classifiers trained on combined expression data from both normal tissue data sets were able to predict the site of origin in a cohort of 652 primary tumors with approximately 90% accuracy. Prediction accuracies of primary cancer-based classifiers were in the same range, as determined by cross-validation on this cohort. For individual tumor types, normal tissue-based classifiers achieved sensitivities in the range of 64-99% and specificities in the range of 92-100%. Primary origins for 12 of 20 metastases were predicted correctly, with false predictions highlighting the need for accurate sample preparation to avoid contaminations by metastases-surrounding tissue. We conclude that gene expression patterns of normal tissues harbor phenotypic information that is retained in tumors and can be sufficient to recover the type of primary tumor from expression patterns alone.
Collapse
Affiliation(s)
- E Staub
- Merck KGaA, Merck Serono, Drug Discovery Informatics, Darmstadt, Germany.
| | | | | |
Collapse
|
33
|
Forest classification trees and forest support vector machines algorithms: Demonstration using microarray data. Comput Biol Med 2010; 40:519-24. [DOI: 10.1016/j.compbiomed.2010.03.006] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2009] [Revised: 01/09/2010] [Accepted: 03/22/2010] [Indexed: 11/22/2022]
|
34
|
Zhang W, Robbins K, Wang Y, Bertrand K, Rekaya R. A jackknife-like method for classification and uncertainty assessment of multi-category tumor samples using gene expression information. BMC Genomics 2010; 11:273. [PMID: 20429942 PMCID: PMC2876124 DOI: 10.1186/1471-2164-11-273] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2009] [Accepted: 04/29/2010] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The use of gene expression profiling for the classification of human cancer tumors has been widely investigated. Previous studies were successful in distinguishing several tumor types in binary problems. As there are over a hundred types of cancers, and potentially even more subtypes, it is essential to develop multi-category methodologies for molecular classification for any meaningful practical application. RESULTS A jackknife-based supervised learning method called paired-samples test algorithm (PST), coupled with a binary classification model based on linear regression, was proposed and applied to two well known and challenging datasets consisting of 14 (GCM dataset) and 9 (NC160 dataset) tumor types. The results showed that the proposed method improved the prediction accuracy of the test samples for the GCM dataset, especially when t-statistic was used in the primary feature selection. For the NCI60 dataset, the application of PST improved prediction accuracy when the numbers of used genes were relatively small (100 or 200). These improvements made the binary classification method more robust to the gene selection mechanism and the size of genes to be used. The overall prediction accuracies were competitive in comparison to the most accurate results obtained by several previous studies on the same datasets and with other methods. Furthermore, the relative confidence R(T) provided a unique insight into the sources of the uncertainty shown in the statistical classification and the potential variants within the same tumor type. CONCLUSION We proposed a novel bagging method for the classification and uncertainty assessment of multi-category tumor samples using gene expression information. The strengths were demonstrated in the application to two bench datasets.
Collapse
Affiliation(s)
- Wensheng Zhang
- Department of Animal and Dairy Science, University of Georgia, Athens, GA 30602, USA
| | | | | | | | | |
Collapse
|
35
|
Kim J, Eberwine J. RNA: state memory and mediator of cellular phenotype. Trends Cell Biol 2010; 20:311-8. [PMID: 20382532 DOI: 10.1016/j.tcb.2010.03.003] [Citation(s) in RCA: 31] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2009] [Revised: 03/15/2010] [Accepted: 03/18/2010] [Indexed: 12/11/2022]
Abstract
It has become increasingly clear that the genome is dynamic and exquisitely sensitive, changing expression patterns in response to age, environmental stimuli and pharmacological and physiological manipulations. Similarly, cellular phenotype, traditionally viewed as a stable end-state, should be viewed as versatile and changeable. The phenotype of a cell is better defined as a 'homeostatic phenotype' implying plasticity resulting from a dynamically changing yet characteristic pattern of gene/protein expression. A stable change in phenotype is the result of the movement of a cell between different multidimensional identity spaces. Here, we describe a key driver of this transition and the stabilizer of phenotype: the relative abundances of the cellular RNAs. We argue that the quantitative state of RNA can be likened to a state memory, that when transferred between cells, alters the phenotype in a predictable manner.
Collapse
Affiliation(s)
- Junhyong Kim
- Penn Genome Frontiers Institute, Department of Biology, University of Pennsylvania Medical School, University of Pennsylvania, Philadelphia, PA 19104, USA.
| | | |
Collapse
|
36
|
Pérez NF, Ferré J, Boqué R. Multi-class classification with probabilistic discriminant partial least squares (p-DPLS). Anal Chim Acta 2010; 664:27-33. [DOI: 10.1016/j.aca.2010.01.059] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2009] [Revised: 01/22/2010] [Accepted: 01/29/2010] [Indexed: 10/19/2022]
|
37
|
Joseph SJ, Robbins KR, Zhang W, Rekaya R. Comparison of two output-coding strategies for multi-class tumor classification using gene expression data and Latent Variable Model as binary classifier. Cancer Inform 2010; 9:39-48. [PMID: 20458360 PMCID: PMC2865770 DOI: 10.4137/cin.s3827] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
Multi-class cancer classification based on microarray data is described. A generalized output-coding scheme based on One Versus One (OVO) combined with Latent Variable Model (LVM) is used. Results from the proposed One Versus One (OVO) outputcoding strategy is compared with the results obtained from the generalized One Versus All (OVA) method and their efficiencies of using them for multi-class tumor classification have been studied. This comparative study was done using two microarray gene expression data: Global Cancer Map (GCM) dataset and brain cancer (BC) dataset. Primary feature selection was based on fold change and penalized t-statistics. Evaluation was conducted with varying feature numbers. The OVO coding strategy worked quite well with the BC data, while both OVO and OVA results seemed to be similar for the GCM data. The selection of output coding methods for combining binary classifiers for multi-class tumor classification depends on the number of tumor types considered, the discrepancies between the tumor samples used for training as well as the heterogeneity of expression within the cancer subtypes used as training data.
Collapse
Affiliation(s)
- Sandeep J Joseph
- Rhodes Centre for Animal and Dairy Science, University of Georgia, Athens, GA 30605, USA
| | | | | | | |
Collapse
|
38
|
Staub E, Buhr HJ, Gröne J. WITHDRAWN: Predicting the site of origin of tumors by a gene expression signature derived from normal tissues. Oncogene 2009:onc2009398. [PMID: 19915613 DOI: 10.1038/onc.2009.398] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2009] [Revised: 10/09/2009] [Accepted: 10/12/2009] [Indexed: 12/30/2022]
Abstract
Multiple expression signatures for the prediction of the site of origin of metastatic cancers of unknown primary origin (CUP) have been developed. Owing to their limited coverage of tumor types and suboptimal prediction accuracy on distinct tumors there is still room for alternative CUP gene expression signatures. Whereas in past studies CUP classifiers were solely trained on data from tumor samples, we now use expression patterns from normal tissues for classifier training. This approach potentially avoids pitfalls related to the representation of genetically heterogeneous tumor subtypes during classifier training. Two expression data sets of normal human tissues have been reanalysed to derive an expression signature for liver, prostate, kidney, ovarian and lung tissues. In reciprocal validation classifiers trained on either data set achieved overall accuracies greater than 97%. Classifiers trained on combined expression data from both normal tissue data sets were able to predict the site of origin in a cohort of 652 primary tumors with approximately 90% accuracy. Prediction accuracies of primary cancer-based classifiers were in the same range as determined by cross-validation on this cohort. For individual tumor types, normal tissue-based best-centroid classifiers achieved sensitivities ranging from 71 to 99% and specificities ranging from 91 to 99%. Primary origins for 12 of 20 metastases were predicted correctly with false predictions highlighting the need for accurate sample preparation to avoid contaminations by metastases-surrounding tissue. We conclude that gene expression patterns of normal tissues harbor phenotypic information that is retained in tumors and can be sufficient to recover the type of a primary tumor from expression patterns alone.Oncogene advance online publication, 16 November 2009; doi:10.1038/onc.2009.398.
Collapse
Affiliation(s)
- E Staub
- Drug Discovery Informatics, Merck Serono, Merck KGaA, Darmstadt, Germany
| | | | | |
Collapse
|
39
|
Keerthikumar S, Bhadra S, Kandasamy K, Raju R, Ramachandra YL, Bhattacharyya C, Imai K, Ohara O, Mohan S, Pandey A. Prediction of candidate primary immunodeficiency disease genes using a support vector machine learning approach. DNA Res 2009; 16:345-51. [PMID: 19801557 PMCID: PMC2780952 DOI: 10.1093/dnares/dsp019] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/07/2023] Open
Abstract
Screening and early identification of primary immunodeficiency disease (PID) genes is a major challenge for physicians. Many resources have catalogued molecular alterations in known PID genes along with their associated clinical and immunological phenotypes. However, these resources do not assist in identifying candidate PID genes. We have recently developed a platform designated Resource of Asian PDIs, which hosts information pertaining to molecular alterations, protein-protein interaction networks, mouse studies and microarray gene expression profiling of all known PID genes. Using this resource as a discovery tool, we describe the development of an algorithm for prediction of candidate PID genes. Using a support vector machine learning approach, we have predicted 1442 candidate PID genes using 69 binary features of 148 known PID genes and 3162 non-PID genes as a training data set. The power of this approach is illustrated by the fact that six of the predicted genes have recently been experimentally confirmed to be PID genes. The remaining genes in this predicted data set represent attractive candidates for testing in patients where the etiology cannot be ascribed to any of the known PID genes.
Collapse
|
40
|
Yang TY. Simple Bayesian binary framework for discovering significant genes and classifying cancer diagnosis. Comput Stat Data Anal 2009. [DOI: 10.1016/j.csda.2008.04.008] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
41
|
Abstract
What if there was a rapid, inexpensive, and accurate blood diagnostic that could determine which patients were infected, identify the organism(s) responsible, and identify patients who were not responding to therapy? We hypothesized that systems analysis of the transcriptional activity of circulating immune effector cells could be used to identify conserved elements in the host response to systemic inflammation, and furthermore, to discriminate between sterile and infectious etiologies. We review herein a validated, systems biology approach demonstrating that 1) abdominal and pulmonary sepsis diagnoses can be made in mouse models using microarray (RNA) data from circulating blood, 2) blood microarray data can be used to differentiate between the host response to Gram-negative and Gram-positive pneumonia, 3) the endotoxin response of normal human volunteers can be mapped at the level of gene expression, and 4) a similar strategy can be used in the critically ill to follow septic patients and quantitatively determine immune recovery. These findings provide the foundation of immune cartography and demonstrate the potential of this approach for rapidly diagnosing sepsis and identifying pathogens. Further, our data suggest a new approach to determine how specific pathogens perturb the physiology of circulating leukocytes in a cell-specific manner. Large, prospective clinical trails are needed to validate the clinical utility of leukocyte RNA diagnostics (e.g., the riboleukogram).
Collapse
|
42
|
Andronesi OC, Blekas KD, Mintzopoulos D, Astrakas L, Black PM, Tzika AA. Molecular classification of brain tumor biopsies using solid-state magic angle spinning proton magnetic resonance spectroscopy and robust classifiers. Int J Oncol 2008; 33:1017-25. [PMID: 18949365 DOI: 10.3892/ijo_00000000] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022] Open
Abstract
Brain tumors are one of the leading causes of death in adults with cancer; however, molecular classification of these tumors with in vivo magnetic resonance spectroscopy (MRS) is limited because of the small number of metabolites detected. In vitro MRS provides highly informative biomarker profiles at higher fields, but also consumes the sample so that it is unavailable for subsequent analysis. In contrast, ex vivo high-resolution magic angle spinning (HRMAS) MRS conserves the sample but requires large samples and can pose technical challenges for producing accurate data, depending on the sample testing temperature. We developed a novel approach that combines a two-dimensional (2D), solid-state, HRMAS proton (1H) NMR method, TOBSY (total through-bond spectroscopy), which maximizes the advantages of HRMAS and a robust classification strategy. We used approximately 2 mg of tissue at -8 degrees C from each of 55 brain biopsies, and reliably detected 16 different biologically relevant molecular species. We compared two classification strategies, the support vector machine (SVM) classifier and a feed-forward neural network using the Levenberg-Marquardt back-propagation algorithm. We used the minimum redundancy/maximum relevance (MRMR) method as a powerful feature-selection scheme along with the SVM classifier. We suggest that molecular characterization of brain tumors based on highly informative 2D MRS should enable us to type and prognose even inoperable patients with high accuracy in vivo.
Collapse
Affiliation(s)
- Ovidiu C Andronesi
- NMR Surgical Laboratory, Department of Surgery, Harvard Medical School and Massachusetts General Hospital, Boston, MA 02114, USA
| | | | | | | | | | | |
Collapse
|
43
|
Hong JH, Cho SB. A probabilistic multi-class strategy of one-vs.-rest support vector machines for cancer classification. Neurocomputing 2008. [DOI: 10.1016/j.neucom.2008.04.033] [Citation(s) in RCA: 45] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
44
|
Kianmehr K, Alhajj R. CARSVM: A class association rule-based classification framework and its application to gene expression data. Artif Intell Med 2008; 44:7-25. [DOI: 10.1016/j.artmed.2008.05.002] [Citation(s) in RCA: 16] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2007] [Revised: 05/10/2008] [Accepted: 05/13/2008] [Indexed: 12/01/2022]
|
45
|
Trajkovski I, Lavrač N, Tolar J. SEGS: Search for enriched gene sets in microarray data. J Biomed Inform 2008; 41:588-601. [DOI: 10.1016/j.jbi.2007.12.001] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2007] [Revised: 10/08/2007] [Accepted: 12/04/2007] [Indexed: 01/21/2023]
|
46
|
Yang CS, Chuang LY, Ke CH, Yang CH. A Combination of Shuffled Frog-Leaping Algorithm and Genetic Algorithm for Gene Selection. JOURNAL OF ADVANCED COMPUTATIONAL INTELLIGENCE AND INTELLIGENT INFORMATICS 2008. [DOI: 10.20965/jaciii.2008.p0218] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Microarray data referencing to gene expression profiles provides valuable answers to a variety of problems, and contributes to advances in clinical medicine. The application of microarray data to the classification of cancer types has recently assumed increasing importance. The classification of microarray data samples involves feature selection, whose goal is to identify subsets of differentially expressed gene potentially relevant for distinguishing sample classes and classifier design. We propose an efficient evolutionary approach for selecting gene subsets from gene expression data that effectively achieves higher accuracy for classification problems. Our proposal combines a shuffled frog-leaping algorithm (SFLA) and a genetic algorithm (GA), and chooses genes (features) related to classification. The K-nearest neighbor (KNN) with leave-one-out cross validation (LOOCV) is used to evaluate classification accuracy. We apply a novel hybrid approach based on SFLA-GA and KNN classification and compare 11 classification problems from the literature. Experimental results show that classification accuracy obtained using selected features was higher than the accuracy of datasets without feature selection.
Collapse
|
47
|
Gupta A, Bar-Joseph Z. Extracting dynamics from static cancer expression data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2008; 5:172-182. [PMID: 18451427 DOI: 10.1109/tcbb.2007.70233] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/26/2023]
Abstract
Static expression experiments analyze samples from many individuals. These samples are often snapshots of the progression of a certain disease such as cancer. This raises an intriguing question: Can we determine a temporal order for these samples? Such an ordering can lead to better understanding of the dynamics of the disease and to the identification of genes associated with its progression. In this paper we formally prove, for the first time, that under a model for the dynamics of the expression levels of a single gene, it is indeed possible to recover the correct ordering of the static expression datasets by solving an instance of the traveling salesman problem (TSP). In addition, we devise an algorithm that combines a TSP heuristic and probabilistic modeling for inferring the underlying temporal order of the microarray experiments. This algorithm constructs probabilistic continuous curves to represent expression profiles leading to accurate temporal reconstruction for human data. Applying our method to cancer expression data we show that the ordering derived agrees well with survival duration. A classifier that utilizes this ordering improves upon other classifiers suggested for this task. The set of genes displaying consistent behavior for the determined ordering are enriched for genes associated with cancer progression.
Collapse
Affiliation(s)
- Anupam Gupta
- Department of Computer Science, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA.
| | | |
Collapse
|
48
|
Medeiros F, Rigl CT, Anderson GG, Becker SH, Halling KC. Tissue handling for genome-wide expression analysis: a review of the issues, evidence, and opportunities. Arch Pathol Lab Med 2008; 131:1805-16. [PMID: 18081440 DOI: 10.5858/2007-131-1805-thfgea] [Citation(s) in RCA: 82] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 08/08/2007] [Indexed: 11/06/2022]
Abstract
CONTEXT Molecular diagnostic applications that use microarrays to analyze large numbers of genes simultaneously require high-quality mRNA. As these genome-wide expression assays become more commonly used in medical practice, pathologists and oncologists will benefit from understanding the importance of obtaining high-quality RNA in order to generate reliable diagnostic and prognostic information, especially as these relate to cancer. OBJECTIVE To review the effects that different tissue preservation techniques have on RNA quality and to provide practical advice on changes in tissue acquisition and handling that may soon be needed for certain clinical situations. DATA SOURCES A review of recent literature on RNA quality, tissue fixation, cancer diagnosis, and gene expression analysis. CONCLUSIONS Studies have consistently shown that frozen tissue yields more intact RNA than formalin-fixed, paraffin-embedded tissue. The chemical modification, cross-linking, and fragmentation caused by formalin fixation often render RNA unsuitable for microarray analysis. Thus, when expression analysis involving hundreds or more than 1000 gene markers is contemplated, pathologists should consider freezing a specimen within half an hour (preferably within minutes) of surgical resection and storing it at -80 degrees C or below. In coming years, pathologists will need to work closely with oncologists and other clinicians to determine when saving frozen tissue for microarray expression analysis is both practical and necessary. In select cases, the benefit of implementing a few extra tissue-handling steps may improve diagnostic and prognostic capability.
Collapse
Affiliation(s)
- Fabiola Medeiros
- Department of Laboratory Medicine and Pathology, Division of Laboratory Genetics, Mayo Clinic, Rochester, Minn, USA
| | | | | | | | | |
Collapse
|
49
|
Jen CH, Yang TP, Tung CY, Su SH, Lin CH, Hsu MT, Wang HW. Signature Evaluation Tool (SET): a Java-based tool to evaluate and visualize the sample discrimination abilities of gene expression signatures. BMC Bioinformatics 2008; 9:58. [PMID: 18221568 PMCID: PMC2248562 DOI: 10.1186/1471-2105-9-58] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2007] [Accepted: 01/28/2008] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The identification of specific gene expression signature for distinguishing sample groups is a dominant field in cancer research. Although a number of tools have been developed to identify optimal gene expression signatures, the number of signature genes obtained is often overly large to be applied clinically. Furthermore, experimental verification is sometimes limited by the availability of wet-lab materials such as antibodies and reagents. A tool to evaluate the discrimination power of candidate genes is therefore in high demand by clinical researchers. RESULTS Signature Evaluation Tool (SET) is a Java-based tool adopting the Golub's weighted voting algorithm as well as incorporating the visual presentation of prediction strength for each array sample. SET provides a flexible and easy-to-follow platform to evaluate the discrimination power of a gene signature. Here, we demonstrated the application of SET for several purposes: (1) for signatures consisting of a large number of genes, SET offers the ability to rapidly narrow down the number of genes; (2) for a given signature (from third party analyses or user-defined), SET can re-evaluate and re-adjust its discrimination power by selecting/de-selecting genes repeatedly; (3) for multiple microarray datasets, SET can evaluate the classification capability of a signature among datasets; and (4) by providing a module to visualize the prediction strength for each sample, SET allows users to re-evaluate the discrimination power on mis-grouped or less-certain samples. Information obtained from the above applications could be useful in prognostic analyses or clinical management decisions. CONCLUSION Here we present SET to evaluate and visualize the sample-discrimination ability of a given gene expression signature. This tool provides a filtration function for signature identification and lies between clinical analyses and class prediction (or feature selection) tools. The simplicity, flexibility and brevity of SET could make it an invaluable tool for marker identification in clinical research.
Collapse
Affiliation(s)
- Chih-Hung Jen
- Microarray & Gene Expression Analysis Core Facility, VGH National Yang-Ming University Genome Research Center, Taipei, Taiwan.
| | | | | | | | | | | | | |
Collapse
|
50
|
Fujita A, Sato JR, Ferreira CE, Sogayar MC. GEDI: a user-friendly toolbox for analysis of large-scale gene expression data. BMC Bioinformatics 2007; 8:457. [PMID: 18021455 PMCID: PMC2194737 DOI: 10.1186/1471-2105-8-457] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2007] [Accepted: 11/19/2007] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Several mathematical and statistical methods have been proposed in the last few years to analyze microarray data. Most of those methods involve complicated formulas, and software implementations that require advanced computer programming skills. Researchers from other areas may experience difficulties when they attempting to use those methods in their research. Here we present an user-friendly toolbox which allows large-scale gene expression analysis to be carried out by biomedical researchers with limited programming skills. RESULTS Here, we introduce an user-friendly toolbox called GEDI (Gene Expression Data Interpreter), an extensible, open-source, and freely-available tool that we believe will be useful to a wide range of laboratories, and to researchers with no background in Mathematics and Computer Science, allowing them to analyze their own data by applying both classical and advanced approaches developed and recently published by Fujita et al. CONCLUSION GEDI is an integrated user-friendly viewer that combines the state of the art SVR, DVAR and SVAR algorithms, previously developed by us. It facilitates the application of SVR, DVAR and SVAR, further than the mathematical formulas present in the corresponding publications, and allows one to better understand the results by means of available visualizations. Both running the statistical methods and visualizing the results are carried out within the graphical user interface, rendering these algorithms accessible to the broad community of researchers in Molecular Biology.
Collapse
Affiliation(s)
- André Fujita
- Chemistry Institute, University of São Paulo, Av, Lineu Prestes, 748 - São Paulo, 05508-900, SP, Brazil.
| | | | | | | |
Collapse
|