1
|
Gentry AE, Jackson-Cook CK, Lyon DE, Archer KJ. Penalized Ordinal Regression Methods for Predicting Stage of Cancer in High-Dimensional Covariate Spaces. Cancer Inform 2015; 14:201-8. [PMID: 26052223 PMCID: PMC4447150 DOI: 10.4137/cin.s17277] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2014] [Revised: 02/16/2015] [Accepted: 02/17/2015] [Indexed: 12/20/2022] Open
Abstract
The pathological description of the stage of a tumor is an important clinical designation and is considered, like many other forms of biomedical data, an ordinal outcome. Currently, statistical methods for predicting an ordinal outcome using clinical, demographic, and high-dimensional correlated features are lacking. In this paper, we propose a method that fits an ordinal response model to predict an ordinal outcome for high-dimensional covariate spaces. Our method penalizes some covariates (high-throughput genomic features) without penalizing others (such as demographic and/or clinical covariates). We demonstrate the application of our method to predict the stage of breast cancer. In our model, breast cancer subtype is a nonpenalized predictor, and CpG site methylation values from the Illumina Human Methylation 450K assay are penalized predictors. The method has been made available in the ordinalgmifs package in the R programming environment.
Collapse
Affiliation(s)
| | | | - Debra E Lyon
- College of Nursing, University of Florida, Gainesville, FL, USA
| | - Kellie J Archer
- Department of Biostatistics, Virginia Commonwealth University, Richmond, VA, USA
| |
Collapse
|
2
|
Yoo S, Huang T, Campbell JD, Lee E, Tu Z, Geraci MW, Powell CA, Schadt EE, Spira A, Zhu J. MODMatcher: multi-omics data matcher for integrative genomic analysis. PLoS Comput Biol 2014; 10:e1003790. [PMID: 25122495 PMCID: PMC4133046 DOI: 10.1371/journal.pcbi.1003790] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2014] [Accepted: 06/26/2014] [Indexed: 12/30/2022] Open
Abstract
Errors in sample annotation or labeling often occur in large-scale genetic or genomic studies and are difficult to avoid completely during data generation and management. For integrative genomic studies, it is critical to identify and correct these errors. Different types of genetic and genomic data are inter-connected by cis-regulations. On that basis, we developed a computational approach, Multi-Omics Data Matcher (MODMatcher), to identify and correct sample labeling errors in multiple types of molecular data, which can be used in further integrative analysis. Our results indicate that inspection of sample annotation and labeling error is an indispensable data quality assurance step. Applied to a large lung genomic study, MODMatcher increased statistically significant genetic associations and genomic correlations by more than two-fold. In a simulation study, MODMatcher provided more robust results by using three types of omics data than two types of omics data. We further demonstrate that MODMatcher can be broadly applied to large genomic data sets containing multiple types of omics data, such as The Cancer Genome Atlas (TCGA) data sets. Many human diseases are complex with multiple genetic and environmental causal factors interacting together to give rise to disease phenotypes. Such factors affect biological systems through many layers of regulations, including transcriptional and epigenetic regulation, and protein changes. To fully understand their molecular mechanisms, complex diseases are often studied in diverse dimensions including genetics (genotype variations by single nucleotide polymorphism (SNP) arrays or whole exome sequencing), transcriptomics, epigenetics, and proteomics. However, errors in sample annotation or labeling often occur in large-scale genetic and genomic studies and are difficult to avoid completely during data generation and management. Identifying and correcting these errors are critical for integrative genomic studies. In this study, we developed a computational approach, Multi-Omics Data Matcher (MODMatcher), to identify and correct sample labeling errors based on multiple types of molecular data before further integrative analysis. Our results indicate that signals increased more than 100% after correction of sample labeling errors in a large lung genomic study. Our method can be broadly applied to large genomic data sets with multiple types of omics data, such as TCGA (The Cancer Genome Atlas) data sets.
Collapse
Affiliation(s)
- Seungyeul Yoo
- Icahn Institute of Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, New York, New York, United States of America
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, New York, United States of America
| | - Tao Huang
- Icahn Institute of Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, New York, New York, United States of America
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, New York, United States of America
| | - Joshua D. Campbell
- Division of Computational Biomedicine, Department of Medicine, Boston University School of Medicine, Boston, Massachusetts, United States of America
| | - Eunjee Lee
- Icahn Institute of Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, New York, New York, United States of America
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, New York, United States of America
| | - Zhidong Tu
- Icahn Institute of Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, New York, New York, United States of America
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, New York, United States of America
| | - Mark W. Geraci
- Division of Pulmonary Sciences and Critical Care Medicine, University of Colorado Denver, Aurora, Colorado, United States of America
| | - Charles A. Powell
- Division of Pulmonary, Critical Care and Sleep Medicine, Icahn School of Medicine at Mount Sinai, New York, New York, United States of America
| | - Eric E. Schadt
- Icahn Institute of Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, New York, New York, United States of America
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, New York, United States of America
| | - Avrum Spira
- Division of Computational Biomedicine, Department of Medicine, Boston University School of Medicine, Boston, Massachusetts, United States of America
| | - Jun Zhu
- Icahn Institute of Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, New York, New York, United States of America
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, New York, United States of America
- * E-mail:
| |
Collapse
|
3
|
Staedtler F, Hartmann N, Letzkus M, Bongiovanni S, Scherer A, Marc P, Johnson KJ, Schumacher MM. Robust and tissue-independent gender-specific transcript biomarkers. Biomarkers 2013; 18:436-45. [PMID: 23829492 DOI: 10.3109/1354750x.2013.811538] [Citation(s) in RCA: 28] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
CONTEXT Correct gender assignment in humans at the molecular level is crucial in many scientific disciplines and applied areas. MATERIALS AND METHODS Candidate gender markers were identified through supervised statistical analysis of genome wide microarray expression data from human blood samples (N = 123, 58 female, 65 male) as a training set. The potential of the markers to predict undisclosed tissue donor gender was tested on microarray data from 13 healthy and 11 cancerous human tissue collections (internal) and external datasets from samples of varying tissue origin. The abundance of some genes in the marker panel was quantified by RT-PCR as alternative analytical technology. RESULTS We identified and qualified predictive, gender-specific transcript markers based on a set of five genes (RPS4Y1, EIF1AY, DDX3Y, KDM5D and XIST). CONCLUSION Gene expression marker panels can be used as a robust tissue- and platform-independent predictive approach for gender determination.
Collapse
Affiliation(s)
- Frank Staedtler
- Novartis Institutes for BioMedical Research (NIBR), Biomarker Development, Basel, Switzerland.
| | | | | | | | | | | | | | | |
Collapse
|