1
|
Cao X, Liang X, Zhang S, Sha Q. Gene selection by incorporating genetic networks into case-control association studies. Eur J Hum Genet 2024; 32:270-277. [PMID: 36529820 PMCID: PMC10923938 DOI: 10.1038/s41431-022-01264-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2022] [Revised: 11/27/2022] [Accepted: 11/30/2022] [Indexed: 12/23/2022] Open
Abstract
Large-scale genome-wide association studies (GWAS) have been successfully applied to a wide range of genetic variants underlying complex diseases. The network-based regression approach has been developed to incorporate a biological genetic network and to overcome the challenges caused by the computational efficiency for analyzing high-dimensional genomic data. In this paper, we propose a gene selection approach by incorporating genetic networks into case-control association studies for DNA sequence data or DNA methylation data. Instead of using traditional dimension reduction techniques such as principal component analyses and supervised principal component analyses, we use a linear combination of genotypes at SNPs or methylation values at CpG sites in a gene to capture gene-level signals. We employ three linear combination approaches: optimally weighted sum (OWS), beta-based weighted sum (BWS), and LD-adjusted polygenic risk score (LD-PRS). OWS and LD-PRS are supervised approaches that depend on the effect of each SNP or CpG site on the case-control status, while BWS can be extracted without using the case-control status. After using one of the linear combinations of genotypes or methylation values in each gene to capture gene-level signals, we regularize them to perform gene selection based on the biological network. Simulation studies show that the proposed approaches have higher true positive rates than using traditional dimension reduction techniques. We also apply our approaches to DNA methylation data and UK Biobank DNA sequence data for analyzing rheumatoid arthritis. The results show that the proposed methods can select potentially rheumatoid arthritis related genes that are missed by existing methods.
Collapse
Affiliation(s)
- Xuewei Cao
- Department of Mathematical Sciences, Michigan Technological University, Houghton, MI, USA
| | - Xiaoyu Liang
- Department of Epidemiology and Biostatistics, Michigan State University, East Lansing, MI, USA
| | - Shuanglin Zhang
- Department of Mathematical Sciences, Michigan Technological University, Houghton, MI, USA
| | - Qiuying Sha
- Department of Mathematical Sciences, Michigan Technological University, Houghton, MI, USA.
| |
Collapse
|
2
|
Kim K, Jun TH, Ha BK, Wang S, Sun H. New statistical selection method for pleiotropic variants associated with both quantitative and qualitative traits. BMC Bioinformatics 2023; 24:381. [PMID: 37817069 PMCID: PMC10563219 DOI: 10.1186/s12859-023-05505-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2023] [Accepted: 09/28/2023] [Indexed: 10/12/2023] Open
Abstract
BACKGROUND Identification of pleiotropic variants associated with multiple phenotypic traits has received increasing attention in genetic association studies. Overlapping genetic associations from multiple traits help to detect weak genetic associations missed by single-trait analyses. Many statistical methods were developed to identify pleiotropic variants with most of them being limited to quantitative traits when pleiotropic effects on both quantitative and qualitative traits have been observed. This is a statistically challenging problem because there does not exist an appropriate multivariate distribution to model both quantitative and qualitative data together. Alternatively, meta-analysis methods can be applied, which basically integrate summary statistics of individual variants associated with either a quantitative or a qualitative trait without accounting for correlations among genetic variants. RESULTS We propose a new statistical selection method based on a unified selection score quantifying how a genetic variant, i.e., a pleiotropic variant associates with both quantitative and qualitative traits. In our extensive simulation studies where various types of pleiotropic effects on both quantitative and qualitative traits were considered, we demonstrated that the proposed method outperforms the existing meta-analysis methods in terms of true positive selection. We also applied the proposed method to a peanut dataset with 6 quantitative and 2 qualitative traits, and a cowpea dataset with 2 quantitative and 6 qualitative traits. We were able to detect some potentially pleiotropic variants missed by the existing methods in both analyses. CONCLUSIONS The proposed method is able to locate pleiotropic variants associated with both quantitative and qualitative traits. It has been implemented into an R package 'UNISS', which can be downloaded from http://github.com/statpng/uniss.
Collapse
Affiliation(s)
- Kipoong Kim
- Department of Statistic, Pusan National University, 46241, Busan, Korea
| | - Tae-Hwan Jun
- Department of Plant Bioscience, Pusan National University, 50463, Miryang, Korea
| | - Bo-Keun Ha
- Department of Applied Plant Science, Chonnam National University, 61186, Gwangju, Korea
| | - Shuang Wang
- Department of Biostatistics, Mailman School of Public Health, Columbia University, New York, 10032, USA
| | - Hokeun Sun
- Department of Statistic, Pusan National University, 46241, Busan, Korea.
| |
Collapse
|
3
|
Tang X, Mo Z, Chang C, Qian X. Group-shrinkage feature selection with a spatial network for mining DNA methylation data. Comput Biol Med 2023; 154:106573. [PMID: 36706568 DOI: 10.1016/j.compbiomed.2023.106573] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2022] [Revised: 01/05/2023] [Accepted: 01/22/2023] [Indexed: 01/25/2023]
Abstract
Identifying disease-related biomarkers from high-dimensional DNA methylation data helps in reducing early screening costs and inferring pathogenesis mechanisms. Good discovery results have been achieved through spatial correlation methods of methylation sites, group-based regularization, and network constraints. However, these methods still have some key limitations as they cannot exclude isolated differential sites and only consider adjacent site ordering. Therefore, we propose a group-shrinkage feature selection algorithm to encourage the selection of clustered sites and discourage the selection of isolated differential sites. Specifically, a network-guided group-shrinkage strategy is developed to penalize weakly-correlated isolated methylation sites through a network structure constraint. The spatial network is constructed based on spatial correlation information of DNA methylation sites, where this information accounts for the uneven site distribution. The experimental simulations and applications demonstrated that the proposed method outperforms the advanced regularization methods, especially in rejecting isolated methylation sites; hence this study provides an efficient and clinical-valuable method for biomarker candidate discovery in DNA methylation data. Additionally, the proposed method exhibits enhanced reliability due to introducing biological prior knowledge into a regularization-based feature selection framework and could promote more research in the integration between biological prior knowledge and classical feature selection methods, thus facilitating their clinical application. Our source codes will be released at https://github.com/SJTUBME-QianLab/Group-shrinkage-Spatial-Network once this manuscript is accepted for publication.
Collapse
Affiliation(s)
- Xinlu Tang
- Medical Image and Health Informatics Lab, School of Biomedical Engineering, Shanghai Jiao Tong University, Shanghai, China.
| | - Zhanfeng Mo
- School of Computer Science and Engineering, Nanyang Technological University, Singapore.
| | - Cheng Chang
- Department of Nuclear Medicine, Shanghai, Chest Hospital, School of Medicine, Shanghai Jiao Tong University, Shanghai 200030, China.
| | - Xiaohua Qian
- Medical Image and Health Informatics Lab, School of Biomedical Engineering, Shanghai Jiao Tong University, Shanghai, China.
| |
Collapse
|
4
|
Rohan TE, Ginsberg M, Wang Y, Couch FJ, Feigelson HS, Greenlee RT, Honda S, Stark A, Chitale D, Wang T, Xue X, Oktay MH, Sparano JA, Loudig O. Molecular markers of risk of subsequent invasive breast cancer in women with ductal carcinoma in situ: protocol for a population-based cohort study. BMJ Open 2021; 11:e053397. [PMID: 34702732 PMCID: PMC8549665 DOI: 10.1136/bmjopen-2021-053397] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 12/14/2022] Open
Abstract
INTRODUCTION Ductal carcinoma in situ (DCIS) of the breast is a non-obligate precursor of invasive breast cancer (IBC). Many DCIS patients are either undertreated or overtreated. The overarching goal of the study described here is to facilitate detection of patients with DCIS at risk of IBC development. Here, we propose to use risk factor data and formalin-fixed paraffin-embedded (FFPE) DCIS tissue from a large, ethnically diverse, population-based cohort of 8175 women with a first diagnosis of DCIS and followed for subsequent IBC to: identify/validate miRNA expression changes in DCIS tissue associated with risk of subsequent IBC; evaluate ipsilateral IBC risk in association with two previously identified marker sets (triple immunopositivity for p16, COX-2, Ki67; Oncotype DX Breast DCIS score); examine the association of risk factor data with IBC risk. METHODS AND ANALYSIS We are conducting a series of case-control studies nested within the cohort. Cases are women with DCIS who developed subsequent IBC; controls (2/case) are matched to cases on calendar year of and age at DCIS diagnosis. We project 485 cases/970 controls in the aim focused on risk factors. We estimate obtaining FFPE tissue for 320 cases/640 controls for the aim focused on miRNAs; of these, 173 cases/346 controls will be included in the aim focused on p16, COX-2 and Ki67 immunopositivity, and of the latter, 156 case-control pairs will be included in the aim focused on the Oncotype DX Breast DCIS score®. Multivariate conditional logistic regression will be used for statistical analyses. ETHICS AND DISSEMINATION Ethics approval was obtained from the Institutional Review Boards of Albert Einstein College of Medicine (IRB 2014-3611), Kaiser Permanente Colorado, Kaiser Permanente Hawaii, Henry Ford Health System, Mayo Clinic, Marshfield Clinic Research Institute and Hackensack Meridian Health, and from Lifespan Research Protection Office. The study results will be presented at meetings and published in peer-reviewed journals.
Collapse
Affiliation(s)
- Thomas E Rohan
- Department of Epidemiology and Population Health, Albert Einstein College of Medicine, Bronx, New York, USA
| | - Mindy Ginsberg
- Department of Epidemiology and Population Health, Albert Einstein College of Medicine, Bronx, New York, USA
| | - Yihong Wang
- Department of Pathology and Laboratory Medicine, Rhode Island Hospital and Lifespan Medical Center, Providence, Rhode Island, USA
- Warren Alpert Medical School of Brown University, Providence, Rhode Island, USA
| | - Fergus J Couch
- Department of Laboratory Medicine and Pathology, Mayo Clinic, Rochester, Minnesota, USA
- Department of Health Sciences Research, Mayo Clinic, Rochester, Minnesota, USA
| | | | - Robert T Greenlee
- Center for Clinical Epidemiology and Population Health, Marshfield Clinic Research Institute, Marshfield, Wisconsin, USA
| | - Stacey Honda
- Center for Integrated Healthcare, Kaiser Permanente, Hawaii Permanente Medical Group, Honolulu, Hawaii, USA
| | - Azadeh Stark
- Department of Pathology and Laboratory Medicine, Henry Ford Health System, Detroit, Michigan, USA
- Breast Oncology Program and Department of Pathology, Henry Ford Health System, Detroit, Michigan, USA
| | - Dhananjay Chitale
- Department of Pathology and Laboratory Medicine, Henry Ford Health System, Detroit, Michigan, USA
- Breast Oncology Program and Department of Pathology, Henry Ford Health System, Detroit, Michigan, USA
| | - Tao Wang
- Department of Epidemiology and Population Health, Albert Einstein College of Medicine, Bronx, New York, USA
| | - Xiaonan Xue
- Department of Epidemiology and Population Health, Albert Einstein College of Medicine, Bronx, New York, USA
| | - Maja H Oktay
- Department of Pathology, Albert Einstein College of Medicine/Montefiore Medical Center, Bronx, New York, USA
- Department of Anatomy and Structural Biology, Albert Einstein College of Medicine/Montefiore Medical Center, Bronx, New York, USA
| | - Joseph A Sparano
- Department of Oncology, Albert Einstein College of Medicine/Montefiore Medical Center, Bronx, New York, USA
| | - Olivier Loudig
- Center for Discovery and Innovation, Hackensack Meridian Health, Nutley, New Jersey, USA
| |
Collapse
|
5
|
Oh M, Kim K, Sun H. Covariance thresholding to detect differentially co-expressed genes from microarray gene expression data. J Bioinform Comput Biol 2021; 18:2050002. [PMID: 32336254 DOI: 10.1142/s021972002050002x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Gene set analysis aims to identify differentially expressed or co-expressed genes within a biological pathway between two experimental conditions, so that it can eventually reveal biological processes and pathways involved in disease development. In the last few decades, various statistical and computational methods have been proposed to improve statistical power of gene set analysis. In recent years, much attention has been paid to differentially co-expressed genes since they can be potentially disease-related genes without significant difference in average expression levels between two conditions. In this paper, we propose a new statistical method to identify differentially co-expressed genes from microarray gene expression data. The proposed method first estimates co-expression levels of paired genes using covariance regularization by thresholding, and then significance of difference in covariance estimation between two conditions is evaluated. We demonstrated that the proposed method is more powerful than the existing main-stream methods to detect co-expressed genes through extensive simulation studies. Also, we applied it to various microarray gene expression datasets related with mutant p53 transcriptional activity, and epithelium and stroma breast cancer.
Collapse
Affiliation(s)
- Mingyu Oh
- Department of Statistics, Pusan National University, Busan, 46241, Korea
| | - Kipoong Kim
- Department of Statistics, Pusan National University, Busan, 46241, Korea
| | - Hokeun Sun
- Department of Statistics, Pusan National University, Busan, 46241, Korea
| |
Collapse
|
6
|
Zhou F, Ren J, Lu X, Ma S, Wu C. Gene-Environment Interaction: A Variable Selection Perspective. Methods Mol Biol 2021; 2212:191-223. [PMID: 33733358 DOI: 10.1007/978-1-0716-0947-7_13] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/31/2023]
Abstract
Gene-environment interactions have important implications for elucidating the genetic basis of complex diseases beyond the joint function of multiple genetic factors and their interactions (or epistasis). In the past, G × E interactions have been mainly conducted within the framework of genetic association studies. The high dimensionality of G × E interactions, due to the complicated form of environmental effects and the presence of a large number of genetic factors including gene expressions and SNPs, has motivated the recent development of penalized variable selection methods for dissecting G × E interactions, which has been ignored in the majority of published reviews on genetic interaction studies. In this article, we first survey existing studies on both gene-environment and gene-gene interactions. Then, after a brief introduction to the variable selection methods, we review penalization and relevant variable selection methods in marginal and joint paradigms, respectively, under a variety of conceptual models. Discussions on strengths and limitations, as well as computational aspects of the variable selection methods tailored for G × E studies, have also been provided.
Collapse
Affiliation(s)
- Fei Zhou
- Department of Statistics, Kansas State University, Manhattan, KS, USA
| | - Jie Ren
- Department of Biostatistics, Indiana University School of Medicine, Indianapolis, IN, USA
| | - Xi Lu
- Department of Statistics, Kansas State University, Manhattan, KS, USA
| | - Shuangge Ma
- Department of Biostatistics, School of Public Health, Yale University, New Haven, CT, USA
| | - Cen Wu
- Department of Statistics, Kansas State University, Manhattan, KS, USA.
| |
Collapse
|
7
|
Zou K, Kim KS, Kim K, Kang D, Park YH, Sun H, Ha BK, Ha J, Jun TH. Genetic Diversity and Genome-Wide Association Study of Seed Aspect Ratio Using a High-Density SNP Array in Peanut ( Arachis hypogaea L.). Genes (Basel) 2020; 12:E2. [PMID: 33375051 PMCID: PMC7822046 DOI: 10.3390/genes12010002] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2020] [Revised: 12/09/2020] [Accepted: 12/17/2020] [Indexed: 12/12/2022] Open
Abstract
Peanut (Arachis hypogaea L.) is one of the important oil crops of the world. In this study, we aimed to evaluate the genetic diversity of 384 peanut germplasms including 100 Korean germplasms and 284 core collections from the United States Department of Agriculture (USDA) using an Axiom_Arachis array with 58K single-nucleotide polymorphisms (SNPs). We evaluated the evolutionary relationships among 384 peanut germplasms using a genome-wide association study (GWAS) of seed aspect ratio data processed by ImageJ software. In total, 14,030 filtered polymorphic SNPs were identified from the peanut 58K SNP array. We identified five SNPs with significant associations to seed aspect ratio on chromosomes Aradu.A09, Aradu.A10, Araip.B08, and Araip.B09. AX-177640219 on chromosome Araip.B08 was the most significantly associated marker in GAPIT and Regularization method. Phosphoenolpyruvate carboxylase (PEPC) was found among the eleven genes within a linkage disequilibrium (LD) of the significant SNPs on Araip.B08 and could have a strong causal effect in determining seed aspect ratio. The results of the present study provide information and methods that are useful for further genetic and genomic studies as well as molecular breeding programs in peanuts.
Collapse
Affiliation(s)
- Kunyan Zou
- Department of Plant Bioscience, Pusan National University, Miryang 50463, Korea; (K.Z.); (D.K.); (Y.-H.P.)
| | | | - Kipoong Kim
- Department of Statistics, Pusan National University, Busan 46241, Korea; (K.K.); (H.S.)
| | - Dongwoo Kang
- Department of Plant Bioscience, Pusan National University, Miryang 50463, Korea; (K.Z.); (D.K.); (Y.-H.P.)
| | - Yu-Hyeon Park
- Department of Plant Bioscience, Pusan National University, Miryang 50463, Korea; (K.Z.); (D.K.); (Y.-H.P.)
| | - Hokeun Sun
- Department of Statistics, Pusan National University, Busan 46241, Korea; (K.K.); (H.S.)
| | - Bo-Keun Ha
- Department of Applied Plant Science, Chonnam National University, Gwangju 61186, Korea;
| | - Jungmin Ha
- Department of Plant Science, Gangneung-Wonju National University, Gangneung 25457, Korea;
| | - Tae-Hwan Jun
- Department of Plant Bioscience, Pusan National University, Miryang 50463, Korea; (K.Z.); (D.K.); (Y.-H.P.)
- Life and Industry Convergence Research Institute, Pusan National University, Miryang 50463, Korea
| |
Collapse
|
8
|
Selection probability of multivariate regularization to identify pleiotropic variants in genetic association studies. COMMUNICATIONS FOR STATISTICAL APPLICATIONS AND METHODS 2020. [DOI: 10.29220/csam.2020.27.5.535] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
|
9
|
Kim K, Koo J, Sun H. An empirical threshold of selection probability for analysis of high-dimensional correlated data. J STAT COMPUT SIM 2020. [DOI: 10.1080/00949655.2020.1739286] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/31/2023]
Affiliation(s)
- Kipoong Kim
- Department of Statistics, Pusan National University, Busan, South Korea
| | - Jajoon Koo
- Department of Statistics, Pusan National University, Busan, South Korea
| | - Hokeun Sun
- Department of Statistics, Pusan National University, Busan, South Korea
| |
Collapse
|
10
|
Kim K, Sun H. Incorporating genetic networks into case-control association studies with high-dimensional DNA methylation data. BMC Bioinformatics 2019; 20:510. [PMID: 31640538 PMCID: PMC6805595 DOI: 10.1186/s12859-019-3040-x] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2019] [Accepted: 08/21/2019] [Indexed: 12/23/2022] Open
Abstract
Background In human genetic association studies with high-dimensional gene expression data, it has been well known that statistical selection methods utilizing prior biological network knowledge such as genetic pathways and signaling pathways can outperform other methods that ignore genetic network structures in terms of true positive selection. In recent epigenetic research on case-control association studies, relatively many statistical methods have been proposed to identify cancer-related CpG sites and their corresponding genes from high-dimensional DNA methylation array data. However, most of existing methods are not designed to utilize genetic network information although methylation levels between linked genes in the genetic networks tend to be highly correlated with each other. Results We propose new approach that combines data dimension reduction techniques with network-based regularization to identify outcome-related genes for analysis of high-dimensional DNA methylation data. In simulation studies, we demonstrated that the proposed approach overwhelms other statistical methods that do not utilize genetic network information in terms of true positive selection. We also applied it to the 450K DNA methylation array data of the four breast invasive carcinoma cancer subtypes from The Cancer Genome Atlas (TCGA) project. Conclusions The proposed variable selection approach can utilize prior biological network information for analysis of high-dimensional DNA methylation array data. It first captures gene level signals from multiple CpG sites using data a dimension reduction technique and then performs network-based regularization based on biological network graph information. It can select potentially cancer-related genes and genetic pathways that were missed by the existing methods. Electronic supplementary material The online version of this article (10.1186/s12859-019-3040-x) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Kipoong Kim
- Department of Statistic, Pusan National University, Busan, 46241, Korea
| | - Hokeun Sun
- Department of Statistic, Pusan National University, Busan, 46241, Korea.
| |
Collapse
|
11
|
Choi J, Kim K, Sun H. New variable selection strategy for analysis of high-dimensional DNA methylation data. J Bioinform Comput Biol 2018; 16:1850010. [PMID: 29954287 DOI: 10.1142/s0219720018500105] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/22/2023]
Abstract
In genetic association studies, regularization methods are often used due to their computational efficiency for analysis of high-dimensional genomic data. DNA methylation data generated from Infinium HumanMethylation450 BeadChip Kit have a group structure where an individual gene consists of multiple Cytosine-phosphate-Guanine (CpG) sites. Consequently, group-based regularization can precisely detect outcome-related CpG sites. Representative examples are sparse group lasso (SGL) and network-based regularization. The former is powerful when most of the CpG sites within the same gene are associated with a phenotype outcome. In contrast, the latter is preferred when only a few of the CpG sites within the same gene are related to the outcome. In this paper, we propose new variable selection strategy based on a selection probability that measures selection frequency of individual variables selected by both SGL and network-based regularization. In extensive simulation study, we demonstrated that the proposed strategy can show relatively outstanding selection performance under any situation, compared with both SGL and network-based regularization. Also, we applied the proposed strategy to identify differentially methylated CpG sites and their corresponding genes from ovarian cancer data.
Collapse
Affiliation(s)
- Jiyun Choi
- 1 Department of Statistics, Pusan National University, Busan 46241, Korea
| | - Kipoong Kim
- 1 Department of Statistics, Pusan National University, Busan 46241, Korea
| | - Hokeun Sun
- 1 Department of Statistics, Pusan National University, Busan 46241, Korea
| |
Collapse
|
12
|
Sun H, Wang Y, Chen Y, Li Y, Wang S. pETM: a penalized Exponential Tilt Model for analysis of correlated high-dimensional DNA methylation data. Bioinformatics 2018; 33:1765-1772. [PMID: 28165116 DOI: 10.1093/bioinformatics/btx064] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2016] [Accepted: 01/31/2017] [Indexed: 12/31/2022] Open
Abstract
Motivation DNA methylation plays an important role in many biological processes and cancer progression. Recent studies have found that there are also differences in methylation variations in different groups other than differences in methylation means. Several methods have been developed that consider both mean and variance signals in order to improve statistical power of detecting differentially methylated loci. Moreover, as methylation levels of neighboring CpG sites are known to be strongly correlated, methods that incorporate correlations have also been developed. We previously developed a network-based penalized logistic regression for correlated methylation data, but only focusing on mean signals. We have also developed a generalized exponential tilt model that captures both mean and variance signals but only examining one CpG site at a time. Results In this article, we proposed a penalized Exponential Tilt Model (pETM) using network-based regularization that captures both mean and variance signals in DNA methylation data and takes into account the correlations among nearby CpG sites. By combining the strength of the two models we previously developed, we demonstrated the superior power and better performance of the pETM method through simulations and the applications to the 450K DNA methylation array data of the four breast invasive carcinoma cancer subtypes from The Cancer Genome Atlas (TCGA) project. The developed pETM method identifies many cancer-related methylation loci that were missed by our previously developed method that considers correlations among nearby methylation loci but not variance signals. Availability and Implementation The R package 'pETM' is publicly available through CRAN: http://cran.r-project.org . Contact sw2206@columbia.edu. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Hokeun Sun
- Department of Statistics, Pusan National University, Busan, Korea
| | - Ya Wang
- Department of Biostatistics, Mailman School of Public Health, Columbia University, New York, NY, USA
| | - Yong Chen
- Division of Biostatistics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Yun Li
- Department of Biostatistics, University of North Carolina, Chapel Hill, NC, USA.,Department of Genetics, University of North Carolina, Chapel Hill, NC, USA.,Department of Computer Science, University of North Carolina, Chapel Hill, NC, USA
| | - Shuang Wang
- Department of Biostatistics, Mailman School of Public Health, Columbia University, New York, NY, USA
| |
Collapse
|
13
|
Liang S, Ma A, Yang S, Wang Y, Ma Q. A Review of Matched-pairs Feature Selection Methods for Gene Expression Data Analysis. Comput Struct Biotechnol J 2018; 16:88-97. [PMID: 30275937 PMCID: PMC6158772 DOI: 10.1016/j.csbj.2018.02.005] [Citation(s) in RCA: 28] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2017] [Revised: 02/14/2018] [Accepted: 02/19/2018] [Indexed: 12/31/2022] Open
Abstract
With the rapid accumulation of gene expression data from various technologies, e.g., microarray, RNA-sequencing (RNA-seq), and single-cell RNA-seq, it is necessary to carry out dimensional reduction and feature (signature genes) selection in support of making sense out of such high dimensional data. These computational methods significantly facilitate further data analysis and interpretation, such as gene function enrichment analysis, cancer biomarker detection, and drug targeting identification in precision medicine. Although numerous methods have been developed for feature selection in bioinformatics, it is still a challenge to choose the appropriate methods for a specific problem and seek for the most reasonable ranking features. Meanwhile, the paired gene expression data under matched case-control design (MCCD) is becoming increasingly popular, which has often been used in multi-omics integration studies and may increase feature selection efficiency by offsetting similar distributions of confounding features. The appropriate feature selection methods specifically designed for the paired data, which is named as matched-pairs feature selection (MPFS), however, have not been maturely developed in parallel. In this review, we compare the performance of 10 feature-selection methods (eight MPFS methods and two traditional unpaired methods) on two real datasets by applied three classification methods, and analyze the algorithm complexity of these methods through the running of their programs. This review aims to induce and comprehensively present the MPFS in such a way that readers can easily understand its characteristics and get a clue in selecting the appropriate methods for their analyses.
Collapse
Affiliation(s)
- Sen Liang
- Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun 130012, China
| | - Anjun Ma
- Bioinformatics and Mathematical Biosciences Lab, Department of Agronomy, Horticulture and Plant Science, Department of Mathematics and Statistics, South Dakota State University, Brookings, SD 57007, USA.,BioSNTR, Brookings, SD, USA
| | - Sen Yang
- Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun 130012, China
| | - Yan Wang
- Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun 130012, China
| | - Qin Ma
- Bioinformatics and Mathematical Biosciences Lab, Department of Agronomy, Horticulture and Plant Science, Department of Mathematics and Statistics, South Dakota State University, Brookings, SD 57007, USA.,BioSNTR, Brookings, SD, USA
| |
Collapse
|
14
|
Wang Y, Teschendorff AE, Widschwendter M, Wang S. Accounting for differential variability in detecting differentially methylated regions. Brief Bioinform 2017; 20:47-57. [DOI: 10.1093/bib/bbx097] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2017] [Indexed: 12/11/2022] Open
Affiliation(s)
- Ya Wang
- Department of Biostatistics, Mailman School of Public Health, Columbia University, New York, NY, USA
| | - Andrew E Teschendorff
- Department of Women's Cancer, University College London, London, UK
- CAS Key Lab of Computational Biology, Shanghai Institute for Biological Sciences, Chinese Academy of Sciences, Shanghai, China
- Statistical Cancer Genomics, UCL Cancer Institute, University College London, London, UK
| | | | - Shuang Wang
- Department of Biostatistics, Mailman School of Public Health, Columbia University, New York, NY, USA
| |
Collapse
|
15
|
Abstract
In human genome research, genetic association studies of rare variants have been widely studied since the advent of high-throughput DNA sequencing platforms. However, detection of outcome-related rare variants still remains a statistically challenging problem because the number of observed genetic mutations is extremely rare. Recently, a power set-based statistical selection procedure has been proposed to locate both risk and protective rare variants within the outcome-related genes or genetic regions. Although it can perform an individual selection of rare variants, the procedure has a limitation that it cannot measure the certainty of selected rare variants. In this article, we propose a selection probability of individual rare variants, where selection frequencies of rare variants are computed based on bootstrap resampling. Therefore, it can quantify the certainty of both selected and unselected rare variants. Also, a new selection approach using a threshold of selection probability is introduced and compared with some existing selection procedures from extensive simulation studies and real sequencing data analysis. We have demonstrated that the proposed approach outperforms the existing methods in terms of a selection power.
Collapse
Affiliation(s)
- Gira Lee
- Department of Statistics, Pusan National University , Busan, Korea
| | - Hokeun Sun
- Department of Statistics, Pusan National University , Busan, Korea
| |
Collapse
|
16
|
Ko H, Kim K, Sun H. Multiple Group Testing Procedures for Analysis of High-Dimensional Genomic Data. Genomics Inform 2016; 14:187-195. [PMID: 28154510 PMCID: PMC5287123 DOI: 10.5808/gi.2016.14.4.187] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2016] [Revised: 10/20/2016] [Accepted: 10/26/2016] [Indexed: 02/02/2023] Open
Abstract
In genetic association studies with high-dimensional genomic data, multiple group testing procedures are often required in order to identify disease/trait-related genes or genetic regions, where multiple genetic sites or variants are located within the same gene or genetic region. However, statistical testing procedures based on an individual test suffer from multiple testing issues such as the control of family-wise error rate and dependent tests. Moreover, detecting only a few of genes associated with a phenotype outcome among tens of thousands of genes is of main interest in genetic association studies. In this reason regularization procedures, where a phenotype outcome regresses on all genomic markers and then regression coefficients are estimated based on a penalized likelihood, have been considered as a good alternative approach to analysis of high-dimensional genomic data. But, selection performance of regularization procedures has been rarely compared with that of statistical group testing procedures. In this article, we performed extensive simulation studies where commonly used group testing procedures such as principal component analysis, Hotelling's T2 test, and permutation test are compared with group lasso (least absolute selection and shrinkage operator) in terms of true positive selection. Also, we applied all methods considered in simulation studies to identify genes associated with ovarian cancer from over 20,000 genetic sites generated from Illumina Infinium HumanMethylation27K Beadchip. We found a big discrepancy of selected genes between multiple group testing procedures and group lasso.
Collapse
Affiliation(s)
- Hyoseok Ko
- Department of Statistics, Pusan National University, Busan 46241, Korea
| | - Kipoong Kim
- Department of Statistics, Pusan National University, Busan 46241, Korea
| | - Hokeun Sun
- Department of Statistics, Pusan National University, Busan 46241, Korea
| |
Collapse
|
17
|
Kim K, Choi J, Sun H. Network-based regularization for analysis of high-dimensional genomic data with group structure. KOREAN JOURNAL OF APPLIED STATISTICS 2016. [DOI: 10.5351/kjas.2016.29.6.1117] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
|
18
|
Doerken S, Mockenhaupt M, Naldi L, Schumacher M, Sekula P. The case-crossover design via penalized regression. BMC Med Res Methodol 2016; 16:103. [PMID: 27549803 PMCID: PMC4994302 DOI: 10.1186/s12874-016-0197-0] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2016] [Accepted: 07/28/2016] [Indexed: 11/12/2022] Open
Abstract
BACKGROUND The case-crossover design is an attractive alternative to the classical case-control design which can be used to study the onset of acute events if the risk factors of interest vary in time. By comparing exposures within cases at different time periods, the case-crossover design does not rely on control subjects which can be difficult to acquire. However, using the standard method of maximum likelihood, resulting risk estimates can be heavily biased when the prevalence to risk factors is very low (or very high). METHODS To overcome the problem of low risk factor prevalences, penalized conditional logistic regression via the lasso (least absolute shrinkage and selection operator) has been proposed in the literature as well as related methods such as the Firth correction. We apply and compare several penalized regression approaches in the context of a case-crossover analysis of the European Study of Severe Cutaneous Adverse Reactions (EuroSCAR; 1997-2001). RESULTS Out of 30 drugs, standard methods only correctly classified 17 drugs (including some highly implausible risk estimates), while penalized methods correctly classified 22 drugs. CONCLUSION Penalized methods generally yield better risk classifications and much more plausible risk estimates for the EuroSCAR study than standard methods. As these novel techniques can be easily implemented using available R packages, we encourage routine use of penalized conditional logistic regression for case-crossover data.
Collapse
Affiliation(s)
- Sam Doerken
- Institute for Medical Biometry and Statistics, Faculty of Medicine and Medical Center - University of Freiburg, Freiburg, Germany
| | - Maja Mockenhaupt
- Dokumentationszentrum schwerer Hautreaktionen (dZh), Medical Center, University of Freiburg, Freiburg, Germany
| | - Luigi Naldi
- USC di Dermatologia, Azienda Ospedaliero Papa Giovanni XXIII, Bergamo, Italy
| | - Martin Schumacher
- Institute for Medical Biometry and Statistics, Faculty of Medicine and Medical Center - University of Freiburg, Freiburg, Germany
| | - Peggy Sekula
- Institute for Medical Biometry and Statistics, Faculty of Medicine and Medical Center - University of Freiburg, Freiburg, Germany
| |
Collapse
|
19
|
Ruan P, Shen J, Santella RM, Zhou S, Wang S. NEpiC: a network-assisted algorithm for epigenetic studies using mean and variance combined signals. Nucleic Acids Res 2016; 44:e134. [PMID: 27302130 PMCID: PMC5027497 DOI: 10.1093/nar/gkw546] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2015] [Accepted: 06/04/2016] [Indexed: 12/13/2022] Open
Abstract
DNA methylation plays an important role in many biological processes. Existing epigenome-wide association studies (EWAS) have successfully identified aberrantly methylated genes in many diseases and disorders with most studies focusing on analysing methylation sites one at a time. Incorporating prior biological information such as biological networks has been proven to be powerful in identifying disease-associated genes in both gene expression studies and genome-wide association studies (GWAS) but has been under studied in EWAS. Although recent studies have noticed that there are differences in methylation variation in different groups, only a few existing methods consider variance signals in DNA methylation studies. Here, we present a network-assisted algorithm, NEpiC, that combines both mean and variance signals in searching for differentially methylated sub-networks using the protein–protein interaction (PPI) network. In simulation studies, we demonstrate the power gain from using both the prior biological information and variance signals compared to using either of the two or neither information. Applications to several DNA methylation datasets from the Cancer Genome Atlas (TCGA) project and DNA methylation data on hepatocellular carcinoma (HCC) from the Columbia University Medical Center (CUMC) suggest that the proposed NEpiC algorithm identifies more cancer-related genes and generates better replication results.
Collapse
Affiliation(s)
- Peifeng Ruan
- School of Computer Science and Shanghai Key Lab of Intelligent Information Processing, Fudan University, Shanghai 200433, China
| | - Jing Shen
- Department of Environmental Health Science, Mailman School of Public Health, Columbia University, New York, NY 10032, USA
| | - Regina M Santella
- Department of Environmental Health Science, Mailman School of Public Health, Columbia University, New York, NY 10032, USA
| | - Shuigeng Zhou
- School of Computer Science and Shanghai Key Lab of Intelligent Information Processing, Fudan University, Shanghai 200433, China
| | - Shuang Wang
- Department of Biostatistics, Mailman School of Public Health, Columbia University, New York, NY 10032, USA
| |
Collapse
|
20
|
Zhang Y, Zhang J, Liu Z, Liu Y, Tuo S. A network-based approach to identify disease-associated gene modules through integrating DNA methylation and gene expression. Biochem Biophys Res Commun 2015; 465:437-42. [DOI: 10.1016/j.bbrc.2015.08.033] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2015] [Accepted: 08/09/2015] [Indexed: 11/28/2022]
|
21
|
Avalos M, Pouyes H, Grandvalet Y, Orriols L, Lagarde E. Sparse conditional logistic regression for analyzing large-scale matched data from epidemiological studies: a simple algorithm. BMC Bioinformatics 2015; 16 Suppl 6:S1. [PMID: 25916593 PMCID: PMC4416185 DOI: 10.1186/1471-2105-16-s6-s1] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
This paper considers the problem of estimation and variable selection for large high-dimensional data (high number of predictors p and large sample size N, without excluding the possibility that N < p) resulting from an individually matched case-control study. We develop a simple algorithm for the adaptation of the Lasso and related methods to the conditional logistic regression model. Our proposal relies on the simplification of the calculations involved in the likelihood function. Then, the proposed algorithm iteratively solves reweighted Lasso problems using cyclical coordinate descent, computed along a regularization path. This method can handle large problems and deal with sparse features efficiently. We discuss benefits and drawbacks with respect to the existing available implementations. We also illustrate the interest and use of these techniques on a pharmacoepidemiological study of medication use and traffic safety.
Collapse
|
22
|
Cheng CP, Kuo IY, Alakus H, Frazer KA, Harismendy O, Wang YC, Tseng VS. Network-based analysis identifies epigenetic biomarkers of esophageal squamous cell carcinoma progression. ACTA ACUST UNITED AC 2014; 30:3054-61. [PMID: 25015989 DOI: 10.1093/bioinformatics/btu433] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022]
Abstract
MOTIVATION A rapid progression of esophageal squamous cell carcinoma (ESCC) causes a high mortality rate because of the propensity for metastasis driven by genetic and epigenetic alterations. The identification of prognostic biomarkers would help prevent or control metastatic progression. Expression analyses have been used to find such markers, but do not always validate in separate cohorts. Epigenetic marks, such as DNA methylation, are a potential source of more reliable and stable biomarkers. Importantly, the integration of both expression and epigenetic alterations is more likely to identify relevant biomarkers. RESULTS We present a new analysis framework, using ESCC progression-associated gene regulatory network (GRN escc), to identify differentially methylated CpG sites prognostic of ESCC progression. From the CpG loci differentially methylated in 50 tumor-normal pairs, we selected 44 CpG loci most highly associated with survival and located in the promoters of genes more likely to belong to GRN escc. Using an independent ESCC cohort, we confirmed that 8/10 of CpG loci in the promoter of GRN escc genes significantly correlated with patient survival. In contrast, 0/10 CpG loci in the promoter genes outside the GRN escc were correlated with patient survival. We further characterized the GRN escc network topology and observed that the genes with methylated CpG loci associated with survival deviated from the center of mass and were less likely to be hubs in the GRN escc. We postulate that our analysis framework improves the identification of bona fide prognostic biomarkers from DNA methylation studies, especially with partial genome coverage.
Collapse
Affiliation(s)
- Chun-Pei Cheng
- Department of Computer Science and Information Engineering, National Cheng Kung University, Tainan 701, Taiwan, Moores Cancer Center, University of California San Diego, La Jolla, CA 92093, USA, Institute of Basic Medical Sciences, National Cheng Kung University, Tainan 701, Taiwan, Department of Pediatrics and Rady Children's Hospital, University of California San Diego, La Jolla, CA 92093, USA, Department of General, Visceral and Cancer Surgery, University of Cologne, Köln, Germany, Institute for Genomic Medicine, University of California San Diego, La Jolla, CA 92093, USA, Department of Pharmacology and Institute of Medical Informatics, National Cheng Kung University, Tainan 701, Taiwan Department of Computer Science and Information Engineering, National Cheng Kung University, Tainan 701, Taiwan, Moores Cancer Center, University of California San Diego, La Jolla, CA 92093, USA, Institute of Basic Medical Sciences, National Cheng Kung University, Tainan 701, Taiwan, Department of Pediatrics and Rady Children's Hospital, University of California San Diego, La Jolla, CA 92093, USA, Department of General, Visceral and Cancer Surgery, University of Cologne, Köln, Germany, Institute for Genomic Medicine, University of California San Diego, La Jolla, CA 92093, USA, Department of Pharmacology and Institute of Medical Informatics, National Cheng Kung University, Tainan 701, Taiwan
| | - I-Ying Kuo
- Department of Computer Science and Information Engineering, National Cheng Kung University, Tainan 701, Taiwan, Moores Cancer Center, University of California San Diego, La Jolla, CA 92093, USA, Institute of Basic Medical Sciences, National Cheng Kung University, Tainan 701, Taiwan, Department of Pediatrics and Rady Children's Hospital, University of California San Diego, La Jolla, CA 92093, USA, Department of General, Visceral and Cancer Surgery, University of Cologne, Köln, Germany, Institute for Genomic Medicine, University of California San Diego, La Jolla, CA 92093, USA, Department of Pharmacology and Institute of Medical Informatics, National Cheng Kung University, Tainan 701, Taiwan
| | - Hakan Alakus
- Department of Computer Science and Information Engineering, National Cheng Kung University, Tainan 701, Taiwan, Moores Cancer Center, University of California San Diego, La Jolla, CA 92093, USA, Institute of Basic Medical Sciences, National Cheng Kung University, Tainan 701, Taiwan, Department of Pediatrics and Rady Children's Hospital, University of California San Diego, La Jolla, CA 92093, USA, Department of General, Visceral and Cancer Surgery, University of Cologne, Köln, Germany, Institute for Genomic Medicine, University of California San Diego, La Jolla, CA 92093, USA, Department of Pharmacology and Institute of Medical Informatics, National Cheng Kung University, Tainan 701, Taiwan Department of Computer Science and Information Engineering, National Cheng Kung University, Tainan 701, Taiwan, Moores Cancer Center, University of California San Diego, La Jolla, CA 92093, USA, Institute of Basic Medical Sciences, National Cheng Kung University, Tainan 701, Taiwan, Department of Pediatrics and Rady Children's Hospital, University of California San Diego, La Jolla, CA 92093, USA, Department of General, Visceral and Cancer Surgery, University of Cologne, Köln, Germany, Institute for Genomic Medicine, University of California San Diego, La Jolla, CA 92093, USA, Department of Pharmacology and Institute of Medical Informatics, National Cheng Kung University, Tainan 701, Taiwan Department of Computer Science and Information Engineering, National Cheng Kung University, Tainan 701, Taiwan, Moores Cancer Center, University of California San Diego, La Jolla, CA 92093, USA, Institute of Basic Medical Sciences, National Cheng Kung University, Tainan 701, Taiwan, Department of Pediatrics and Rady Children's Hospital, University of California San Diego, La Jolla, CA 92093, USA, Department of General, Visceral and Cancer Surgery, University of Cologne, Köln, Germany, Institute for Genomic Medicine, University of California San Diego, La Jolla, CA 92093, USA, Department of Ph
| | - Kelly A Frazer
- Department of Computer Science and Information Engineering, National Cheng Kung University, Tainan 701, Taiwan, Moores Cancer Center, University of California San Diego, La Jolla, CA 92093, USA, Institute of Basic Medical Sciences, National Cheng Kung University, Tainan 701, Taiwan, Department of Pediatrics and Rady Children's Hospital, University of California San Diego, La Jolla, CA 92093, USA, Department of General, Visceral and Cancer Surgery, University of Cologne, Köln, Germany, Institute for Genomic Medicine, University of California San Diego, La Jolla, CA 92093, USA, Department of Pharmacology and Institute of Medical Informatics, National Cheng Kung University, Tainan 701, Taiwan Department of Computer Science and Information Engineering, National Cheng Kung University, Tainan 701, Taiwan, Moores Cancer Center, University of California San Diego, La Jolla, CA 92093, USA, Institute of Basic Medical Sciences, National Cheng Kung University, Tainan 701, Taiwan, Department of Pediatrics and Rady Children's Hospital, University of California San Diego, La Jolla, CA 92093, USA, Department of General, Visceral and Cancer Surgery, University of Cologne, Köln, Germany, Institute for Genomic Medicine, University of California San Diego, La Jolla, CA 92093, USA, Department of Pharmacology and Institute of Medical Informatics, National Cheng Kung University, Tainan 701, Taiwan Department of Computer Science and Information Engineering, National Cheng Kung University, Tainan 701, Taiwan, Moores Cancer Center, University of California San Diego, La Jolla, CA 92093, USA, Institute of Basic Medical Sciences, National Cheng Kung University, Tainan 701, Taiwan, Department of Pediatrics and Rady Children's Hospital, University of California San Diego, La Jolla, CA 92093, USA, Department of General, Visceral and Cancer Surgery, University of Cologne, Köln, Germany, Institute for Genomic Medicine, University of California San Diego, La Jolla, CA 92093, USA, Department of Ph
| | - Olivier Harismendy
- Department of Computer Science and Information Engineering, National Cheng Kung University, Tainan 701, Taiwan, Moores Cancer Center, University of California San Diego, La Jolla, CA 92093, USA, Institute of Basic Medical Sciences, National Cheng Kung University, Tainan 701, Taiwan, Department of Pediatrics and Rady Children's Hospital, University of California San Diego, La Jolla, CA 92093, USA, Department of General, Visceral and Cancer Surgery, University of Cologne, Köln, Germany, Institute for Genomic Medicine, University of California San Diego, La Jolla, CA 92093, USA, Department of Pharmacology and Institute of Medical Informatics, National Cheng Kung University, Tainan 701, Taiwan Department of Computer Science and Information Engineering, National Cheng Kung University, Tainan 701, Taiwan, Moores Cancer Center, University of California San Diego, La Jolla, CA 92093, USA, Institute of Basic Medical Sciences, National Cheng Kung University, Tainan 701, Taiwan, Department of Pediatrics and Rady Children's Hospital, University of California San Diego, La Jolla, CA 92093, USA, Department of General, Visceral and Cancer Surgery, University of Cologne, Köln, Germany, Institute for Genomic Medicine, University of California San Diego, La Jolla, CA 92093, USA, Department of Pharmacology and Institute of Medical Informatics, National Cheng Kung University, Tainan 701, Taiwan
| | - Yi-Ching Wang
- Department of Computer Science and Information Engineering, National Cheng Kung University, Tainan 701, Taiwan, Moores Cancer Center, University of California San Diego, La Jolla, CA 92093, USA, Institute of Basic Medical Sciences, National Cheng Kung University, Tainan 701, Taiwan, Department of Pediatrics and Rady Children's Hospital, University of California San Diego, La Jolla, CA 92093, USA, Department of General, Visceral and Cancer Surgery, University of Cologne, Köln, Germany, Institute for Genomic Medicine, University of California San Diego, La Jolla, CA 92093, USA, Department of Pharmacology and Institute of Medical Informatics, National Cheng Kung University, Tainan 701, Taiwan Department of Computer Science and Information Engineering, National Cheng Kung University, Tainan 701, Taiwan, Moores Cancer Center, University of California San Diego, La Jolla, CA 92093, USA, Institute of Basic Medical Sciences, National Cheng Kung University, Tainan 701, Taiwan, Department of Pediatrics and Rady Children's Hospital, University of California San Diego, La Jolla, CA 92093, USA, Department of General, Visceral and Cancer Surgery, University of Cologne, Köln, Germany, Institute for Genomic Medicine, University of California San Diego, La Jolla, CA 92093, USA, Department of Pharmacology and Institute of Medical Informatics, National Cheng Kung University, Tainan 701, Taiwan
| | - Vincent S Tseng
- Department of Computer Science and Information Engineering, National Cheng Kung University, Tainan 701, Taiwan, Moores Cancer Center, University of California San Diego, La Jolla, CA 92093, USA, Institute of Basic Medical Sciences, National Cheng Kung University, Tainan 701, Taiwan, Department of Pediatrics and Rady Children's Hospital, University of California San Diego, La Jolla, CA 92093, USA, Department of General, Visceral and Cancer Surgery, University of Cologne, Köln, Germany, Institute for Genomic Medicine, University of California San Diego, La Jolla, CA 92093, USA, Department of Pharmacology and Institute of Medical Informatics, National Cheng Kung University, Tainan 701, Taiwan Department of Computer Science and Information Engineering, National Cheng Kung University, Tainan 701, Taiwan, Moores Cancer Center, University of California San Diego, La Jolla, CA 92093, USA, Institute of Basic Medical Sciences, National Cheng Kung University, Tainan 701, Taiwan, Department of Pediatrics and Rady Children's Hospital, University of California San Diego, La Jolla, CA 92093, USA, Department of General, Visceral and Cancer Surgery, University of Cologne, Köln, Germany, Institute for Genomic Medicine, University of California San Diego, La Jolla, CA 92093, USA, Department of Pharmacology and Institute of Medical Informatics, National Cheng Kung University, Tainan 701, Taiwan
| |
Collapse
|
23
|
Lee HS, Burkhardt BR, McLeod W, Smith S, Eberhard C, Lynch K, Hadley D, Rewers M, Simell O, She JX, Hagopian B, Lernmark A, Akolkar B, Ziegler AG, Krischer JP. Biomarker discovery study design for type 1 diabetes in The Environmental Determinants of Diabetes in the Young (TEDDY) study. Diabetes Metab Res Rev 2014; 30:424-34. [PMID: 24339168 PMCID: PMC4058423 DOI: 10.1002/dmrr.2510] [Citation(s) in RCA: 38] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 08/12/2013] [Revised: 10/29/2013] [Accepted: 12/04/2013] [Indexed: 12/27/2022]
Abstract
AIMS The Environmental Determinants of Diabetes in the Young planned biomarker discovery studies on longitudinal samples for persistent confirmed islet cell autoantibodies and type 1 diabetes using dietary biomarkers, metabolomics, microbiome/viral metagenomics and gene expression. METHODS This article describes the details of planning The Environmental Determinants of Diabetes in the Young biomarker discovery studies using a nested case-control design that was chosen as an alternative to the full cohort analysis. In the frame of a nested case-control design, it guides the choice of matching factors, selection of controls, preparation of external quality control samples and reduction of batch effects along with proper sample allocation. RESULTS AND CONCLUSION Our design is to reduce potential bias and retain study power while reducing the costs by limiting the numbers of samples requiring laboratory analyses. It also covers two primary end points (the occurrence of diabetes-related autoantibodies and the diagnosis of type 1 diabetes). The resulting list of case-control matched samples for each laboratory was augmented with external quality control samples.
Collapse
|