1
|
Wang F, Jia K, Li Y. Integrative deep learning with prior assisted feature selection. Stat Med 2024; 43:3792-3814. [PMID: 38923006 DOI: 10.1002/sim.10148] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2023] [Revised: 04/23/2024] [Accepted: 06/07/2024] [Indexed: 06/28/2024]
Abstract
Integrative analysis has emerged as a prominent tool in biomedical research, offering a solution to the "smalln $$ n $$ and largep $$ p $$ " challenge. Leveraging the powerful capabilities of deep learning in extracting complex relationship between genes and diseases, our objective in this study is to incorporate deep learning into the framework of integrative analysis. Recognizing the redundancy within candidate features, we introduce a dedicated feature selection layer in the proposed integrative deep learning method. To further improve the performance of feature selection, the rich previous researches are utilized by an ensemble learning method to identify "prior information". This leads to the proposed prior assisted integrative deep learning (PANDA) method. We demonstrate the superiority of the PANDA method through a series of simulation studies, showing its clear advantages over competing approaches in both feature selection and outcome prediction. Finally, a skin cutaneous melanoma (SKCM) dataset is extensively analyzed by the PANDA method to show its practical application.
Collapse
Affiliation(s)
- Feifei Wang
- Center for Applied Statistics, Renmin University of China, Beijing, China
- School of Statistics, Renmin University of China, Beijing, China
| | - Ke Jia
- School of Statistics, Renmin University of China, Beijing, China
| | - Yang Li
- Center for Applied Statistics, Renmin University of China, Beijing, China
- School of Statistics, Renmin University of China, Beijing, China
| |
Collapse
|
2
|
Li R, Xu S, Li Y, Tang Z, Feng D, Cai J, Ma S. Incorporating prior information in gene expression network-based cancer heterogeneity analysis. Biostatistics 2024:kxae028. [PMID: 39074174 DOI: 10.1093/biostatistics/kxae028] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2024] [Revised: 07/04/2024] [Accepted: 07/08/2024] [Indexed: 07/31/2024] Open
Abstract
Cancer is molecularly heterogeneous, with seemingly similar patients having different molecular landscapes and accordingly different clinical behaviors. In recent studies, gene expression networks have been shown as more effective/informative for cancer heterogeneity analysis than some simpler measures. Gene interconnections can be classified as "direct" and "indirect," where the latter can be caused by shared genomic regulators (such as transcription factors, microRNAs, and other regulatory molecules) and other mechanisms. It has been suggested that incorporating the regulators of gene expressions in network analysis and focusing on the direct interconnections can lead to a deeper understanding of the more essential gene interconnections. Such analysis can be seriously challenged by the large number of parameters (jointly caused by network analysis, incorporation of regulators, and heterogeneity) and often weak signals. To effectively tackle this problem, we propose incorporating prior information contained in the published literature. A key challenge is that such prior information can be partial or even wrong. We develop a two-step procedure that can flexibly accommodate different levels of prior information quality. Simulation demonstrates the effectiveness of the proposed approach and its superiority over relevant competitors. In the analysis of a breast cancer dataset, findings different from the alternatives are made, and the identified sample subgroups have important clinical differences.
Collapse
Affiliation(s)
- Rong Li
- Department of Biostatistics, Yale School of Public Health, 60 College Street, New Haven, 06511, CT, United States
| | - Shaodong Xu
- Center for Applied Statistics and School of Statistics, Renmin University of China, 59 Zhongguancun Street, 100872, Beijing, China
| | - Yang Li
- Center for Applied Statistics and School of Statistics, Renmin University of China, 59 Zhongguancun Street, 100872, Beijing, China
| | - Zuojian Tang
- Global Computational Biology and Digital Sciences, Boehringer Ingelheim Pharmaceuticals Inc., 900 Ridgebury Road, Ridgefield, 06877, CT, United States
| | - Di Feng
- Global Computational Biology and Digital Sciences, Boehringer Ingelheim Pharmaceuticals Inc., 900 Ridgebury Road, Ridgefield, 06877, CT, United States
| | - James Cai
- Global Computational Biology and Digital Sciences, Boehringer Ingelheim Pharmaceuticals Inc., 900 Ridgebury Road, Ridgefield, 06877, CT, United States
| | - Shuangge Ma
- Department of Biostatistics, Yale School of Public Health, 60 College Street, New Haven, 06511, CT, United States
| |
Collapse
|
3
|
Han W, Zhang S, Ma S, Ren M. Information-incorporated sparse hierarchical cancer heterogeneity analysis. Stat Med 2024; 43:2280-2297. [PMID: 38553996 DOI: 10.1002/sim.10071] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2023] [Revised: 01/11/2024] [Accepted: 03/19/2024] [Indexed: 05/18/2024]
Abstract
Cancer heterogeneity analysis is essential for precision medicine. Most of the existing heterogeneity analyses only consider a single type of data and ignore the possible sparsity of important features. In cancer clinical practice, it has been suggested that two types of data, pathological imaging and omics data, are commonly collected and can produce hierarchical heterogeneous structures, in which the refined sub-subgroup structure determined by omics features can be nested in the rough subgroup structure determined by the imaging features. Moreover, sparsity pursuit has extraordinary significance and is more challenging for heterogeneity analysis, because the important features may not be the same in different subgroups, which is ignored by the existing heterogeneity analyses. Fortunately, rich information from previous literature (for example, those deposited in PubMed) can be used to assist feature selection in the present study. Advancing from the existing analyses, in this study, we propose a novel sparse hierarchical heterogeneity analysis framework, which can integrate two types of features and incorporate prior knowledge to improve feature selection. The proposed approach has satisfactory statistical properties and competitive numerical performance. A TCGA real data analysis demonstrates the practical value of our approach in analyzing data heterogeneity and sparsity.
Collapse
Affiliation(s)
- Wei Han
- School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing, China
- Key Laboratory of Big Data Mining and Knowledge Management, Chinese Academy of Sciences, Beijing, China
| | - Sanguo Zhang
- School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing, China
- Key Laboratory of Big Data Mining and Knowledge Management, Chinese Academy of Sciences, Beijing, China
| | - Shuangge Ma
- Department of Biostatistics, Yale School of Public Health, New Haven, Connecticut
| | - Mingyang Ren
- School of Mathematical Sciences, Shanghai Jiao Tong University, Shanghai, China
| |
Collapse
|
4
|
Ma Y, Li Y, Zhang Z, Du G, Huang T, Zhao ZZ, Liu S, Dang Z. Establishment of a Risk Prediction Model for Metabolic Syndrome in High Altitude Areas in Qinghai Province, China: A Cross-Sectional Study. Diabetes Metab Syndr Obes 2024; 17:2041-2052. [PMID: 38774573 PMCID: PMC11107940 DOI: 10.2147/dmso.s445650] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 10/20/2023] [Accepted: 04/05/2024] [Indexed: 05/24/2024] Open
Abstract
Purpose The prevalence of metabolic syndrome (MetS) is increasing worldwide, and early prediction of MetS risk is highly beneficial for health outcomes. This study aimed to develop and validate a nomogram to predict MetS risk in Qinghai Province, China, and it provides a methodological reference for MetS prevention and control in Qinghai Province, China. Patients and Methods A total of 3073 participants living between 1900 and 3710 meters above sea level in Qinghai Province participated in this study between March 2014 and March 2016. We omitted 12 subjects who were missing diagnostic component data for MetS, ultimately resulting in 3061 research subjects, 70% of the subjects were assigned randomly to the training set, and the remaining subjects were assigned to the validation set. The least absolute shrinkage and selection operator (LASSO) regression analysis method was used for variable selection via running cyclic coordinate descent with 10-fold cross-validation. Multivariable logistic regression was then performed to develop a predictive model and nomogram. The receiver operating characteristic (ROC) curves was used for model evaluation, and calibration plot and decision curve analysis (DCA) were used for model validation. Results Of 24 variables studied, 6 risk predictors were identified by LASSO regression analysis: hyperlipidaemia, hyperglycemia, abdominal obesity, systolic blood pressure (SBP), diastolic blood pressure (DBP), and body mass index (BMI). A prediction model including these 6 risk factors was constructed and displayed good predictability with an area under the ROC curve of 0.914 for the training set and 0.930 for the validation set. DCA revealed that if the threshold probability of MetS is less than 82%, the application of this nomogram is more beneficial than both the treat-all or treat-none strategies. Conclusion The nomogram developed in our study demonstrated strong discriminative power and clinical applicability, making it a valuable reference for meets prevention and control in the plateau areas of Qinghai Province.
Collapse
Affiliation(s)
- Yanting Ma
- Department of Public Health, Medical College, Qinghai University, Xining, Qinghai, People’s Republic of China
| | - Yongyuan Li
- Disease Control department, Huangzhong District health Bureau, Xining, Qinghai, People’s Republic of China
| | - Zhanfeng Zhang
- Huangzhong District, Duoba County Health Services Center, Xining, Qinghai, People’s Republic of China
| | - Guomei Du
- Clinical Laboratory, Qinghai Red Cross Hospital, Xining, Qinghai, People’s Republic of China
| | - Ting Huang
- Department of Public Health, Medical College, Qinghai University, Xining, Qinghai, People’s Republic of China
| | - Zhi Zhong Zhao
- Disease Control Department, Qinghai Provincial Center for Endemic Disease Control and Prevention, Xining, Qinghai, People’s Republic of China
| | - Shou Liu
- Department of Public Health, Medical College, Qinghai University, Xining, Qinghai, People’s Republic of China
| | - Zhancui Dang
- Department of Public Health, Medical College, Qinghai University, Xining, Qinghai, People’s Republic of China
| |
Collapse
|
5
|
Bhattarai SP, Dzikowicz DJ, Xue Y, Block R, Tucker RG, Bhandari S, Boulware VE, Stone B, Carey MG. Estimating Ejection Fraction from the 12 Lead ECG among Patients with Acute Heart Failure. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2024:2024.03.25.24304875. [PMID: 38585894 PMCID: PMC10996705 DOI: 10.1101/2024.03.25.24304875] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/09/2024]
Abstract
Background Identifying patients with low left ventricular ejection fraction (LVEF) in the emergency department using an electrocardiogram (ECG) may optimize acute heart failure (AHF) management. We aimed to assess the efficacy of 527 automated 12-lead ECG features for estimating LVEF among patients with AHF. Method Medical records of patients >18 years old and AHF-related ICD codes, demographics, LVEF %, comorbidities, and medication were analyzed. Least Absolute Shrinkage and Selection Operator (LASSO) identified important ECG features and evaluated performance. Results Among 851 patients, the mean age was 74 years (IQR:11), male 56% (n=478), and the median body mass index was 29 kg/m2 (IQR:1.8). A total of 914 echocardiograms and ECGs were matched; the time between ECG-Echocardiogram was 9 hours (IQR of 9 hours); ≤30% LVEF (16.45%, n=140). Lasso demonstrated 42 ECG features important for estimating LVEF ≤30%. The predictive model of LVEF ≤30% demonstrated an area under the curve (AUC) of 0.86, a 95% confidence interval (CI) of 0.83 to 0.89, a specificity of 54% (50% to 57%), and a sensitivity of 91 (95% CI: 88% to 96%), accuracy 60% (95% CI:60 % to 63%) and, negative predictive value of 95%. Conclusions An explainable machine learning model with physiologically feasible predictors may be useful in screening patients with low LVEF in AHF.
Collapse
Affiliation(s)
| | - Dillon J Dzikowicz
- University of Rochester School of Nursing, NY
- University of Rochester Medical Center, NY
- Clinical Cardiovascular Research Center, University of Rochester Medical Center, NY
| | - Ying Xue
- University of Rochester School of Nursing, NY
| | - Robert Block
- Department of Public Health Sciences, University of Rochester Medical Center, NY
- Cardiology Division, Department of Medicine, University of Rochester Medical Center
| | | | | | | | | | - Mary G Carey
- University of Rochester School of Nursing, NY
- University of Rochester Medical Center, NY
| |
Collapse
|
6
|
Zhang YH, Xie LH, Li J, Qi YW, Shi JJ. Classification and clinical significance of immunogenic cell death-related genes in Plasmodium falciparum infection determined by integrated bioinformatics analysis and machine learning. Malar J 2024; 23:48. [PMID: 38360586 PMCID: PMC10868002 DOI: 10.1186/s12936-024-04877-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2023] [Accepted: 02/10/2024] [Indexed: 02/17/2024] Open
Abstract
BACKGROUND Immunogenic cell death (ICD) is a type of regulated cell death that plays a crucial role in activating the immune system in response to various stressors, including cancer cells and pathogens. However, the involvement of ICD in the human immune response against malaria remains to be defined. METHODS In this study, data from Plasmodium falciparum infection cohorts, derived from cross-sectional studies, were analysed to identify ICD subtypes and their correlation with parasitaemia and immune responses. Using consensus clustering, ICD subtypes were identified, and their association with the immune landscape was assessed by employing ssGSEA. Differentially expressed genes (DEGs) analysis, functional enrichment, protein-protein interaction networks, and machine learning (least absolute shrinkage and selection operator (LASSO) regression and random forest) were used to identify ICD-associated hub genes linked with high parasitaemia. A nomogram visualizing these genes' correlation with parasitaemia levels was developed, and its performance was evaluated using receiver operating characteristic (ROC) curves. RESULTS In the P. falciparum infection cohort, two ICD-associated subtypes were identified, with subtype 1 showing better adaptive immune responses and lower parasitaemia compared to subtype 2. DEGs analysis revealed upregulation of proliferative signalling pathways, T-cell receptor signalling pathways and T-cell activation and differentiation in subtype 1, while subtype 2 exhibited elevated cytokine signalling and inflammatory responses. PPI network construction and machine learning identified CD3E and FCGR1A as candidate hub genes. A constructed nomogram integrating these genes demonstrated significant classification performance of high parasitaemia, which was evidenced by AUC values ranging from 0.695 to 0.737 in the training set and 0.911 to 0.933 and 0.759 to 0.849 in two validation sets, respectively. Additionally, significant correlations between the expressions of these genes and the clinical manifestation of P. falciparum infection were observed. CONCLUSION This study reveals the existence of two ICD subtypes in the human immune response against P. falciparum infection. Two ICD-associated candidate hub genes were identified, and a nomogram was constructed for the classification of high parasitaemia. This study can deepen the understanding of the human immune response to P. falciparum infection and provide new targets for the prevention and control of malaria.
Collapse
Affiliation(s)
- Yan-Hui Zhang
- Key Laboratory of Gastrointestinal Cancer (Fujian Medical University), Ministry of Education, Fuzhou, China.
| | - Li-Hua Xie
- Key Laboratory of Gastrointestinal Cancer (Fujian Medical University), Ministry of Education, Fuzhou, China
| | - Jian Li
- State Key Laboratory of Cellular Stress Biology, Innovation Center for Cell Signaling Network, School of Life Sciences, Xiamen University, Xiamen, Fujian, China
| | - Yan-Wei Qi
- Department of Pathogenic Biology and Immunology, School of Basic Medical Sciences, Guangzhou Medical University, Guangzhou, China
| | - Jia-Jian Shi
- Key Laboratory of Gastrointestinal Cancer (Fujian Medical University), Ministry of Education, Fuzhou, China
| |
Collapse
|
7
|
Wang Y, Liu S, Zhang W, Zheng L, Li E, Zhu M, Yan D, Shi J, Bao J, Yu J. Development and Evaluation of a Nomogram for Predicting the Outcome of Immune Reconstitution Among HIV/AIDS Patients Receiving Antiretroviral Therapy in China. Adv Biol (Weinh) 2024; 8:e2300378. [PMID: 37937390 DOI: 10.1002/adbi.202300378] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2023] [Revised: 10/12/2023] [Indexed: 11/09/2023]
Abstract
This study aims to develop and evaluate a model to predict the immune reconstitution among HIV/AIDS patients after antiretroviral therapy (ART). A total of 502 HIV/AIDS patients are randomized to the training cohort and evaluation cohort. Least absolute shrinkage and selection operator (LASSO) regression and multivariate logistic regression analysis are performed to identify the indicators and establish the nomogram for predicting the immune reconstitution. Decision curve analysis (DCA) and clinical impact curve (CIC) are used to evaluate the clinical effectiveness of the nomogram. Predictive factors included white blood cells (WBC), baseline CD4+ T-cell counts (baseline CD4), ratio of effector regulatory T cells to resting regulatory T cells (eTreg/rTreg) and low-density lipoprotein cholesterol (LDL-C) and are incorporated into the nomogram. The area under the curve (AUC) is 0.812 (95% CI, 0.767∼0.851) and 0.794 (95%CI, 0.719∼0.857) in the training cohort and evaluation cohort, respectively. The calibration curve shows a high consistency between the predicted and actual observations. Moreover, DCA and CIC indicate that the nomogram has a superior net benefit in predicting poor immune reconstitution. A simple-to-use nomogram containing four routinely collected variables is developed and internally evaluated and can be used to predict the poor immune reconstitution in HIV/AIDS patients after ART.
Collapse
Affiliation(s)
- Yi Wang
- Institute of Hepatology and Epidemiology, Affiliated Xixi Hospital in Hangzhou, Zhejiang University of Traditional Chinese Medicine, Hangzhou, 310023, China
| | - Shourong Liu
- Department of Infection, Affiliated Xixi Hospital in Hangzhou, Zhejiang University of Traditional Chinese Medicine, Hangzhou, 310023, China
| | - Wenhui Zhang
- Department of Infection, Affiliated Xixi Hospital in Hangzhou, Zhejiang University of Traditional Chinese Medicine, Hangzhou, 310023, China
- Department of Nursing, Affiliated Xixi Hospital in Hangzhou, Zhejiang University of Traditional Chinese Medicine, Hangzhou, 310023, China
| | - Liping Zheng
- Department of Nursing, Affiliated Xixi Hospital in Hangzhou, Zhejiang University of Traditional Chinese Medicine, Hangzhou, 310023, China
| | - Er Li
- Department of Nursing, Affiliated Xixi Hospital in Hangzhou, Zhejiang University of Traditional Chinese Medicine, Hangzhou, 310023, China
| | - Mingli Zhu
- Medical Laboratory, Affiliated Hangzhou Xixi Hospital, Zhejiang University School of Medicine, Hangzhou, 310023, China
| | - Dingyan Yan
- Department of Infection, Affiliated Xixi Hospital in Hangzhou, Zhejiang University of Traditional Chinese Medicine, Hangzhou, 310023, China
- Department of Nursing, Affiliated Xixi Hospital in Hangzhou, Zhejiang University of Traditional Chinese Medicine, Hangzhou, 310023, China
| | - Jinchuan Shi
- Department of Infection, Affiliated Xixi Hospital in Hangzhou, Zhejiang University of Traditional Chinese Medicine, Hangzhou, 310023, China
| | - Jianfeng Bao
- Institute of Hepatology and Epidemiology, Affiliated Xixi Hospital in Hangzhou, Zhejiang University of Traditional Chinese Medicine, Hangzhou, 310023, China
| | - Jianhua Yu
- Department of Infection, Affiliated Xixi Hospital in Hangzhou, Zhejiang University of Traditional Chinese Medicine, Hangzhou, 310023, China
| |
Collapse
|
8
|
Rauschenberger A, Landoulsi Z, van de Wiel MA, Glaab E. Penalized regression with multiple sources of prior effects. Bioinformatics 2023; 39:btad680. [PMID: 37951587 PMCID: PMC10699841 DOI: 10.1093/bioinformatics/btad680] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2023] [Revised: 10/19/2023] [Accepted: 11/08/2023] [Indexed: 11/14/2023] Open
Abstract
MOTIVATION In many high-dimensional prediction or classification tasks, complementary data on the features are available, e.g. prior biological knowledge on (epi)genetic markers. Here we consider tasks with numerical prior information that provide an insight into the importance (weight) and the direction (sign) of the feature effects, e.g. regression coefficients from previous studies. RESULTS We propose an approach for integrating multiple sources of such prior information into penalized regression. If suitable co-data are available, this improves the predictive performance, as shown by simulation and application. AVAILABILITY AND IMPLEMENTATION The proposed method is implemented in the R package transreg (https://github.com/lcsb-bds/transreg, https://cran.r-project.org/package=transreg).
Collapse
Affiliation(s)
- Armin Rauschenberger
- Luxembourg Centre for Systems Biomedicine (LCSB), University of Luxembourg, 4362 Esch-sur-Alzette, Luxembourg
| | - Zied Landoulsi
- Luxembourg Centre for Systems Biomedicine (LCSB), University of Luxembourg, 4362 Esch-sur-Alzette, Luxembourg
| | - Mark A van de Wiel
- Department of Epidemiology and Data Science (EDS), Amsterdam University Medical Centers (Amsterdam UMC), 1081 HV Amsterdam, The Netherlands
| | - Enrico Glaab
- Luxembourg Centre for Systems Biomedicine (LCSB), University of Luxembourg, 4362 Esch-sur-Alzette, Luxembourg
| |
Collapse
|
9
|
Liu Y, Yin P, Cui J, Sun C, Chen L, Hong N, Li Z. Radiomics analysis based on CT for the prediction of pulmonary metastases in ewing sarcoma. BMC Med Imaging 2023; 23:147. [PMID: 37784073 PMCID: PMC10544364 DOI: 10.1186/s12880-023-01077-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2023] [Accepted: 08/14/2023] [Indexed: 10/04/2023] Open
Abstract
OBJECTIVES This study aimed to develop and validate radiomics models on the basis of computed tomography (CT) and clinical features for the prediction of pulmonary metastases (MT) in patients with Ewing sarcoma (ES) within 2 years after diagnosis. MATERIALS AND METHODS A total of 143 patients with a histopathological diagnosis of ES were enrolled in this study (114 in the training cohort and 29 in the validation cohort). The regions of interest (ROIs) were handcrafted along the boundary of each tumor on the CT and CT-enhanced (CTE) images, and radiomic features were extracted. Six different models were built, including three radiomics models (CT, CTE and ComB models) and three clinical-radiomics models (CT_clinical, CTE_clinical and ComB_clinical models). The area under the receiver operating characteristic curve (AUC), and accuracy were calculated to evaluate the different models, and DeLong test was used to compare the AUCs of the models. RESULTS Among the clinical risk factors, the therapeutic method had significant differences between the MT and non-MT groups (P<0.01). The six models performed well in predicting pulmonary metastases in patients with ES, and the ComB model (AUC: 0.866/0.852 in training/validation cohort) achieved the highest AUC among the six models. However, no statistically significant difference was observed between the AUC of the models. CONCLUSIONS In patients with ES, clinical-radiomics model created using radiomics signature and clinical features provided favorable ability and accuracy for pulmonary metastases prediction.
Collapse
Affiliation(s)
- Ying Liu
- Department of Radiology, Peking University People's Hospital, 11 Xizhimen Nandajie, Xicheng District, Beijing, 100044, People's Republic of China
| | - Ping Yin
- Department of Radiology, Peking University People's Hospital, 11 Xizhimen Nandajie, Xicheng District, Beijing, 100044, People's Republic of China
| | - Jingjing Cui
- United Imaging Intelligence (Beijing) Co., Ltd, Yongteng North Road, Haidian District, Beijing, 100094, People's Republic of China
| | - Chao Sun
- Department of Radiology, Peking University People's Hospital, 11 Xizhimen Nandajie, Xicheng District, Beijing, 100044, People's Republic of China
| | - Lei Chen
- Department of Radiology, Peking University People's Hospital, 11 Xizhimen Nandajie, Xicheng District, Beijing, 100044, People's Republic of China
| | - Nan Hong
- Department of Radiology, Peking University People's Hospital, 11 Xizhimen Nandajie, Xicheng District, Beijing, 100044, People's Republic of China.
| | - Zhentao Li
- Department of Radiology, Peking University People's Hospital, 11 Xizhimen Nandajie, Xicheng District, Beijing, 100044, People's Republic of China.
| |
Collapse
|
10
|
Wang F, Liang D, Li Y, Ma S. Prior information-assisted integrative analysis of multiple datasets. Bioinformatics 2023; 39:btad452. [PMID: 37490475 PMCID: PMC10400378 DOI: 10.1093/bioinformatics/btad452] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2022] [Revised: 05/13/2023] [Accepted: 07/24/2023] [Indexed: 07/27/2023] Open
Abstract
MOTIVATION Analyzing genetic data to identify markers and construct predictive models is of great interest in biomedical research. However, limited by cost and sample availability, genetic studies often suffer from the "small sample size, high dimensionality" problem. To tackle this problem, an integrative analysis that collectively analyzes multiple datasets with compatible designs is often conducted. For regularizing estimation and selecting relevant variables, penalization and other regularization techniques are routinely adopted. "Blindly" searching over a vast number of variables may not be efficient. RESULTS We propose incorporating prior information to assist integrative analysis of multiple genetic datasets. To obtain accurate prior information, we adopt a convolutional neural network with an active learning strategy to label textual information from previous studies. Then the extracted prior information is incorporated using a group LASSO-based technique. We conducted a series of simulation studies that demonstrated the satisfactory performance of the proposed method. Finally, data on skin cutaneous melanoma are analyzed to establish practical utility. AVAILABILITY AND IMPLEMENTATION Code is available at https://github.com/ldz7/PAIA. The data that support the findings in this article are openly available in TCGA (The Cancer Genome Atlas) at https://portal.gdc.cancer.gov/.
Collapse
Affiliation(s)
- Feifei Wang
- Center for Applied Statistics, Renmin University of China, Beijing 100872, China
- School of Statistics, Renmin University of China, Beijing 100872, China
- Institute for Data Science in Health, Renmin University of China, Beijing 100872, China
| | - Dongzuo Liang
- School of Statistics, Renmin University of China, Beijing 100872, China
- RSS and China-Re Life Joint Lab on Public Health and Risk Management, Renmin University of China, Beijing 100872, China
| | - Yang Li
- Center for Applied Statistics, Renmin University of China, Beijing 100872, China
- School of Statistics, Renmin University of China, Beijing 100872, China
- RSS and China-Re Life Joint Lab on Public Health and Risk Management, Renmin University of China, Beijing 100872, China
| | - Shuangge Ma
- Department of Biostatistics, Yale University, New Haven, CT 06520, United States
| |
Collapse
|
11
|
Zhang X, Liu CT. Information-incorporated sparse convex clustering for disease subtyping. Bioinformatics 2023; 39:btad417. [PMID: 37382570 PMCID: PMC10329496 DOI: 10.1093/bioinformatics/btad417] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2023] [Revised: 05/13/2023] [Accepted: 06/28/2023] [Indexed: 06/30/2023] Open
Abstract
MOTIVATION Heterogeneity in human diseases presents clinical challenges in accurate disease characterization and treatment. Recently available high throughput multi-omics data may offer a great opportunity to explore the underlying mechanisms of diseases and improve disease heterogeneity assessment throughout the treatment course. In addition, increasingly accumulated data from existing literature may be informative about disease subtyping. However, the existing clustering procedures, such as Sparse Convex Clustering (SCC), cannot directly utilize the prior information even though SCC produces stable clusters. RESULTS We develop a clustering procedure, information-incorporated Sparse Convex Clustering, to respond to the need for disease subtyping in precision medicine. Utilizing the text mining approach, the proposed method leverages the existing information from previously published studies through a group lasso penalty to improve disease subtyping and biomarker identification. The proposed method allows taking heterogeneous information, such as multi-omics data. We conduct simulation studies under several scenarios with various accuracy of the prior information to evaluate the performance of our method. The proposed method outperforms other clustering methods, such as SCC, K-means, Sparse K-means, iCluster+, and Bayesian Consensus Clustering. In addition, the proposed method generates more accurate disease subtypes and identifies important biomarkers for future studies in real data analysis of breast and lung cancer-related omics data. In conclusion, we present an information-incorporated clustering procedure that allows coherent pattern discovery and feature selection. AVAILABILITY AND IMPLEMENTATION The code is available upon request.
Collapse
Affiliation(s)
- Xiaoyu Zhang
- Department of Biostatistics, Boston University School of Public Health, Boston, MA 02118, United States
| | - Ching-Ti Liu
- Department of Biostatistics, Boston University School of Public Health, Boston, MA 02118, United States
| |
Collapse
|
12
|
Tran L, He K, Wang D, Jiang H. A cross-validation statistical framework for asymmetric data integration. Biometrics 2023; 79:1280-1292. [PMID: 35524490 PMCID: PMC9637892 DOI: 10.1111/biom.13685] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2021] [Accepted: 04/19/2022] [Indexed: 11/26/2022]
Abstract
The proliferation of biobanks and large public clinical data sets enables their integration with a smaller amount of locally gathered data for the purposes of parameter estimation and model prediction. However, public data sets may be subject to context-dependent confounders and the protocols behind their generation are often opaque; naively integrating all external data sets equally can bias estimates and lead to spurious conclusions. Weighted data integration is a potential solution, but current methods still require subjective specifications of weights and can become computationally intractable. Under the assumption that local data are generated from the set of unknown true parameters, we propose a novel weighted integration method based upon using the external data to minimize the local data leave-one-out cross validation (LOOCV) error. We demonstrate how the optimization of LOOCV errors for linear and Cox proportional hazards models can be rewritten as functions of external data set integration weights. Significant reductions in estimation error and prediction error are shown using simulation studies mimicking the heterogeneity of clinical data as well as a real-world example using kidney transplant patients from the Scientific Registry of Transplant Recipients.
Collapse
Affiliation(s)
- Lam Tran
- Department of Biostatistics, University of Michigan, Ann Arbor MI, USA
| | - Kevin He
- Department of Biostatistics, University of Michigan, Ann Arbor MI, USA
| | - Di Wang
- Department of Biostatistics, University of Michigan, Ann Arbor MI, USA
| | - Hui Jiang
- Department of Biostatistics, University of Michigan, Ann Arbor MI, USA
| |
Collapse
|
13
|
Li S, Zhang L, Tony Cai T, Li H. Estimation and Inference for High-Dimensional Generalized Linear Models with Knowledge Transfer. J Am Stat Assoc 2023; 119:1274-1285. [PMID: 38948492 PMCID: PMC11213555 DOI: 10.1080/01621459.2023.2184373] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2021] [Accepted: 02/15/2023] [Indexed: 03/06/2023]
Abstract
Transfer learning provides a powerful tool for incorporating data from related studies into a target study of interest. In epidemiology and medical studies, the classification of a target disease could borrow information across other related diseases and populations. In this work, we consider transfer learning for high-dimensional generalized linear models (GLMs). A novel algorithm, TransHDGLM, that integrates data from the target study and the source studies is proposed. Minimax rate of convergence for estimation is established and the proposed estimator is shown to be rate-optimal. Statistical inference for the target regression coefficients is also studied. Asymptotic normality for a debiased estimator is established, which can be used for constructing coordinate-wise confidence intervals of the regression coefficients. Numerical studies show significant improvement in estimation and inference accuracy over GLMs that only use the target data. The proposed methods are applied to a real data study concerning the classification of colorectal cancer using gut microbiomes, and are shown to enhance the classification accuracy in comparison to methods that only use the target data.
Collapse
Affiliation(s)
- Sai Li
- Institute of Statistics and Big Data, Renmin University of China, China
| | - Linjun Zhang
- Department of Statistics, Rutgers University, New Brunswick, NJ 08854
| | - T Tony Cai
- Department of Statistics, the Wharton School, University of Pennsylvania, Philadelphia, PA 19104
| | - Hongzhe Li
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104
| |
Collapse
|
14
|
Han W, Li H, Gong M. Multi-regularization sparse reconstruction based on multifactorial multiobjective optimization. Appl Soft Comput 2023. [DOI: 10.1016/j.asoc.2023.110122] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/17/2023]
|
15
|
Shen J, Li H, Yu X, Bai L, Dong Y, Cao J, Lu K, Tang Z. Efficient feature extraction from highly sparse binary genotype data for cancer prognosis prediction using an auto-encoder. Front Oncol 2023; 12:1091767. [PMID: 36703783 PMCID: PMC9872139 DOI: 10.3389/fonc.2022.1091767] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2022] [Accepted: 12/19/2022] [Indexed: 01/11/2023] Open
Abstract
Genomics involving tens of thousands of genes is a complex system determining phenotype. An interesting and vital issue is how to integrate highly sparse genetic genomics data with a mass of minor effects into a prediction model for improving prediction power. We find that the deep learning method can work well to extract features by transforming highly sparse dichotomous data to lower-dimensional continuous data in a non-linear way. This may provide benefits in risk prediction-associated genotype data. We developed a multi-stage strategy to extract information from highly sparse binary genotype data and applied it for cancer prognosis. Specifically, we first reduced the size of binary biomarkers via a univariable regression model to a moderate size. Then, a trainable auto-encoder was used to learn compact features from the reduced data. Next, we performed a LASSO problem process to select the optimal combination of extracted features. Lastly, we applied such feature combination to real cancer prognostic models and evaluated the raw predictive effect of the models. The results indicated that these compressed transformation features could better improve the model's original predictive performance and might avoid an overfitting problem. This idea may be enlightening for everyone involved in cancer research, risk reduction, treatment, and patient care via integrating genomics data.
Collapse
Affiliation(s)
- Junjie Shen
- Department of Biostatistics, School of Public Health, Medical College of Soochow University, Suzhou, China,Jiangsu Key Laboratory of Preventive and Translational Medicine for Geriatric Diseases, Medical College of Soochow University, Suzhou, China
| | - Huijun Li
- Department of Biostatistics, School of Public Health, Medical College of Soochow University, Suzhou, China,Jiangsu Key Laboratory of Preventive and Translational Medicine for Geriatric Diseases, Medical College of Soochow University, Suzhou, China
| | - Xinghao Yu
- Jiangsu Key Laboratory of Preventive and Translational Medicine for Geriatric Diseases, Medical College of Soochow University, Suzhou, China,Center for Genetic Epidemiology and Genomics, School of Public Health, Medical College of Soochow University, Suzhou, China
| | - Lu Bai
- Department of Biostatistics, School of Public Health, Medical College of Soochow University, Suzhou, China,Jiangsu Key Laboratory of Preventive and Translational Medicine for Geriatric Diseases, Medical College of Soochow University, Suzhou, China
| | - Yongfei Dong
- Department of Biostatistics, School of Public Health, Medical College of Soochow University, Suzhou, China,Jiangsu Key Laboratory of Preventive and Translational Medicine for Geriatric Diseases, Medical College of Soochow University, Suzhou, China
| | - Jianping Cao
- School of Radiation Medicine and Protection and Collaborative Innovation Center of Radiation Medicine of Jiangsu Higher Education Institutions, Soochow University, Suzhou, China
| | - Ke Lu
- Department of Orthopedics, Affiliated Kunshan Hospital of Jiangsu University, Suzhou, China,*Correspondence: Zaixiang Tang, ; Ke Lu,
| | - Zaixiang Tang
- Department of Biostatistics, School of Public Health, Medical College of Soochow University, Suzhou, China,Jiangsu Key Laboratory of Preventive and Translational Medicine for Geriatric Diseases, Medical College of Soochow University, Suzhou, China,*Correspondence: Zaixiang Tang, ; Ke Lu,
| |
Collapse
|
16
|
Wen C, Wang Q, Jiang Y. Stability Approach to Regularization Selection for Reduced-Rank Regression. J Comput Graph Stat 2022; 32:974-984. [PMID: 37810194 PMCID: PMC10554232 DOI: 10.1080/10618600.2022.2119986] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2021] [Accepted: 08/22/2022] [Indexed: 10/17/2022]
Abstract
The reduced-rank regression model is a popular model to deal with multivariate response and multiple predictors, and is widely used in biology, chemometrics, econometrics, engineering, and other fields. In the reduced-rank regression modelling, a central objective is to estimate the rank of the coefficient matrix that represents the number of effective latent factors in predicting the multivariate response. Although theoretical results such as rank estimation consistency have been established for various methods, in practice rank determination still relies on information criterion based methods such as AIC and BIC or subsampling based methods such as cross validation. Unfortunately, the theoretical properties of these practical methods are largely unknown. In this paper, we present a novel method called StARS-RRR that selects the tuning parameter and then estimates the rank of the coefficient matrix for reduced-rank regression based on the stability approach. We prove that StARS-RRR achieves rank estimation consistency, i.e., the rank estimated with the tuning parameter selected by StARS-RRR is consistent to the true rank. Through a simulation study, we show that StARS-RRR outperforms other tuning parameter selection methods including AIC, BIC, and cross validation as it provides the most accurate estimated rank. In addition, when applied to a breast cancer dataset, StARS-RRR discovers a reasonable number of genetic pathways that affect the DNA copy number variations and results in a smaller prediction error than the other methods with a random-splitting process.
Collapse
Affiliation(s)
- Canhong Wen
- International Institute of Finance, School of Management, University of Science and Technology of China
| | - Qin Wang
- International Institute of Finance, School of Management, University of Science and Technology of China
| | - Yuan Jiang
- Department of Statistics, Oregon State University
| |
Collapse
|
17
|
Sen Puliparambil B, Tomal JH, Yan Y. A Novel Algorithm for Feature Selection Using Penalized Regression with Applications to Single-Cell RNA Sequencing Data. BIOLOGY 2022; 11:biology11101495. [PMID: 36290397 PMCID: PMC9598401 DOI: 10.3390/biology11101495] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/11/2022] [Revised: 09/21/2022] [Accepted: 09/30/2022] [Indexed: 11/05/2022]
Abstract
With the emergence of single-cell RNA sequencing (scRNA-seq) technology, scientists are able to examine gene expression at single-cell resolution. Analysis of scRNA-seq data has its own challenges, which stem from its high dimensionality. The method of machine learning comes with the potential of gene (feature) selection from the high-dimensional scRNA-seq data. Even though there exist multiple machine learning methods that appear to be suitable for feature selection, such as penalized regression, there is no rigorous comparison of their performances across data sets, where each poses its own challenges. Therefore, in this paper, we analyzed and compared multiple penalized regression methods for scRNA-seq data. Given the scRNA-seq data sets we analyzed, the results show that sparse group lasso (SGL) outperforms the other six methods (ridge, lasso, elastic net, drop lasso, group lasso, and big lasso) using the metrics area under the receiver operating curve (AUC) and computation time. Building on these findings, we proposed a new algorithm for feature selection using penalized regression methods. The proposed algorithm works by selecting a small subset of genes and applying SGL to select the differentially expressed genes in scRNA-seq data. By using hierarchical clustering to group genes, the proposed method bypasses the need for domain-specific knowledge for gene grouping information. In addition, the proposed algorithm provided consistently better AUC for the data sets used.
Collapse
Affiliation(s)
- Bhavithry Sen Puliparambil
- Master of Science in Data Science Program, Thompson Rivers University, 805 TRU Way, Kamloops, BC V2C 0C8, Canada
- Correspondence:
| | - Jabed H. Tomal
- Department of Mathematics and Statistics, Thompson Rivers University, 805 TRU Way, Kamloops, BC V2C 0C8, Canada
| | - Yan Yan
- Department of Computing Science, Thompson Rivers University, 805 TRU Way, Kamloops, BC V2C 0C8, Canada
| |
Collapse
|
18
|
Zu T, Lian H, Green B, Yu Y. Ultra-high Dimensional Quantile Regression for Longitudinal Data: an Application to Blood Pressure Analysis. J Am Stat Assoc 2022. [DOI: 10.1080/01621459.2022.2128806] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/14/2022]
Affiliation(s)
- Tianhai Zu
- Department of Operations, Business Analytics, & Information Systems, University of Cincinnati, Cincinnati, Ohio, USA
| | - Heng Lian
- Department of Mathematics, City University of Hong Kong, Tat Chee Avenue, Kowloon Tong Hong Kong, China
| | - Brittany Green
- Department of Information Systems, Analytics, and Operations, University of Louisville, Louisville, Kentucky, USA
| | - Yan Yu
- Department of Operations, Business Analytics, & Information Systems, University of Cincinnati, Cincinnati, Ohio, USA
| |
Collapse
|
19
|
Chen J, Bie R, Qin Y, Li Y, Ma S. Lq-based robust analytics on ultrahigh and high dimensional data. Stat Med 2022; 41:5220-5241. [PMID: 36098057 DOI: 10.1002/sim.9563] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2021] [Revised: 06/02/2022] [Accepted: 08/02/2022] [Indexed: 11/10/2022]
Abstract
Ultrahigh and high dimensional data are common in regression analysis for various fields, such as omics data, finance, and biological engineering. In addition to the problem of dimension, the data might also be contaminated. There are two main types of contamination: outliers and model misspecification. We develop an unique method that takes into account the ultrahigh or high dimensional issues and both types of contamination. In this article, we propose a framework for feature screening and selection based on the minimum Lq-likelihood estimation (MLqE), which accounts for the model misspecification contamination issue and has also been shown to be robust to outliers. In numerical analysis, we explore the robustness of this framework under different outliers and model misspecification scenarios. To examine the performance of this framework, we conduct real data analysis using the skin cutaneous melanoma data. When comparing with traditional screening and feature selection methods, the proposed method shows superiority in both variable identification effectiveness and parameter estimation accuracy.
Collapse
Affiliation(s)
- Jiachen Chen
- Department of Biostatistics, Boston University, Boston, MA, USA
| | - Ruofan Bie
- Department of Biostatistics, Brown University, Providence, RI, USA
| | - Yichen Qin
- Department of Operations, Business Analytics and Information Systems, University of Cincinnati, Cincinnati, OH, USA
| | - Yang Li
- Center for Applied Statistics and School of Statistics, Renmin University of China, Beijing, China.,RSS and China-Re Life Joint Lab on Public Health and Risk Management, Renmin University of China, Beijing, China
| | - Shuangge Ma
- Department of Biostatistics, Boston University, Boston, MA, USA.,RSS and China-Re Life Joint Lab on Public Health and Risk Management, Renmin University of China, Beijing, China
| |
Collapse
|
20
|
Wang X, Liu H, Ma S. GEInfo: an R package for gene-environment interaction analysis incorporating prior information. Bioinformatics 2022; 38:3139-3140. [PMID: 35485739 PMCID: PMC9154264 DOI: 10.1093/bioinformatics/btac301] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2022] [Revised: 03/30/2022] [Accepted: 04/25/2022] [Indexed: 11/12/2022] Open
Abstract
SUMMARY Gene-environment (G-E) interactions have important implications for many complex diseases. With higher dimensionality and weaker signals, G-E interaction analysis is more challenged than the analysis of main G (and E) effects. The accumulation of published literature makes it possible to borrow strength from prior information and improve analysis. In a recent study, a 'quasi-likelihood + penalization' approach was developed to effectively incorporate prior information. Here, we first extend it to linear, logistic and Poisson regressions. Such models are much more popular in practice. More importantly, we develop the R package GEInfo, which realizes this approach in a user-friendly manner. To facilitate direct comparison and routine data analysis, the package also includes functions for alternative methods and visualization. AVAILABILITY AND IMPLEMENTATION The package is available at https://CRAN.R-project.org/package=GEInfo. SUPPLEMENTARY INFORMATION Supplementary materials are available at Bioinformatics online.
Collapse
Affiliation(s)
- Xiaoyan Wang
- College of Finance and Statistics, Hunan University, Changsha 410079, Hunan, China
| | - Hongduo Liu
- College of Finance and Statistics, Hunan University, Changsha 410079, Hunan, China
| | - Shuangge Ma
- Department of Biostatistics, Yale University, New Haven, CT 06520, USA
| |
Collapse
|
21
|
Li Y, Xu S, Ma S, Wu M. Network-based cancer heterogeneity analysis incorporating multi-view of prior information. Bioinformatics 2022; 38:2855-2862. [PMID: 35561185 PMCID: PMC9113254 DOI: 10.1093/bioinformatics/btac183] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2021] [Revised: 02/22/2022] [Accepted: 03/22/2022] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION Cancer genetic heterogeneity analysis has critical implications for tumour classification, response to therapy and choice of biomarkers to guide personalized cancer medicine. However, existing heterogeneity analysis based solely on molecular profiling data usually suffers from a lack of information and has limited effectiveness. Many biomedical and life sciences databases have accumulated a substantial volume of meaningful biological information. They can provide additional information beyond molecular profiling data, yet pose challenges arising from potential noise and uncertainty. RESULTS In this study, we aim to develop a more effective heterogeneity analysis method with the help of prior information. A network-based penalization technique is proposed to innovatively incorporate a multi-view of prior information from multiple databases, which accommodates heterogeneity attributed to both differential genes and gene relationships. To account for the fact that the prior information might not be fully credible, we propose a weighted strategy, where the weight is determined dependent on the data and can ensure that the present model is not excessively disturbed by incorrect information. Simulation and analysis of The Cancer Genome Atlas glioblastoma multiforme data demonstrate the practical applicability of the proposed method. AVAILABILITY AND IMPLEMENTATION R code implementing the proposed method is available at https://github.com/mengyunwu2020/PECM. The data that support the findings in this paper are openly available in TCGA (The Cancer Genome Atlas) at https://portal.gdc.cancer.gov/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yang Li
- Center for Applied Statistics, School of Statistics, Statistical Consulting Center, and RSS and China-Re Life Joint Lab on Public Health and Risk Management, Renmin University of China, Beijing 100872, China
| | - Shaodong Xu
- Center for Applied Statistics, School of Statistics, Statistical Consulting Center, and RSS and China-Re Life Joint Lab on Public Health and Risk Management, Renmin University of China, Beijing 100872, China
| | - Shuangge Ma
- Department of Biostatistics, Yale School of Public Health, New Haven, CT 06520, USA
| | - Mengyun Wu
- School of Statistics and Management, Shanghai University of Finance and Economics, Shanghai 200433, China
| |
Collapse
|
22
|
Zhang Y, Liu M, Neykov M, Cai T. Prior Adaptive Semi-supervised Learning with Application to EHR Phenotyping. JOURNAL OF MACHINE LEARNING RESEARCH : JMLR 2022; 23:83. [PMID: 37974910 PMCID: PMC10653017] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/19/2023]
Abstract
Electronic Health Record (EHR) data, a rich source for biomedical research, have been successfully used to gain novel insight into a wide range of diseases. Despite its potential, EHR is currently underutilized for discovery research due to its major limitation in the lack of precise phenotype information. To overcome such difficulties, recent efforts have been devoted to developing supervised algorithms to accurately predict phenotypes based on relatively small training datasets with gold standard labels extracted via chart review. However, supervised methods typically require a sizable training set to yield generalizable algorithms, especially when the number of candidate features, p , is large. In this paper, we propose a semi-supervised (SS) EHR phenotyping method that borrows information from both a small, labeled dataset (where both the label Y and the feature set X are observed) and a much larger, weakly-labeled dataset in which the feature set X is accompanied only by a surrogate label S that is available to all patients. Under a working prior assumption that S is related to X only through Y and allowing it to hold approximately, we propose a prior adaptive semi-supervised (PASS) estimator that incorporates the prior knowledge by shrinking the estimator towards a direction derived under the prior. We derive asymptotic theory for the proposed estimator and justify its efficiency and robustness to prior information of poor quality. We also demonstrate its superiority over existing estimators under various scenarios via simulation studies and on three real-world EHR phenotyping studies at a large tertiary hospital.
Collapse
Affiliation(s)
- Yichi Zhang
- Department of Computer Science and Statistics, University of Rhode Island
| | - Molei Liu
- Department of Biostatistics, Harvard T.H. Chan School of Public Health
| | - Matey Neykov
- Department of Statistics and Data Science, Carnegie Mellon University
| | - Tianxi Cai
- Department of Biostatistics, Harvard T.H. Chan School of Public Health
| |
Collapse
|
23
|
Development of a 15-Gene Signature Model as a Prognostic Tool in Sex Hormone-Dependent Cancers. BIOMED RESEARCH INTERNATIONAL 2021; 2021:3676107. [PMID: 34869761 PMCID: PMC8635877 DOI: 10.1155/2021/3676107] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/21/2020] [Revised: 05/09/2021] [Accepted: 10/12/2021] [Indexed: 11/30/2022]
Abstract
Sex hormone dependence is associated with tumor progression and prognosis. Here, we explored the molecular basis of luminal A-like phenotype in sex hormone-dependent cancers. RNA-sequencing data from 8 cancer types were obtained from The Cancer Genome Atlas (TCGA). We investigated the enrichment function of differentially expressed genes (DEGs) in luminal A breast cancer (BRCA). Weighted coexpression network analysis (WGCNA) was used to identify gene modules associated with the luminal A-like phenotype, and we calculated the module's preservation in 8 cancer types. Module hub genes screened using least absolute shrinkage and selection operator (LASSO) were used to construct a gene signature model for the luminal A-like phenotype, and we assessed the model's relationship with prognosis, enriched pathways, and immune infiltration using bioinformatics approaches. Compared to other BRCA subtypes, the enrichment functions of upregulated genes in luminal A BRCA were related to hormone biological processes and receptor activity, and the downregulated genes were associated with the cell cycle and nuclear division. A gene module significantly associated with luminal A BRCA was shared by uterine corpus endometrial carcinoma (UCEC), leading to a similar phenotype. Fifteen hub genes were used to construct a gene signature model for the assessment of the luminal A-like phenotype, and the corrected C-statistics and Brier scores were 0.986 and 0.023, respectively. Calibration plots showed good performance, and decision curve analysis indicated a high net benefit of the model. The 15-gene signature model was associated with better overall survival in BRCA and UCEC and was characterized by downregulation of DNA replication, cell cycle and activated CD4 T cells. In conclusion, our study elucidated that BRCA and UCEC share a similar sex hormone-dependent phenotype and constructed a 15-gene signature model for use as a prognostic tool to quantify the probability of the phenotype.
Collapse
|
24
|
Cao K, Ma T, Ling X, Liu M, Jiang X, Ma K, Zhu J, Ma J. Development of immune gene pair-based signature predictive of prognosis and immunotherapy in esophageal cancer. ANNALS OF TRANSLATIONAL MEDICINE 2021; 9:1591. [PMID: 34790797 PMCID: PMC8576717 DOI: 10.21037/atm-21-5217] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/16/2021] [Accepted: 10/20/2021] [Indexed: 12/13/2022]
Abstract
Background Esophageal cancer (EC) is one of the deadliest solid malignancies, mainly consisting of esophageal squamous cell carcinoma (ESCC) and adenocarcinoma (EAC). Robust biomarkers that can improve patient risk stratification are needed to optimize cancer management. We sought to establish potent prognostic signatures with immune-related gene (IRG) pairs for ESCC and EAC. Methods We obtained differentially expressed IRGs by intersecting the Immunology Database and Analysis Portal (ImmPort) with the transcriptome data set of The Cancer Genome Atlas (TCGA)-ESCC and EAC cohorts. A novel rank-based pairwise comparison algorithm was applied to select effective IRG pairs (IRGPs), followed by constructing a prognostic IRGP signature via the least absolute shrinkage and selection operator (LASSO) regression model. We assessed the predictive power of the IRGP signatures on prognosis, tumor-infiltrating immune cells, and immune checkpoint inhibitor (ICI) efficacy in EC. Kaplan-Meier survival analysis and receiver operating characteristic curves (ROC) were used to evaluate the clinical significance of IRGPs. Univariate and multivariate Cox regression analyses were performed to investigate the association of overall survival (OS) with IRGPs and clinical characteristics. Results We built a 19-IRGP signature for ESCC (n=75) and a 17-IRGP signature for EAC (n=78), with an area under the ROC curve (AUC) of 0.931 and 0.803, respectively. IRGP signature-derived risk scores stratified patients into low- and high-risk groups with significantly different OS in ESCC and EAC (P<0.001). Nomogram and decision curve analysis were used to evaluate the clinical relevance of the prognostic signatures, achieving a C-index of 0.973 in ESCC and 0.880 in EAC. The risk scores were associated with immune and ESTIMATE (Estimation of STromal and Immune cells in MAlignant Tumor tissues using Expression data) scores and the composition of immune cells in the tumor microenvironment. The association between risk score and human leukocyte antigens (HLAs), mismatch repair (MMR) genes, and immune checkpoint molecules demonstrated its predictive value for ICI response. Differential immune characteristics and predictive value of the risk score were observed in EAC. Conclusions The established immune signatures showed great promise in predicting prognosis, tumor immunogenicity, and immunotherapy response in ESCC and EAC.
Collapse
Affiliation(s)
- Kui Cao
- Department of Clinical Laboratory, Biobank, Harbin Medical University Cancer Hospital, Harbin, China.,Department of Clinical Oncology, Harbin Medical University Cancer Hospital, Harbin, China
| | - Tianjiao Ma
- Department of Cardiovascular Surgery, Sun Yat-sen Memorial Hospital, Sun Yat-sen University, Guangzhou, China
| | - Xiaodong Ling
- Department of Thoracic Surgery, Harbin Medical University Cancer Hospital, Harbin, China
| | - Mingdong Liu
- Department of Clinical Oncology, Harbin Medical University Cancer Hospital, Harbin, China
| | - Xiangyu Jiang
- Department of Thoracic Surgery, Harbin Medical University Cancer Hospital, Harbin, China
| | - Keru Ma
- Department of Thoracic Surgery, Harbin Medical University Cancer Hospital, Harbin, China
| | - Jinhong Zhu
- Department of Clinical Laboratory, Biobank, Harbin Medical University Cancer Hospital, Harbin, China
| | - Jianqun Ma
- Department of Thoracic Surgery, Harbin Medical University Cancer Hospital, Harbin, China
| |
Collapse
|
25
|
Varsou DD, Koutroumpa NM, Sarimveis H. Automated Grouping of Nanomaterials and Read-Across Prediction of Their Adverse Effects Based on Mathematical Optimization. J Chem Inf Model 2021; 61:2766-2779. [PMID: 34029462 DOI: 10.1021/acs.jcim.1c00199] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
In this study, a computational workflow is presented for grouping engineered nanomaterials (ENMs) and for predicting their toxicity-related end points. A mixed integer-linear optimization program (MILP) problem is formulated, which automatically filters out the noisy variables, defines the grouping boundaries, and develops specific to each group predictive models. The method is extended to the multidimensional space, by considering the ENM characterization categories (e.g., biological, physicochemical, biokinetics, image etc.) as different dimensions. The performance of the proposed method is illustrated through the application to benchmark data sets and comparison with alternative predictive modeling approaches. The trained models using the above data sets were made publicly available through a user-friendly web service.
Collapse
Affiliation(s)
- Dimitra-Danai Varsou
- School of Chemical Engineering, National Technical University of Athens, Athens, 157 80, Greece
| | | | - Haralambos Sarimveis
- School of Chemical Engineering, National Technical University of Athens, Athens, 157 80, Greece
| |
Collapse
|
26
|
Yi H, Zhang Q, Lin C, Ma S. Information-incorporated Gaussian graphical model for gene expression data. Biometrics 2021; 78:512-523. [PMID: 33527365 DOI: 10.1111/biom.13428] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/22/2020] [Revised: 09/19/2020] [Accepted: 01/13/2021] [Indexed: 11/29/2022]
Abstract
In the analysis of gene expression data, network approaches take a system perspective and have played an irreplaceably important role. Gaussian graphical models (GGMs) have been popular in the network analysis of gene expression data. They investigate the conditional dependence between genes and "transform" the problem of estimating network structures into a sparse estimation of precision matrices. When there is a moderate to large number of genes, the number of parameters to be estimated may overwhelm the limited sample size, leading to unreliable estimation and selection. In this article, we propose incorporating information from previous studies (for example, those deposited at PubMed) to assist estimating the network structure in the present data. It is recognized that such information can be partial, biased, or even wrong. A penalization-based estimation approach is developed, shown to have consistency properties, and realized using an effective computational algorithm. Simulation demonstrates its competitive performance under various information accuracy scenarios. The analysis of TCGA lung cancer prognostic genes leads to network structures different from the alternatives.
Collapse
Affiliation(s)
- Huangdi Yi
- Department of Biostatistics, Yale School of Public Health, New Haven, Connecticut
| | - Qingzhao Zhang
- Department of Statistics, School of Economics; Key Laboratory of Econometrics, Ministry of Education; The Wang Yanan Institute for Studies in Economics, Xiamen University, Xiamen, China
| | - Cunjie Lin
- Center for Applied Statistics and School of Statistics, Renmin University of China, Beijing, China
| | - Shuangge Ma
- Department of Biostatistics, Yale School of Public Health, New Haven, Connecticut
| |
Collapse
|
27
|
Pijyan A, Zheng Q, Hong HG, Li Y. Consistent Estimation of Generalized Linear Models with High Dimensional Predictors via Stepwise Regression. ENTROPY (BASEL, SWITZERLAND) 2020; 22:e22090965. [PMID: 33286734 PMCID: PMC7597260 DOI: 10.3390/e22090965] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/01/2020] [Revised: 08/26/2020] [Accepted: 08/28/2020] [Indexed: 05/16/2023]
Abstract
Predictive models play a central role in decision making. Penalized regression approaches, such as least absolute shrinkage and selection operator (LASSO), have been widely used to construct predictive models and explain the impacts of the selected predictors, but the estimates are typically biased. Moreover, when data are ultrahigh-dimensional, penalized regression is usable only after applying variable screening methods to downsize variables. We propose a stepwise procedure for fitting generalized linear models with ultrahigh dimensional predictors. Our procedure can provide a final model; control both false negatives and false positives; and yield consistent estimates, which are useful to gauge the actual effect size of risk factors. Simulations and applications to two clinical studies verify the utility of the method.
Collapse
Affiliation(s)
- Alex Pijyan
- Department of Statistics and Probability, Michigan State University, East Lansing, MI 48824, USA;
| | - Qi Zheng
- Department of Bioinformatics and Biostatistics, University of Louisville, Louisville, KY 40202, USA;
| | - Hyokyoung G. Hong
- Department of Statistics and Probability, Michigan State University, East Lansing, MI 48824, USA;
- Correspondence:
| | - Yi Li
- Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109, USA;
| |
Collapse
|
28
|
Xu G, Zhu H, Lee JJ. Borrowing strength and borrowing index for Bayesian hierarchical models. Comput Stat Data Anal 2020; 144. [DOI: 10.1016/j.csda.2019.106901] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
|
29
|
Development and external-validation of a nomogram for predicting the survival of hospitalised HIV/AIDS patients based on a large study cohort in western China. Epidemiol Infect 2020; 148:e84. [PMID: 32234104 PMCID: PMC7189350 DOI: 10.1017/s0950268820000758] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023] Open
Abstract
The aim of this study was to develop and externally validate a simple-to-use nomogram for predicting the survival of hospitalised human immunodeficiency virus/acquired immunodeficiency syndrome (HIV/AIDS) patients (hospitalised person living with HIV/AIDS (PLWHAs)). Hospitalised PLWHAs (n = 3724) between January 2012 and December 2014 were enrolled in the training cohort. HIV-infected inpatients (n = 1987) admitted in 2015 were included as the external-validation cohort. The least absolute shrinkage and selection operator method was used to perform data dimension reduction and select the optimal predictors. The nomogram incorporated 11 independent predictors, including occupation, antiretroviral therapy, pneumonia, tuberculosis, Talaromyces marneffei, hypertension, septicemia, anaemia, respiratory failure, hypoproteinemia and electrolyte disturbances. The Likelihood χ2 statistic of the model was 516.30 (P = 0.000). Integrated Brier Score was 0.076 and Brier scores of the nomogram at the 10-day and 20-day time points were 0.046 and 0.071, respectively. The area under the curves for receiver operating characteristic were 0.819 and 0.828, and precision-recall curves were 0.242 and 0.378 at two time points. Calibration plots and decision curve analysis in the two sets showed good performance and a high net benefit of nomogram. In conclusion, the nomogram developed in the current study has relatively high calibration and is clinically useful. It provides a convenient and useful tool for timely clinical decision-making and the risk management of hospitalised PLWHAs.
Collapse
|
30
|
Zheng Q, Hong HG, Li Y. Building generalized linear models with ultrahigh dimensional features: A sequentially conditional approach. Biometrics 2020; 76:47-60. [PMID: 31350909 PMCID: PMC7136011 DOI: 10.1111/biom.13122] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2019] [Accepted: 07/19/2019] [Indexed: 11/29/2022]
Abstract
Conditional screening approaches have emerged as a powerful alternative to the commonly used marginal screening, as they can identify marginally weak but conditionally important variables. However, most existing conditional screening methods need to fix the initial conditioning set, which may determine the ultimately selected variables. If the conditioning set is not properly chosen, the methods may produce false negatives and positives. Moreover, screening approaches typically need to involve tuning parameters and extra modeling steps in order to reach a final model. We propose a sequential conditioning approach by dynamically updating the conditioning set with an iterative selection process. We provide its theoretical properties under the framework of generalized linear models. Powered by an extended Bayesian information criterion as the stopping rule, the method will lead to a final model without the need to choose tuning parameters or threshold parameters. The practical utility of the proposed method is examined via extensive simulations and analysis of a real clinical study on predicting multiple myeloma patients' response to treatment based on their genomic profiles.
Collapse
Affiliation(s)
- Qi Zheng
- Department of Bioinformatics and Biostatistics, University of Louisville, Louisville, Kentucky
| | - Hyokyoung G Hong
- Department of Statistics and Probability, Michigan State University, East Lansing, Michigan
| | - Yi Li
- Department of Biostatistics, University of Michigan, Ann Arbor, Michigan
| |
Collapse
|
31
|
Zheng X, Lin L, Liu B, Xiao Y, Xiong X. A multi-task transfer learning method with dictionary learning. Knowl Based Syst 2020. [DOI: 10.1016/j.knosys.2019.105233] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
|
32
|
Velten B, Huber W. Adaptive penalization in high-dimensional regression and classification with external covariates using variational Bayes. Biostatistics 2019; 22:348-364. [PMID: 31596468 PMCID: PMC8036004 DOI: 10.1093/biostatistics/kxz034] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2018] [Revised: 06/27/2019] [Accepted: 08/14/2019] [Indexed: 12/18/2022] Open
Abstract
Penalization schemes like Lasso or ridge regression are routinely used to regress a response of interest on a high-dimensional set of potential predictors. Despite being decisive, the question of the relative strength of penalization is often glossed over and only implicitly determined by the scale of individual predictors. At the same time, additional information on the predictors is available in many applications but left unused. Here, we propose to make use of such external covariates to adapt the penalization in a data-driven manner. We present a method that differentially penalizes feature groups defined by the covariates and adapts the relative strength of penalization to the information content of each group. Using techniques from the Bayesian tool-set our procedure combines shrinkage with feature selection and provides a scalable optimization scheme. We demonstrate in simulations that the method accurately recovers the true effect sizes and sparsity patterns per feature group. Furthermore, it leads to an improved prediction performance in situations where the groups have strong differences in dynamic range. In applications to data from high-throughput biology, the method enables re-weighting the importance of feature groups from different assays. Overall, using available covariates extends the range of applications of penalized regression, improves model interpretability and can improve prediction performance.
Collapse
Affiliation(s)
- Britta Velten
- Genome Biology Unit, European Molecular Biology Laboratory, Meyerhofstr. 1, 69117 Heidelberg, Germany
| | - Wolfgang Huber
- Genome Biology Unit, European Molecular Biology Laboratory, Meyerhofstr. 1, 69117 Heidelberg, Germany
| |
Collapse
|
33
|
Xie W, Liu J, Huang X, Wu G, Jeen F, Chen S, Zhang C, Yang W, Li C, Li Z, Ge L, Tang W. A nomogram to predict vascular invasion before resection of colorectal cancer. Oncol Lett 2019; 18:5785-5792. [PMID: 31788051 PMCID: PMC6865036 DOI: 10.3892/ol.2019.10937] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2019] [Accepted: 07/26/2019] [Indexed: 02/06/2023] Open
Abstract
Vascular invasion (VI) is an important feature for systemic recurrence and an indicator for the application of adjuvant therapy in colorectal cancer (CRC). Preoperative knowledge of VI is important in determining whether adjuvant therapy is necessary, as well as the adequacy of surgical resection. In the present study, a predictive nomogram for VI in patients with CRC was constructed. The prediction model consisted of 664 eligible patients with CRC, who were divided into a training set (n=468) and a validation set (n=196). Data were collected between August 2013 and April 2018. The feature selection model was established using the least absolute shrinkage and selection operator regression model. Multivariable logistic regression analysis was used to construct the predictive nomogram. The performance of the nomogram was evaluated by calibration, discrimination and clinical usefulness. Differentiation, computed tomography (CT)-based on N stage (CT N stage), hemameba and tumor distance from the anus (cm) were integrated into the nomogram. The nomogram exhibited good discrimination, with an area under the curve (AUC) of 0.731 and good calibration. Application of the nomogram in the validation cohort showed acceptable discrimination, with an AUC of 0.710 and good calibration. Decision curve analysis revealed that the nomogram was clinically useful. These findings suggests, to the best of our knowledge, that this may be the first nomogram for individual preoperative prediction of VI in patients with CRC, which may promote preoperative optimization strategies for this selected group of patients.
Collapse
Affiliation(s)
- Weishun Xie
- Department of Gastrointestinal Surgery, Affiliated Tumor Hospital of Guangxi Medical University, Nanning, Guangxi 530021, P.R. China.,Guangxi Clinical Research Center for Colorectal Cancer, Affiliated Tumor Hospital of Guangxi Medical University, Nanning, Guangxi 530021, P.R. China
| | - Jungang Liu
- Department of Gastrointestinal Surgery, Affiliated Tumor Hospital of Guangxi Medical University, Nanning, Guangxi 530021, P.R. China.,Guangxi Clinical Research Center for Colorectal Cancer, Affiliated Tumor Hospital of Guangxi Medical University, Nanning, Guangxi 530021, P.R. China
| | - Xiaoliang Huang
- Department of Gastrointestinal Surgery, Affiliated Tumor Hospital of Guangxi Medical University, Nanning, Guangxi 530021, P.R. China.,Guangxi Clinical Research Center for Colorectal Cancer, Affiliated Tumor Hospital of Guangxi Medical University, Nanning, Guangxi 530021, P.R. China
| | - Guo Wu
- Department of Gastrointestinal Surgery, Affiliated Tumor Hospital of Guangxi Medical University, Nanning, Guangxi 530021, P.R. China.,Guangxi Clinical Research Center for Colorectal Cancer, Affiliated Tumor Hospital of Guangxi Medical University, Nanning, Guangxi 530021, P.R. China
| | - Franco Jeen
- Department of Gastrointestinal Surgery, Affiliated Tumor Hospital of Guangxi Medical University, Nanning, Guangxi 530021, P.R. China.,Guangxi Clinical Research Center for Colorectal Cancer, Affiliated Tumor Hospital of Guangxi Medical University, Nanning, Guangxi 530021, P.R. China
| | - Shaomei Chen
- Department of Gastrointestinal Surgery, Affiliated Tumor Hospital of Guangxi Medical University, Nanning, Guangxi 530021, P.R. China.,Guangxi Clinical Research Center for Colorectal Cancer, Affiliated Tumor Hospital of Guangxi Medical University, Nanning, Guangxi 530021, P.R. China
| | - Chuqiao Zhang
- Department of Gastrointestinal Surgery, Affiliated Tumor Hospital of Guangxi Medical University, Nanning, Guangxi 530021, P.R. China.,Guangxi Clinical Research Center for Colorectal Cancer, Affiliated Tumor Hospital of Guangxi Medical University, Nanning, Guangxi 530021, P.R. China
| | - Wenkang Yang
- Department of Gastrointestinal Surgery, Affiliated Tumor Hospital of Guangxi Medical University, Nanning, Guangxi 530021, P.R. China.,Guangxi Clinical Research Center for Colorectal Cancer, Affiliated Tumor Hospital of Guangxi Medical University, Nanning, Guangxi 530021, P.R. China
| | - Chan Li
- Department of Gastrointestinal Surgery, Affiliated Tumor Hospital of Guangxi Medical University, Nanning, Guangxi 530021, P.R. China
| | - Zhengtian Li
- Department of Gastrointestinal Surgery, Affiliated Tumor Hospital of Guangxi Medical University, Nanning, Guangxi 530021, P.R. China.,Guangxi Clinical Research Center for Colorectal Cancer, Affiliated Tumor Hospital of Guangxi Medical University, Nanning, Guangxi 530021, P.R. China
| | - Lianying Ge
- Guangxi Clinical Research Center for Colorectal Cancer, Affiliated Tumor Hospital of Guangxi Medical University, Nanning, Guangxi 530021, P.R. China.,Department of Gynecologic Oncology, Affiliated Tumor Hospital of Guangxi Medical University, Nanning, Guangxi 530021, P.R. China
| | - Weizhong Tang
- Department of Gastrointestinal Surgery, Affiliated Tumor Hospital of Guangxi Medical University, Nanning, Guangxi 530021, P.R. China.,Guangxi Clinical Research Center for Colorectal Cancer, Affiliated Tumor Hospital of Guangxi Medical University, Nanning, Guangxi 530021, P.R. China
| |
Collapse
|
34
|
Mehta B, Pedro S, Ozen G, Kalil A, Wolfe F, Mikuls T, Michaud K. Serious infection risk in rheumatoid arthritis compared with non-inflammatory rheumatic and musculoskeletal diseases: a US national cohort study. RMD Open 2019; 5:e000935. [PMID: 31245055 PMCID: PMC6560658 DOI: 10.1136/rmdopen-2019-000935] [Citation(s) in RCA: 90] [Impact Index Per Article: 18.0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2019] [Revised: 04/23/2019] [Accepted: 05/15/2019] [Indexed: 11/04/2022] Open
Abstract
Objectives To identify serious infection (SI) risk by aetiology and site in patients with rheumatoid arthritis (RA) compared with those with non-inflammatory rheumatic and musculoskeletal diseases (NIRMD). Methods Patients participating in FORWARD from 2001 to 2016 were assessed for SIs; defined by infections requiring hospitalisation, intravenous antibiotics or followed by death. SIs were categorised by aetiology and site. SI risk was assessed through Cox proportional hazards models. Best models were selected using machine learning Least Absolute Shrinkage and Selection Operator (LASSO) methodology. Results Among 20 361 patients with RA and 6176 patients with NIRMD, 1600 and 276 first SIs were identified, respectively. Incidence of SIs was higher in RA compared with NIRMD (IRR = 1.5; 95% CI 1.2 to 1.5). The risk persisted after adjusting using the LASSO model (HR 1.7; 95% CI 1.5 to 1.8), but attenuated when additionally adjusted for glucocorticoid use (HR 1.3; 95% CI 1.2 to 1.5). SI risk was significantly higher in RA versus NIRMD for bacterial infections as well as for respiratory, skin, bone, joint, bloodstream infections and sepsis irrespective of glucocorticoid use. Compared with NIRMD, SI risk was significantly increased in patients with RA who were in moderate and high disease activity but was similar to those in low disease activity/remission (p trend < 0.001). Conclusions The risk of all SIs, particularly bacterial, respiratory, bloodstream, sepsis, skin, bone and joint infections are significantly increased in patients with RA compared with patients with NIRMD. This infection risk appears to be greatest in those with higher RA disease activity.
Collapse
Affiliation(s)
- Bella Mehta
- Department of Medicine, Hospital for Special Surgery, New York, New York, USA
| | - Sofia Pedro
- Forward, The National Databank for Rheumatic Diseases, Wichita, Kansas, USA
| | - Gulsen Ozen
- Department of Medicine, University of Nebraska Medical Center, Omaha, Nebraska, USA
| | - Andre Kalil
- Department of Medicine, University of Nebraska Medical Center, Omaha, Nebraska, USA
| | - Frederick Wolfe
- Forward, The National Databank for Rheumatic Diseases, Wichita, Kansas, USA
| | - Ted Mikuls
- Department of Medicine, University of Nebraska Medical Center, Omaha, Nebraska, USA
| | - Kaleb Michaud
- Forward, The National Databank for Rheumatic Diseases, Wichita, Kansas, USA.,Department of Medicine, University of Nebraska Medical Center, Omaha, Nebraska, USA
| |
Collapse
|
35
|
Zhao Y, Zhu H, Lu Z, Knickmeyer RC, Zou F. Structured Genome-Wide Association Studies with Bayesian Hierarchical Variable Selection. Genetics 2019; 212:397-415. [PMID: 31010934 PMCID: PMC6553832 DOI: 10.1534/genetics.119.301906] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2019] [Accepted: 04/08/2019] [Indexed: 02/04/2023] Open
Abstract
It becomes increasingly important in using genome-wide association studies (GWAS) to select important genetic information associated with qualitative or quantitative traits. Currently, the discovery of biological association among SNPs motivates various strategies to construct SNP-sets along the genome and to incorporate such set information into selection procedure for a higher selection power, while facilitating more biologically meaningful results. The aim of this paper is to propose a novel Bayesian framework for hierarchical variable selection at both SNP-set (group) level and SNP (within group) level. We overcome a key limitation of existing posterior updating scheme in most Bayesian variable selection methods by proposing a novel sampling scheme to explicitly accommodate the ultrahigh-dimensionality of genetic data. Specifically, by constructing an auxiliary variable selection model under SNP-set level, the new procedure utilizes the posterior samples of the auxiliary model to subsequently guide the posterior inference for the targeted hierarchical selection model. We apply the proposed method to a variety of simulation studies and show that our method is computationally efficient and achieves substantially better performance than competing approaches in both SNP-set and SNP selection. Applying the method to the Alzheimers Disease Neuroimaging Initiative (ADNI) data, we identify biologically meaningful genetic factors under several neuroimaging volumetric phenotypes. Our method is general and readily to be applied to a wide range of biomedical studies.
Collapse
Affiliation(s)
- Yize Zhao
- Department of Healthcare Policy and Research, Cornell University Weill Cornell, New York, New York 10065
| | - Hongtu Zhu
- Department of Biostatistics, University of North Carolina, Chapel Hill, North Carolina 27599
| | - Zhaohua Lu
- Department of Biostatistics, St. Jude Children's Research Hospital, Memphis, Tennessee 38105
| | - Rebecca C Knickmeyer
- Department of Pediatrics and Human Development, Michigan State University, East Lansing, Michigan 48824
| | - Fei Zou
- Department of Biostatistics, University of Florida, Gainesville, Florida 32611
| |
Collapse
|
36
|
Fan X, Fang K, Ma S, Wang S, Zhang Q. Assisted graphical model for gene expression data analysis. Stat Med 2019; 38:2364-2380. [PMID: 30854706 DOI: 10.1002/sim.8112] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2018] [Revised: 12/16/2018] [Accepted: 01/09/2019] [Indexed: 11/12/2022]
Abstract
The analysis of gene expression data has been playing a pivotal role in recent biomedical research. For gene expression data, network analysis has been shown to be more informative and powerful than individual-gene and geneset-based analysis. Despite promising successes, with the high dimensionality of gene expression data and often low sample sizes, network construction with gene expression data is still often challenged. In recent studies, a prominent trend is to conduct multidimensional profiling, under which data are collected on gene expressions as well as their regulators (copy number variations, methylation, microRNAs, SNPs, etc). With the regulation relationship, regulators contain information on gene expressions and can potentially assist in estimating their characteristics. In this study, we develop an assisted graphical model (AGM) approach, which can effectively use information in regulators to improve the estimation of gene expression graphical structure. The proposed approach has an intuitive formulation and can adaptively accommodate different regulator scenarios. Its consistency properties are rigorously established. Extensive simulations and the analysis of a breast cancer gene expression data set demonstrate the practical effectiveness of the AGM.
Collapse
Affiliation(s)
- Xinyan Fan
- Department of Statistics, School of Economics, Xiamen University, Xiamen, China
| | - Kuangnan Fang
- Department of Statistics, School of Economics, Xiamen University, Xiamen, China.,Fujian Key Laboratory of Statistical Sciences, Xiamen University, Xiamen, China
| | - Shuangge Ma
- Department of Statistics, School of Economics, Xiamen University, Xiamen, China.,Department of Biostatistics, Yale School of Public Health, New Haven, Connecticut
| | - Shuaichao Wang
- School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, China
| | - Qingzhao Zhang
- Department of Statistics, School of Economics, Xiamen University, Xiamen, China.,Fujian Key Laboratory of Statistical Sciences, Xiamen University, Xiamen, China.,The Wang Yanan Institute for Studies in Economics, Xiamen University, Xiamen, China
| |
Collapse
|
37
|
van de Wiel MA, Te Beest DE, Münch MM. Learning from a lot: Empirical Bayes for high-dimensional model-based prediction. Scand Stat Theory Appl 2019; 46:2-25. [PMID: 31007342 PMCID: PMC6472625 DOI: 10.1111/sjos.12335] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2017] [Revised: 01/24/2018] [Accepted: 03/22/2018] [Indexed: 12/21/2022]
Abstract
Empirical Bayes is a versatile approach to "learn from a lot" in two ways: first, from a large number of variables and, second, from a potentially large amount of prior information, for example, stored in public repositories. We review applications of a variety of empirical Bayes methods to several well-known model-based prediction methods, including penalized regression, linear discriminant analysis, and Bayesian models with sparse or dense priors. We discuss "formal" empirical Bayes methods that maximize the marginal likelihood but also more informal approaches based on other data summaries. We contrast empirical Bayes to cross-validation and full Bayes and discuss hybrid approaches. To study the relation between the quality of an empirical Bayes estimator and p, the number of variables, we consider a simple empirical Bayes estimator in a linear model setting. We argue that empirical Bayes is particularly useful when the prior contains multiple parameters, which model a priori information on variables termed "co-data". In particular, we present two novel examples that allow for co-data: first, a Bayesian spike-and-slab setting that facilitates inclusion of multiple co-data sources and types and, second, a hybrid empirical Bayes-full Bayes ridge regression approach for estimation of the posterior predictive interval.
Collapse
Affiliation(s)
- Mark A. van de Wiel
- Department of Epidemiology and Biostatistics, Amsterdam Public Health Research InstituteVU University Medical CenterAmsterdamThe Netherlands
- Department of MathematicsVU UniversityAmsterdamThe Netherlands
| | - Dennis E. Te Beest
- Department of Epidemiology and Biostatistics, Amsterdam Public Health Research InstituteVU University Medical CenterAmsterdamThe Netherlands
| | - Magnus M. Münch
- Department of Epidemiology and Biostatistics, Amsterdam Public Health Research InstituteVU University Medical CenterAmsterdamThe Netherlands
- Mathematical Institute, Faculty of ScienceLeiden UniversityLeidenThe Netherlands
| |
Collapse
|
38
|
Wang X, Xu Y, Ma S. Identifying gene-environment interactions incorporating prior information. Stat Med 2019; 38:1620-1633. [PMID: 30637789 DOI: 10.1002/sim.8064] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2018] [Revised: 10/20/2018] [Accepted: 11/26/2018] [Indexed: 12/28/2022]
Abstract
For many complex diseases, gene-environment (G-E) interactions have independent contributions beyond the main G and E effects. Despite extensive effort, it still remains challenging to identify G-E interactions. With the long accumulation of experiments and data, for many biomedical problems of common interest, there are existing studies that can be relevant and informative for the identification of G-E interactions and/or main effects. In this study, our goal is to identify G-E interactions (as well as their corresponding main G effects) under a joint statistical modeling framework. Significantly advancing from the existing studies, a quasi-likelihood-based approach is developed to incorporate information mined from the existing literature. A penalization approach is adopted for identification and selection and respects the "main effects, interactions" hierarchical structure. Simulation shows that, when the existing information is of high quality, significant improvement can be observed. On the other hand, when the existing information is less informative, the proposed method still performs reasonably (and hence demonstrates a certain degree of "robustness"). The analysis of The Cancer Genome Atlas (TCGA) data on cutaneous melanoma and glioblastoma multiforme demonstrates the practical applicability of the proposed approach and also leads to sensible findings.
Collapse
Affiliation(s)
- Xiaoyan Wang
- College of Finance and Statistics, Hunan University, Changsha, China.,Department of Biostatistics, Yale University, New Haven, Connecticut
| | - Yonghong Xu
- School of Economics, Xiamen University, Xiamen, China
| | - Shuangge Ma
- Department of Biostatistics, Yale University, New Haven, Connecticut
| |
Collapse
|
39
|
A semi-supervised approach for predicting cell-type specific functional consequences of non-coding variation using MPRAs. Nat Commun 2018; 9:5199. [PMID: 30518757 PMCID: PMC6281617 DOI: 10.1038/s41467-018-07349-w] [Citation(s) in RCA: 26] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2018] [Accepted: 10/18/2018] [Indexed: 01/21/2023] Open
Abstract
Predicting the functional consequences of genetic variants in non-coding regions is a challenging problem. We propose here a semi-supervised approach, GenoNet, to jointly utilize experimentally confirmed regulatory variants (labeled variants), millions of unlabeled variants genome-wide, and more than a thousand cell/tissue type specific epigenetic annotations to predict functional consequences of non-coding variants. Through the application to several experimental datasets, we demonstrate that the proposed method significantly improves prediction accuracy compared to existing functional prediction methods at the tissue/cell type level, but especially so at the organism level. Importantly, we illustrate how the GenoNet scores can help in fine-mapping at GWAS loci, and in the discovery of disease associated genes in sequencing studies. As more comprehensive lists of experimentally validated variants become available over the next few years, semi-supervised methods like GenoNet can be used to provide increasingly accurate functional predictions for variants genome-wide and across a variety of cell/tissue types. Predicting the functional consequences of non-coding genetic variants is a challenge. Here, He et al. present GenoNet, a semi-supervised method that combines information from experimentally confirmed regulatory variants with cell type- and tissue specific annotation for function prediction.
Collapse
|
40
|
Arabnejad M, Dawkins BA, Bush WS, White BC, Harkness AR, McKinney BA. Transition-transversion encoding and genetic relationship metric in ReliefF feature selection improves pathway enrichment in GWAS. BioData Min 2018; 11:23. [PMID: 30410580 PMCID: PMC6215626 DOI: 10.1186/s13040-018-0186-4] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2018] [Accepted: 10/22/2018] [Indexed: 11/29/2022] Open
Abstract
BACKGROUND ReliefF is a nearest-neighbor based feature selection algorithm that efficiently detects variants that are important due to statistical interactions or epistasis. For categorical predictors, like genotypes, the standard metric used in ReliefF has been a simple (binary) mismatch difference. In this study, we develop new metrics of varying complexity that incorporate allele sharing, adjustment for allele frequency heterogeneity via the genetic relationship matrix (GRM), and physicochemical differences of variants via a new transition/transversion encoding. METHODS We introduce a new two-dimensional transition/transversion genotype encoding for ReliefF, and we implement three ReliefF attribute metrics: 1.) genotype mismatch (GM), which is the ReliefF standard, 2.) allele mismatch (AM), which accounts for heterozygous differences and has not been used previously in ReliefF, and 3.) the new transition/transversion metric. We incorporate these attribute metrics into the ReliefF nearest neighbor calculation with a Manhattan metric, and we introduce GRM as a new ReliefF nearest-neighbor metric to adjust for allele frequency heterogeneity. RESULTS We apply ReliefF with each metric to a GWAS of major depressive disorder and compare the detection of genes in pathways implicated in depression, including Axon Guidance, Neuronal System, and G Protein-Coupled Receptor Signaling. We also compare with detection by Random Forest and Lasso as well as random/null selection to assess pathway size bias. CONCLUSIONS Our results suggest that using more genetically motivated encodings, such as transition/transversion, and metrics that adjust for allele frequency heterogeneity, such as GRM, lead to ReliefF attribute scores with improved pathway enrichment.
Collapse
Affiliation(s)
- M. Arabnejad
- Tandy School of Computer Science, The University of Tulsa, 800 S. Tucker Dr, Tulsa, OK 74104 USA
| | - B. A. Dawkins
- Department of Mathematics, The University of Tulsa, Tulsa, OK 74104 USA
| | - W. S. Bush
- Institute for Computational Biology, Case Western Reserve University, 2103 Cornell Road, Cleveland, OH 44106 USA
| | - B. C. White
- Tandy School of Computer Science, The University of Tulsa, 800 S. Tucker Dr, Tulsa, OK 74104 USA
| | - A. R. Harkness
- Department of Psychology, The University of Tulsa, Tulsa, OK 74104 USA
| | - B. A. McKinney
- Tandy School of Computer Science, The University of Tulsa, 800 S. Tucker Dr, Tulsa, OK 74104 USA
- Department of Mathematics, The University of Tulsa, Tulsa, OK 74104 USA
| |
Collapse
|
41
|
Zhang L, Wang Y, Han J, Shen H, Zhao M, Cai S. Neutrophil-lymphocyte ratio, gamma-glutamyl transpeptidase, lipase, high-density lipoprotein as a panel of factors to predict acute pancreatitis in pregnancy. Medicine (Baltimore) 2018; 97:e11189. [PMID: 29952970 PMCID: PMC6242302 DOI: 10.1097/md.0000000000011189] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 12/11/2022] Open
Abstract
Acute pancreatitis in pregnancy (APIP) is a rare but dangerous complication. APIP has common symptoms with acute abdomen. Assessment of an acute abdomen is more complicated during pregnancy because the gravid uterus could mask most of symptomatic signs. It has been a challenge to diagnose APIP by physical examination or diagnostic imaging. Case studies on APIP are also limited for analysis on the risk factors associated with the disease. This retrospective study evaluated a series of risk factors from a relatively substantial number of APIP cases to determine early predictors or prognosis markers for APIP.Fifty-nine APIP patients together with 179 random normal pregnant women in Shengjing Affiliated Hospital of China Medical University were included for this retrospective study. Medical parameters of blood test in biochemistry and hematology were compared between 2 groups using t test. Multivariate logistic regression analysis was performed to investigate the relationship between various factors and APIP using Statistical Applied Software (SAS student version).Compared with normal pregnant women, APIP patients have elevated values in alanine aminotransferase (ALT), aspartate aminotransferase (AST), blood urea nitrogen, creatinine, C-reactive protein, direct bilirubin, fibrin degradation products, gamma-glutamyl transpeptidase (GGT), glucose, lipase, pH and decreased values in albumin, fibrinogen, high-density lipoprotein (HDL), hemoglobin, low-density lipoprotein cholesterol (LDL-D), and total proteins from their blood tests. In addition, APIP patients have decreased numbers in red cells but increased numbers in white blood cells and increased ratio of neutrophil/lymphocyte (N/L). Among these factors, N/LR, GGT, lipase, and HDL are significantly associated with APIP. This study suggests that the combination of those factors serve as a panel of indicators for early-onset prognosis of APIP.GGT, lipase, HDL, and N/LR can serve as a panel of factors to predict APIP. More case studies are important to further evaluate the predicting power of this panel factors in APIP.
Collapse
Affiliation(s)
- Lichun Zhang
- Department of Emergency, Shengjing Affiliated Hospital of China Medical University, Shenyang, Liaoning Province
| | - Yu Wang
- Department of Emergency, Shengjing Affiliated Hospital of China Medical University, Shenyang, Liaoning Province
| | - Jun Han
- Department of Emergency, Shengjing Affiliated Hospital of China Medical University, Shenyang, Liaoning Province
| | - Haitao Shen
- Department of Emergency, Shengjing Affiliated Hospital of China Medical University, Shenyang, Liaoning Province
| | - Min Zhao
- Department of Emergency, Shengjing Affiliated Hospital of China Medical University, Shenyang, Liaoning Province
| | | |
Collapse
|
42
|
Hong HG, Kang J, Li Y. Conditional screening for ultra-high dimensional covariates with survival outcomes. LIFETIME DATA ANALYSIS 2018; 24:45-71. [PMID: 27933468 PMCID: PMC5494024 DOI: 10.1007/s10985-016-9387-7] [Citation(s) in RCA: 32] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/24/2016] [Accepted: 11/25/2016] [Indexed: 05/12/2023]
Abstract
Identifying important biomarkers that are predictive for cancer patients' prognosis is key in gaining better insights into the biological influences on the disease and has become a critical component of precision medicine. The emergence of large-scale biomedical survival studies, which typically involve excessive number of biomarkers, has brought high demand in designing efficient screening tools for selecting predictive biomarkers. The vast amount of biomarkers defies any existing variable selection methods via regularization. The recently developed variable screening methods, though powerful in many practical setting, fail to incorporate prior information on the importance of each biomarker and are less powerful in detecting marginally weak while jointly important signals. We propose a new conditional screening method for survival outcome data by computing the marginal contribution of each biomarker given priorily known biological information. This is based on the premise that some biomarkers are known to be associated with disease outcomes a priori. Our method possesses sure screening properties and a vanishing false selection rate. The utility of the proposal is further confirmed with extensive simulation studies and analysis of a diffuse large B-cell lymphoma dataset. We are pleased to dedicate this work to Jack Kalbfleisch, who has made instrumental contributions to the development of modern methods of analyzing survival data.
Collapse
Affiliation(s)
| | - Jian Kang
- University of Michigan, Ann Arbor, MI, USA.
| | - Yi Li
- University of Michigan, Ann Arbor, MI, USA
| |
Collapse
|
43
|
Zhang H, Zheng Y, Yoon G, Zhang Z, Gao T, Joyce B, Zhang W, Schwartz J, Vokonas P, Colicino E, Baccarelli A, Hou L, Liu L. Regularized estimation in sparse high-dimensional multivariate regression, with application to a DNA methylation study. Stat Appl Genet Mol Biol 2017; 16:159-171. [PMID: 28734115 DOI: 10.1515/sagmb-2016-0073] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
In this article, we consider variable selection for correlated high dimensional DNA methylation markers as multivariate outcomes. A novel weighted square-root LASSO procedure is proposed to estimate the regression coefficient matrix. A key feature of this method is tuning-insensitivity, which greatly simplifies the computation by obviating cross validation for penalty parameter selection. A precision matrix obtained via the constrained ℓ1 minimization method is used to account for the within-subject correlation among multivariate outcomes. Oracle inequalities of the regularized estimators are derived. The performance of our proposed method is illustrated via extensive simulation studies. We apply our method to study the relation between smoking and high dimensional DNA methylation markers in the Normative Aging Study (NAS).
Collapse
|