1
|
Danaeifar M, Najafi A. Artificial Intelligence and Computational Biology in Gene Therapy: A Review. Biochem Genet 2024:10.1007/s10528-024-10799-1. [PMID: 38635012 DOI: 10.1007/s10528-024-10799-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2023] [Accepted: 04/02/2024] [Indexed: 04/19/2024]
Abstract
One of the trending fields in almost all areas of science and technology is artificial intelligence. Computational biology and artificial intelligence can help gene therapy in many steps including: gene identification, gene editing, vector design, development of new macromolecules and modeling of gene delivery. There are various tools used by computational biology and artificial intelligence in this field, such as genomics, transcriptomic and proteomics data analysis, machine learning algorithms and molecular interaction studies. These tools can introduce new gene targets, novel vectors, optimized experiment conditions, predict the outcomes and suggest the best solutions to avoid undesired immune responses following gene therapy treatment.
Collapse
Affiliation(s)
- Mohsen Danaeifar
- Molecular Biology Research Center, Systems Biology and Poisonings Institute, Baqiyatallah University of Medical Science, P.O. Box 19395-5487, Tehran, Iran
| | - Ali Najafi
- Molecular Biology Research Center, Systems Biology and Poisonings Institute, Baqiyatallah University of Medical Science, P.O. Box 19395-5487, Tehran, Iran.
| |
Collapse
|
2
|
Malakhov MM, Dai B, Shen XT, Pan W. A BOOTSTRAP MODEL COMPARISON TEST FOR IDENTIFYING GENES WITH CONTEXT-SPECIFIC PATTERNS OF GENETIC REGULATION. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.03.06.531446. [PMID: 36945657 PMCID: PMC10028853 DOI: 10.1101/2023.03.06.531446] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/09/2023]
Abstract
Understanding how genetic variation affects gene expression is essential for a complete picture of the functional pathways that give rise to complex traits. Although numerous studies have established that many genes are differentially expressed in distinct human tissues and cell types, no tools exist for identifying the genes whose expression is differentially regulated. Here we introduce DRAB (Differential Regulation Analysis by Bootstrapping), a gene-based method for testing whether patterns of genetic regulation are significantly different between tissues or other biological contexts. DRAB first leverages the elastic net to learn context-specific models of local genetic regulation and then applies a novel bootstrap-based model comparison test to check their equivalency. Unlike previous model comparison tests, our proposed approach can determine whether population-level models have equal predictive performance by accounting for the variability of feature selection and model training. We validated DRAB on mRNA expression data from a variety of human tissues in the Genotype-Tissue Expression (GTEx) Project. DRAB yielded biologically reasonable results and had sufficient power to detect genes with tissue-specific regulatory profiles while effectively controlling false positives. By providing a framework that facilitates the prioritization of differentially regulated genes, our study enables future discoveries on the genetic architecture of molecular phenotypes.
Collapse
Affiliation(s)
| | - Ben Dai
- Department of Statistics, The Chinese University of Hong Kong
| | | | - Wei Pan
- Division of Biostatistics, University of Minnesota
| |
Collapse
|
3
|
Araujo DS, Nguyen C, Hu X, Mikhaylova AV, Gignoux C, Ardlie K, Taylor KD, Durda P, Liu Y, Papanicolaou G, Cho MH, Rich SS, Rotter JI, Im HK, Manichaikul A, Wheeler HE. Multivariate adaptive shrinkage improves cross-population transcriptome prediction and association studies in underrepresented populations. HGG ADVANCES 2023; 4:100216. [PMID: 37869564 PMCID: PMC10589725 DOI: 10.1016/j.xhgg.2023.100216] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2023] [Accepted: 06/27/2023] [Indexed: 10/24/2023] Open
Abstract
Transcriptome prediction models built with data from European-descent individuals are less accurate when applied to different populations because of differences in linkage disequilibrium patterns and allele frequencies. We hypothesized that methods that leverage shared regulatory effects across different conditions, in this case, across different populations, may improve cross-population transcriptome prediction. To test this hypothesis, we made transcriptome prediction models for use in transcriptome-wide association studies (TWASs) using different methods (elastic net, joint-tissue imputation [JTI], matrix expression quantitative trait loci [Matrix eQTL], multivariate adaptive shrinkage in R [MASHR], and transcriptome-integrated genetic association resource [TIGAR]) and tested their out-of-sample transcriptome prediction accuracy in population-matched and cross-population scenarios. Additionally, to evaluate model applicability in TWASs, we integrated publicly available multiethnic genome-wide association study (GWAS) summary statistics from the Population Architecture using Genomics and Epidemiology (PAGE) study and Pan-ancestry genetic analysis of the UK Biobank (PanUKBB) with our developed transcriptome prediction models. In regard to transcriptome prediction accuracy, MASHR models performed better or the same as other methods in both population-matched and cross-population transcriptome predictions. Furthermore, in multiethnic TWASs, MASHR models yielded more discoveries that replicate in both PAGE and PanUKBB across all methods analyzed, including loci previously mapped in GWASs and loci previously not found in GWASs. Overall, our study demonstrates the importance of using methods that benefit from different populations' effect size estimates in order to improve TWASs for multiethnic or underrepresented populations.
Collapse
Affiliation(s)
- Daniel S. Araujo
- Program in Bioinformatics, Loyola University Chicago, Chicago, IL 60660, USA
| | - Chris Nguyen
- Department of Biology, Loyola University Chicago, Chicago, IL 60660, USA
| | - Xiaowei Hu
- Center for Public Health Genomics, Department of Public Health Sciences, University of Virginia, Charlottesville, VA 22908, USA
| | - Anna V. Mikhaylova
- Department of Biostatistics, University of Washington, Seattle, WA 98195, USA
| | - Chris Gignoux
- Division of Biomedical Informatics and Personalized Medicine, Department of Medicine, UC Denver Anschutz Medical Campus, Aurora, CO 80045, USA
| | - Kristin Ardlie
- Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Kent D. Taylor
- The Institute for Translational Genomics and Population Sciences, Department of Pediatrics, the Lundquist Institute for Biomedical Innovation at Harbor-UCLA Medical Center, Torrance, CA 90502, USA
| | - Peter Durda
- Laboratory for Clinical Biochemistry Research, University of Vermont, Colchester, VT 05446, USA
| | - Yongmei Liu
- Department of Medicine, Duke University School of Medicine, Durham, NC 27710, USA
| | - George Papanicolaou
- Epidemiology Branch, Division of Cardiovascular Sciences, National Heart, Lung and Blood Institute, Bethesda, MD 20892, USA
| | - Michael H. Cho
- Channing Division of Network Medicine, Department of Medicine, Brigham and Women’s Hospital, Boston, MA 02115, USA
| | - Stephen S. Rich
- Center for Public Health Genomics, Department of Public Health Sciences, University of Virginia, Charlottesville, VA 22908, USA
| | - Jerome I. Rotter
- The Institute for Translational Genomics and Population Sciences, Department of Pediatrics, the Lundquist Institute for Biomedical Innovation at Harbor-UCLA Medical Center, Torrance, CA 90502, USA
| | - NHLBI TOPMed Consortium
- Program in Bioinformatics, Loyola University Chicago, Chicago, IL 60660, USA
- Department of Biology, Loyola University Chicago, Chicago, IL 60660, USA
- Center for Public Health Genomics, Department of Public Health Sciences, University of Virginia, Charlottesville, VA 22908, USA
- Department of Biostatistics, University of Washington, Seattle, WA 98195, USA
- Division of Biomedical Informatics and Personalized Medicine, Department of Medicine, UC Denver Anschutz Medical Campus, Aurora, CO 80045, USA
- Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
- The Institute for Translational Genomics and Population Sciences, Department of Pediatrics, the Lundquist Institute for Biomedical Innovation at Harbor-UCLA Medical Center, Torrance, CA 90502, USA
- Laboratory for Clinical Biochemistry Research, University of Vermont, Colchester, VT 05446, USA
- Department of Medicine, Duke University School of Medicine, Durham, NC 27710, USA
- Epidemiology Branch, Division of Cardiovascular Sciences, National Heart, Lung and Blood Institute, Bethesda, MD 20892, USA
- Channing Division of Network Medicine, Department of Medicine, Brigham and Women’s Hospital, Boston, MA 02115, USA
- Section of Genetic Medicine, University of Chicago, Chicago, IL 60637, USA
| | - Hae Kyung Im
- Section of Genetic Medicine, University of Chicago, Chicago, IL 60637, USA
| | - Ani Manichaikul
- Center for Public Health Genomics, Department of Public Health Sciences, University of Virginia, Charlottesville, VA 22908, USA
| | - Heather E. Wheeler
- Program in Bioinformatics, Loyola University Chicago, Chicago, IL 60660, USA
- Department of Biology, Loyola University Chicago, Chicago, IL 60660, USA
| |
Collapse
|
4
|
Mai J, Lu M, Gao Q, Zeng J, Xiao J. Transcriptome-wide association studies: recent advances in methods, applications and available databases. Commun Biol 2023; 6:899. [PMID: 37658226 PMCID: PMC10474133 DOI: 10.1038/s42003-023-05279-y] [Citation(s) in RCA: 7] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2023] [Accepted: 08/24/2023] [Indexed: 09/03/2023] Open
Abstract
Genome-wide association study has identified fruitful variants impacting heritable traits. Nevertheless, identifying critical genes underlying those significant variants has been a great task. Transcriptome-wide association study (TWAS) is an instrumental post-analysis to detect significant gene-trait associations focusing on modeling transcription-level regulations, which has made numerous progresses in recent years. Leveraging from expression quantitative loci (eQTL) regulation information, TWAS has advantages in detecting functioning genes regulated by disease-associated variants, thus providing insight into mechanisms of diseases and other phenotypes. Considering its vast potential, this review article comprehensively summarizes TWAS, including the methodology, applications and available resources.
Collapse
Affiliation(s)
- Jialin Mai
- National Genomics Data Center, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing, 100101, China
- CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing, 100101, China
- University of Chinese Academy of Sciences, Beijing, 100049, China
| | - Mingming Lu
- National Genomics Data Center, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing, 100101, China
- CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing, 100101, China
- University of Chinese Academy of Sciences, Beijing, 100049, China
| | - Qianwen Gao
- National Genomics Data Center, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing, 100101, China
- CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing, 100101, China
- University of Chinese Academy of Sciences, Beijing, 100049, China
| | - Jingyao Zeng
- National Genomics Data Center, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing, 100101, China.
- CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing, 100101, China.
| | - Jingfa Xiao
- National Genomics Data Center, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing, 100101, China.
- CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing, 100101, China.
- University of Chinese Academy of Sciences, Beijing, 100049, China.
| |
Collapse
|
5
|
Ren J, Lin Z, He R, Shen X, Pan W. Using GWAS summary data to impute traits for genotyped individuals. HGG ADVANCES 2023; 4:100197. [PMID: 37181332 PMCID: PMC10173780 DOI: 10.1016/j.xhgg.2023.100197] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2022] [Accepted: 04/07/2023] [Indexed: 05/16/2023] Open
Abstract
Genome-wide association study (GWAS) summary data have become extremely useful in daily routine data analysis, largely facilitating new methods development and new applications. However, a severe limitation with the current use of GWAS summary data is its exclusive restriction to only linear single nucleotide polymorphism (SNP)-trait association analyses. To further expand the use of GWAS summary data, along with a large sample of individual-level genotypes, we propose a nonparametric method for large-scale imputation of the genetic component of the trait for the given genotypes. The imputed individual-level trait values, along with the individual-level genotypes, make it possible to conduct any analysis as with individual-level GWAS data, including nonlinear SNP-trait associations and predictions. We use the UK Biobank data to highlight the usefulness and effectiveness of the proposed method in three applications that currently cannot be done with only GWAS summary data (for SNP-trait associations): marginal SNP-trait association analysis under non-additive genetic models, detection of SNP-SNP interactions, and genetic prediction of a trait using a nonlinear model of SNPs.
Collapse
Affiliation(s)
- Jingchen Ren
- School of Statistics, University of Minnesota, Minneapolis, MN 55455, USA
- Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455, USA
| | - Zhaotong Lin
- Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455, USA
| | - Ruoyu He
- School of Statistics, University of Minnesota, Minneapolis, MN 55455, USA
- Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455, USA
| | - Xiaotong Shen
- School of Statistics, University of Minnesota, Minneapolis, MN 55455, USA
| | - Wei Pan
- Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455, USA
- Corresponding author
| |
Collapse
|
6
|
Araujo DS, Nguyen C, Hu X, Mikhaylova AV, Gignoux C, Ardlie K, Taylor KD, Durda P, Liu Y, Papanicolaou G, Cho MH, Rich SS, Rotter JI, Im HK, Manichaikul A, Wheeler HE. Multivariate adaptive shrinkage improves cross-population transcriptome prediction for transcriptome-wide association studies in underrepresented populations. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.02.09.527747. [PMID: 36798214 PMCID: PMC9934635 DOI: 10.1101/2023.02.09.527747] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 02/12/2023]
Abstract
Transcriptome prediction models built with data from European-descent individuals are less accurate when applied to different populations because of differences in linkage disequilibrium patterns and allele frequencies. We hypothesized methods that leverage shared regulatory effects across different conditions, in this case, across different populations may improve cross-population transcriptome prediction. To test this hypothesis, we made transcriptome prediction models for use in transcriptome-wide association studies (TWAS) using different methods (Elastic Net, Joint-Tissue Imputation (JTI), Matrix eQTL, Multivariate Adaptive Shrinkage in R (MASHR), and Transcriptome-Integrated Genetic Association Resource (TIGAR)) and tested their out-of-sample transcriptome prediction accuracy in population-matched and cross-population scenarios. Additionally, to evaluate model applicability in TWAS, we integrated publicly available multi-ethnic genome-wide association study (GWAS) summary statistics from the Population Architecture using Genomics and Epidemiology Study (PAGE) and Pan-UK Biobank with our developed transcriptome prediction models. In regard to transcriptome prediction accuracy, MASHR models performed better or the same as other methods in both population-matched and cross-population transcriptome predictions. Furthermore, in multi-ethnic TWAS, MASHR models yielded more discoveries that replicate in both PAGE and PanUKBB across all methods analyzed, including loci previously mapped in GWAS and new loci previously not found in GWAS. Overall, our study demonstrates the importance of using methods that benefit from different populations' effect size estimates in order to improve TWAS for multi-ethnic or underrepresented populations.
Collapse
Affiliation(s)
- Daniel S. Araujo
- Program in Bioinformatics, Loyola University Chicago, Chicago, IL, 60660, USA
| | - Chris Nguyen
- Department of Biology, Loyola University Chicago, Chicago, IL, 60660, USA
| | - Xiaowei Hu
- Center for Public Health Genomics, Department of Public Health Sciences, University of Virginia, Charlottesville, VA, 22908, USA
| | - Anna V. Mikhaylova
- Department of Biostatistics, University of Washington, Seattle, WA, 98195, USA
| | - Chris Gignoux
- Division of Biomedical Informatics and Personalized Medicine, Department of Medicine, UC Denver Anschutz Medical Campus, Aurora, CO, 80045, USA
| | - Kristin Ardlie
- Broad Institute of MIT and Harvard, Cambridge, MA, 02142, USA
| | - Kent D. Taylor
- The Institute for Translational Genomics and Population Sciences, Department of Pediatrics, The Lundquist Institute for Biomedical Innovation at Harbor-UCLA Medical Center, Torrance, CA, 90502, USA
| | - Peter Durda
- Laboratory for Clinical Biochemistry Research, University of Vermont, Colchester, VT, 05446, USA
| | - Yongmei Liu
- Department of Medicine, Duke University School of Medicine, Durham, NC, 27710, USA
| | - George Papanicolaou
- Epidemiology Branch, Division of Cardiovascular Sciences, National Heart, Lung and Blood Institute, Bethesda, MD, 20892, USA
| | - Michael H. Cho
- Channing Division of Network Medicine, Department of Medicine, Brigham and Women’s Hospital, Boston, MA, 02115, USA
| | - Stephen S. Rich
- Center for Public Health Genomics, Department of Public Health Sciences, University of Virginia, Charlottesville, VA, 22908, USA
| | - Jerome I. Rotter
- The Institute for Translational Genomics and Population Sciences, Department of Pediatrics, The Lundquist Institute for Biomedical Innovation at Harbor-UCLA Medical Center, Torrance, CA, 90502, USA
| | | | - Hae Kyung Im
- Section of Genetic Medicine, The University of Chicago, Chicago, IL, 60637, USA
| | - Ani Manichaikul
- Center for Public Health Genomics, Department of Public Health Sciences, University of Virginia, Charlottesville, VA, 22908, USA
| | - Heather E. Wheeler
- Program in Bioinformatics, Loyola University Chicago, Chicago, IL, 60660, USA
- Department of Biology, Loyola University Chicago, Chicago, IL, 60660, USA
| |
Collapse
|
7
|
Kim J, Kim H, Lee MS, Lee H, Kim YJ, Lee WY, Yun SH, Kim HC, Hong HK, Hannenhalli S, Cho YB, Park D, Choi SS. Transcriptomes of the tumor-adjacent normal tissues are more informative than tumors in predicting recurrence in colorectal cancer patients. J Transl Med 2023; 21:209. [PMID: 36941605 PMCID: PMC10029176 DOI: 10.1186/s12967-023-04053-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2022] [Accepted: 03/10/2023] [Indexed: 03/23/2023] Open
Abstract
BACKGROUND Previous investigations of transcriptomic signatures of cancer patient survival and post-therapy relapse have focused on tumor tissue. In contrast, here we show that in colorectal cancer (CRC) transcriptomes derived from normal tissues adjacent to tumors (NATs) are better predictors of relapse. RESULTS Using the transcriptomes of paired tumor and NAT specimens from 80 Korean CRC patients retrospectively determined to be in recurrence or nonrecurrence states, we found that, when comparing recurrent with nonrecurrent samples, NATs exhibit a greater number of differentially expressed genes (DEGs) than tumors. Training two prognostic elastic net-based machine learning models-NAT-based and tumor-based in our Samsung Medical Center (SMC) cohort, we found that NAT-based model performed better in predicting the survival when the model was applied to the tumor-derived transcriptomes of an independent cohort of 450 COAD patients in TCGA. Furthermore, compositions of tumor-infiltrating immune cells in NATs were found to have better prognostic capability than in tumors. We also confirmed through Cox regression analysis that in both SMC-CRC as well as in TCGA-COAD cohorts, a greater proportion of genes exhibited significant hazard ratio when NAT-derived transcriptome was used compared to when tumor-derived transcriptome was used. CONCLUSIONS Taken together, our results strongly suggest that NAT-derived transcriptomes and immune cell composition of CRC are better predictors of patient survival and tumor recurrence than the primary tumor.
Collapse
Affiliation(s)
- Jinho Kim
- Division of Biomedical Convergence, College of Biomedical Science, Institute of Bioscience & Biotechnology, Kangwon National University, Chuncheon, 24341, Korea
| | - Hyunjung Kim
- Precision Medicine Center, Future Innovation Research Division, Seoul National University Bundang Hospital, Seongnam, 13620, Korea
| | - Min-Seok Lee
- Division of Biomedical Convergence, College of Biomedical Science, Institute of Bioscience & Biotechnology, Kangwon National University, Chuncheon, 24341, Korea
| | - Heetak Lee
- Precision Medicine Center, Future Innovation Research Division, Seoul National University Bundang Hospital, Seongnam, 13620, Korea
- Center for Genome Engineering, Institute for Basic Science, 55, Expo-ro, Yuseng-gu, Daejeon, 34126, Korea
| | - Yeon Jeong Kim
- Samsung Genome Institute, Samsung Medical Center, Seoul, 06351, Korea
| | - Woo Yong Lee
- Department of Surgery, Samsung Medical Center, Sungkyunkwan University School of Medicine, Seoul, 06351, Korea
| | - Seong Hyeon Yun
- Department of Surgery, Samsung Medical Center, Sungkyunkwan University School of Medicine, Seoul, 06351, Korea
| | - Hee Cheol Kim
- Department of Surgery, Samsung Medical Center, Sungkyunkwan University School of Medicine, Seoul, 06351, Korea
| | - Hye Kyung Hong
- Institute for Future Medicine, Samsung Medical Center, Seoul, 06351, Korea
| | - Sridhar Hannenhalli
- Cancer Data Science Lab, Center for Cancer Research, National Cancer Institute, Bethesda, 20814, MD, USA
| | - Yong Beom Cho
- Department of Surgery, Samsung Medical Center, Sungkyunkwan University School of Medicine, Seoul, 06351, Korea.
- Department of Health Sciences and Technology, SAIHST, Sungkyunkwan University, Seoul, 06351, Korea.
| | | | - Sun Shim Choi
- Division of Biomedical Convergence, College of Biomedical Science, Institute of Bioscience & Biotechnology, Kangwon National University, Chuncheon, 24341, Korea.
| |
Collapse
|
8
|
He R, Xue H, Pan W. Statistical power of transcriptome-wide association studies. Genet Epidemiol 2022; 46:572-588. [PMID: 35766062 PMCID: PMC9669108 DOI: 10.1002/gepi.22491] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2022] [Revised: 05/27/2022] [Accepted: 05/31/2022] [Indexed: 01/02/2023]
Abstract
Transcriptome-Wide Association Studies (TWASs) have become increasingly popular in identifying genes (or other endophenotypes or exposures) associated with complex traits. In TWAS, one first builds a predictive model for gene expressions using an expression quantitative trait loci (eQTL) data set in stage 1, then tests the association between the predicted gene expression and a trait based on a large, independent genome-wide association study (GWAS) data set in stage 2. However, since the sample size of the eQTL data set is usually small and the coefficient of multiple determination (i.e.,R 2 ${R}^{2}$ ) of the model for many genes is also small, a question of interest is to what extent these factors affect the statistical power of TWAS. In addition, in contrast to a standard (univariate) TWAS (UV-TWAS) considering only a single gene at a time, multivariate TWAS (MV-TWAS) methods have recently emerged to account for the effects of multiple genes, or a gene's nonlinear effects, simultaneously. With the absence of the power analysis for these MV-TWAS methods, it would be of interest to investigate whether one can gain or lose power by using the newly proposed MV-TWAS instead of UV-TWAS. In this paper, we first outline a general method for sample size/power calculations for two-sample TWAS, then use real data-the Alzheimer's Disease Neuroimaging Initiative (ADNI) expression quantitative trait loci (eQTL) data and the Genotype-Tissue Expression (GTEx) eQTL data for stage 1, the International Genomics of Alzheimer's Project Alzheimer's disease (AD) GWAS summary data and UK Biobank (UKB) individual-level data for stage 2-to empirically address these questions. Our most important conclusions are the following. First, a sample size of a few thousands (~8000) would suffice in stage 1, where the power of TWAS would be more determined by cis-heritability of gene expression. Second, as in the general case of simple regression versus multiple regression, the power of MV-TWAS may be higher or lower than that of UV-TWAS, depending on the specific relationships among the GWAS trait and multiple genes (or linear and nonlinear terms of the same gene's expression levels), such as their correlations and effect sizes. Interestingly, several top genes with large power gains in MV-TWAS (over that in UV-TWAS) were known to be (and in our data more significantly) associated with AD. We also reached similar conclusions in an application to the GTEx whole blood gene expression data and UKB GWAS data of high-density lipoprotein cholesterol. The proposed method and the conclusions are expected to be useful in planning and designing future TWAS and other related studies (e.g., Proteome- or Metabolome-Wide Association Studies) when determining the sample sizes for the two stages.
Collapse
Affiliation(s)
- Ruoyu He
- School of StatisticsUniversity of MinnesotaMinneapolisMinnesotaUSA
- University of MinnesotaDivision of Biostatistics, School of Public HealthMinneapolisMinnesotaUSA
| | - Haoran Xue
- University of MinnesotaDivision of Biostatistics, School of Public HealthMinneapolisMinnesotaUSA
| | - Wei Pan
- University of MinnesotaDivision of Biostatistics, School of Public HealthMinneapolisMinnesotaUSA
| | | |
Collapse
|
9
|
Elgart M, Lyons G, Romero-Brufau S, Kurniansyah N, Brody JA, Guo X, Lin HJ, Raffield L, Gao Y, Chen H, de Vries P, Lloyd-Jones DM, Lange LA, Peloso GM, Fornage M, Rotter JI, Rich SS, Morrison AC, Psaty BM, Levy D, Redline S, Sofer T. Non-linear machine learning models incorporating SNPs and PRS improve polygenic prediction in diverse human populations. Commun Biol 2022; 5:856. [PMID: 35995843 PMCID: PMC9395509 DOI: 10.1038/s42003-022-03812-z] [Citation(s) in RCA: 27] [Impact Index Per Article: 13.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2021] [Accepted: 08/05/2022] [Indexed: 01/03/2023] Open
Abstract
Polygenic risk scores (PRS) are commonly used to quantify the inherited susceptibility for a trait, yet they fail to account for non-linear and interaction effects between single nucleotide polymorphisms (SNPs). We address this via a machine learning approach, validated in nine complex phenotypes in a multi-ancestry population. We use an ensemble method of SNP selection followed by gradient boosted trees (XGBoost) to allow for non-linearities and interaction effects. We compare our results to the standard, linear PRS model developed using PRSice, LDpred2, and lassosum2. Combining a PRS as a feature in an XGBoost model results in a relative increase in the percentage variance explained compared to the standard linear PRS model by 22% for height, 27% for HDL cholesterol, 43% for body mass index, 50% for sleep duration, 58% for systolic blood pressure, 64% for total cholesterol, 66% for triglycerides, 77% for LDL cholesterol, and 100% for diastolic blood pressure. Multi-ancestry trained models perform similarly to specific racial/ethnic group trained models and are consistently superior to the standard linear PRS models. This work demonstrates an effective method to account for non-linearities and interaction effects in genetics-based prediction models.
Collapse
Affiliation(s)
- Michael Elgart
- Division of Sleep and Circadian Disorders, Brigham and Women's Hospital, Boston, MA, USA.
- Department of Medicine, Harvard Medical School, Boston, MA, USA.
| | - Genevieve Lyons
- Division of Sleep and Circadian Disorders, Brigham and Women's Hospital, Boston, MA, USA
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | - Santiago Romero-Brufau
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA
- Department of Medicine, Mayo Clinic, Rochester, MN, USA
| | - Nuzulul Kurniansyah
- Division of Sleep and Circadian Disorders, Brigham and Women's Hospital, Boston, MA, USA
| | - Jennifer A Brody
- Cardiovascular Health Research Unit, Department of Medicine, University of Washington, Seattle, WA, USA
| | - Xiuqing Guo
- The Institute for Translational Genomics and Population Sciences, Department of Pediatrics, The Lundquist Institute for Biomedical Innovation at Harbor-UCLA Medical Center, Torrance, CA, USA
| | - Henry J Lin
- The Institute for Translational Genomics and Population Sciences, Department of Pediatrics, The Lundquist Institute for Biomedical Innovation at Harbor-UCLA Medical Center, Torrance, CA, USA
| | - Laura Raffield
- Department of Genetics, University of North Carolina, Chapel Hill, NC, USA
| | - Yan Gao
- The Jackson Heart Study, University of Mississippi Medical Center, Jackson, MS, USA
| | - Han Chen
- Human Genetics Center, Department of Epidemiology, Human Genetics, and Environmental Sciences, School of Public Health, The University of Texas Health Science Center at Houston, Houston, TX, USA
- Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, USA
| | - Paul de Vries
- Human Genetics Center, Department of Epidemiology, Human Genetics, and Environmental Sciences, School of Public Health, The University of Texas Health Science Center at Houston, Houston, TX, USA
| | | | - Leslie A Lange
- Department of Medicine, University of Colorado Denver, Anschutz Medical Campus, Aurora, CO, USA
| | - Gina M Peloso
- Department of Biostatistics, Boston University School of Public Health, Boston, MA, USA
| | - Myriam Fornage
- Human Genetics Center, Department of Epidemiology, Human Genetics, and Environmental Sciences, School of Public Health, The University of Texas Health Science Center at Houston, Houston, TX, USA
- Brown Foundation Institute of Molecular Medicine, McGovern Medical School, University of Texas Health Science Center at Houston, Houston, TX, USA
| | - Jerome I Rotter
- The Institute for Translational Genomics and Population Sciences, Department of Pediatrics, The Lundquist Institute for Biomedical Innovation at Harbor-UCLA Medical Center, Torrance, CA, USA
| | - Stephen S Rich
- Center for Public Health Genomics, University of Virginia School of Medicine, Charlottesville, VA, USA
| | - Alanna C Morrison
- Human Genetics Center, Department of Epidemiology, Human Genetics, and Environmental Sciences, School of Public Health, The University of Texas Health Science Center at Houston, Houston, TX, USA
| | - Bruce M Psaty
- Cardiovascular Health Research Unit, Departments of Medicine, Epidemiology, and Health Services, University of Washington, Seattle, WA, USA
| | - Daniel Levy
- The Population Sciences Branch of the National Heart, Lung and Blood Institute, Bethesda, MD, USA
- The Framingham Heart Study, Framingham, MA, USA
| | - Susan Redline
- Division of Sleep and Circadian Disorders, Brigham and Women's Hospital, Boston, MA, USA
- Department of Medicine, Harvard Medical School, Boston, MA, USA
| | - Tamar Sofer
- Division of Sleep and Circadian Disorders, Brigham and Women's Hospital, Boston, MA, USA.
- Department of Medicine, Harvard Medical School, Boston, MA, USA.
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA.
| |
Collapse
|
10
|
Network Approaches for Charting the Transcriptomic and Epigenetic Landscape of the Developmental Origins of Health and Disease. Genes (Basel) 2022; 13:genes13050764. [PMID: 35627149 PMCID: PMC9141211 DOI: 10.3390/genes13050764] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2022] [Revised: 04/04/2022] [Accepted: 04/13/2022] [Indexed: 02/04/2023] Open
Abstract
The early developmental phase is of critical importance for human health and disease later in life. To decipher the molecular mechanisms at play, current biomedical research is increasingly relying on large quantities of diverse omics data. The integration and interpretation of the different datasets pose a critical challenge towards the holistic understanding of the complex biological processes that are involved in early development. In this review, we outline the major transcriptomic and epigenetic processes and the respective datasets that are most relevant for studying the periconceptional period. We cover both basic data processing and analysis steps, as well as more advanced data integration methods. A particular focus is given to network-based methods. Finally, we review the medical applications of such integrative analyses.
Collapse
|
11
|
Schubert R, Geoffroy E, Gregga I, Mulford AJ, Aguet F, Ardlie K, Gerszten R, Clish C, Van Den Berg D, Taylor KD, Durda P, Johnson WC, Cornell E, Guo X, Liu Y, Tracy R, Conomos M, Blackwell T, Papanicolaou G, Lappalainen T, Mikhaylova AV, Thornton TA, Cho MH, Gignoux CR, Lange L, Lange E, Rich SS, Rotter JI, Manichaikul A, Im HK, Wheeler HE. Protein prediction for trait mapping in diverse populations. PLoS One 2022; 17:e0264341. [PMID: 35202437 PMCID: PMC8870552 DOI: 10.1371/journal.pone.0264341] [Citation(s) in RCA: 12] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2021] [Accepted: 02/08/2022] [Indexed: 11/18/2022] Open
Abstract
Genetically regulated gene expression has helped elucidate the biological mechanisms underlying complex traits. Improved high-throughput technology allows similar interrogation of the genetically regulated proteome for understanding complex trait mechanisms. Here, we used the Trans-omics for Precision Medicine (TOPMed) Multi-omics pilot study, which comprises data from Multi-Ethnic Study of Atherosclerosis (MESA), to optimize genetic predictors of the plasma proteome for genetically regulated proteome-wide association studies (PWAS) in diverse populations. We built predictive models for protein abundances using data collected in TOPMed MESA, for which we have measured 1,305 proteins by a SOMAscan assay. We compared predictive models built via elastic net regression to models integrating posterior inclusion probabilities estimated by fine-mapping SNPs prior to elastic net. In order to investigate the transferability of predictive models across ancestries, we built protein prediction models in all four of the TOPMed MESA populations, African American (n = 183), Chinese (n = 71), European (n = 416), and Hispanic/Latino (n = 301), as well as in all populations combined. As expected, fine-mapping produced more significant protein prediction models, especially in African ancestries populations, potentially increasing opportunity for discovery. When we tested our TOPMed MESA models in the independent European INTERVAL study, fine-mapping improved cross-ancestries prediction for some proteins. Using GWAS summary statistics from the Population Architecture using Genomics and Epidemiology (PAGE) study, which comprises ∼50,000 Hispanic/Latinos, African Americans, Asians, Native Hawaiians, and Native Americans, we applied S-PrediXcan to perform PWAS for 28 complex traits. The most protein-trait associations were discovered, colocalized, and replicated in large independent GWAS using proteome prediction model training populations with similar ancestries to PAGE. At current training population sample sizes, performance between baseline and fine-mapped protein prediction models in PWAS was similar, highlighting the utility of elastic net. Our predictive models in diverse populations are publicly available for use in proteome mapping methods at https://doi.org/10.5281/zenodo.4837327.
Collapse
Affiliation(s)
- Ryan Schubert
- Department of Mathematics and Statistics, Loyola University Chicago, Chicago, IL, United States of America
- Department of Biology, Loyola University Chicago, Chicago, IL, United States of America
- Program in Bioinformatics, Loyola University Chicago, Chicago, IL, United States of America
| | - Elyse Geoffroy
- Program in Bioinformatics, Loyola University Chicago, Chicago, IL, United States of America
| | - Isabelle Gregga
- Department of Biology, Loyola University Chicago, Chicago, IL, United States of America
| | - Ashley J. Mulford
- Department of Biology, Loyola University Chicago, Chicago, IL, United States of America
- Program in Bioinformatics, Loyola University Chicago, Chicago, IL, United States of America
| | - Francois Aguet
- Broad Institute, Cambridge, MA, United States of America
| | - Kristin Ardlie
- Broad Institute, Cambridge, MA, United States of America
| | - Robert Gerszten
- Beth Israel Deaconess Medical Center, Boston, MA, United States of America
| | - Clary Clish
- Broad Institute, Cambridge, MA, United States of America
| | - David Van Den Berg
- University of Southern California, Los Angeles, CA, United States of America
| | - Kent D. Taylor
- The Institute for Translational Genomics and Population Sciences, Department of Pediatrics, The Lundquist Institute for Biomedical Innovation at Harbor-UCLA Medical Center, Torrance, CA, United States of America
| | - Peter Durda
- Laboratory for Clinical Biochemistry Research, University of Vermont, Burlington, VT, United States of America
| | - W. Craig Johnson
- Collaborative Health Studies Coordinating Center, University of Washington, Seattle, WA, United States of America
| | - Elaine Cornell
- Laboratory for Clinical Biochemistry Research, University of Vermont, Burlington, VT, United States of America
| | - Xiuqing Guo
- The Institute for Translational Genomics and Population Sciences, Department of Pediatrics, The Lundquist Institute for Biomedical Innovation at Harbor-UCLA Medical Center, Torrance, CA, United States of America
| | - Yongmei Liu
- Department of Medicine, Duke University School of Medicine, Durham, NC, United States of America
| | - Russell Tracy
- Laboratory for Clinical Biochemistry Research, University of Vermont, Burlington, VT, United States of America
| | - Matthew Conomos
- Department of Biostatistics, University of Washington, Seattle, WA, United States of America
| | - Tom Blackwell
- Department of Biostatistics, University of Michigan, Ann Arbor, MI, United States of America
| | - George Papanicolaou
- Epidemiology Branch, National Heart, Lung and Blood Institute, Bethesda, MD, United States of America
| | - Tuuli Lappalainen
- New York Genome Center and Department of Systems Biology, Columbia University, New York, NY United States of America
| | - Anna V. Mikhaylova
- Department of Biostatistics, University of Washington, Seattle, WA, United States of America
| | - Timothy A. Thornton
- Department of Biostatistics, University of Washington, Seattle, WA, United States of America
| | - Michael H. Cho
- Channing Division of Network Medicine, Brigham and Women’s Hospital, Boston, MA, United States of America
| | - Christopher R. Gignoux
- Division of Biomedical Informatics and Personalized Medicine, Department of Medicine, University of Colorado Anschutz Medical Campus, Aurora, CO, United States of America
| | - Leslie Lange
- Division of Biomedical Informatics and Personalized Medicine, Department of Medicine, University of Colorado Anschutz Medical Campus, Aurora, CO, United States of America
| | - Ethan Lange
- Division of Biomedical Informatics and Personalized Medicine, Department of Medicine, University of Colorado Anschutz Medical Campus, Aurora, CO, United States of America
| | - Stephen S. Rich
- Center for Public Health Genomics, University of Virginia, Charlottesville, VA, United States of America
| | - Jerome I. Rotter
- The Institute for Translational Genomics and Population Sciences, Department of Pediatrics, The Lundquist Institute for Biomedical Innovation at Harbor-UCLA Medical Center, Torrance, CA, United States of America
| | | | - Ani Manichaikul
- Center for Public Health Genomics, University of Virginia, Charlottesville, VA, United States of America
| | - Hae Kyung Im
- Section of Genetic Medicine, The University of Chicago, Chicago, IL, United States of America
| | - Heather E. Wheeler
- Department of Biology, Loyola University Chicago, Chicago, IL, United States of America
- Program in Bioinformatics, Loyola University Chicago, Chicago, IL, United States of America
- * E-mail:
| |
Collapse
|
12
|
Meta-imputation of transcriptome from genotypes across multiple datasets by leveraging publicly available summary-level data. PLoS Genet 2022; 18:e1009571. [PMID: 35100255 PMCID: PMC8830793 DOI: 10.1371/journal.pgen.1009571] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2021] [Revised: 02/10/2022] [Accepted: 01/07/2022] [Indexed: 11/22/2022] Open
Abstract
Transcriptome wide association studies (TWAS) can be used as a powerful method to identify and interpret the underlying biological mechanisms behind GWAS by mapping gene expression levels with phenotypes. In TWAS, gene expression is often imputed from individual-level genotypes of regulatory variants identified from external resources, such as Genotype-Tissue Expression (GTEx) Project. In this setting, a straightforward approach to impute expression levels of a specific tissue is to use the model trained from the same tissue type. When multiple tissues are available for the same subjects, it has been demonstrated that training imputation models from multiple tissue types improves the accuracy because of shared eQTLs between the tissues and increase in effective sample size. However, existing joint-tissue methods require access of genotype and expression data across all tissues. Moreover, they cannot leverage the abundance of various expression datasets across various tissues for non-overlapping individuals. Here, we explore the optimal way to combine imputed levels across training models from multiple tissues and datasets in a flexible manner using summary-level data. Our proposed method (SWAM) combines arbitrary number of transcriptome imputation models to linearly optimize the imputation accuracy given a target tissue. By integrating models across tissues and/or individuals, SWAM can improve the accuracy of transcriptome imputation or to improve power to TWAS while only requiring individual-level data from a single reference cohort. To evaluate the accuracy of SWAM, we combined 49 tissue-specific gene expression imputation models from the GTEx Project as well as from a large eQTL study of Depression Susceptibility Genes and Networks (DGN) Project and tested imputation accuracy in GEUVADIS lymphoblastoid cell lines samples. We also extend our meta-imputation method to meta-TWAS to leverage multiple tissues in TWAS analysis with summary-level statistics. Our results capitalize on the importance of integrating multiple tissues to unravel regulatory impacts of genetic variants on complex traits. The gene expression levels within a cell are affected by various factors, including DNA variation, cell type, cellular microenvironment, disease status, and other environmental factors surrounding the individual. The genetic component of gene expression is known to explain a substantial fraction of transcriptional variation among individuals and can be imputed from genotypes in a tissue-specific manner, by training from population-scale transcriptomic profiles designed to identify expression quantitative loci (eQTLs). Imputing gene expression levels is shown to help understand the genetic basis of human disease through Transcriptome-wide association analysis (TWAS) and Mendelian Randomization (MR). However, it has been unclear how to integrate multiple imputation models trained from individual datasets to maximize their accuracy without having to access individual genotypes and expression levels that are often protected for privacy concerns. We developed SWAM (Smartly Weighted Averaging across Multiple datasets), a meta-imputation framework which can accurately impute gene expression levels from genotypes by integrating multiple imputation models without requiring individual-level data. Our method examines the similarity or differences between resources and borrowing information most relevant to the tissue of interest. We demonstrate that SWAM outperforms existing single-tissue and multi-tissue imputation models and continue to increase accuracy when integrating additional imputation models.
Collapse
|
13
|
Lin Z, Xue H, Malakhov MM, Knutson KA, Pan W. Accounting for nonlinear effects of gene expression identifies additional associated genes in transcriptome-wide association studies. Hum Mol Genet 2022; 31:2462-2470. [PMID: 35043938 PMCID: PMC9307319 DOI: 10.1093/hmg/ddac015] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2021] [Revised: 01/08/2022] [Accepted: 01/10/2022] [Indexed: 01/21/2023] Open
Abstract
Transcriptome-wide association studies (TWAS) integrate genome-wide association study (GWAS) data with gene expression (GE) data to identify (putative) causal genes for complex traits. There are two stages in TWAS: in Stage 1, a model is built to impute gene expression from genotypes, and in Stage 2, gene-trait association is tested using imputed gene expression. Despite many successes with TWAS, in the current practice, one only assumes a linear relationship between GE and the trait, which however may not hold, leading to loss of power. In this study, we extend the standard TWAS by considering a quadratic effect of GE, in addition to the usual linear effect. We train imputation models for both linear and quadratic gene expression levels in Stage 1, then include both the imputed linear and quadratic expression levels in Stage 2. We applied both the standard TWAS and our approach first to the ADNI gene expression data and the IGAP Alzheimer's disease GWAS summary data, then to the GTEx (V8) gene expression data and the UK Biobank individual-level GWAS data for lipids, followed by validation with different GWAS data, suitable model checking and more robust TWAS methods. In all these applications, the new TWAS approach was able to identify additional genes associated with Alzheimer's disease, LDL and HDL cholesterol levels, suggesting its likely power gains and thus the need to account for potentially nonlinear effects of gene expression on complex traits.
Collapse
Affiliation(s)
- Zhaotong Lin
- Division of Biostatistics, University of Minnesota, Minneapolis, MN 55455, USA
| | - Haoran Xue
- Division of Biostatistics, University of Minnesota, Minneapolis, MN 55455, USA
| | - Mykhaylo M Malakhov
- Division of Biostatistics, University of Minnesota, Minneapolis, MN 55455, USA
| | - Katherine A Knutson
- Division of Biostatistics, University of Minnesota, Minneapolis, MN 55455, USA
| | - Wei Pan
- To whom correspondence should be addressed at: A460 Mayo Building, 420 Delaware St SE, Minneapolis, MN 55455, USA. Tel: (612)626-2705; Fax: (612)626-0660;
| |
Collapse
|