1
|
Interpretable Bayesian network abstraction for dimension reduction. Neural Comput Appl 2022. [DOI: 10.1007/s00521-022-07810-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
2
|
Ghazi AR, Sucipto K, Rahnavard A, Franzosa EA, McIver LJ, Lloyd-Price J, Schwager E, Weingart G, Moon YS, Morgan XC, Waldron L, Huttenhower C. High-sensitivity pattern discovery in large, paired multiomic datasets. Bioinformatics 2022; 38:i378-i385. [PMID: 35758795 PMCID: PMC9235493 DOI: 10.1093/bioinformatics/btac232] [Citation(s) in RCA: 16] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Modern biological screens yield enormous numbers of measurements, and identifying and interpreting statistically significant associations among features are essential. In experiments featuring multiple high-dimensional datasets collected from the same set of samples, it is useful to identify groups of associated features between the datasets in a way that provides high statistical power and false discovery rate (FDR) control. RESULTS Here, we present a novel hierarchical framework, HAllA (Hierarchical All-against-All association testing), for structured association discovery between paired high-dimensional datasets. HAllA efficiently integrates hierarchical hypothesis testing with FDR correction to reveal significant linear and non-linear block-wise relationships among continuous and/or categorical data. We optimized and evaluated HAllA using heterogeneous synthetic datasets of known association structure, where HAllA outperformed all-against-all and other block-testing approaches across a range of common similarity measures. We then applied HAllA to a series of real-world multiomics datasets, revealing new associations between gene expression and host immune activity, the microbiome and host transcriptome, metabolomic profiling and human health phenotypes. AVAILABILITY AND IMPLEMENTATION An open-source implementation of HAllA is freely available at http://huttenhower.sph.harvard.edu/halla along with documentation, demo datasets and a user group. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Andrew R Ghazi
- Biostatistics Department, Harvard T. H. Chan School of Public Health, Boston, MA 02115, USA
- Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
- Harvard Chan Microbiome in Public Health Center, Harvard T. H. Chan School of Public Health, Boston, MA 02115, USA
| | - Kathleen Sucipto
- Biostatistics Department, Harvard T. H. Chan School of Public Health, Boston, MA 02115, USA
| | - Ali Rahnavard
- Biostatistics Department, Harvard T. H. Chan School of Public Health, Boston, MA 02115, USA
- Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Eric A Franzosa
- Biostatistics Department, Harvard T. H. Chan School of Public Health, Boston, MA 02115, USA
- Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
- Harvard Chan Microbiome in Public Health Center, Harvard T. H. Chan School of Public Health, Boston, MA 02115, USA
| | - Lauren J McIver
- Biostatistics Department, Harvard T. H. Chan School of Public Health, Boston, MA 02115, USA
- Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
- Harvard Chan Microbiome in Public Health Center, Harvard T. H. Chan School of Public Health, Boston, MA 02115, USA
| | - Jason Lloyd-Price
- Biostatistics Department, Harvard T. H. Chan School of Public Health, Boston, MA 02115, USA
- Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Emma Schwager
- Biostatistics Department, Harvard T. H. Chan School of Public Health, Boston, MA 02115, USA
| | - George Weingart
- Biostatistics Department, Harvard T. H. Chan School of Public Health, Boston, MA 02115, USA
- Harvard Chan Microbiome in Public Health Center, Harvard T. H. Chan School of Public Health, Boston, MA 02115, USA
| | - Yo Sup Moon
- Biostatistics Department, Harvard T. H. Chan School of Public Health, Boston, MA 02115, USA
| | - Xochitl C Morgan
- Department of Microbiology and Immunology, University of Otago, Dunedin 9016, New Zealand
| | - Levi Waldron
- Department of Epidemiology and Biostatistics, City University of New York Graduate School of Public Health and Health Policy, New York City, NY 10035, USA
| | - Curtis Huttenhower
- Biostatistics Department, Harvard T. H. Chan School of Public Health, Boston, MA 02115, USA
- Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
- Harvard Chan Microbiome in Public Health Center, Harvard T. H. Chan School of Public Health, Boston, MA 02115, USA
- Department of Immunology and Infectious Diseases, Harvard T. H. Chan School of Public Health, Boston, MA 02115, USA
| |
Collapse
|
3
|
Jin J, Jia B, Yuan YJ. Combining nucleotide variations and structure variations for improving astaxanthin biosynthesis. Microb Cell Fact 2022; 21:79. [PMID: 35527251 PMCID: PMC9082887 DOI: 10.1186/s12934-022-01793-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2021] [Accepted: 04/10/2022] [Indexed: 11/13/2022] Open
Abstract
Background Mutational technology has been used to achieve genome-wide variations in laboratory and industrial microorganisms. Genetic polymorphisms of natural genome evolution include nucleotide variations and structural variations, which inspired us to suggest that both types of genotypic variations are potentially useful in improving the performance of chassis cells for industrial applications. However, highly efficient approaches that simultaneously generate structural and nucleotide variations are still lacking. Results The aim of this study was to develop a method of increasing biosynthesis of astaxanthin in yeast by Combining Nucleotide variations And Structure variations (CNAS), which were generated by combinations of Atmospheric and room temperature plasma (ARTP) and Synthetic Chromosome Recombination and Modification by LoxP-Mediated Evolution (SCRaMbLE) system. CNAS was applied to increase the biosynthesis of astaxanthin in yeast and resulted in improvements of 2.2- and 7.0-fold in the yield of astaxanthin. Furthermore, this method was shown to be able to generate structures (deletion, duplication, and inversion) as well as nucleotide variations (SNPs and InDels) simultaneously. Additionally, genetic analysis of the genotypic variations of an astaxanthin improved strain revealed that the deletion of YJR116W and the C2481G mutation of YOL084W enhanced yield of astaxanthin, suggesting a genotype-to-phenotype relationship. Conclusions This study demonstrated that the CNAS strategy could generate both structure variations and nucleotide variations, allowing the enhancement of astaxanthin yield by different genotypes in yeast. Overall, this study provided a valuable tool for generating genomic variation diversity that has desirable phenotypes as well as for knowing the relationship between genotypes and phenotypes in evolutionary processes. Supplementary Information The online version contains supplementary material available at 10.1186/s12934-022-01793-6.
Collapse
|
4
|
Zhao X, Sun XF, Zhao LL, Huang LJ, Wang PC. Morphological, transcriptomic and metabolomic analyses of Sophora davidii mutants for plant height. BMC PLANT BIOLOGY 2022; 22:144. [PMID: 35337273 PMCID: PMC8951708 DOI: 10.1186/s12870-022-03503-1] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/22/2021] [Accepted: 03/02/2022] [Indexed: 05/28/2023]
Abstract
Sophora davidii is an important plant resource in the karst region of Southwest China, but S. davidii plant-height mutants are rarely reported. Therefore, we performed phenotypic, anatomic structural, transcriptomic and metabolomic analyses to study the mechanisms responsible for S. davidii plant-height mutants. Phenotypic and anatomical observations showed that compared to the wild type, the dwarf mutant displayed a significant decrease in plant height, while the tall mutant displayed a significant increase in plant height. The dwarf mutant cells were smaller and more densely arranged, while those of the wild type and the tall mutant were larger and loosely arranged. Transcriptomic analysis revealed that differentially expressed genes (DEGs) involved in cell wall biosynthesis, expansion, phytohormone biosynthesis, signal transduction pathways, flavonoid biosynthesis and phenylpropanoid biosynthesis were significantly enriched in the S. davidii plant-height mutants. Metabolomic analysis revealed 57 significantly differential metabolites screened from both the dwarf and tall mutants. A total of 8 significantly different flavonoid compounds were annotated to LIPID MAPS, and three metabolites (chlorogenic acid, kaempferol and scopoletin) were involved in phenylpropanoid biosynthesis and flavonoid biosynthesis. These results shed light on the molecular mechanisms of plant height in S. davidii mutants and provide insight for further molecular breeding programs.
Collapse
Affiliation(s)
- Xin Zhao
- College of Animal Science, Guizhou University, Guiyang, 550025, China
| | - Xiao-Fu Sun
- Weining Plateau Grassland Test Station, Weining, 553100, China
| | - Li-Li Zhao
- College of Animal Science, Guizhou University, Guiyang, 550025, China.
| | - Li-Juan Huang
- College of Animal Science, Guizhou University, Guiyang, 550025, China
| | - Pu-Chang Wang
- Guizhou Institute of Prataculture, Guiyang, 550006, China.
| |
Collapse
|
5
|
Natukunda MI, Mantilla-Perez MB, Graham MA, Liu P, Salas-Fernandez MG. Dissection of canopy layer-specific genetic control of leaf angle in Sorghum bicolor by RNA sequencing. BMC Genomics 2022; 23:95. [PMID: 35114939 PMCID: PMC8812014 DOI: 10.1186/s12864-021-08251-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2021] [Accepted: 12/10/2021] [Indexed: 11/11/2022] Open
Abstract
BACKGROUND Leaf angle is an important plant architecture trait, affecting plant density, light interception efficiency, photosynthetic rate, and yield. The "smart canopy" model proposes more vertical leaves in the top plant layers and more horizontal leaves in the lower canopy, maximizing conversion efficiency and photosynthesis. Sorghum leaf arrangement is opposite to that proposed in the "smart canopy" model, indicating the need for improvement. Although leaf angle quantitative trait loci (QTL) have been previously reported, only the Dwarf3 (Dw3) auxin transporter gene, colocalizing with a major-effect QTL on chromosome 7, has been validated. Additionally, the genetic architecture of leaf angle across canopy layers remains to be elucidated. RESULTS This study characterized the canopy-layer specific transcriptome of five sorghum genotypes using RNA sequencing. A set of 284 differentially expressed genes for at least one layer comparison (FDR < 0.05) co-localized with 69 leaf angle QTL and were consistently identified across genotypes. These genes are involved in transmembrane transport, hormone regulation, oxidation-reduction process, response to stimuli, lipid metabolism, and photosynthesis. The most relevant eleven candidate genes for layer-specific angle modification include those homologous to genes controlling leaf angle in rice and maize or genes associated with cell size/expansion, shape, and cell number. CONCLUSIONS Considering the predicted functions of candidate genes, their potential undesirable pleiotropic effects should be further investigated across tissues and developmental stages. Future validation of proposed candidates and exploitation through genetic engineering or gene editing strategies targeted to collar cells will bring researchers closer to the realization of a "smart canopy" sorghum.
Collapse
Affiliation(s)
| | - Maria B Mantilla-Perez
- Department of Agronomy, Iowa State University, Ames, IA, 50011, USA
- Present address: Bayer Crop Science, Chesterfield, MO, USA
| | - Michelle A Graham
- Department of Agronomy, Iowa State University, Ames, IA, 50011, USA
- Corn Insects and Crop Genetics Research, USDA-ARS, Ames, IA, 50011, USA
| | - Peng Liu
- Department of Statistics, Iowa State University, Ames, IA, 50011, USA
| | | |
Collapse
|
6
|
Becker AK, Dörr M, Felix SB, Frost F, Grabe HJ, Lerch MM, Nauck M, Völker U, Völzke H, Kaderali L. From heterogeneous healthcare data to disease-specific biomarker networks: A hierarchical Bayesian network approach. PLoS Comput Biol 2021; 17:e1008735. [PMID: 33577591 PMCID: PMC7906470 DOI: 10.1371/journal.pcbi.1008735] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2020] [Revised: 02/25/2021] [Accepted: 01/22/2021] [Indexed: 01/26/2023] Open
Abstract
In this work, we introduce an entirely data-driven and automated approach to reveal disease-associated biomarker and risk factor networks from heterogeneous and high-dimensional healthcare data. Our workflow is based on Bayesian networks, which are a popular tool for analyzing the interplay of biomarkers. Usually, data require extensive manual preprocessing and dimension reduction to allow for effective learning of Bayesian networks. For heterogeneous data, this preprocessing is hard to automatize and typically requires domain-specific prior knowledge. We here combine Bayesian network learning with hierarchical variable clustering in order to detect groups of similar features and learn interactions between them entirely automated. We present an optimization algorithm for the adaptive refinement of such group Bayesian networks to account for a specific target variable, like a disease. The combination of Bayesian networks, clustering, and refinement yields low-dimensional but disease-specific interaction networks. These networks provide easily interpretable, yet accurate models of biomarker interdependencies. We test our method extensively on simulated data, as well as on data from the Study of Health in Pomerania (SHIP-TREND), and demonstrate its effectiveness using non-alcoholic fatty liver disease and hypertension as examples. We show that the group network models outperform available biomarker scores, while at the same time, they provide an easily interpretable interaction network. High-dimensional and heterogeneous healthcare data, such as electronic health records or epidemiological study data, contain much information on yet unknown risk factors that are associated with disease development. The identification of these risk factors may help to improve prevention, diagnosis, and therapy. Bayesian networks are powerful statistical models that can decipher these complex relationships. However, high dimensionality and heterogeneity of data, together with missing values and high feature correlation, make it difficult to automatically learn a good model from data. To facilitate the use of network models, we present a novel, fully automated workflow that combines network learning with hierarchical clustering. The algorithm reveals groups of strongly related features and models the interactions among those groups. It results in simpler network models that are easier to analyze. We introduce a method of adaptive refinement of such models to ensure that disease-relevant parts of the network are modeled in great detail. Our approach makes it easy to learn compact, accurate, and easily interpretable biomarker interaction networks. We test our method extensively on simulated data as well as data from the Study of Health in Pomerania (SHIP-Trend) by learning models of hypertension and non-alcoholic fatty liver disease.
Collapse
Affiliation(s)
- Ann-Kristin Becker
- Institute of Bioinformatics, University Medicine Greifswald, Greifswald, Germany
| | - Marcus Dörr
- Department of Internal Medicine B, University Medicine Greifswald, Greifswald, Germany
- German Centre for Cardiovascular Research (DZHK), partner site Greifswald, Greifswald, Germany
| | - Stephan B. Felix
- Department of Internal Medicine B, University Medicine Greifswald, Greifswald, Germany
- German Centre for Cardiovascular Research (DZHK), partner site Greifswald, Greifswald, Germany
| | - Fabian Frost
- Department of Internal Medicine A, University Medicine Greifswald, Greifswald, Germany
| | - Hans J. Grabe
- Department of Psychiatry, University Medicine Greifswald, Greifswald, Germany
| | - Markus M. Lerch
- Department of Internal Medicine A, University Medicine Greifswald, Greifswald, Germany
| | - Matthias Nauck
- Institute of Clinical Chemistry and Laboratory Medicine, University Medicine Greifswald, Greifswald, Germany
| | - Uwe Völker
- Interfaculty Institute of Genetics and Functional Genomics, Department of Functional Genomics, University Medicine Greifswald, Greifswald, Germany
| | - Henry Völzke
- Institute of Community Medicine, SHIP/KEF, University Medicine Greifswald, Greifswald, Germany
| | - Lars Kaderali
- Institute of Bioinformatics, University Medicine Greifswald, Greifswald, Germany
- * E-mail:
| |
Collapse
|
7
|
Zhang X, Du W, Zhang J, Zou Z, Ruan C. High-throughput profiling of diapause regulated genes from Trichogramma dendrolimi (Hymenoptera: Trichogrammatidae). BMC Genomics 2020; 21:864. [PMID: 33276726 PMCID: PMC7718664 DOI: 10.1186/s12864-020-07285-4] [Citation(s) in RCA: 21] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2020] [Accepted: 11/26/2020] [Indexed: 11/10/2022] Open
Abstract
Background The parasitoid wasp, Trichogramma dendrolimi, can enter diapause at the prepupal stage. Thus, diapause is an efficient preservation method during the mass production of T. dendrolimi. Previous studies on diapause have mainly focused on ecological characteristics, so the molecular basis of diapause in T. dendrolimi is unknown. We compared transcriptomes of diapause and non-diapause T. dendrolimi to identify key genes and pathways involved in diapause development. Results Transcriptome sequencing was performed on diapause prepupae, pupae after diapause, non-diapause prepupae, and pupae. Analysis yielded a total of 87,022 transcripts with an average length of 1604 bp. By removing redundant sequences and those without significant BLAST hits, a non-redundant dataset was generated, containing 7593 sequences with an average length of 3351 bp. Among them, 5702 genes were differentially expressed. The result of Gene Ontology (GO) enrichment analysis revealed that regulation of transcription, DNA-templated, oxidation-reduction process, and signal transduction were significantly affected. Ten genes were selected for validation using quantitative real-time PCR (qPCR). The changes showed the same trend as between the qPCR and RNA-Seq results. Several genes were identified as involved in diapause, including ribosomal proteins, zinc finger proteins, homeobox proteins, forkhead box proteins, UDP-glucuronosyltransferase, Glutathione-S-transferase, p53, and DNA damage-regulated gene 1 (pdrg1). Genes related to lipid metabolism were also included. Conclusions We generated a large amount of transcriptome data from T. dendrolimi, providing a resource for future gene function research. The diapause-related genes identified help reveal the molecular mechanisms of diapause, in T. dendrolimi, and other insect species. Supplementary Information The online version contains supplementary material available at 10.1186/s12864-020-07285-4.
Collapse
Affiliation(s)
- Xue Zhang
- Engineering Research Center of Natural Enemies, Institute of Biological Control, Jilin Agricultural University, Changchun, 130118, China
| | - Wenmei Du
- Engineering Research Center of Natural Enemies, Institute of Biological Control, Jilin Agricultural University, Changchun, 130118, China
| | - Junjie Zhang
- Engineering Research Center of Natural Enemies, Institute of Biological Control, Jilin Agricultural University, Changchun, 130118, China
| | - Zhen Zou
- State Key Laboratory of Integrated Management of Pest Insect and Rodents, Institute of Zoology, Chinese Academy of Sciences, Beijing, 100101, China
| | - Changchun Ruan
- Engineering Research Center of Natural Enemies, Institute of Biological Control, Jilin Agricultural University, Changchun, 130118, China.
| |
Collapse
|
8
|
Liu X, Xia Y, Zhang Y, Yang C, Xiong Z, Song X, Ai L. Comprehensive transcriptomic and proteomic analyses of antroquinonol biosynthetic genes and enzymes in Antrodia camphorata. AMB Express 2020; 10:136. [PMID: 32748086 PMCID: PMC7399014 DOI: 10.1186/s13568-020-01076-6] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2020] [Accepted: 07/28/2020] [Indexed: 01/06/2023] Open
Abstract
Antroquinonol (AQ) has several remarkable bioactivities in acute myeloid leukaemia and pancreatic cancer, but difficulties in the mass production of AQ hamper its applications. Currently, molecular biotechnology methods, such as gene overexpression, have been widely used to increase the production of metabolites. However, AQ biosynthetic genes and enzymes are poorly understood. In this study, an integrated study coupling RNA-Seq and isobaric tags for relative and absolute quantitation (iTRAQ) were used to identify AQ synthesis-related genes and enzymes in Antrodia camphorata during coenzyme Q0-induced fermentation (FM). The upregulated genes related to acetyl-CoA synthesis indicated that acetyl-CoA enters the mevalonate pathway to form the farnesyl tail precursor of AQ. The metE gene for an enzyme with methyl transfer activity provided sufficient methyl groups for AQ structure formation. The CoQ2 and ubiA genes encode p-hydroxybenzoate polyprenyl transferase, linking coenzyme Q0 and the polyisoprene side chain to form coenzyme Q3. NADH is transformed into NAD+ and releases two electrons, which may be beneficial for the conversion of coenzyme Q3 to AQ. Understanding the biosynthetic genes and enzymes of AQ is important for improving its production by genetic means in the future.
Collapse
Affiliation(s)
- Xiaofeng Liu
- Shanghai Engineering Research Center of Food Microbiology, School of Medical Instrument and Food Engineering, University of Shanghai for Science and Technology, 516 Jungong Road, Shanghai, 200093, People's Republic of China
| | - Yongjun Xia
- Shanghai Engineering Research Center of Food Microbiology, School of Medical Instrument and Food Engineering, University of Shanghai for Science and Technology, 516 Jungong Road, Shanghai, 200093, People's Republic of China
| | - Yao Zhang
- Shanghai Engineering Research Center of Food Microbiology, School of Medical Instrument and Food Engineering, University of Shanghai for Science and Technology, 516 Jungong Road, Shanghai, 200093, People's Republic of China
| | - Caiyun Yang
- Shanghai Engineering Research Center of Food Microbiology, School of Medical Instrument and Food Engineering, University of Shanghai for Science and Technology, 516 Jungong Road, Shanghai, 200093, People's Republic of China
| | - Zhiqiang Xiong
- Shanghai Engineering Research Center of Food Microbiology, School of Medical Instrument and Food Engineering, University of Shanghai for Science and Technology, 516 Jungong Road, Shanghai, 200093, People's Republic of China
| | - Xin Song
- Shanghai Engineering Research Center of Food Microbiology, School of Medical Instrument and Food Engineering, University of Shanghai for Science and Technology, 516 Jungong Road, Shanghai, 200093, People's Republic of China
| | - Lianzhong Ai
- Shanghai Engineering Research Center of Food Microbiology, School of Medical Instrument and Food Engineering, University of Shanghai for Science and Technology, 516 Jungong Road, Shanghai, 200093, People's Republic of China.
| |
Collapse
|
9
|
Kimmerling RJ, Prakadan SM, Gupta AJ, Calistri NL, Stevens MM, Olcum S, Cermak N, Drake RS, Pelton K, De Smet F, Ligon KL, Shalek AK, Manalis SR. Linking single-cell measurements of mass, growth rate, and gene expression. Genome Biol 2018; 19:207. [PMID: 30482222 PMCID: PMC6260722 DOI: 10.1186/s13059-018-1576-0] [Citation(s) in RCA: 30] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2018] [Accepted: 10/31/2018] [Indexed: 11/26/2022] Open
Abstract
Mass and growth rate are highly integrative measures of cell physiology not discernable via genomic measurements. Here, we introduce a microfluidic platform enabling direct measurement of single-cell mass and growth rate upstream of highly multiplexed single-cell profiling such as single-cell RNA sequencing. We resolve transcriptional signatures associated with single-cell mass and growth rate in L1210 and FL5.12 cell lines and activated CD8+ T cells. Further, we demonstrate a framework using these linked measurements to characterize biophysical heterogeneity in a patient-derived glioblastoma cell line with and without drug treatment. Our results highlight the value of coupled phenotypic metrics in guiding single-cell genomics.
Collapse
Affiliation(s)
- Robert J. Kimmerling
- Koch Institute for Integrative Cancer Research, Massachusetts Institute of Technology, Cambridge, MA 02139 USA
- Department of Biological Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139 USA
| | - Sanjay M. Prakadan
- Koch Institute for Integrative Cancer Research, Massachusetts Institute of Technology, Cambridge, MA 02139 USA
- Ragon Institute of Massachusetts General Hospital, Massachusetts Institute of Technology, and Harvard, Cambridge, MA 02139 USA
- Department of Chemistry, Massachusetts Institute of Technology, Cambridge, MA 02139 USA
- Institute for Medical Engineering & Science, Massachusetts Institute of Technology, Cambridge, MA 02139 USA
- Broad Institute of MIT and Harvard, Cambridge, MA 02142 USA
| | - Alejandro J. Gupta
- Koch Institute for Integrative Cancer Research, Massachusetts Institute of Technology, Cambridge, MA 02139 USA
- Ragon Institute of Massachusetts General Hospital, Massachusetts Institute of Technology, and Harvard, Cambridge, MA 02139 USA
- Department of Chemistry, Massachusetts Institute of Technology, Cambridge, MA 02139 USA
- Institute for Medical Engineering & Science, Massachusetts Institute of Technology, Cambridge, MA 02139 USA
- Broad Institute of MIT and Harvard, Cambridge, MA 02142 USA
| | - Nicholas L. Calistri
- Koch Institute for Integrative Cancer Research, Massachusetts Institute of Technology, Cambridge, MA 02139 USA
| | - Mark M. Stevens
- Koch Institute for Integrative Cancer Research, Massachusetts Institute of Technology, Cambridge, MA 02139 USA
- Department of Medical Oncology, Dana-Farber Cancer Institute, Harvard Medical School, Boston, MA 02215 USA
| | - Selim Olcum
- Koch Institute for Integrative Cancer Research, Massachusetts Institute of Technology, Cambridge, MA 02139 USA
| | - Nathan Cermak
- Koch Institute for Integrative Cancer Research, Massachusetts Institute of Technology, Cambridge, MA 02139 USA
| | - Riley S. Drake
- Koch Institute for Integrative Cancer Research, Massachusetts Institute of Technology, Cambridge, MA 02139 USA
- Ragon Institute of Massachusetts General Hospital, Massachusetts Institute of Technology, and Harvard, Cambridge, MA 02139 USA
- Department of Chemistry, Massachusetts Institute of Technology, Cambridge, MA 02139 USA
- Institute for Medical Engineering & Science, Massachusetts Institute of Technology, Cambridge, MA 02139 USA
- Broad Institute of MIT and Harvard, Cambridge, MA 02142 USA
| | - Kristine Pelton
- Department of Oncologic Pathology, Dana-Farber Cancer Institute, Boston, MA 02215 USA
| | | | - Keith L. Ligon
- Department of Oncologic Pathology, Dana-Farber Cancer Institute, Boston, MA 02215 USA
| | - Alex K. Shalek
- Koch Institute for Integrative Cancer Research, Massachusetts Institute of Technology, Cambridge, MA 02139 USA
- Ragon Institute of Massachusetts General Hospital, Massachusetts Institute of Technology, and Harvard, Cambridge, MA 02139 USA
- Department of Chemistry, Massachusetts Institute of Technology, Cambridge, MA 02139 USA
- Institute for Medical Engineering & Science, Massachusetts Institute of Technology, Cambridge, MA 02139 USA
- Broad Institute of MIT and Harvard, Cambridge, MA 02142 USA
- Harvard-MIT Division of Health Sciences and Technology, Massachusetts Institute of Technology, Cambridge, MA 02139 USA
- Massachusetts General Hospital, Boston, MA 02114 USA
| | - Scott R. Manalis
- Koch Institute for Integrative Cancer Research, Massachusetts Institute of Technology, Cambridge, MA 02139 USA
- Department of Biological Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139 USA
- Department of Mechanical Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139 USA
| |
Collapse
|
10
|
Kang J, Rancati T, Lee S, Oh JH, Kerns SL, Scott JG, Schwartz R, Kim S, Rosenstein BS. Machine Learning and Radiogenomics: Lessons Learned and Future Directions. Front Oncol 2018; 8:228. [PMID: 29977864 PMCID: PMC6021505 DOI: 10.3389/fonc.2018.00228] [Citation(s) in RCA: 42] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2018] [Accepted: 06/04/2018] [Indexed: 12/25/2022] Open
Abstract
Due to the rapid increase in the availability of patient data, there is significant interest in precision medicine that could facilitate the development of a personalized treatment plan for each patient on an individual basis. Radiation oncology is particularly suited for predictive machine learning (ML) models due to the enormous amount of diagnostic data used as input and therapeutic data generated as output. An emerging field in precision radiation oncology that can take advantage of ML approaches is radiogenomics, which is the study of the impact of genomic variations on the sensitivity of normal and tumor tissue to radiation. Currently, patients undergoing radiotherapy are treated using uniform dose constraints specific to the tumor and surrounding normal tissues. This is suboptimal in many ways. First, the dose that can be delivered to the target volume may be insufficient for control but is constrained by the surrounding normal tissue, as dose escalation can lead to significant morbidity and rare. Second, two patients with nearly identical dose distributions can have substantially different acute and late toxicities, resulting in lengthy treatment breaks and suboptimal control, or chronic morbidities leading to poor quality of life. Despite significant advances in radiogenomics, the magnitude of the genetic contribution to radiation response far exceeds our current understanding of individual risk variants. In the field of genomics, ML methods are being used to extract harder-to-detect knowledge, but these methods have yet to fully penetrate radiogenomics. Hence, the goal of this publication is to provide an overview of ML as it applies to radiogenomics. We begin with a brief history of radiogenomics and its relationship to precision medicine. We then introduce ML and compare it to statistical hypothesis testing to reflect on shared lessons and to avoid common pitfalls. Current ML approaches to genome-wide association studies are examined. The application of ML specifically to radiogenomics is next presented. We end with important lessons for the proper integration of ML into radiogenomics.
Collapse
Affiliation(s)
- John Kang
- Department of Radiation Oncology, University of Rochester Medical Center, Rochester, NY, United States
| | - Tiziana Rancati
- Prostate Cancer Program, Fondazione IRCCS Istituto Nazionale dei Tumori, Milan, Italy
| | - Sangkyu Lee
- Department of Medical Physics, Memorial Sloan Kettering Cancer Center, New York, NY, United States
| | - Jung Hun Oh
- Department of Medical Physics, Memorial Sloan Kettering Cancer Center, New York, NY, United States
| | - Sarah L. Kerns
- Department of Radiation Oncology, University of Rochester Medical Center, Rochester, NY, United States
| | - Jacob G. Scott
- Department of Translational Hematology and Oncology Research, Cleveland Clinic, Cleveland, OH, United States
- Department of Radiation Oncology, Cleveland Clinic, Cleveland, OH, United States
| | - Russell Schwartz
- Computational Biology Department, Carnegie Mellon School of Computer Science, Pittsburgh, PA, United States
- Department of Biological Sciences, Carnegie Mellon University, Pittsburgh, PA, United States
| | - Seyoung Kim
- Computational Biology Department, Carnegie Mellon School of Computer Science, Pittsburgh, PA, United States
| | - Barry S. Rosenstein
- Department of Radiation Oncology, Icahn School of Medicine at Mount Sinai, New York, NY, United States
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, United States
| |
Collapse
|
11
|
Zhao C, Jiang J, Guan Y, Guo X, He B. EMR-based medical knowledge representation and inference via Markov random fields and distributed representation learning. Artif Intell Med 2018; 87:49-59. [PMID: 29691122 DOI: 10.1016/j.artmed.2018.03.005] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2017] [Revised: 02/28/2018] [Accepted: 03/29/2018] [Indexed: 01/09/2023]
Abstract
OBJECTIVE Electronic medical records (EMRs) contain medical knowledge that can be used for clinical decision support (CDS). Our objective is to develop a general system that can extract and represent knowledge contained in EMRs to support three CDS tasks-test recommendation, initial diagnosis, and treatment plan recommendation-given the condition of a patient. METHODS We extracted four kinds of medical entities from records and constructed an EMR-based medical knowledge network (EMKN), in which nodes are entities and edges reflect their co-occurrence in a record. Three bipartite subgraphs (bigraphs) were extracted from the EMKN, one to support each task. One part of the bigraph was the given condition (e.g., symptoms), and the other was the condition to be inferred (e.g., diseases). Each bigraph was regarded as a Markov random field (MRF) to support the inference. We proposed three graph-based energy functions and three likelihood-based energy functions. Two of these functions are based on knowledge representation learning and can provide distributed representations of medical entities. Two EMR datasets and three metrics were utilized to evaluate the performance. RESULTS As a whole, the evaluation results indicate that the proposed system outperformed the baseline methods. The distributed representation of medical entities does reflect similarity relationships with respect to knowledge level. CONCLUSION Combining EMKN and MRF is an effective approach for general medical knowledge representation and inference. Different tasks, however, require individually designed energy functions.
Collapse
Affiliation(s)
- Chao Zhao
- School of Computer Science and Technology, Harbin, Heilongjiang 150001, China.
| | - Jingchi Jiang
- School of Computer Science and Technology, Harbin, Heilongjiang 150001, China.
| | - Yi Guan
- School of Computer Science and Technology, Harbin, Heilongjiang 150001, China.
| | - Xitong Guo
- School of Management, Harbin Institute of Technology, Harbin, Heilongjiang 150001, China.
| | - Bin He
- School of Computer Science and Technology, Harbin, Heilongjiang 150001, China.
| |
Collapse
|
12
|
Sinoquet C. A method combining a random forest-based technique with the modeling of linkage disequilibrium through latent variables, to run multilocus genome-wide association studies. BMC Bioinformatics 2018; 19:106. [PMID: 29587628 PMCID: PMC5870262 DOI: 10.1186/s12859-018-2054-0] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2016] [Accepted: 02/09/2018] [Indexed: 01/26/2023] Open
Abstract
BACKGROUND Genome-wide association studies (GWASs) have been widely used to discover the genetic basis of complex phenotypes. However, standard single-SNP GWASs suffer from lack of power. In particular, they do not directly account for linkage disequilibrium, that is the dependences between SNPs (Single Nucleotide Polymorphisms). RESULTS We present the comparative study of two multilocus GWAS strategies, in the random forest-based framework. The first method, T-Trees, was designed by Botta and collaborators (Botta et al., PLoS ONE 9(4):e93379, 2014). We designed the other method, which is an innovative hybrid method combining T-Trees with the modeling of linkage disequilibrium. Linkage disequilibrium is modeled through a collection of tree-shaped Bayesian networks with latent variables, following our former works (Mourad et al., BMC Bioinformatics 12(1):16, 2011). We compared the two methods, both on simulated and real data. For dominant and additive genetic models, in either of the conditions simulated, the hybrid approach always slightly performs better than T-Trees. We assessed predictive powers through the standard ROC technique on 14 real datasets. For 10 of the 14 datasets analyzed, the already high predicted power observed for T-Trees (0.910-0.946) can still be increased by up to 0.030. We also assessed whether the distributions of SNPs' scores obtained from T-Trees and the hybrid approach differed. Finally, we thoroughly analyzed the intersections of top 100 SNPs output by any two or the three methods amongst T-Trees, the hybrid approach, and the single-SNP method. CONCLUSIONS The sophistication of T-Trees through finer linkage disequilibrium modeling is shown beneficial. The distributions of SNPs' scores generated by T-Trees and the hybrid approach are shown statistically different, which suggests complementary of the methods. In particular, for 12 of the 14 real datasets, the distribution tail of highest SNPs' scores shows larger values for the hybrid approach. Thus are pinpointed more interesting SNPs than by T-Trees, to be provided as a short list of prioritized SNPs, for a further analysis by biologists. Finally, among the 211 top 100 SNPs jointly detected by the single-SNP method, T-Trees and the hybrid approach over the 14 datasets, we identified 72 and 38 SNPs respectively present in the top25s and top10s for each method.
Collapse
Affiliation(s)
- Christine Sinoquet
- LS2N, UMR CNRS 6004, Université de Nantes, 2 rue de la Houssinière, BP 92208, Nantes Cedex, 44322, France.
| |
Collapse
|
13
|
Survival, gene and metabolite responses of Litoria verreauxii alpina frogs to fungal disease chytridiomycosis. Sci Data 2018; 5:180033. [PMID: 29509187 PMCID: PMC5839156 DOI: 10.1038/sdata.2018.33] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2017] [Accepted: 01/08/2018] [Indexed: 12/20/2022] Open
Abstract
The fungal skin disease chytridiomycosis has caused the devastating decline and extinction of hundreds of amphibian species globally, yet the potential for evolving resistance, and the underlying pathophysiological mechanisms remain poorly understood. We exposed 406 naïve, captive-raised alpine tree frogs (Litoria verreauxii alpina) from multiple populations (one evolutionarily naïve to chytridiomycosis) to the aetiological agent Batrachochytrium dendrobatidis in two concurrent and controlled infection experiments. We investigated (A) survival outcomes and clinical pathogen burdens between populations and clutches, and (B) individual host tissue responses to chytridiomycosis. Here we present multiple interrelated datasets associated with these exposure experiments, including animal signalment, survival and pathogen burden of 355 animals from Experiment A, and the following datasets related to 61 animals from Experiment B: animal signalment and pathogen burden; raw RNA-Seq reads from skin, liver and spleen tissues; de novo assembled transcriptomes for each tissue type; raw gene expression data; annotation data for each gene; and raw metabolite expression data from skin and liver tissues. These data provide an extensive baseline for future analyses.
Collapse
|
14
|
Sex and tissue specific gene expression patterns identified following de novo transcriptomic analysis of the Norway lobster, Nephrops norvegicus. BMC Genomics 2017; 18:622. [PMID: 28814267 PMCID: PMC5559819 DOI: 10.1186/s12864-017-3981-2] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2016] [Accepted: 08/01/2017] [Indexed: 12/31/2022] Open
Abstract
BACKGROUND The Norway lobster, Nephrops norvegicus, is economically important in European fisheries and is a key organism in local marine ecosystems. Despite multi-faceted scientific interest in this species, our current knowledge of genetic resources in this species remains very limited. Here, we generated a reference de novo transcriptome for N. norvegicus from multiple tissues in both sexes. Bioinformatic analyses were conducted to detect transcripts that were expressed exclusively in either males or females. Patterns were validated via RT-PCR. RESULTS Sixteen N. norvegicus libraries were sequenced from immature and mature ovary, testis and vas deferens (including the masculinizing androgenic gland). In addition, eyestalk, brain, thoracic ganglia and hepatopancreas tissues were screened in males and both immature and mature females. RNA-Sequencing resulted in >600 million reads. De novo assembly that combined the current dataset with two previously published libraries from eyestalk tissue, yielded a reference transcriptome of 333,225 transcripts with an average size of 708 base pairs (bp), with an N50 of 1272 bp. Sex-specific transcripts were detected primarily in gonads followed by hepatopancreas, brain, thoracic ganglia, and eyestalk, respectively. Candidate transcripts that were expressed exclusively either in males or females were highlighted and the 10 most abundant ones were validated via RT-PCR. Among the most highly expressed genes were Serine threonine protein kinase in testis and Vitellogenin in female hepatopancreas. These results align closely with gene annotation results. Moreover, a differential expression heatmap showed that the majority of differentially expressed transcripts were identified in gonad and eyestalk tissues. Results indicate that sex-specific gene expression patterns in Norway lobster are controlled by differences in gene regulation pattern between males and females in somatic tissues. CONCLUSIONS The current study presents the first multi-tissue reference transcriptome for the Norway lobster that can be applied to future biological, wild restocking and fisheries studies. Sex-specific markers were mainly expressed in males implying that males may experience stronger selection than females. It is apparent that differential expression is due to sex-specific gene regulatory pathways that are present in somatic tissues and not from effects of genes located on heterogametic sex chromosomes. The N. norvegicus data provide a foundation for future gene-based reproductive studies.
Collapse
|
15
|
Taliun D, Gamper J, Leser U, Pattaro C. Fast Sampling-Based Whole-Genome Haplotype Block Recognition. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2016; 13:315-325. [PMID: 27045830 DOI: 10.1109/tcbb.2015.2456897] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Scaling linkage disequilibrium (LD) based haplotype block recognition to the entire human genome has always been a challenge. The best-known algorithm has quadratic runtime complexity and, even when sophisticated search space pruning is applied, still requires several days of computations. Here, we propose a novel sampling-based algorithm, called S-MIG (++), where the main idea is to estimate the area that most likely contains all haplotype blocks by sampling a very small number of SNP pairs. A subsequent refinement step computes the exact blocks by considering only the SNP pairs within the estimated area. This approach significantly reduces the number of computed LD statistics, making the recognition of haplotype blocks very fast. We theoretically and empirically prove that the area containing all haplotype blocks can be estimated with a very high degree of certainty. Through experiments on the 243,080 SNPs on chromosome 20 from the 1,000 Genomes Project, we compared our previous algorithm MIG (++) with the new S-MIG (++) and observed a runtime reduction from 2.8 weeks to 34.8 hours. In a parallelized version of the S-MIG (++) algorithm using 32 parallel processes, the runtime was further reduced to 5.1 hours.
Collapse
|
16
|
Abstract
Models for genome-wide prediction and association studies usually target a single phenotypic trait. However, in animal and plant genetics it is common to record information on multiple phenotypes for each individual that will be genotyped. Modeling traits individually disregards the fact that they are most likely associated due to pleiotropy and shared biological basis, thus providing only a partial, confounded view of genetic effects and phenotypic interactions. In this article we use data from a Multiparent Advanced Generation Inter-Cross (MAGIC) winter wheat population to explore Bayesian networks as a convenient and interpretable framework for the simultaneous modeling of multiple quantitative traits. We show that they are equivalent to multivariate genetic best linear unbiased prediction (GBLUP) and that they are competitive with single-trait elastic net and single-trait GBLUP in predictive performance. Finally, we discuss their relationship with other additive-effects models and their advantages in inference and interpretation. MAGIC populations provide an ideal setting for this kind of investigation because the very low population structure and large sample size result in predictive models with good power and limited confounding due to relatedness.
Collapse
|
17
|
Bayesian systems-based genetic association analysis with effect strength estimation and omic wide interpretation: a case study in rheumatoid arthritis. METHODS IN MOLECULAR BIOLOGY (CLIFTON, N.J.) 2014; 1142:143-76. [PMID: 24706282 DOI: 10.1007/978-1-4939-0404-4_14] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
Abstract
Rich dependency structures are often formed in genetic association studies between the phenotypic, clinical, and environmental descriptors. These descriptors may not be standardized, and may encompass various disease definitions and clinical endpoints which are only weakly influenced by various (e.g., genetic) factors. Such loosely defined complex intermediate clinical phenotypes are typically used in follow-up candidate gene association studies, e.g., after genome-wide analysis, to deepen the understanding of the associations and to estimate effect strength. This chapter discusses a solid methodology, which is useful in such a scenario, by using probabilistic graphical models, namely, Bayesian networks in the Bayesian statistical framework. This method offers systematically scalable, comprehensive hierarchical hypotheses about multivariate relevance. We discuss its workflow: from data engineering to semantic publication of the results. We overview the construction, visualization, and interpretation of complex hypotheses related to the structural analysis of relevance. Furthermore, we illustrate the use of a dependency model-based relevance measure, which takes into account the structural properties of the model, for quantifying the effect strength. Finally, we discuss the "interpretational" or translational challenge of a genetic association study, with a focus on the fusion of heterogeneous omic knowledge to reintegrate the results into a genome-wide context.
Collapse
|
18
|
Yücebaş SC, Aydın Son Y. A prostate cancer model build by a novel SVM-ID3 hybrid feature selection method using both genotyping and phenotype data from dbGaP. PLoS One 2014; 9:e91404. [PMID: 24651484 PMCID: PMC3961262 DOI: 10.1371/journal.pone.0091404] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2013] [Accepted: 02/12/2014] [Indexed: 01/11/2023] Open
Abstract
Through Genome Wide Association Studies (GWAS) many Single Nucleotide Polymorphism (SNP)-complex disease relations can be investigated. The output of GWAS can be high in amount and high dimensional, also relations between SNPs, phenotypes and diseases are most likely to be nonlinear. In order to handle high volume-high dimensional data and to be able to find the nonlinear relations we have utilized data mining approaches and a hybrid feature selection model of support vector machine and decision tree has been designed. The designed model is tested on prostate cancer data and for the first time combined genotype and phenotype information is used to increase the diagnostic performance. We were able to select phenotypic features such as ethnicity and body mass index, and SNPs those map to specific genes such as CRR9, TERT. The performance results of the proposed hybrid model, on prostate cancer dataset, with 90.92% of sensitivity and 0.91 of area under ROC curve, shows the potential of the approach for prediction and early detection of the prostate cancer.
Collapse
Affiliation(s)
- Sait Can Yücebaş
- Medical Informatics Department, Graduate School of Informatics, Middle East Technical University. Ankara, Turkey
| | - Yeşim Aydın Son
- Medical Informatics Department, Graduate School of Informatics, Middle East Technical University. Ankara, Turkey
- Bioinformatics Graduate Program, Graduate School of Informatics, Middle East Technical University, Ankara, Turkey
- * E-mail:
| |
Collapse
|
19
|
|
20
|
Mourad R, Sinoquet C, Dina C, Leray P. Visualization of pairwise and multilocus linkage disequilibrium structure using latent forests. PLoS One 2011; 6:e27320. [PMID: 22174739 PMCID: PMC3236755 DOI: 10.1371/journal.pone.0027320] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2011] [Accepted: 10/14/2011] [Indexed: 11/19/2022] Open
Abstract
Linkage disequilibrium study represents a major issue in statistical genetics as it plays a fundamental role in gene mapping and helps us to learn more about human history. The linkage disequilibrium complex structure makes its exploratory data analysis essential yet challenging. Visualization methods, such as the triangular heat map implemented in Haploview, provide simple and useful tools to help understand complex genetic patterns, but remain insufficient to fully describe them. Probabilistic graphical models have been widely recognized as a powerful formalism allowing a concise and accurate modeling of dependences between variables. In this paper, we propose a method for short-range, long-range and chromosome-wide linkage disequilibrium visualization using forests of hierarchical latent class models. Thanks to its hierarchical nature, our method is shown to provide a compact view of both pairwise and multilocus linkage disequilibrium spatial structures for the geneticist. Besides, a multilocus linkage disequilibrium measure has been designed to evaluate linkage disequilibrium in hierarchy clusters. To learn the proposed model, a new scalable algorithm is presented. It constrains the dependence scope, relying on physical positions, and is able to deal with more than one hundred thousand single nucleotide polymorphisms. The proposed algorithm is fast and does not require phase genotypic data.
Collapse
Affiliation(s)
- Raphaël Mourad
- LINA, UMR CNRS 6241, Ecole Polytechnique de l'Université de Nantes, BP 50609 Nantes, France.
| | | | | | | |
Collapse
|
21
|
Mourad R, Sinoquet C, Leray P. Probabilistic graphical models for genetic association studies. Brief Bioinform 2011; 13:20-33. [PMID: 21450805 DOI: 10.1093/bib/bbr015] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Probabilistic graphical models have been widely recognized as a powerful formalism in the bioinformatics field, especially in gene expression studies and linkage analysis. Although less well known in association genetics, many successful methods have recently emerged to dissect the genetic architecture of complex diseases. In this review article, we cover the applications of these models to the population association studies' context, such as linkage disequilibrium modeling, fine mapping and candidate gene studies, and genome-scale association studies. Significant breakthroughs of the corresponding methods are highlighted, but emphasis is also given to their current limitations, in particular, to the issue of scalability. Finally, we give promising directions for future research in this field.
Collapse
Affiliation(s)
- Raphaël Mourad
- Ecole Polytechnique de l'Université de Nantes, rue Christian Pauc, BP 50609, 44306 Nantes Cedex 3, France.
| | | | | |
Collapse
|