1
|
Wang Q, Tang TM, Youlton N, Weldy CS, Kenney AM, Ronen O, Weston Hughes J, Chin ET, Sutton SC, Agarwal A, Li X, Behr M, Kumbier K, Moravec CS, Wilson Tang WH, Margulies KB, Cappola TP, Butte AJ, Arnaout R, Brown JB, Priest JR, Parikh VN, Yu B, Ashley EA. Epistasis regulates genetic control of cardiac hypertrophy. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2024:2023.11.06.23297858. [PMID: 37987017 PMCID: PMC10659487 DOI: 10.1101/2023.11.06.23297858] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/22/2023]
Abstract
The combinatorial effect of genetic variants is often assumed to be additive. Although genetic variation can clearly interact non-additively, methods to uncover epistatic relationships remain in their infancy. We develop low-signal signed iterative random forests to elucidate the complex genetic architecture of cardiac hypertrophy. We derive deep learning-based estimates of left ventricular mass from the cardiac MRI scans of 29,661 individuals enrolled in the UK Biobank. We report epistatic genetic variation including variants close to CCDC141 , IGF1R , TTN , and TNKS. Several loci where variants were deemed insignificant in univariate genome-wide association analyses are identified. Functional genomic and integrative enrichment analyses reveal a complex gene regulatory network in which genes mapped from these loci share biological processes and myogenic regulatory factors. Through a network analysis of transcriptomic data from 313 explanted human hearts, we found strong gene co-expression correlations between these statistical epistasis contributors in healthy hearts and a significant connectivity decrease in failing hearts. We assess causality of epistatic effects via RNA silencing of gene-gene interactions in human induced pluripotent stem cell-derived cardiomyocytes. Finally, single-cell morphology analysis using a novel high-throughput microfluidic system shows that cardiomyocyte hypertrophy is non-additively modifiable by specific pairwise interactions between CCDC141 and both TTN and IGF1R . Our results expand the scope of genetic regulation of cardiac structure to epistasis.
Collapse
|
2
|
Mouronte-López ML, Gómez Sánchez-Seco J, Benito RM. Patterns of human and bots behaviour on Twitter conversations about sustainability. Sci Rep 2024; 14:3223. [PMID: 38331929 PMCID: PMC10853507 DOI: 10.1038/s41598-024-52471-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2023] [Accepted: 01/18/2024] [Indexed: 02/10/2024] Open
Abstract
Sustainability is an issue of worldwide concern. Twitter is one of the most popular social networks, which makes it particularly interesting for exploring opinions and characteristics related to issues of social preoccupation. This paper aims to gain a better understanding of the activity related to sustainability that takes place on twitter. In addition to building a mathematical model to identify account typologies (bot and human users), different behavioural patterns were detected using clustering analysis mainly in the mechanisms of posting tweets and retweets). The model took as explanatory variables, certain characteristics of the user's profile and her/his activity. A lexicon-based sentiment analysis in the period from 2006 to 2022 was also carried out in conjunction with a keyword study based on centrality metrics. We found that, in both bot and human users, messages showed mostly a positive sentiment. Bots had a higher percentage of neutral messages than human users. With respect to the used keywords certain commonalities but also slight differences between humans and bots were identified.
Collapse
Affiliation(s)
- Mary Luz Mouronte-López
- Higher Polytechnic School, Universidad Francisco de Vitoria, Carretera Pozuelo a, Av de Majadahonda, Km 1.800, 28223, Madrid, Spain.
| | - Javier Gómez Sánchez-Seco
- Higher Polytechnic School, Universidad Francisco de Vitoria, Carretera Pozuelo a, Av de Majadahonda, Km 1.800, 28223, Madrid, Spain
- Grupo de Sistemas Complejos, Escuela Técnica Superior de Ingeniería Agronómica, Alimentaria y de Biosistemas, Universidad Politécnica de Madrid, Avda. Puerta de Hierro 2-4, 28040, Madrid, Spain
| | - Rosa M Benito
- Grupo de Sistemas Complejos, Escuela Técnica Superior de Ingeniería Agronómica, Alimentaria y de Biosistemas, Universidad Politécnica de Madrid, Avda. Puerta de Hierro 2-4, 28040, Madrid, Spain
| |
Collapse
|
3
|
Rhodes JS, Aumon A, Morin S, Girard M, Larochelle C, Brunet-Ratnasingham E, Pagliuzza A, Marchitto L, Zhang W, Cutler A, Grand'Maison F, Zhou A, Finzi A, Chomont N, Kaufmann DE, Zandee S, Prat A, Wolf G, Moon KR. Gaining Biological Insights through Supervised Data Visualization. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.11.22.568384. [PMID: 38293135 PMCID: PMC10827133 DOI: 10.1101/2023.11.22.568384] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/01/2024]
Abstract
Dimensionality reduction-based data visualization is pivotal in comprehending complex biological data. The most common methods, such as PHATE, t-SNE, and UMAP, are unsupervised and therefore reflect the dominant structure in the data, which may be independent of expert-provided labels. Here we introduce a supervised data visualization method called RF-PHATE, which integrates expert knowledge for further exploration of the data. RF-PHATE leverages random forests to capture intricate featurelabel relationships. Extracting information from the forest, RF-PHATE generates low-dimensional visualizations that highlight relevant data relationships while disregarding extraneous features. This approach scales to large datasets and applies to classification and regression. We illustrate RF-PHATE's prowess through three case studies. In a multiple sclerosis study using longitudinal clinical and imaging data, RF-PHATE unveils a sub-group of patients with non-benign relapsingremitting Multiple Sclerosis, demonstrating its aptitude for time-series data. In the context of Raman spectral data, RF-PHATE effectively showcases the impact of antioxidants on diesel exhaust-exposed lung cells, highlighting its proficiency in noisy environments. Furthermore, RF-PHATE aligns established geometric structures with COVID-19 patient outcomes, enriching interpretability in a hierarchical manner. RF-PHATE bridges expert insights and visualizations, promising knowledge generation. Its adaptability, scalability, and noise tolerance underscore its potential for widespread adoption.
Collapse
|
4
|
Bonham KS, Fahur Bottino G, McCann SH, Beauchemin J, Weisse E, Barry F, Cano Lorente R, Huttenhower C, Bruchhage M, D’Sa V, Deoni S, Klepac-Ceraj V. Gut-resident microorganisms and their genes are associated with cognition and neuroanatomy in children. SCIENCE ADVANCES 2023; 9:eadi0497. [PMID: 38134274 PMCID: PMC10745691 DOI: 10.1126/sciadv.adi0497] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/31/2023] [Accepted: 11/21/2023] [Indexed: 12/24/2023]
Abstract
Emerging evidence implicates gut microbial metabolism in neurodevelopmental disorders, but its influence on typical neurodevelopment has not been explored in detail. We investigated the relationship between the microbiome and neuroanatomy and cognition of 381 healthy children, demonstrating that differences in microbial taxa and genes are associated with overall cognitive function and the size of brain regions. Using a combination of statistical and machine learning models, we showed that species including Alistipes obesi, Blautia wexlerae, and Ruminococcus gnavus were enriched or depleted in children with higher cognitive function scores. Microbial metabolism of short-chain fatty acids was also associated with cognitive function. In addition, machine models were able to predict the volume of brain regions from microbial profiles, and taxa that were important in predicting cognitive function were also important for predicting individual brain regions and specific subscales of cognitive function. These findings provide potential biomarkers of neurocognitive development and may enable development of targets for early detection and intervention.
Collapse
Affiliation(s)
- Kevin S. Bonham
- Department of Biological Sciences, Wellesley College, Wellesley, MA, USA
| | | | | | | | - Elizabeth Weisse
- Department of Psychology, University of Stavanger, Stavanger, Norway
| | | | | | | | - Curtis Huttenhower
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA
- Department of Immunology and Infectious Diseases, Harvard T.H. Chan School of Public Health, Boston, MA, USA
- Harvard Chan Microbiome in Public Health Center, Harvard T.H. Chan School of Public Health, Boston, MA, USA
- Associate Member, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Muriel Bruchhage
- Department of Psychology, University of Stavanger, Stavanger, Norway
| | - Viren D’Sa
- Rhode Island Hospital, Providence, RI, USA
| | - Sean Deoni
- Rhode Island Hospital, Providence, RI, USA
| | - Vanja Klepac-Ceraj
- Department of Biological Sciences, Wellesley College, Wellesley, MA, USA
| |
Collapse
|
5
|
Li X, Hu H, Ren Q, Wang M, Du Y, He Y, Wang Q. Comparative analysis of endophyte diversity of Dendrobium officinale lived on rock and tree. PLANT BIOTECHNOLOGY (TOKYO, JAPAN) 2023; 40:145-155. [PMID: 38264473 PMCID: PMC10804140 DOI: 10.5511/plantbiotechnology.23.0208a] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 10/12/2022] [Accepted: 02/08/2023] [Indexed: 01/25/2024]
Abstract
Dendrobium officinale usually lives on rock or tree, but their endophyte diversity has not yet been fully revealed? In this study, high-throughput sequencing technology was used to investigate the endophyte diversity of the roots of D. officinale lived on tree (Group 1-3, arboreal type) and rock (Group 4, lithophytic type). The results showed that their composition of endophytic fungi and bacteria were similar at phylum level, while their relative abundance were different. Their taxa composition and abundance of endophytes differed significantly among groups at the genus level. Alpha diversity of endophytic fungi of lithophytic type was higher than those from arboreal type, while there was no advantage in endophytic bacteria. Beta diversity revealed that the endophytic fungi tended to cluster in each group, but the endophytic bacteria were dispersed among the groups. LEfSe analysis found that the numbers of predicted endophyte biomarkers of lithophytic type were more than arboreal types at genus level, and the biomarkers varied among groups. Microbial network analysis revealed similarities and differences in the taxa composition and abundance of shared and special endophytes in each group. These results suggested that the root endophytes of lithophytic and arboreal D. officinale differed in diversity.
Collapse
Affiliation(s)
- Xiaolan Li
- Microbial Resources and Drug Development Key Laboratory of Guizhou Tertiary Institution, Life Sciences Institute, School of Stomatology, Zunyi Medical University, Zunyi 563000, China
| | - Huan Hu
- Microbial Resources and Drug Development Key Laboratory of Guizhou Tertiary Institution, Life Sciences Institute, School of Stomatology, Zunyi Medical University, Zunyi 563000, China
| | - Qunli Ren
- Microbial Resources and Drug Development Key Laboratory of Guizhou Tertiary Institution, Life Sciences Institute, School of Stomatology, Zunyi Medical University, Zunyi 563000, China
| | - Miao Wang
- Microbial Resources and Drug Development Key Laboratory of Guizhou Tertiary Institution, Life Sciences Institute, School of Stomatology, Zunyi Medical University, Zunyi 563000, China
| | - Yimei Du
- School of Pharmacy, Zunyi Medical University, Zunyi 563000, China
| | - Yuqi He
- School of Pharmacy, Zunyi Medical University, Zunyi 563000, China
- Key Laboratory of Basic Pharmacology of Ministry of Education and Joint International Research Laboratory of Ethnomedicine of Ministry of Education, Zunyi Medical University, Zunyi 563000, China
| | - Qian Wang
- Microbial Resources and Drug Development Key Laboratory of Guizhou Tertiary Institution, Life Sciences Institute, School of Stomatology, Zunyi Medical University, Zunyi 563000, China
| |
Collapse
|
6
|
Mo X, Wang N, He Z, Kang W, Wang L, Han X, Yang L. The sub-molecular characterization identification for cervical cancer. Heliyon 2023; 9:e16873. [PMID: 37484385 PMCID: PMC10360967 DOI: 10.1016/j.heliyon.2023.e16873] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2023] [Revised: 05/28/2023] [Accepted: 05/31/2023] [Indexed: 07/25/2023] Open
Abstract
Background The efficacy of therapy in cervical cancer (CESC) is blocked by high molecular heterogeneity. Thus, the sub-molecular characterization remains primarily explored for personalizing the treatment of CESC patients. Methods Datasets with 741 CESC patients were obtained from TCGA and GEO databases. The NMF algorithm, random forest algorithm, and multivariate Cox analysis were utilized to construct a classifier for defining the sub-molecular characterization. Then, the biological characteristics, genomic variations, prognosis, and immune landscape in molecular subtypes were explored. The significance of classifier genes was validated by quantitative Real-Time PCR, cell transfection, cell colony formation assay, wound healing assay, cell proliferation assay, and Western blot. Results The CESC patients were classified into two subtypes, and the high classifier-score patients with significant differences in ECM-receptor interaction, PI3K-Akt signaling pathway, and MAPK signaling pathway showed a poorer prognosis in OS (p < 0.001), DFI (p = 0.016), PFI (p < 0.001) and DSS (p < 0.001), and with high the M0 Macrophage and resting Mast cells infiltration and low HLA family gene expression. Moreover, the constructed classifier owns a high identified accuracy in the tumor/normal groups (AUC: 0.993), the tumor/CIN1-CIN3 groups (AUC: 0.963), and normal/CIN1-CIN3 groups (AUC: 0.962), and the total prediction performance is better than currently published signatures in CESC (C-index: 0,763). The combined prediction performance further indicated that Nomogram (AUC = 0.837) is superior to the classifier (AUC = 0.835) and Stage (AUC = 0.568), and the C-index of calibration curves is 0.784. The potential biological function of classifier genes indicated that silencing GALNT2 inhibited the cancer cell's proliferation, migration, and colony formation; Conversely, the cancer cell's proliferation, migration, and colony formation were increased after the upregulation of GALNT2. The Epithelial-Mesenchymal Transition Experiment showed that GALNT2 knockdown might reduce the levels of Snail and Vimentin proteins and increase E-cadherin; Conversely, the levels of Snail and Vimentin proteins were increased, E-cadherin was reduced by GALNT2 upregulation. Conclusion The classifier we constructed may help improve our understanding of subtype characteristics and provide a new strategy for developing CESC therapeutics. Remarkably, GALNT2 may be an option to directly target drivers in CESC cancer therapy.
Collapse
Affiliation(s)
- XinKai Mo
- Department of Clinical Laboratory, Shandong Cancer Hospital and Institute, Shandong First Medical University and Shandong Academy of Medical Sciences, Jinan 250117, Shandong, PR China
| | - Na Wang
- Department of Medical Laboratory Science, Xinjiang Bayingoleng Mongolian Autonomous Prefecture People's Hospital, Xinjiang, China
| | - Zanjing He
- Department of Medical Laboratory Science, Xinjiang Bayingoleng Mongolian Autonomous Prefecture People's Hospital, Xinjiang, China
| | - Wenjun Kang
- Department of Medical Laboratory Science, Xinjiang Bayingoleng Mongolian Autonomous Prefecture People's Hospital, Xinjiang, China
| | - Lu Wang
- Department of Medical Laboratory Science, Xinjiang Bayingoleng Mongolian Autonomous Prefecture People's Hospital, Xinjiang, China
| | - Xia Han
- Department of Medical Laboratory Science, Xinjiang Bayingoleng Mongolian Autonomous Prefecture People's Hospital, Xinjiang, China
| | - Liu Yang
- Department of Clinical Laboratory, Shandong Cancer Hospital and Institute, Shandong First Medical University and Shandong Academy of Medical Sciences, Jinan 250117, Shandong, PR China
| |
Collapse
|
7
|
Takefuji Y. Why the power of diversity does not always produce better groups and societies. Biosystems 2023; 229:104918. [PMID: 37196894 DOI: 10.1016/j.biosystems.2023.104918] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2023] [Revised: 05/06/2023] [Accepted: 05/07/2023] [Indexed: 05/19/2023]
Abstract
Diversity is supposed to create better groups and societies but sometimes fails. It is explained why the power of diversity may not create better groups in the current diversity prediction theory. Diversity may hurt civic life and introduce distrust. This is because the current diversity prediction theory is based on real numbers that ignore individual abilities. Its diversity prediction theory maximizes performance with infinite population size. Contrary to this, collective intelligence or swarm intelligence is not maximized by infinite population size, but by population size. The extended diversity prediction theory using the complex number allows us to express individual abilities or qualities. The diversity of complex numbers always produces better groups and societies. The wisdom of crowds, collective intelligence, swarm intelligence or nature-inspired intelligence is implemented in the current machine learning or artificial intelligence, called Random Forest. The problem of the current diversity prediction theory is detailed in this paper.
Collapse
Affiliation(s)
- Yoshiyasu Takefuji
- Faculty of Data Science, Musashino University, 3-3-3 Ariake Koto-ku, Tokyo, 135-8181, Japan.
| |
Collapse
|
8
|
Baiocchi GC, Vojdani A, Rosenberg AZ, Vojdani E, Halpert G, Ostrinski Y, Zyskind I, Filgueiras IS, Schimke LF, Marques AHC, Giil LM, Lavi YB, Silverberg JI, Zimmerman J, Hill DA, Thornton A, Kim M, De Vito R, Fonseca DLM, Plaça DR, Freire PP, Camara NOS, Calich VLG, Scheibenbogen C, Heidecke H, Lattin MT, Ochs HD, Riemekasten G, Amital H, Shoenfeld Y, Cabral-Marques O. Cross-sectional analysis reveals autoantibody signatures associated with COVID-19 severity. J Med Virol 2023; 95:e28538. [PMID: 36722456 DOI: 10.1002/jmv.28538] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2022] [Revised: 01/20/2023] [Accepted: 01/24/2023] [Indexed: 02/02/2023]
Abstract
The severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) infection is associated with increased levels of autoantibodies targeting immunological proteins such as cytokines and chemokines. Reports further indicate that COVID-19 patients may develop a broad spectrum of autoimmune diseases due to reasons not fully understood. Even so, the landscape of autoantibodies induced by SARS-CoV-2 infection remains uncharted territory. To gain more insight, we carried out a comprehensive assessment of autoantibodies known to be linked to diverse autoimmune diseases observed in COVID-19 patients in a cohort of 231 individuals, of which 161 were COVID-19 patients (72 with mild, 61 moderate, and 28 with severe disease) and 70 were healthy controls. Dysregulated IgG and IgA autoantibody signatures, characterized mainly by elevated concentrations, occurred predominantly in patients with moderate or severe COVID-19 infection. Autoantibody levels often accompanied anti-SARS-CoV-2 antibody concentrations while stratifying COVID-19 severity as indicated by random forest and principal component analyses. Furthermore, while young versus elderly COVID-19 patients showed only slight differences in autoantibody levels, elderly patients with severe disease presented higher IgG autoantibody concentrations than young individuals with severe COVID-19. This work maps the intersection of COVID-19 and autoimmunity by demonstrating the dysregulation of multiple autoantibodies triggered during SARS-CoV-2 infection. Thus, this cross-sectional study suggests that SARS-CoV-2 infection induces autoantibody signatures associated with COVID-19 severity and several autoantibodies that can be used as biomarkers of COVID-19 severity, indicating autoantibodies as potential therapeutical targets for these patients.
Collapse
Affiliation(s)
- Gabriela C Baiocchi
- Department of Immunology, Institute of Biomedical Sciences, University of São Paulo, São Paulo, Brazil
| | - Aristo Vojdani
- Immunosciences Laboratory, Inc., Department of Immunology, Los Angeles, California, USA.,Cyrex Laboratories, Phoenix, Arizona, USA
| | - Avi Z Rosenberg
- Department of Pathology, Johns Hopkins University, Baltimore, Maryland, USA
| | | | - Gilad Halpert
- Ariel University, Ariel, Israel.,Zabludowicz Center for Autoimmune Diseases, Sheba Medical Center, Tel-Hashomer, Israel.,Saint Petersburg State University Russia, St Petersburg, Russia
| | - Yuri Ostrinski
- Ariel University, Ariel, Israel.,Zabludowicz Center for Autoimmune Diseases, Sheba Medical Center, Tel-Hashomer, Israel.,Saint Petersburg State University Russia, St Petersburg, Russia
| | - Israel Zyskind
- Department of Pediatrics, NYU Langone Medical Center, New York, New York, USA.,Maimonides Medical Center, Brooklyn, New York, USA
| | - Igor S Filgueiras
- Department of Immunology, Institute of Biomedical Sciences, University of São Paulo, São Paulo, Brazil
| | - Lena F Schimke
- Department of Immunology, Institute of Biomedical Sciences, University of São Paulo, São Paulo, Brazil
| | - Alexandre H C Marques
- Department of Immunology, Institute of Biomedical Sciences, University of São Paulo, São Paulo, Brazil
| | - Lasse M Giil
- Department of Internal Medicine, Haraldsplass Deaconess Hospital, Bergen, Norway
| | - Yael B Lavi
- Department of Chemistry Ben Gurion University Beer-Sheva, Beer-Sheva, Israel
| | - Jonathan I Silverberg
- Department of Dermatology, George Washington University School of Medicine and Health Sciences, Washington, USA
| | | | | | | | - Myungjin Kim
- Data Science Initiative at Brown University, Providence, Rhode Island, USA
| | - Roberta De Vito
- Department of Biostatistics and the Data Science Initiative at Brown University, Providence, Rhode Island, USA
| | - Dennyson L M Fonseca
- Interunit Postgraduate Program on Bioinformatics, Institute of Mathematics and Statistics (IME), University of Sao Paulo (USP), Sao Paulo, Brazil
| | - Desireé R Plaça
- Department of Clinical and Toxicological Analyses, School of Pharmaceutical Sciences, São Paulo, Brazil
| | - Paula P Freire
- Department of Immunology, Institute of Biomedical Sciences, University of São Paulo, São Paulo, Brazil
| | - Niels O S Camara
- Department of Immunology, Institute of Biomedical Sciences, University of São Paulo, São Paulo, Brazil
| | - Vera L G Calich
- Department of Immunology, Institute of Biomedical Sciences, University of São Paulo, São Paulo, Brazil
| | - Carmen Scheibenbogen
- Institute for Medical Immunology, Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt Universität zu Berlin, Berlin, Germany
| | - Harald Heidecke
- CellTrend Gesellschaft mit beschränkter Haftung (GmbH), Luckenwalde, Germany
| | - Miriam T Lattin
- Department of Biology, Yeshiva University, Manhatten, New York, USA
| | - Hans D Ochs
- Department of Pediatrics, University of Washington School of Medicine, and Seattle Children's Research Institute, Seattle, Washington, USA
| | - Gabriela Riemekasten
- Department of Rheumatology, University Medical Center Schleswig-Holstein Campus Lübeck, Lübeck, Germany
| | - Howard Amital
- Ariel University, Ariel, Israel.,Zabludowicz Center for Autoimmune Diseases, Sheba Medical Center, Tel-Hashomer, Israel.,Department of Medicine B, Sheba Medical Center, Tel Hashomer, Israel.,Sackler Faculty of Medicine, Tel-Aviv University, Tel-Aviv, Israel
| | - Yehuda Shoenfeld
- Zabludowicz Center for Autoimmune Diseases, Sheba Medical Center, Tel-Hashomer, Israel.,Saint Petersburg State University Russia, St Petersburg, Russia
| | - Otavio Cabral-Marques
- Department of Immunology, Institute of Biomedical Sciences, University of São Paulo, São Paulo, Brazil.,Interunit Postgraduate Program on Bioinformatics, Institute of Mathematics and Statistics (IME), University of Sao Paulo (USP), Sao Paulo, Brazil.,Department of Clinical and Toxicological Analyses, School of Pharmaceutical Sciences, São Paulo, Brazil.,Department of Pharmacy and Postgraduate Program of Health and Science, Federal University of Rio Grande do Norte, Natal, Brazil.,Department of Medicine, Division of Molecular Medicine, University of São Paulo School of Medicine, Baltimore, USA.,Laboratory of Medical Investigation 29, University of São Paulo School of Medicine, São Paulo, Brazil
| |
Collapse
|
9
|
Rolczynski BS, Díaz SA, Kim YC, Mathur D, Klein WP, Medintz IL, Melinger JS. Determining interchromophore effects for energy transport in molecular networks using machine-learning algorithms. Phys Chem Chem Phys 2023; 25:3651-3665. [PMID: 36648290 DOI: 10.1039/d2cp04960k] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/05/2023]
Abstract
Nature uses chromophore networks, with highly optimized structural and energetic characteristics, to perform important chemical functions. Due to its modularity, predictable aggregation characteristics, and established synthetic protocols, structural DNA nanotechnology is a promising medium for arranging chromophore networks with analogous structural and energetic controls. However, this high level of control creates a greater need to know how to optimize the systems precisely. This study uses the system's modularity to produce variations of a coupled 14-Site chromophore network. It uses machine-learning algorithms and spectroscopy measurements to reveal the energy-transport roles of these Sites, paying particular attention to the cooperative and inhibitive effects they impose on each other for transport across the network. The physical significance of these patterns is contextualized, using molecular dynamics simulations and energy-transport modeling. This analysis yields insights about how energy transfers across the Donor-Relay and Relay-Acceptor interfaces, as well as the energy-transport pathways through the homogeneous Relay segment. Overall, this report establishes an approach that uses machine-learning methods to understand, in fine detail, the role that each Site plays in an optoelectronic molecular network.
Collapse
Affiliation(s)
- Brian S Rolczynski
- Electronics Science and Technology Division, Code 6800, U.S. Naval Research Laboratory, Washington, DC 20375, USA.
| | - Sebastián A Díaz
- Center for Bio/Molecular Science and Engineering, Code 6900, U.S. Naval Research Laboratory, Washington, DC 20375, USA
| | - Young C Kim
- Materials Science and Technology Division, Code 6300, U.S. Naval Research Laboratory, Washington, DC 20375, USA
| | - Divita Mathur
- Department of Chemistry, Case Western Reserve University, Cleveland, OH 44106, USA
| | - William P Klein
- Center for Bio/Molecular Science and Engineering, Code 6900, U.S. Naval Research Laboratory, Washington, DC 20375, USA
| | - Igor L Medintz
- Center for Bio/Molecular Science and Engineering, Code 6900, U.S. Naval Research Laboratory, Washington, DC 20375, USA
| | - Joseph S Melinger
- Electronics Science and Technology Division, Code 6800, U.S. Naval Research Laboratory, Washington, DC 20375, USA.
| |
Collapse
|
10
|
Bavykina M, Kostina N, Lee CR, Schafleitner R, Bishop-von Wettberg E, Nuzhdin SV, Samsonova M, Gursky V, Kozlov K. Modeling of Flowering Time in Vigna radiata with Artificial Image Objects, Convolutional Neural Network and Random Forest. PLANTS (BASEL, SWITZERLAND) 2022; 11:3327. [PMID: 36501364 PMCID: PMC9738219 DOI: 10.3390/plants11233327] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/01/2022] [Revised: 11/22/2022] [Accepted: 11/28/2022] [Indexed: 06/17/2023]
Abstract
Flowering time is an important target for breeders in developing new varieties adapted to changing conditions. In this work, a new approach is proposed in which the SNP markers influencing time to flowering in mung bean are selected as important features in a random forest model. The genotypic and weather data are encoded in artificial image objects, and a model for flowering time prediction is constructed as a convolutional neural network. The model uses weather data for only a limited time period of 5 days before and 20 days after planting and is capable of predicting the time to flowering with high accuracy. The most important factors for model solution were identified using saliency maps and a Score-CAM method. Our approach can help breeding programs harness genotypic and phenotypic diversity to more effectively produce varieties with a desired flowering time.
Collapse
Affiliation(s)
- Maria Bavykina
- Mathematical Biology and Bioinformatics Lab, Peter the Great St. Petersburg Polytechnic University, 195251 Saint Petersburg, Russia
| | - Nadezhda Kostina
- Mathematical Biology and Bioinformatics Lab, Peter the Great St. Petersburg Polytechnic University, 195251 Saint Petersburg, Russia
| | - Cheng-Ruei Lee
- Institute of Ecology and Evolutionary Biology, National Taiwan University, Taipei 106319, Taiwan
| | | | - Eric Bishop-von Wettberg
- Department of Plant and Soil Science, Gund Institute for the Environment, University of Vermont, Burlington, VT 05405, USA
| | - Sergey V. Nuzhdin
- Mathematical Biology and Bioinformatics Lab, Peter the Great St. Petersburg Polytechnic University, 195251 Saint Petersburg, Russia
- Program Molecular and Computation Biology, University of California, Los-Angeles, CA 90095, USA
| | - Maria Samsonova
- Mathematical Biology and Bioinformatics Lab, Peter the Great St. Petersburg Polytechnic University, 195251 Saint Petersburg, Russia
| | - Vitaly Gursky
- Theoretical Department, Ioffe Institute, 194021 Saint Petersburg, Russia
| | - Konstantin Kozlov
- Mathematical Biology and Bioinformatics Lab, Peter the Great St. Petersburg Polytechnic University, 195251 Saint Petersburg, Russia
| |
Collapse
|
11
|
Accelerating imputation of missing genotypes using parallel computing. J Genet 2022. [DOI: 10.1007/s12041-022-01396-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
|
12
|
Li L, Wu X, Chen J, Wang S, Wan Y, Ji H, Wen Y, Zhang J. Genetic Dissection of Epistatic Interactions Contributing Yield-Related Agronomic Traits in Rice Using the Compressed Mixed Model. PLANTS 2022; 11:plants11192504. [PMID: 36235370 PMCID: PMC9571936 DOI: 10.3390/plants11192504] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/20/2022] [Revised: 09/09/2022] [Accepted: 09/19/2022] [Indexed: 11/26/2022]
Abstract
Rice (Oryza sativa) is one of the most important cereal crops in the world, and yield-related agronomic traits, including plant height (PH), panicle length (PL), and protein content (PC), are prerequisites for attaining the desired yield and quality in breeding programs. Meanwhile, the main effects and epistatic effects of quantitative trait nucleotides (QTNs) are all important genetic components for yield-related quantitative traits. In this study, we conducted genome-wide association studies (GWAS) for 413 rice germplasm resources, with 36,901 single nucleotide polymorphisms (SNPs), to identify QTNs, QTN-by-QTN interaction (QQI), and their candidate genes, using a multi-locus compressed variance component mixed model, 3VmrMLM. As a result, two significant QTNs and 56 paired QQIs were detected, amongst 5219 genes of these QTNs, and 26 genes were identified as the yield-related confirmed genes, such as LCRN1, OsSPL3, and OsVOZ1 for PH, and LOG and QsBZR1 for PL. To reveal the substantial contributions related to the variation of yield-related agronomic traits in rice, we further implemented an enrichment analysis and expression analysis. As the results showed, 114 genes, nearly all significant QQIs, were involved in 37 GO terms; for example, the macromolecule metabolic process (GO:0043170), intracellular part (GO:0044424), and binding (GO:0005488). It was revealed that most of the QQIs and the candidate genes were significantly involved in the biological process, molecular function, and cellular component of the target traits. The demonstrated genetic interactions play a critical role in yield-related agronomic traits of rice, and such epistatic interactions contributed to large portions of the missing heritability in GWAS. These results help us to understand the genetic basis underlying the inheritance of the three yield-related agronomic traits and provide implications for rice improvement.
Collapse
Affiliation(s)
- Ling Li
- College of Science, Nanjing Agricultural University, Nanjing 210095, China
| | - Xinyi Wu
- College of Science, Nanjing Agricultural University, Nanjing 210095, China
| | - Juncong Chen
- College of Finance, Nanjing Agricultural University, Nanjing 210095, China
| | - Shengmeng Wang
- College of Science, Nanjing Agricultural University, Nanjing 210095, China
| | - Yuxuan Wan
- School of Business Administration, Jiangxi University of Finance and Economics, Nanchang 330013, China
| | - Hanbing Ji
- College of Science, Nanjing Agricultural University, Nanjing 210095, China
| | - Yangjun Wen
- College of Science, Nanjing Agricultural University, Nanjing 210095, China
- Key Laboratory of Crop Genetics and Germplasm Enhancement, Nanjing Agricultural University, Nanjing 210095, China
- Correspondence: (Y.W.); (J.Z.)
| | - Jin Zhang
- College of Science, Nanjing Agricultural University, Nanjing 210095, China
- Key Laboratory of Crop Genetics and Germplasm Enhancement, Nanjing Agricultural University, Nanjing 210095, China
- Correspondence: (Y.W.); (J.Z.)
| |
Collapse
|
13
|
Saha S, Perrin L, Röder L, Brun C, Spinelli L. Epi-MEIF: detecting higher order epistatic interactions for complex traits using mixed effect conditional inference forests. Nucleic Acids Res 2022; 50:e114. [PMID: 36107776 PMCID: PMC9639209 DOI: 10.1093/nar/gkac715] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2022] [Revised: 07/29/2022] [Accepted: 09/12/2022] [Indexed: 12/04/2022] Open
Abstract
Understanding the relationship between genetic variations and variations in complex and quantitative phenotypes remains an ongoing challenge. While Genome-wide association studies (GWAS) have become a vital tool for identifying single-locus associations, we lack methods for identifying epistatic interactions. In this article, we propose a novel method for higher-order epistasis detection using mixed effect conditional inference forest (epiMEIF). The proposed method is fitted on a group of single nucleotide polymorphisms (SNPs) potentially associated with the phenotype and the tree structure in the forest facilitates the identification of n-way interactions between the SNPs. Additional testing strategies further improve the robustness of the method. We demonstrate its ability to detect true n-way interactions via extensive simulations in both cross-sectional and longitudinal synthetic datasets. This is further illustrated in an application to reveal epistatic interactions from natural variations of cardiac traits in flies (Drosophila). Overall, the method provides a generalized way to identify higher-order interactions from any GWAS data, thereby greatly improving the detection of the genetic architecture underlying complex phenotypes.
Collapse
Affiliation(s)
- Saswati Saha
- Aix Marseille Univ, INSERM, TAGC (UMR1090), Turing Centre for Living systems , Marseille , France
| | - Laurent Perrin
- Aix Marseille Univ, INSERM, TAGC (UMR1090), Turing Centre for Living systems , Marseille , France
- CNRS , Marseille , France
| | - Laurence Röder
- Aix Marseille Univ, INSERM, TAGC (UMR1090), Turing Centre for Living systems , Marseille , France
| | - Christine Brun
- Aix Marseille Univ, INSERM, TAGC (UMR1090), Turing Centre for Living systems , Marseille , France
- CNRS , Marseille , France
| | - Lionel Spinelli
- Aix Marseille Univ, INSERM, TAGC (UMR1090), Turing Centre for Living systems , Marseille , France
| |
Collapse
|
14
|
Wang H, Yang W, Qin Q, Yang X, Yang Y, Liu H, Lu W, Gu S, Cao X, Feng D, Zhang Z, He J. E3 ubiquitin ligase MAGI3 degrades c-Myc and acts as a predictor for chemotherapy response in colorectal cancer. Mol Cancer 2022; 21:151. [PMID: 35864508 PMCID: PMC9306183 DOI: 10.1186/s12943-022-01622-9] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2021] [Accepted: 05/27/2021] [Indexed: 12/24/2022] Open
Abstract
Background Recurrence and chemoresistance constitute the leading cause of death in colorectal cancer (CRC). Thus, it is of great significance to clarify the underlying mechanisms and identify predictors for tailoring adjuvant chemotherapy to improve the outcome of CRC. Methods By screening differentially expressed genes (DEGs), constructing random forest classification and ranking the importance of DEGs, we identified membrane associated guanylate kinase, WW and PDZ domain containing 3 (MAGI3) as an important gene in CRC recurrence. Immunohistochemical and western blot assays were employed to further detect MAGI3 expression in CRC tissues and cell lines. Cell counting kit-8, plate colony formation, flow cytometry, sub-cutaneous injection and azoxymethane plus dextran sulfate sodium induced mice CRC assays were employed to explore the effects of MAGI3 on proliferation, growth, cell cycle, apoptosis, xenograft formation and chemotherapy resistance of CRC. The underlying molecular mechanisms were further investigated through gene set enrichment analysis, quantitative real-time PCR, western blot, co-immunoprecipitation, ubiquitination, GST fusion protein pull-down and immunohistochemical staining assays. Results Our results showed that dysregulated low level of MAGI3 was correlated with recurrence and poor prognosis of CRC. MAGI3 was identified as a novel substrate-binding subunit of SKP1-Cullin E3 ligase to recognize c-Myc, and process c-Myc ubiquitination and degradation. Expression of MAGI3 in CRC cells inhibited cell growth, promoted apoptosis and chemosensitivity to fluoropyrimidine-based chemotherapy by suppressing activation of c-Myc in vitro and in vivo. In clinic, the stage II/III CRC patients with MAGI3-high had a significantly good recurrence-free survival (~ 80%, 5-year), and were not necessary for further adjuvant chemotherapy. The patients with MAGI3-medium had a robustly good response rate or recurrence-free survival with fluoropyrimidine-based chemotherapy, and were recommended to undergo fluoropyrimidine-based adjuvant chemotherapy. Conclusions MAGI3 is a novel E3 ubiquitin ligase by degradation of c-Myc to regulate CRC development and may act as a potential predictor of adjuvant chemotherapy for CRC patients. Graphical Abstract ![]()
Supplementary Information The online version contains supplementary material available at 10.1186/s12943-022-01622-9.
Collapse
Affiliation(s)
- Haibo Wang
- Beijing Key Laboratory for Tumor Invasion and Metastasis, Department of Biochemistry and Molecular Biology, Capital Medical University, No.10 Xitoutiao, You An Men, Beijing, 100069, People's Republic of China
| | - Wenjing Yang
- Department of Oncology, Beijing Hospital of Traditional Chinese Medicine, Capital Medical University, Beijing, People's Republic of China
| | - Qiong Qin
- Beijing Key Laboratory for Tumor Invasion and Metastasis, Department of Biochemistry and Molecular Biology, Capital Medical University, No.10 Xitoutiao, You An Men, Beijing, 100069, People's Republic of China
| | - Xiaomei Yang
- Beijing Key Laboratory for Tumor Invasion and Metastasis, Department of Biochemistry and Molecular Biology, Capital Medical University, No.10 Xitoutiao, You An Men, Beijing, 100069, People's Republic of China
| | - Ying Yang
- Core Facilities Center, Capital Medical University, Beijing, People's Republic of China
| | - Hua Liu
- Beijing Key Laboratory for Tumor Invasion and Metastasis, Department of Biochemistry and Molecular Biology, Capital Medical University, No.10 Xitoutiao, You An Men, Beijing, 100069, People's Republic of China
| | - Wenxiu Lu
- Beijing Key Laboratory for Tumor Invasion and Metastasis, Department of Biochemistry and Molecular Biology, Capital Medical University, No.10 Xitoutiao, You An Men, Beijing, 100069, People's Republic of China
| | - Siyu Gu
- Beijing Key Laboratory for Tumor Invasion and Metastasis, Department of Biochemistry and Molecular Biology, Capital Medical University, No.10 Xitoutiao, You An Men, Beijing, 100069, People's Republic of China
| | - Xuedi Cao
- Beijing Key Laboratory for Tumor Invasion and Metastasis, Department of Biochemistry and Molecular Biology, Capital Medical University, No.10 Xitoutiao, You An Men, Beijing, 100069, People's Republic of China
| | - Duiping Feng
- Department of Interventional Radiology, First Hospital of Shanxi Medical University, Taiyuan, People's Republic of China
| | - Zhongtao Zhang
- Department of General Surgery, Beijing Friendship Hospital, Capital Medical University & National Clinical Research Center for Digestive Diseases, No.95 Yong-anRoad, Xi-Cheng District, Beijing, 100050, People's Republic of China.
| | - Junqi He
- Beijing Key Laboratory for Tumor Invasion and Metastasis, Department of Biochemistry and Molecular Biology, Capital Medical University, No.10 Xitoutiao, You An Men, Beijing, 100069, People's Republic of China.
| |
Collapse
|
15
|
Ray A. Machine learning in postgenomic biology and personalized medicine. WILEY INTERDISCIPLINARY REVIEWS. DATA MINING AND KNOWLEDGE DISCOVERY 2022; 12:e1451. [PMID: 35966173 PMCID: PMC9371441 DOI: 10.1002/widm.1451] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/23/2020] [Accepted: 12/22/2021] [Indexed: 06/15/2023]
Abstract
In recent years Artificial Intelligence in the form of machine learning has been revolutionizing biology, biomedical sciences, and gene-based agricultural technology capabilities. Massive data generated in biological sciences by rapid and deep gene sequencing and protein or other molecular structure determination, on the one hand, requires data analysis capabilities using machine learning that are distinctly different from classical statistical methods; on the other, these large datasets are enabling the adoption of novel data-intensive machine learning algorithms for the solution of biological problems that until recently had relied on mechanistic model-based approaches that are computationally expensive. This review provides a bird's eye view of the applications of machine learning in post-genomic biology. Attempt is also made to indicate as far as possible the areas of research that are poised to make further impacts in these areas, including the importance of explainable artificial intelligence (XAI) in human health. Further contributions of machine learning are expected to transform medicine, public health, agricultural technology, as well as to provide invaluable gene-based guidance for the management of complex environments in this age of global warming.
Collapse
Affiliation(s)
- Animesh Ray
- Riggs School of Applied Life Sciences, Keck Graduate Institute, 535 Watson Drive, Claremont, CA91711, USA
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, California, USA
| |
Collapse
|
16
|
Chinese Comma Disambiguation in Math Word Problems Using SMOTE and Random Forests. AI 2021. [DOI: 10.3390/ai2040044] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open
Abstract
Natural language understanding technologies play an essential role in automatically solving math word problems. In the process of machine understanding Chinese math word problems, comma disambiguation, which is associated with a class imbalance binary learning problem, is addressed as a valuable instrument to transform the problem statement of math word problems into structured representation. Aiming to resolve this problem, we employed the synthetic minority oversampling technique (SMOTE) and random forests to comma classification after their hyperparameters were jointly optimized. We propose a strict measure to evaluate the performance of deployed comma classification models on comma disambiguation in math word problems. To verify the effectiveness of random forest classifiers with SMOTE on comma disambiguation, we conducted two-stage experiments on two datasets with a collection of evaluation measures. Experimental results showed that random forest classifiers were significantly superior to baseline methods in Chinese comma disambiguation. The SMOTE algorithm with optimized hyperparameter settings based on the categorical distribution of different datasets is preferable, instead of with its default values. For practitioners, we suggest that hyperparameters of a classification models be optimized again after parameter settings of SMOTE have been changed.
Collapse
|
17
|
Bektaş AB, Gönen M. PrognosiT: Pathway/gene set-based tumour volume prediction using multiple kernel learning. BMC Bioinformatics 2021; 22:537. [PMID: 34727887 PMCID: PMC8561914 DOI: 10.1186/s12859-021-04460-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2021] [Accepted: 10/26/2021] [Indexed: 11/10/2022] Open
Abstract
Background Identification of molecular mechanisms that determine tumour progression in cancer patients is a prerequisite for developing new disease treatment guidelines. Even though the predictive performance of current machine learning models is promising, extracting significant and meaningful knowledge from the data simultaneously during the learning process is a difficult task considering the high-dimensional and highly correlated nature of genomic datasets. Thus, there is a need for models that not only predict tumour volume from gene expression data of patients but also use prior information coming from pathway/gene sets during the learning process, to distinguish molecular mechanisms which play crucial role in tumour progression and therefore, disease prognosis. Results In this study, instead of initially choosing several pathways/gene sets from an available set and training a model on this previously chosen subset of genomic features, we built a novel machine learning algorithm, PrognosiT, that accomplishes both tasks together. We tested our algorithm on thyroid carcinoma patients using gene expression profiles and cancer-specific pathways/gene sets. Predictive performance of our novel multiple kernel learning algorithm (PrognosiT) was comparable or even better than random forest (RF) and support vector regression (SVR). It is also notable that, to predict tumour volume, PrognosiT used gene expression features less than one-tenth of what RF and SVR algorithms used. Conclusions PrognosiT was able to obtain comparable or even better predictive performance than SVR and RF. Moreover, we demonstrated that during the learning process, our algorithm managed to extract relevant and meaningful pathway/gene sets information related to the studied cancer type, which provides insights about its progression and aggressiveness. We also compared gene expressions of the selected genes by our algorithm in tumour and normal tissues, and we then discussed up- and down-regulated genes selected by our algorithm while learning, which could be beneficial for determining new biomarkers. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-021-04460-6.
Collapse
Affiliation(s)
- Ayyüce Begüm Bektaş
- Graduate School of Sciences and Engineering, Koç University, Istanbul, 34450, Turkey
| | - Mehmet Gönen
- Department of Industrial Engineering, College of Engineering, Koç University, Istanbul, 34450, Turkey. .,School of Medicine, Koç University, Istanbul, 34450, Turkey.
| |
Collapse
|
18
|
Wang D, Li J, Sun Y, Ding X, Zhang X, Liu S, Han B, Wang H, Duan X, Sun T. A Machine Learning Model for Accurate Prediction of Sepsis in ICU Patients. Front Public Health 2021; 9:754348. [PMID: 34722452 PMCID: PMC8553999 DOI: 10.3389/fpubh.2021.754348] [Citation(s) in RCA: 26] [Impact Index Per Article: 8.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2021] [Accepted: 09/20/2021] [Indexed: 12/23/2022] Open
Abstract
Background: Although numerous studies are conducted every year on how to reduce the fatality rate associated with sepsis, it is still a major challenge faced by patients, clinicians, and medical systems worldwide. Early identification and prediction of patients at risk of sepsis and adverse outcomes associated with sepsis are critical. We aimed to develop an artificial intelligence algorithm that can predict sepsis early. Methods: This was a secondary analysis of an observational cohort study from the Intensive Care Unit of the First Affiliated Hospital of Zhengzhou University. A total of 4,449 infected patients were randomly assigned to the development and validation data set at a ratio of 4:1. After extracting electronic medical record data, a set of 55 features (variables) was calculated and passed to the random forest algorithm to predict the onset of sepsis. Results: The pre-procedure clinical variables were used to build a prediction model from the training data set using the random forest machine learning method; a 5-fold cross-validation was used to evaluate the prediction accuracy of the model. Finally, we tested the model using the validation data set. The area obtained by the model under the receiver operating characteristic (ROC) curve (AUC) was 0.91, the sensitivity was 87%, and the specificity was 89%. Conclusions: This newly established machine learning-based model has shown good predictive ability in Chinese sepsis patients. External validation studies are necessary to confirm the universality of our method in the population and treatment practice.
Collapse
Affiliation(s)
- Dong Wang
- General Intensive Care Unit, The First Affiliated Hospital of Zhengzhou University, Zhengzhou, China.,Key Laboratory for Critical Care Medicine of Henan Province, Zhengzhou, China.,Key Laboratory for Sepsis of Zhengzhou, Zhengzhou, China
| | - Jinbo Li
- General Intensive Care Unit, The First Affiliated Hospital of Zhengzhou University, Zhengzhou, China.,Department of Electrical and Computer Engineering, University of Alberta, Edmonton, AB, Canada
| | - Yali Sun
- General Intensive Care Unit, The First Affiliated Hospital of Zhengzhou University, Zhengzhou, China.,Key Laboratory for Critical Care Medicine of Henan Province, Zhengzhou, China.,Key Laboratory for Sepsis of Zhengzhou, Zhengzhou, China
| | - Xianfei Ding
- General Intensive Care Unit, The First Affiliated Hospital of Zhengzhou University, Zhengzhou, China.,Key Laboratory for Critical Care Medicine of Henan Province, Zhengzhou, China.,Key Laboratory for Sepsis of Zhengzhou, Zhengzhou, China
| | - Xiaojuan Zhang
- General Intensive Care Unit, The First Affiliated Hospital of Zhengzhou University, Zhengzhou, China.,Key Laboratory for Critical Care Medicine of Henan Province, Zhengzhou, China.,Key Laboratory for Sepsis of Zhengzhou, Zhengzhou, China
| | - Shaohua Liu
- General Intensive Care Unit, The First Affiliated Hospital of Zhengzhou University, Zhengzhou, China.,Key Laboratory for Critical Care Medicine of Henan Province, Zhengzhou, China.,Key Laboratory for Sepsis of Zhengzhou, Zhengzhou, China
| | - Bing Han
- General Intensive Care Unit, The First Affiliated Hospital of Zhengzhou University, Zhengzhou, China.,Key Laboratory for Critical Care Medicine of Henan Province, Zhengzhou, China.,Key Laboratory for Sepsis of Zhengzhou, Zhengzhou, China
| | - Haixu Wang
- General Intensive Care Unit, The First Affiliated Hospital of Zhengzhou University, Zhengzhou, China.,Key Laboratory for Critical Care Medicine of Henan Province, Zhengzhou, China.,Key Laboratory for Sepsis of Zhengzhou, Zhengzhou, China
| | - Xiaoguang Duan
- General Intensive Care Unit, The First Affiliated Hospital of Zhengzhou University, Zhengzhou, China.,Key Laboratory for Critical Care Medicine of Henan Province, Zhengzhou, China.,Key Laboratory for Sepsis of Zhengzhou, Zhengzhou, China
| | - Tongwen Sun
- General Intensive Care Unit, The First Affiliated Hospital of Zhengzhou University, Zhengzhou, China.,Key Laboratory for Critical Care Medicine of Henan Province, Zhengzhou, China.,Key Laboratory for Sepsis of Zhengzhou, Zhengzhou, China
| |
Collapse
|
19
|
Tanaka H, Kreisberg JF, Ideker T. Genetic dissection of complex traits using hierarchical biological knowledge. PLoS Comput Biol 2021; 17:e1009373. [PMID: 34534210 PMCID: PMC8480841 DOI: 10.1371/journal.pcbi.1009373] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2020] [Revised: 09/29/2021] [Accepted: 08/23/2021] [Indexed: 11/18/2022] Open
Abstract
Despite the growing constellation of genetic loci linked to common traits, these loci have yet to account for most heritable variation, and most act through poorly understood mechanisms. Recent machine learning (ML) systems have used hierarchical biological knowledge to associate genetic mutations with phenotypic outcomes, yielding substantial predictive power and mechanistic insight. Here, we use an ontology-guided ML system to map single nucleotide variants (SNVs) focusing on 6 classic phenotypic traits in natural yeast populations. The 29 identified loci are largely novel and account for ~17% of the phenotypic variance, versus <3% for standard genetic analysis. Representative results show that sensitivity to hydroxyurea is linked to SNVs in two alternative purine biosynthesis pathways, and that sensitivity to copper arises through failure to detoxify reactive oxygen species in fatty acid metabolism. This work demonstrates a knowledge-based approach to amplifying and interpreting signals in population genetic studies. Genome-wide association studies (GWAS) have identified many important loci for common diseases and other traits. However, the loci identified by these studies are almost always many steps away from an understanding of underlying biological mechanisms. Here we develop an approach using hierarchical biological knowledge to identify genes and pathways responsible for phenotypic traits. Variants identified by the new method could explain a substantially greater fraction of heritability than previously reported. Moreover, we identified mechanistic pathways by which each causal variant affects cellular function. For example, we find that sensitivity to hydroxyurea is tied to genetic variants in two alternative purine biosynthesis pathways, and that sensitivity to copper arises through failure to detoxify reactive oxygen species in fatty acid metabolism. The new approach is a potentially transformative concept for understanding the genetic drivers of phenotypic variance, with potential applications in understanding traits in biomedicine and agriculture.
Collapse
Affiliation(s)
- Hidenori Tanaka
- Department of Medicine, University of California San Diego, La Jolla, California, United States of America
| | - Jason F. Kreisberg
- Department of Medicine, University of California San Diego, La Jolla, California, United States of America
- * E-mail: (JFK); (TI)
| | - Trey Ideker
- Department of Medicine, University of California San Diego, La Jolla, California, United States of America
- * E-mail: (JFK); (TI)
| |
Collapse
|
20
|
Montesinos-López OA, Montesinos-López A, Mosqueda-Gonzalez BA, Montesinos-López JC, Crossa J, Ramirez NL, Singh P, Valladares-Anguiano FA. A zero altered Poisson random forest model for genomic-enabled prediction. G3-GENES GENOMES GENETICS 2021; 11:6042695. [PMID: 33693599 PMCID: PMC8022945 DOI: 10.1093/g3journal/jkaa057] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/03/2020] [Accepted: 12/10/2020] [Indexed: 12/23/2022]
Abstract
In genomic selection choosing the statistical machine learning model is of paramount importance. In this paper, we present an application of a zero altered random forest model with two versions (ZAP_RF and ZAPC_RF) to deal with excess zeros in count response variables. The proposed model was compared with the conventional random forest (RF) model and with the conventional Generalized Poisson Ridge regression (GPR) using two real datasets, and we found that, in terms of prediction performance, the proposed zero inflated random forest model outperformed the conventional RF and GPR models.
Collapse
Affiliation(s)
| | - Abelardo Montesinos-López
- Departamento de Matemáticas, Centro Universitario de Ciencias Exactas e Ingenierías (CUCEI), Universidad de Guadalajara, 44430 Guadalajara, Jalisco, México
| | | | | | - José Crossa
- Colegio de Postgraduados, Montecillos, Edo. de México CP 56230, México.,International Maize and Wheat Improvement Center (CIMMYT), Km 45, Carretera Mexico-Veracruz, CP 52640, Edo. de México, México
| | - Nerida Lozano Ramirez
- International Maize and Wheat Improvement Center (CIMMYT), Km 45, Carretera Mexico-Veracruz, CP 52640, Edo. de México, México
| | - Pawan Singh
- International Maize and Wheat Improvement Center (CIMMYT), Km 45, Carretera Mexico-Veracruz, CP 52640, Edo. de México, México
| | | |
Collapse
|
21
|
Towards fine-scale population stratification modeling based on kernel principal component analysis and random forest. Genes Genomics 2021; 43:1143-1155. [PMID: 34097252 DOI: 10.1007/s13258-021-01057-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2020] [Accepted: 01/26/2021] [Indexed: 10/21/2022]
Abstract
BACKGROUND Population stratification modeling is essential in Genome-Wide Association Studies. OBJECTIVE In this paper, we aim to build a fine-scale population stratification model to efficiently infer individual genetic ancestry. METHODS Kernel Principal Component Analysis (PCA) and random forest are adopted to build the population stratification model, together with parameter optimization. We explore different PCA methods, including standard PCA and kernel PCA to extract relevant features from the genotype data that is transformed by vcf2geno, a pipeline from LASER software. These extracted features are fed into a random forest for ensemble learning. Parameter tuning is performed to jointly find the optimal number of principal components, kernel function for PCA and parameters of the random forest. RESULTS Experiments based on HGDP dataset show that kernel PCA with Sigmoid function and Gaussian function can achieve higher prediction accuracy than the standard PCA. Compared to standard PCA with the two principal components, the accuracy by using KPCA-Sigmoid with the optimal number of principal components can achieve around 100% and 200% improvement for East Asian and European populations, respectively. CONCLUSION With the optimal parameter configuration on both PCA and random forest, our proposed method can infer the individual genetic ancestry more accurately, given their variants.
Collapse
|
22
|
High-Resolution Genomic Comparisons within Salmonella enterica Serotypes Derived from Beef Feedlot Cattle: Parsing the Roles of Cattle Source, Pen, Animal, Sample Type, and Production Period. Appl Environ Microbiol 2021; 87:e0048521. [PMID: 33863705 DOI: 10.1128/aem.00485-21] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
Salmonella enterica is a major foodborne pathogen, and contaminated beef products have been identified as one of the primary sources of Salmonella-related outbreaks. Pathogenicity and antibiotic resistance of Salmonella are highly serotype and subpopulation specific, which makes it essential to understand high-resolution Salmonella population dynamics in cattle. Time of year, source of cattle, pen, and sample type (i.e., feces, hide, or lymph nodes) have previously been identified as important factors influencing the serotype distribution of Salmonella (e.g., Anatum, Lubbock, Cerro, Montevideo, Kentucky, Newport, and Norwich) that were isolated from a longitudinal sampling design in a research feedlot. In this study, we performed high-resolution genomic comparisons of Salmonella isolates within each serotype using both single-nucleotide polymorphism-based maximum-likelihood phylogeny and hierarchical clustering of core-genome multilocus sequence typing. The importance of the aforementioned features in clonal Salmonella expansion was further explored using a supervised machine learning algorithm. In addition, we identified and compared the resistance genes, plasmids, and pathogenicity island profiles of the isolates within each subpopulation. Our findings indicate that clonal expansion of Salmonella strains in cattle was mainly influenced by the randomization of block and pen, as well as the origin/source of the cattle, i.e., regardless of sampling time and sample type (i.e., feces, lymph node, or hide). Further research is needed concerning the role of the feedlot pen environment prior to cattle placement to better understand carryover contributions of existing strains of Salmonella and their bacteriophages. IMPORTANCE Salmonella serotypes isolated from outbreaks in humans can also be found in beef cattle and feedlots. Virulence factors and antibiotic resistance are among the primary defense mechanisms of Salmonella, and are often associated with clonal expansion. This makes understanding the subpopulation dynamics of Salmonella in cattle critical for effective mitigation. There remains a gap in the literature concerning subpopulation dynamics within Salmonella serotypes in feedlot cattle from the beginning of feeding up until slaughter. Here, we explore Salmonella population dynamics within each serotype using core-genome phylogeny and hierarchical classifications. We used machine learning to quantitatively parse the relative importance of both hierarchical and longitudinal clustering among cattle host samples. Our results reveal that Salmonella populations in cattle are highly clonal over a 6-month study period and that clonal dissemination of Salmonella in cattle is mainly influenced spatially by experimental block and pen, as well by the geographical origin of the cattle.
Collapse
|
23
|
Chen K, Xu H, Lei Y, Lio P, Li Y, Guo H, Ali Moni M. Integration and interplay of machine learning and bioinformatics approach to identify genetic interaction related to ovarian cancer chemoresistance. Brief Bioinform 2021; 22:6272796. [PMID: 33971668 DOI: 10.1093/bib/bbab100] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2021] [Revised: 03/04/2021] [Accepted: 03/06/2021] [Indexed: 11/15/2022] Open
Abstract
Although chemotherapy is the first-line treatment for ovarian cancer (OCa) patients, chemoresistance (CR) decreases their progression-free survival. This paper investigates the genetic interaction (GI) related to OCa-CR. To decrease the complexity of establishing gene networks, individual signature genes related to OCa-CR are identified using a gradient boosting decision tree algorithm. Additionally, the genetic interaction coefficient (GIC) is proposed to measure the correlation of two signature genes quantitatively and explain their joint influence on OCa-CR. Gene pair that possesses high GIC is identified as signature pair. A total of 24 signature gene pairs are selected that include 10 individual signature genes and the influence of signature gene pairs on OCa-CR is explored. Finally, a signature gene pair-based prediction of OCa-CR is identified. The area under curve (AUC) is a widely used performance measure for machine learning prediction. The AUC of signature gene pair reaches 0.9658, whereas the AUC of individual signature gene-based prediction is 0.6823 only. The identified signature gene pairs not only build an efficient GI network of OCa-CR but also provide an interesting way for OCa-CR prediction. This improvement shows that our proposed method is a useful tool to investigate GI related to OCa-CR.
Collapse
Affiliation(s)
- Kexin Chen
- School of Electronics Engineering and Computer Science, Peking University, 100871, Beijing, China
| | - Haoming Xu
- Department of Biomedical Engineering, Duke University, 27708, Durham, United States
| | - Yiming Lei
- School of Electronics Engineering and Computer Science, Peking University, 100871, Beijing, China
| | - Pietro Lio
- Computer Laboratory, University of Cambridge, CB3-0FD, Cambridge, United Kingdom
| | - Yuan Li
- Department of Obstetrics and Gynecology, Peking University Third Hospital, 100083, Beijing, China
| | - Hongyan Guo
- Department of Obstetrics and Gynecology, Peking University Third Hospital, 100083, Beijing, China
| | - Mohammad Ali Moni
- School of Public health and Community Medicine, University of New South Wales, 2052, Sydney, Australia
| |
Collapse
|
24
|
Wang B, Liua F, Deveaux L, Ash A, Gosh S, Li X, Rundensteiner E, Cottrell L, Adderley R, Stanton B. Adolescent HIV-related behavioural prediction using machine learning: a foundation for precision HIV prevention. AIDS 2021; 35:S75-S84. [PMID: 33867490 PMCID: PMC8133351 DOI: 10.1097/qad.0000000000002867] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022]
Abstract
BACKGROUND Precision prevention is increasingly important in HIV prevention research to move beyond universal interventions to those tailored for high-risk individuals. The current study was designed to develop machine learning algorithms for predicting adolescent HIV risk behaviours. METHODS Comprehensive longitudinal data on adolescent risk behaviours, perceptions, peer and family influence, and neighbourhood risk factors were collected from 2564 grade-10 students at baseline followed for 24 months over 2008-2012. Machine learning techniques [support vector machine (SVM) and random forests] were applied to innovatively leverage longitudinal data for robust HIV risk behaviour prediction. In this study, we focused on two adolescent risk behaviours: had ever had sex and had multiple sex partners. Twenty percent of the data were withheld for model testing. RESULTS The SVM model with cost-sensitive learning achieved the highest sensitivity, at 79.1%, specificity of 75.4% with AUC of 0.86 in predicting multiple sex partners on the training data (10-fold cross-validation), and sensitivity of 79.7%, specificity of 76.5% with AUC of 0.86 on the testing data. The random forest model obtained the best performance in predicting had ever had sex, yielding the sensitivity of 78.5%, specificity of 73.1% with AUC of 0.84 on the training data and sensitivity of 82.7%, specificity of 75.3% with AUC of 0.87 on the testing data. CONCLUSION Machine learning methods can be used to build effective prediction model(s) to identify adolescents who are likely to engage in HIV risk behaviours. This study builds a foundation for targeted intervention strategies and informs precision prevention efforts in school-setting.
Collapse
Affiliation(s)
- Bo Wang
- Department of Population and Quantitative Health Sciences, University of Massachusetts Medical School, 368 Plantation Street, Worcester, Massachusetts, USA
| | - Feifan Liua
- Department of Population and Quantitative Health Sciences, University of Massachusetts Medical School, 368 Plantation Street, Worcester, Massachusetts, USA
| | - Lynette Deveaux
- Office of HIV/AIDS, Ministry of Health, Shirley Street, Nassau, The Bahamas
| | - Arlene Ash
- Department of Population and Quantitative Health Sciences, University of Massachusetts Medical School, 368 Plantation Street, Worcester, Massachusetts, USA
| | - Samiran Gosh
- Department of Family Medicine and Public Health Sciences, Wayne State University School of Medicine, Detroit, Michigan
| | - Xiaoming Li
- Department of Health Promotion, Education, and Behavior, University of South Carolina Arnold School of Public, Columbia, South Carolina
| | | | - Lesley Cottrell
- Center for Excellence in Disabilities, West Virginia University, Morgantown, West Virginia
| | - Richard Adderley
- Office of HIV/AIDS, Ministry of Health, Shirley Street, Nassau, The Bahamas
| | - Bonita Stanton
- Hackensack Meridian School of Medicine, Nutley, New Jersey, USA
| |
Collapse
|
25
|
Ashbrook DG, Arends D, Prins P, Mulligan MK, Roy S, Williams EG, Lutz CM, Valenzuela A, Bohl CJ, Ingels JF, McCarty MS, Centeno AG, Hager R, Auwerx J, Lu L, Williams RW. A platform for experimental precision medicine: The extended BXD mouse family. Cell Syst 2021; 12:235-247.e9. [PMID: 33472028 PMCID: PMC7979527 DOI: 10.1016/j.cels.2020.12.002] [Citation(s) in RCA: 89] [Impact Index Per Article: 29.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2020] [Revised: 08/29/2020] [Accepted: 12/21/2020] [Indexed: 12/17/2022]
Abstract
The challenge of precision medicine is to model complex interactions among DNA variants, phenotypes, development, environments, and treatments. We address this challenge by expanding the BXD family of mice to 140 fully isogenic strains, creating a uniquely powerful model for precision medicine. This family segregates for 6 million common DNA variants-a level that exceeds many human populations. Because each member can be replicated, heritable traits can be mapped with high power and precision. Current BXD phenomes are unsurpassed in coverage and include much omics data and thousands of quantitative traits. BXDs can be extended by a single-generation cross to as many as 19,460 isogenic F1 progeny, and this extended BXD family is an effective platform for testing causal modeling and for predictive validation. BXDs are a unique core resource for the field of experimental precision medicine.
Collapse
Affiliation(s)
- David G Ashbrook
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN 38163, USA.
| | - Danny Arends
- Lebenswissenschaftliche Fakultät, Albrecht Daniel Thaer-Institut, Humboldt-Universität zu Berlin, Invalidenstraße 42, 10115 Berlin, Germany
| | - Pjotr Prins
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN 38163, USA
| | - Megan K Mulligan
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN 38163, USA
| | - Suheeta Roy
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN 38163, USA
| | - Evan G Williams
- Luxembourg Centre for Systems Biomedicine, Université du Luxembourg, L-4365 Esch-sur-Alzette, Luxembourg
| | - Cathleen M Lutz
- Mouse Repository and the Rare and Orphan Disease Center, the Jackson Laboratory, Bar Harbor, ME 04609, USA
| | - Alicia Valenzuela
- Mouse Repository and the Rare and Orphan Disease Center, the Jackson Laboratory, Bar Harbor, ME 04609, USA
| | - Casey J Bohl
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN 38163, USA
| | - Jesse F Ingels
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN 38163, USA
| | - Melinda S McCarty
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN 38163, USA
| | - Arthur G Centeno
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN 38163, USA
| | - Reinmar Hager
- Division of Evolution & Genomic Sciences, Faculty of Biology, Medicine and Health, Manchester Academic Health Science Centre, The University of Manchester, Oxford Road, Manchester M13 9PL, UK
| | - Johan Auwerx
- Laboratory of Integrative Systems Physiology, École Polytechnique Fédérale de Lausanne, CH-1015 Lausanne, Switzerland
| | - Lu Lu
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN 38163, USA.
| | - Robert W Williams
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN 38163, USA.
| |
Collapse
|
26
|
Peng Q, Shen Y, Fu K, Dai Z, Jin L, Yang D, Zhu J. Artificial intelligence prediction model for overall survival of clear cell renal cell carcinoma based on a 21-gene molecular prognostic score system. Aging (Albany NY) 2021; 13:7361-7381. [PMID: 33686949 PMCID: PMC7993746 DOI: 10.18632/aging.202594] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2020] [Accepted: 01/14/2021] [Indexed: 01/03/2023]
Abstract
We developed and validated a new prognostic model for predicting the overall survival in clear cell renal cell carcinoma (ccRCC) patients. In this study, artificial intelligence (AI) algorithms including random forest and neural network were trained to build a molecular prognostic score (mPS) system. Afterwards, we investigated the potential mechanisms underlying mPS by assessing gene set enrichment analysis, mutations, copy number variations (CNVs) and immune cell infiltration. A total of 275 prognosis-related genes were identified, which were also differentially expressed between ccRCC patients and healthy controls. We then constructed a universal mPS system that depends on the expression status of only 21 of these genes by applying AI-based algorithms. Then, the mPS were validated by another independent cohort and demonstrated to be applicable to ccRCC subsets. Furthermore, a nomogram comprising the mPS score and several independent variables was established and proved to effectively predict ccRCC patient prognosis. Finally, significant differences were identified regarding the pathways, mutated genes, CNVs and tumor-infiltrating immune cells among the subgroups of ccRCC stratified by the mPS system. The AI-based mPS system can provide critical prognostic prediction for ccRCC patients and may be useful to inform treatment and surveillance decisions before initial intervention.
Collapse
Affiliation(s)
- Qiliang Peng
- Department of Radiotherapy and Oncology, The Second Affiliated Hospital of Soochow University, Suzhou, China.,Institute of Radiotherapy and Oncology, Soochow University, Suzhou, China
| | - Yi Shen
- Department of Radiation Oncology, The Affiliated Suzhou Science and Technology Town Hospital of Nanjing Medical University, Suzhou, China
| | - Kai Fu
- Department of Urology, The Second Affiliated Hospital of Soochow University, Suzhou, China
| | - Zheng Dai
- Department of Urology, The Second Affiliated Hospital of Soochow University, Suzhou, China
| | - Lu Jin
- Department of Urology, The Second Affiliated Hospital of Soochow University, Suzhou, China
| | - Dongrong Yang
- Department of Urology, The Second Affiliated Hospital of Soochow University, Suzhou, China
| | - Jin Zhu
- Department of Urology, The Second Affiliated Hospital of Soochow University, Suzhou, China
| |
Collapse
|
27
|
Orlenko A, Moore JH. A comparison of methods for interpreting random forest models of genetic association in the presence of non-additive interactions. BioData Min 2021; 14:9. [PMID: 33514397 PMCID: PMC7847145 DOI: 10.1186/s13040-021-00243-0] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2020] [Accepted: 01/13/2021] [Indexed: 01/19/2023] Open
Abstract
BACKGROUND Non-additive interactions among genes are frequently associated with a number of phenotypes, including known complex diseases such as Alzheimer's, diabetes, and cardiovascular disease. Detecting interactions requires careful selection of analytical methods, and some machine learning algorithms are unable or underpowered to detect or model feature interactions that exhibit non-additivity. The Random Forest method is often employed in these efforts due to its ability to detect and model non-additive interactions. In addition, Random Forest has the built-in ability to estimate feature importance scores, a characteristic that allows the model to be interpreted with the order and effect size of the feature association with the outcome. This characteristic is very important for epidemiological and clinical studies where results of predictive modeling could be used to define the future direction of the research efforts. An alternative way to interpret the model is with a permutation feature importance metric which employs a permutation approach to calculate a feature contribution coefficient in units of the decrease in the model's performance and with the Shapely additive explanations which employ cooperative game theory approach. Currently, it is unclear which Random Forest feature importance metric provides a superior estimation of the true informative contribution of features in genetic association analysis. RESULTS To address this issue, and to improve interpretability of Random Forest predictions, we compared different methods for feature importance estimation in real and simulated datasets with non-additive interactions. As a result, we detected a discrepancy between the metrics for the real-world datasets and further established that the permutation feature importance metric provides more precise feature importance rank estimation for the simulated datasets with non-additive interactions. CONCLUSIONS By analyzing both real and simulated data, we established that the permutation feature importance metric provides more precise feature importance rank estimation in the presence of non-additive interactions.
Collapse
Affiliation(s)
- Alena Orlenko
- Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, PA, USA
| | - Jason H Moore
- Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, PA, USA.
| |
Collapse
|
28
|
Bracher-Smith M, Crawford K, Escott-Price V. Machine learning for genetic prediction of psychiatric disorders: a systematic review. Mol Psychiatry 2021; 26:70-79. [PMID: 32591634 PMCID: PMC7610853 DOI: 10.1038/s41380-020-0825-2] [Citation(s) in RCA: 50] [Impact Index Per Article: 16.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 05/15/2020] [Revised: 06/09/2020] [Accepted: 06/16/2020] [Indexed: 12/25/2022]
Abstract
Machine learning methods have been employed to make predictions in psychiatry from genotypes, with the potential to bring improved prediction of outcomes in psychiatric genetics; however, their current performance is unclear. We aim to systematically review machine learning methods for predicting psychiatric disorders from genetics alone and evaluate their discrimination, bias and implementation. Medline, PsycInfo, Web of Science and Scopus were searched for terms relating to genetics, psychiatric disorders and machine learning, including neural networks, random forests, support vector machines and boosting, on 10 September 2019. Following PRISMA guidelines, articles were screened for inclusion independently by two authors, extracted, and assessed for risk of bias. Overall, 63 full texts were assessed from a pool of 652 abstracts. Data were extracted for 77 models of schizophrenia, bipolar, autism or anorexia across 13 studies. Performance of machine learning methods was highly varied (0.48-0.95 AUC) and differed between schizophrenia (0.54-0.95 AUC), bipolar (0.48-0.65 AUC), autism (0.52-0.81 AUC) and anorexia (0.62-0.69 AUC). This is likely due to the high risk of bias identified in the study designs and analysis for reported results. Choices for predictor selection, hyperparameter search and validation methodology, and viewing of the test set during training were common causes of high risk of bias in analysis. Key steps in model development and validation were frequently not performed or unreported. Comparison of discrimination across studies was constrained by heterogeneity of predictors, outcome and measurement, in addition to sample overlap within and across studies. Given widespread high risk of bias and the small number of studies identified, it is important to ensure established analysis methods are adopted. We emphasise best practices in methodology and reporting for improving future studies.
Collapse
Affiliation(s)
- Matthew Bracher-Smith
- MRC Centre for Neuropsychiatric Genetics and Genomics, Division of Psychological Medicine and Clinical Neurosciences, School of Medicine, Cardiff University, Cardiff, UK
| | - Karen Crawford
- MRC Centre for Neuropsychiatric Genetics and Genomics, Division of Psychological Medicine and Clinical Neurosciences, School of Medicine, Cardiff University, Cardiff, UK
- Dementia Research Institute, School of Medicine, Cardiff University, Cardiff, UK
| | - Valentina Escott-Price
- MRC Centre for Neuropsychiatric Genetics and Genomics, Division of Psychological Medicine and Clinical Neurosciences, School of Medicine, Cardiff University, Cardiff, UK.
- Dementia Research Institute, School of Medicine, Cardiff University, Cardiff, UK.
| |
Collapse
|
29
|
Yan KK, Zhao H, Wu JT, Pang H. An enhanced machine learning tool for cis-eQTL mapping with regularization and confounder adjustments. Genet Epidemiol 2020; 44:798-810. [PMID: 32700329 PMCID: PMC7875251 DOI: 10.1002/gepi.22341] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/01/2020] [Revised: 07/07/2020] [Accepted: 07/07/2020] [Indexed: 11/07/2022]
Abstract
Many expression quantitative trait loci (eQTL) studies have been conducted to investigate the biological effects of variants in gene regulation. However, these eQTL studies may suffer from low or moderate statistical power and overly conservative false-discovery rate. In practice, most algorithms for eQTL identification do not model the joint effects of multiple genetic variants with weak or moderate influence. Here we present a novel machine-learning algorithm, lasso least-squares kernel machine (LSKM-LASSO) that model the association between multiple genetic variants and phenotypic traits simultaneously with the existence of nongenetic and genetic confounding. With a more general and flexible framework for the estimation of genetic confounding, LSKM-LASSO is able to provide a more accurate evaluation of the joint effects of multiple genetic variants. Our simulations demonstrate that our approach outperforms three state-of-the-art alternatives in terms of eQTL identification and phenotype prediction. We then apply our method to genotype and gene expression data of 11 tissues obtained from the Genotype-Tissue Expression project. Our algorithm was able to identify more genes with eQTL than other algorithms. By incorporating a regularization term and combining it with least-squares kernel machine, LSKM-LASSO provides a powerful tool for eQTL mapping and phenotype prediction.
Collapse
Affiliation(s)
- Kang K. Yan
- School of Public Health, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong, China
| | - Hongyu Zhao
- Department of Biostatistics, Yale University, New Haven, Connecticut
| | - Joseph T. Wu
- School of Public Health, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong, China
| | - Herbert Pang
- School of Public Health, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong, China
| |
Collapse
|
30
|
Abstract
In this chapter we discuss the past, present and future of clinical biomarker development. We explore the advent of new technologies, paving the way in which health, medicine and disease is understood. This review includes the identification of physicochemical assays, current regulations, the development and reproducibility of clinical trials, as well as, the revolution of omics technologies and state-of-the-art integration and analysis approaches.
Collapse
|
31
|
Predicting the geographic origin of Spanish Cedar (Cedrela odorata L.) based on DNA variation. CONSERV GENET 2020. [DOI: 10.1007/s10592-020-01282-6] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
32
|
Gim JA, Kwon Y, Lee HA, Lee KR, Kim S, Choi Y, Kim YK, Lee H. A Machine Learning-Based Identification of Genes Affecting the Pharmacokinetics of Tacrolimus Using the DMET TM Plus Platform. Int J Mol Sci 2020; 21:E2517. [PMID: 32260456 PMCID: PMC7178269 DOI: 10.3390/ijms21072517] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2020] [Revised: 03/29/2020] [Accepted: 04/02/2020] [Indexed: 12/15/2022] Open
Abstract
Tacrolimus is an immunosuppressive drug with a narrow therapeutic index and larger interindividual variability. We identified genetic variants to predict tacrolimus exposure in healthy Korean males using machine learning algorithms such as decision tree, random forest, and least absolute shrinkage and selection operator (LASSO) regression. rs776746 (CYP3A5) and rs1137115 (CYP2A6) are single nucleotide polymorphisms (SNPs) that can affect exposure to tacrolimus. A decision tree, when coupled with random forest analysis, is an efficient tool for predicting the exposure to tacrolimus based on genotype. These tools are helpful to determine an individualized dose of tacrolimus.
Collapse
Affiliation(s)
- Jeong-An Gim
- Department of Transdisciplinary Studies, Graduate School of Convergence Science and Technology, Seoul National University, Seoul 16229, Korea; (J.-A.G.); (Y.K.); (H.A.L.); (K.-R.L.); (S.K.)
- Medical Science Research Center, College of Medicine, Korea University, Seoul 02841, Korea
| | - Yonghan Kwon
- Department of Transdisciplinary Studies, Graduate School of Convergence Science and Technology, Seoul National University, Seoul 16229, Korea; (J.-A.G.); (Y.K.); (H.A.L.); (K.-R.L.); (S.K.)
- Department of Biostatistics and Computing, Yonsei University Graduate School, Seoul 03722, Korea
| | - Hyun A Lee
- Department of Transdisciplinary Studies, Graduate School of Convergence Science and Technology, Seoul National University, Seoul 16229, Korea; (J.-A.G.); (Y.K.); (H.A.L.); (K.-R.L.); (S.K.)
- Department of Clinical Pharmacology and Therapeutics, Seoul National University College of Medicine and Hospital, Seoul 03080, Korea
| | - Kyeong-Ryoon Lee
- Department of Transdisciplinary Studies, Graduate School of Convergence Science and Technology, Seoul National University, Seoul 16229, Korea; (J.-A.G.); (Y.K.); (H.A.L.); (K.-R.L.); (S.K.)
- Laboratory Animal Resource Center, Korea Research Institute of Bioscience and Biotechnology, Ochang, Chungbuk 28116, Korea
| | - Soohyun Kim
- Department of Transdisciplinary Studies, Graduate School of Convergence Science and Technology, Seoul National University, Seoul 16229, Korea; (J.-A.G.); (Y.K.); (H.A.L.); (K.-R.L.); (S.K.)
| | | | - Yu Kyong Kim
- Daewoong Pharmaceutical Co., Ltd., Seoul 06170, Korea;
| | - Howard Lee
- Department of Transdisciplinary Studies, Graduate School of Convergence Science and Technology, Seoul National University, Seoul 16229, Korea; (J.-A.G.); (Y.K.); (H.A.L.); (K.-R.L.); (S.K.)
- Department of Clinical Pharmacology and Therapeutics, Seoul National University College of Medicine and Hospital, Seoul 03080, Korea
- Department of Molecular Medicine and Biopharmaceutical Sciences, Graduate School of Convergence Science and Technology, Seoul National University, Seoul 03080, Korea
| |
Collapse
|
33
|
Kess T, Bentzen P, Lehnert SJ, Sylvester EVA, Lien S, Kent MP, Sinclair‐Waters M, Morris C, Wringe B, Fairweather R, Bradbury IR. Modular chromosome rearrangements reveal parallel and nonparallel adaptation in a marine fish. Ecol Evol 2020; 10:638-653. [PMID: 32015832 PMCID: PMC6988541 DOI: 10.1002/ece3.5828] [Citation(s) in RCA: 28] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2019] [Revised: 10/05/2019] [Accepted: 10/10/2019] [Indexed: 01/01/2023] Open
Abstract
Genomic architecture and standing variation can play a key role in ecological adaptation and contribute to the predictability of evolution. In Atlantic cod (Gadus morhua), four large chromosomal rearrangements have been associated with ecological gradients and migratory behavior in regional analyses. However, the degree of parallelism, the extent of independent inheritance, and functional distinctiveness of these rearrangements remain poorly understood. Here, we use a 12K single nucleotide polymorphism (SNP) array to demonstrate extensive individual variation in rearrangement genotype within populations across the species range, suggesting that local adaptation to fine-scale ecological variation is enabled by rearrangements with independent inheritance. Our results demonstrate significant association of rearrangements with migration phenotype and environmental gradients across the species range. Individual rearrangements exhibit functional modularity, but also contain loci showing multiple environmental associations. Clustering in genetic distance trees and reduced differentiation within rearrangements across the species range are consistent with shared variation as a source of contemporary adaptive diversity in Atlantic cod. Conversely, we also find that haplotypes in the LG12 and LG1 rearranged region have diverged across the Atlantic, despite consistent environmental associations. Exchange of these structurally variable genomic regions, as well as local selective pressures, has likely facilitated individual diversity within Atlantic cod stocks. Our results highlight the importance of genomic architecture and standing variation in enabling fine-scale adaptation in marine species.
Collapse
Affiliation(s)
- Tony Kess
- Fisheries and Oceans CanadaNorthwest Atlantic Fisheries CentreSt. John'sNLCanada
| | - Paul Bentzen
- Biology DepartmentDalhousie UniversityHalifaxNSCanada
| | - Sarah J. Lehnert
- Fisheries and Oceans CanadaNorthwest Atlantic Fisheries CentreSt. John'sNLCanada
| | - Emma V. A. Sylvester
- Fisheries and Oceans CanadaNorthwest Atlantic Fisheries CentreSt. John'sNLCanada
| | - Sigbjørn Lien
- Department of Animal and Aquacultural SciencesFaculty of BiosciencesCentre for Integrative GeneticsNorwegian University of Life SciencesÅsNorway
| | - Matthew P. Kent
- Department of Animal and Aquacultural SciencesFaculty of BiosciencesCentre for Integrative GeneticsNorwegian University of Life SciencesÅsNorway
| | - Marion Sinclair‐Waters
- Organismal and Evolutionary Biology Research ProgrammeUniversity of HelsinkiHelsinkiFinland
| | - Corey Morris
- Fisheries and Oceans CanadaNorthwest Atlantic Fisheries CentreSt. John'sNLCanada
| | - Brendan Wringe
- Fisheries and Oceans CanadaBedford Institute of OceanographyDartmouthNSCanada
| | | | - Ian R. Bradbury
- Fisheries and Oceans CanadaNorthwest Atlantic Fisheries CentreSt. John'sNLCanada
- Biology DepartmentDalhousie UniversityHalifaxNSCanada
| |
Collapse
|
34
|
Breitbach ME, Greenspan S, Resnick NM, Perera S, Gurkar AU, Absher D, Levine AS. Exonic Variants in Aging-Related Genes Are Predictive of Phenotypic Aging Status. Front Genet 2019; 10:1277. [PMID: 31921313 PMCID: PMC6931058 DOI: 10.3389/fgene.2019.01277] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2019] [Accepted: 11/19/2019] [Indexed: 01/31/2023] Open
Abstract
Background: Recent studies investigating longevity have revealed very few convincing genetic associations with increased lifespan. This is, in part, due to the complexity of biological aging, as well as the limited power of genome-wide association studies, which assay common single nucleotide polymorphisms (SNPs) and require several thousand subjects to achieve statistical significance. To overcome such barriers, we performed comprehensive DNA sequencing of a panel of 20 genes previously associated with phenotypic aging in a cohort of 200 individuals, half of whom were clinically defined by an "early aging" phenotype, and half of whom were clinically defined by a "late aging" phenotype based on age (65-75 years) and the ability to walk up a flight of stairs or walk for 15 min without resting. A validation cohort of 511 late agers was used to verify our results. Results: We found early agers were not enriched for more total variants in these 20 aging-related genes than late agers. Using machine learning methods, we identified the most predictive model of aging status, both in our discovery and validation cohorts, to be a random forest model incorporating damaging exon variants [Combined Annotation-Dependent Depletion (CADD) > 15]. The most heavily weighted variants in the model were within poly(ADP-ribose) polymerase 1 (PARP1) and excision repair cross complementation group 5 (ERCC5), both of which are involved in a canonical aging pathway, DNA damage repair. Conclusion: Overall, this study implemented a framework to apply machine learning to identify sequencing variants associated with complex phenotypes such as aging. While the small sample size making up our cohort inhibits our ability to make definitive conclusions about the ability of these genes to accurately predict aging, this study offers a unique method for exploring polygenic associations with complex phenotypes.
Collapse
Affiliation(s)
- Megan E Breitbach
- HudsonAlpha Institute for Biotechnology, Hunstville, AL, United States.,Department of Biotechnology Science and Engineering, University of Alabama in Huntsville, Hunstville, AL, United States
| | - Susan Greenspan
- Division of Geriatric Medicine, Department of Medicine, University of Pittsburgh School of Medicine, Pittsburgh, PA, United States
| | - Neil M Resnick
- Division of Geriatric Medicine, Department of Medicine, University of Pittsburgh School of Medicine, Pittsburgh, PA, United States.,Institute on Aging of UPMC, University of Pittsburgh School of Medicine, Pittsburgh, PA, United States
| | - Subashan Perera
- Division of Geriatric Medicine, Department of Medicine, University of Pittsburgh School of Medicine, Pittsburgh, PA, United States.,Department of Biostatistics, University of Pittsburgh Graduate School of Public Health, Pittsburgh, PA, United States
| | - Aditi U Gurkar
- Division of Geriatric Medicine, Department of Medicine, University of Pittsburgh School of Medicine, Pittsburgh, PA, United States.,Institute on Aging of UPMC, University of Pittsburgh School of Medicine, Pittsburgh, PA, United States
| | - Devin Absher
- HudsonAlpha Institute for Biotechnology, Hunstville, AL, United States
| | - Arthur S Levine
- Department of Microbiology and Molecular Genetics, University of Pittsburgh School of Medicine, Pittsburgh, PA, United States.,UPMC Hillman Cancer Center, Pittsburgh, PA, United States
| |
Collapse
|
35
|
TSLRF: Two-Stage Algorithm Based on Least Angle Regression and Random Forest in genome-wide association studies. Sci Rep 2019; 9:18034. [PMID: 31792302 PMCID: PMC6889171 DOI: 10.1038/s41598-019-54519-x] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2019] [Accepted: 11/15/2019] [Indexed: 11/24/2022] Open
Abstract
One of the most important tasks in genome-wide association analysis (GWAS) is the detection of single-nucleotide polymorphisms (SNPs) which are related to target traits. With the development of sequencing technology, traditional statistical methods are difficult to analyze the corresponding high-dimensional massive data or SNPs. Recently, machine learning methods have become more popular in high-dimensional genetic data analysis for their fast computation speed. However, most of machine learning methods have several drawbacks, such as poor generalization ability, over-fitting, unsatisfactory classification and low detection accuracy. This study proposed a two-stage algorithm based on least angle regression and random forest (TSLRF), which firstly considered the control of population structure and polygenic effects, then selected the SNPs that were potentially related to target traits by using least angle regression (LARS), furtherly analyzed this variable subset using random forest (RF) to detect quantitative trait nucleotides (QTNs) associated with target traits. The new method has more powerful detection in simulation experiments and real data analyses. The results of simulation experiments showed that, compared with the existing approaches, the new method effectively improved the detection ability of QTNs and model fitting degree, and required less calculation time. In addition, the new method significantly distinguished QTNs and other SNPs. Subsequently, the new method was applied to analyze five flowering-related traits in Arabidopsis. The results showed that, the distinction between QTNs and unrelated SNPs was more significant than the other methods. The new method detected 60 genes confirmed to be related to the target trait, which was significantly higher than the other methods, and simultaneously detected multiple gene clusters associated with the target trait.
Collapse
|
36
|
Meijsen JJ, Rammos A, Campbell A, Hayward C, Porteous DJ, Deary IJ, Marioni RE, Nicodemus KK. Using tree-based methods for detection of gene-gene interactions in the presence of a polygenic signal: simulation study with application to educational attainment in the Generation Scotland Cohort Study. Bioinformatics 2019; 35:181-188. [PMID: 29931044 PMCID: PMC6330004 DOI: 10.1093/bioinformatics/bty462] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2018] [Accepted: 06/14/2018] [Indexed: 11/13/2022] Open
Abstract
Motivation The genomic architecture of human complex diseases is thought to be attributable to single markers, polygenic components and epistatic components. No study has examined the ability of tree-based methods to detect epistasis in the presence of a polygenic signal. We sought to apply decision tree-based methods, C5.0 and logic regression, to detect epistasis under several simulated conditions, varying strength of interaction and linkage disequilibrium (LD) structure. We then applied the same methods to the phenotype of educational attainment in a large population cohort. Results LD pruning improved the power and reduced the type I error. C5.0 had a conservative type I error rate whereas logic regression had a type I error rate that exceeded 5%. Despite the more conservative type I error, C5.0 was observed to have higher power than logic regression across several conditions. In the presence of a polygenic signal, power was generally reduced. Applying both methods on educational attainment in a large population cohort yielded numerous interacting SNPs; notably a SNP in RCAN3 which is associated with reading and spelling and a SNP in NPAS3, a neurodevelopmental gene. Availability and implementation All methods used are implemented and freely available in R. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Joeri J Meijsen
- Centre for Genomic and Experimental Medicine, Institute of Genetics and Molecular Medicine, University of Edinburgh, Edinburgh, UK.,Centre for Cognitive Ageing and Cognitive Epidemiology, University of Edinburgh, Edinburgh, UK
| | - Alexandros Rammos
- Centre for Genomic and Experimental Medicine, Institute of Genetics and Molecular Medicine, University of Edinburgh, Edinburgh, UK.,Department of Genetics, Smurfit Institute of Genetics and Institute of Neuroscience, Trinity College Dublin, Dublin, Ireland
| | - Archie Campbell
- Centre for Genomic and Experimental Medicine, Institute of Genetics and Molecular Medicine, University of Edinburgh, Edinburgh, UK
| | - Caroline Hayward
- MRC Human Genetics Unit, Institute of Genetics and Molecular Medicine, University of Edinburgh, Edinburgh, UK
| | - David J Porteous
- Centre for Genomic and Experimental Medicine, Institute of Genetics and Molecular Medicine, University of Edinburgh, Edinburgh, UK.,Centre for Cognitive Ageing and Cognitive Epidemiology, University of Edinburgh, Edinburgh, UK
| | - Ian J Deary
- Centre for Cognitive Ageing and Cognitive Epidemiology, University of Edinburgh, Edinburgh, UK.,Department of Psychology, University of Edinburgh, Edinburgh, UK
| | - Riccardo E Marioni
- Centre for Genomic and Experimental Medicine, Institute of Genetics and Molecular Medicine, University of Edinburgh, Edinburgh, UK.,Centre for Cognitive Ageing and Cognitive Epidemiology, University of Edinburgh, Edinburgh, UK
| | - Kristin K Nicodemus
- Centre for Genomic and Experimental Medicine, Institute of Genetics and Molecular Medicine, University of Edinburgh, Edinburgh, UK.,Centre for Cognitive Ageing and Cognitive Epidemiology, University of Edinburgh, Edinburgh, UK
| |
Collapse
|
37
|
Fine-Resolution Population Mapping from International Space Station Nighttime Photography and Multisource Social Sensing Data Based on Similarity Matching. REMOTE SENSING 2019. [DOI: 10.3390/rs11161900] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Previous studies have attempted to disaggregate census data into fine resolution with multisource remote sensing data considering the importance of fine-resolution population distribution in urban planning, environmental protection, resource allocation, and social economy. However, the lack of direct human activity information invariably restricts the accuracy of population mapping and reduces the credibility of the mapping process even when external facility distribution information is adopted. To address these problems, the present study proposed a novel population mapping method by combining International Space Station (ISS) photography nighttime light data, point of interest (POI) data, and location-based social media data. A similarity matching model, consisting of semantic and distance matching models, was established to integrate POI and social media data. Effective information was extracted from the integrated data through principal component analysis and then used along with road density information to train the random forest (RF) model. A comparison with WordPop data proved that our method can generate fine-resolution population distribution with higher accuracy ( R 2 = 0.91 ) than those of previous studies ( R 2 = 0.55 ). To illustrate the advantages of our method, we highlighted the limitations of previous methods that ignore social media data in handling residential regions with similar light intensity. We also discussed the performance of our method in adopting social media data, considering their characteristics, with different volumes and acquisition times. Results showed that social media data acquired between 19:00 and 8:00 with a volume of approximately 300,000 will help our method realize high accuracy with low computation burden. This study showed the great potential of combining social sensing data for disaggregating fine-resolution population.
Collapse
|
38
|
Machine learning technology in the application of genome analysis: A systematic review. Gene 2019; 705:149-156. [PMID: 31026571 DOI: 10.1016/j.gene.2019.04.062] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2019] [Revised: 04/17/2019] [Accepted: 04/22/2019] [Indexed: 01/17/2023]
Abstract
Machine learning (ML) is a powerful technique to tackle many problems in data mining and predictive analytics. We believe that ML will be of considerable potentials in the field of bioinformatics since the high-throughput technology is producing ever increasing biological data. In this review, we summarized major ML algorithms and conditions that must be paid attention to when applying these algorithms to genomic problems in details and we provided a list of examples from different perspectives and data analysis challenges at present.
Collapse
|
39
|
Wu M, Ma S. Robust genetic interaction analysis. Brief Bioinform 2019; 20:624-637. [PMID: 29897421 PMCID: PMC6556899 DOI: 10.1093/bib/bby033] [Citation(s) in RCA: 17] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2018] [Revised: 03/22/2018] [Indexed: 01/17/2023] Open
Abstract
For the risk, progression, and response to treatment of many complex diseases, it has been increasingly recognized that genetic interactions (including gene-gene and gene-environment interactions) play important roles beyond the main genetic and environmental effects. In practical genetic interaction analyses, model mis-specification and outliers/contaminations in response variables and covariates are not uncommon, and demand robust analysis methods. Compared with their nonrobust counterparts, robust genetic interaction analysis methods are significantly less popular but are gaining attention fast. In this article, we provide a comprehensive review of robust genetic interaction analysis methods, on their methodologies and applications, for both marginal and joint analysis, and for addressing model mis-specification as well as outliers/contaminations in response variables and covariates.
Collapse
Affiliation(s)
- Mengyun Wu
- Mengyun Wu and Shuangge Ma, School of Statistics and Management, Shanghai University of Finance and Economics, Shanghai 200433, China and Yale School of Public Health, New Haven, CT 06520, USA
| | - Shuangge Ma
- Mengyun Wu and Shuangge Ma, School of Statistics and Management, Shanghai University of Finance and Economics, Shanghai 200433, China and Yale School of Public Health, New Haven, CT 06520, USA
| |
Collapse
|
40
|
Arabnejad M, Dawkins BA, Bush WS, White BC, Harkness AR, McKinney BA. Transition-transversion encoding and genetic relationship metric in ReliefF feature selection improves pathway enrichment in GWAS. BioData Min 2018; 11:23. [PMID: 30410580 PMCID: PMC6215626 DOI: 10.1186/s13040-018-0186-4] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2018] [Accepted: 10/22/2018] [Indexed: 11/29/2022] Open
Abstract
BACKGROUND ReliefF is a nearest-neighbor based feature selection algorithm that efficiently detects variants that are important due to statistical interactions or epistasis. For categorical predictors, like genotypes, the standard metric used in ReliefF has been a simple (binary) mismatch difference. In this study, we develop new metrics of varying complexity that incorporate allele sharing, adjustment for allele frequency heterogeneity via the genetic relationship matrix (GRM), and physicochemical differences of variants via a new transition/transversion encoding. METHODS We introduce a new two-dimensional transition/transversion genotype encoding for ReliefF, and we implement three ReliefF attribute metrics: 1.) genotype mismatch (GM), which is the ReliefF standard, 2.) allele mismatch (AM), which accounts for heterozygous differences and has not been used previously in ReliefF, and 3.) the new transition/transversion metric. We incorporate these attribute metrics into the ReliefF nearest neighbor calculation with a Manhattan metric, and we introduce GRM as a new ReliefF nearest-neighbor metric to adjust for allele frequency heterogeneity. RESULTS We apply ReliefF with each metric to a GWAS of major depressive disorder and compare the detection of genes in pathways implicated in depression, including Axon Guidance, Neuronal System, and G Protein-Coupled Receptor Signaling. We also compare with detection by Random Forest and Lasso as well as random/null selection to assess pathway size bias. CONCLUSIONS Our results suggest that using more genetically motivated encodings, such as transition/transversion, and metrics that adjust for allele frequency heterogeneity, such as GRM, lead to ReliefF attribute scores with improved pathway enrichment.
Collapse
Affiliation(s)
- M. Arabnejad
- Tandy School of Computer Science, The University of Tulsa, 800 S. Tucker Dr, Tulsa, OK 74104 USA
| | - B. A. Dawkins
- Department of Mathematics, The University of Tulsa, Tulsa, OK 74104 USA
| | - W. S. Bush
- Institute for Computational Biology, Case Western Reserve University, 2103 Cornell Road, Cleveland, OH 44106 USA
| | - B. C. White
- Tandy School of Computer Science, The University of Tulsa, 800 S. Tucker Dr, Tulsa, OK 74104 USA
| | - A. R. Harkness
- Department of Psychology, The University of Tulsa, Tulsa, OK 74104 USA
| | - B. A. McKinney
- Tandy School of Computer Science, The University of Tulsa, 800 S. Tucker Dr, Tulsa, OK 74104 USA
- Department of Mathematics, The University of Tulsa, Tulsa, OK 74104 USA
| |
Collapse
|
41
|
Li B, Zhang N, Wang YG, George AW, Reverter A, Li Y. Genomic Prediction of Breeding Values Using a Subset of SNPs Identified by Three Machine Learning Methods. Front Genet 2018; 9:237. [PMID: 30023001 PMCID: PMC6039760 DOI: 10.3389/fgene.2018.00237] [Citation(s) in RCA: 79] [Impact Index Per Article: 13.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2018] [Accepted: 06/14/2018] [Indexed: 12/22/2022] Open
Abstract
The analysis of large genomic data is hampered by issues such as a small number of observations and a large number of predictive variables (commonly known as “large P small N”), high dimensionality or highly correlated data structures. Machine learning methods are renowned for dealing with these problems. To date machine learning methods have been applied in Genome-Wide Association Studies for identification of candidate genes, epistasis detection, gene network pathway analyses and genomic prediction of phenotypic values. However, the utility of two machine learning methods, Gradient Boosting Machine (GBM) and Extreme Gradient Boosting Method (XgBoost), in identifying a subset of SNP makers for genomic prediction of breeding values has never been explored before. In this study, using 38,082 SNP markers and body weight phenotypes from 2,093 Brahman cattle (1,097 bulls as a discovery population and 996 cows as a validation population), we examined the efficiency of three machine learning methods, namely Random Forests (RF), GBM and XgBoost, in (a) the identification of top 400, 1,000, and 3,000 ranked SNPs; (b) using the subsets of SNPs to construct genomic relationship matrices (GRMs) for the estimation of genomic breeding values (GEBVs). For comparison purposes, we also calculated the GEBVs from (1) 400, 1,000, and 3,000 SNPs that were randomly selected and evenly spaced across the genome, and (2) from all the SNPs. We found that RF and especially GBM are efficient methods in identifying a subset of SNPs with direct links to candidate genes affecting the growth trait. In comparison to the estimate of prediction accuracy of GEBVs from using all SNPs (0.43), the 3,000 top SNPs identified by RF (0.42) and GBM (0.46) had similar values to those of the whole SNP panel. The performance of the subsets of SNPs from RF and GBM was substantially better than that of evenly spaced subsets across the genome (0.18–0.29). Of the three methods, RF and GBM consistently outperformed the XgBoost in genomic prediction accuracy.
Collapse
Affiliation(s)
- Bo Li
- CSIRO Agriculture and Food, St Lucia, QLD, Australia.,Shandong Technology and Business University, School of Computer Science and Technology, YanTai, China.,Shandong Co-Innovation Centre of Future Intelligent Computing, YanTai, China
| | - Nanxi Zhang
- Centre for Applications in Natural Resource Mathematics, University of Queensland, St Lucia, QLD, Australia
| | - You-Gan Wang
- School of Mathematical Sciences, Queensland University of Technology, Brisbane, QLD, Australia
| | | | | | - Yutao Li
- CSIRO Agriculture and Food, St Lucia, QLD, Australia
| |
Collapse
|
42
|
Waters CD, Hard JJ, Brieuc MSO, Fast DE, Warheit KI, Knudsen CM, Bosch WJ, Naish KA. Genomewide association analyses of fitness traits in captive-reared Chinook salmon: Applications in evaluating conservation strategies. Evol Appl 2018; 11:853-868. [PMID: 29928295 PMCID: PMC5999212 DOI: 10.1111/eva.12599] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2017] [Accepted: 01/09/2018] [Indexed: 12/20/2022] Open
Abstract
A novel application of genomewide association analyses is to use trait-associated loci to monitor the effects of conservation strategies on potentially adaptive genetic variation. Comparisons of fitness between captive- and wild-origin individuals, for example, do not reveal how captive rearing affects genetic variation underlying fitness traits or which traits are most susceptible to domestication selection. Here, we used data collected across four generations to identify loci associated with six traits in adult Chinook salmon (Oncorhynchus tshawytscha) and then determined how two alternative management approaches for captive rearing affected variation at these loci. Loci associated with date of return to freshwater spawning grounds (return timing), length and weight at return, age at maturity, spawn timing, and daily growth coefficient were identified using 9108 restriction site-associated markers and random forest, an approach suitable for polygenic traits. Mapping of trait-associated loci, gene annotations, and integration of results across multiple studies revealed candidate regions involved in several fitness-related traits. Genotypes at trait-associated loci were then compared between two hatchery populations that were derived from the same source but are now managed as separate lines, one integrated with and one segregated from the wild population. While no broad-scale change was detected across four generations, there were numerous regions where trait-associated loci overlapped with signatures of adaptive divergence previously identified in the two lines. Many regions, primarily with loci linked to return and spawn timing, were either unique to or more divergent in the segregated line, suggesting that these traits may be responding to domestication selection. This study is one of the first to utilize genomic approaches to demonstrate the effectiveness of a conservation strategy, managed gene flow, on trait-associated-and potentially adaptive-loci. The results will promote the development of trait-specific tools to better monitor genetic change in captive and wild populations.
Collapse
Affiliation(s)
- Charles D. Waters
- School of Aquatic and Fishery SciencesUniversity of WashingtonSeattleWAUSA
| | - Jeffrey J. Hard
- Conservation Biology DivisionNorthwest Fisheries Science CenterNational Oceanic and Atmospheric AdministrationSeattleWAUSA
| | - Marine S. O. Brieuc
- School of Aquatic and Fishery SciencesUniversity of WashingtonSeattleWAUSA
- Department of BiosciencesCentre for Ecological and Evolutionary Synthesis (CEES)University of OsloOsloNorway
| | | | | | | | | | - Kerry A. Naish
- School of Aquatic and Fishery SciencesUniversity of WashingtonSeattleWAUSA
| |
Collapse
|
43
|
Kang J, Rancati T, Lee S, Oh JH, Kerns SL, Scott JG, Schwartz R, Kim S, Rosenstein BS. Machine Learning and Radiogenomics: Lessons Learned and Future Directions. Front Oncol 2018; 8:228. [PMID: 29977864 PMCID: PMC6021505 DOI: 10.3389/fonc.2018.00228] [Citation(s) in RCA: 42] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2018] [Accepted: 06/04/2018] [Indexed: 12/25/2022] Open
Abstract
Due to the rapid increase in the availability of patient data, there is significant interest in precision medicine that could facilitate the development of a personalized treatment plan for each patient on an individual basis. Radiation oncology is particularly suited for predictive machine learning (ML) models due to the enormous amount of diagnostic data used as input and therapeutic data generated as output. An emerging field in precision radiation oncology that can take advantage of ML approaches is radiogenomics, which is the study of the impact of genomic variations on the sensitivity of normal and tumor tissue to radiation. Currently, patients undergoing radiotherapy are treated using uniform dose constraints specific to the tumor and surrounding normal tissues. This is suboptimal in many ways. First, the dose that can be delivered to the target volume may be insufficient for control but is constrained by the surrounding normal tissue, as dose escalation can lead to significant morbidity and rare. Second, two patients with nearly identical dose distributions can have substantially different acute and late toxicities, resulting in lengthy treatment breaks and suboptimal control, or chronic morbidities leading to poor quality of life. Despite significant advances in radiogenomics, the magnitude of the genetic contribution to radiation response far exceeds our current understanding of individual risk variants. In the field of genomics, ML methods are being used to extract harder-to-detect knowledge, but these methods have yet to fully penetrate radiogenomics. Hence, the goal of this publication is to provide an overview of ML as it applies to radiogenomics. We begin with a brief history of radiogenomics and its relationship to precision medicine. We then introduce ML and compare it to statistical hypothesis testing to reflect on shared lessons and to avoid common pitfalls. Current ML approaches to genome-wide association studies are examined. The application of ML specifically to radiogenomics is next presented. We end with important lessons for the proper integration of ML into radiogenomics.
Collapse
Affiliation(s)
- John Kang
- Department of Radiation Oncology, University of Rochester Medical Center, Rochester, NY, United States
| | - Tiziana Rancati
- Prostate Cancer Program, Fondazione IRCCS Istituto Nazionale dei Tumori, Milan, Italy
| | - Sangkyu Lee
- Department of Medical Physics, Memorial Sloan Kettering Cancer Center, New York, NY, United States
| | - Jung Hun Oh
- Department of Medical Physics, Memorial Sloan Kettering Cancer Center, New York, NY, United States
| | - Sarah L. Kerns
- Department of Radiation Oncology, University of Rochester Medical Center, Rochester, NY, United States
| | - Jacob G. Scott
- Department of Translational Hematology and Oncology Research, Cleveland Clinic, Cleveland, OH, United States
- Department of Radiation Oncology, Cleveland Clinic, Cleveland, OH, United States
| | - Russell Schwartz
- Computational Biology Department, Carnegie Mellon School of Computer Science, Pittsburgh, PA, United States
- Department of Biological Sciences, Carnegie Mellon University, Pittsburgh, PA, United States
| | - Seyoung Kim
- Computational Biology Department, Carnegie Mellon School of Computer Science, Pittsburgh, PA, United States
| | - Barry S. Rosenstein
- Department of Radiation Oncology, Icahn School of Medicine at Mount Sinai, New York, NY, United States
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, United States
| |
Collapse
|
44
|
Wheeler NE, Gardner PP, Barquist L. Machine learning identifies signatures of host adaptation in the bacterial pathogen Salmonella enterica. PLoS Genet 2018; 14:e1007333. [PMID: 29738521 PMCID: PMC5940178 DOI: 10.1371/journal.pgen.1007333] [Citation(s) in RCA: 47] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2017] [Accepted: 03/24/2018] [Indexed: 11/18/2022] Open
Abstract
Emerging pathogens are a major threat to public health, however understanding how pathogens adapt to new niches remains a challenge. New methods are urgently required to provide functional insights into pathogens from the massive genomic data sets now being generated from routine pathogen surveillance for epidemiological purposes. Here, we measure the burden of atypical mutations in protein coding genes across independently evolved Salmonella enterica lineages, and use these as input to train a random forest classifier to identify strains associated with extraintestinal disease. Members of the species fall along a continuum, from pathovars which cause gastrointestinal infection and low mortality, associated with a broad host-range, to those that cause invasive infection and high mortality, associated with a narrowed host range. Our random forest classifier learned to perfectly discriminate long-established gastrointestinal and invasive serovars of Salmonella. Additionally, it was able to discriminate recently emerged Salmonella Enteritidis and Typhimurium lineages associated with invasive disease in immunocompromised populations in sub-Saharan Africa, and within-host adaptation to invasive infection. We dissect the architecture of the model to identify the genes that were most informative of phenotype, revealing a common theme of degradation of metabolic pathways in extraintestinal lineages. This approach accurately identifies patterns of gene degradation and diversifying selection specific to invasive serovars that have been captured by more labour-intensive investigations, but can be readily scaled to larger analyses.
Collapse
Affiliation(s)
- Nicole E. Wheeler
- Wellcome Sanger Institute, Hinxton, United Kingdom
- Biomolecular Interaction Centre, School of Biological Sciences, University of Canterbury, Christchurch, New Zealand
- * E-mail: (NEW); (LB)
| | - Paul P. Gardner
- Biomolecular Interaction Centre, School of Biological Sciences, University of Canterbury, Christchurch, New Zealand
- Department of Biochemistry, University of Otago, Dunedin, New Zealand
| | - Lars Barquist
- Institute for Molecular Infection Biology, University of Wuerzburg, Wuerzburg, Germany
- Helmholtz Institute for RNA-based Infection Research, Wuerzburg, Germany
- * E-mail: (NEW); (LB)
| |
Collapse
|
45
|
Brieuc MSO, Waters CD, Drinan DP, Naish KA. A practical introduction to Random Forest for genetic association studies in ecology and evolution. Mol Ecol Resour 2018; 18:755-766. [PMID: 29504715 DOI: 10.1111/1755-0998.12773] [Citation(s) in RCA: 59] [Impact Index Per Article: 9.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2017] [Revised: 02/08/2018] [Accepted: 02/17/2018] [Indexed: 12/25/2022]
Abstract
Large genomic studies are becoming increasingly common with advances in sequencing technology, and our ability to understand how genomic variation influences phenotypic variation between individuals has never been greater. The exploration of such relationships first requires the identification of associations between molecular markers and phenotypes. Here, we explore the use of Random Forest (RF), a powerful machine-learning algorithm, in genomic studies to discern loci underlying both discrete and quantitative traits, particularly when studying wild or nonmodel organisms. RF is becoming increasingly used in ecological and population genetics because, unlike traditional methods, it can efficiently analyse thousands of loci simultaneously and account for nonadditive interactions. However, understanding both the power and limitations of Random Forest is important for its proper implementation and the interpretation of results. We therefore provide a practical introduction to the algorithm and its use for identifying associations between molecular markers and phenotypes, discussing such topics as data limitations, algorithm initiation and optimization, as well as interpretation. We also provide short R tutorials as examples, with the aim of providing a guide to the implementation of the algorithm. Topics discussed here are intended to serve as an entry point for molecular ecologists interested in employing Random Forest to identify trait associations in genomic data sets.
Collapse
Affiliation(s)
- Marine S O Brieuc
- School of Aquatic and Fishery Sciences, University of Washington, Seattle, WA, USA.,Center for Ecological and Evolutionary Synthesis (CEES), Department of Biosciences, University of Oslo, Oslo, Norway
| | - Charles D Waters
- School of Aquatic and Fishery Sciences, University of Washington, Seattle, WA, USA
| | - Daniel P Drinan
- School of Aquatic and Fishery Sciences, University of Washington, Seattle, WA, USA
| | - Kerry A Naish
- School of Aquatic and Fishery Sciences, University of Washington, Seattle, WA, USA
| |
Collapse
|
46
|
Lee S, Kerns S, Ostrer H, Rosenstein B, Deasy JO, Oh JH. Machine Learning on a Genome-wide Association Study to Predict Late Genitourinary Toxicity After Prostate Radiation Therapy. Int J Radiat Oncol Biol Phys 2018; 101:128-135. [PMID: 29502932 DOI: 10.1016/j.ijrobp.2018.01.054] [Citation(s) in RCA: 54] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2017] [Revised: 01/02/2018] [Accepted: 01/16/2018] [Indexed: 01/23/2023]
Abstract
PURPOSE Late genitourinary (GU) toxicity after radiation therapy limits the quality of life of prostate cancer survivors; however, efforts to explain GU toxicity using patient and dose information have remained unsuccessful. We identified patients with a greater congenital GU toxicity risk by identifying and integrating patterns in genome-wide single nucleotide polymorphisms (SNPs). METHODS AND MATERIALS We applied a preconditioned random forest regression method for predicting risk from the genome-wide data to combine the effects of multiple SNPs and overcome the statistical power limitations of single-SNP analysis. We studied a cohort of 324 prostate cancer patients who were self-assessed for 4 urinary symptoms at 2 years after radiation therapy using the International Prostate Symptom Score. RESULTS The predictive accuracy of the method varied across the symptoms. Only for the weak stream endpoint did it achieve a significant area under the curve of 0.70 (95% confidence interval 0.54-0.86; P = .01) on hold-out validation data that outperformed competing methods. Gene ontology analysis highlighted key biological processes, such as neurogenesis and ion transport, from the genes known to be important for urinary tract functions. CONCLUSIONS We applied machine learning methods and bioinformatics tools to genome-wide data to predict and explain GU toxicity. Our approach enabled the design of a more powerful predictive model and the determination of plausible biomarkers and biological processes associated with GU toxicity.
Collapse
Affiliation(s)
- Sangkyu Lee
- Department of Medical Physics, Memorial Sloan Kettering Cancer Center, New York, New York
| | - Sarah Kerns
- Department of Radiation Oncology, University of Rochester Medical Center, New York, New York
| | - Harry Ostrer
- Department of Pathology, Albert Einstein College of Medicine, New York, New York; Department of Pediatrics, Albert Einstein College of Medicine, New York, New York
| | - Barry Rosenstein
- Department of Radiation Oncology and Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, New York
| | - Joseph O Deasy
- Department of Medical Physics, Memorial Sloan Kettering Cancer Center, New York, New York
| | - Jung Hun Oh
- Department of Medical Physics, Memorial Sloan Kettering Cancer Center, New York, New York.
| |
Collapse
|
47
|
Burghardt LT, Young ND, Tiffin P. A Guide to Genome-Wide Association Mapping in Plants. ACTA ACUST UNITED AC 2017; 2:22-38. [PMID: 31725973 DOI: 10.1002/cppb.20041] [Citation(s) in RCA: 42] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022]
Abstract
Genome-wide association studies (GWAS) have developed into a valuable approach for identifying the genetic basis of phenotypic variation. In this article, we provide an overview of the design, analysis, and interpretation of GWAS. First, we present results from simulations that explore key elements of experimental design as well as considerations for collecting the relevant genomic and phenotypic data. Next, we outline current statistical methods and tools used for GWA analyses and discuss the inclusion of covariates to account for population structure and the interpretation of results. Given that many false positive associations will occur in any GWA analysis, we highlight strategies for prioritizing GWA candidates for further statistical and empirical validation. While focused on plants, the material we cover is also applicable to other systems. © 2017 by John Wiley & Sons, Inc.
Collapse
Affiliation(s)
- Liana T Burghardt
- Department of Plant and Microbial Biology, University of Minnesota, St. Paul, Minnesota
| | - Nevin D Young
- Department of Plant Pathology, University of Minnesota, St. Paul, Minnesota
| | - Peter Tiffin
- Department of Plant and Microbial Biology, University of Minnesota, St. Paul, Minnesota
| |
Collapse
|
48
|
Uncovering the genetic signature of quantitative trait evolution with replicated time series data. Heredity (Edinb) 2016; 118:42-51. [PMID: 27848948 DOI: 10.1038/hdy.2016.98] [Citation(s) in RCA: 28] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2016] [Revised: 08/18/2016] [Accepted: 08/24/2016] [Indexed: 01/04/2023] Open
Abstract
The genetic architecture of adaptation in natural populations has not yet been resolved: it is not clear to what extent the spread of beneficial mutations (selective sweeps) or the response of many quantitative trait loci drive adaptation to environmental changes. Although much attention has been given to the genomic footprint of selective sweeps, the importance of selection on quantitative traits is still not well studied, as the associated genomic signature is extremely difficult to detect. We propose 'Evolve and Resequence' as a promising tool, to study polygenic adaptation of quantitative traits in evolving populations. Simulating replicated time series data we show that adaptation to a new intermediate trait optimum has three characteristic phases that are reflected on the genomic level: (1) directional frequency changes towards the new trait optimum, (2) plateauing of allele frequencies when the new trait optimum has been reached and (3) subsequent divergence between replicated trajectories ultimately leading to the loss or fixation of alleles while the trait value does not change. We explore these 3 phase characteristics for relevant population genetic parameters to provide expectations for various experimental evolution designs. Remarkably, over a broad range of parameters the trajectories of selected alleles display a pattern across replicates, which differs both from neutrality and directional selection. We conclude that replicated time series data from experimental evolution studies provide a promising framework to study polygenic adaptation from whole-genome population genetics data.
Collapse
|
49
|
Exploiting Single-Cell Quantitative Data to Map Genetic Variants Having Probabilistic Effects. PLoS Genet 2016; 12:e1006213. [PMID: 27479122 PMCID: PMC4968810 DOI: 10.1371/journal.pgen.1006213] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2016] [Accepted: 07/02/2016] [Indexed: 01/11/2023] Open
Abstract
Despite the recent progress in sequencing technologies, genome-wide association studies (GWAS) remain limited by a statistical-power issue: many polymorphisms contribute little to common trait variation and therefore escape detection. The small contribution sometimes corresponds to incomplete penetrance, which may result from probabilistic effects on molecular regulations. In such cases, genetic mapping may benefit from the wealth of data produced by single-cell technologies. We present here the development of a novel genetic mapping method that allows to scan genomes for single-cell Probabilistic Trait Loci that modify the statistical properties of cellular-level quantitative traits. Phenotypic values are acquired on thousands of individual cells, and genetic association is obtained from a multivariate analysis of a matrix of Kantorovich distances. No prior assumption is required on the mode of action of the genetic loci involved and, by exploiting all single-cell values, the method can reveal non-deterministic effects. Using both simulations and yeast experimental datasets, we show that it can detect linkages that are missed by classical genetic mapping. A probabilistic effect of a single SNP on cell shape was detected and validated. The method also detected a novel locus associated with elevated gene expression noise of the yeast galactose regulon. Our results illustrate how single-cell technologies can be exploited to improve the genetic dissection of certain common traits. The method is available as an open source R package called ptlmapper.
Collapse
|
50
|
Märtens K, Hallin J, Warringer J, Liti G, Parts L. Predicting quantitative traits from genome and phenome with near perfect accuracy. Nat Commun 2016; 7:11512. [PMID: 27160605 PMCID: PMC4866306 DOI: 10.1038/ncomms11512] [Citation(s) in RCA: 29] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2015] [Accepted: 04/01/2016] [Indexed: 12/20/2022] Open
Abstract
In spite of decades of linkage and association studies and its potential impact on human health, reliable prediction of an individual's risk for heritable disease remains difficult. Large numbers of mapped loci do not explain substantial fractions of heritable variation, leaving an open question of whether accurate complex trait predictions can be achieved in practice. Here, we use a genome sequenced population of ∼7,000 yeast strains of high but varying relatedness, and predict growth traits from family information, effects of segregating genetic variants and growth in other environments with an average coefficient of determination R(2) of 0.91. This accuracy exceeds narrow-sense heritability, approaches limits imposed by measurement repeatability and is higher than achieved with a single assay in the laboratory. Our results prove that very accurate prediction of complex traits is possible, and suggest that additional data from families rather than reference cohorts may be more useful for this purpose.
Collapse
Affiliation(s)
- Kaspar Märtens
- Institute of Computer Science, University of Tartu, Tartu 50409, Estonia
| | - Johan Hallin
- Institute for Research on Cancer and Aging, University of Sophia Antipolis, Nice 02 06107, France
| | - Jonas Warringer
- Department of Chemistry and Molecular Biology, Gothenburg University, Gothenburg 40530, Sweden
- Centre for Integrative Genetics (CIGENE), Department of Animal and Aquacultural Sciences, Norwegian University of Life Sciences, Ås N-1432, Norway
| | - Gianni Liti
- Institute for Research on Cancer and Aging, University of Sophia Antipolis, Nice 02 06107, France
| | - Leopold Parts
- Institute of Computer Science, University of Tartu, Tartu 50409, Estonia
- Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton CB101SA, UK
| |
Collapse
|