1
|
Jia X, Luo W, Li J, Xing J, Sun H, Wu S, Su X. A deep learning framework for predicting disease-gene associations with functional modules and graph augmentation. BMC Bioinformatics 2024; 25:214. [PMID: 38877401 DOI: 10.1186/s12859-024-05841-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2024] [Accepted: 06/12/2024] [Indexed: 06/16/2024] Open
Abstract
BACKGROUND The exploration of gene-disease associations is crucial for understanding the mechanisms underlying disease onset and progression, with significant implications for prevention and treatment strategies. Advances in high-throughput biotechnology have generated a wealth of data linking diseases to specific genes. While graph representation learning has recently introduced groundbreaking approaches for predicting novel associations, existing studies always overlooked the cumulative impact of functional modules such as protein complexes and the incompletion of some important data such as protein interactions, which limits the detection performance. RESULTS Addressing these limitations, here we introduce a deep learning framework called ModulePred for predicting disease-gene associations. ModulePred performs graph augmentation on the protein interaction network using L3 link prediction algorithms. It builds a heterogeneous module network by integrating disease-gene associations, protein complexes and augmented protein interactions, and develops a novel graph embedding for the heterogeneous module network. Subsequently, a graph neural network is constructed to learn node representations by collectively aggregating information from topological structure, and gene prioritization is carried out by the disease and gene embeddings obtained from the graph neural network. Experimental results underscore the superiority of ModulePred, showcasing the effectiveness of incorporating functional modules and graph augmentation in predicting disease-gene associations. This research introduces innovative ideas and directions, enhancing the understanding and prediction of gene-disease relationships.
Collapse
Affiliation(s)
- Xianghu Jia
- College of Computer Science and Technology, Qingdao University, Qingdao, 266071, Shandong, China
| | - Weiwen Luo
- College of Computer Science and Technology, Qingdao University, Qingdao, 266071, Shandong, China
| | - Jiaqi Li
- College of Computer Science and Technology, Qingdao University, Qingdao, 266071, Shandong, China
| | - Jieqi Xing
- College of Computer Science and Technology, Qingdao University, Qingdao, 266071, Shandong, China
| | - Hongjie Sun
- College of Computer Science and Technology, Qingdao University, Qingdao, 266071, Shandong, China
| | - Shunyao Wu
- College of Computer Science and Technology, Qingdao University, Qingdao, 266071, Shandong, China.
| | - Xiaoquan Su
- College of Computer Science and Technology, Qingdao University, Qingdao, 266071, Shandong, China.
| |
Collapse
|
2
|
Xiang J, Meng X, Zhao Y, Wu FX, Li M. HyMM: hybrid method for disease-gene prediction by integrating multiscale module structure. Brief Bioinform 2022; 23:6547263. [PMID: 35275996 DOI: 10.1093/bib/bbac072] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2021] [Revised: 01/18/2022] [Accepted: 02/13/2022] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION Identifying disease-related genes is an important issue in computational biology. Module structure widely exists in biomolecule networks, and complex diseases are usually thought to be caused by perturbations of local neighborhoods in the networks, which can provide useful insights for the study of disease-related genes. However, the mining and effective utilization of the module structure is still challenging in such issues as a disease gene prediction. RESULTS We propose a hybrid disease-gene prediction method integrating multiscale module structure (HyMM), which can utilize multiscale information from local to global structure to more effectively predict disease-related genes. HyMM extracts module partitions from local to global scales by multiscale modularity optimization with exponential sampling, and estimates the disease relatedness of genes in partitions by the abundance of disease-related genes within modules. Then, a probabilistic model for integration of gene rankings is designed in order to integrate multiple predictions derived from multiscale module partitions and network propagation, and a parameter estimation strategy based on functional information is proposed to further enhance HyMM's predictive power. By a series of experiments, we reveal the importance of module partitions at different scales, and verify the stable and good performance of HyMM compared with eight other state-of-the-arts and its further performance improvement derived from the parameter estimation. CONCLUSIONS The results confirm that HyMM is an effective framework for integrating multiscale module structure to enhance the ability to predict disease-related genes, which may provide useful insights for the study of the multiscale module structure and its application in such issues as a disease-gene prediction.
Collapse
Affiliation(s)
- Ju Xiang
- Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha 410083, China; Department of Basic Medical Sciences & Academician Workstation, Changsha Medical University, Changsha, Hunan 410219, China
| | - Xiangmao Meng
- School of Computer Science and Engineering, Central South University, Changsha 410083, China
| | - Yichao Zhao
- School of Computer Science and Engineering, Central South University, Changsha, Hunan 410083, China
| | - Fang-Xiang Wu
- Division of Biomedical Engineering and Department of Mechanical Engineering, University of Saskatchewan, Saskatoon, SK, S7N 5A9, Canada
| | - Min Li
- Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha 410083, China
| |
Collapse
|
3
|
Mining semantic information of co-word network to improve link prediction performance. Scientometrics 2022. [DOI: 10.1007/s11192-021-04247-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
4
|
Yi HC, You ZH, Guo ZH, Huang DS, Chan KCC. Learning Representation of Molecules in Association Network for Predicting Intermolecular Associations. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:2546-2554. [PMID: 32070992 DOI: 10.1109/tcbb.2020.2973091] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
A key aim of post-genomic biomedical research is to systematically understand molecules and their interactions in human cells. Multiple biomolecules coordinate to sustain life activities, and interactions between various biomolecules are interconnected. However, existing studies usually only focusing on associations between two or very limited types of molecules. In this study, we propose a network representation learning based computational framework MAN-SDNE to predict any intermolecular associations. More specifically, we constructed a large-scale molecular association network of multiple biomolecules in human by integrating associations among long non-coding RNA, microRNA, protein, drug, and disease, containing 6,528 molecular nodes, 9 kind of,105,546 associations. And then, the feature of each node is represented by its network proximity and attribute features. Furthermore, these features are used to train Random Forest classifier to predict intermolecular associations. MAN-SDNE achieves a remarkable performance with an AUC of 0.9552 and an AUPR of 0.9338 under five-fold cross-validation. To indicate the ability to predict specific types of interactions, a case study for predicting lncRNA-protein interactions using MAN-SDNE is also executed. Experimental results demonstrate this work offers a systematic insight for understanding the synergistic associations between molecules and complex diseases and provides a network-based computational tool to systematically explore intermolecular interactions.
Collapse
|
5
|
Nassar H, Benson AR, Gleich DF. Neighborhood and PageRank methods for pairwise link prediction. SOCIAL NETWORK ANALYSIS AND MINING 2020. [DOI: 10.1007/s13278-020-00671-6] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
6
|
Bean DM, Al-Chalabi A, Dobson RJB, Iacoangeli A. A Knowledge-Based Machine Learning Approach to Gene Prioritisation in Amyotrophic Lateral Sclerosis. Genes (Basel) 2020; 11:E668. [PMID: 32575372 PMCID: PMC7349022 DOI: 10.3390/genes11060668] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2020] [Revised: 06/13/2020] [Accepted: 06/16/2020] [Indexed: 02/07/2023] Open
Abstract
Amyotrophic lateral sclerosis is a neurodegenerative disease of the upper and lower motor neurons resulting in death from neuromuscular respiratory failure, typically within two to five years of first symptoms. Several rare disruptive gene variants have been associated with ALS and are responsible for about 15% of all cases. Although our knowledge of the genetic landscape of this disease is improving, it remains limited. Machine learning models trained on the available protein-protein interaction and phenotype-genotype association data can use our current knowledge of the disease genetics for the prediction of novel candidate genes. Here, we describe a knowledge-based machine learning method for this purpose. We trained our model on protein-protein interaction data from IntAct, gene function annotation from Gene Ontology, and known disease-gene associations from DisGeNet. Using several sets of known ALS genes from public databases and a manual review as input, we generated a list of new candidate genes for each input set. We investigated the relevance of the predicted genes in ALS by using the available summary statistics from the largest ALS genome-wide association study and by performing functional and phenotype enrichment analysis. The predicted sets were enriched for genes associated with other neurodegenerative diseases known to overlap with ALS genetically and phenotypically, as well as for biological processes associated with the disease. Moreover, using ALS genes from ClinVar and our manual review as input, the predicted sets were enriched for ALS-associated genes (ClinVar p = 0.038 and manual review p = 0.060) when used for gene prioritisation in a genome-wide association study.
Collapse
Affiliation(s)
- Daniel M. Bean
- Department of Biostatistics & Health Informatics, King′s College London, 16 De Crespigny Park, London SE5 8AF, UK;
- Health Data Research UK London, University College London, 16 De Crespigny Park, London SE5 8AF, UK
| | - Ammar Al-Chalabi
- King′s College Hospital, Bessemer Road, Denmark Hill, Brixton, London SE5 9RS, UK;
- Maurice Wohl Clinical Neuroscience Institute, Department of Basic and Clinical Neuroscience, King′s College London, London, 5 Cutcombe Rd, Brixton, London SE5 9RT, UK
| | - Richard J. B. Dobson
- Department of Biostatistics & Health Informatics, King′s College London, 16 De Crespigny Park, London SE5 8AF, UK;
- Health Data Research UK London, University College London, 16 De Crespigny Park, London SE5 8AF, UK
- Institute of Health Informatics, University College London, 222 Euston Rd, London NW1 2DA, UK
| | - Alfredo Iacoangeli
- Department of Biostatistics & Health Informatics, King′s College London, 16 De Crespigny Park, London SE5 8AF, UK;
- Maurice Wohl Clinical Neuroscience Institute, Department of Basic and Clinical Neuroscience, King′s College London, London, 5 Cutcombe Rd, Brixton, London SE5 9RT, UK
| |
Collapse
|
7
|
Bugnon LA, Yones C, Raad J, Gerard M, Rubiolo M, Merino G, Pividori M, Di Persia L, Milone DH, Stegmayer G. DL4papers: a deep learning approach for the automatic interpretation of scientific articles. Bioinformatics 2020; 36:3499-3506. [DOI: 10.1093/bioinformatics/btaa111] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2019] [Revised: 12/27/2019] [Accepted: 02/14/2020] [Indexed: 01/26/2023] Open
Abstract
Abstract
Motivation
In precision medicine, next-generation sequencing and novel preclinical reports have led to an increasingly large amount of results, published in the scientific literature. However, identifying novel treatments or predicting a drug response in, for example, cancer patients, from the huge amount of papers available remains a laborious and challenging work. This task can be considered a text mining problem that requires reading a lot of academic documents for identifying a small set of papers describing specific relations between key terms. Due to the infeasibility of the manual curation of these relations, computational methods that can automatically identify them from the available literature are urgently needed.
Results
We present DL4papers, a new method based on deep learning that is capable of analyzing and interpreting papers in order to automatically extract relevant relations between specific keywords. DL4papers receives as input a query with the desired keywords, and it returns a ranked list of papers that contain meaningful associations between the keywords. The comparison against related methods showed that our proposal outperformed them in a cancer corpus. The reliability of the DL4papers output list was also measured, revealing that 100% of the first two documents retrieved for a particular search have relevant relations, in average. This shows that our model can guarantee that in the top-2 papers of the ranked list, the relation can be effectively found. Furthermore, the model is capable of highlighting, within each document, the specific fragments that have the associations of the input keywords. This can be very useful in order to pay attention only to the highlighted text, instead of reading the full paper. We believe that our proposal could be used as an accurate tool for rapidly identifying relationships between genes and their mutations, drug responses and treatments in the context of a certain disease. This new approach can certainly be a very useful and valuable resource for the advancement of the precision medicine field.
Availability and implementation
A web-demo is available at: http://sinc.unl.edu.ar/web-demo/dl4papers/. Full source code and data are available at: https://sourceforge.net/projects/sourcesinc/files/dl4papers/.
Contact
lbugnon@sinc.unl.edu.ar
Supplementary information
Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- L A Bugnon
- Research Institute for Signals, Systems and Computational Intelligence, sinc(i), FICH/UNL-CONICET, Ciudad Universitaria, Santa Fe 3000, Argentina
| | - C Yones
- Research Institute for Signals, Systems and Computational Intelligence, sinc(i), FICH/UNL-CONICET, Ciudad Universitaria, Santa Fe 3000, Argentina
| | - J Raad
- Research Institute for Signals, Systems and Computational Intelligence, sinc(i), FICH/UNL-CONICET, Ciudad Universitaria, Santa Fe 3000, Argentina
| | - M Gerard
- Research Institute for Signals, Systems and Computational Intelligence, sinc(i), FICH/UNL-CONICET, Ciudad Universitaria, Santa Fe 3000, Argentina
| | - M Rubiolo
- Research Institute for Signals, Systems and Computational Intelligence, sinc(i), FICH/UNL-CONICET, Ciudad Universitaria, Santa Fe 3000, Argentina
| | - G Merino
- Research Institute for Signals, Systems and Computational Intelligence, sinc(i), FICH/UNL-CONICET, Ciudad Universitaria, Santa Fe 3000, Argentina
- Bioengineering and Bioinformatics Research and Development Institute, IBB, FIUNER-CONICET, Ruta Prov 11, Km 10.5, Oro Verde 3100, Argentina
| | - M Pividori
- Research Institute for Signals, Systems and Computational Intelligence, sinc(i), FICH/UNL-CONICET, Ciudad Universitaria, Santa Fe 3000, Argentina
- Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, Perelman School of Medicine, Philadelphia, PA, USA
| | - L Di Persia
- Research Institute for Signals, Systems and Computational Intelligence, sinc(i), FICH/UNL-CONICET, Ciudad Universitaria, Santa Fe 3000, Argentina
| | - D H Milone
- Research Institute for Signals, Systems and Computational Intelligence, sinc(i), FICH/UNL-CONICET, Ciudad Universitaria, Santa Fe 3000, Argentina
| | - G Stegmayer
- Research Institute for Signals, Systems and Computational Intelligence, sinc(i), FICH/UNL-CONICET, Ciudad Universitaria, Santa Fe 3000, Argentina
| |
Collapse
|
8
|
Lee B, Zhang S, Poleksic A, Xie L. Heterogeneous Multi-Layered Network Model for Omics Data Integration and Analysis. Front Genet 2020; 10:1381. [PMID: 32063919 PMCID: PMC6997577 DOI: 10.3389/fgene.2019.01381] [Citation(s) in RCA: 37] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2019] [Accepted: 12/18/2019] [Indexed: 01/08/2023] Open
Abstract
Advances in next-generation sequencing and high-throughput techniques have enabled the generation of vast amounts of diverse omics data. These big data provide an unprecedented opportunity in biology, but impose great challenges in data integration, data mining, and knowledge discovery due to the complexity, heterogeneity, dynamics, uncertainty, and high-dimensionality inherited in the omics data. Network has been widely used to represent relations between entities in biological system, such as protein-protein interaction, gene regulation, and brain connectivity (i.e. network construction) as well as to infer novel relations given a reconstructed network (aka link prediction). Particularly, heterogeneous multi-layered network (HMLN) has proven successful in integrating diverse biological data for the representation of the hierarchy of biological system. The HMLN provides unparalleled opportunities but imposes new computational challenges on establishing causal genotype-phenotype associations and understanding environmental impact on organisms. In this review, we focus on the recent advances in developing novel computational methods for the inference of novel biological relations from the HMLN. We first discuss the properties of biological HMLN. Then we survey four categories of state-of-the-art methods (matrix factorization, random walk, knowledge graph, and deep learning). Thirdly, we demonstrate their applications to omics data integration and analysis. Finally, we outline strategies for future directions in the development of new HMLN models.
Collapse
Affiliation(s)
- Bohyun Lee
- Ph.D. Program in Computer Science, The City University of New York, New York, NY, United States
| | - Shuo Zhang
- Ph.D. Program in Computer Science, The City University of New York, New York, NY, United States
| | - Aleksandar Poleksic
- Department of Computer Science, The University of Northern Iowa, Cedar Falls, IA, United States
| | - Lei Xie
- Ph.D. Program in Computer Science, The City University of New York, New York, NY, United States
- Ph.D. Program in Biochemistry and Biology, The City University of New York, New York, NY, United States
- Department of Computer Science, Hunter College, The City University of New York, New York, NY, United States
- Helen and Robert Appel Alzheimer’s Disease Research Institute, Feil Family Brain & Mind Research Institute, Weill Cornell Medicine, Cornell University, Ithaca, NY, United States
| |
Collapse
|
9
|
Kasa SR, Bhattacharya S, Rajan V. Gaussian mixture copulas for high-dimensional clustering and dependency-based subtyping. Bioinformatics 2020; 36:621-628. [PMID: 31368480 DOI: 10.1093/bioinformatics/btz599] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2018] [Revised: 05/27/2019] [Accepted: 07/26/2019] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION The identification of sub-populations of patients with similar characteristics, called patient subtyping, is important for realizing the goals of precision medicine. Accurate subtyping is crucial for tailoring therapeutic strategies that can potentially lead to reduced mortality and morbidity. Model-based clustering, such as Gaussian mixture models, provides a principled and interpretable methodology that is widely used to identify subtypes. However, they impose identical marginal distributions on each variable; such assumptions restrict their modeling flexibility and deteriorates clustering performance. RESULTS In this paper, we use the statistical framework of copulas to decouple the modeling of marginals from the dependencies between them. Current copula-based methods cannot scale to high dimensions due to challenges in parameter inference. We develop HD-GMCM, that addresses these challenges and, to our knowledge, is the first copula-based clustering method that can fit high-dimensional data. Our experiments on real high-dimensional gene-expression and clinical datasets show that HD-GMCM outperforms state-of-the-art model-based clustering methods, by virtue of modeling non-Gaussian data and being robust to outliers through the use of Gaussian mixture copulas. We present a case study on lung cancer data from TCGA. Clusters obtained from HD-GMCM can be interpreted based on the dependencies they model, that offers a new way of characterizing subtypes. Empirically, such modeling not only uncovers latent structure that leads to better clustering but also meaningful clinical subtypes in terms of survival rates of patients. AVAILABILITY AND IMPLEMENTATION An implementation of HD-GMCM in R is available at: https://bitbucket.org/cdal/hdgmcm/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Siva Rajesh Kasa
- Department of Information Systems and Analytics, School of Computing, National University of Singapore, 117418 Singapore
| | | | - Vaibhav Rajan
- Department of Information Systems and Analytics, School of Computing, National University of Singapore, 117418 Singapore
| |
Collapse
|
10
|
Guo ZH, You ZH, Yi HC. Integrative Construction and Analysis of Molecular Association Network in Human Cells by Fusing Node Attribute and Behavior Information. MOLECULAR THERAPY-NUCLEIC ACIDS 2019; 19:498-506. [PMID: 31923739 PMCID: PMC6951835 DOI: 10.1016/j.omtn.2019.10.046] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/28/2019] [Revised: 10/07/2019] [Accepted: 10/21/2019] [Indexed: 11/27/2022]
Abstract
Detecting whether a pair of biomolecules associate is of great significance in the study of molecular biology. Hence, computational methods are urgently needed as guidance for practice. However, most of the previous prediction models influenced by reductionism focused on isolated research objects, which have their own inherent defects. Inspired by holism, a machine-learning-based framework called MAN-node2vec is proposed to predict multi-type relationships in the molecular associations network (MAN). Specifically, we constructed a large-scale MAN composed of 1,023 miRNAs, 1,649 proteins, 769 long non-coding RNAs (lncRNAs), 1,025 drugs, and 2,062 diseases. Then, each biomolecule in MAN can be represented as a vector by its attribute learned by k-mer, etc. and its behavior learned by node2vec. Finally, the random forest classifier is applied to carry out the relationship prediction task. The proposed model achieved a reliable performance with 0.9677 areas under the curve (AUCs) and 0.9562 areas under the precision curve (AUPRs) under 5-fold cross-validation. Also, additional experiments proved that the proposed global model shows more competitive performance than the traditional local method. All of these provided a systematic insight for understanding the synergistic interactions between various molecules and diseases. It is anticipated that this work can bring beneficial inspiration and advance to related systems biology and biomedical research.
Collapse
Affiliation(s)
- Zhen-Hao Guo
- Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Urumqi 830011, China; University of Chinese Academy of Sciences, Beijing 100049, China
| | - Zhu-Hong You
- Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Urumqi 830011, China; University of Chinese Academy of Sciences, Beijing 100049, China.
| | - Hai-Cheng Yi
- Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Urumqi 830011, China; University of Chinese Academy of Sciences, Beijing 100049, China
| |
Collapse
|
11
|
Carraro M, Monzon AM, Chiricosta L, Reggiani F, Aspromonte MC, Bellini M, Pagel K, Jiang Y, Radivojac P, Kundu K, Pal LR, Yin Y, Limongelli I, Andreoletti G, Moult J, Wilson SJ, Katsonis P, Lichtarge O, Chen J, Wang Y, Hu Z, Brenner SE, Ferrari C, Murgia A, Tosatto SC, Leonardi E. Assessment of patient clinical descriptions and pathogenic variants from gene panel sequences in the CAGI-5 intellectual disability challenge. Hum Mutat 2019; 40:1330-1345. [PMID: 31144778 PMCID: PMC7341177 DOI: 10.1002/humu.23823] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2019] [Revised: 05/07/2019] [Accepted: 05/27/2019] [Indexed: 12/15/2022]
Abstract
The Critical Assessment of Genome Interpretation-5 intellectual disability challenge asked to use computational methods to predict patient clinical phenotypes and the causal variant(s) based on an analysis of their gene panel sequence data. Sequence data for 74 genes associated with intellectual disability (ID) and/or autism spectrum disorders (ASD) from a cohort of 150 patients with a range of neurodevelopmental manifestations (i.e. ID, autism, epilepsy, microcephaly, macrocephaly, hypotonia, ataxia) have been made available for this challenge. For each patient, predictors had to report the causative variants and which of the seven phenotypes were present. Since neurodevelopmental disorders are characterized by strong comorbidity, tested individuals often present more than one pathological condition. Considering the overall clinical manifestation of each patient, the correct phenotype has been predicted by at least one group for 93 individuals (62%). ID and ASD were the best predicted among the seven phenotypic traits. Also, causative or potentially pathogenic variants were predicted correctly by at least one group. However, the prediction of the correct causative variant seems to be insufficient to predict the correct phenotype. In some cases, the correct prediction has been supported by rare or common variants in genes different from the causative one.
Collapse
Affiliation(s)
- Marco Carraro
- Department of Biomedical Sciences, University of Padua, Padua, Italy
| | | | - Luigi Chiricosta
- Department of Biomedical Sciences, University of Padua, Padua, Italy
| | - Francesco Reggiani
- Department of Biomedical Sciences, University of Padua, Padua, Italy
- Department of Information Engineering, University of Padua, Padua, Italy
| | | | - Mariagrazia Bellini
- Department of Woman and Child Health, University of Padua, Padua, Italy
- Fondazione Istituto di Ricerca Pediatrica (IRP), Città della Speranza, Padova, Italy
| | - Kymberleigh Pagel
- Khoury College of Computer and Information Sciences, Northeastern University, 440, Huntington Avenue, Boston, MA 02115, USA
| | - Yuxiang Jiang
- Khoury College of Computer and Information Sciences, Northeastern University, 440, Huntington Avenue, Boston, MA 02115, USA
| | - Predrag Radivojac
- Khoury College of Computer and Information Sciences, Northeastern University, 440, Huntington Avenue, Boston, MA 02115, USA
| | - Kunal Kundu
- Institute for Bioscience and Biotechnology Research, University of Maryland, 9600 Gudelsky Drive, Rockville, MD 20850, USA
- Computational Biology, Bioinformatics and Genomics, Biological Sciences Graduate Program, University of Maryland, College Park, MD 20742, USA
| | - Lipika R. Pal
- Institute for Bioscience and Biotechnology Research, University of Maryland, 9600 Gudelsky Drive, Rockville, MD 20850, USA
| | - Yizhou Yin
- Institute for Bioscience and Biotechnology Research, University of Maryland, 9600 Gudelsky Drive, Rockville, MD 20850, USA
- Computational Biology, Bioinformatics and Genomics, Biological Sciences Graduate Program, University of Maryland, College Park, MD 20742, USA
| | | | - Gaia Andreoletti
- Institute for Bioscience and Biotechnology Research, University of Maryland, 9600 Gudelsky Drive, Rockville, MD 20850, USA
- Department of Cell Biology and Molecular Genetics, University of Maryland, College Park, MD 20742, USA
| | - John Moult
- Institute for Bioscience and Biotechnology Research, University of Maryland, 9600 Gudelsky Drive, Rockville, MD 20850, USA
- Department of Cell Biology and Molecular Genetics, University of Maryland, College Park, MD 20742, USA
| | - Stephen J. Wilson
- Baylor College of Medicine, Department of Molecular and Human Genetics, Houston, TX 77030, USA
| | - Panagiotis Katsonis
- Baylor College of Medicine, Department of Molecular and Human Genetics, Houston, TX 77030, USA
| | - Olivier Lichtarge
- Baylor College of Medicine, Department of Molecular and Human Genetics, Houston, TX 77030, USA
| | - Jingqi Chen
- Department of Plant and Microbial Biology, University of California, Berkeley, CA 94720, USA
| | - Yaqiong Wang
- Department of Plant and Microbial Biology, University of California, Berkeley, CA 94720, USA
| | - Zhiqiang Hu
- Department of Plant and Microbial Biology, University of California, Berkeley, CA 94720, USA
| | - Steven E. Brenner
- Department of Plant and Microbial Biology, University of California, Berkeley, CA 94720, USA
| | - Carlo Ferrari
- Department of Information Engineering, University of Padua, Padua, Italy
| | - Alessandra Murgia
- Department of Woman and Child Health, University of Padua, Padua, Italy
- Fondazione Istituto di Ricerca Pediatrica (IRP), Città della Speranza, Padova, Italy
| | - Silvio C.E. Tosatto
- Department of Biomedical Sciences, University of Padua, Padua, Italy
- CNR Institute of Neuroscience, Padua, Italy
| | - Emanuela Leonardi
- Department of Woman and Child Health, University of Padua, Padua, Italy
- Fondazione Istituto di Ricerca Pediatrica (IRP), Città della Speranza, Padova, Italy
| |
Collapse
|
12
|
Katsonis P, Lichtarge O. CAGI5: Objective performance assessments of predictions based on the Evolutionary Action equation. Hum Mutat 2019; 40:1436-1454. [PMID: 31317604 PMCID: PMC6900054 DOI: 10.1002/humu.23873] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2019] [Revised: 07/02/2019] [Accepted: 07/11/2019] [Indexed: 12/14/2022]
Abstract
Many computational approaches estimate the effect of coding variants, but their predictions often disagree with each other. These contradictions confound users and raise questions regarding reliability. Performance assessments can indicate the expected accuracy for each method and highlight advantages and limitations. The Critical Assessment of Genome Interpretation (CAGI) community aims to organize objective and systematic assessments: They challenge predictors on unpublished experimental and clinical data and assign independent assessors to evaluate the submissions. We participated in CAGI experiments as predictors, using the Evolutionary Action (EA) method to estimate the fitness effect of coding mutations. EA is untrained, uses homology information, and relies on a formal equation: The fitness effect equals the functional sensitivity to residue changes multiplied by the magnitude of the substitution. In previous CAGI experiments (between 2011 and 2016), our submissions aimed to predict the protein activity of single mutants. In 2018 (CAGI5), we also submitted predictions regarding clinical associations, folding stability, and matching genomic data with phenotype. For all these diverse challenges, we used EA to predict the fitness effect of variants, adjusted to specifically address each question. Our submissions had consistently good performance, suggesting that EA predicts reliably the effects of genetic variants.
Collapse
Affiliation(s)
- Panagiotis Katsonis
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas
| | - Olivier Lichtarge
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas.,Department of Biochemistry & Molecular Biology, Baylor College of Medicine, Houston, Texas.,Department of Pharmacology, Baylor College of Medicine, Houston, Texas.,Computational and Integrative Biomedical Research Center, Baylor College of Medicine, Houston, Texas
| |
Collapse
|