1
|
Molotkov I, Artomov M. Detecting biased validation of predictive models in the positive-unlabeled setting: disease gene prioritization case study. BIOINFORMATICS ADVANCES 2023; 3:vbad128. [PMID: 37745001 PMCID: PMC10517638 DOI: 10.1093/bioadv/vbad128] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 06/14/2023] [Revised: 08/13/2023] [Accepted: 09/12/2023] [Indexed: 09/26/2023]
Abstract
Motivation Positive-unlabeled data consists of points with either positive or unknown labels. It is widespread in medical, genetic, and biological settings, creating a high demand for predictive positive-unlabeled models. The performance of such models is usually estimated using validation sets, assumed to be selected completely at random (SCAR) from known positive examples. For certain metrics, this assumption enables unbiased performance estimation when treating positive-unlabeled data as positive/negative. However, the SCAR assumption is often adopted without proper justifications, simply for the sake of convenience. Results We provide an algorithm that under the weak assumptions of a lower bound on the number of positive examples can test for the violation of the SCAR assumption. Applying it to the problem of gene prioritization for complex genetic traits, we illustrate that the SCAR assumption is often violated there, causing the inflation of performance estimates, which we refer to as validation bias. We estimate the potential impact of validation bias on performance estimation. Our analysis reveals that validation bias is widespread in gene prioritization data and can significantly overestimate the performance of models. This finding elucidates the discrepancy between the reported good performance of models and their limited practical applications. Availability and implementation Python code with examples of application of the validation bias detection algorithm is available at github.com/ArtomovLab/ValidationBias.
Collapse
Affiliation(s)
- Ivan Molotkov
- The Steve and Cindy Rasmussen Institute for Genomic Medicine, Nationwide Children’s Hospital, Columbus, OH, United States
- Department of Pediatrics, The Ohio State University, Columbus, OH, United States
- ITMO University, Saint Petersburg, Russia
| | - Mykyta Artomov
- The Steve and Cindy Rasmussen Institute for Genomic Medicine, Nationwide Children’s Hospital, Columbus, OH, United States
- Department of Pediatrics, The Ohio State University, Columbus, OH, United States
| |
Collapse
|
2
|
Boguslav MR, Salem NM, White EK, Sullivan KJ, Bada M, Hernandez TL, Leach SM, Hunter LE. Creating an ignorance-base: Exploring known unknowns in the scientific literature. J Biomed Inform 2023; 143:104405. [PMID: 37270143 PMCID: PMC10528083 DOI: 10.1016/j.jbi.2023.104405] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2022] [Revised: 05/18/2023] [Accepted: 05/21/2023] [Indexed: 06/05/2023]
Abstract
BACKGROUND Scientific discovery progresses by exploring new and uncharted territory. More specifically, it advances by a process of transforming unknown unknowns first into known unknowns, and then into knowns. Over the last few decades, researchers have developed many knowledge bases to capture and connect the knowns, which has enabled topic exploration and contextualization of experimental results. But recognizing the unknowns is also critical for finding the most pertinent questions and their answers. Prior work on known unknowns has sought to understand them, annotate them, and automate their identification. However, no knowledge-bases yet exist to capture these unknowns, and little work has focused on how scientists might use them to trace a given topic or experimental result in search of open questions and new avenues for exploration. We show here that a knowledge base of unknowns can be connected to ontologically grounded biomedical knowledge to accelerate research in the field of prenatal nutrition. RESULTS We present the first ignorance-base, a knowledge-base created by combining classifiers to recognize ignorance statements (statements of missing or incomplete knowledge that imply a goal for knowledge) and biomedical concepts over the prenatal nutrition literature. This knowledge-base places biomedical concepts mentioned in the literature in context with the ignorance statements authors have made about them. Using our system, researchers interested in the topic of vitamin D and prenatal health were able to uncover three new avenues for exploration (immune system, respiratory system, and brain development) by searching for concepts enriched in ignorance statements. These were buried among the many standard enriched concepts. Additionally, we used the ignorance-base to enrich concepts connected to a gene list associated with vitamin D and spontaneous preterm birth and found an emerging topic of study (brain development) in an implied field (neuroscience). The researchers could look to the field of neuroscience for potential answers to the ignorance statements. CONCLUSION Our goal is to help students, researchers, funders, and publishers better understand the state of our collective scientific ignorance (known unknowns) in order to help accelerate research through the continued illumination of and focus on the known unknowns and their respective goals for scientific knowledge.
Collapse
Affiliation(s)
- Mayla R Boguslav
- Computational Bioscience Program, University of Colorado, Anschutz Medical Campus, E 17th Avenue, Aurora, 80045, CO, USA.
| | - Nourah M Salem
- Computational Bioscience Program, University of Colorado, Anschutz Medical Campus, E 17th Avenue, Aurora, 80045, CO, USA
| | - Elizabeth K White
- Computational Bioscience Program, University of Colorado, Anschutz Medical Campus, E 17th Avenue, Aurora, 80045, CO, USA; Center for Genes, Environment and Health, National Jewish Health, Jackson Street, Denver, 80206, CO, USA
| | - Katherine J Sullivan
- Computational Bioscience Program, University of Colorado, Anschutz Medical Campus, E 17th Avenue, Aurora, 80045, CO, USA
| | - Michael Bada
- Computational Bioscience Program, University of Colorado, Anschutz Medical Campus, E 17th Avenue, Aurora, 80045, CO, USA
| | - Teri L Hernandez
- College of Nursing, Department of Medicine/Division of Endocrinology, Metabolism, & Diabetes, University of Colorado, Anschutz Medical Campus, E 17th Avenue, Aurora, 80045, CO, USA
| | - Sonia M Leach
- Computational Bioscience Program, University of Colorado, Anschutz Medical Campus, E 17th Avenue, Aurora, 80045, CO, USA; Center for Genes, Environment and Health, National Jewish Health, Jackson Street, Denver, 80206, CO, USA
| | - Lawrence E Hunter
- Computational Bioscience Program, University of Colorado, Anschutz Medical Campus, E 17th Avenue, Aurora, 80045, CO, USA
| |
Collapse
|
3
|
Rahaie Z, Rabiee HR, Alinejad-Rokny H. DeepGenePrior: A deep learning model for prioritizing genes affected by copy number variants. PLoS Comput Biol 2023; 19:e1011249. [PMID: 37486921 PMCID: PMC10399873 DOI: 10.1371/journal.pcbi.1011249] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2022] [Revised: 08/03/2023] [Accepted: 06/06/2023] [Indexed: 07/26/2023] Open
Abstract
The genetic etiology of brain disorders is highly heterogeneous, characterized by abnormalities in the development of the central nervous system that lead to diminished physical or intellectual capabilities. The process of determining which gene drives disease, known as "gene prioritization," is not entirely understood. Genome-wide searches for gene-disease associations are still underdeveloped due to reliance on previous discoveries and evidence sources with false positive or negative relations. This paper introduces DeepGenePrior, a model based on deep neural networks that prioritizes candidate genes in genetic diseases. Using the well-studied Variational AutoEncoder (VAE), we developed a score to measure the impact of genes on target diseases. Unlike other methods that use prior data to select candidate genes, based on the "guilt by association" principle and auxiliary data sources like protein networks, our study exclusively employs copy number variants (CNVs) for gene prioritization. By analyzing CNVs from 74,811 individuals with autism, schizophrenia, and developmental delay, we identified genes that best distinguish cases from controls. Our findings indicate a 12% increase in fold enrichment in brain-expressed genes compared to previous studies and a 15% increase in genes associated with mouse nervous system phenotypes. Furthermore, we identified common deletions in ZDHHC8, DGCR5, and CATG00000022283 among the top genes related to all three disorders, suggesting a common etiology among these clinically distinct conditions. DeepGenePrior is publicly available online at http://git.dml.ir/z_rahaie/DGP to address obstacles in existing gene prioritization studies identifying candidate genes.
Collapse
Affiliation(s)
- Zahra Rahaie
- BCB Group, DML, Department of Computer Engineering, Sharif University of Technology, Tehran, Iran
| | - Hamid R. Rabiee
- BCB Group, DML, Department of Computer Engineering, Sharif University of Technology, Tehran, Iran
| | - Hamid Alinejad-Rokny
- UNSW Biomedical Machine Learning Lab (BML), the Graduate School of Biomedical Engineering, UNSW Sydney, Sydney, Australia
| |
Collapse
|
4
|
Henry OJ, Stödberg T, Båtelson S, Rasi C, Stranneheim H, Wedell A. Individualised human phenotype ontology gene panels improve clinical whole exome and genome sequencing analytical efficacy in a cohort of developmental and epileptic encephalopathies. Mol Genet Genomic Med 2023; 11:e2167. [PMID: 36967109 PMCID: PMC10337286 DOI: 10.1002/mgg3.2167] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2022] [Revised: 02/21/2023] [Accepted: 03/01/2023] [Indexed: 07/20/2023] Open
Abstract
BACKGROUND The majority of genetic epilepsies remain unsolved in terms of specific genotype. Phenotype-based genomic analyses have shown potential to strengthen genomic analysis in various ways, including improving analytical efficacy. METHODS We have tested a standardised phenotyping method termed 'Phenomodels' for integrating deep-phenotyping information with our in-house developed clinical whole exome/genome sequencing analytical pipeline. Phenomodels includes a user-friendly epilepsy phenotyping template and an objective measure for selecting which template terms to include in individualised Human Phenotype Ontology (HPO) gene panels. In a pilot study of 38 previously solved cases of developmental and epileptic encephalopathies, we compared the sensitivity and specificity of the individualised HPO gene panels with the clinical epilepsy gene panel. RESULTS The Phenomodels template showed high sensitivity for capturing relevant phenotypic information, where 37/38 individuals' HPO gene panels included the causative gene. The HPO gene panels also had far fewer variants to assess than the epilepsy gene panel. CONCLUSION We have demonstrated a viable approach for incorporating standardised phenotype information into clinical genomic analyses, which may enable more efficient analysis.
Collapse
Affiliation(s)
- Olivia J. Henry
- Department of Molecular Medicine and SurgeryKarolinska InstitutetStockholmSweden
| | - Tommy Stödberg
- Department of Women's and Children's HealthKarolinska InstitutetStockholmSweden
- Department of Pediatric NeurologyKarolinska University HospitalStockholmSweden
| | - Sofia Båtelson
- Department of Pediatric NeurologyKarolinska University HospitalStockholmSweden
| | - Chiara Rasi
- Science for Life Laboratory, Department of Microbiology, Tumour and Cell BiologyKarolinska InstitutetStockholmSweden
| | - Henrik Stranneheim
- Department of Molecular Medicine and SurgeryKarolinska InstitutetStockholmSweden
- Science for Life Laboratory, Department of Microbiology, Tumour and Cell BiologyKarolinska InstitutetStockholmSweden
- Centre for Inherited Metabolic DiseasesKarolinska University HospitalStockholmSweden
| | - Anna Wedell
- Department of Molecular Medicine and SurgeryKarolinska InstitutetStockholmSweden
- Centre for Inherited Metabolic DiseasesKarolinska University HospitalStockholmSweden
| |
Collapse
|
5
|
Zhang Y, Chen L, Li S. CIPHER-SC: Disease-Gene Association Inference Using Graph Convolution on a Context-Aware Network With Single-Cell Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:819-829. [PMID: 32809944 DOI: 10.1109/tcbb.2020.3017547] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Inference of disease-gene associations helps unravel the pathogenesis of diseases and contributes to the treatment. Although many machine learning-based methods have been developed to predict causative genes, accurate association inference remains challenging. One major reason is the inaccurate feature selection and accumulation of error brought by commonly used multi-stage training architecture. In addition, the existing methods do not incorporate cell-type-specific information, thus fail to study gene functions at a higher resolution. Therefore, we introduce single-cell transcriptome data and construct a context-aware network to unbiasedly integrate all data sources. Then we develop a graph convolution-based approach named CIPHER-SC to realize a complete end-to-end learning architecture. Our approach outperforms four state-of-the-art approaches in five-fold cross-validations on three distinct test sets with the best AUC of 0.9501, demonstrating its stable ability either to predict the novel genes or to predict with genetic basis. The ablation study shows that our complete end-to-end design and unbiased data integration boost the performance from 0.8727 to 0.9443 in AUC. The addition of single-cell data further improves the prediction accuracy and makes our results be enriched for cell-type-specific genes. These results confirm the ability of CIPHER-SC to discover reliable disease genes. Our implementation is available at http://github.com/YidingZhang117/CIPHER-SC.
Collapse
|
6
|
Lastra-Díaz JJ, Lara-Clares A, Garcia-Serrano A. HESML: a real-time semantic measures library for the biomedical domain with a reproducible survey. BMC Bioinformatics 2022; 23:23. [PMID: 34991460 PMCID: PMC8734250 DOI: 10.1186/s12859-021-04539-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2020] [Accepted: 12/15/2021] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Ontology-based semantic similarity measures based on SNOMED-CT, MeSH, and Gene Ontology are being extensively used in many applications in biomedical text mining and genomics respectively, which has encouraged the development of semantic measures libraries based on the aforementioned ontologies. However, current state-of-the-art semantic measures libraries have some performance and scalability drawbacks derived from their ontology representations based on relational databases, or naive in-memory graph representations. Likewise, a recent reproducible survey on word similarity shows that one hybrid IC-based measure which integrates a shortest-path computation sets the state of the art in the family of ontology-based semantic measures. However, the lack of an efficient shortest-path algorithm for their real-time computation prevents both their practical use in any application and the use of any other path-based semantic similarity measure. RESULTS To bridge the two aforementioned gaps, this work introduces for the first time an updated version of the HESML Java software library especially designed for the biomedical domain, which implements the most efficient and scalable ontology representation reported in the literature, together with a new method for the approximation of the Dijkstra's algorithm for taxonomies, called Ancestors-based Shortest-Path Length (AncSPL), which allows the real-time computation of any path-based semantic similarity measure. CONCLUSIONS We introduce a set of reproducible benchmarks showing that HESML outperforms by several orders of magnitude the current state-of-the-art libraries in the three aforementioned biomedical ontologies, as well as the real-time performance and approximation quality of the new AncSPL shortest-path algorithm. Likewise, we show that AncSPL linearly scales regarding the dimension of the common ancestor subgraph regardless of the ontology size. Path-based measures based on the new AncSPL algorithm are up to six orders of magnitude faster than their exact implementation in large ontologies like SNOMED-CT and GO. Finally, we provide a detailed reproducibility protocol and dataset as supplementary material to allow the exact replication of all our experiments and results.
Collapse
Affiliation(s)
- Juan J. Lastra-Díaz
- NLP & IR Research Group, E.T.S.I. Informática, Universidad Nacional de Educación a Distancia (UNED), C/Juan del Rosal 16, 28040 Madrid, Spain
| | - Alicia Lara-Clares
- NLP & IR Research Group, E.T.S.I. Informática, Universidad Nacional de Educación a Distancia (UNED), C/Juan del Rosal 16, 28040 Madrid, Spain
| | - Ana Garcia-Serrano
- NLP & IR Research Group, E.T.S.I. Informática, Universidad Nacional de Educación a Distancia (UNED), C/Juan del Rosal 16, 28040 Madrid, Spain
| |
Collapse
|
7
|
Yue Z, Yan D, Guo G, Chen JY. Biological Network Mining. Methods Mol Biol 2021; 2328:139-151. [PMID: 34251623 DOI: 10.1007/978-1-0716-1534-8_8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
In this book chapter, we introduce a pipeline to mine significant biomedical entities (or bioentities) in biological networks. Our focus is on prioritizing both bioentities themselves and the associations between bioentities in order to reveal their biological functions. We will introduce three tools BEERE, WIPER, and PAGER 2.0 that can be used together for network analysis and function interpretation: (1) BEERE is a network analysis tool for "Biomedical Entity Expansion, Ranking and Explorations," (2) WIPER is an entity-to-entity association ranking tool, and (3) PAGER 2.0 is a service for gene enrichment analysis.
Collapse
Affiliation(s)
- Zongliang Yue
- The University of Alabama at Birmingham, Birmingham, AL, USA
| | - Da Yan
- The University of Alabama at Birmingham, Birmingham, AL, USA.
| | - Guimu Guo
- The University of Alabama at Birmingham, Birmingham, AL, USA
| | - Jake Y Chen
- The University of Alabama at Birmingham, Birmingham, AL, USA
| |
Collapse
|
8
|
Guerra C, Joshi S, Lu Y, Palini F, Ferraro Petrillo U, Rossignac J. Rank-Similarity Measures for Comparing Gene Prioritizations: A Case Study in Autism. J Comput Biol 2020; 28:283-295. [PMID: 33103913 DOI: 10.1089/cmb.2020.0244] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
We discuss the challenge of comparing three gene prioritization methods: network propagation, integer linear programming rank aggregation (RA), and statistical RA. These methods are based on different biological categories and estimate disease-gene association. Previously proposed comparison schemes are based on three measures of performance: receiver operating curve, area under the curve, and median rank ratio. Although they may capture important aspects of gene prioritization performance, they may fail to capture important differences in the rankings of individual genes. We suggest that comparison schemes could be improved by also considering recently proposed measures of similarity between gene rankings. We tested this suggestion on comparison schemes for prioritizations of genes associated with autism that were obtained using brain- and tissue-specific data. Our results show the effectiveness of our measures of similarity in clustering brain regions based on their relevance to autism.
Collapse
Affiliation(s)
- Concettina Guerra
- Georgia Institute of Technology College of Computing, School of Interactive Computing, Atlanta, Georgia, USA
| | - Sarang Joshi
- Georgia Institute of Technology College of Computing, School of Interactive Computing, Atlanta, Georgia, USA
| | - Yinquan Lu
- Georgia Institute of Technology College of Computing, School of Interactive Computing, Atlanta, Georgia, USA
| | - Francesco Palini
- Dipartimento di Scienze Statistiche, Università di Roma-La Sapienza, Rome, Italy
| | | | - Jarek Rossignac
- Georgia Institute of Technology College of Computing, School of Interactive Computing, Atlanta, Georgia, USA
| |
Collapse
|
9
|
Paliwal S, de Giorgio A, Neil D, Michel JB, Lacoste AM. Preclinical validation of therapeutic targets predicted by tensor factorization on heterogeneous graphs. Sci Rep 2020; 10:18250. [PMID: 33106501 PMCID: PMC7589557 DOI: 10.1038/s41598-020-74922-z] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2020] [Accepted: 09/30/2020] [Indexed: 12/04/2022] Open
Abstract
Incorrect drug target identification is a major obstacle in drug discovery. Only 15% of drugs advance from Phase II to approval, with ineffective targets accounting for over 50% of these failures1-3. Advances in data fusion and computational modeling have independently progressed towards addressing this issue. Here, we capitalize on both these approaches with Rosalind, a comprehensive gene prioritization method that combines heterogeneous knowledge graph construction with relational inference via tensor factorization to accurately predict disease-gene links. Rosalind demonstrates an increase in performance of 18%-50% over five comparable state-of-the-art algorithms. On historical data, Rosalind prospectively identifies 1 in 4 therapeutic relationships eventually proven true. Beyond efficacy, Rosalind is able to accurately predict clinical trial successes (75% recall at rank 200) and distinguish likely failures (74% recall at rank 200). Lastly, Rosalind predictions were experimentally tested in a patient-derived in-vitro assay for Rheumatoid arthritis (RA), which yielded 5 promising genes, one of which is unexplored in RA.
Collapse
Affiliation(s)
- Saee Paliwal
- BenevolentAI, 1 Dock72 Way, 7th Floor, Brooklyn, NY, 11205, USA.
| | - Alex de Giorgio
- BenevolentAI, 4-6 Maple Street, Bloomsbury, London, W1T5HD, UK
| | - Daniel Neil
- BenevolentAI, 1 Dock72 Way, 7th Floor, Brooklyn, NY, 11205, USA
| | | | - Alix Mb Lacoste
- BenevolentAI, 1 Dock72 Way, 7th Floor, Brooklyn, NY, 11205, USA
| |
Collapse
|
10
|
Jiang S, Zhang CY, Tang L, Zhao LX, Chen HZ, Qiu Y. Integrated Genomic Analysis Revealed Associated Genes for Alzheimer's Disease in APOE4 Non-Carriers. Curr Alzheimer Res 2020; 16:753-763. [PMID: 31441725 DOI: 10.2174/1567205016666190823124724] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2019] [Revised: 07/14/2019] [Accepted: 08/08/2019] [Indexed: 12/31/2022]
Abstract
BACKGROUND APOE4 is the strongest genetic risk factor for late-onset Alzheimer's disease (LOAD). LOAD patients carrying or not carrying APOE4 manifest distinct clinico-pathological characteristics. APOE4 has been shown to play a critical role in the pathogenesis of AD by affecting various aspects of pathological processes. However, the pathogenesis involved in LOAD not-carrying APOE4 remains elusive. OBJECTIVE We aimed to identify the associated genes involved in LOAD not-carrying APOE4. METHODS An integrated genomic analysis of datasets of genome-wide association study, genome-wide expression profiling and genome-wide linkage scan and protein-protein interaction network construction were applied to identify associated gene clusters in APOE4 non-carriers. The role of one of hub gene of an APOE4 non-carrier-associated gene cluster in tau phosphorylation was studied by knockdown and western blot. RESULTS We identified 12 gene clusters associated with AD APOE4 non-carriers. The hub genes associated with AD in these clusters were MAPK8, POU2F1, XRCC1, PRKCG, EXOC6, VAMP4, SIRT1, MME, NOS1, ABCA1 and LDLR. The associated genes for APOE4 non-carriers were enriched in hereditary disorder, neurological disease and psychological disorders. Moreover, knockdown of PRKCG to reduce the expression of protein kinase Cγ isoform enhanced tau phosphorylation at Thr181 and Thr231 and the expression of glycogen synthase kinase 3β and cyclin-dependent kinase 5 in the presence of APOE3 but not APOE4. CONCLUSION The study provides new insight into the mechanism of distinct pathogenesis of LOAD not carrying APOE4 and prompts the functional exploration of identified genes based on APOE genotypes.
Collapse
Affiliation(s)
- Shan Jiang
- Department of Pharmacology and Chemical Biology, Institute of Medical Sciences, Shanghai Jiao Tong University School of Medicine, Shanghai 200025, China.,Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX 77030, United States
| | - Chun-Yun Zhang
- Department of Pharmacology and Chemical Biology, Institute of Medical Sciences, Shanghai Jiao Tong University School of Medicine, Shanghai 200025, China
| | - Ling Tang
- Department of Pharmacology and Chemical Biology, Institute of Medical Sciences, Shanghai Jiao Tong University School of Medicine, Shanghai 200025, China
| | - Lan-Xue Zhao
- Department of Pharmacology and Chemical Biology, Institute of Medical Sciences, Shanghai Jiao Tong University School of Medicine, Shanghai 200025, China
| | - Hong-Zhuan Chen
- Institute of Interdisciplinary Integrative Biomedical Research, Shanghai University of Traditional Chinese Medicine, Shanghai 201210, China
| | - Yu Qiu
- Department of Pharmacology and Chemical Biology, Institute of Medical Sciences, Shanghai Jiao Tong University School of Medicine, Shanghai 200025, China
| |
Collapse
|
11
|
Hwang S, Kim CY, Yang S, Kim E, Hart T, Marcotte EM, Lee I. HumanNet v2: human gene networks for disease research. Nucleic Acids Res 2020; 47:D573-D580. [PMID: 30418591 PMCID: PMC6323914 DOI: 10.1093/nar/gky1126] [Citation(s) in RCA: 114] [Impact Index Per Article: 28.5] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2018] [Accepted: 10/25/2018] [Indexed: 12/15/2022] Open
Abstract
Human gene networks have proven useful in many aspects of disease research, with numerous network-based strategies developed for generating hypotheses about gene-disease-drug associations. The ability to predict and organize genes most relevant to a specific disease has proven especially important. We previously developed a human functional gene network, HumanNet, by integrating diverse types of omics data using Bayesian statistics framework and demonstrated its ability to retrieve disease genes. Here, we present HumanNet v2 (http://www.inetbio.org/humannet), a database of human gene networks, which was updated by incorporating new data types, extending data sources and improving network inference algorithms. HumanNet now comprises a hierarchy of human gene networks, allowing for more flexible incorporation of network information into studies. HumanNet performs well in ranking disease-linked gene sets with minimal literature-dependent biases. We observe that incorporating model organisms’ protein–protein interactions does not markedly improve disease gene predictions, suggesting that many of the disease gene associations are now captured directly in human-derived datasets. With an improved interactive user interface for disease network analysis, we expect HumanNet will be a useful resource for network medicine.
Collapse
Affiliation(s)
- Sohyun Hwang
- Department of Biotechnology, College of Life Science and Biotechnology, Yonsei University, Seoul 03722, Korea.,Center for Systems and Synthetic Biology, Institute for Cellular and Molecular Biology, University of Texas, Austin, TX 78712, USA.,Department of Biomedical Science, College of Life Science, CHA University, Seongnam-si 13496, Korea
| | - Chan Yeong Kim
- Department of Biotechnology, College of Life Science and Biotechnology, Yonsei University, Seoul 03722, Korea
| | - Sunmo Yang
- Department of Biotechnology, College of Life Science and Biotechnology, Yonsei University, Seoul 03722, Korea
| | - Eiru Kim
- Department of Bioinformatics and Computational Biology, The University of Texas MD Anderson Cancer Center, Houston, TX USA
| | - Traver Hart
- Department of Bioinformatics and Computational Biology, The University of Texas MD Anderson Cancer Center, Houston, TX USA
| | - Edward M Marcotte
- Center for Systems and Synthetic Biology, Institute for Cellular and Molecular Biology, University of Texas, Austin, TX 78712, USA.,Department of Molecular Biosciences, University of Texas at Austin, TX 78712, USA
| | - Insuk Lee
- Department of Biotechnology, College of Life Science and Biotechnology, Yonsei University, Seoul 03722, Korea
| |
Collapse
|
12
|
Cabrera-Andrade A, López-Cortés A, Jaramillo-Koupermann G, Paz-y-Miño C, Pérez-Castillo Y, Munteanu CR, González-Díaz H, Pazos A, Tejera E. Gene Prioritization through Consensus Strategy, Enrichment Methodologies Analysis, and Networking for Osteosarcoma Pathogenesis. Int J Mol Sci 2020; 21:E1053. [PMID: 32033398 PMCID: PMC7038221 DOI: 10.3390/ijms21031053] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2019] [Revised: 01/30/2020] [Accepted: 01/30/2020] [Indexed: 12/12/2022] Open
Abstract
Osteosarcoma is the most common subtype of primary bone cancer, affecting mostly adolescents. In recent years, several studies have focused on elucidating the molecular mechanisms of this sarcoma; however, its molecular etiology has still not been determined with precision. Therefore, we applied a consensus strategy with the use of several bioinformatics tools to prioritize genes involved in its pathogenesis. Subsequently, we assessed the physical interactions of the previously selected genes and applied a communality analysis to this protein-protein interaction network. The consensus strategy prioritized a total list of 553 genes. Our enrichment analysis validates several studies that describe the signaling pathways PI3K/AKT and MAPK/ERK as pathogenic. The gene ontology described TP53 as a principal signal transducer that chiefly mediates processes associated with cell cycle and DNA damage response It is interesting to note that the communality analysis clusters several members involved in metastasis events, such as MMP2 and MMP9, and genes associated with DNA repair complexes, like ATM, ATR, CHEK1, and RAD51. In this study, we have identified well-known pathogenic genes for osteosarcoma and prioritized genes that need to be further explored.
Collapse
Affiliation(s)
- Alejandro Cabrera-Andrade
- Grupo de Bio-Quimioinformática, Universidad de Las Américas, Quito 170125, Ecuador;
- Carrera de Enfermería, Facultad de Ciencias de la Salud, Universidad de Las Américas, Quito 170125, Ecuador
- RNASA-IMEDIR, Computer Sciences Faculty, University of A Coruna, 15071 A Coruña, Spain; (A.L.-C.); (C.R.M.); (A.P.)
| | - Andrés López-Cortés
- RNASA-IMEDIR, Computer Sciences Faculty, University of A Coruna, 15071 A Coruña, Spain; (A.L.-C.); (C.R.M.); (A.P.)
- Centro de Investigación Genética y Genómica, Facultad de Ciencias de la Salud Eugenio Espejo, Universidad UTE, Quito 170129, Ecuador;
| | - Gabriela Jaramillo-Koupermann
- Laboratorio de Biología Molecular, Subproceso de Anatomía Patológica, Hospital de Especialidades Eugenio Espejo, Quito 170403, Ecuador;
| | - César Paz-y-Miño
- Centro de Investigación Genética y Genómica, Facultad de Ciencias de la Salud Eugenio Espejo, Universidad UTE, Quito 170129, Ecuador;
| | - Yunierkis Pérez-Castillo
- Grupo de Bio-Quimioinformática, Universidad de Las Américas, Quito 170125, Ecuador;
- Escuela de Ciencias Físicas y Matemáticas, Universidad de Las Américas, Quito 170125, Ecuador
| | - Cristian R. Munteanu
- RNASA-IMEDIR, Computer Sciences Faculty, University of A Coruna, 15071 A Coruña, Spain; (A.L.-C.); (C.R.M.); (A.P.)
- Biomedical Research Institute of A Coruña (INIBIC), University Hospital Complex of A Coruña (CHUAC), 15006 A Coruña, Spain
- Centro de Investigación en Tecnologías de la Información y las Comunicaciones (CITIC), Campus de Elviña s/n, 15071 A Coruña, Spain
| | - Humbert González-Díaz
- Department of Organic Chemistry II, University of the Basque Country UPV/EHU, 48940 Leioa, Spain
- IKERBASQUE, Basque Foundation for Science, 48011 Bilbao, Spain;
| | - Alejandro Pazos
- RNASA-IMEDIR, Computer Sciences Faculty, University of A Coruna, 15071 A Coruña, Spain; (A.L.-C.); (C.R.M.); (A.P.)
- Biomedical Research Institute of A Coruña (INIBIC), University Hospital Complex of A Coruña (CHUAC), 15006 A Coruña, Spain
- Centro de Investigación en Tecnologías de la Información y las Comunicaciones (CITIC), Campus de Elviña s/n, 15071 A Coruña, Spain
| | - Eduardo Tejera
- Grupo de Bio-Quimioinformática, Universidad de Las Américas, Quito 170125, Ecuador;
- Facultad de Ingeniería y Ciencias Agropecuarias, Universidad de Las Américas, Quito 170125, Ecuador
| |
Collapse
|
13
|
Tran VD, Sperduti A, Backofen R, Costa F. Heterogeneous networks integration for disease–gene prioritization with node kernels. Bioinformatics 2020; 36:2649-2656. [DOI: 10.1093/bioinformatics/btaa008] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2019] [Revised: 12/19/2019] [Accepted: 01/23/2020] [Indexed: 12/21/2022] Open
Abstract
Abstract
Motivation
The identification of disease–gene associations is a task of fundamental importance in human health research. A typical approach consists in first encoding large gene/protein relational datasets as networks due to the natural and intuitive property of graphs for representing objects’ relationships and then utilizing graph-based techniques to prioritize genes for successive low-throughput validation assays. Since different types of interactions between genes yield distinct gene networks, there is the need to integrate different heterogeneous sources to improve the reliability of prioritization systems.
Results
We propose an approach based on three phases: first, we merge all sources in a single network, then we partition the integrated network according to edge density introducing a notion of edge type to distinguish the parts and finally, we employ a novel node kernel suitable for graphs with typed edges. We show how the node kernel can generate a large number of discriminative features that can be efficiently processed by linear regularized machine learning classifiers. We report state-of-the-art results on 12 disease–gene associations and on a time-stamped benchmark containing 42 newly discovered associations.
Availability and implementation
Source code: https://github.com/dinhinfotech/DiGI.git.
Supplementary information
Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Van Dinh Tran
- Bioinformatics Group, Department of Computer Science, University of Freiburg, Freiburg im Breisgau, Germany
| | | | - Rolf Backofen
- Bioinformatics Group, Department of Computer Science, University of Freiburg, Freiburg im Breisgau, Germany
- Signalling Research Centres BIOSS and CIBSS, University of Freiburg, Germany
| | - Fabrizio Costa
- Department of Computer Science, University of Exeter, Exeter, UK
| |
Collapse
|
14
|
Zolotareva O, Kleine M. A Survey of Gene Prioritization Tools for Mendelian and Complex Human Diseases. J Integr Bioinform 2019; 16:/j/jib.ahead-of-print/jib-2018-0069/jib-2018-0069.xml. [PMID: 31494632 PMCID: PMC7074139 DOI: 10.1515/jib-2018-0069] [Citation(s) in RCA: 21] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2018] [Accepted: 07/12/2019] [Indexed: 12/16/2022] Open
Abstract
Modern high-throughput experiments provide us with numerous potential associations between genes and diseases. Experimental validation of all the discovered associations, let alone all the possible interactions between them, is time-consuming and expensive. To facilitate the discovery of causative genes, various approaches for prioritization of genes according to their relevance for a given disease have been developed. In this article, we explain the gene prioritization problem and provide an overview of computational tools for gene prioritization. Among about a hundred of published gene prioritization tools, we select and briefly describe 14 most up-to-date and user-friendly. Also, we discuss the advantages and disadvantages of existing tools, challenges of their validation, and the directions for future research.
Collapse
Affiliation(s)
- Olga Zolotareva
- Bielefeld University, Faculty of Technology and Center for Biotechnology, International Research Training Group "Computational Methods for the Analysis of the Diversity and Dynamics of Genomes" and Genome Informatics, Universitätsstraße 25, Bielefeld, Germany
| | - Maren Kleine
- Bielefeld University, Faculty of Technology, Bioinformatics/Medical Informatics Department, Universitätsstraße 25, Bielefeld, Germany
| |
Collapse
|
15
|
Yue Z, Willey CD, Hjelmeland AB, Chen JY. BEERE: a web server for biomedical entity expansion, ranking and explorations. Nucleic Acids Res 2019; 47:W578-W586. [PMID: 31114876 PMCID: PMC6602520 DOI: 10.1093/nar/gkz428] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2019] [Revised: 05/04/2019] [Accepted: 05/20/2019] [Indexed: 12/02/2022] Open
Abstract
BEERE (Biomedical Entity Expansion, Ranking and Explorations) is a new web-based data analysis tool to help biomedical researchers characterize any input list of genes/proteins, biomedical terms or their combinations, i.e. 'biomedical entities', in the context of existing literature. Specifically, BEERE first aims to help users examine the credibility of known entity-to-entity associative or semantic relationships supported by database or literature references from the user input of a gene/term list. Then, it will help users uncover the relative importance of each entity-a gene or a term-within the user input by computing the ranking scores of all entities. At last, it will help users hypothesize new gene functions or genotype-phenotype associations by an interactive visual interface of constructed global entity relationship network. The output from BEERE includes: a list of the original entities matched with known relationships in databases; any expanded entities that may be generated from the analysis; the ranks and ranking scores reported with statistical significance for each entity; and an interactive graphical display of the gene or term network within data provenance annotations that link to external data sources. The web server is free and open to all users with no login requirement and can be accessed at http://discovery.informatics.uab.edu/beere/.
Collapse
Affiliation(s)
- Zongliang Yue
- Informatics Institute, School of Medicine, the University of Alabama at Birmingham, AL 35233, USA
| | - Christopher D Willey
- Department of Radiation Oncology, School of Medicine, the University of Alabama at Birmingham, AL 35233, USA
| | - Anita B Hjelmeland
- Department of Cell, Developmental and Integrative Biology, School of Medicine, the University of Alabama at Birmingham, AL 35233, USA
| | - Jake Y Chen
- Informatics Institute, School of Medicine, the University of Alabama at Birmingham, AL 35233, USA
| |
Collapse
|
16
|
GPS: Identification of disease genes by rank aggregation of multi-genomic scoring schemes. Genomics 2019; 111:612-618. [DOI: 10.1016/j.ygeno.2018.03.017] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2018] [Revised: 03/16/2018] [Accepted: 03/21/2018] [Indexed: 12/19/2022]
|
17
|
Kiblawi S, Chasman D, Henning A, Park E, Poon H, Gould M, Ahlquist P, Craven M. Augmenting subnetwork inference with information extracted from the scientific literature. PLoS Comput Biol 2019; 15:e1006758. [PMID: 31246951 PMCID: PMC6619809 DOI: 10.1371/journal.pcbi.1006758] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2018] [Revised: 07/10/2019] [Accepted: 01/04/2019] [Indexed: 11/20/2022] Open
Abstract
Many biological studies involve either (i) manipulating some aspect of a cell or its environment and then simultaneously measuring the effect on thousands of genes, or (ii) systematically manipulating each gene and then measuring the effect on some response of interest. A common challenge that arises in these studies is to explain how genes identified as relevant in the given experiment are organized into a subnetwork that accounts for the response of interest. The task of inferring a subnetwork is typically dependent on the information available in publicly available, structured databases, which suffer from incompleteness. However, a wealth of potentially relevant information resides in the scientific literature, such as information about genes associated with certain concepts of interest, as well as interactions that occur among various biological entities. We contend that by exploiting this information, we can improve the explanatory power and accuracy of subnetwork inference in multiple applications. Here we propose and investigate several ways in which information extracted from the scientific literature can be used to augment subnetwork inference. We show that we can use literature-extracted information to (i) augment the set of entities identified as being relevant in a subnetwork inference task, (ii) augment the set of interactions used in the process, and (iii) support targeted browsing of a large inferred subnetwork by identifying entities and interactions that are closely related to concepts of interest. We use this approach to uncover the pathways involved in interactions between a virus and a host cell, and the pathways that are regulated by a transcription factor associated with breast cancer. Our experimental results demonstrate that these approaches can provide more accurate and more interpretable subnetworks. Integer program code, background network data, and pathfinding code are available at https://github.com/Craven-Biostat-Lab/subnetwork_inference There is a multitude of publicly available databases that contain information about biological entities (i.e., genes, proteins, and other small molecules) as well as information about how these entities interact together. However, these databases are often incomplete. There is a wealth of information present in the text of the scientific literature that is not yet available in these databases. Using tools that mine the scientific literature we are able to extract some of this potentially relevant information. In this work we show how we can use publicly available databases in conjunction with the information extracted from the scientific literature to infer the networks that are involved in specific biological processes, such as viral replication and cancer tumor growth.
Collapse
Affiliation(s)
- Sid Kiblawi
- Department of Computer Sciences, University of Wisconsin, Madison, WI, USA
- Department of Biostatistics and Medical Informatics, University of Wisconsin, Madison, WI, USA
| | - Deborah Chasman
- Wisconsin Institute for Discovery, University of Wisconsin, Madison, WI, USA
| | - Amanda Henning
- Department of Oncology, University of Wisconsin, Madison, WI, USA
| | - Eunju Park
- Institute for Molecular Virology, University of Wisconsin, Madison, WI, USA
- Morgridge Institute for Research, Madison, WI, USA
| | | | - Michael Gould
- Department of Oncology, University of Wisconsin, Madison, WI, USA
| | - Paul Ahlquist
- Institute for Molecular Virology, University of Wisconsin, Madison, WI, USA
- Morgridge Institute for Research, Madison, WI, USA
- Howard Hughes Medical Institute, University of Wisconsin, Madison, WI, USA
| | - Mark Craven
- Department of Computer Sciences, University of Wisconsin, Madison, WI, USA
- Department of Biostatistics and Medical Informatics, University of Wisconsin, Madison, WI, USA
- * E-mail:
| |
Collapse
|
18
|
Fine RS, Pers TH, Amariuta T, Raychaudhuri S, Hirschhorn JN. Benchmarker: An Unbiased, Association-Data-Driven Strategy to Evaluate Gene Prioritization Algorithms. Am J Hum Genet 2019; 104:1025-1039. [PMID: 31056107 PMCID: PMC6556976 DOI: 10.1016/j.ajhg.2019.03.027] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2018] [Accepted: 03/28/2019] [Indexed: 01/17/2023] Open
Abstract
Genome-wide association studies (GWASs) are valuable for understanding human biology, but associated loci typically contain multiple associated variants and genes. Thus, algorithms that prioritize likely causal genes and variants for a given phenotype can provide biological interpretations of association data. However, a critical, currently missing capability is to objectively compare performance of such algorithms. Typical comparisons rely on "gold standard" genes harboring causal coding variants, but such gold standards may be biased and incomplete. To address this issue, we developed Benchmarker, an unbiased, data-driven benchmarking method that compares performance of similarity-based prioritization strategies to each other (and to random chance) by leave-one-chromosome-out cross-validation with stratified linkage disequilibrium (LD) score regression. We first applied Benchmarker to 20 well-powered GWASs and compared gene prioritization based on strategies employing three different data sources, including annotated gene sets and gene expression; genes prioritized based on gene sets had higher per-SNP heritability than those prioritized based on gene expression. Additionally, in a direct comparison of three methods, DEPICT and MAGMA outperformed NetWAS. We also evaluated combinations of methods; our results indicated that combining data sources and algorithms can help prioritize higher-quality genes for follow-up. Benchmarker provides an unbiased approach to evaluate any similarity-based method that provides genome-wide prioritization of genes, variants, or gene sets and can determine the best such method for any particular GWAS. Our method addresses an important unmet need for rigorous tool assessment and can assist in mapping genetic associations to causal function.
Collapse
Affiliation(s)
- Rebecca S Fine
- Department of Genetics, Harvard Medical School, Boston, MA 02115, USA; Division of Endocrinology and Center for Basic and Translational Obesity Research, Boston Children's Hospital, Boston, MA 02115, USA; Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA; Ph.D. Program in Biological and Biomedical Sciences, Graduate School of Arts and Sciences, Harvard University, Cambridge, MA 02138, USA
| | - Tune H Pers
- The Novo Nordisk Foundation Center for Basic Metabolic Research, Faculty of Health and Medical Sciences, University of Copenhagen, 2200 Copenhagen, Denmark; Department of Epidemiology Research, Statens Serum Institut, 2300 Copenhagen, Denmark
| | - Tiffany Amariuta
- Department of Genetics, Harvard Medical School, Boston, MA 02115, USA; Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA; Center for Data Sciences, Brigham and Women's Hospital and Harvard Medical School, Boston, MA 02115, USA; Division of Genetics, Brigham and Women's Hospital, Boston, MA 02115, USA; Division of Rheumatology, Immunology, and Allergy, Brigham and Women's Hospital and Harvard Medical School, Boston, MA 02115, USA; Department of Biomedical Informatics, Harvard Medical School, Boston, MA 02115, USA; Ph.D. Program in Bioinformatics and Integrative Genomics, Graduate School of Arts and Sciences, Harvard University, Cambridge, MA 02138, USA
| | - Soumya Raychaudhuri
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA; Center for Data Sciences, Brigham and Women's Hospital and Harvard Medical School, Boston, MA 02115, USA; Division of Genetics, Brigham and Women's Hospital, Boston, MA 02115, USA; Division of Rheumatology, Immunology, and Allergy, Brigham and Women's Hospital and Harvard Medical School, Boston, MA 02115, USA; Department of Biomedical Informatics, Harvard Medical School, Boston, MA 02115, USA; Arthritis Research UK Centre for Genetics and Genomics, Centre for Musculoskeletal Research, Manchester Academic Health Science Centre, The University of Manchester, Manchester M13 9PL, UK
| | - Joel N Hirschhorn
- Department of Genetics, Harvard Medical School, Boston, MA 02115, USA; Division of Endocrinology and Center for Basic and Translational Obesity Research, Boston Children's Hospital, Boston, MA 02115, USA; Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA; Department of Pediatrics, Harvard Medical School, Boston, MA 02115, USA.
| |
Collapse
|
19
|
Tomar S, Sethi R, Lai PS. Specific phenotype semantics facilitate gene prioritization in clinical exome sequencing. Eur J Hum Genet 2019; 27:1389-1397. [PMID: 31053788 DOI: 10.1038/s41431-019-0412-7] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2018] [Revised: 02/21/2019] [Accepted: 04/15/2019] [Indexed: 12/13/2022] Open
Abstract
Selection and prioritization of phenotype-centric variants remains a challenging part of variant analysis and interpretation in clinical exome sequencing. Phenotype-driven shortlisting of patient-specific gene lists can avoid missed diagnosis. Here, we analyzed the relevance of using primary Human Phenotype Ontology identifiers (HPO IDs) in prioritizing Mendelian disease genes across 30 in-house, 10 previously reported, and 10 recently published cases using three popular web-based gene prioritization tools (OMIMExplorer, VarElect & Phenolyzer). We assessed partial HPO-based gene prioritization using randomly chosen and top 10%, 30%, and 50% HPO IDs based on information content and found high variance within rank ratios across the former vs the latter. This signified that randomly selected less-specific HPO IDs for a given disease phenotype performed poorly by ranking probe gene farther away from the top rank. In contrast, the use of top 10%, 30%, and 50% HPO IDs individually could rank the probe gene among the top 1% in the ranked list of genes that was equivalent to the results when the full list of HPO IDs were used. Hence, we conclude that use of just the top 10% of HPO IDs chosen based on information content is sufficient for ranking the probe gene at top position. Our findings provide practical guidance for utilizing structured phenotype semantics and web-based gene-ranking tools to aid in identifying known as well unknown candidate gene associations in Mendelian disorders.
Collapse
Affiliation(s)
- Swati Tomar
- Department of Paediatrics, Yong Loo Lin School of Medicine, National University of Singapore, National University Health System (NUHS), 1E Kent Ridge Road, 119228, Singapore
| | - Raman Sethi
- Department of Paediatrics, Yong Loo Lin School of Medicine, National University of Singapore, National University Health System (NUHS), 1E Kent Ridge Road, 119228, Singapore
| | - Poh San Lai
- Department of Paediatrics, Yong Loo Lin School of Medicine, National University of Singapore, National University Health System (NUHS), 1E Kent Ridge Road, 119228, Singapore.
| |
Collapse
|
20
|
Stacey D, Fauman EB, Ziemek D, Sun BB, Harshfield EL, Wood AM, Butterworth AS, Suhre K, Paul DS. ProGeM: a framework for the prioritization of candidate causal genes at molecular quantitative trait loci. Nucleic Acids Res 2019; 47:e3. [PMID: 30239796 PMCID: PMC6326795 DOI: 10.1093/nar/gky837] [Citation(s) in RCA: 64] [Impact Index Per Article: 12.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2017] [Revised: 08/31/2018] [Accepted: 09/11/2018] [Indexed: 12/27/2022] Open
Abstract
Quantitative trait locus (QTL) mapping of molecular phenotypes such as metabolites, lipids and proteins through genome-wide association studies represents a powerful means of highlighting molecular mechanisms relevant to human diseases. However, a major challenge of this approach is to identify the causal gene(s) at the observed QTLs. Here, we present a framework for the 'Prioritization of candidate causal Genes at Molecular QTLs' (ProGeM), which incorporates biological domain-specific annotation data alongside genome annotation data from multiple repositories. We assessed the performance of ProGeM using a reference set of 227 previously reported and extensively curated metabolite QTLs. For 98% of these loci, the expert-curated gene was one of the candidate causal genes prioritized by ProGeM. Benchmarking analyses revealed that 69% of the causal candidates were nearest to the sentinel variant at the investigated molecular QTLs, indicating that genomic proximity is the most reliable indicator of 'true positive' causal genes. In contrast, cis-gene expression QTL data led to three false positive candidate causal gene assignments for every one true positive assignment. We provide evidence that these conclusions also apply to other molecular phenotypes, suggesting that ProGeM is a powerful and versatile tool for annotating molecular QTLs. ProGeM is freely available via GitHub.
Collapse
Affiliation(s)
- David Stacey
- MRC/BHF Cardiovascular Epidemiology Unit, Department of Public Health and Primary Care, University of Cambridge, Cambridge CB1 8RN, UK
| | - Eric B Fauman
- Pfizer Worldwide Research & Development, Genome Sciences & Technologies, Cambridge, MA 02142, USA
| | - Daniel Ziemek
- Pfizer Worldwide Research & Development, Inflammation & Immunology, 14167 Berlin, Germany
| | - Benjamin B Sun
- MRC/BHF Cardiovascular Epidemiology Unit, Department of Public Health and Primary Care, University of Cambridge, Cambridge CB1 8RN, UK
| | - Eric L Harshfield
- MRC/BHF Cardiovascular Epidemiology Unit, Department of Public Health and Primary Care, University of Cambridge, Cambridge CB1 8RN, UK
- Department of Clinical Neurosciences, University of Cambridge, Cambridge CB2 0QQ, UK
| | - Angela M Wood
- MRC/BHF Cardiovascular Epidemiology Unit, Department of Public Health and Primary Care, University of Cambridge, Cambridge CB1 8RN, UK
| | - Adam S Butterworth
- MRC/BHF Cardiovascular Epidemiology Unit, Department of Public Health and Primary Care, University of Cambridge, Cambridge CB1 8RN, UK
| | - Karsten Suhre
- Department of Physiology and Biophysics, Weill Cornell Medicine-Qatar, PO 24144, Doha, Qatar
| | - Dirk S Paul
- MRC/BHF Cardiovascular Epidemiology Unit, Department of Public Health and Primary Care, University of Cambridge, Cambridge CB1 8RN, UK
| |
Collapse
|
21
|
Luo P, Tian LP, Ruan J, Wu FX. Disease Gene Prediction by Integrating PPI Networks, Clinical RNA-Seq Data and OMIM Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2019; 16:222-232. [PMID: 29990218 DOI: 10.1109/tcbb.2017.2770120] [Citation(s) in RCA: 35] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Disease gene prediction is a challenging task that has a variety of applications such as early diagnosis and drug development. The existing machine learning methods suffer from the imbalanced sample issue because the number of known disease genes (positive samples) is much less than that of unknown genes which are typically considered to be negative samples. In addition, most methods have not utilized clinical data from patients with a specific disease to predict disease genes. In this study, we propose a disease gene prediction algorithm (called dgSeq) by combining protein-protein interaction (PPI) network, clinical RNA-Seq data, and Online Mendelian Inheritance in Man (OMIN) data. Our dgSeq constructs differential networks based on rewiring information calculated from clinical RNA-Seq data. To select balanced sets of non-disease genes (negative samples), a disease-gene network is also constructed from OMIM data. After features are extracted from the PPI networks and differential networks, the logistic regression classifiers are trained. Our dgSeq obtains AUC values of 0.88, 0.83, and 0.80 for identifying breast cancer genes, thyroid cancer genes, and Alzheimer's disease genes, respectively, which indicates its superiority to other three competing methods. Both gene set enrichment analysis and predicted results demonstrate that dgSeq can effectively predict new disease genes.
Collapse
|
22
|
López-Cortés A, Paz-Y-Miño C, Cabrera-Andrade A, Barigye SJ, Munteanu CR, González-Díaz H, Pazos A, Pérez-Castillo Y, Tejera E. Gene prioritization, communality analysis, networking and metabolic integrated pathway to better understand breast cancer pathogenesis. Sci Rep 2018; 8:16679. [PMID: 30420728 PMCID: PMC6232116 DOI: 10.1038/s41598-018-35149-1] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2018] [Accepted: 10/16/2018] [Indexed: 12/30/2022] Open
Abstract
Consensus strategy was proved to be highly efficient in the recognition of gene-disease association. Therefore, the main objective of this study was to apply theoretical approaches to explore genes and communities directly involved in breast cancer (BC) pathogenesis. We evaluated the consensus between 8 prioritization strategies for the early recognition of pathogenic genes. A communality analysis in the protein-protein interaction (PPi) network of previously selected genes was enriched with gene ontology, metabolic pathways, as well as oncogenomics validation with the OncoPPi and DRIVE projects. The consensus genes were rationally filtered to 1842 genes. The communality analysis showed an enrichment of 14 communities specially connected with ERBB, PI3K-AKT, mTOR, FOXO, p53, HIF-1, VEGF, MAPK and prolactin signaling pathways. Genes with highest ranking were TP53, ESR1, BRCA2, BRCA1 and ERBB2. Genes with highest connectivity degree were TP53, AKT1, SRC, CREBBP and EP300. The connectivity degree allowed to establish a significant correlation between the OncoPPi network and our BC integrated network conformed by 51 genes and 62 PPi. In addition, CCND1, RAD51, CDC42, YAP1 and RPA1 were functional genes with significant sensitivity score in BC cell lines. In conclusion, the consensus strategy identifies both well-known pathogenic genes and prioritized genes that need to be further explored.
Collapse
Affiliation(s)
- Andrés López-Cortés
- Centro de Investigación Genética y Genómica, Facultad de Ciencias de la Salud Eugenio Espejo, Universidad UTE, Mariscal Sucre Avenue, 170129, Quito, Ecuador.
- RNASA-IMEDIR, Computer Sciences Faculty, University of Coruna, 15071, Coruna, Spain.
| | - César Paz-Y-Miño
- Centro de Investigación Genética y Genómica, Facultad de Ciencias de la Salud Eugenio Espejo, Universidad UTE, Mariscal Sucre Avenue, 170129, Quito, Ecuador
| | - Alejandro Cabrera-Andrade
- Carrera de Enfermería, Facultad de Ciencias de la Salud, Universidad de las Américas, Avenue de los Granados, 170125, Quito, Ecuador
- Grupo de Bio-Quimioinformática, Universidad de las Américas, Avenue de los Granados, 170125, Quito, Ecuador
| | - Stephen J Barigye
- Department of Chemistry, McGill University, 801 Sherbrooke Street West, Montreal, QC, H3A 0B8, Canada
| | - Cristian R Munteanu
- RNASA-IMEDIR, Computer Sciences Faculty, University of Coruna, 15071, Coruna, Spain
- INIBIC, Institute of Biomedical Research, CHUAC, UDC, 15006, Coruna, Spain
| | - Humberto González-Díaz
- Department of Organic Chemistry II, University of the Basque Country UPV/EHU, 48940, Leioa, Biscay, Spain
- IKERBASQUE, Basque Foundation for Science, 48011, Bilbao, Biscay, Spain
| | - Alejandro Pazos
- RNASA-IMEDIR, Computer Sciences Faculty, University of Coruna, 15071, Coruna, Spain
- INIBIC, Institute of Biomedical Research, CHUAC, UDC, 15006, Coruna, Spain
| | - Yunierkis Pérez-Castillo
- Grupo de Bio-Quimioinformática, Universidad de las Américas, Avenue de los Granados, 170125, Quito, Ecuador
- Escuela de Ciencias Físicas y Matemáticas, Universidad de las Américas, Avenue de los Granados, 170125, Quito, Ecuador
| | - Eduardo Tejera
- Grupo de Bio-Quimioinformática, Universidad de las Américas, Avenue de los Granados, 170125, Quito, Ecuador.
- Facultad de Ingeniería y Ciencias Agropecuarias, Universidad de las Américas, Avenue de los Granados, 170125, Quito, Ecuador.
| |
Collapse
|
23
|
Rosenthal SB, Len J, Webster M, Gary A, Birmingham A, Fisch KM. Interactive network visualization in Jupyter notebooks: visJS2jupyter. Bioinformatics 2018; 34:126-128. [PMID: 28968701 DOI: 10.1093/bioinformatics/btx581] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2017] [Accepted: 09/13/2017] [Indexed: 01/20/2023] Open
Abstract
Motivation Network biology is widely used to elucidate mechanisms of disease and biological processes. The ability to interact with biological networks is important for hypothesis generation and to give researchers an intuitive understanding of the data. We present visJS2jupyter, a tool designed to embed interactive networks in Jupyter notebooks to streamline network analysis and to promote reproducible research. Results The tool provides functions for performing and visualizing useful network operations in biology, including network overlap, network propagation around a focal set of genes, and co-localization of two sets of seed genes. visJS2jupyter uses the JavaScript library vis.js to create interactive networks displayed within Jupyter notebook cells with features including drag, click, hover, and zoom. We demonstrate the functionality of visJS2jupyter applied to a biological question, by creating a network propagation visualization to prioritize risk-related genes in autism. Availability and implementation The visJS2jupyter package is distributed under the MIT License. The source code, documentation and installation instructions are freely available on GitHub at https://github.com/ucsd-ccbb/visJS2jupyter. The package can be downloaded at https://pypi.python.org/pypi/visJS2jupyter. Contact sbrosenthal@ucsd.edu. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Sara Brin Rosenthal
- Department of Medicine, Center for Computational Biology and Bioinformatics, University of California San Diego, La Jolla, CA 92093, USA
| | - Julia Len
- Department of Medicine, Center for Computational Biology and Bioinformatics, University of California San Diego, La Jolla, CA 92093, USA
| | - Mikayla Webster
- Department of Medicine, Center for Computational Biology and Bioinformatics, University of California San Diego, La Jolla, CA 92093, USA
| | - Aaron Gary
- Department of Medicine, Center for Computational Biology and Bioinformatics, University of California San Diego, La Jolla, CA 92093, USA
| | - Amanda Birmingham
- Department of Medicine, Center for Computational Biology and Bioinformatics, University of California San Diego, La Jolla, CA 92093, USA
| | - Kathleen M Fisch
- Department of Medicine, Center for Computational Biology and Bioinformatics, University of California San Diego, La Jolla, CA 92093, USA
| |
Collapse
|
24
|
Zampieri G, Tran DV, Donini M, Navarin N, Aiolli F, Sperduti A, Valle G. Scuba: scalable kernel-based gene prioritization. BMC Bioinformatics 2018; 19:23. [PMID: 29370760 PMCID: PMC5785908 DOI: 10.1186/s12859-018-2025-5] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2016] [Accepted: 01/15/2018] [Indexed: 01/01/2023] Open
Abstract
BACKGROUND The uncovering of genes linked to human diseases is a pressing challenge in molecular biology and precision medicine. This task is often hindered by the large number of candidate genes and by the heterogeneity of the available information. Computational methods for the prioritization of candidate genes can help to cope with these problems. In particular, kernel-based methods are a powerful resource for the integration of heterogeneous biological knowledge, however, their practical implementation is often precluded by their limited scalability. RESULTS We propose Scuba, a scalable kernel-based method for gene prioritization. It implements a novel multiple kernel learning approach, based on a semi-supervised perspective and on the optimization of the margin distribution. Scuba is optimized to cope with strongly unbalanced settings where known disease genes are few and large scale predictions are required. Importantly, it is able to efficiently deal both with a large amount of candidate genes and with an arbitrary number of data sources. As a direct consequence of scalability, Scuba integrates also a new efficient strategy to select optimal kernel parameters for each data source. We performed cross-validation experiments and simulated a realistic usage setting, showing that Scuba outperforms a wide range of state-of-the-art methods. CONCLUSIONS Scuba achieves state-of-the-art performance and has enhanced scalability compared to existing kernel-based approaches for genomic data. This method can be useful to prioritize candidate genes, particularly when their number is large or when input data is highly heterogeneous. The code is freely available at https://github.com/gzampieri/Scuba .
Collapse
Affiliation(s)
- Guido Zampieri
- CRIBI Biotechnology Center, University of Padova, viale G. Colombo, 3, Padova, Italy.,Department of Women's and Children's Health, University of Padova, via Giustiniani, 3, Padova, Italy
| | - Dinh Van Tran
- Department of Mathematics, University of Padova, via Trieste, 63, Padova, Italy
| | - Michele Donini
- Istituto Italiano di Tecnologia, Via Morego, 30, Genoa, Italy
| | - Nicolò Navarin
- Department of Mathematics, University of Padova, via Trieste, 63, Padova, Italy
| | - Fabio Aiolli
- Department of Mathematics, University of Padova, via Trieste, 63, Padova, Italy
| | - Alessandro Sperduti
- Department of Mathematics, University of Padova, via Trieste, 63, Padova, Italy
| | - Giorgio Valle
- CRIBI Biotechnology Center, University of Padova, viale G. Colombo, 3, Padova, Italy. .,Department of Biology, University of Padova, viale G. Colombo, 3, Padova, Italy.
| |
Collapse
|
25
|
Bolgár B, Antal P. VB-MK-LMF: fusion of drugs, targets and interactions using variational Bayesian multiple kernel logistic matrix factorization. BMC Bioinformatics 2017; 18:440. [PMID: 28978313 PMCID: PMC5628496 DOI: 10.1186/s12859-017-1845-z] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2017] [Accepted: 09/21/2017] [Indexed: 12/20/2022] Open
Abstract
BACKGROUND Computational fusion approaches to drug-target interaction (DTI) prediction, capable of utilizing multiple sources of background knowledge, were reported to achieve superior predictive performance in multiple studies. Other studies showed that specificities of the DTI task, such as weighting the observations and focusing the side information are also vital for reaching top performance. METHOD We present Variational Bayesian Multiple Kernel Logistic Matrix Factorization (VB-MK-LMF), which unifies the advantages of (1) multiple kernel learning, (2) weighted observations, (3) graph Laplacian regularization, and (4) explicit modeling of probabilities of binary drug-target interactions. RESULTS VB-MK-LMF achieves significantly better predictive performance in standard benchmarks compared to state-of-the-art methods, which can be traced back to multiple factors. The systematic evaluation of the effect of multiple kernels confirm their benefits, but also highlights the limitations of linear kernel combinations, already recognized in other fields. The analysis of the effect of prior kernels using varying sample sizes sheds light on the balance of data and knowledge in DTI tasks and on the rate at which the effect of priors vanishes. This also shows the existence of "small sample size" regions where using side information offers significant gains. Alongside favorable predictive performance, a notable property of MF methods is that they provide a unified space for drugs and targets using latent representations. Compared to earlier studies, the dimensionality of this space proved to be surprisingly low, which makes the latent representations constructed by VB-ML-LMF especially well-suited for visual analytics. The probabilistic nature of the predictions allows the calculation of the expected values of hits in functionally relevant sets, which we demonstrate by predicting drug promiscuity. The variational Bayesian approximation is also implemented for general purpose graphics processing units yielding significantly improved computational time. CONCLUSION In standard benchmarks, VB-MK-LMF shows significantly improved predictive performance in a wide range of settings. Beyond these benchmarks, another contribution of our work is highlighting and providing estimates for further pharmaceutically relevant quantities, such as promiscuity, druggability and total number of interactions.
Collapse
Affiliation(s)
- Bence Bolgár
- Department of Measurement and Information Systems, Budapest University of Technology and Economics, Magyar tudósok krt. 2., Budapest, 1117 Hungary
| | - Péter Antal
- Department of Measurement and Information Systems, Budapest University of Technology and Economics, Magyar tudósok krt. 2., Budapest, 1117 Hungary
| |
Collapse
|
26
|
Frasca M. Gene2DisCo: Gene to disease using disease commonalities. Artif Intell Med 2017; 82:34-46. [DOI: 10.1016/j.artmed.2017.08.001] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2017] [Revised: 07/24/2017] [Accepted: 08/13/2017] [Indexed: 01/10/2023]
|
27
|
Tejera E, Cruz-Monteagudo M, Burgos G, Sánchez ME, Sánchez-Rodríguez A, Pérez-Castillo Y, Borges F, Cordeiro MNDS, Paz-Y-Miño C, Rebelo I. Consensus strategy in genes prioritization and combined bioinformatics analysis for preeclampsia pathogenesis. BMC Med Genomics 2017; 10:50. [PMID: 28789679 PMCID: PMC5549357 DOI: 10.1186/s12920-017-0286-x] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2017] [Accepted: 07/28/2017] [Indexed: 12/19/2022] Open
Abstract
BACKGROUND Preeclampsia is a multifactorial disease with unknown pathogenesis. Even when recent studies explored this disease using several bioinformatics tools, the main objective was not directed to pathogenesis. Additionally, consensus prioritization was proved to be highly efficient in the recognition of genes-disease association. However, not information is available about the consensus ability to early recognize genes directly involved in pathogenesis. Therefore our aim in this study is to apply several theoretical approaches to explore preeclampsia; specifically those genes directly involved in the pathogenesis. METHODS We firstly evaluated the consensus between 12 prioritization strategies to early recognize pathogenic genes related to preeclampsia. A communality analysis in the protein-protein interaction network of previously selected genes was done including further enrichment analysis. The enrichment analysis includes metabolic pathways as well as gene ontology. Microarray data was also collected and used in order to confirm our results or as a strategy to weight the previously enriched pathways. RESULTS The consensus prioritized gene list was rationally filtered to 476 genes using several criteria. The communality analysis showed an enrichment of communities connected with VEGF-signaling pathway. This pathway is also enriched considering the microarray data. Our result point to VEGF, FLT1 and KDR as relevant pathogenic genes, as well as those connected with NO metabolism. CONCLUSION Our results revealed that consensus strategy improve the detection and initial enrichment of pathogenic genes, at least in preeclampsia condition. Moreover the combination of the first percent of the prioritized genes with protein-protein interaction network followed by communality analysis reduces the gene space. This approach actually identifies well known genes related with pathogenesis. However, genes like HSP90, PAK2, CD247 and others included in the first 1% of the prioritized list need to be further explored in preeclampsia pathogenesis through experimental approaches.
Collapse
Affiliation(s)
- Eduardo Tejera
- Facultad de Medicina, Universidad de Las Américas, Av. de los Granados E12-41y Colimes esq, EC170125, Quito, Ecuador.
| | - Maykel Cruz-Monteagudo
- Department of Molecular and Cellular Pharmacology, Miller School of Medicine and Center for Computational Science, University of Miami, FL 33136, Miami, USA.,Department of General Education, West Coast University-Miami Campus, Doral, FL 33178, USA.,CIQUP/Departamento de Quimica e Bioquimica, Faculdade de Ciências, Universidade do Porto, 4169-007, Porto, Portugal.,REQUIMTE, Department of Chemistry and Biochemistry, Faculty of Sciences, University of Porto, 4169-007, Porto, Portugal
| | - Germán Burgos
- Facultad de Medicina, Universidad de Las Américas, Av. de los Granados E12-41y Colimes esq, EC170125, Quito, Ecuador
| | - María-Eugenia Sánchez
- Facultad de Medicina, Universidad de Las Américas, Av. de los Granados E12-41y Colimes esq, EC170125, Quito, Ecuador
| | - Aminael Sánchez-Rodríguez
- Departamento de Ciencias Naturales, Universidad Técnica Particular de Loja, Calle París S/N, EC1101608, Loja, Ecuador
| | | | - Fernanda Borges
- CIQUP/Departamento de Quimica e Bioquimica, Faculdade de Ciências, Universidade do Porto, 4169-007, Porto, Portugal
| | | | - César Paz-Y-Miño
- Centro de Investigaciones genética y genómica, Facultad de Ciencias de la Salud, Universidad Tecnológica Equinoccial, Quito, Ecuador
| | - Irene Rebelo
- Faculty of Pharmacy, University of Porto, Porto, Portugal.,UCIBIO@REQUIMTE, Caparica, Portugal
| |
Collapse
|
28
|
Hassani-Pak K, Rawlings C. Knowledge Discovery in Biological Databases for Revealing Candidate Genes Linked to Complex Phenotypes. J Integr Bioinform 2017; 14:/j/jib.ahead-of-print/jib-2016-0002/jib-2016-0002.xml. [PMID: 28609292 PMCID: PMC6042805 DOI: 10.1515/jib-2016-0002] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2017] [Accepted: 02/16/2017] [Indexed: 02/06/2023] Open
Abstract
Genetics and “omics” studies designed to uncover genotype to phenotype relationships often identify large numbers of potential candidate genes, among which the causal genes are hidden. Scientists generally lack the time and technical expertise to review all relevant information available from the literature, from key model species and from a potentially wide range of related biological databases in a variety of data formats with variable quality and coverage. Computational tools are needed for the integration and evaluation of heterogeneous information in order to prioritise candidate genes and components of interaction networks that, if perturbed through potential interventions, have a positive impact on the biological outcome in the whole organism without producing negative side effects. Here we review several bioinformatics tools and databases that play an important role in biological knowledge discovery and candidate gene prioritization. We conclude with several key challenges that need to be addressed in order to facilitate biological knowledge discovery in the future.
Collapse
|
29
|
Guala D, Sonnhammer ELL. A large-scale benchmark of gene prioritization methods. Sci Rep 2017; 7:46598. [PMID: 28429739 PMCID: PMC5399445 DOI: 10.1038/srep46598] [Citation(s) in RCA: 33] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2016] [Accepted: 03/22/2017] [Indexed: 11/16/2022] Open
Abstract
In order to maximize the use of results from high-throughput experimental studies, e.g. GWAS, for identification and diagnostics of new disease-associated genes, it is important to have properly analyzed and benchmarked gene prioritization tools. While prospective benchmarks are underpowered to provide statistically significant results in their attempt to differentiate the performance of gene prioritization tools, a strategy for retrospective benchmarking has been missing, and new tools usually only provide internal validations. The Gene Ontology(GO) contains genes clustered around annotation terms. This intrinsic property of GO can be utilized in construction of robust benchmarks, objective to the problem domain. We demonstrate how this can be achieved for network-based gene prioritization tools, utilizing the FunCoup network. We use cross-validation and a set of appropriate performance measures to compare state-of-the-art gene prioritization algorithms: three based on network diffusion, NetRank and two implementations of Random Walk with Restart, and MaxLink that utilizes network neighborhood. Our benchmark suite provides a systematic and objective way to compare the multitude of available and future gene prioritization tools, enabling researchers to select the best gene prioritization tool for the task at hand, and helping to guide the development of more accurate methods.
Collapse
Affiliation(s)
- Dimitri Guala
- Stockholm Bioinformatics Center, Department of Biochemistry and Biophysics, Stockholm University, Science for Life Laboratory, Box 1031, 17121 Solna, Sweden
| | - Erik L L Sonnhammer
- Stockholm Bioinformatics Center, Department of Biochemistry and Biophysics, Stockholm University, Science for Life Laboratory, Box 1031, 17121 Solna, Sweden
| |
Collapse
|
30
|
GenePANDA-a novel network-based gene prioritizing tool for complex diseases. Sci Rep 2017; 7:43258. [PMID: 28252032 PMCID: PMC5333103 DOI: 10.1038/srep43258] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2016] [Accepted: 01/23/2017] [Indexed: 02/08/2023] Open
Abstract
Here we describe GenePANDA, a novel network-based tool for prioritizing candidate disease genes. GenePANDA assesses whether a gene is likely a candidate disease gene based on its relative distance to known disease genes in a functional association network. A unique feature of GenePANDA is the introduction of adjusted network distance derived by normalizing the raw network distance between two genes with their respective mean raw network distance to all other genes in the network. The use of adjusted network distance significantly improves GenePANDA’s performance on prioritizing complex disease genes. GenePANDA achieves superior performance over five previously published algorithms for prioritizing disease genes. Finally, GenePANDA can assist in prioritizing functionally important SNPs identified by GWAS.
Collapse
|
31
|
Liu JL, Zhao M. A PubMed-wide study of endometriosis. Genomics 2016; 108:151-157. [DOI: 10.1016/j.ygeno.2016.10.003] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2016] [Revised: 09/30/2016] [Accepted: 10/12/2016] [Indexed: 12/18/2022]
|
32
|
Li J, Lin X, Teng Y, Qi S, Xiao D, Zhang J, Kang Y. A Comprehensive Evaluation of Disease Phenotype Networks for Gene Prioritization. PLoS One 2016; 11:e0159457. [PMID: 27415759 PMCID: PMC4944959 DOI: 10.1371/journal.pone.0159457] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2016] [Accepted: 07/01/2016] [Indexed: 12/31/2022] Open
Abstract
Identification of disease-causing genes is a fundamental challenge for human health studies. The phenotypic similarity among diseases may reflect the interactions at the molecular level, and phenotype comparison can be used to predict disease candidate genes. Online Mendelian Inheritance in Man (OMIM) is a database of human genetic diseases and related genes that has become an authoritative source of disease phenotypes. However, disease phenotypes have been described by free text; thus, standardization of phenotypic descriptions is needed before diseases can be compared. Several disease phenotype networks have been established in OMIM using different standardization methods. Two of these networks are important for phenotypic similarity analysis: the first and most commonly used network (mimMiner) is standardized by medical subject heading, and the other network (resnikHPO) is the first to be standardized by human phenotype ontology. This paper comprehensively evaluates for the first time the accuracy of these two networks in gene prioritization based on protein–protein interactions using large-scale, leave-one-out cross-validation experiments. The results show that both networks can effectively prioritize disease-causing genes, and the approach that relates two diseases using a logistic function improves prioritization performance. Tanimoto, one of four methods for normalizing resnikHPO, generates a symmetric network and it performs similarly to mimMiner. Furthermore, an integration of these two networks outperforms either network alone in gene prioritization, indicating that these two disease networks are complementary.
Collapse
Affiliation(s)
- Jianhua Li
- Department of Biomedical Informatics, Sino-Dutch Biomedical and Information Engineering School, Northeastern University, Shenyang, Liaoning, China
- Key Laboratory of Medical Image Computing of Northeastern University, Ministry of Education, Shenyang, Liaoning, China
| | - Xiaoyan Lin
- Department of Biomedical Informatics, Sino-Dutch Biomedical and Information Engineering School, Northeastern University, Shenyang, Liaoning, China
| | - Yueyang Teng
- Department of Biomedical Imaging, Sino-Dutch Biomedical and Information Engineering School, Northeastern University, Shenyang, Liaoning, China
| | - Shouliang Qi
- Key Laboratory of Medical Image Computing of Northeastern University, Ministry of Education, Shenyang, Liaoning, China
- Department of Biomedical Imaging, Sino-Dutch Biomedical and Information Engineering School, Northeastern University, Shenyang, Liaoning, China
| | - Dayu Xiao
- Department of Biomedical Imaging, Sino-Dutch Biomedical and Information Engineering School, Northeastern University, Shenyang, Liaoning, China
| | - Jianying Zhang
- Department of Biomedical Informatics, Sino-Dutch Biomedical and Information Engineering School, Northeastern University, Shenyang, Liaoning, China
- Border Biomedical Research Center, Department of Biological Sciences, The University of Texas at El Paso, El Paso, Texas, United States of America
| | - Yan Kang
- Key Laboratory of Medical Image Computing of Northeastern University, Ministry of Education, Shenyang, Liaoning, China
- Department of Biomedical Imaging, Sino-Dutch Biomedical and Information Engineering School, Northeastern University, Shenyang, Liaoning, China
- * E-mail:
| |
Collapse
|
33
|
Börnigen D, Tyekucheva S, Wang X, Rider JR, Lee GS, Mucci LA, Sweeney C, Huttenhower C. Computational Reconstruction of NFκB Pathway Interaction Mechanisms during Prostate Cancer. PLoS Comput Biol 2016; 12:e1004820. [PMID: 27078000 PMCID: PMC4831844 DOI: 10.1371/journal.pcbi.1004820] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2015] [Accepted: 02/19/2016] [Indexed: 12/21/2022] Open
Abstract
Molecular research in cancer is one of the largest areas of bioinformatic investigation, but it remains a challenge to understand biomolecular mechanisms in cancer-related pathways from high-throughput genomic data. This includes the Nuclear-factor-kappa-B (NFκB) pathway, which is central to the inflammatory response and cell proliferation in prostate cancer development and progression. Despite close scrutiny and a deep understanding of many of its members’ biomolecular activities, the current list of pathway members and a systems-level understanding of their interactions remains incomplete. Here, we provide the first steps toward computational reconstruction of interaction mechanisms of the NFκB pathway in prostate cancer. We identified novel roles for ATF3, CXCL2, DUSP5, JUNB, NEDD9, SELE, TRIB1, and ZFP36 in this pathway, in addition to new mechanistic interactions between these genes and 10 known NFκB pathway members. A newly predicted interaction between NEDD9 and ZFP36 in particular was validated by co-immunoprecipitation, as was NEDD9's potential biological role in prostate cancer cell growth regulation. We combined 651 gene expression datasets with 1.4M gene product interactions to predict the inclusion of 40 additional genes in the pathway. Molecular mechanisms of interaction among pathway members were inferred using recent advances in Bayesian data integration to simultaneously provide information specific to biological contexts and individual biomolecular activities, resulting in a total of 112 interactions in the fully reconstructed NFκB pathway: 13 (11%) previously known, 29 (26%) supported by existing literature, and 70 (63%) novel. This method is generalizable to other tissue types, cancers, and organisms, and this new information about the NFκB pathway will allow us to further understand prostate cancer and to develop more effective prevention and treatment strategies. In molecular research in cancer it remains challenging to uncover biomolecular mechanisms in cancer-related pathways from high-throughput genomic data, including the Nuclear-factor-kappa-B (NFκB) pathway. Despite close scrutiny and a deep understanding of many of the NFκB pathway members’ biomolecular activities, the current list of pathway members and a systems-level understanding of their interactions remains incomplete. In this study, we provide the first steps toward computational reconstruction of interaction mechanisms of the NFκB pathway in prostate cancer. We identified novel roles for 8 genes in this pathway and new mechanistic interactions between these genes and 10 known pathway members. We combined 651 gene expression datasets with 1.4M interactions to predict the inclusion of 40 additional genes in the pathway. Molecular mechanisms of interaction were inferred using recent advances in Bayesian data integration to simultaneously provide information specific to biological contexts and individual biomolecular activities, resulting in 112 interactions in the fully reconstructed NFκB pathway. This method is generalizable, and this new information about the NFκB pathway will allow us to further understand prostate cancer.
Collapse
Affiliation(s)
- Daniela Börnigen
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Harvard University, Boston, Massachusetts, United States of America.,The Broad Institute of MIT and Harvard, Cambridge, Massachusetts, United States of America
| | - Svitlana Tyekucheva
- Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, Boston, Massachusetts, United States of America
| | - Xiaodong Wang
- Department of Medical Oncology, Dana-Farber Cancer Institute, Harvard Medical School, Boston, Massachusetts, United States of America
| | - Jennifer R Rider
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Harvard University, Boston, Massachusetts, United States of America
| | - Gwo-Shu Lee
- Department of Medical Oncology, Dana-Farber Cancer Institute, Harvard Medical School, Boston, Massachusetts, United States of America
| | - Lorelei A Mucci
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Harvard University, Boston, Massachusetts, United States of America
| | - Christopher Sweeney
- Department of Medical Oncology, Dana-Farber Cancer Institute, Harvard Medical School, Boston, Massachusetts, United States of America
| | - Curtis Huttenhower
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Harvard University, Boston, Massachusetts, United States of America.,The Broad Institute of MIT and Harvard, Cambridge, Massachusetts, United States of America
| |
Collapse
|
34
|
Sung YJ, Pérusse L, Sarzynski MA, Fornage M, Sidney S, Sternfeld B, Rice T, Terry G, Jacobs DR, Katzmarzyk P, Curran JE, Carr JJ, Blangero J, Ghosh S, Després JP, Rankinen T, Rao D, Bouchard C. Genome-wide association studies suggest sex-specific loci associated with abdominal and visceral fat. Int J Obes (Lond) 2016; 40:662-74. [PMID: 26480920 PMCID: PMC4821694 DOI: 10.1038/ijo.2015.217] [Citation(s) in RCA: 54] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 06/11/2015] [Revised: 10/05/2015] [Accepted: 10/06/2015] [Indexed: 12/21/2022]
Abstract
BACKGROUND To identify loci associated with abdominal fat and replicate prior findings, we performed genome-wide association (GWA) studies of abdominal fat traits: subcutaneous adipose tissue (SAT); visceral adipose tissue (VAT); total adipose tissue (TAT) and visceral to subcutaneous adipose tissue ratio (VSR). SUBJECTS AND METHODS Sex-combined and sex-stratified analyses were performed on each trait with (TRAIT-BMI) or without (TRAIT) adjustment for body mass index (BMI), and cohort-specific results were combined via a fixed effects meta-analysis. A total of 2513 subjects of European descent were available for the discovery phase. For replication, 2171 European Americans and 772 African Americans were available. RESULTS A total of 52 single-nucleotide polymorphisms (SNPs) encompassing 7 loci showed suggestive evidence of association (P<1.0 × 10(-6)) with abdominal fat in the sex-combined analyses. The strongest evidence was found on chromosome 7p14.3 between a SNP near BBS9 gene and VAT (rs12374818; P=1.10 × 10(-7)), an association that was replicated (P=0.02). For the BMI-adjusted trait, the strongest evidence of association was found between a SNP near CYCSP30 and VAT-BMI (rs10506943; P=2.42 × 10(-7)). Our sex-specific analyses identified one genome-wide significant (P<5.0 × 10(-8)) locus for SAT in women with 11 SNPs encompassing the MLLT10, DNAJC1 and EBLN1 genes on chromosome 10p12.31 (P=3.97 × 10(-8) to 1.13 × 10(-8)). The THNSL2 gene previously associated with VAT in women was also replicated (P=0.006). The six gene/loci showing the strongest evidence of association with VAT or VAT-BMI were interrogated for their functional links with obesity and inflammation using the Biograph knowledge-mining software. Genes showing the closest functional links with obesity and inflammation were ADCY8 and KCNK9, respectively. CONCLUSIONS Our results provide evidence for new loci influencing abdominal visceral (BBS9, ADCY8, KCNK9) and subcutaneous (MLLT10/DNAJC1/EBLN1) fat, and confirmed a locus (THNSL2) previously reported to be associated with abdominal fat in women.
Collapse
Affiliation(s)
- Yun Ju Sung
- Division of Biostatistics, Washington University School of Medicine, St-Louis, MO
| | - Louis Pérusse
- Department of Kinesiology, School of Medicine and Institute of Nutrition and Functional Foods, Laval University, Québec, QC
| | - Mark A. Sarzynski
- Human Genomics Laboratory, Pennington Biomedical Research Center, Baton Rouge, LA
| | - Myriam Fornage
- Center for Human Genetics, University of Texas Health Science Center, Houston, TX
| | - Steve Sidney
- Division of Research, Kaiser Permanente Northern California, Oakland, CA
| | - Barbara Sternfeld
- Division of Research, Kaiser Permanente Northern California, Oakland, CA
| | - Treva Rice
- Division of Biostatistics, Washington University School of Medicine, St-Louis, MO
| | - Gregg Terry
- Department of Radiology, School of Medicine, Vanderbilt University, Nahsville, TN
| | - David R. Jacobs
- Division of Epidemiology and Community Health, School of Public Health, University of Minnesota, Minneapolis, MN
| | - Peter Katzmarzyk
- Human Genomics Laboratory, Pennington Biomedical Research Center, Baton Rouge, LA
| | - Joanne E Curran
- South Texas Diabetes and Obesity Institute, University of Texas Rio Grande Valley, TX
| | - John Jeffrey Carr
- Department of Radiology, School of Medicine, Vanderbilt University, Nahsville, TN
| | - John Blangero
- South Texas Diabetes and Obesity Institute, University of Texas Rio Grande Valley, TX
| | - Sujoy Ghosh
- Cardiovascular and Metabolic Disorders Program and Center for Computational Biology, Duke-NUS Graduate Medical School, Singapore
| | - Jean-Pierre Després
- Department of Kinesiology, School of Medicine and Institute of Nutrition and Functional Foods, Laval University, Québec, QC
- Centre de recherché de l’Institut universitaire de cardiologie et de pneumologie de Québec, Québec, QC
| | - Tuomo Rankinen
- Human Genomics Laboratory, Pennington Biomedical Research Center, Baton Rouge, LA
| | - D.C. Rao
- Division of Biostatistics, Washington University School of Medicine, St-Louis, MO
| | - Claude Bouchard
- Human Genomics Laboratory, Pennington Biomedical Research Center, Baton Rouge, LA
| |
Collapse
|
35
|
ElShal S, Tranchevent LC, Sifrim A, Ardeshirdavani A, Davis J, Moreau Y. Beegle: from literature mining to disease-gene discovery. Nucleic Acids Res 2016; 44:e18. [PMID: 26384564 PMCID: PMC4737179 DOI: 10.1093/nar/gkv905] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2015] [Revised: 08/25/2015] [Accepted: 08/29/2015] [Indexed: 01/06/2023] Open
Abstract
Disease-gene identification is a challenging process that has multiple applications within functional genomics and personalized medicine. Typically, this process involves both finding genes known to be associated with the disease (through literature search) and carrying out preliminary experiments or screens (e.g. linkage or association studies, copy number analyses, expression profiling) to determine a set of promising candidates for experimental validation. This requires extensive time and monetary resources. We describe Beegle, an online search and discovery engine that attempts to simplify this process by automating the typical approaches. It starts by mining the literature to quickly extract a set of genes known to be linked with a given query, then it integrates the learning methodology of Endeavour (a gene prioritization tool) to train a genomic model and rank a set of candidate genes to generate novel hypotheses. In a realistic evaluation setup, Beegle has an average recall of 84% in the top 100 returned genes as a search engine, which improves the discovery engine by 12.6% in the top 5% prioritized genes. Beegle is publicly available at http://beegle.esat.kuleuven.be/.
Collapse
Affiliation(s)
- Sarah ElShal
- Department of Electrical Engineering (ESAT) STADIUS Center for Dynamical Systems, Signal Processing and Data Analytics Department, KU Leuven, Leuven 3001, Belgium iMinds Future Health Department, KU Leuven, Leuven 3001, Belgium
| | - Léon-Charles Tranchevent
- Department of Electrical Engineering (ESAT) STADIUS Center for Dynamical Systems, Signal Processing and Data Analytics Department, KU Leuven, Leuven 3001, Belgium iMinds Future Health Department, KU Leuven, Leuven 3001, Belgium Inserm UMR-S1052, CNRS UMR5286, Cancer Research Centre of Lyon, Lyon, France Université de Lyon 1, Villeurbanne, France Centre Léon Bérard, Lyon, France
| | - Alejandro Sifrim
- Department of Electrical Engineering (ESAT) STADIUS Center for Dynamical Systems, Signal Processing and Data Analytics Department, KU Leuven, Leuven 3001, Belgium iMinds Future Health Department, KU Leuven, Leuven 3001, Belgium Wellcome Trust Genome Campus, Hinxton, Wellcome Trust Sanger Institute, Cambridge CB10 1SA, UK
| | - Amin Ardeshirdavani
- Department of Electrical Engineering (ESAT) STADIUS Center for Dynamical Systems, Signal Processing and Data Analytics Department, KU Leuven, Leuven 3001, Belgium iMinds Future Health Department, KU Leuven, Leuven 3001, Belgium
| | - Jesse Davis
- Department of Computer Science (DTAI), KU Leuven, Leuven 3001, Belgium
| | - Yves Moreau
- Department of Electrical Engineering (ESAT) STADIUS Center for Dynamical Systems, Signal Processing and Data Analytics Department, KU Leuven, Leuven 3001, Belgium iMinds Future Health Department, KU Leuven, Leuven 3001, Belgium
| |
Collapse
|
36
|
Weichenberger CX, Blankenburg H, Palermo A, D'Elia Y, König E, Bernstein E, Domingues FS. Dintor: functional annotation of genomic and proteomic data. BMC Genomics 2015; 16:1081. [PMID: 26691694 PMCID: PMC4687148 DOI: 10.1186/s12864-015-2279-5] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2015] [Accepted: 12/08/2015] [Indexed: 11/16/2022] Open
Abstract
Background During the last decade, a great number of extremely valuable large-scale genomics and proteomics datasets have become available to the research community. In addition, dropping costs for conducting high-throughput sequencing experiments and the option to outsource them considerably contribute to an increasing number of researchers becoming active in this field. Even though various computational approaches have been developed to analyze these data, it is still a laborious task involving prudent integration of many heterogeneous and frequently updated data sources, creating a barrier for interested scientists to accomplish their own analysis. Results We have implemented Dintor, a data integration framework that provides a set of over 30 tools to assist researchers in the exploration of genomics and proteomics datasets. Each of the tools solves a particular task and several tools can be combined into data processing pipelines. Dintor covers a wide range of frequently required functionalities, from gene identifier conversions and orthology mappings to functional annotation of proteins and genetic variants up to candidate gene prioritization and Gene Ontology-based gene set enrichment analysis. Since the tools operate on constantly changing datasets, we provide a mechanism to unambiguously link tools with different versions of archived datasets, which guarantees reproducible results for future tool invocations. We demonstrate a selection of Dintor’s capabilities by analyzing datasets from four representative publications. The open source software can be downloaded and installed on a local Unix machine. For reasons of data privacy it can be configured to retrieve local data only. In addition, the Dintor tools are available on our public Galaxy web service at http://dintor.eurac.edu. Conclusions Dintor is a computational annotation framework for the analysis of genomic and proteomic datasets, providing a rich set of tools that cover the most frequently encountered tasks. A major advantage is its capability to consistently handle multiple versions of tool-associated datasets, supporting the researcher in delivering reproducible results. Electronic supplementary material The online version of this article (doi:10.1186/s12864-015-2279-5) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Christian X Weichenberger
- Center for Biomedicine, European Academy of Bolzano/Bozen (EURAC), (Affiliated to the University of Lübeck, Lübeck, Germany), Viale Druso 1, 39100, Bolzano, Italy.
| | - Hagen Blankenburg
- Center for Biomedicine, European Academy of Bolzano/Bozen (EURAC), (Affiliated to the University of Lübeck, Lübeck, Germany), Viale Druso 1, 39100, Bolzano, Italy.
| | - Antonia Palermo
- Center for Biomedicine, European Academy of Bolzano/Bozen (EURAC), (Affiliated to the University of Lübeck, Lübeck, Germany), Viale Druso 1, 39100, Bolzano, Italy.
| | - Yuri D'Elia
- Center for Biomedicine, European Academy of Bolzano/Bozen (EURAC), (Affiliated to the University of Lübeck, Lübeck, Germany), Viale Druso 1, 39100, Bolzano, Italy.
| | - Eva König
- Center for Biomedicine, European Academy of Bolzano/Bozen (EURAC), (Affiliated to the University of Lübeck, Lübeck, Germany), Viale Druso 1, 39100, Bolzano, Italy.
| | - Erik Bernstein
- Deutsches Krebsforschungszentrum (DKFZ), Im Neuenheimer Feld 280, 69120, Heidelberg, Germany.
| | - Francisco S Domingues
- Center for Biomedicine, European Academy of Bolzano/Bozen (EURAC), (Affiliated to the University of Lübeck, Lübeck, Germany), Viale Druso 1, 39100, Bolzano, Italy.
| |
Collapse
|
37
|
Verleyen W, Ballouz S, Gillis J. Positive and negative forms of replicability in gene network analysis. Bioinformatics 2015; 32:1065-73. [PMID: 26668004 DOI: 10.1093/bioinformatics/btv734] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2015] [Accepted: 12/09/2015] [Indexed: 02/07/2023] Open
Abstract
MOTIVATION Gene networks have become a central tool in the analysis of genomic data but are widely regarded as hard to interpret. This has motivated a great deal of comparative evaluation and research into best practices. We explore the possibility that this may lead to overfitting in the field as a whole. RESULTS We construct a model of 'research communities' sampling from real gene network data and machine learning methods to characterize performance trends. Our analysis reveals an important principle limiting the value of replication, namely that targeting it directly causes 'easy' or uninformative replication to dominate analyses. We find that when sampling across network data and algorithms with similar variability, the relationship between replicability and accuracy is positive (Spearman's correlation, rs ∼0.33) but where no such constraint is imposed, the relationship becomes negative for a given gene function (rs ∼ -0.13). We predict factors driving replicability in some prior analyses of gene networks and show that they are unconnected with the correctness of the original result, instead reflecting replicable biases. Without these biases, the original results also vanish replicably. We show these effects can occur quite far upstream in network data and that there is a strong tendency within protein-protein interaction data for highly replicable interactions to be associated with poor quality control. AVAILABILITY AND IMPLEMENTATION Algorithms, network data and a guide to the code available at: https://github.com/wimverleyen/AggregateGeneFunctionPrediction CONTACT jgillis@cshl.edu SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- W Verleyen
- Stanley Institute for Cognitive Genomics, Cold Spring Harbor Laboratory, 500 Sunnyside Boulevard Woodbury, NY 11797, USA
| | - S Ballouz
- Stanley Institute for Cognitive Genomics, Cold Spring Harbor Laboratory, 500 Sunnyside Boulevard Woodbury, NY 11797, USA
| | - J Gillis
- Stanley Institute for Cognitive Genomics, Cold Spring Harbor Laboratory, 500 Sunnyside Boulevard Woodbury, NY 11797, USA
| |
Collapse
|
38
|
Freytag S, Gagnon-Bartsch J, Speed TP, Bahlo M. Systematic noise degrades gene co-expression signals but can be corrected. BMC Bioinformatics 2015; 16:309. [PMID: 26403471 PMCID: PMC4583191 DOI: 10.1186/s12859-015-0745-3] [Citation(s) in RCA: 40] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2015] [Accepted: 09/16/2015] [Indexed: 12/31/2022] Open
Abstract
Background In the past decade, the identification of gene co-expression has become a routine part of the analysis of high-dimensional microarray data. Gene co-expression, which is mostly detected via the Pearson correlation coefficient, has played an important role in the discovery of molecular pathways and networks. Unfortunately, the presence of systematic noise in high-dimensional microarray datasets corrupts estimates of gene co-expression. Removing systematic noise from microarray data is therefore crucial. Many cleaning approaches for microarray data exist, however these methods are aimed towards improving differential expression analysis and their performances have been primarily tested for this application. To our knowledge, the performances of these approaches have never been systematically compared in the context of gene co-expression estimation. Results Using simulations we demonstrate that standard cleaning procedures, such as background correction and quantile normalization, fail to adequately remove systematic noise that affects gene co-expression and at times further degrade true gene co-expression. Instead we show that a global version of removal of unwanted variation (RUV), a data-driven approach, removes systematic noise but also allows the estimation of the true underlying gene-gene correlations. We compare the performance of all noise removal methods when applied to five large published datasets on gene expression in the human brain. RUV retrieves the highest gene co-expression values for sets of genes known to interact, but also provides the greatest consistency across all five datasets. We apply the method to prioritize epileptic encephalopathy candidate genes. Conclusions Our work raises serious concerns about the quality of many published gene co-expression analyses. RUV provides an efficient and flexible way to remove systematic noise from high-dimensional microarray datasets when the objective is gene co-expression analysis. The RUV method as applicable in the context of gene-gene correlation estimation is available as a BioconductoR-package: RUVcorr. Electronic supplementary material The online version of this article (doi:10.1186/s12859-015-0745-3) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Saskia Freytag
- Bioinformatics Division, Walter + Eliza Hall Institute, 1G Royal Parade, Melbourne, 3050, Australia. .,Department of Mathematics and Statistics, University of Melbourne, Melbourne, 3010, Australia.
| | - Johann Gagnon-Bartsch
- Department of Statistics, University of California, 367 Evans Hall, Berkeley, 94720, USA.
| | - Terence P Speed
- Bioinformatics Division, Walter + Eliza Hall Institute, 1G Royal Parade, Melbourne, 3050, Australia. .,Department of Mathematics and Statistics, University of Melbourne, Melbourne, 3010, Australia. .,Department of Statistics, University of California, 367 Evans Hall, Berkeley, 94720, USA.
| | - Melanie Bahlo
- Bioinformatics Division, Walter + Eliza Hall Institute, 1G Royal Parade, Melbourne, 3050, Australia. .,Department of Mathematics and Statistics, University of Melbourne, Melbourne, 3010, Australia. .,Department of Medical Biology, University of Melbourne, Melbourne, 3010, Australia.
| |
Collapse
|
39
|
NetRanker: A network-based gene ranking tool using protein-protein interaction and gene expression data. BIOCHIP JOURNAL 2015. [DOI: 10.1007/s13206-015-9407-9] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
|
40
|
Antanaviciute A, Watson CM, Harrison SM, Lascelles C, Crinnion L, Markham AF, Bonthron DT, Carr IM. OVA: integrating molecular and physical phenotype data from multiple biomedical domain ontologies with variant filtering for enhanced variant prioritization. Bioinformatics 2015; 31:3822-9. [PMID: 26272982 PMCID: PMC4653395 DOI: 10.1093/bioinformatics/btv473] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2015] [Accepted: 08/09/2015] [Indexed: 12/13/2022] Open
Abstract
MOTIVATION Exome sequencing has become a de facto standard method for Mendelian disease gene discovery in recent years, yet identifying disease-causing mutations among thousands of candidate variants remains a non-trivial task. RESULTS Here we describe a new variant prioritization tool, OVA (ontology variant analysis), in which user-provided phenotypic information is exploited to infer deeper biological context. OVA combines a knowledge-based approach with a variant-filtering framework. It reduces the number of candidate variants by considering genotype and predicted effect on protein sequence, and scores the remainder on biological relevance to the query phenotype.We take advantage of several ontologies in order to bridge knowledge across multiple biomedical domains and facilitate computational analysis of annotations pertaining to genes, diseases, phenotypes, tissues and pathways. In this way, OVA combines information regarding molecular and physical phenotypes and integrates both human and model organism data to effectively prioritize variants. By assessing performance on both known and novel disease mutations, we show that OVA performs biologically meaningful candidate variant prioritization and can be more accurate than another recently published candidate variant prioritization tool. AVAILABILITY AND IMPLEMENTATION OVA is freely accessible at http://dna2.leeds.ac.uk:8080/OVA/index.jsp. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online. CONTACT umaan@leeds.ac.uk.
Collapse
Affiliation(s)
- Agne Antanaviciute
- Section of Genetics, Institute of Biomedical and Clinical Sciences, School of Medicine, University of Leeds and
| | - Christopher M Watson
- Section of Genetics, Institute of Biomedical and Clinical Sciences, School of Medicine, University of Leeds and Yorkshire Regional Genetics Service, St James's University Hospital, Leeds, UK
| | - Sally M Harrison
- Section of Genetics, Institute of Biomedical and Clinical Sciences, School of Medicine, University of Leeds and
| | - Carolina Lascelles
- Section of Genetics, Institute of Biomedical and Clinical Sciences, School of Medicine, University of Leeds and
| | - Laura Crinnion
- Section of Genetics, Institute of Biomedical and Clinical Sciences, School of Medicine, University of Leeds and Yorkshire Regional Genetics Service, St James's University Hospital, Leeds, UK
| | - Alexander F Markham
- Section of Genetics, Institute of Biomedical and Clinical Sciences, School of Medicine, University of Leeds and
| | - David T Bonthron
- Section of Genetics, Institute of Biomedical and Clinical Sciences, School of Medicine, University of Leeds and
| | - Ian M Carr
- Section of Genetics, Institute of Biomedical and Clinical Sciences, School of Medicine, University of Leeds and
| |
Collapse
|
41
|
Antanaviciute A, Daly C, Crinnion LA, Markham AF, Watson CM, Bonthron DT, Carr IM. GeneTIER: prioritization of candidate disease genes using tissue-specific gene expression profiles. Bioinformatics 2015; 31:2728-35. [PMID: 25861967 PMCID: PMC4528628 DOI: 10.1093/bioinformatics/btv196] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2014] [Accepted: 04/01/2015] [Indexed: 12/12/2022] Open
Abstract
Motivation: In attempts to determine the genetic causes of human disease, researchers are often faced with a large number of candidate genes. Linkage studies can point to a genomic region containing hundreds of genes, while the high-throughput sequencing approach will often identify a great number of non-synonymous genetic variants. Since systematic experimental verification of each such candidate gene is not feasible, a method is needed to decide which genes are worth investigating further. Computational gene prioritization presents itself as a solution to this problem, systematically analyzing and sorting each gene from the most to least likely to be the disease-causing gene, in a fraction of the time it would take a researcher to perform such queries manually. Results: Here, we present Gene TIssue Expression Ranker (GeneTIER), a new web-based application for candidate gene prioritization. GeneTIER replaces knowledge-based inference traditionally used in candidate disease gene prioritization applications with experimental data from tissue-specific gene expression datasets and thus largely overcomes the bias toward the better characterized genes/diseases that commonly afflict other methods. We show that our approach is capable of accurate candidate gene prioritization and illustrate its strengths and weaknesses using case study examples. Availability and Implementation: Freely available on the web at http://dna.leeds.ac.uk/GeneTIER/. Contact:umaan@leeds.ac.uk Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Agne Antanaviciute
- Section of Genetics, Institute of Biomedical and Clinical Sciences, School of Medicine, University of Leeds, St James's University Hospital and
| | - Catherine Daly
- Section of Genetics, Institute of Biomedical and Clinical Sciences, School of Medicine, University of Leeds, St James's University Hospital and
| | - Laura A Crinnion
- Yorkshire Regional Genetics Service, St James's University Hospital, Leeds, UK
| | - Alexander F Markham
- Section of Genetics, Institute of Biomedical and Clinical Sciences, School of Medicine, University of Leeds, St James's University Hospital and
| | | | - David T Bonthron
- Section of Genetics, Institute of Biomedical and Clinical Sciences, School of Medicine, University of Leeds, St James's University Hospital and
| | - Ian M Carr
- Section of Genetics, Institute of Biomedical and Clinical Sciences, School of Medicine, University of Leeds, St James's University Hospital and
| |
Collapse
|
42
|
Priedigkeit N, Wolfe N, Clark NL. Evolutionary signatures amongst disease genes permit novel methods for gene prioritization and construction of informative gene-based networks. PLoS Genet 2015; 11:e1004967. [PMID: 25679399 PMCID: PMC4334549 DOI: 10.1371/journal.pgen.1004967] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2014] [Accepted: 12/19/2014] [Indexed: 12/27/2022] Open
Abstract
Genes involved in the same function tend to have similar evolutionary histories, in that their rates of evolution covary over time. This coevolutionary signature, termed Evolutionary Rate Covariation (ERC), is calculated using only gene sequences from a set of closely related species and has demonstrated potential as a computational tool for inferring functional relationships between genes. To further define applications of ERC, we first established that roughly 55% of genetic diseases posses an ERC signature between their contributing genes. At a false discovery rate of 5% we report 40 such diseases including cancers, developmental disorders and mitochondrial diseases. Given these coevolutionary signatures between disease genes, we then assessed ERC's ability to prioritize known disease genes out of a list of unrelated candidates. We found that in the presence of an ERC signature, the true disease gene is effectively prioritized to the top 6% of candidates on average. We then apply this strategy to a melanoma-associated region on chromosome 1 and identify MCL1 as a potential causative gene. Furthermore, to gain global insight into disease mechanisms, we used ERC to predict molecular connections between 310 nominally distinct diseases. The resulting “disease map” network associates several diseases with related pathogenic mechanisms and unveils many novel relationships between clinically distinct diseases, such as between Hirschsprung's disease and melanoma. Taken together, these results demonstrate the utility of molecular evolution as a gene discovery platform and show that evolutionary signatures can be used to build informative gene-based networks. Molecular evolution has informed our understanding of gene function; however, classical methods have largely been static in their implementation, focusing on single genes. Here, we present and prove the utility of a dynamic, network-based understanding of molecular evolution to infer relationships between genes associated with human diseases. We have shown previously that groups of genes within functional niches tend to share similar evolutionary histories. Exploiting the availability of whole genomes from multiple species, these histories can be numerically scored and dynamically compared to one another using a sequence-based signature termed Evolutionary Rate Covariation (ERC). To explore potential applications, we characterized ERC amongst disease genes and found that many diseases contain significant ERC signatures between their contributing genes. We show that ERC can also prioritize “true” disease genes amongst unrelated gene candidates. Lastly, these signatures can serve as a foundation for creating instructive gene-based networks, unveiling novel relationships between diseases thought to be clinically distinct. Our hope is that this study will add to the increasing evidence that advancing our understanding of molecular evolution can be a crucial asset in large-scale gene discovery pursuits (Link to our webserver that provides intuitive ERC analysis tools: http://csb.pitt.edu/erc_analysis/).
Collapse
Affiliation(s)
- Nolan Priedigkeit
- Medical Scientist Training Program, University of Pittsburgh School of Medicine, Pittsburgh, Pennsylvania, United States of America
- Department of Pharmacology and Chemical Biology, University of Pittsburgh, Pittsburgh, Pennsylvania, United States of America
| | - Nicholas Wolfe
- Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, Pennsylvania, United States of America
| | - Nathan L. Clark
- Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, Pennsylvania, United States of America
- * E-mail:
| |
Collapse
|
43
|
Soul J, Hardingham TE, Boot-Handford RP, Schwartz JM. PhenomeExpress: a refined network analysis of expression datasets by inclusion of known disease phenotypes. Sci Rep 2015; 5:8117. [PMID: 25631385 PMCID: PMC4822650 DOI: 10.1038/srep08117] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2014] [Accepted: 12/19/2014] [Indexed: 12/19/2022] Open
Abstract
We describe a new method, PhenomeExpress, for the analysis of transcriptomic datasets to identify pathogenic disease mechanisms. Our analysis method includes input from both protein-protein interaction and phenotype similarity networks. This introduces valuable information from disease relevant phenotypes, which aids the identification of sub-networks that are significantly enriched in differentially expressed genes and are related to the disease relevant phenotypes. This contrasts with many active sub-network detection methods, which rely solely on protein-protein interaction networks derived from compounded data of many unrelated biological conditions and which are therefore not specific to the context of the experiment. PhenomeExpress thus exploits readily available animal model and human disease phenotype information. It combines this prior evidence of disease phenotypes with the experimentally derived disease data sets to provide a more targeted analysis. Two case studies, in subchondral bone in osteoarthritis and in Pax5 in acute lymphoblastic leukaemia, demonstrate that PhenomeExpress identifies core disease pathways in both mouse and human disease expression datasets derived from different technologies. We also validate the approach by comparison to state-of-the-art active sub-network detection methods, which reveals how it may enhance the detection of molecular phenotypes and provide a more detailed context to those previously identified as possible candidates.
Collapse
Affiliation(s)
- Jamie Soul
- Wellcome Trust Centre for Cell-Matrix Research, Faculty of Life Sciences, University of Manchester, Manchester M13 9PT, UK
| | - Timothy E Hardingham
- Wellcome Trust Centre for Cell-Matrix Research, Faculty of Life Sciences, University of Manchester, Manchester M13 9PT, UK
| | - Raymond P Boot-Handford
- Wellcome Trust Centre for Cell-Matrix Research, Faculty of Life Sciences, University of Manchester, Manchester M13 9PT, UK
| | - Jean-Marc Schwartz
- Wellcome Trust Centre for Cell-Matrix Research, Faculty of Life Sciences, University of Manchester, Manchester M13 9PT, UK
| |
Collapse
|
44
|
Nandal UK, Vlietstra WJ, Byrman C, Jeeninga RE, Ringrose JH, van Kampen AHC, Speijer D, Moerland PD. Candidate prioritization for low-abundant differentially expressed proteins in 2D-DIGE datasets. BMC Bioinformatics 2015; 16:25. [PMID: 25627479 PMCID: PMC4384356 DOI: 10.1186/s12859-015-0455-x] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2014] [Accepted: 01/09/2015] [Indexed: 01/17/2023] Open
Abstract
Background Two-dimensional differential gel electrophoresis (2D-DIGE) provides a powerful technique to separate proteins on their isoelectric point and apparent molecular mass and quantify changes in protein expression. Abundantly available proteins in spots can be identified using mass spectrometry-based approaches. However, identification is often not possible for low-abundant proteins. Results We present a novel computational approach to prioritize candidate proteins for unidentified spots. Our approach exploits noisy information on the isoelectric point and apparent molecular mass of a protein spot in combination with functional similarities of candidate proteins to already identified proteins to select and rank candidates. We evaluated our method on a 2D-DIGE dataset comparing protein expression in uninfected and HIV-1 infected T-cells. Using leave-one-out cross-validation, we show that the true-positive rate for the top-5 ranked proteins is 43.8%. Conclusions Our approach shows good performance on a 2D-DIGE dataset comparing protein expression in uninfected and HIV-1 infected T-cells. We expect our method to be highly useful in (re-)mining other 2D-DIGE experiments in which especially the low-abundant protein spots remain to be identified. Electronic supplementary material The online version of this article (doi:10.1186/s12859-015-0455-x) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Umesh K Nandal
- Bioinformatics Laboratory, Academic Medical Center, University of Amsterdam, PO Box 22700, DE Amsterdam, 1100, The Netherlands.
| | - Wytze J Vlietstra
- Bioinformatics Laboratory, Academic Medical Center, University of Amsterdam, PO Box 22700, DE Amsterdam, 1100, The Netherlands.
| | - Carsten Byrman
- Bioinformatics Laboratory, Academic Medical Center, University of Amsterdam, PO Box 22700, DE Amsterdam, 1100, The Netherlands.
| | - Rienk E Jeeninga
- Laboratory of Experimental Virology, Department of Medical Microbiology, Center for Infection and Immunity Amsterdam (CINIMA), Academic Medical Center, University of Amsterdam, PO Box 22700, DE Amsterdam, 1100, The Netherlands.
| | - Jeffrey H Ringrose
- Laboratory of Experimental Virology, Department of Medical Microbiology, Center for Infection and Immunity Amsterdam (CINIMA), Academic Medical Center, University of Amsterdam, PO Box 22700, DE Amsterdam, 1100, The Netherlands.
| | - Antoine H C van Kampen
- Bioinformatics Laboratory, Academic Medical Center, University of Amsterdam, PO Box 22700, DE Amsterdam, 1100, The Netherlands. .,Biosystems Data Analysis Group, University of Amsterdam, Science Park 9041098, XH Amsterdam, The Netherlands.
| | - Dave Speijer
- Department of Medical Biochemistry, Academic Medical Center, University of Amsterdam, PO Box 22700, DE Amsterdam, 1100, The Netherlands.
| | - Perry D Moerland
- Bioinformatics Laboratory, Academic Medical Center, University of Amsterdam, PO Box 22700, DE Amsterdam, 1100, The Netherlands.
| |
Collapse
|
45
|
Bargsten JW, Nap JP, Sanchez-Perez GF, van Dijk ADJ. Prioritization of candidate genes in QTL regions based on associations between traits and biological processes. BMC PLANT BIOLOGY 2014; 14:330. [PMID: 25492368 PMCID: PMC4274756 DOI: 10.1186/s12870-014-0330-3] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/15/2014] [Accepted: 11/10/2014] [Indexed: 05/18/2023]
Abstract
BACKGROUND Elucidation of genotype-to-phenotype relationships is a major challenge in biology. In plants, it is the basis for molecular breeding. Quantitative Trait Locus (QTL) mapping enables to link variation at the trait level to variation at the genomic level. However, QTL regions typically contain tens to hundreds of genes. In order to prioritize such candidate genes, we show that we can identify potentially causal genes for a trait based on overrepresentation of biological processes (gene functions) for the candidate genes in the QTL regions of that trait. RESULTS The prioritization method was applied to rice QTL data, using gene functions predicted on the basis of sequence- and expression-information. The average reduction of the number of genes was over ten-fold. Comparison with various types of experimental datasets (including QTL fine-mapping and Genome Wide Association Study results) indicated both statistical significance and biological relevance of the obtained connections between genes and traits. A detailed analysis of flowering time QTLs illustrates that genes with completely unknown function are likely to play a role in this important trait. CONCLUSIONS Our approach can guide further experimentation and validation of causal genes for quantitative traits. This way it capitalizes on QTL data to uncover how individual genes influence trait variation.
Collapse
Affiliation(s)
- Joachim W Bargsten
- />Applied Bioinformatics, Bioscience, Plant Sciences Group, Wageningen University and Research Centre, Wageningen, The Netherlands
- />Netherlands Bioinformatics Centre (NBIC), Nijmegen, The Netherlands
- />Laboratory for Plant Breeding, Plant Sciences Group, Wageningen University and Research Centre, Wageningen, The Netherlands
| | - Jan-Peter Nap
- />Applied Bioinformatics, Bioscience, Plant Sciences Group, Wageningen University and Research Centre, Wageningen, The Netherlands
- />Netherlands Bioinformatics Centre (NBIC), Nijmegen, The Netherlands
| | - Gabino F Sanchez-Perez
- />Applied Bioinformatics, Bioscience, Plant Sciences Group, Wageningen University and Research Centre, Wageningen, The Netherlands
- />Laboratory of Bioinformatics, Plant Sciences Group, Wageningen University and Research Centre, Wageningen, The Netherlands
| | - Aalt DJ van Dijk
- />Applied Bioinformatics, Bioscience, Plant Sciences Group, Wageningen University and Research Centre, Wageningen, The Netherlands
- />Biometris, Wageningen University and Research Centre, Wageningen, The Netherlands
| |
Collapse
|
46
|
Joice R, Yasuda K, Shafquat A, Morgan XC, Huttenhower C. Determining microbial products and identifying molecular targets in the human microbiome. Cell Metab 2014; 20:731-741. [PMID: 25440055 PMCID: PMC4254638 DOI: 10.1016/j.cmet.2014.10.003] [Citation(s) in RCA: 68] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 02/08/2023]
Abstract
Human-associated microbes are the source of many bioactive microbial products (proteins and metabolites) that play key functions both in human host pathways and in microbe-microbe interactions. Culture-independent studies now provide an accelerated means of exploring novel bioactives in the human microbiome; however, intriguingly, a substantial fraction of the microbial metagenome cannot be mapped to annotated genes or isolate genomes and is thus of unknown function. Meta'omic approaches, including metagenomic sequencing, metatranscriptomics, metabolomics, and integration of multiple assay types, represent an opportunity to efficiently explore this large pool of potential therapeutics. In combination with appropriate follow-up validation, high-throughput culture-independent assays can be combined with computational approaches to identify and characterize novel and biologically interesting microbial products. Here we briefly review the state of microbial product identification and characterization and discuss possible next steps to catalog and leverage the large uncharted fraction of the microbial metagenome.
Collapse
Affiliation(s)
- Regina Joice
- Department of Biostatistics, Harvard School of Public Health, Boston, MA 02115, USA; Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Koji Yasuda
- Department of Biostatistics, Harvard School of Public Health, Boston, MA 02115, USA; Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Afrah Shafquat
- Department of Biostatistics, Harvard School of Public Health, Boston, MA 02115, USA; Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Xochitl C Morgan
- Department of Biostatistics, Harvard School of Public Health, Boston, MA 02115, USA; Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA.
| | - Curtis Huttenhower
- Department of Biostatistics, Harvard School of Public Health, Boston, MA 02115, USA; Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA.
| |
Collapse
|
47
|
Abstract
MOTIVATION Most existing methods for predicting causal disease genes rely on specific type of evidence, and are therefore limited in terms of applicability. More often than not, the type of evidence available for diseases varies-for example, we may know linked genes, keywords associated with the disease obtained by mining text, or co-occurrence of disease symptoms in patients. Similarly, the type of evidence available for genes varies-for example, specific microarray probes convey information only for certain sets of genes. In this article, we apply a novel matrix-completion method called Inductive Matrix Completion to the problem of predicting gene-disease associations; it combines multiple types of evidence (features) for diseases and genes to learn latent factors that explain the observed gene-disease associations. We construct features from different biological sources such as microarray expression data and disease-related textual data. A crucial advantage of the method is that it is inductive; it can be applied to diseases not seen at training time, unlike traditional matrix-completion approaches and network-based inference methods that are transductive. RESULTS Comparison with state-of-the-art methods on diseases from the Online Mendelian Inheritance in Man (OMIM) database shows that the proposed approach is substantially better-it has close to one-in-four chance of recovering a true association in the top 100 predictions, compared to the recently proposed Catapult method (second best) that has <15% chance. We demonstrate that the inductive method is particularly effective for a query disease with no previously known gene associations, and for predicting novel genes, i.e. genes that are previously not linked to diseases. Thus the method is capable of predicting novel genes even for well-characterized diseases. We also validate the novelty of predictions by evaluating the method on recently reported OMIM associations and on associations recently reported in the literature. AVAILABILITY Source code and datasets can be downloaded from http://bigdata.ices.utexas.edu/project/gene-disease.
Collapse
Affiliation(s)
- Nagarajan Natarajan
- Department of Computer Science, University of Texas at Austin, Austin, TX 78712, USA
| | - Inderjit S Dhillon
- Department of Computer Science, University of Texas at Austin, Austin, TX 78712, USA
| |
Collapse
|
48
|
Jiang L, Edwards SM, Thomsen B, Workman CT, Guldbrandtsen B, Sørensen P. A random set scoring model for prioritization of disease candidate genes using protein complexes and data-mining of GeneRIF, OMIM and PubMed records. BMC Bioinformatics 2014; 15:315. [PMID: 25253562 PMCID: PMC4181406 DOI: 10.1186/1471-2105-15-315] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2013] [Accepted: 09/17/2014] [Indexed: 12/12/2022] Open
Abstract
BACKGROUND Prioritizing genetic variants is a challenge because disease susceptibility loci are often located in genes of unknown function or the relationship with the corresponding phenotype is unclear. A global data-mining exercise on the biomedical literature can establish the phenotypic profile of genes with respect to their connection to disease phenotypes. The importance of protein-protein interaction networks in the genetic heterogeneity of common diseases or complex traits is becoming increasingly recognized. Thus, the development of a network-based approach combined with phenotypic profiling would be useful for disease gene prioritization. RESULTS We developed a random-set scoring model and implemented it to quantify phenotype relevance in a network-based disease gene-prioritization approach. We validated our approach based on different gene phenotypic profiles, which were generated from PubMed abstracts, OMIM, and GeneRIF records. We also investigated the validity of several vocabulary filters and different likelihood thresholds for predicted protein-protein interactions in terms of their effect on the network-based gene-prioritization approach, which relies on text-mining of the phenotype data. Our method demonstrated good precision and sensitivity compared with those of two alternative complex-based prioritization approaches. We then conducted a global ranking of all human genes according to their relevance to a range of human diseases. The resulting accurate ranking of known causal genes supported the reliability of our approach. Moreover, these data suggest many promising novel candidate genes for human disorders that have a complex mode of inheritance. CONCLUSION We have implemented and validated a network-based approach to prioritize genes for human diseases based on their phenotypic profile. We have devised a powerful and transparent tool to identify and rank candidate genes. Our global gene prioritization provides a unique resource for the biological interpretation of data from genome-wide association studies, and will help in the understanding of how the associated genetic variants influence disease or quantitative phenotypes.
Collapse
Affiliation(s)
- Li Jiang
- Department of Molecular Biology and Genetics, Aarhus University, DK-8830 Tjele, Denmark.
| | | | | | | | | | | |
Collapse
|
49
|
Oellrich A, Koehler S, Washington N, Mungall C, Lewis S, Haendel M, Robinson PN, Smedley D. The influence of disease categories on gene candidate predictions from model organism phenotypes. J Biomed Semantics 2014; 5:S4. [PMID: 25093073 PMCID: PMC4108905 DOI: 10.1186/2041-1480-5-s1-s4] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/05/2023] Open
Abstract
BACKGROUND The molecular etiology is still to be identified for about half of the currently described Mendelian diseases in humans, thereby hindering efforts to find treatments or preventive measures. Advances, such as new sequencing technologies, have led to increasing amounts of data becoming available with which to address the problem of identifying disease genes. Therefore, automated methods are needed that reliably predict disease gene candidates based on available data. We have recently developed Exomiser as a tool for identifying causative variants from exome analysis results by filtering and prioritising using a number of criteria including the phenotype similarity between the disease and mouse mutants involving the gene candidates. Initial investigations revealed a variation in performance for different medical categories of disease, due in part to a varying contribution of the phenotype scoring component. RESULTS In this study, we further analyse the performance of our cross-species phenotype matching algorithm, and examine in more detail the reasons why disease gene filtering based on phenotype data works better for certain disease categories than others. We found that in addition to misleading phenotype alignments between species, some disease categories are still more amenable to automated predictions than others, and that this often ties in with community perceptions on how well the organism works as model. CONCLUSIONS In conclusion, our automated disease gene candidate predictions are highly dependent on the organism used for the predictions and the disease category being studied. Future work on computational disease gene prediction using phenotype data would benefit from methods that take into account the disease category and the source of model organism data.
Collapse
Affiliation(s)
- Anika Oellrich
- Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, CB10 1SA Hinxton, UK
| | - Sebastian Koehler
- Institute for Medical Genetics and Human Genetics, Universitaetsklinikum Charite, Augustenburger Platz 1, 13353 Berlin, Germany
| | - Nicole Washington
- Berkeley Bioinformatics Open-Source Projects, Lawrence Berkeley National Laboratory, 1 Cyclotron Road, CA 94720 Berkeley, USA
| | - Sanger Mouse Genetic Project
- Berkeley Bioinformatics Open-Source Projects, Lawrence Berkeley National Laboratory, 1 Cyclotron Road, CA 94720 Berkeley, USA
| | - Chris Mungall
- Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, CB10 1SA Hinxton, UK
| | - Suzanna Lewis
- Berkeley Bioinformatics Open-Source Projects, Lawrence Berkeley National Laboratory, 1 Cyclotron Road, CA 94720 Berkeley, USA
| | - Melissa Haendel
- Ontology Development Group, OHSU Library, Oregon Health & Science University, 3181 S.W. Sam Jackson Park Rd, OR 97239 Portland, USA
| | - Peter N Robinson
- Institute for Medical Genetics and Human Genetics, Universitaetsklinikum Charite, Augustenburger Platz 1, 13353 Berlin, Germany
| | - Damian Smedley
- Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, CB10 1SA Hinxton, UK
| |
Collapse
|
50
|
Guala D, Sjölund E, Sonnhammer ELL. MaxLink: network-based prioritization of genes tightly linked to a disease seed set. Bioinformatics 2014; 30:2689-90. [DOI: 10.1093/bioinformatics/btu344] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/28/2023] Open
|