1
|
Hongyao HE, Chun JI, Xiaoyan G, Fangfang L, Jing Z, Lin Z, Pengxiang Z, Zengchun L. Associative gene networks reveal novel candidates important for ADHD and dyslexia comorbidity. BMC Med Genomics 2023; 16:208. [PMID: 37667328 PMCID: PMC10478365 DOI: 10.1186/s12920-023-01502-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2022] [Accepted: 03/26/2023] [Indexed: 09/06/2023] Open
Abstract
BACKGROUND Attention deficit hyperactivity disorder (ADHD) is commonly associated with developmental dyslexia (DD), which are both prevalent and complicated pediatric neurodevelopmental disorders that have a significant influence on children's learning and development. Clinically, the comorbidity incidence of DD and ADHD is between 25 and 48%. Children with DD and ADHD may have more severe cognitive deficiencies, a poorer level of schooling, and a higher risk of social and emotional management disorders. Furthermore, patients with this comorbidity are frequently treated for a single condition in clinical settings, and the therapeutic outcome is poor. The development of effective treatment approaches against these diseases is complicated by their comorbidity features. This is often a major problem in diagnosis and treatment. In this study, we developed bioinformatical methodology for the analysis of the comorbidity of these two diseases. As such, the search for candidate genes related to the comorbid conditions of ADHD and DD can help in elucidating the molecular mechanisms underlying the comorbid condition, and can also be useful for genotyping and identifying new drug targets. RESULTS Using the ANDSystem tool, the reconstruction and analysis of gene networks associated with ADHD and dyslexia was carried out. The gene network of ADHD included 599 genes/proteins and 148,978 interactions, while that of dyslexia included 167 genes/proteins and 27,083 interactions. When the ANDSystem and GeneCards data were combined, a total of 213 genes/proteins for ADHD and dyslexia were found. An approach for ranking genes implicated in the comorbid condition of the two diseases was proposed. The approach is based on ten criteria for ranking genes by their importance, including relevance scores of association between disease and genes, standard methods of gene prioritization, as well as original criteria that take into account the characteristics of an associative gene network and the presence of known polymorphisms in the analyzed genes. Among the top 20 genes with the highest priority DRD2, DRD4, CNTNAP2 and GRIN2B are mentioned in the literature as directly linked with the comorbidity of ADHD and dyslexia. According to the proposed approach, the genes OPRM1, CHRNA4 and SNCA had the highest priority in the development of comorbidity of these two diseases. Additionally, it was revealed that the most relevant genes are involved in biological processes related to signal transduction, positive regulation of transcription from RNA polymerase II promoters, chemical synaptic transmission, response to drugs, ion transmembrane transport, nervous system development, cell adhesion, and neuron migration. CONCLUSIONS The application of methods of reconstruction and analysis of gene networks is a powerful tool for studying the molecular mechanisms of comorbid conditions. The method put forth to rank genes by their importance for the comorbid condition of ADHD and dyslexia was employed to predict genes that play key roles in the development of the comorbid condition. The results can be utilized to plan experiments for the identification of novel candidate genes and search for novel pharmacological targets.
Collapse
Affiliation(s)
- H E Hongyao
- Medical College of Shihezi University, Shihezi, China
| | - J I Chun
- Medical College of Shihezi University, Shihezi, China
| | - Gao Xiaoyan
- Medical College of Shihezi University, Shihezi, China
| | - Liu Fangfang
- Medical College of Shihezi University, Shihezi, China
| | - Zhang Jing
- Medical College of Shihezi University, Shihezi, China
| | - Zhong Lin
- Medical College of Shihezi University, Shihezi, China
| | - Zuo Pengxiang
- Medical College of Shihezi University, Shihezi, China.
| | - Li Zengchun
- Medical College of Shihezi University, Shihezi, China.
| |
Collapse
|
2
|
Tziastoudi M, Tsezou A, Stefanidis I. Cadherin and Wnt signaling pathways as key regulators in diabetic nephropathy. PLoS One 2021; 16:e0255728. [PMID: 34411124 PMCID: PMC8375992 DOI: 10.1371/journal.pone.0255728] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2020] [Accepted: 07/22/2021] [Indexed: 12/14/2022] Open
Abstract
AIM A recent meta-analysis of genome-wide linkage studies (GWLS) has identified multiple genetic regions suggestive of linkage with DN harboring hundreds of genes. Moving this number of genetic loci forward into biological insight is truly the next step. Here, we approach this challenge with a gene ontology (GO) analysis in order to yield biological and functional role to the genes, an over-representation test to find which GO terms are enriched in the gene list, pathway analysis, as well as protein network analysis. METHOD GO analysis was performed using protein analysis through evolutionary relationships (PANTHER) version 14.0 software and P-values less than 0.05 were considered statistically significant. GO analysis was followed by over-representation test for the identification of enriched terms. Statistical significance was calculated by Fisher's exact test and adjusted using the false discovery rate (FDR) for correction of multiple tests. Cytoscape with the relevant plugins was used for the construction of the protein network and clustering analysis. RESULTS The GO analysis assign multiple GO terms to the genes regarding the molecular function, the biological process and the cellular component, protein class and pathway analysis. The findings of the over-representation test highlight the contribution of cell adhesion regarding the biological process, integral components of plasma membrane regarding the cellular component, chemokines and cytokines with regard to protein class, while the pathway analysis emphasizes the contribution of Wnt and cadherin signaling pathways. CONCLUSIONS Our results suggest that a core feature of the pathogenesis of DN may be a disturbance in Wnt and cadherin signaling pathways, whereas the contribution of chemokines and cytokines need to be studied in additional studies.
Collapse
Affiliation(s)
- Maria Tziastoudi
- Department of Nephrology, School of Medicine, University of Thessaly, Larissa, Greece
| | - Aspasia Tsezou
- Laboratory of Biology, Faculty of Medicine, School of Health Sciences, University of Thessaly, Larissa, Greece
- Laboratory of Cytogenetics and Molecular Genetics, Faculty of Medicine, School of Health Sciences, University of Thessaly, Larissa, Greece
| | - Ioannis Stefanidis
- Department of Nephrology, School of Medicine, University of Thessaly, Larissa, Greece
| |
Collapse
|
3
|
Luo P, Tian LP, Chen B, Xiao Q, Wu FX. Ensemble disease gene prediction by clinical sample-based networks. BMC Bioinformatics 2020; 21:79. [PMID: 32164526 PMCID: PMC7068856 DOI: 10.1186/s12859-020-3346-8] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Disease gene prediction is a critical and challenging task. Many computational methods have been developed to predict disease genes, which can reduce the money and time used in the experimental validation. Since proteins (products of genes) usually work together to achieve a specific function, biomolecular networks, such as the protein-protein interaction (PPI) network and gene co-expression networks, are widely used to predict disease genes by analyzing the relationships between known disease genes and other genes in the networks. However, existing methods commonly use a universal static PPI network, which ignore the fact that PPIs are dynamic, and PPIs in various patients should also be different. RESULTS To address these issues, we develop an ensemble algorithm to predict disease genes from clinical sample-based networks (EdgCSN). The algorithm first constructs single sample-based networks for each case sample of the disease under study. Then, these single sample-based networks are merged to several fused networks based on the clustering results of the samples. After that, logistic models are trained with centrality features extracted from the fused networks, and an ensemble strategy is used to predict the finial probability of each gene being disease-associated. EdgCSN is evaluated on breast cancer (BC), thyroid cancer (TC) and Alzheimer's disease (AD) and obtains AUC values of 0.970, 0.971 and 0.966, respectively, which are much better than the competing algorithms. Subsequent de novo validations also demonstrate the ability of EdgCSN in predicting new disease genes. CONCLUSIONS In this study, we propose EdgCSN, which is an ensemble learning algorithm for predicting disease genes with models trained by centrality features extracted from clinical sample-based networks. Results of the leave-one-out cross validation show that our EdgCSN performs much better than the competing algorithms in predicting BC-associated, TC-associated and AD-associated genes. de novo validations also show that EdgCSN is valuable for identifying new disease genes.
Collapse
Affiliation(s)
- Ping Luo
- Division of Biomedical Engineering, University of Saskatchewan, Saskatoon, S7N 5A9, Canada
| | - Li-Ping Tian
- School of Information, Beijing Wuzi University, Beijing, 101149, China
| | - Bolin Chen
- School of Computer Science, Northwestern Polytechnical University, Xi'an, 710072, China
| | - Qianghua Xiao
- School of Mathematics and Physics, University of South China, HengYang, 421001, China
| | - Fang-Xiang Wu
- Division of Biomedical Engineering, University of Saskatchewan, Saskatoon, S7N 5A9, Canada. .,Department of Computer Science, University of Saskatchewan, Saskatoon, S7N 5C9, Canada. .,School of Mathematics and Statistics, Hainan Normal University, Haikou, 571158, China. .,Department of Mechanical Engineering, University of Saskatchewan, Saskatoon, S7N 5A9, Canada.
| |
Collapse
|
4
|
Oerton E, Roberts I, Lewis PSH, Guilliams T, Bender A. Understanding and predicting disease relationships through similarity fusion. Bioinformatics 2020; 35:1213-1220. [PMID: 30169824 PMCID: PMC6449746 DOI: 10.1093/bioinformatics/bty754] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2018] [Revised: 08/09/2018] [Accepted: 08/29/2018] [Indexed: 12/15/2022] Open
Abstract
Motivation Combining disease relationships across multiple biological levels could aid our understanding of common processes taking place in disease, potentially indicating opportunities for drug sharing. Here, we propose a similarity fusion approach which accounts for differences in information content between different data types, allowing combination of each data type in a balanced manner. Results We apply this method to six different types of biological data (ontological, phenotypic, literature co-occurrence, genetic association, gene expression and drug indication data) for 84 diseases to create a ‘disease map’: a network of diseases connected at one or more biological levels. As well as reconstructing known disease relationships, 15% of links in the disease map are novel links spanning traditional ontological classes, such as between psoriasis and inflammatory bowel disease. 62% of links in the disease map represent drug-sharing relationships, illustrating the relevance of the similarity fusion approach to the identification of potential therapeutic relationships. Availability and implementation Freely available under the MIT license at https://github.com/e-oerton/disease-similarity-fusion Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Erin Oerton
- Centre for Molecular Informatics, Department of Chemistry, University of Cambridge, Cambridge, UK.,Healx Ltd, Park House, Castle Park, Cambridge, UK
| | - Ian Roberts
- Healx Ltd, Park House, Castle Park, Cambridge, UK
| | | | | | - Andreas Bender
- Centre for Molecular Informatics, Department of Chemistry, University of Cambridge, Cambridge, UK.,Healx Ltd, Park House, Castle Park, Cambridge, UK
| |
Collapse
|
5
|
Tran VD, Sperduti A, Backofen R, Costa F. Heterogeneous networks integration for disease–gene prioritization with node kernels. Bioinformatics 2020; 36:2649-2656. [DOI: 10.1093/bioinformatics/btaa008] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2019] [Revised: 12/19/2019] [Accepted: 01/23/2020] [Indexed: 12/21/2022] Open
Abstract
Abstract
Motivation
The identification of disease–gene associations is a task of fundamental importance in human health research. A typical approach consists in first encoding large gene/protein relational datasets as networks due to the natural and intuitive property of graphs for representing objects’ relationships and then utilizing graph-based techniques to prioritize genes for successive low-throughput validation assays. Since different types of interactions between genes yield distinct gene networks, there is the need to integrate different heterogeneous sources to improve the reliability of prioritization systems.
Results
We propose an approach based on three phases: first, we merge all sources in a single network, then we partition the integrated network according to edge density introducing a notion of edge type to distinguish the parts and finally, we employ a novel node kernel suitable for graphs with typed edges. We show how the node kernel can generate a large number of discriminative features that can be efficiently processed by linear regularized machine learning classifiers. We report state-of-the-art results on 12 disease–gene associations and on a time-stamped benchmark containing 42 newly discovered associations.
Availability and implementation
Source code: https://github.com/dinhinfotech/DiGI.git.
Supplementary information
Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Van Dinh Tran
- Bioinformatics Group, Department of Computer Science, University of Freiburg, Freiburg im Breisgau, Germany
| | | | - Rolf Backofen
- Bioinformatics Group, Department of Computer Science, University of Freiburg, Freiburg im Breisgau, Germany
- Signalling Research Centres BIOSS and CIBSS, University of Freiburg, Germany
| | - Fabrizio Costa
- Department of Computer Science, University of Exeter, Exeter, UK
| |
Collapse
|
6
|
Arabfard M, Ohadi M, Rezaei Tabar V, Delbari A, Kavousi K. Genome-wide prediction and prioritization of human aging genes by data fusion: a machine learning approach. BMC Genomics 2019; 20:832. [PMID: 31706268 PMCID: PMC6842548 DOI: 10.1186/s12864-019-6140-0] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2019] [Accepted: 09/25/2019] [Indexed: 12/11/2022] Open
Abstract
Background Machine learning can effectively nominate novel genes for various research purposes in the laboratory. On a genome-wide scale, we implemented multiple databases and algorithms to predict and prioritize the human aging genes (PPHAGE). Results We fused data from 11 databases, and used Naïve Bayes classifier and positive unlabeled learning (PUL) methods, NB, Spy, and Rocchio-SVM, to rank human genes in respect with their implication in aging. The PUL methods enabled us to identify a list of negative (non-aging) genes to use alongside the seed (known age-related) genes in the ranking process. Comparison of the PUL algorithms revealed that none of the methods for identifying a negative sample were advantageous over other methods, and their simultaneous use in a form of fusion was critical for obtaining optimal results (PPHAGE is publicly available at https://cbb.ut.ac.ir/pphage). Conclusion We predict and prioritize over 3,000 candidate age-related genes in human, based on significant ranking scores. The identified candidate genes are associated with pathways, ontologies, and diseases that are linked to aging, such as cancer and diabetes. Our data offer a platform for future experimental research on the genetic and biological aspects of aging. Additionally, we demonstrate that fusion of PUL methods and data sources can be successfully used for aging and disease candidate gene prioritization.
Collapse
Affiliation(s)
- Masoud Arabfard
- Department of Bioinformatics, Kish International Campus University of Tehran, Kish, Iran.,Laboratory of Complex Biological Systems and Bioinformatics (CBB), Department of Bioinformatics, Institute of Biochemistry and Biophysics (IBB), University of Tehran, Tehran, Iran
| | - Mina Ohadi
- Iranian Research Center on Aging, University of Social Welfare and Rehabilitation Sciences, Tehran, Iran.
| | - Vahid Rezaei Tabar
- Department of Statistics, Faculty of Mathematical Sciences and Computer, Allameh Tabataba'i University, Tehran, Iran
| | - Ahmad Delbari
- Iranian Research Center on Aging, University of Social Welfare and Rehabilitation Sciences, Tehran, Iran
| | - Kaveh Kavousi
- Laboratory of Complex Biological Systems and Bioinformatics (CBB), Department of Bioinformatics, Institute of Biochemistry and Biophysics (IBB), University of Tehran, Tehran, Iran.
| |
Collapse
|
7
|
Kumar AA, Van Laer L, Alaerts M, Ardeshirdavani A, Moreau Y, Laukens K, Loeys B, Vandeweyer G. pBRIT: gene prioritization by correlating functional and phenotypic annotations through integrative data fusion. Bioinformatics 2018; 34:2254-2262. [PMID: 29452392 PMCID: PMC6022555 DOI: 10.1093/bioinformatics/bty079] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2017] [Revised: 01/25/2018] [Accepted: 02/12/2018] [Indexed: 12/31/2022] Open
Abstract
Motivation Computational gene prioritization can aid in disease gene identification. Here, we propose pBRIT (prioritization using Bayesian Ridge regression and Information Theoretic model), a novel adaptive and scalable prioritization tool, integrating Pubmed abstracts, Gene Ontology, Sequence similarities, Mammalian and Human Phenotype Ontology, Pathway, Interactions, Disease Ontology, Gene Association database and Human Genome Epidemiology database, into the prediction model. We explore and address effects of sparsity and inter-feature dependencies within annotation sources, and the impact of bias towards specific annotations. Results pBRIT models feature dependencies and sparsity by an Information-Theoretic (data driven) approach and applies intermediate integration based data fusion. Following the hypothesis that genes underlying similar diseases will share functional and phenotype characteristics, it incorporates Bayesian Ridge regression to learn a linear mapping between functional and phenotype annotations. Genes are prioritized on phenotypic concordance to the training genes. We evaluated pBRIT against nine existing methods, and on over 2000 HPO-gene associations retrieved after construction of pBRIT data sources. We achieve maximum AUC scores ranging from 0.92 to 0.96 against benchmark datasets and of 0.80 against the time-stamped HPO entries, indicating good performance with high sensitivity and specificity. Our model shows stable performance with regard to changes in the underlying annotation data, is fast and scalable for implementation in routine pipelines. Availability and implementation http://biomina.be/apps/pbrit/; https://bitbucket.org/medgenua/pbrit. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Ajay Anand Kumar
- Center of Medical Genetics, University of Antwerp and Antwerp University Hospital, Antwerp, Belgium
- Biomedical Informatics Research Network Antwerp (biomina), University of Antwerp, Antwerp, Belgium
| | - Lut Van Laer
- Center of Medical Genetics, University of Antwerp and Antwerp University Hospital, Antwerp, Belgium
| | - Maaike Alaerts
- Center of Medical Genetics, University of Antwerp and Antwerp University Hospital, Antwerp, Belgium
| | - Amin Ardeshirdavani
- Department of Electrical Engineering (ESAT), STADIUS Center for Dynamical Systems, Signal Processing and Data Analytics, Belgium
- imec, Leuven, Belgium
| | - Yves Moreau
- Department of Electrical Engineering (ESAT), STADIUS Center for Dynamical Systems, Signal Processing and Data Analytics, Belgium
- imec, Leuven, Belgium
| | - Kris Laukens
- Biomedical Informatics Research Network Antwerp (biomina), University of Antwerp, Antwerp, Belgium
- ADReM Data Laboratory, University of Antwerp, Antwerp, Belgium
| | - Bart Loeys
- Center of Medical Genetics, University of Antwerp and Antwerp University Hospital, Antwerp, Belgium
| | - Geert Vandeweyer
- Center of Medical Genetics, University of Antwerp and Antwerp University Hospital, Antwerp, Belgium
- Biomedical Informatics Research Network Antwerp (biomina), University of Antwerp, Antwerp, Belgium
| |
Collapse
|
8
|
Tran Van D, Sperduti A, Costa F. The conjunctive disjunctive graph node kernel for disease gene prioritization. Neurocomputing 2018. [DOI: 10.1016/j.neucom.2018.01.089] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/17/2022]
|
9
|
Saik OV, Demenkov PS, Ivanisenko TV, Bragina EY, Freidin MB, Goncharova IA, Dosenko VE, Zolotareva OI, Hofestaedt R, Lavrik IN, Rogaev EI, Ivanisenko VA. Novel candidate genes important for asthma and hypertension comorbidity revealed from associative gene networks. BMC Med Genomics 2018; 11:15. [PMID: 29504915 PMCID: PMC6389037 DOI: 10.1186/s12920-018-0331-4] [Citation(s) in RCA: 32] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022] Open
Abstract
BACKGROUND Hypertension and bronchial asthma are a major issue for people's health. As of 2014, approximately one billion adults, or ~ 22% of the world population, have had hypertension. As of 2011, 235-330 million people globally have been affected by asthma and approximately 250,000-345,000 people have died each year from the disease. The development of the effective treatment therapies against these diseases is complicated by their comorbidity features. This is often a major problem in diagnosis and their treatment. Hence, in this study the bioinformatical methodology for the analysis of the comorbidity of these two diseases have been developed. As such, the search for candidate genes related to the comorbid conditions of asthma and hypertension can help in elucidating the molecular mechanisms underlying the comorbid condition of these two diseases, and can also be useful for genotyping and identifying new drug targets. RESULTS Using ANDSystem, the reconstruction and analysis of gene networks associated with asthma and hypertension was carried out. The gene network of asthma included 755 genes/proteins and 62,603 interactions, while the gene network of hypertension - 713 genes/proteins and 45,479 interactions. Two hundred and five genes/proteins and 9638 interactions were shared between asthma and hypertension. An approach for ranking genes implicated in the comorbid condition of two diseases was proposed. The approach is based on nine criteria for ranking genes by their importance, including standard methods of gene prioritization (Endeavor, ToppGene) as well as original criteria that take into account the characteristics of an associative gene network and the presence of known polymorphisms in the analysed genes. According to the proposed approach, the genes IL10, TLR4, and CAT had the highest priority in the development of comorbidity of these two diseases. Additionally, it was revealed that the list of top genes is enriched with apoptotic genes and genes involved in biological processes related to the functioning of central nervous system. CONCLUSIONS The application of methods of reconstruction and analysis of gene networks is a productive tool for studying the molecular mechanisms of comorbid conditions. The method put forth to rank genes by their importance to the comorbid condition of asthma and hypertension was employed that resulted in prediction of 10 genes, playing the key role in the development of the comorbid condition. The results can be utilised to plan experiments for identification of novel candidate genes along with searching for novel pharmacological targets.
Collapse
Affiliation(s)
- Olga V. Saik
- Institute of Cytology and Genetics, Siberian Branch, Russian Academy of Sciences, Novosibirsk, Russia
| | - Pavel S. Demenkov
- Institute of Cytology and Genetics, Siberian Branch, Russian Academy of Sciences, Novosibirsk, Russia
| | - Timofey V. Ivanisenko
- Institute of Cytology and Genetics, Siberian Branch, Russian Academy of Sciences, Novosibirsk, Russia
| | - Elena Yu Bragina
- Research Institute of Medical Genetics, Tomsk NRMC, Tomsk, Russia
| | - Maxim B. Freidin
- Research Institute of Medical Genetics, Tomsk NRMC, Tomsk, Russia
| | | | | | - Olga I. Zolotareva
- Bielefeld University, International Research Training Group “Computational Methods for the Analysis of the Diversity and Dynamics of Genomes”, Bielefeld, Germany
| | - Ralf Hofestaedt
- Bielefeld University, Technical Faculty, AG Bioinformatics and Medical Informatics, Bielefeld, Germany
| | - Inna N. Lavrik
- Department of Translational Inflammation, Institute of Experimental Internal Medicine, Otto von Guericke University, Magdeburg, Germany
| | - Evgeny I. Rogaev
- Institute of Cytology and Genetics, Siberian Branch, Russian Academy of Sciences, Novosibirsk, Russia
- University of Massachusetts Medical School, Worcester, MA USA
- Department of Genomics and Human Genetics, Institute of General Genetics, Russian Academy of Sciences, Moscow, Russia
- Center for Genetics and Genetic Technologies, Faculty of Biology, Faculty of Bioengineering and Bioinformatics, Lomonosov Moscow State University, Moscow, Russia
| | - Vladimir A. Ivanisenko
- Institute of Cytology and Genetics, Siberian Branch, Russian Academy of Sciences, Novosibirsk, Russia
| |
Collapse
|
10
|
Zampieri G, Tran DV, Donini M, Navarin N, Aiolli F, Sperduti A, Valle G. Scuba: scalable kernel-based gene prioritization. BMC Bioinformatics 2018; 19:23. [PMID: 29370760 PMCID: PMC5785908 DOI: 10.1186/s12859-018-2025-5] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2016] [Accepted: 01/15/2018] [Indexed: 01/01/2023] Open
Abstract
BACKGROUND The uncovering of genes linked to human diseases is a pressing challenge in molecular biology and precision medicine. This task is often hindered by the large number of candidate genes and by the heterogeneity of the available information. Computational methods for the prioritization of candidate genes can help to cope with these problems. In particular, kernel-based methods are a powerful resource for the integration of heterogeneous biological knowledge, however, their practical implementation is often precluded by their limited scalability. RESULTS We propose Scuba, a scalable kernel-based method for gene prioritization. It implements a novel multiple kernel learning approach, based on a semi-supervised perspective and on the optimization of the margin distribution. Scuba is optimized to cope with strongly unbalanced settings where known disease genes are few and large scale predictions are required. Importantly, it is able to efficiently deal both with a large amount of candidate genes and with an arbitrary number of data sources. As a direct consequence of scalability, Scuba integrates also a new efficient strategy to select optimal kernel parameters for each data source. We performed cross-validation experiments and simulated a realistic usage setting, showing that Scuba outperforms a wide range of state-of-the-art methods. CONCLUSIONS Scuba achieves state-of-the-art performance and has enhanced scalability compared to existing kernel-based approaches for genomic data. This method can be useful to prioritize candidate genes, particularly when their number is large or when input data is highly heterogeneous. The code is freely available at https://github.com/gzampieri/Scuba .
Collapse
Affiliation(s)
- Guido Zampieri
- CRIBI Biotechnology Center, University of Padova, viale G. Colombo, 3, Padova, Italy.,Department of Women's and Children's Health, University of Padova, via Giustiniani, 3, Padova, Italy
| | - Dinh Van Tran
- Department of Mathematics, University of Padova, via Trieste, 63, Padova, Italy
| | - Michele Donini
- Istituto Italiano di Tecnologia, Via Morego, 30, Genoa, Italy
| | - Nicolò Navarin
- Department of Mathematics, University of Padova, via Trieste, 63, Padova, Italy
| | - Fabio Aiolli
- Department of Mathematics, University of Padova, via Trieste, 63, Padova, Italy
| | - Alessandro Sperduti
- Department of Mathematics, University of Padova, via Trieste, 63, Padova, Italy
| | - Giorgio Valle
- CRIBI Biotechnology Center, University of Padova, viale G. Colombo, 3, Padova, Italy. .,Department of Biology, University of Padova, viale G. Colombo, 3, Padova, Italy.
| |
Collapse
|
11
|
|
12
|
Sreeja A, Vinayan KP. Multidimensional knowledge-based framework is an essential step in the categorization of gene sets in complex disorders. J Bioinform Comput Biol 2017; 15:1750022. [DOI: 10.1142/s0219720017500226] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
In complex disorders, collaborative role of several genes accounts for the multitude of symptoms and the discovery of molecular mechanisms requires proper understanding of pertinent genes. Majority of the recent techniques utilize either single information or consolidate the independent outlook from multiple knowledge sources for assisting the discovery of candidate genes. In any case, given that various sorts of heterogeneous sources are possibly significant for quality gene prioritization, every source bearing data not conveyed by another, we assert that a perfect strategy ought to give approaches to observe among them in a genuine integrative style that catches the degree of each, instead of utilizing a straightforward mix of sources. We propose a flexible approach that empowers multi-source information reconciliation for quality gene prioritization that augments the complementary nature of various learning sources so as to utilize the maximum information of aggregated data. To illustrate the proposed approach, we took Autism Spectrum Disorder (ASD) as a case study and validated the framework on benchmark studies. We observed that the combined ranking based on integrated knowledge reduces the false positive observations and boosts the performance when compared with individual rankings. The clinical phenotype validation for ASD shows that there is a significant linkage between top positioned genes and endophenotypes of ASD. Categorization of genes based on endophenotype associations by this method will be useful for further hypothesis generation leading to clinical and translational analysis. This approach may also be useful in other complex neurological and psychiatric disorders with a strong genetic component.
Collapse
Affiliation(s)
- A. Sreeja
- Department of Computer Science & IT, School of Arts and Sciences, Amrita University, Kochi, Kerala, India
| | - K. P. Vinayan
- Division of Paediatric Neurology, Department of Neurology, Amrita Institute of Medical Sciences, Amrita University, Kochi, Kerala, India
| |
Collapse
|
13
|
Disease genes prioritizing mechanisms: a comprehensive and systematic literature review. ACTA ACUST UNITED AC 2017. [DOI: 10.1007/s13721-017-0154-9] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
|
14
|
HybridRanker: Integrating network topology and biomedical knowledge to prioritize cancer candidate genes. J Biomed Inform 2016; 64:139-146. [DOI: 10.1016/j.jbi.2016.10.003] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2016] [Revised: 08/13/2016] [Accepted: 10/06/2016] [Indexed: 11/20/2022]
|
15
|
Cardozo T, Gupta P, Ni E, Young LM, Tivon D, Felsovalyi K. Data sources for in vivo molecular profiling of human phenotypes. WILEY INTERDISCIPLINARY REVIEWS-SYSTEMS BIOLOGY AND MEDICINE 2016; 8:472-484. [PMID: 27599755 DOI: 10.1002/wsbm.1354] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/01/2016] [Revised: 06/26/2016] [Accepted: 06/27/2016] [Indexed: 11/08/2022]
Abstract
Molecular profiling of human diseases has been approached at the genetic (DNA), expression (RNA), and proteomic (protein) levels. An important goal of these efforts is to map observed molecular patterns to specific, mechanistic organic entities, such as loci in the genome, individual RNA molecules or defined proteins or protein assemblies. Importantly, such maps have been historically approached in the more intuitive context of a theoretical individual cell, but diseases are better described in reality using an in vivo framework, namely a library of several tissue-specific maps. In this article, we review the existing data atlases that can be used for this purpose and identify critical gaps that could move the field forward from cellular to in vivo dimensions. WIREs Syst Biol Med 2016, 8:472-484. doi: 10.1002/wsbm.1354 For further resources related to this article, please visit the WIREs website.
Collapse
Affiliation(s)
- Timothy Cardozo
- Department of Biochemistry and Molecular Pharmacology, NYU School of Medicine, New York, NY, USA.
| | - Priyanka Gupta
- Department of Biochemistry and Molecular Pharmacology, NYU School of Medicine, New York, NY, USA.,GeneCentrix Inc., New York, NY, USA
| | - Eric Ni
- Department of Biochemistry and Molecular Pharmacology, NYU School of Medicine, New York, NY, USA.,GeneCentrix Inc., New York, NY, USA
| | - Lauren M Young
- Department of Pathology, NYU School of Medicine, New York, NY, USA
| | | | | |
Collapse
|
16
|
Chen B, Shang X, Li M, Wang J, Wu FX. Identifying Individual-Cancer-Related Genes by Rebalancing the Training Samples. IEEE Trans Nanobioscience 2016; 15:309-315. [PMID: 27093705 DOI: 10.1109/tnb.2016.2553119] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
The identification of individual-cancer-related genes typically is an imbalanced classification issue. The number of known cancer-related genes is far less than the number of all unknown genes, which makes it very hard to detect novel predictions from such imbalanced training samples. A regular machine learning method can either only detect genes related to all cancers or add clinical knowledge to circumvent this issue. In this study, we introduce a training sample rebalancing strategy to overcome this issue by using a two-step logistic regression and a random resampling method. The two-step logistic regression is to select a set of genes that related to all cancers. While the random resampling method is performed to further classify those genes associated with individual cancers. The issue of imbalanced classification is circumvented by randomly adding positive instances related to other cancers at first, and then excluding those unrelated predictions according to the overall performance at the following step. Numerical experiments show that the proposed resampling method is able to identify cancer-related genes even when the number of known genes related to it is small. The final predictions for all individual cancers achieve AUC values around 0.93 by using the leave-one-out cross validation method, which is very promising, compared with existing methods.
Collapse
|
17
|
Chen B, Li M, Wang J, Shang X, Wu FX. A fast and high performance multiple data integration algorithm for identifying human disease genes. BMC Med Genomics 2015; 8 Suppl 3:S2. [PMID: 26399620 PMCID: PMC4582601 DOI: 10.1186/1755-8794-8-s3-s2] [Citation(s) in RCA: 39] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/01/2023] Open
Abstract
BACKGROUND Integrating multiple data sources is indispensable in improving disease gene identification. It is not only due to the fact that disease genes associated with similar genetic diseases tend to lie close with each other in various biological networks, but also due to the fact that gene-disease associations are complex. Although various algorithms have been proposed to identify disease genes, their prediction performances and the computational time still should be further improved. RESULTS In this study, we propose a fast and high performance multiple data integration algorithm for identifying human disease genes. A posterior probability of each candidate gene associated with individual diseases is calculated by using a Bayesian analysis method and a binary logistic regression model. Two prior probability estimation strategies and two feature vector construction methods are developed to test the performance of the proposed algorithm. CONCLUSIONS The proposed algorithm is not only generated predictions with high AUC scores, but also runs very fast. When only a single PPI network is employed, the AUC score is 0.769 by using F2 as feature vectors. The average running time for each leave-one-out experiment is only around 1.5 seconds. When three biological networks are integrated, the AUC score using F3 as feature vectors increases to 0.830, and the average running time for each leave-one-out experiment takes only about 12.54 seconds. It is better than many existing algorithms.
Collapse
Affiliation(s)
- Bolin Chen
- School of Computer Science, Northwestern Polytechnical University, 127 Youyi West Road, 710072, Xi'an, P.R. China
| | - Min Li
- School of Information Science and Engineering, Central South University, 410083, Changsha, P.R.China
| | - Jianxin Wang
- School of Information Science and Engineering, Central South University, 410083, Changsha, P.R.China
| | - Xuequn Shang
- School of Computer Science, Northwestern Polytechnical University, 127 Youyi West Road, 710072, Xi'an, P.R. China
| | - Fang-Xiang Wu
- Division of Biomedical Engineering, University of Saskatchewan, 57 Campus Dr., S7N 5A9, Saskatoon, Canada
- Department of Mechanical Engineering, University of Saskatchewan, 57 Campus Dr., S7N 5A9, Saskatoon, Canada
| |
Collapse
|
18
|
Jayaraman A, Jamil K, Khan HA. Identifying new targets in leukemogenesis using computational approaches. Saudi J Biol Sci 2015; 22:610-22. [PMID: 26288567 PMCID: PMC4537869 DOI: 10.1016/j.sjbs.2015.01.012] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2014] [Revised: 01/04/2015] [Accepted: 01/12/2015] [Indexed: 02/08/2023] Open
Abstract
There is a need to identify novel targets in Acute Lymphoblastic Leukemia (ALL), a hematopoietic cancer affecting children, to improve our understanding of disease biology and that can be used for developing new therapeutics. Hence, the aim of our study was to find new genes as targets using in silico studies; for this we retrieved the top 10% overexpressed genes from Oncomine public domain microarray expression database; 530 overexpressed genes were short-listed from Oncomine database. Then, using prioritization tools such as ENDEAVOUR, DIR and TOPPGene online tools, we found fifty-four genes common to the three prioritization tools which formed our candidate leukemogenic genes for this study. As per the protocol we selected thirty training genes from PubMed. The prioritized and training genes were then used to construct STRING functional association network, which was further analyzed using cytoHubba hub analysis tool to investigate new genes which could form drug targets in leukemia. Analysis of the STRING protein network built from these prioritized and training genes led to identification of two hub genes, SMAD2 and CDK9, which were not implicated in leukemogenesis earlier. Filtering out from several hundred genes in the network we also found MEN1, HDAC1 and LCK genes, which re-emphasized the important role of these genes in leukemogenesis. This is the first report on these five additional signature genes in leukemogenesis. We propose these as new targets for developing novel therapeutics and also as biomarkers in leukemogenesis, which could be important for prognosis and diagnosis.
Collapse
Affiliation(s)
- Archana Jayaraman
- Centre for Biotechnology and Bioinformatics, School of Life Sciences, Jawaharlal Nehru Institute of Advanced Studies (JNIAS), Secunderabad, Telangana, India
- Center for Biotechnology, Jawaharlal Nehru Technological University (JNTUH), Kukatpally, Hyderabad, Telangana, India
| | - Kaiser Jamil
- Centre for Biotechnology and Bioinformatics, School of Life Sciences, Jawaharlal Nehru Institute of Advanced Studies (JNIAS), Secunderabad, Telangana, India
- Corresponding author. at: Centre for Biotechnology and Bioinformatics, School of Life Sciences, Jawaharlal Nehru Institute of Advanced Studies (JNIAS), Buddha Bhawan, 6th Floor, M.G. Road, Secunderabad 500003, Telangana, India. Tel.: + 91 9676872626; fax: +91 40 27541551.
| | - Haseeb A. Khan
- Department of Biochemistry, College of Sciences, Bldg. 5, King Saud University, P.O. Box 2455, Riyadh, Saudi Arabia
| |
Collapse
|
19
|
Antanaviciute A, Watson CM, Harrison SM, Lascelles C, Crinnion L, Markham AF, Bonthron DT, Carr IM. OVA: integrating molecular and physical phenotype data from multiple biomedical domain ontologies with variant filtering for enhanced variant prioritization. Bioinformatics 2015; 31:3822-9. [PMID: 26272982 PMCID: PMC4653395 DOI: 10.1093/bioinformatics/btv473] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2015] [Accepted: 08/09/2015] [Indexed: 12/13/2022] Open
Abstract
MOTIVATION Exome sequencing has become a de facto standard method for Mendelian disease gene discovery in recent years, yet identifying disease-causing mutations among thousands of candidate variants remains a non-trivial task. RESULTS Here we describe a new variant prioritization tool, OVA (ontology variant analysis), in which user-provided phenotypic information is exploited to infer deeper biological context. OVA combines a knowledge-based approach with a variant-filtering framework. It reduces the number of candidate variants by considering genotype and predicted effect on protein sequence, and scores the remainder on biological relevance to the query phenotype.We take advantage of several ontologies in order to bridge knowledge across multiple biomedical domains and facilitate computational analysis of annotations pertaining to genes, diseases, phenotypes, tissues and pathways. In this way, OVA combines information regarding molecular and physical phenotypes and integrates both human and model organism data to effectively prioritize variants. By assessing performance on both known and novel disease mutations, we show that OVA performs biologically meaningful candidate variant prioritization and can be more accurate than another recently published candidate variant prioritization tool. AVAILABILITY AND IMPLEMENTATION OVA is freely accessible at http://dna2.leeds.ac.uk:8080/OVA/index.jsp. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online. CONTACT umaan@leeds.ac.uk.
Collapse
Affiliation(s)
- Agne Antanaviciute
- Section of Genetics, Institute of Biomedical and Clinical Sciences, School of Medicine, University of Leeds and
| | - Christopher M Watson
- Section of Genetics, Institute of Biomedical and Clinical Sciences, School of Medicine, University of Leeds and Yorkshire Regional Genetics Service, St James's University Hospital, Leeds, UK
| | - Sally M Harrison
- Section of Genetics, Institute of Biomedical and Clinical Sciences, School of Medicine, University of Leeds and
| | - Carolina Lascelles
- Section of Genetics, Institute of Biomedical and Clinical Sciences, School of Medicine, University of Leeds and
| | - Laura Crinnion
- Section of Genetics, Institute of Biomedical and Clinical Sciences, School of Medicine, University of Leeds and Yorkshire Regional Genetics Service, St James's University Hospital, Leeds, UK
| | - Alexander F Markham
- Section of Genetics, Institute of Biomedical and Clinical Sciences, School of Medicine, University of Leeds and
| | - David T Bonthron
- Section of Genetics, Institute of Biomedical and Clinical Sciences, School of Medicine, University of Leeds and
| | - Ian M Carr
- Section of Genetics, Institute of Biomedical and Clinical Sciences, School of Medicine, University of Leeds and
| |
Collapse
|
20
|
Antanaviciute A, Daly C, Crinnion LA, Markham AF, Watson CM, Bonthron DT, Carr IM. GeneTIER: prioritization of candidate disease genes using tissue-specific gene expression profiles. Bioinformatics 2015; 31:2728-35. [PMID: 25861967 PMCID: PMC4528628 DOI: 10.1093/bioinformatics/btv196] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2014] [Accepted: 04/01/2015] [Indexed: 12/12/2022] Open
Abstract
Motivation: In attempts to determine the genetic causes of human disease, researchers are often faced with a large number of candidate genes. Linkage studies can point to a genomic region containing hundreds of genes, while the high-throughput sequencing approach will often identify a great number of non-synonymous genetic variants. Since systematic experimental verification of each such candidate gene is not feasible, a method is needed to decide which genes are worth investigating further. Computational gene prioritization presents itself as a solution to this problem, systematically analyzing and sorting each gene from the most to least likely to be the disease-causing gene, in a fraction of the time it would take a researcher to perform such queries manually. Results: Here, we present Gene TIssue Expression Ranker (GeneTIER), a new web-based application for candidate gene prioritization. GeneTIER replaces knowledge-based inference traditionally used in candidate disease gene prioritization applications with experimental data from tissue-specific gene expression datasets and thus largely overcomes the bias toward the better characterized genes/diseases that commonly afflict other methods. We show that our approach is capable of accurate candidate gene prioritization and illustrate its strengths and weaknesses using case study examples. Availability and Implementation: Freely available on the web at http://dna.leeds.ac.uk/GeneTIER/. Contact:umaan@leeds.ac.uk Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Agne Antanaviciute
- Section of Genetics, Institute of Biomedical and Clinical Sciences, School of Medicine, University of Leeds, St James's University Hospital and
| | - Catherine Daly
- Section of Genetics, Institute of Biomedical and Clinical Sciences, School of Medicine, University of Leeds, St James's University Hospital and
| | - Laura A Crinnion
- Yorkshire Regional Genetics Service, St James's University Hospital, Leeds, UK
| | - Alexander F Markham
- Section of Genetics, Institute of Biomedical and Clinical Sciences, School of Medicine, University of Leeds, St James's University Hospital and
| | | | - David T Bonthron
- Section of Genetics, Institute of Biomedical and Clinical Sciences, School of Medicine, University of Leeds, St James's University Hospital and
| | - Ian M Carr
- Section of Genetics, Institute of Biomedical and Clinical Sciences, School of Medicine, University of Leeds, St James's University Hospital and
| |
Collapse
|
21
|
Lhota J, Hauptman R, Hart T, Ng C, Xie L. A new method to improve network topological similarity search: applied to fold recognition. Bioinformatics 2015; 31:2106-14. [PMID: 25717198 DOI: 10.1093/bioinformatics/btv125] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2014] [Accepted: 02/21/2015] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Similarity search is the foundation of bioinformatics. It plays a key role in establishing structural, functional and evolutionary relationships between biological sequences. Although the power of the similarity search has increased steadily in recent years, a high percentage of sequences remain uncharacterized in the protein universe. Thus, new similarity search strategies are needed to efficiently and reliably infer the structure and function of new sequences. The existing paradigm for studying protein sequence, structure, function and evolution has been established based on the assumption that the protein universe is discrete and hierarchical. Cumulative evidence suggests that the protein universe is continuous. As a result, conventional sequence homology search methods may be not able to detect novel structural, functional and evolutionary relationships between proteins from weak and noisy sequence signals. To overcome the limitations in existing similarity search methods, we propose a new algorithmic framework-Enrichment of Network Topological Similarity (ENTS)-to improve the performance of large scale similarity searches in bioinformatics. RESULTS We apply ENTS to a challenging unsolved problem: protein fold recognition. Our rigorous benchmark studies demonstrate that ENTS considerably outperforms state-of-the-art methods. As the concept of ENTS can be applied to any similarity metric, it may provide a general framework for similarity search on any set of biological entities, given their representation as a network. AVAILABILITY AND IMPLEMENTATION Source code freely available upon request CONTACT : lxie@iscb.org.
Collapse
Affiliation(s)
- John Lhota
- Hunter College High School, New York, NY 10128, U.S.A., Department of Computer Science, Hunter College, The City University of New York, New York, NY 10065, U.S.A., Department of Biological Sciences, Hunter College, The City University of New York New York, NY 10065, U.S.A. and The Graduate Center, The City University of New York, New York, NY 10016, U.S.A
| | - Ruth Hauptman
- Hunter College High School, New York, NY 10128, U.S.A., Department of Computer Science, Hunter College, The City University of New York, New York, NY 10065, U.S.A., Department of Biological Sciences, Hunter College, The City University of New York New York, NY 10065, U.S.A. and The Graduate Center, The City University of New York, New York, NY 10016, U.S.A
| | - Thomas Hart
- Hunter College High School, New York, NY 10128, U.S.A., Department of Computer Science, Hunter College, The City University of New York, New York, NY 10065, U.S.A., Department of Biological Sciences, Hunter College, The City University of New York New York, NY 10065, U.S.A. and The Graduate Center, The City University of New York, New York, NY 10016, U.S.A
| | - Clara Ng
- Hunter College High School, New York, NY 10128, U.S.A., Department of Computer Science, Hunter College, The City University of New York, New York, NY 10065, U.S.A., Department of Biological Sciences, Hunter College, The City University of New York New York, NY 10065, U.S.A. and The Graduate Center, The City University of New York, New York, NY 10016, U.S.A
| | - Lei Xie
- Hunter College High School, New York, NY 10128, U.S.A., Department of Computer Science, Hunter College, The City University of New York, New York, NY 10065, U.S.A., Department of Biological Sciences, Hunter College, The City University of New York New York, NY 10065, U.S.A. and The Graduate Center, The City University of New York, New York, NY 10016, U.S.A. Hunter College High School, New York, NY 10128, U.S.A., Department of Computer Science, Hunter College, The City University of New York, New York, NY 10065, U.S.A., Department of Biological Sciences, Hunter College, The City University of New York New York, NY 10065, U.S.A. and The Graduate Center, The City University of New York, New York, NY 10016, U.S.A
| |
Collapse
|
22
|
Iourov IY, Vorsanova SG, Yurov YB. In silico molecular cytogenetics: a bioinformatic approach to prioritization of candidate genes and copy number variations for basic and clinical genome research. Mol Cytogenet 2014; 7:98. [PMID: 25525469 PMCID: PMC4269961 DOI: 10.1186/s13039-014-0098-z] [Citation(s) in RCA: 33] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2014] [Accepted: 12/02/2014] [Indexed: 01/08/2023] Open
Abstract
Background The availability of multiple in silico tools for prioritizing genetic variants widens the possibilities for converting genomic data into biological knowledge. However, in molecular cytogenetics, bioinformatic analyses are generally limited to result visualization or database mining for finding similar cytogenetic data. Obviously, the potential of bioinformatics might go beyond these applications. On the other hand, the requirements for performing successful in silico analyses (i.e. deep knowledge of computer science, statistics etc.) can hinder the implementation of bioinformatics in clinical and basic molecular cytogenetic research. Here, we propose a bioinformatic approach to prioritization of genomic variations that is able to solve these problems. Results Selecting gene expression as an initial criterion, we have proposed a bioinformatic approach combining filtering and ranking prioritization strategies, which includes analyzing metabolome and interactome data on proteins encoded by candidate genes. To finalize the prioritization of genetic variants, genomic, epigenomic, interactomic and metabolomic data fusion has been made. Structural abnormalities and aneuploidy revealed by array CGH and FISH have been evaluated to test the approach through determining genotype-phenotype correlations, which have been found similar to those of previous studies. Additionally, we have been able to prioritize copy number variations (CNV) (i.e. differentiate between benign CNV and CNV with phenotypic outcome). Finally, the approach has been applied to prioritize genetic variants in cases of somatic mosaicism (including tissue-specific mosaicism). Conclusions In order to provide for an in silico evaluation of molecular cytogenetic data, we have proposed a bioinformatic approach to prioritization of candidate genes and CNV. While having the disadvantage of possible unavailability of gene expression data or lack of expression variability between genes of interest, the approach provides several advantages. These are (i) the versatility due to independence from specific databases/tools or software, (ii) relative algorithm simplicity (possibility to avoid sophisticated computational/statistical methodology) and (iii) applicability to molecular cytogenetic data because of the chromosome-centric nature. In conclusion, the approach is able to become useful for increasing the yield of molecular cytogenetic techniques.
Collapse
Affiliation(s)
- Ivan Y Iourov
- Mental Health Research Center, Russian Academy of Medical Sciences, 117152 Moscow, Russia ; Russian National Research Medical University named after N.I. Pirogov, Separated Structural Unit "Clinical Research Institute of Pediatrics", Ministry of Health of Russian Federation, 125412 Moscow, Russia ; Department of Medical Genetics, Russian Medical Academy of Postgraduate Education, Moscow, 123995 Russia
| | - Svetlana G Vorsanova
- Mental Health Research Center, Russian Academy of Medical Sciences, 117152 Moscow, Russia ; Russian National Research Medical University named after N.I. Pirogov, Separated Structural Unit "Clinical Research Institute of Pediatrics", Ministry of Health of Russian Federation, 125412 Moscow, Russia
| | - Yuri B Yurov
- Mental Health Research Center, Russian Academy of Medical Sciences, 117152 Moscow, Russia ; Russian National Research Medical University named after N.I. Pirogov, Separated Structural Unit "Clinical Research Institute of Pediatrics", Ministry of Health of Russian Federation, 125412 Moscow, Russia
| |
Collapse
|
23
|
Chen B, Wang J, Li M, Wu FX. Identifying disease genes by integrating multiple data sources. BMC Med Genomics 2014; 7 Suppl 2:S2. [PMID: 25350511 PMCID: PMC4243092 DOI: 10.1186/1755-8794-7-s2-s2] [Citation(s) in RCA: 30] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022] Open
Abstract
BACKGROUND Now multiple types of data are available for identifying disease genes. Those data include gene-disease associations, disease phenotype similarities, protein-protein interactions, pathways, gene expression profiles, etc.. It is believed that integrating different kinds of biological data is an effective method to identify disease genes. RESULTS In this paper, we propose a multiple data integration method based on the theory of Markov random field (MRF) and the method of Bayesian analysis for identifying human disease genes. The proposed method is not only flexible in easily incorporating different kinds of data, but also reliable in predicting candidate disease genes. CONCLUSIONS Numerical experiments are carried out by integrating known gene-disease associations, protein complexes, protein-protein interactions, pathways and gene expression profiles. Predictions are evaluated by the leave-one-out method. The proposed method achieves an AUC score of 0.743 when integrating all those biological data in our experiments.
Collapse
|
24
|
Wang Q, Zhang S, Pang S, Zhang M, Wang B, Liu Q, Li J. GroupRank: rank candidate genes in PPI network by differentially expressed gene groups. PLoS One 2014; 9:e110406. [PMID: 25330105 PMCID: PMC4199715 DOI: 10.1371/journal.pone.0110406] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2014] [Accepted: 09/19/2014] [Indexed: 11/25/2022] Open
Abstract
Many cell activities are organized as a network, and genes are clustered into co-expressed groups if they have the same or closely related biological function or they are co-regulated. In this study, based on an assumption that a strong candidate disease gene is more likely close to gene groups in which all members coordinately differentially express than individual genes with differential expression, we developed a novel disease gene prioritization method GroupRank by integrating gene co-expression and differential expression information generated from microarray data as well as PPI network. A candidate gene is ranked high using GroupRank if it is differentially expressed in disease and control or is close to differentially co-expressed groups in PPI network. We tested our method on data sets of lung, kidney, leukemia and breast cancer. The results revealed GroupRank could efficiently prioritize disease genes with significantly improved AUC value in comparison to the previous method with no consideration of co-exprssed gene groups in PPI network. Moreover, the functional analyses of the major contributing gene group in gene prioritization of kidney cancer verified that our algorithm GroupRank not only ranks disease genes efficiently but also could help us identify and understand possible mechanisms in important physiological and pathological processes of disease.
Collapse
Affiliation(s)
- Qing Wang
- Department of Bioinformatics & Biostatistics, School of Life Science and Biotechnology, Shanghai Jiao Tong University, Shanghai, China
| | - Siyi Zhang
- Department of Bioinformatics & Biostatistics, School of Life Science and Biotechnology, Shanghai Jiao Tong University, Shanghai, China
| | - Shichao Pang
- Department of Bioinformatics & Biostatistics, School of Life Science and Biotechnology, Shanghai Jiao Tong University, Shanghai, China
| | - Menghuan Zhang
- Department of Bioinformatics & Biostatistics, School of Life Science and Biotechnology, Shanghai Jiao Tong University, Shanghai, China
| | - Bo Wang
- Department of Bioinformatics & Biostatistics, School of Life Science and Biotechnology, Shanghai Jiao Tong University, Shanghai, China
| | - Qi Liu
- Department of Bioinformatics & Biostatistics, School of Life Science and Biotechnology, Shanghai Jiao Tong University, Shanghai, China
- Department of Biomedical Informatics, Vanderbilt University School of Medicine, Nashville, Tennessee, United States of America
- Center for Quantitative Sciences, Vanderbilt University School of Medicine, Nashville, Tennessee, United States of America
- * E-mail: (QL); (JL)
| | - Jing Li
- Department of Bioinformatics & Biostatistics, School of Life Science and Biotechnology, Shanghai Jiao Tong University, Shanghai, China
- Shanghai Center for Bioinformation Technology, Shanghai, China
- * E-mail: (QL); (JL)
| |
Collapse
|
25
|
Disease gene identification by using graph kernels and Markov random fields. SCIENCE CHINA-LIFE SCIENCES 2014; 57:1054-63. [DOI: 10.1007/s11427-014-4745-8] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/21/2014] [Accepted: 07/14/2014] [Indexed: 01/05/2023]
|
26
|
Li Z, Chang SH, Zhang LY, Gao L, Wang J. Molecular genetic studies of ADHD and its candidate genes: a review. Psychiatry Res 2014; 219:10-24. [PMID: 24863865 DOI: 10.1016/j.psychres.2014.05.005] [Citation(s) in RCA: 93] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 11/19/2013] [Revised: 03/31/2014] [Accepted: 05/04/2014] [Indexed: 11/26/2022]
Abstract
Attention-deficit/hyperactivity disorder (ADHD) is a common childhood-onset psychiatric disorder with high heritability. In recent years, numerous molecular genetic studies have been published to investigate susceptibility loci for ADHD. These results brought valuable candidates for further research, but they also presented great challenge for profound understanding of genetic data and general patterns of current molecular genetic studies of ADHD since they are scattered and heterogeneous. In this review, we presented a retrospective review of more than 300 molecular genetic studies for ADHD from two aspects: (1) the main achievements of various studies were summarized, including linkage studies, candidate-gene association studies, genome-wide association studies and genome-wide copy number variation studies, with a special focus on general patterns of study design and common sample features; (2) candidate genes for ADHD have been systematically evaluated in three ways for better utilization. The thorough summary of the achievements from various studies will provide an overview of the research status of molecular genetics studies for ADHD. Meanwhile, the analysis of general patterns and sample characteristics on the basis of these studies, as well as the integrative review of candidate ADHD genes, will propose new clues and directions for future experiment design.
Collapse
Affiliation(s)
- Zhao Li
- Key Laboratory of Mental Health, Institute of Psychology, Chinese Academy of Sciences, 16 Lincui Road, Chaoyang District, Beijing 100101, China; University of Chinese Academy of Sciences, No. 19A Yuquan Road, Beijing 100049, China
| | - Su-Hua Chang
- Key Laboratory of Mental Health, Institute of Psychology, Chinese Academy of Sciences, 16 Lincui Road, Chaoyang District, Beijing 100101, China
| | - Liu-Yan Zhang
- Key Laboratory of Mental Health, Institute of Psychology, Chinese Academy of Sciences, 16 Lincui Road, Chaoyang District, Beijing 100101, China; University of Chinese Academy of Sciences, No. 19A Yuquan Road, Beijing 100049, China
| | - Lei Gao
- Key Laboratory of Mental Health, Institute of Psychology, Chinese Academy of Sciences, 16 Lincui Road, Chaoyang District, Beijing 100101, China; University of Chinese Academy of Sciences, No. 19A Yuquan Road, Beijing 100049, China
| | - Jing Wang
- Key Laboratory of Mental Health, Institute of Psychology, Chinese Academy of Sciences, 16 Lincui Road, Chaoyang District, Beijing 100101, China.
| |
Collapse
|
27
|
Oliver KL, Lukic V, Thorne NP, Berkovic SF, Scheffer IE, Bahlo M. Harnessing gene expression networks to prioritize candidate epileptic encephalopathy genes. PLoS One 2014; 9:e102079. [PMID: 25014031 PMCID: PMC4090166 DOI: 10.1371/journal.pone.0102079] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2013] [Accepted: 06/14/2014] [Indexed: 01/11/2023] Open
Abstract
We apply a novel gene expression network analysis to a cohort of 182 recently reported candidate Epileptic Encephalopathy genes to identify those most likely to be true Epileptic Encephalopathy genes. These candidate genes were identified as having single variants of likely pathogenic significance discovered in a large-scale massively parallel sequencing study. Candidate Epileptic Encephalopathy genes were prioritized according to their co-expression with 29 known Epileptic Encephalopathy genes. We utilized developing brain and adult brain gene expression data from the Allen Human Brain Atlas (AHBA) and compared this to data from Celsius: a large, heterogeneous gene expression data warehouse. We show replicable prioritization results using these three independent gene expression resources, two of which are brain-specific, with small sample size, and the third derived from a heterogeneous collection of tissues with large sample size. Of the nineteen genes that we predicted with the highest likelihood to be true Epileptic Encephalopathy genes, two (GNAO1 and GRIN2B) have recently been independently reported and confirmed. We compare our results to those produced by an established in silico prioritization approach called Endeavour, and finally present gene expression networks for the known and candidate Epileptic Encephalopathy genes. This highlights sub-networks of gene expression, particularly in the network derived from the adult AHBA gene expression dataset. These networks give clues to the likely biological interactions between Epileptic Encephalopathy genes, potentially highlighting underlying mechanisms and avenues for therapeutic targets.
Collapse
Affiliation(s)
- Karen L. Oliver
- Bioinformatics Division, The Walter and Eliza Hall Institute of Medical Research, Melbourne, Victoria, Australia
- Epilepsy Research Center, Department of Medicine, University of Melbourne, Austin Health, Heidelberg, Victoria, Australia
| | - Vesna Lukic
- Bioinformatics Division, The Walter and Eliza Hall Institute of Medical Research, Melbourne, Victoria, Australia
| | - Natalie P. Thorne
- Bioinformatics Division, The Walter and Eliza Hall Institute of Medical Research, Melbourne, Victoria, Australia
| | - Samuel F. Berkovic
- Epilepsy Research Center, Department of Medicine, University of Melbourne, Austin Health, Heidelberg, Victoria, Australia
| | - Ingrid E. Scheffer
- Epilepsy Research Center, Department of Medicine, University of Melbourne, Austin Health, Heidelberg, Victoria, Australia
- Florey Institute, Melbourne, Victoria, Australia
- Department of Paediatrics, University of Melbourne, Royal Children's Hospital, Melbourne, Victoria, Australia
| | - Melanie Bahlo
- Bioinformatics Division, The Walter and Eliza Hall Institute of Medical Research, Melbourne, Victoria, Australia
- Department of Mathematics and Statistics, University of Melbourne, Melbourne, Victoria, Australia
| |
Collapse
|
28
|
Wang W, Yang S, Zhang X, Li J. Drug repositioning by integrating target information through a heterogeneous network model. ACTA ACUST UNITED AC 2014; 30:2923-30. [PMID: 24974205 DOI: 10.1093/bioinformatics/btu403] [Citation(s) in RCA: 196] [Impact Index Per Article: 19.6] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022]
Abstract
MOTIVATION The emergence of network medicine not only offers more opportunities for better and more complete understanding of the molecular complexities of diseases, but also serves as a promising tool for identifying new drug targets and establishing new relationships among diseases that enable drug repositioning. Computational approaches for drug repositioning by integrating information from multiple sources and multiple levels have the potential to provide great insights to the complex relationships among drugs, targets, disease genes and diseases at a system level. RESULTS In this article, we have proposed a computational framework based on a heterogeneous network model and applied the approach on drug repositioning by using existing omics data about diseases, drugs and drug targets. The novelty of the framework lies in the fact that the strength between a disease-drug pair is calculated through an iterative algorithm on the heterogeneous graph that also incorporates drug-target information. Comprehensive experimental results show that the proposed approach significantly outperforms several recent approaches. Case studies further illustrate its practical usefulness. AVAILABILITY AND IMPLEMENTATION http://cbc.case.edu CONTACT jingli@cwru.edu SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Wenhui Wang
- Department of Electrical Engineering and Computer Science, Case Western Reserve University, Cleveland, OH 44106, USA and Molecular and Computational Biology, University of Southern California, Los Angeles, CA 90089, USA Department of Electrical Engineering and Computer Science, Case Western Reserve University, Cleveland, OH 44106, USA and Molecular and Computational Biology, University of Southern California, Los Angeles, CA 90089, USA
| | - Sen Yang
- Department of Electrical Engineering and Computer Science, Case Western Reserve University, Cleveland, OH 44106, USA and Molecular and Computational Biology, University of Southern California, Los Angeles, CA 90089, USA
| | - Xiang Zhang
- Department of Electrical Engineering and Computer Science, Case Western Reserve University, Cleveland, OH 44106, USA and Molecular and Computational Biology, University of Southern California, Los Angeles, CA 90089, USA
| | - Jing Li
- Department of Electrical Engineering and Computer Science, Case Western Reserve University, Cleveland, OH 44106, USA and Molecular and Computational Biology, University of Southern California, Los Angeles, CA 90089, USA
| |
Collapse
|
29
|
Valentini G, Paccanaro A, Caniza H, Romero AE, Re M. An extensive analysis of disease-gene associations using network integration and fast kernel-based gene prioritization methods. Artif Intell Med 2014; 61:63-78. [PMID: 24726035 PMCID: PMC4070077 DOI: 10.1016/j.artmed.2014.03.003] [Citation(s) in RCA: 41] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2013] [Revised: 03/05/2014] [Accepted: 03/10/2014] [Indexed: 02/07/2023]
Abstract
OBJECTIVE In the context of "network medicine", gene prioritization methods represent one of the main tools to discover candidate disease genes by exploiting the large amount of data covering different types of functional relationships between genes. Several works proposed to integrate multiple sources of data to improve disease gene prioritization, but to our knowledge no systematic studies focused on the quantitative evaluation of the impact of network integration on gene prioritization. In this paper, we aim at providing an extensive analysis of gene-disease associations not limited to genetic disorders, and a systematic comparison of different network integration methods for gene prioritization. MATERIALS AND METHODS We collected nine different functional networks representing different functional relationships between genes, and we combined them through both unweighted and weighted network integration methods. We then prioritized genes with respect to each of the considered 708 medical subject headings (MeSH) diseases by applying classical guilt-by-association, random walk and random walk with restart algorithms, and the recently proposed kernelized score functions. RESULTS The results obtained with classical random walk algorithms and the best single network achieved an average area under the curve (AUC) across the 708 MeSH diseases of about 0.82, while kernelized score functions and network integration boosted the average AUC to about 0.89. Weighted integration, by exploiting the different "informativeness" embedded in different functional networks, outperforms unweighted integration at 0.01 significance level, according to the Wilcoxon signed rank sum test. For each MeSH disease we provide the top-ranked unannotated candidate genes, available for further bio-medical investigation. CONCLUSIONS Network integration is necessary to boost the performances of gene prioritization methods. Moreover the methods based on kernelized score functions can further enhance disease gene ranking results, by adopting both local and global learning strategies, able to exploit the overall topology of the network.
Collapse
Affiliation(s)
- Giorgio Valentini
- AnacletoLab - Dipartimento di Informatica, Università degli Studi di Milano, via Comelico 39/41, 20135 Milano, Italy.
| | - Alberto Paccanaro
- Department of Computer Science and Centre for Systems and Synthetic Biology, Royal Holloway, University of London, Egham TW20 0EX, UK
| | - Horacio Caniza
- Department of Computer Science and Centre for Systems and Synthetic Biology, Royal Holloway, University of London, Egham TW20 0EX, UK
| | - Alfonso E Romero
- Department of Computer Science and Centre for Systems and Synthetic Biology, Royal Holloway, University of London, Egham TW20 0EX, UK
| | - Matteo Re
- AnacletoLab - Dipartimento di Informatica, Università degli Studi di Milano, via Comelico 39/41, 20135 Milano, Italy
| |
Collapse
|
30
|
Zhang SW, Shao DD, Zhang SY, Wang YB. Prioritization of candidate disease genes by enlarging the seed set and fusing information of the network topology and gene expression. MOLECULAR BIOSYSTEMS 2014; 10:1400-8. [PMID: 24695957 DOI: 10.1039/c3mb70588a] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/13/2023]
Abstract
The identification of disease genes is very important not only to provide greater understanding of gene function and cellular mechanisms which drive human disease, but also to enhance human disease diagnosis and treatment. Recently, high-throughput techniques have been applied to detect dozens or even hundreds of candidate genes. However, experimental approaches to validate the many candidates are usually time-consuming, tedious and expensive, and sometimes lack reproducibility. Therefore, numerous theoretical and computational methods (e.g. network-based approaches) have been developed to prioritize candidate disease genes. Many network-based approaches implicitly utilize the observation that genes causing the same or similar diseases tend to correlate with each other in gene-protein relationship networks. Of these network approaches, the random walk with restart algorithm (RWR) is considered to be a state-of-the-art approach. To further improve the performance of RWR, we propose a novel method named ESFSC to identify disease-related genes, by enlarging the seed set according to the centrality of disease genes in a network and fusing information of the protein-protein interaction (PPI) network topological similarity and the gene expression correlation. The ESFSC algorithm restarts at all of the nodes in the seed set consisting of the known disease genes and their k-nearest neighbor nodes, then walks in the global network separately guided by the similarity transition matrix constructed with PPI network topological similarity properties and the correlational transition matrix constructed with the gene expression profiles. As a result, all the genes in the network are ranked by weighted fusing the above results of the RWR guided by two types of transition matrices. Comprehensive simulation results of the 10 diseases with 97 known disease genes collected from the Online Mendelian Inheritance in Man (OMIM) database show that ESFSC outperforms existing methods for prioritizing candidate disease genes. The top prediction results of Alzheimer's disease are consistent with previous literature reports.
Collapse
Affiliation(s)
- Shao-Wu Zhang
- College of Automation, Northwestern Polytechnical University, 710072, Xi'an, China.
| | | | | | | |
Collapse
|
31
|
Zhan Y, Zhang R, Lv H, Song X, Xu X, Chai L, Lv W, Shang Z, Jiang Y, Zhang R. Prioritization of candidate genes for periodontitis using multiple computational tools. J Periodontol 2014; 85:1059-69. [PMID: 24476546 DOI: 10.1902/jop.2014.130523] [Citation(s) in RCA: 34] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022]
Abstract
BACKGROUND Both genetic and environmental factors contribute to the development of periodontitis. Genetic studies identified a variety of candidate genes for periodontitis. The aim of the present study is to identify the most promising candidate genes for periodontitis using an integrative gene ranking method. METHODS Seed genes that were confirmed to be associated with periodontitis were identified using text mining. Three types of candidate genes were then extracted from different resources (expression profiles, genome-wide association studies). Combining the seed genes, four freely available bioinformatics tools (ToppGene, DIR, Endeavour, and GPEC) were integrated for prioritization of candidate genes. Candidate genes that identified with at least three programs and ranked in the top 20 by each program were considered the most promising. RESULTS Prioritization analysis resulted in 21 promising genes involved or potentially involved in periodontitis. Among them, IL18 (interleukin 18), CD44 (CD44 molecule), CXCL1 (chemokine [CXC motif] ligand 1), IL6ST (interleukin 6 signal transducer), MMP3 (matrix metallopeptidase 3), MMP7, CCR1 (chemokine [C-C motif] receptor 1), MMP13, and TLR9 (Toll-like receptor 9) had been associated with periodontitis. However, the roles of other genes, such as CSF3 (colony stimulating factor 3 receptor), CD40, TNFSF14 (tumor necrosis factor receptor superfamily, member 14), IFNB1 (interferon-β1), TIRAP (toll-interleukin 1 receptor domain containing adaptor protein), IL2RA (interleukin 2 receptor α), ETS1 (v-ets avian erythroblastosis virus E26 oncogene homolog 1), GADD45B (growth arrest and DNA-damage-inducible 45 β), BIRC3 (baculoviral IAP repeat containing 3), VAV1 (vav 1 guanine nucleotide exchange factor), COL5A1 (collagen, type V, α1), and C3 (complement component 3), have not been investigated thoroughly in the process of periodontitis. These genes are mainly involved in bacterial infection, immune response, and inflammatory reaction, suggesting that further characterizing their roles in periodontitis will be important. CONCLUSIONS A combination of computational tools will be useful in mining candidate genes for periodontitis. These theoretical results provide new clues for experimental biologists to plan targeted experiments.
Collapse
Affiliation(s)
- Yuanbo Zhan
- Department of Periodontology and Oral Mucosa, Second Affiliated Hospital of Harbin Medical University, Harbin, China
| | | | | | | | | | | | | | | | | | | |
Collapse
|
32
|
Kimmel C, Visweswaran S. An algorithm for network-based gene prioritization that encodes knowledge both in nodes and in links. PLoS One 2013; 8:e79564. [PMID: 24260251 PMCID: PMC3834271 DOI: 10.1371/journal.pone.0079564] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2012] [Accepted: 09/25/2013] [Indexed: 11/19/2022] Open
Abstract
BACKGROUND Candidate gene prioritization aims to identify promising new genes associated with a disease or a biological process from a larger set of candidate genes. In recent years, network-based methods - which utilize a knowledge network derived from biological knowledge - have been utilized for gene prioritization. Biological knowledge can be encoded either through the network's links or nodes. Current network-based methods can only encode knowledge through links. This paper describes a new network-based method that can encode knowledge in links as well as in nodes. RESULTS We developed a new network inference algorithm called the Knowledge Network Gene Prioritization (KNGP) algorithm which can incorporate both link and node knowledge. The performance of the KNGP algorithm was evaluated on both synthetic networks and on networks incorporating biological knowledge. The results showed that the combination of link knowledge and node knowledge provided a significant benefit across 19 experimental diseases over using link knowledge alone or node knowledge alone. CONCLUSIONS The KNGP algorithm provides an advance over current network-based algorithms, because the algorithm can encode both link and node knowledge. We hope the algorithm will aid researchers with gene prioritization.
Collapse
Affiliation(s)
- Chad Kimmel
- Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, Pennsylvania, United States of America
- * E-mail:
| | - Shyam Visweswaran
- Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, Pennsylvania, United States of America
| |
Collapse
|
33
|
Chang SH, Gao L, Li Z, Zhang WN, Du Y, Wang J. BDgene: a genetic database for bipolar disorder and its overlap with schizophrenia and major depressive disorder. Biol Psychiatry 2013; 74:727-33. [PMID: 23764453 DOI: 10.1016/j.biopsych.2013.04.016] [Citation(s) in RCA: 46] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 12/17/2012] [Revised: 03/27/2013] [Accepted: 04/12/2013] [Indexed: 12/14/2022]
Abstract
BACKGROUND Bipolar disorder (BD) is a common psychiatric disorder with complex genetic architecture. It shares overlapping genetic influences with schizophrenia (SZ) and major depressive disorder (MDD). Large numbers of genetic studies of BD and cross-disorder studies between BD and SZ/MDD have accumulated numerous genetic data. There is a growing need to integrate the data to provide a comprehensive data set to facilitate the genetic study of BD and its highly relevant diseases. METHODS BDgene database was developed to integrate BD-related genetic factors and shared ones with SZ/MDD from profound literature reading. On the basis of data from the literature, in-depth analyses were performed for further understanding of the data, including gene prioritization, pathway-based analysis, intersection analysis of multidisease candidate genes, and pathway enrichment analysis. RESULTS BDgene includes multiple types of literature-reported genetic factors of BD with both positive and negative results, including 797 genes, 3119 single nucleotide polymorphisms, and 789 regions. Shared genetic factors such as single nucleotide polymorphisms, genes, and regions from published cross-disorder studies among BD and SZ/MDD were also presented. In-depth data analyses identified 43 BD core genes; 70 BD candidate pathways; and 127, 79, and 107 new potential cross-disorder genes for BD-SZ, BD-MDD, and BD-SZ-MDD, respectively. CONCLUSIONS As a central genetic database for BD and the first cross-disorder database for BD and SZ/MDD, BDgene provides not only a comprehensive review of current genetic research but also high-confidence candidate genes and pathways for understanding of BD mechanism and shared etiology among its relevant diseases. BDgene is freely available at http://bdgene.psych.ac.cn.
Collapse
Affiliation(s)
- Su-Hua Chang
- Key Laboratory of Mental Health, Institute of Psychology, Chinese Academy of Sciences, Beijing, China
| | | | | | | | | | | |
Collapse
|
34
|
Minelli C, De Grandi A, Weichenberger CX, Gögele M, Modenese M, Attia J, Barrett JH, Boehnke M, Borsani G, Casari G, Fox CS, Freina T, Hicks AA, Marroni F, Parmigiani G, Pastore A, Pattaro C, Pfeufer A, Ruggeri F, Schwienbacher C, Taliun D, Pramstaller PP, Domingues FS, Thompson JR. Importance of different types of prior knowledge in selecting genome-wide findings for follow-up. Genet Epidemiol 2013; 37:205-13. [PMID: 23307621 DOI: 10.1002/gepi.21705] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2012] [Revised: 10/28/2012] [Accepted: 11/22/2012] [Indexed: 12/14/2022]
Abstract
Biological plausibility and other prior information could help select genome-wide association (GWA) findings for further follow-up, but there is no consensus on which types of knowledge should be considered or how to weight them. We used experts' opinions and empirical evidence to estimate the relative importance of 15 types of information at the single-nucleotide polymorphism (SNP) and gene levels. Opinions were elicited from 10 experts using a two-round Delphi survey. Empirical evidence was obtained by comparing the frequency of each type of characteristic in SNPs established as being associated with seven disease traits through GWA meta-analysis and independent replication, with the corresponding frequency in a randomly selected set of SNPs. SNP and gene characteristics were retrieved using a specially developed bioinformatics tool. Both the expert and the empirical evidence rated previous association in a meta-analysis or more than one study as conferring the highest relative probability of true association, whereas previous association in a single study ranked much lower. High relative probabilities were also observed for location in a functional protein domain, although location in a region evolutionarily conserved in vertebrates was ranked high by the data but not by the experts. Our empirical evidence did not support the importance attributed by the experts to whether the gene encodes a protein in a pathway or shows interactions relevant to the trait. Our findings provide insight into the selection and weighting of different types of knowledge in SNP or gene prioritization, and point to areas requiring further research.
Collapse
Affiliation(s)
- Cosetta Minelli
- Center for Biomedicine, European Academy Bozen/Bolzano (EURAC), Bolzano, Italy.
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
35
|
Nie Y, Yu J. Mining breast cancer genes with a network based noise-tolerant approach. BMC SYSTEMS BIOLOGY 2013; 7:49. [PMID: 23799982 PMCID: PMC3702465 DOI: 10.1186/1752-0509-7-49] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/28/2012] [Accepted: 06/21/2013] [Indexed: 12/22/2022]
Abstract
BACKGROUND Mining novel breast cancer genes is an important task in breast cancer research. Many approaches prioritize candidate genes based on their similarity to known cancer genes, usually by integrating multiple data sources. However, different types of data often contain varying degrees of noise. For effective data integration, it's important to design methods that work robustly with respect to noise. RESULTS Gene Ontology (GO) annotations were often utilized in cancer gene mining works. However, the vast majority of GO annotations were computationally derived, thus not completely accurate. A set of genes annotated with breast cancer enriched GO terms was adopted here as a set of source data with realistic noise. A novel noise tolerant approach was proposed to rank candidate breast cancer genes using noisy source data within the framework of a comprehensive human Protein-Protein Interaction (PPI) network. Performance of the proposed method was quantitatively evaluated by comparing it with the more established random walk approach. Results showed that the proposed method exhibited better performance in ranking known breast cancer genes and higher robustness against data noise than the random walk approach. When noise started to increase, the proposed method was able to maintained relatively stable performance, while the random walk approach showed drastic performance decline; when noise increased to a large extent, the proposed method was still able to achieve better performance than random walk did. CONCLUSIONS A novel noise tolerant method was proposed to mine breast cancer genes. Compared to the well established random walk approach, it showed better performance in correctly ranking cancer genes and worked robustly with respect to noise within source data. To the best of our knowledge, it's the first such effort to quantitatively analyze noise tolerance between different breast cancer gene mining methods. The sorted gene list can be valuable for breast cancer research. The proposed quantitative noise analysis method may also prove useful for other data integration efforts. It is hoped that the current work can lead to more discussions about influence of data noise on different computational methods for mining disease genes.
Collapse
Affiliation(s)
- Yaling Nie
- National Key Laboratory of Biochemical Engineering, Institute of Process Engineering, Chinese Academy of Sciences, Beijing 100190, China
| | | |
Collapse
|
36
|
A multi-platform draft de novo genome assembly and comparative analysis for the Scarlet Macaw (Ara macao). PLoS One 2013; 8:e62415. [PMID: 23667475 PMCID: PMC3648530 DOI: 10.1371/journal.pone.0062415] [Citation(s) in RCA: 43] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2013] [Accepted: 03/21/2013] [Indexed: 12/31/2022] Open
Abstract
Data deposition to NCBI Genomes: This Whole Genome Shotgun project has been deposited at DDBJ/EMBL/GenBank under the accession AMXX00000000 (SMACv1.0, unscaffolded genome assembly). The version described in this paper is the first version (AMXX01000000). The scaffolded assembly (SMACv1.1) has been deposited at DDBJ/EMBL/GenBank under the accession AOUJ00000000, and is also the first version (AOUJ01000000). Strong biological interest in traits such as the acquisition and utilization of speech, cognitive abilities, and longevity catalyzed the utilization of two next-generation sequencing platforms to provide the first-draft de novo genome assembly for the large, new world parrot Ara macao (Scarlet Macaw). Despite the challenges associated with genome assembly for an outbred avian species, including 951,507 high-quality putative single nucleotide polymorphisms, the final genome assembly (>1.035 Gb) includes more than 997 Mb of unambiguous sequence data (excluding N's). Cytogenetic analyses including ZooFISH revealed complex rearrangements associated with two scarlet macaw macrochromosomes (AMA6, AMA7), which supports the hypothesis that translocations, fusions, and intragenomic rearrangements are key factors associated with karyotype evolution among parrots. In silico annotation of the scarlet macaw genome provided robust evidence for 14,405 nuclear gene annotation models, their predicted transcripts and proteins, and a complete mitochondrial genome. Comparative analyses involving the scarlet macaw, chicken, and zebra finch genomes revealed high levels of nucleotide-based conservation as well as evidence for overall genome stability among the three highly divergent species. Application of a new whole-genome analysis of divergence involving all three species yielded prioritized candidate genes and noncoding regions for parrot traits of interest (i.e., speech, intelligence, longevity) which were independently supported by the results of previous human GWAS studies. We also observed evidence for genes and noncoding loci that displayed extreme conservation across the three avian lineages, thereby reflecting their likely biological and developmental importance among birds.
Collapse
|
37
|
Mandillo S, Golini E, Marazziti D, Di Pietro C, Matteoni R, Tocchini-Valentini GP. Mice lacking the Parkinson's related GPR37/PAEL receptor show non-motor behavioral phenotypes: age and gender effect. GENES BRAIN AND BEHAVIOR 2013; 12:465-77. [DOI: 10.1111/gbb.12041] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/07/2012] [Revised: 02/15/2013] [Accepted: 04/05/2013] [Indexed: 12/14/2022]
Affiliation(s)
- S. Mandillo
- CNR-National Research Council, IBCN-Institute of Cell Biology and Neurobiology; EMMA-Infrafrontier-IMPC; Monterotondo Scalo; Rome; Italy
| | - E. Golini
- CNR-National Research Council, IBCN-Institute of Cell Biology and Neurobiology; EMMA-Infrafrontier-IMPC; Monterotondo Scalo; Rome; Italy
| | - D. Marazziti
- CNR-National Research Council, IBCN-Institute of Cell Biology and Neurobiology; EMMA-Infrafrontier-IMPC; Monterotondo Scalo; Rome; Italy
| | - C. Di Pietro
- CNR-National Research Council, IBCN-Institute of Cell Biology and Neurobiology; EMMA-Infrafrontier-IMPC; Monterotondo Scalo; Rome; Italy
| | - R. Matteoni
- CNR-National Research Council, IBCN-Institute of Cell Biology and Neurobiology; EMMA-Infrafrontier-IMPC; Monterotondo Scalo; Rome; Italy
| | - G. P. Tocchini-Valentini
- CNR-National Research Council, IBCN-Institute of Cell Biology and Neurobiology; EMMA-Infrafrontier-IMPC; Monterotondo Scalo; Rome; Italy
| |
Collapse
|
38
|
Wang W, Yang S, Li JING. Drug target predictions based on heterogeneous graph inference. PACIFIC SYMPOSIUM ON BIOCOMPUTING. PACIFIC SYMPOSIUM ON BIOCOMPUTING 2013:53-64. [PMID: 23424111 PMCID: PMC3605000] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/01/2023]
Abstract
A key issue in drug development is to understand the hidden relationships among drugs and targets. Computational methods for novel drug target predictions can greatly reduce time and costs compared with experimental methods. In this paper, we propose a network based computational approach for novel drug and target association predictions. More specifically, a heterogeneous drug-target graph, which incorporates known drug-target interactions as well as drug-drug and target-target similarities, is first constructed. Based on this graph, a novel graph-based inference method is introduced. Compared with two state-of-the-art methods, large-scale cross-validation results indicate that the proposed method can greatly improve novel target predictions.
Collapse
Affiliation(s)
| | | | - JING Li
- Department of Electrical Engineering and Computer Science, Case Western Reserve University, Cleveland, Ohio, 44106, USA
| |
Collapse
|
39
|
Masoudi-Nejad A, Meshkin A, Haji-Eghrari B, Bidkhori G. RETRACTED ARTICLE: Candidate gene prioritization. Mol Genet Genomics 2012; 287:679-98. [DOI: 10.1007/s00438-012-0710-z] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2012] [Accepted: 07/12/2012] [Indexed: 01/16/2023]
|
40
|
|
41
|
Chang S, Zhang W, Gao L, Wang J. Prioritization of candidate genes for attention deficit hyperactivity disorder by computational analysis of multiple data sources. Protein Cell 2012; 3:526-34. [PMID: 22773342 DOI: 10.1007/s13238-012-2931-7] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2012] [Accepted: 05/15/2012] [Indexed: 01/24/2023] Open
Abstract
Attention deficit hyperactivity disorder (ADHD) is a common, highly heritable psychiatric disorder characterized by hyperactivity, inattention and increased impulsivity. In recent years, a large number of genetic studies for ADHD have been published and related genetic data has been accumulated dramatically. To provide researchers a comprehensive ADHD genetic resource, we previously developed the first genetic database for ADHD (ADHDgene). The abundant genetic data provides novel candidates for further study. Meanwhile, it also brings new challenge for selecting promising candidate genes for replication and verification research. In this study, we surveyed the computational tools for candidate gene prioritization and selected five tools, which integrate multiple data sources for gene prioritization, to prioritize ADHD candidate genes in ADHDgene. The prioritization analysis resulted in 16 prioritized candidate genes, which are mainly involved in several major neurotransmitter systems or in nervous system development pathways. Among these genes, nervous system development related genes, especially SNAP25, STX1A and the gene-gene interactions related with each of them deserve further investigations. Our results may provide new insight for further verification study and facilitate the exploration of pathogenesis mechanism of ADHD.
Collapse
Affiliation(s)
- Suhua Chang
- Key Laboratory of Mental Health, Institute of Psychology, Chinese Academy of Sciences, Beijing 100101, China
| | | | | | | |
Collapse
|
42
|
Computational tools for prioritizing candidate genes: boosting disease gene discovery. Nat Rev Genet 2012; 13:523-36. [DOI: 10.1038/nrg3253] [Citation(s) in RCA: 332] [Impact Index Per Article: 27.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
|
43
|
Doncheva NT, Kacprowski T, Albrecht M. Recent approaches to the prioritization of candidate disease genes. WILEY INTERDISCIPLINARY REVIEWS-SYSTEMS BIOLOGY AND MEDICINE 2012; 4:429-42. [PMID: 22689539 DOI: 10.1002/wsbm.1177] [Citation(s) in RCA: 45] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
Many efforts are still devoted to the discovery of genes involved with specific phenotypes, in particular, diseases. High-throughput techniques are thus applied frequently to detect dozens or even hundreds of candidate genes. However, the experimental validation of many candidates is often an expensive and time-consuming task. Therefore, a great variety of computational approaches has been developed to support the identification of the most promising candidates for follow-up studies. The biomedical knowledge already available about the disease of interest and related genes is commonly exploited to find new gene-disease associations and to prioritize candidates. In this review, we highlight recent methodological advances in this research field of candidate gene prioritization. We focus on approaches that use network information and integrate heterogeneous data sources. Furthermore, we discuss current benchmarking procedures for evaluating and comparing different prioritization methods.
Collapse
|