1
|
Yang X, Yang C, Lei J, Liu J. An Approach of Epistasis Detection Using Integer Linear Programming Optimizing Bayesian Network. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:2654-2671. [PMID: 34181547 DOI: 10.1109/tcbb.2021.3092719] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Proposing a more effective and accurate epistatic loci detection method in large-scale genomic data has important research significance for improving crop quality, disease treatment, etc. Due to the characteristics of high accuracy and processing non-linear relationship, Bayesian network (BN) has been widely used in constructing the network of SNPs and phenotype traits and thus to mine epistatic loci. However, the shortcoming of BN is that it is easy to fall into local optimum and unable to process large-scale of SNPs. In this work, we transform the problem of learning Bayesian network into the optimization of integer linear programming (ILP). We use the algorithms of branch-and-bound and cutting planes to get the global optimal Bayesian network (ILPBN), and thus to get epistatic loci influencing specific phenotype traits. In order to handle large-scale of SNP loci and further to improve efficiency, we use the method of optimizing Markov blanket to reduce the number of candidate parent nodes for each node. In addition, we use α-BIC that is suitable for processing the epistatis mining to calculate the BN score. We use four properties of BN decomposable scoring functions to further reduce the number of candidate parent sets for each node. Experiment results show that ILPBN can not only process 2-locus and 3-locus epistasis mining, but also realize multi-locus epistasis detection. Finally, we compare ILPBN with several popular epistasis mining algorithms by using simulated and real Age-related macular disease (AMD) dataset. Experiment results show that ILPBN has better epistasis detection accuracy, F1-score and false positive rate in premise of ensuring the efficiency compared with other methods. Availability: Codes and dataset are available at: http://122.205.95.139/ILPBN/.
Collapse
|
2
|
Tuo S, Li C, Liu F, Li A, He L, Geem ZW, Shang J, Liu H, Zhu Y, Feng Z, Chen T. MTHSA-DHEI: multitasking harmony search algorithm for detecting high-order SNP epistatic interactions. COMPLEX INTELL SYST 2022. [DOI: 10.1007/s40747-022-00813-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
AbstractGenome-wide association studies have succeeded in identifying genetic variants associated with complex diseases, but the findings have not been well interpreted biologically. Although it is widely accepted that epistatic interactions of high-order single nucleotide polymorphisms (SNPs) [(1) Single nucleotide polymorphisms (SNP) are mainly deoxyribonucleic acid (DNA) sequence polymorphisms caused by variants at a single nucleotide at the genome level. They are the most common type of heritable variation in humans.] are important causes of complex diseases, the combinatorial explosion of millions of SNPs and multiple tests impose a large computational burden. Moreover, it is extremely challenging to correctly distinguish high-order SNP epistatic interactions from other high-order SNP combinations due to small sample sizes. In this study, a multitasking harmony search algorithm (MTHSA-DHEI) is proposed for detecting high-order epistatic interactions [(2) In classical genetics, if genes X1 and X2 are mutated and each mutation by itself produces a unique disease status (phenotype) but the mutations together cause the same disease status as the gene X1 mutation, gene X1 is epistatic and gene X2 is hypostatic, and gene X1 has an epistatic effect (main effect) on disease status. In this work, a high-order epistatic interaction occurs when two or more SNP loci have a joint influence on disease status.], with the goal of simultaneously detecting multiple types of high-order (k1-order, k2-order, …, kn-order) SNP epistatic interactions. Unified coding is adopted for multiple tasks, and four complementary association evaluation functions are employed to improve the capability of discriminating the high-order SNP epistatic interactions. We compare the proposed MTHSA-DHEI method with four excellent methods for detecting high-order SNP interactions for 8 high-orderepistatic interaction models with no marginal effect (EINMEs) and 12 epistatic interaction models with marginal effects (EIMEs) (*) and implement the MTHSA-DHEI algorithm with a real dataset: age-related macular degeneration (AMD). The experimental results indicate that MTHSA-DHEI has power and an F1-score exceeding 90% for all EIMEs and five EINMEs and reduces the computational time by more than 90%. It can efficiently perform multiple high-order detection tasks for high-order epistatic interactions and improve the discrimination ability for diverse epistasis models.
Collapse
|
3
|
Pudjihartono N, Fadason T, Kempa-Liehr AW, O'Sullivan JM. A Review of Feature Selection Methods for Machine Learning-Based Disease Risk Prediction. FRONTIERS IN BIOINFORMATICS 2022; 2:927312. [PMID: 36304293 PMCID: PMC9580915 DOI: 10.3389/fbinf.2022.927312] [Citation(s) in RCA: 75] [Impact Index Per Article: 37.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2022] [Accepted: 06/03/2022] [Indexed: 01/14/2023] Open
Abstract
Machine learning has shown utility in detecting patterns within large, unstructured, and complex datasets. One of the promising applications of machine learning is in precision medicine, where disease risk is predicted using patient genetic data. However, creating an accurate prediction model based on genotype data remains challenging due to the so-called “curse of dimensionality” (i.e., extensively larger number of features compared to the number of samples). Therefore, the generalizability of machine learning models benefits from feature selection, which aims to extract only the most “informative” features and remove noisy “non-informative,” irrelevant and redundant features. In this article, we provide a general overview of the different feature selection methods, their advantages, disadvantages, and use cases, focusing on the detection of relevant features (i.e., SNPs) for disease risk prediction.
Collapse
Affiliation(s)
| | - Tayaza Fadason
- Liggins Institute, University of Auckland, Auckland, New Zealand
- Maurice Wilkins Centre for Molecular Biodiscovery, Auckland, New Zealand
| | - Andreas W. Kempa-Liehr
- Department of Engineering Science, The University of Auckland, Auckland, New Zealand
- *Correspondence: Andreas W. Kempa-Liehr, ; Justin M. O'Sullivan,
| | - Justin M. O'Sullivan
- Liggins Institute, University of Auckland, Auckland, New Zealand
- Maurice Wilkins Centre for Molecular Biodiscovery, Auckland, New Zealand
- MRC Lifecourse Epidemiology Unit, University of Southampton, Southampton, United Kingdom
- Singapore Institute for Clinical Sciences, Agency for Science, Technology and Research (A*STAR), Singapore, Singapore
- Australian Parkinson’s Mission, Garvan Institute of Medical Research, Sydney, NSW, Australia
- *Correspondence: Andreas W. Kempa-Liehr, ; Justin M. O'Sullivan,
| |
Collapse
|
4
|
Multi-Objective Artificial Bee Colony Algorithm Based on Scale-Free Network for Epistasis Detection. Genes (Basel) 2022; 13:genes13050871. [PMID: 35627256 PMCID: PMC9140669 DOI: 10.3390/genes13050871] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2022] [Revised: 04/30/2022] [Accepted: 05/10/2022] [Indexed: 12/04/2022] Open
Abstract
In genome-wide association studies, epistasis detection is of great significance for the occurrence and diagnosis of complex human diseases, but it also faces challenges such as high dimensionality and a small data sample size. In order to cope with these challenges, several swarm intelligence methods have been introduced to identify epistasis in recent years. However, the existing methods still have some limitations, such as high-consumption and premature convergence. In this study, we proposed a multi-objective artificial bee colony (ABC) algorithm based on the scale-free network (SFMOABC). The SFMOABC incorporates the scale-free network into the ABC algorithm to guide the update and selection of solutions. In addition, the SFMOABC uses mutual information and the K2-Score of the Bayesian network as objective functions, and the opposition-based learning strategy is used to improve the search ability. Experiments were performed on both simulation datasets and a real dataset of age-related macular degeneration (AMD). The results of the simulation experiments showed that the SFMOABC has better detection power and efficiency than seven other epistasis detection methods. In the real AMD data experiment, most of the single nucleotide polymorphism combinations detected by the SFMOABC have been shown to be associated with AMD disease. Therefore, SFMOABC is a promising method for epistasis detection.
Collapse
|
5
|
Blumenthal DB, Baumbach J, Hoffmann M, Kacprowski T, List M. A framework for modeling epistatic interaction. Bioinformatics 2021; 37:1708-1716. [PMID: 33252645 DOI: 10.1093/bioinformatics/btaa990] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2020] [Revised: 10/21/2020] [Accepted: 11/16/2020] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Recently, various tools for detecting single nucleotide polymorphisms (SNPs) involved in epistasis have been developed. However, no studies evaluate the employed statistical epistasis models such as the χ2-test or quadratic regression independently of the tools that use them. Such an independent evaluation is crucial for developing improved epistasis detection tools, for it allows to decide if a tool's performance should be attributed to the epistasis model or to the optimization strategy run on top of it. RESULTS We present a protocol for evaluating epistasis models independently of the tools they are used in and generalize existing models designed for dichotomous phenotypes to the categorical and quantitative case. In addition, we propose a new model which scores candidate SNP sets by computing maximum likelihood distributions for the observed phenotypes in the cells of their penetrance tables. Extensive experiments show that the proposed maximum likelihood model outperforms three widely used epistasis models in most cases. The experiments also provide valuable insights into the properties of existing models, for instance, that quadratic regression perform particularly well on instances with quantitative phenotypes. AVAILABILITY AND IMPLEMENTATION The evaluation protocol and all compared models are implemented in C++ and are supported under Linux and macOS. They are available at https://github.com/baumbachlab/genepiseeker/, along with test datasets and scripts to reproduce the experiments. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- David B Blumenthal
- Chair of Experimental Bioinformatics, TUM School of Life Sciences, Technical University of Munich, 85354 Freising, Germany
| | - Jan Baumbach
- Chair of Experimental Bioinformatics, TUM School of Life Sciences, Technical University of Munich, 85354 Freising, Germany
| | - Markus Hoffmann
- Chair of Experimental Bioinformatics, TUM School of Life Sciences, Technical University of Munich, 85354 Freising, Germany
| | - Tim Kacprowski
- Chair of Experimental Bioinformatics, TUM School of Life Sciences, Technical University of Munich, 85354 Freising, Germany
| | - Markus List
- Chair of Experimental Bioinformatics, TUM School of Life Sciences, Technical University of Munich, 85354 Freising, Germany
| |
Collapse
|
6
|
Wang L, Wang Y, Fu Y, Gao Y, Du J, Yang C, Liu J. AFSBN: A Method of Artificial Fish Swarm Optimizing Bayesian Network for Epistasis Detection. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:1369-1383. [PMID: 31670676 DOI: 10.1109/tcbb.2019.2949780] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
How to mine the interaction between SNPs (namely epistasis) efficiently and accurately must be considered when to tackle the complexity of underlying biological mechanisms. In order to overcome the defect of low learning efficiency and local optimal, this work proposes an epistasis mining method using artificial fish swarm optimizing Bayesian network (AFSBN). This method uses the characteristics of global optimization, good robustness and fast convergence about the artificial fish swarm algorithm, and uses the algorithm into the heuristic search strategy of Bayesian network. The initial network structure can be evolved through the manipulations of foraging behavior, clustering behavior, tail-chasing behavior and random behavior. This algorithm chooses different behaviors to modify the network state according to the changing of surrounding environment and the states of partners. It realizes the interaction between each artificial fish and its neighboring environment, and finally finds the optimal network in the population. We compared AFSBN with other existing algorithms on both simulated and real datasets. The experimental results demonstrate that our method outperforms others in epistasis detection accuracy in the case of not affecting the efficiency basically for different datasets.
Collapse
|
7
|
Mishra R, Li B. The Application of Artificial Intelligence in the Genetic Study of Alzheimer's Disease. Aging Dis 2020; 11:1567-1584. [PMID: 33269107 PMCID: PMC7673858 DOI: 10.14336/ad.2020.0312] [Citation(s) in RCA: 16] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2020] [Accepted: 03/12/2020] [Indexed: 12/13/2022] Open
Abstract
Alzheimer's disease (AD) is a neurodegenerative disease in which genetic factors contribute approximately 70% of etiological effects. Studies have found many significant genetic and environmental factors, but the pathogenesis of AD is still unclear. With the application of microarray and next-generation sequencing technologies, research using genetic data has shown explosive growth. In addition to conventional statistical methods for the processing of these data, artificial intelligence (AI) technology shows obvious advantages in analyzing such complex projects. This article first briefly reviews the application of AI technology in medicine and the current status of genetic research in AD. Then, a comprehensive review is focused on the application of AI in the genetic research of AD, including the diagnosis and prognosis of AD based on genetic data, the analysis of genetic variation, gene expression profile, gene-gene interaction in AD, and genetic analysis of AD based on a knowledge base. Although many studies have yielded some meaningful results, they are still in a preliminary stage. The main shortcomings include the limitations of the databases, failing to take advantage of AI to conduct a systematic biology analysis of multilevel databases, and lack of a theoretical framework for the analysis results. Finally, we outlook the direction of future development. It is crucial to develop high quality, comprehensive, large sample size, data sharing resources; a multi-level system biology AI analysis strategy is one of the development directions, and computational creativity may play a role in theory model building, verification, and designing new intervention protocols for AD.
Collapse
Affiliation(s)
- Rohan Mishra
- Washington Institute for Health Sciences, Arlington, VA 22203, USA
| | - Bin Li
- Washington Institute for Health Sciences, Arlington, VA 22203, USA
- Georgetown University Medical Center, Washington D.C. 20057, USA
| |
Collapse
|
8
|
Romero-Rosales BL, Tamez-Pena JG, Nicolini H, Moreno-Treviño MG, Trevino V. Improving predictive models for Alzheimer's disease using GWAS data by incorporating misclassified samples modeling. PLoS One 2020; 15:e0232103. [PMID: 32324812 PMCID: PMC7179850 DOI: 10.1371/journal.pone.0232103] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2019] [Accepted: 04/07/2020] [Indexed: 01/14/2023] Open
Abstract
Late-onset Alzheimer’s Disease (LOAD) is the most common form of dementia in the elderly. Genome-wide association studies (GWAS) for LOAD have open new avenues to identify genetic causes and to provide diagnostic tools for early detection. Although several predictive models have been proposed using the few detected GWAS markers, there is still a need for improvement and identification of potential markers. Commonly, polygenic risk scores are being used for prediction. Nevertheless, other methods to generate predictive models have been suggested. In this research, we compared three machine learning methods that have been proved to construct powerful predictive models (genetic algorithms, LASSO, and step-wise) and propose the inclusion of markers from misclassified samples to improve overall prediction accuracy. Our results show that the addition of markers from an initial model plus the markers of the model fitted to misclassified samples improves the area under the receiving operative curve by around 5%, reaching ~0.84, which is highly competitive using only genetic information. The computational strategy used here can help to devise better methods to improve classification models for AD. Our results could have a positive impact on the early diagnosis of Alzheimer’s disease.
Collapse
Affiliation(s)
| | - Jose-Gerardo Tamez-Pena
- Tecnologico de Monterrey, Escuela de Medicina y Ciencias de la Salud, Monterrey, Nuevo Leon, Mexico
| | - Humberto Nicolini
- Genomics of Psychiatric and Neurodegenerative Diseases Laboratory, National Institute of Genomic Medicine (INMEGEN), Mexico City, Mexico
| | | | - Victor Trevino
- Tecnologico de Monterrey, Escuela de Medicina y Ciencias de la Salud, Monterrey, Nuevo Leon, Mexico
- * E-mail:
| |
Collapse
|
9
|
Ding H, An N, Au R, Devine S, Auerbach SH, Massaro J, Joshi P, Liu X, Liu Y, Mahon E, Ang TF, Lin H. Exploring the Hierarchical Influence of Cognitive Functions for Alzheimer Disease: The Framingham Heart Study. J Med Internet Res 2020; 22:e15376. [PMID: 32324139 PMCID: PMC7206516 DOI: 10.2196/15376] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2019] [Revised: 01/13/2020] [Accepted: 01/24/2020] [Indexed: 01/16/2023] Open
Abstract
BACKGROUND Although some neuropsychological (NP) tests are considered more central for the diagnosis of Alzheimer disease (AD), there is a lack of understanding about the interaction between different cognitive tests. OBJECTIVE This study aimed to demonstrate a global view of hierarchical probabilistic dependencies between NP tests and the likelihood of cognitive impairment to assist physicians in recognizing AD precursors. METHODS Our study included 2091 participants from the Framingham Heart Study. These participants had undergone a variety of NP tests, including Wechsler Memory Scale, Wechsler Adult Intelligence Scale, and Boston Naming Test. Heterogeneous cognitive Bayesian networks were developed to understand the relationship between NP tests and the cognitive status. The performance of probabilistic inference was evaluated by the 10-fold cross validation. RESULTS A total of 4512 NP tests were used to build the Bayesian network for the dementia diagnosis. The network demonstrated conditional dependency between different cognitive functions that precede the development of dementia. The prediction model reached an accuracy of 82.24%, with sensitivity of 63.98% and specificity of 92.74%. This probabilistic diagnostic system can also be applied to participants that exhibit more heterogeneous profiles or with missing responses for some NP tests. CONCLUSIONS We developed a probabilistic dependency network for AD diagnosis from 11 NP tests. Our study revealed important psychological functional segregations and precursor evidence of AD development and heterogeneity.
Collapse
Affiliation(s)
- Huitong Ding
- School of Computer Science and Information Engineering, Hefei University of Technology, Hefei, China
- Key Laboratory of Knowledge Engineering with Big Data of Ministry of Education, Hefei University of Technology, Hefei, China
- Department of Anatomy and Neurobiology, Boston University School of Medicine, Boston, MA, United States
| | - Ning An
- School of Computer Science and Information Engineering, Hefei University of Technology, Hefei, China
- Key Laboratory of Knowledge Engineering with Big Data of Ministry of Education, Hefei University of Technology, Hefei, China
| | - Rhoda Au
- Department of Anatomy and Neurobiology, Boston University School of Medicine, Boston, MA, United States
- Department of Epidemiology, Boston University School of Public Health, Boston, MA, United States
- The Framingham Heart Study, Framingham, MA, United States
- Department of Neurology, Boston University School of Medicine, Boston, MA, United States
| | - Sherral Devine
- Department of Anatomy and Neurobiology, Boston University School of Medicine, Boston, MA, United States
- The Framingham Heart Study, Framingham, MA, United States
| | - Sanford H Auerbach
- The Framingham Heart Study, Framingham, MA, United States
- Department of Neurology, Boston University School of Medicine, Boston, MA, United States
| | - Joseph Massaro
- The Framingham Heart Study, Framingham, MA, United States
- Department of Biostatistics, Boston University School of Public Health, Boston, MA, United States
| | - Prajakta Joshi
- Department of Anatomy and Neurobiology, Boston University School of Medicine, Boston, MA, United States
- The Framingham Heart Study, Framingham, MA, United States
| | - Xue Liu
- Department of Anatomy and Neurobiology, Boston University School of Medicine, Boston, MA, United States
- The Framingham Heart Study, Framingham, MA, United States
| | - Yulin Liu
- Department of Anatomy and Neurobiology, Boston University School of Medicine, Boston, MA, United States
- The Framingham Heart Study, Framingham, MA, United States
| | - Elizabeth Mahon
- Department of Anatomy and Neurobiology, Boston University School of Medicine, Boston, MA, United States
- The Framingham Heart Study, Framingham, MA, United States
| | - Ting Fa Ang
- Department of Anatomy and Neurobiology, Boston University School of Medicine, Boston, MA, United States
- Department of Epidemiology, Boston University School of Public Health, Boston, MA, United States
- The Framingham Heart Study, Framingham, MA, United States
| | - Honghuang Lin
- The Framingham Heart Study, Framingham, MA, United States
- Section of Computational Biomedicine, Department of Medicine, Boston University School of Medicine, Boston, MA, United States
| |
Collapse
|
10
|
Dos Santos JPR, Fernandes SB, McCoy S, Lozano R, Brown PJ, Leakey ADB, Buckler ES, Garcia AAF, Gore MA. Novel Bayesian Networks for Genomic Prediction of Developmental Traits in Biomass Sorghum. G3 (BETHESDA, MD.) 2020; 10:769-781. [PMID: 31852730 PMCID: PMC7003104 DOI: 10.1534/g3.119.400759] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/22/2019] [Accepted: 12/15/2019] [Indexed: 11/23/2022]
Abstract
The ability to connect genetic information between traits over time allow Bayesian networks to offer a powerful probabilistic framework to construct genomic prediction models. In this study, we phenotyped a diversity panel of 869 biomass sorghum (Sorghum bicolor (L.) Moench) lines, which had been genotyped with 100,435 SNP markers, for plant height (PH) with biweekly measurements from 30 to 120 days after planting (DAP) and for end-of-season dry biomass yield (DBY) in four environments. We evaluated five genomic prediction models: Bayesian network (BN), Pleiotropic Bayesian network (PBN), Dynamic Bayesian network (DBN), multi-trait GBLUP (MTr-GBLUP), and multi-time GBLUP (MTi-GBLUP) models. In fivefold cross-validation, prediction accuracies ranged from 0.46 (PBN) to 0.49 (MTr-GBLUP) for DBY and from 0.47 (DBN, DAP120) to 0.75 (MTi-GBLUP, DAP60) for PH. Forward-chaining cross-validation further improved prediction accuracies of the DBN, MTi-GBLUP and MTr-GBLUP models for PH (training slice: 30-45 DAP) by 36.4-52.4% relative to the BN and PBN models. Coincidence indices (target: biomass, secondary: PH) and a coincidence index based on lines (PH time series) showed that the ranking of lines by PH changed minimally after 45 DAP. These results suggest a two-level indirect selection method for PH at harvest (first-level target trait) and DBY (second-level target trait) could be conducted earlier in the season based on ranking of lines by PH at 45 DAP (secondary trait). With the advance of high-throughput phenotyping technologies, our proposed two-level indirect selection framework could be valuable for enhancing genetic gain per unit of time when selecting on developmental traits.
Collapse
Affiliation(s)
- Jhonathan P R Dos Santos
- Plant Breeding and Genetics Section, School of Integrative Plant Science
- Department of Genetics, Luiz de Queiroz College of Agriculture, University of São Paulo, Piracicaba, SP, Brazil
| | | | | | - Roberto Lozano
- Plant Breeding and Genetics Section, School of Integrative Plant Science
| | - Patrick J Brown
- Section of Agricultural Plant Biology, Department of Plant Sciences, University of California Davis, 95616, and
| | - Andrew D B Leakey
- Department of Crop Science
- Institute for Genomic Biology
- Department of Plant Biology, University of Illinois at Urbana Champaign, 61801
| | - Edward S Buckler
- Plant Breeding and Genetics Section, School of Integrative Plant Science
- United States Department of Agriculture, Agricultural Research Service, R. W. Holley Center, Ithaca, New York 14853
- Institute for Genomic Diversity, Cornell University, Ithaca, New York 14853
| | - Antonio A F Garcia
- Department of Genetics, Luiz de Queiroz College of Agriculture, University of São Paulo, Piracicaba, SP, Brazil,
| | - Michael A Gore
- Plant Breeding and Genetics Section, School of Integrative Plant Science,
| |
Collapse
|
11
|
Li X, Zhang S, Wong KC. Nature-Inspired Multiobjective Epistasis Elucidation from Genome-Wide Association Studies. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:226-237. [PMID: 29994485 DOI: 10.1109/tcbb.2018.2849759] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
In recent years, the detection of epistatic interactions of multiple genetic variants on the causes of complex diseases brings a significant challenge in genome-wide association studies (GWAS). However, most of the existing methods still suffer from algorithmic limitations such as single-objective optimization, intensive computational requirement, and premature convergence. In this paper, we propose and formulate an epistatic interaction multi-objective artificial bee colony algorithm based on decomposition (EIMOABC/D) to address those problems for genetic interaction detection in genome-wide association studies. First, to direct the genetic interaction detection, two objective functions are formulated to characterize various epistatic models; rank probability model is proposed to sort each population into different nondomination levels based on the fast nondominated sorting approach. After that, the mutual information based local search algorithm is proposed to guide the population search for disease model evaluations in an unbiased manner. To validate the effectiveness of EIMOABC/D, we compare EIMOABC/D against seven state-of-the-art methods on 77 epistatic models including eight small-scale epistatic models with marginal effects, eight large-scale epistatic models with marginal effects, 60 large-scale epistatic models without any marginal effect, and one case study. The experimental results indicate that our proposed algorithm EIMOABC/D outperforms seven state-of-the-art methods on those epistatic models. Furthermore, time complexity analysis and parameter analysis are conducted to demonstrate various properties of our proposed algorithm.
Collapse
|
12
|
Momen M, Campbell MT, Walia H, Morota G. Utilizing trait networks and structural equation models as tools to interpret multi-trait genome-wide association studies. PLANT METHODS 2019; 15:107. [PMID: 31548847 PMCID: PMC6749677 DOI: 10.1186/s13007-019-0493-x] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/11/2019] [Accepted: 09/06/2019] [Indexed: 05/13/2023]
Abstract
BACKGROUND Plant breeders seek to develop cultivars with maximal agronomic value, which is often assessed using numerous, often genetically correlated traits. As intervention on one trait will affect the value of another, breeding decisions should consider the relationships among traits in the context of putative causal structures (i.e., trait networks). While multi-trait genome-wide association studies (MTM-GWAS) can infer putative genetic signals at the multivariate scale, standard MTM-GWAS does not accommodate the network structure of phenotypes, and therefore does not address how the traits are interrelated. We extended the scope of MTM-GWAS by incorporating trait network structures into GWAS using structural equation models (SEM-GWAS). Here, we illustrate the utility of SEM-GWAS using a digital metric for shoot biomass, root biomass, water use, and water use efficiency in rice. RESULTS A salient feature of SEM-GWAS is that it can partition the total single nucleotide polymorphism (SNP) effects acting on a trait into direct and indirect effects. Using this novel approach, we show that for most QTL associated with water use, total SNP effects were driven by genetic effects acting directly on water use rather that genetic effects originating from upstream traits. Conversely, total SNP effects for water use efficiency were largely due to indirect effects originating from the upstream trait, projected shoot area. CONCLUSIONS We describe a robust framework that can be applied to multivariate phenotypes to understand the interrelationships between complex traits. This framework provides novel insights into how QTL act within a phenotypic network that would otherwise not be possible with conventional multi-trait GWAS approaches. Collectively, these results suggest that the use of SEM may enhance our understanding of complex relationships among agronomic traits.
Collapse
Affiliation(s)
- Mehdi Momen
- Department of Animal and Poultry Sciences, Virginia Polytechnic Institute and State University, 175 West Campus Drive, Blacksburg, VA 24061 USA
| | - Malachy T. Campbell
- Department of Animal and Poultry Sciences, Virginia Polytechnic Institute and State University, 175 West Campus Drive, Blacksburg, VA 24061 USA
| | - Harkamal Walia
- Department of Agronomy and Horticulture, University of Nebraska-Lincoln, Lincoln, NE 68583 USA
| | - Gota Morota
- Department of Animal and Poultry Sciences, Virginia Polytechnic Institute and State University, 175 West Campus Drive, Blacksburg, VA 24061 USA
| |
Collapse
|
13
|
Epi-GTBN: an approach of epistasis mining based on genetic Tabu algorithm and Bayesian network. BMC Bioinformatics 2019; 20:444. [PMID: 31455207 PMCID: PMC6712799 DOI: 10.1186/s12859-019-3022-z] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2018] [Accepted: 08/07/2019] [Indexed: 12/31/2022] Open
Abstract
Background Mining epistatic loci which affects specific phenotypic traits is an important research issue in the field of biology. Bayesian network (BN) is a graphical model which can express the relationship between genetic loci and phenotype. Until now, it has been widely used into epistasis mining in many research work. However, this method has two disadvantages: low learning efficiency and easy to fall into local optimum. Genetic algorithm has the excellence of rapid global search and avoiding falling into local optimum. It is scalable and easy to integrate with other algorithms. This work proposes an epistasis mining approach based on genetic tabu algorithm and Bayesian network (Epi-GTBN). It uses genetic algorithm into the heuristic search strategy of Bayesian network. The individual structure can be evolved through the genetic operations of selection, crossover and mutation. It can help to find the optimal network structure, and then further to mine the epistasis loci effectively. In order to enhance the diversity of the population and obtain a more effective global optimal solution, we use the tabu search strategy into the operations of crossover and mutation in genetic algorithm. It can help to accelerate the convergence of the algorithm. Results We compared Epi-GTBN with other recent algorithms using both simulated and real datasets. The experimental results demonstrate that our method has much better epistasis detection accuracy in the case of not affecting the efficiency for different datasets. Conclusions The presented methodology (Epi-GTBN) is an effective method for epistasis detection, and it can be seen as an interesting addition to the arsenal used in complex traits analyses. Electronic supplementary material The online version of this article (10.1186/s12859-019-3022-z) contains supplementary material, which is available to authorized users.
Collapse
|
14
|
Zhang L, Pan Q, Wang Y, Wu X, Shi X. Bayesian Network Construction and Genotype-Phenotype Inference Using GWAS Statistics. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2019; 16:475-489. [PMID: 29990020 PMCID: PMC6499495 DOI: 10.1109/tcbb.2017.2779498] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/15/2023]
Abstract
Genome-wide association studies (GWASs) have received increasing attention to understand how genetic variation affects different human traits. In this paper, we study whether and to what extent exploiting the GWAS statistics can be used for inferring private information about a human individual. We first provide a method to construct a three-layered Bayesian network explicitly revealing the conditional dependency between single-nucleotide polymorphisms (SNPs) and traits from public GWAS catalog. The key challenge in building a Bayesian network from GWAS statistics is the specification of the conditional probability table of a variable with multiple parent variables. We employ the models of independence of causal influences which assume that the causal mechanism of each parent variable is mutually independent. We then formulate three inference problems based on the dependency relationship captured in the Bayesian network, namely trait inference given SNP genotype, genotype inference given trait, and trait inference given known traits, and develop efficient formulas and algorithms. Different from previous work, the possible target of these inference problems we study may be any individual, not limited to GWAS participants. Empirical evaluations show the effectiveness of our proposed methods. In summary, our work implies that meaningful information can be inferred from modeling GWAS statistics, and appropriate privacy protection mechanisms need to be developed to protect genetic privacy not only of GWAS participants but also regular individuals.
Collapse
|
15
|
Martínez CA, Khare K, Rahman S, Elzo MA. Modeling correlated marker effects in genome-wide prediction via Gaussian concentration graph models. J Theor Biol 2017; 437:67-78. [PMID: 29055677 DOI: 10.1016/j.jtbi.2017.10.017] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2016] [Revised: 09/25/2017] [Accepted: 10/15/2017] [Indexed: 10/18/2022]
Abstract
In genome-wide prediction, independence of marker allele substitution effects is typically assumed; however, since early stages in the evolution of this technology it has been known that nature points to correlated effects. In statistics, graphical models have been identified as a useful and powerful tool for covariance estimation in high dimensional problems and it is an area that has recently experienced a great expansion. In particular, Gaussian concentration graph models (GCGM) have been widely studied. These are models in which the distribution of a set of random variables, the marker effects in this case, is assumed to be Markov with respect to an undirected graph G. In this paper, Bayesian (Bayes G and Bayes G-D) and frequentist (GML-BLUP) methods adapting the theory of GCGM to genome-wide prediction were developed. Different approaches to define the graph G based on domain-specific knowledge were proposed, and two propositions and a corollary establishing conditions to find decomposable graphs were proven. These methods were implemented in small simulated and real datasets. In our simulations, scenarios where correlations among allelic substitution effects were expected to arise due to various causes were considered, and graphs were defined on the basis of physical marker positions. Results showed improvements in correlation between phenotypes and predicted additive genetic values and accuracies of predicted additive genetic values when accounting for partially correlated allele substitution effects. Extensions to the multiallelic loci case were described and some possible refinements incorporating more flexible priors in the Bayesian setting were discussed. Our models are promising because they allow incorporation of biological information in the prediction process, and because they are more flexible and general than other models accounting for correlated marker effects that have been proposed previously.
Collapse
Affiliation(s)
| | - Kshitij Khare
- Department of Statistics, University of Florida, Gainesville, FL, USA
| | - Syed Rahman
- Department of Statistics, University of Florida, Gainesville, FL, USA
| | - Mauricio A Elzo
- Department of Animal Sciences, University of Florida, Gainesville, FL, USA
| |
Collapse
|
16
|
Verification of Three-Phase Dependency Analysis Bayesian Network Learning Method for Maize Carotenoid Gene Mining. BIOMED RESEARCH INTERNATIONAL 2017; 2017:1813494. [PMID: 28828382 PMCID: PMC5554554 DOI: 10.1155/2017/1813494] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/03/2017] [Accepted: 06/27/2017] [Indexed: 11/17/2022]
Abstract
Background and Objective Mining the genes related to maize carotenoid components is important to improve the carotenoid content and the quality of maize. Methods On the basis of using the entropy estimation method with Gaussian kernel probability density estimator, we use the three-phase dependency analysis (TPDA) Bayesian network structure learning method to construct the network of maize gene and carotenoid components traits. Results In the case of using two discretization methods and setting different discretization values, we compare the learning effect and efficiency of 10 kinds of Bayesian network structure learning methods. The method is verified and analyzed on the maize dataset of global germplasm collection with 527 elite inbred lines. Conclusions The result confirmed the effectiveness of the TPDA method, which outperforms significantly another 9 kinds of Bayesian network learning methods. It is an efficient method of mining genes for maize carotenoid components traits. The parameters obtained by experiments will help carry out practical gene mining effectively in the future.
Collapse
|
17
|
Sun Y, Shang J, Liu JX, Li S, Zheng CH. epiACO - a method for identifying epistasis based on ant Colony optimization algorithm. BioData Min 2017; 10:23. [PMID: 28694848 PMCID: PMC5500974 DOI: 10.1186/s13040-017-0143-7] [Citation(s) in RCA: 32] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2016] [Accepted: 06/29/2017] [Indexed: 11/23/2022] Open
Abstract
Background Identifying epistasis or epistatic interactions, which refer to nonlinear interaction effects of single nucleotide polymorphisms (SNPs), is essential to understand disease susceptibility and to detect genetic architectures underlying complex diseases. Though many works have been done for identifying epistatic interactions, due to their methodological and computational challenges, the algorithmic development is still ongoing. Results In this study, a method epiACO is proposed to identify epistatic interactions, which based on ant colony optimization algorithm. Highlights of epiACO are the introduced fitness function Svalue, path selection strategies, and a memory based strategy. The Svalue leverages the advantages of both mutual information and Bayesian network to effectively and efficiently measure associations between SNP combinations and the phenotype. Two path selection strategies, i.e., probabilistic path selection strategy and stochastic path selection strategy, are provided to adaptively guide ant behaviors of exploration and exploitation. The memory based strategy is designed to retain candidate solutions found in the previous iterations, and compare them to solutions of the current iteration to generate new candidate solutions, yielding a more accurate way for identifying epistasis. Conclusions Experiments of epiACO and its comparison with other recent methods epiMODE, TEAM, BOOST, SNPRuler, AntEpiSeeker, AntMiner, MACOED, and IACO are performed on both simulation data sets and a real data set of age-related macular degeneration. Results show that epiACO is promising in identifying epistasis and might be an alternative to existing methods.
Collapse
Affiliation(s)
- Yingxia Sun
- School of Information Science and Engineering, Qufu Normal University, Rizhao, 276826 China
| | - Junliang Shang
- School of Information Science and Engineering, Qufu Normal University, Rizhao, 276826 China.,Institute of Network Computing, Qufu Normal University, Rizhao, 276826 China
| | - Jin-Xing Liu
- School of Information Science and Engineering, Qufu Normal University, Rizhao, 276826 China
| | - Shengjun Li
- School of Information Science and Engineering, Qufu Normal University, Rizhao, 276826 China
| | - Chun-Hou Zheng
- School of Information Science and Engineering, Qufu Normal University, Rizhao, 276826 China.,College of Electrical Engineering and Automation, Anhui University, Hefei, 230601 China
| |
Collapse
|
18
|
Gogoshin G, Boerwinkle E, Rodin AS. New Algorithm and Software (BNOmics) for Inferring and Visualizing Bayesian Networks from Heterogeneous Big Biological and Genetic Data. J Comput Biol 2016; 24:340-356. [PMID: 27681505 PMCID: PMC5372779 DOI: 10.1089/cmb.2016.0100] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022] Open
Abstract
Bayesian network (BN) reconstruction is a prototypical systems biology data analysis approach that has been successfully used to reverse engineer and model networks reflecting different layers of biological organization (ranging from genetic to epigenetic to cellular pathway to metabolomic). It is especially relevant in the context of modern (ongoing and prospective) studies that generate heterogeneous high-throughput omics datasets. However, there are both theoretical and practical obstacles to the seamless application of BN modeling to such big data, including computational inefficiency of optimal BN structure search algorithms, ambiguity in data discretization, mixing data types, imputation and validation, and, in general, limited scalability in both reconstruction and visualization of BNs. To overcome these and other obstacles, we present BNOmics, an improved algorithm and software toolkit for inferring and analyzing BNs from omics datasets. BNOmics aims at comprehensive systems biology—type data exploration, including both generating new biological hypothesis and testing and validating the existing ones. Novel aspects of the algorithm center around increasing scalability and applicability to varying data types (with different explicit and implicit distributional assumptions) within the same analysis framework. An output and visualization interface to widely available graph-rendering software is also included. Three diverse applications are detailed. BNOmics was originally developed in the context of genetic epidemiology data and is being continuously optimized to keep pace with the ever-increasing inflow of available large-scale omics datasets. As such, the software scalability and usability on the less than exotic computer hardware are a priority, as well as the applicability of the algorithm and software to the heterogeneous datasets containing many data types—single-nucleotide polymorphisms and other genetic/epigenetic/transcriptome variables, metabolite levels, epidemiological variables, endpoints, and phenotypes, etc.
Collapse
Affiliation(s)
- Grigoriy Gogoshin
- 1 Diabetes and Metabolism Research Institute , City of Hope, Duarte, California
| | - Eric Boerwinkle
- 2 Human Genetics Center, School of Public Health, University of Texas Health Science Center , Houston, Texas.,3 Institute of Molecular Medicine, University of Texas Health Science Center , Houston, Texas
| | - Andrei S Rodin
- 1 Diabetes and Metabolism Research Institute , City of Hope, Duarte, California
| |
Collapse
|
19
|
FHSA-SED: Two-Locus Model Detection for Genome-Wide Association Study with Harmony Search Algorithm. PLoS One 2016; 11:e0150669. [PMID: 27014873 PMCID: PMC4807955 DOI: 10.1371/journal.pone.0150669] [Citation(s) in RCA: 29] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2015] [Accepted: 02/16/2016] [Indexed: 12/24/2022] Open
Abstract
Motivation Two-locus model is a typical significant disease model to be identified in genome-wide association study (GWAS). Due to intensive computational burden and diversity of disease models, existing methods have drawbacks on low detection power, high computation cost, and preference for some types of disease models. Method In this study, two scoring functions (Bayesian network based K2-score and Gini-score) are used for characterizing two SNP locus as a candidate model, the two criteria are adopted simultaneously for improving identification power and tackling the preference problem to disease models. Harmony search algorithm (HSA) is improved for quickly finding the most likely candidate models among all two-locus models, in which a local search algorithm with two-dimensional tabu table is presented to avoid repeatedly evaluating some disease models that have strong marginal effect. Finally G-test statistic is used to further test the candidate models. Results We investigate our method named FHSA-SED on 82 simulated datasets and a real AMD dataset, and compare it with two typical methods (MACOED and CSE) which have been developed recently based on swarm intelligent search algorithm. The results of simulation experiments indicate that our method outperforms the two compared algorithms in terms of detection power, computation time, evaluation times, sensitivity (TPR), specificity (SPC), positive predictive value (PPV) and accuracy (ACC). Our method has identified two SNPs (rs3775652 and rs10511467) that may be also associated with disease in AMD dataset.
Collapse
|
20
|
Niel C, Sinoquet C, Dina C, Rocheleau G. A survey about methods dedicated to epistasis detection. Front Genet 2015; 6:285. [PMID: 26442103 PMCID: PMC4564769 DOI: 10.3389/fgene.2015.00285] [Citation(s) in RCA: 67] [Impact Index Per Article: 7.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2015] [Accepted: 08/27/2015] [Indexed: 12/25/2022] Open
Abstract
During the past decade, findings of genome-wide association studies (GWAS) improved our knowledge and understanding of disease genetics. To date, thousands of SNPs have been associated with diseases and other complex traits. Statistical analysis typically looks for association between a phenotype and a SNP taken individually via single-locus tests. However, geneticists admit this is an oversimplified approach to tackle the complexity of underlying biological mechanisms. Interaction between SNPs, namely epistasis, must be considered. Unfortunately, epistasis detection gives rise to analytic challenges since analyzing every SNP combination is at present impractical at a genome-wide scale. In this review, we will present the main strategies recently proposed to detect epistatic interactions, along with their operating principle. Some of these methods are exhaustive, such as multifactor dimensionality reduction, likelihood ratio-based tests or receiver operating characteristic curve analysis; some are non-exhaustive, such as machine learning techniques (random forests, Bayesian networks) or combinatorial optimization approaches (ant colony optimization, computational evolution system).
Collapse
Affiliation(s)
- Clément Niel
- Computer Science Institute of Nantes-Atlantic (Lina), Centre National de la Recherche Scientifique UMR 6241, Ecole Polytechnique de l'Université de Nantes Nantes, France
| | - Christine Sinoquet
- Computer Science Institute of Nantes-Atlantic (Lina), Centre National de la Recherche Scientifique UMR 6241, University of Nantes Nantes, France
| | - Christian Dina
- Institut du Thorax, Institut National de la Santé et de la Recherche Médicale UMR 1087, Centre National de la Recherche Scientifique UMR 6291, University of Nantes Nantes, France
| | - Ghislain Rocheleau
- European Genomic Institute for Diabetes FR3508, Centre National de la Recherche Scientifique UMR 8199, Lille 2 University Lille, France
| |
Collapse
|
21
|
Abstract
Models for genome-wide prediction and association studies usually target a single phenotypic trait. However, in animal and plant genetics it is common to record information on multiple phenotypes for each individual that will be genotyped. Modeling traits individually disregards the fact that they are most likely associated due to pleiotropy and shared biological basis, thus providing only a partial, confounded view of genetic effects and phenotypic interactions. In this article we use data from a Multiparent Advanced Generation Inter-Cross (MAGIC) winter wheat population to explore Bayesian networks as a convenient and interpretable framework for the simultaneous modeling of multiple quantitative traits. We show that they are equivalent to multivariate genetic best linear unbiased prediction (GBLUP) and that they are competitive with single-trait elastic net and single-trait GBLUP in predictive performance. Finally, we discuss their relationship with other additive-effects models and their advantages in inference and interpretation. MAGIC populations provide an ideal setting for this kind of investigation because the very low population structure and large sample size result in predictive models with good power and limited confounding due to relatedness.
Collapse
|
22
|
An Improved Opposition-Based Learning Particle Swarm Optimization for the Detection of SNP-SNP Interactions. BIOMED RESEARCH INTERNATIONAL 2015; 2015:524821. [PMID: 26236727 PMCID: PMC4509494 DOI: 10.1155/2015/524821] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/25/2014] [Revised: 12/30/2014] [Accepted: 01/02/2015] [Indexed: 12/22/2022]
Abstract
SNP-SNP interactions have been receiving increasing attention in understanding the mechanism underlying susceptibility to complex diseases. Though many works have been done for the detection of SNP-SNP interactions, the algorithmic development is still ongoing. In this study, an improved opposition-based learning particle swarm optimization (IOBLPSO) is proposed for the detection of SNP-SNP interactions. Highlights of IOBLPSO are the introduction of three strategies, namely, opposition-based learning, dynamic inertia weight, and a postprocedure. Opposition-based learning not only enhances the global explorative ability, but also avoids premature convergence. Dynamic inertia weight allows particles to cover a wider search space when the considered SNP is likely to be a random one and converges on promising regions of the search space while capturing a highly suspected SNP. The postprocedure is used to carry out a deep search in highly suspected SNP sets. Experiments of IOBLPSO are performed on both simulation data sets and a real data set of age-related macular degeneration, results of which demonstrate that IOBLPSO is promising in detecting SNP-SNP interactions. IOBLPSO might be an alternative to existing methods for detecting SNP-SNP interactions.
Collapse
|
23
|
Statistical and Computational Methods for Genetic Diseases: An Overview. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2015; 2015:954598. [PMID: 26106440 PMCID: PMC4464008 DOI: 10.1155/2015/954598] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Received: 09/16/2014] [Accepted: 04/23/2015] [Indexed: 12/19/2022]
Abstract
The identification of causes of genetic diseases has been carried out by several approaches with increasing complexity. Innovation of genetic methodologies leads to the production of large amounts of data that needs the support of statistical and computational methods to be correctly processed. The aim of the paper is to provide an overview of statistical and computational methods paying attention to methods for the sequence analysis and complex diseases.
Collapse
|
24
|
Wang J, Zuo Y, Man YG, Avital I, Stojadinovic A, Liu M, Yang X, Varghese RS, Tadesse MG, Ressom HW. Pathway and network approaches for identification of cancer signature markers from omics data. J Cancer 2015; 6:54-65. [PMID: 25553089 PMCID: PMC4278915 DOI: 10.7150/jca.10631] [Citation(s) in RCA: 40] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2014] [Accepted: 11/14/2014] [Indexed: 12/12/2022] Open
Abstract
The advancement of high throughput omic technologies during the past few years has made it possible to perform many complex assays in a much shorter time than the traditional approaches. The rapid accumulation and wide availability of omic data generated by these technologies offer great opportunities to unravel disease mechanisms, but also presents significant challenges to extract knowledge from such massive data and to evaluate the findings. To address these challenges, a number of pathway and network based approaches have been introduced. This review article evaluates these methods and discusses their application in cancer biomarker discovery using hepatocellular carcinoma (HCC) as an example.
Collapse
Affiliation(s)
- Jinlian Wang
- 1. Lombardi Comprehensive Cancer Center, Georgetown University, Washington, DC, USA
- 7. Genetics and Genomics Science, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Yiming Zuo
- 1. Lombardi Comprehensive Cancer Center, Georgetown University, Washington, DC, USA
- 6. Department of Electrical and Computer Engineering, Virginia Polytechnic Institute and State University, Arlington, VA, USA
| | - Yan-gao Man
- 2. Bon Secours Cancer Institute, Richmond VA, USA
| | | | - Alexander Stojadinovic
- 2. Bon Secours Cancer Institute, Richmond VA, USA
- 3. Division of Surgical Oncology, Walter Reed National Military Medical Center, Bethesda, MD, USA
| | - Meng Liu
- 4. Department of Public Health School of Hunter College, City University of New York, NYC, USA
| | - Xiaowei Yang
- 4. Department of Public Health School of Hunter College, City University of New York, NYC, USA
| | - Rency S. Varghese
- 1. Lombardi Comprehensive Cancer Center, Georgetown University, Washington, DC, USA
| | - Mahlet G Tadesse
- 5. Department of Mathematics and Statistics, Georgetown University, Washington DC, USA
| | - Habtom W Ressom
- 1. Lombardi Comprehensive Cancer Center, Georgetown University, Washington, DC, USA
| |
Collapse
|
25
|
Jing PJ, Shen HB. MACOED: a multi-objective ant colony optimization algorithm for SNP epistasis detection in genome-wide association studies. Bioinformatics 2014; 31:634-41. [PMID: 25338719 DOI: 10.1093/bioinformatics/btu702] [Citation(s) in RCA: 55] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/12/2023] Open
Abstract
MOTIVATION The existing methods for genetic-interaction detection in genome-wide association studies are designed from different paradigms, and their performances vary considerably for different disease models. One important reason for this variability is that their construction is based on a single-correlation model between SNPs and disease. Due to potential model preference and disease complexity, a single-objective method will therefore not work well in general, resulting in low power and a high false-positive rate. METHOD In this work, we present a multi-objective heuristic optimization methodology named MACOED for detecting genetic interactions. In MACOED, we combine both logistical regression and Bayesian network methods, which are from opposing schools of statistics. The combination of these two evaluation objectives proved to be complementary, resulting in higher power with a lower false-positive rate than observed for optimizing either objective independently. To solve the space and time complexity for high-dimension problems, a memory-based multi-objective ant colony optimization algorithm is designed in MACOED that is able to retain non-dominated solutions found in past iterations. RESULTS We compared MACOED with other recent algorithms using both simulated and real datasets. The experimental results demonstrate that our method outperforms others in both detection power and computational feasibility for large datasets. AVAILABILITY AND IMPLEMENTATION Codes and datasets are available at: www.csbio.sjtu.edu.cn/bioinf/MACOED/.
Collapse
Affiliation(s)
- Peng-Jie Jing
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University and Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai 200240, China Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University and Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai 200240, China
| | - Hong-Bin Shen
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University and Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai 200240, China Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University and Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai 200240, China
| |
Collapse
|
26
|
Bielza C, Larrañaga P. Bayesian networks in neuroscience: a survey. Front Comput Neurosci 2014; 8:131. [PMID: 25360109 PMCID: PMC4199264 DOI: 10.3389/fncom.2014.00131] [Citation(s) in RCA: 48] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2014] [Accepted: 09/26/2014] [Indexed: 12/29/2022] Open
Abstract
Bayesian networks are a type of probabilistic graphical models lie at the intersection between statistics and machine learning. They have been shown to be powerful tools to encode dependence relationships among the variables of a domain under uncertainty. Thanks to their generality, Bayesian networks can accommodate continuous and discrete variables, as well as temporal processes. In this paper we review Bayesian networks and how they can be learned automatically from data by means of structure learning algorithms. Also, we examine how a user can take advantage of these networks for reasoning by exact or approximate inference algorithms that propagate the given evidence through the graphical structure. Despite their applicability in many fields, they have been little used in neuroscience, where they have focused on specific problems, like functional connectivity analysis from neuroimaging data. Here we survey key research in neuroscience where Bayesian networks have been used with different aims: discover associations between variables, perform probabilistic reasoning over the model, and classify new observations with and without supervision. The networks are learned from data of any kind-morphological, electrophysiological, -omics and neuroimaging-, thereby broadening the scope-molecular, cellular, structural, functional, cognitive and medical- of the brain aspects to be studied.
Collapse
Affiliation(s)
- Concha Bielza
- *Correspondence: Concha Bielza, Departamento de Inteligencia Artificial, Universidad Politecnica de Madrid, Campus de Montegancedo, Boadilla del Monte, 28660 Madrid, Spain e-mail:
| | | |
Collapse
|
27
|
Minozzi G, Pedretti A, Biffani S, Nicolazzi EL, Stella A. Genome wide association analysis of the 16th QTL- MAS Workshop dataset using the Random Forest machine learning approach. BMC Proc 2014; 8:S4. [PMID: 25519518 PMCID: PMC4195406 DOI: 10.1186/1753-6561-8-s5-s4] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023] Open
Abstract
Background Genome wide association studies are now widely used in the livestock sector to estimate the association among single nucleotide polymorphisms (SNPs) distributed across the whole genome and one or more trait. As computational power increases, the use of machine learning techniques to analyze large genome wide datasets becomes possible. Methods The objective of this study was to identify SNPs associated with the three traits simulated in the 16th MAS-QTL workshop dataset using the Random Forest (RF) approach. The approach was applied to single and multiple trait estimated breeding values, and on yield deviations and to compare them with the results of the GRAMMAR-CG method. Results The two QTL mapping methods used, GRAMMAR-CG and RF, were successful in identifying the main QTLs for trait 1 on chromosomes 1 and 4, for trait 2 on chromosomes 1, 4 and 5 and for trait 3 on chromosomes 1, 2 and 3. Conclusions The results of the RF approach were confirmed by the GRAMMAR-CG method and validated by the effective QTL position, even if their approach to unravel cryptic genetic structure is different. Furthermore, both methods showed complementary findings. However, when the variance explained by the QTL is low, they both failed to detect significant associations.
Collapse
Affiliation(s)
- Giulietta Minozzi
- Parco Tecnologico Padano Srl, Via Einstein 26900 Lodi, Italy ; Department of Veterinary Science and Public Health, University of Milan, Via Celoria 10, 20133 Milan, Italy
| | - Andrea Pedretti
- Parco Tecnologico Padano Srl, Via Einstein 26900 Lodi, Italy
| | | | | | | |
Collapse
|
28
|
A Novel Two-Stage Approach for Epistasis Detection in Genome-Wide Case–Control Studies. Biochem Genet 2014; 52:403-14. [DOI: 10.1007/s10528-014-9656-7] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2013] [Accepted: 04/06/2014] [Indexed: 01/07/2023]
|
29
|
Wang E, Bedognetti D, Tomei S, Marincola FM. Common pathways to tumor rejection. Ann N Y Acad Sci 2013; 1284:75-9. [DOI: 10.1111/nyas.12063] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
Affiliation(s)
- Ena Wang
- Infectious Disease and Immunogenetics Section (IDIS); Department of Transfusion Medicine; Clinical Center and trans-NIH Center for Human Immunology (CHI); National Institutes of Health; Bethesda; Maryland
| | | | | | | |
Collapse
|
30
|
Huang Y, Zhao Z, Xu H, Shyr Y, Zhang B. Advances in systems biology: computational algorithms and applications. BMC SYSTEMS BIOLOGY 2012; 6 Suppl 3:S1. [PMID: 23281622 PMCID: PMC3524016 DOI: 10.1186/1752-0509-6-s3-s1] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 12/25/2022]
Abstract
The 2012 International Conference on Intelligent Biology and Medicine (ICIBM 2012) was held on April 22-24, 2012 in Nashville, Tennessee, USA. The conference featured six technical sessions, one tutorial session, one workshop, and 3 keynote presentations that covered state-of-the-art research activities in genomics, systems biology, and intelligent computing. In addition to a major emphasis on the next generation sequencing (NGS)-driven informatics, ICIBM 2012 aligned significant interests in systems biology and its applications in medicine. We highlight in this editorial the selected papers from the meeting that address the developments of novel algorithms and applications in systems biology.
Collapse
Affiliation(s)
- Yufei Huang
- Department of Electrical and Computer Engineering, The University of Texas at San Antonio, San Antonio, TX 78249, USA.
| | | | | | | | | |
Collapse
|