1
|
Li H, Wang Z, Xu L, Li Q, Gao H, Ma H, Cai W, Chen Y, Gao X, Zhang L, Gao H, Zhu B, Xu L, Li J. Genomic prediction of carcass traits using different haplotype block partitioning methods in beef cattle. Evol Appl 2022; 15:2028-2042. [PMID: 36540636 PMCID: PMC9753827 DOI: 10.1111/eva.13491] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2022] [Accepted: 09/18/2022] [Indexed: 09/22/2023] Open
Abstract
Genomic prediction (GP) based on haplotype alleles can capture quantitative trait loci (QTL) effects and increase predictive ability because the haplotypes are expected to be in linkage disequilibrium (LD) with QTL. In this study, we constructed haploblocks using LD-based and the fixed number of single nucleotide polymorphisms (fixed-SNP) methods with Illumina BovineHD chip in beef cattle. To evaluate the performance of different haplotype block partitioning methods, we constructed haploblocks based on LD thresholds (from r 2 > 0.2 to r 2 > 0.8) and the number of fixed-SNPs (5, 10, 20). The performance of predictive methods for three carcass traits including liveweight (LW), dressing percentage (DP), and longissimus dorsi muscle weight (LDMW) was evaluated using three approaches (GBLUP and BayesB model based on the SNP, GHBLUP, and BayesBH models based on the haploblock, and GHBLUP+GBLUP and BayesBH+BayesB models based on the combined haploblock and the nonblocked SNPs, which were located between blocks). In this study, we found the accuracies of LD-based and fixed-SNP haplotype Bayesian methods outperformed the Bayesian models (up to 8.54 ± 7.44% and 5.74 ± 2.95%, respectively). GHBLUP showed a high improvement (up to 11.29 ± 9.87%) compared with GBLUP. The Bayesian models have higher accuracies than BLUP models in most scenarios. The average computing time of the BayesBH+BayesB model can reduce by 29.3% compared with the BayesB model. The prediction accuracies using the LD-based haplotype method showed higher improvements than the fixed-SNP haplotype method. In addition, to avoid the influence of rare haplotypes generated from haplotype construction, we compared the performance of GP by filtering four types of minor haplotype allele frequency (MHAF) (0.01, 0.025, 0.05, and 0.1) under different conditions (LD levels were set at r 2 > 0.3, and the fixed number of SNPs was 5). We found the optimal MHAF threshold for LW was 0.01, and the optimal MHAF threshold for DP and LDMW was 0.025.
Collapse
Affiliation(s)
- Hongwei Li
- Laboratory of Molecular Biology and Bovine Breeding, Institute of Animal SciencesChinese Academy of Agricultural SciencesBeijingChina
| | - Zezhao Wang
- Laboratory of Molecular Biology and Bovine Breeding, Institute of Animal SciencesChinese Academy of Agricultural SciencesBeijingChina
| | - Lei Xu
- Laboratory of Molecular Biology and Bovine Breeding, Institute of Animal SciencesChinese Academy of Agricultural SciencesBeijingChina
| | - Qian Li
- Laboratory of Molecular Biology and Bovine Breeding, Institute of Animal SciencesChinese Academy of Agricultural SciencesBeijingChina
| | - Han Gao
- Laboratory of Molecular Biology and Bovine Breeding, Institute of Animal SciencesChinese Academy of Agricultural SciencesBeijingChina
| | - Haoran Ma
- Laboratory of Molecular Biology and Bovine Breeding, Institute of Animal SciencesChinese Academy of Agricultural SciencesBeijingChina
| | - Wentao Cai
- Laboratory of Molecular Biology and Bovine Breeding, Institute of Animal SciencesChinese Academy of Agricultural SciencesBeijingChina
| | - Yan Chen
- Laboratory of Molecular Biology and Bovine Breeding, Institute of Animal SciencesChinese Academy of Agricultural SciencesBeijingChina
| | - Xue Gao
- Laboratory of Molecular Biology and Bovine Breeding, Institute of Animal SciencesChinese Academy of Agricultural SciencesBeijingChina
| | - Lupei Zhang
- Laboratory of Molecular Biology and Bovine Breeding, Institute of Animal SciencesChinese Academy of Agricultural SciencesBeijingChina
| | - Huijiang Gao
- Laboratory of Molecular Biology and Bovine Breeding, Institute of Animal SciencesChinese Academy of Agricultural SciencesBeijingChina
| | - Bo Zhu
- Laboratory of Molecular Biology and Bovine Breeding, Institute of Animal SciencesChinese Academy of Agricultural SciencesBeijingChina
| | - Lingyang Xu
- Laboratory of Molecular Biology and Bovine Breeding, Institute of Animal SciencesChinese Academy of Agricultural SciencesBeijingChina
| | - Junya Li
- Laboratory of Molecular Biology and Bovine Breeding, Institute of Animal SciencesChinese Academy of Agricultural SciencesBeijingChina
| |
Collapse
|
2
|
Musolf AM, Holzinger ER, Malley JD, Bailey-Wilson JE. What makes a good prediction? Feature importance and beginning to open the black box of machine learning in genetics. Hum Genet 2021; 141:1515-1528. [PMID: 34862561 PMCID: PMC9360120 DOI: 10.1007/s00439-021-02402-z] [Citation(s) in RCA: 16] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2021] [Accepted: 11/08/2021] [Indexed: 01/26/2023]
Abstract
Genetic data have become increasingly complex within the past decade, leading researchers to pursue increasingly complex questions, such as those involving epistatic interactions and protein prediction. Traditional methods are ill-suited to answer these questions, but machine learning (ML) techniques offer an alternative solution. ML algorithms are commonly used in genetics to predict or classify subjects, but some methods evaluate which features (variables) are responsible for creating a good prediction; this is called feature importance. This is critical in genetics, as researchers are often interested in which features (e.g., SNP genotype or environmental exposure) are responsible for a good prediction. This allows for the deeper analysis beyond simple prediction, including the determination of risk factors associated with a given phenotype. Feature importance further permits the researcher to peer inside the black box of many ML algorithms to see how they work and which features are critical in informing a good prediction. This review focuses on ML methods that provide feature importance metrics for the analysis of genetic data. Five major categories of ML algorithms: k nearest neighbors, artificial neural networks, deep learning, support vector machines, and random forests are described. The review ends with a discussion of how to choose the best machine for a data set. This review will be particularly useful for genetic researchers looking to use ML methods to answer questions beyond basic prediction and classification.
Collapse
Affiliation(s)
- Anthony M Musolf
- Statistical Genetics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, 333 Cassell Drive Suite 1200, Baltimore, MD, 21224, USA
| | - Emily R Holzinger
- Target Sciences, Informatics and Predictive Sciences, Bristol Myers Squibb, Cambridge, MA, USA
| | - James D Malley
- Statistical Genetics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, 333 Cassell Drive Suite 1200, Baltimore, MD, 21224, USA
| | - Joan E Bailey-Wilson
- Statistical Genetics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, 333 Cassell Drive Suite 1200, Baltimore, MD, 21224, USA.
| |
Collapse
|
3
|
Xu L, Gao N, Wang Z, Xu L, Liu Y, Chen Y, Xu L, Gao X, Zhang L, Gao H, Zhu B, Li J. Incorporating Genome Annotation Into Genomic Prediction for Carcass Traits in Chinese Simmental Beef Cattle. Front Genet 2020; 11:481. [PMID: 32499816 PMCID: PMC7243208 DOI: 10.3389/fgene.2020.00481] [Citation(s) in RCA: 19] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2019] [Accepted: 04/17/2020] [Indexed: 01/08/2023] Open
Abstract
Various methods have been proposed for genomic prediction (GP) in livestock. These methods have mainly focused on statistical considerations and did not include genome annotation information. In this study, to improve the predictive performance of carcass traits in Chinese Simmental beef cattle, we incorporated the genome annotation information into GP. Single nucleotide polymorphisms (SNPs) were annotated to five genomic classes: intergenic, gene, exon, protein coding sequences, and 3'/5' untranslated region. Haploblocks were constructed for all markers and these five genomic classes by defining a biologically functional unit, and haplotype effects were modeled in both numerical dosage and categorical coding strategies. The first-order epistatic effects among SNPs and haplotypes were modeled using a categorical epistasis model. For all makers, the extension from the SNP-based model to a haplotype-based model improved the accuracy by 5.4-9.8% for carcass weight (CW), live weight (LW), and striploin (SI). For the five genomic classes using the haplotype-based prediction model, the incorporation of gene class information into the model improved the accuracies by an average of 1.4, 2.1, and 1.3% for CW, LW, and SI, respectively, compared with their corresponding results for all markers. Including the first-order epistatic effects into the prediction models improved the accuracies in some traits and genomic classes. Therefore, for traits with moderate-to-high heritability, incorporating genome annotation information of gene class into haplotype-based prediction models could be considered as a promising tool for GP in Chinese Simmental beef cattle, and modeling epistasis in prediction can further increase the accuracy to some degree.
Collapse
Affiliation(s)
- Ling Xu
- Laboratory of Molecular Biology and Bovine Breeding, Institute of Animal Sciences, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Ning Gao
- State Key Laboratory of Biocontrol, School of Life Sciences, Sun Yat-sen University, Guangzhou, China
| | - Zezhao Wang
- Laboratory of Molecular Biology and Bovine Breeding, Institute of Animal Sciences, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Lei Xu
- Laboratory of Molecular Biology and Bovine Breeding, Institute of Animal Sciences, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Ying Liu
- Laboratory of Molecular Biology and Bovine Breeding, Institute of Animal Sciences, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Yan Chen
- Laboratory of Molecular Biology and Bovine Breeding, Institute of Animal Sciences, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Lingyang Xu
- Laboratory of Molecular Biology and Bovine Breeding, Institute of Animal Sciences, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Xue Gao
- Laboratory of Molecular Biology and Bovine Breeding, Institute of Animal Sciences, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Lupei Zhang
- Laboratory of Molecular Biology and Bovine Breeding, Institute of Animal Sciences, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Huijiang Gao
- Laboratory of Molecular Biology and Bovine Breeding, Institute of Animal Sciences, Chinese Academy of Agricultural Sciences, Beijing, China
- National Centre of Beef Cattle Genetic Evaluation, Beijing, China
| | - Bo Zhu
- Laboratory of Molecular Biology and Bovine Breeding, Institute of Animal Sciences, Chinese Academy of Agricultural Sciences, Beijing, China
- National Centre of Beef Cattle Genetic Evaluation, Beijing, China
| | - Junya Li
- Laboratory of Molecular Biology and Bovine Breeding, Institute of Animal Sciences, Chinese Academy of Agricultural Sciences, Beijing, China
- National Centre of Beef Cattle Genetic Evaluation, Beijing, China
| |
Collapse
|
4
|
Cuyabano B, Su G, Rosa G, Lund M, Gianola D. Bootstrap study of genome-enabled prediction reliabilities using haplotype blocks across Nordic Red cattle breeds. J Dairy Sci 2015; 98:7351-63. [DOI: 10.3168/jds.2015-9360] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2015] [Accepted: 06/16/2015] [Indexed: 12/30/2022]
|
5
|
Cuyabano BCD, Su G, Lund MS. Genomic prediction of genetic merit using LD-based haplotypes in the Nordic Holstein population. BMC Genomics 2014; 15:1171. [PMID: 25539631 PMCID: PMC4367958 DOI: 10.1186/1471-2164-15-1171] [Citation(s) in RCA: 54] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2013] [Accepted: 12/12/2014] [Indexed: 11/17/2022] Open
Abstract
Background A haplotype approach to genomic prediction using high density data in dairy cattle as an alternative to single-marker methods is presented. With the assumption that haplotypes are in stronger linkage disequilibrium (LD) with quantitative trait loci (QTL) than single markers, this study focuses on the use of haplotype blocks (haploblocks) as explanatory variables for genomic prediction. Haploblocks were built based on the LD between markers, which allowed variable reduction. The haploblocks were then used to predict three economically important traits (milk protein, fertility and mastitis) in the Nordic Holstein population. Results The haploblock approach improved prediction accuracy compared with the commonly used individual single nucleotide polymorphism (SNP) approach. Furthermore, using an average LD threshold to define the haploblocks (LD≥0.45 between any two markers) increased the prediction accuracies for all three traits, although the improvement was most significant for milk protein (up to 3.1 % improvement in prediction accuracy, compared with the individual SNP approach). Hotelling’s t-tests were performed, confirming the improvement in prediction accuracy for milk protein. Because the phenotypic values were in the form of de-regressed proofs, the improved accuracy for milk protein may be due to higher reliability of the data for this trait compared with the reliability of the mastitis and fertility data. Comparisons between best linear unbiased prediction (BLUP) and Bayesian mixture models also indicated that the Bayesian model produced the most accurate predictions in every scenario for the milk protein trait, and in some scenarios for fertility. Conclusions The haploblock approach to genomic prediction is a promising method for genomic selection in animal breeding. Building haploblocks based on LD reduced the number of variables without the loss of information. This method may play an important role in the future genomic prediction involving while genome sequences.
Collapse
Affiliation(s)
| | - Guosheng Su
- Center for Quantitative Genetics and Genomics, Department of Molecular Biology and Genetics, Aarhus University, Denmark.
| | | |
Collapse
|
6
|
A review for detecting gene-gene interactions using machine learning methods in genetic epidemiology. BIOMED RESEARCH INTERNATIONAL 2013; 2013:432375. [PMID: 24228248 PMCID: PMC3818807 DOI: 10.1155/2013/432375] [Citation(s) in RCA: 44] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/17/2013] [Revised: 08/26/2013] [Accepted: 08/27/2013] [Indexed: 01/04/2023]
Abstract
Recently, the greatest statistical computational challenge in genetic epidemiology is to identify and characterize the genes that interact with other genes and environment factors that bring the effect on complex multifactorial disease. These gene-gene interactions are also denoted as epitasis in which this phenomenon cannot be solved by traditional statistical method due to the high dimensionality of the data and the occurrence of multiple polymorphism. Hence, there are several machine learning methods to solve such problems by identifying such susceptibility gene which are neural networks (NNs), support vector machine (SVM), and random forests (RFs) in such common and multifactorial disease. This paper gives an overview on machine learning methods, describing the methodology of each machine learning methods and its application in detecting gene-gene and gene-environment interactions. Lastly, this paper discussed each machine learning method and presents the strengths and weaknesses of each machine learning method in detecting gene-gene interactions in complex human disease.
Collapse
|
7
|
Zhao X, Xu K, Shi H, Cheng J, Ma J, Gao Y, Li Q, Ye X, Lu Y, Yu X, Du J, Du W, Ye Q, Zhou L. Application of the back-error propagation artificial neural network (BPANN) on genetic variants in the PPAR-γ and RXR-α gene and risk of metabolic syndrome in a Chinese Han population. J Biomed Res 2013; 28:114-22. [PMID: 24683409 PMCID: PMC3968282 DOI: 10.7555/jbr.27.20120061] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2012] [Revised: 11/06/2012] [Accepted: 11/16/2012] [Indexed: 12/25/2022] Open
Abstract
This study was aimed to explore the associations between the combined effects of several polymorphisms in the PPAR-γ and RXR-α gene and environmental factors with the risk of metabolic syndrome by back-error propagation artificial neural network (BPANN). We established the model based on data gathered from metabolic syndrome patients (n = 1012) and normal controls (n = 1069) by BPANN. Mean impact value (MIV) for each input variable was calculated and the sequence of factors was sorted according to their absolute MIVs. Generalized multifactor dimensionality reduction (GMDR) confirmed a joint effect of PPAR-γ and RXR-α based on the results from BPANN. By BPANN analysis, the sequences according to the importance of metabolic syndrome risk factors were in the order of body mass index (BMI), serum adiponectin, rs4240711, gender, rs4842194, family history of type 2 diabetes, rs2920502, physical activity, alcohol drinking, rs3856806, family history of hypertension, rs1045570, rs6537944, age, rs17817276, family history of hyperlipidemia, smoking, rs1801282 and rs3132291. However, no polymorphism was statistically significant in multiple logistic regression analysis. After controlling for environmental factors, A1, A2, B1 and B2 (rs4240711, rs4842194, rs2920502 and rs3856806) models were the best models (cross-validation consistency 10/10, P = 0.0107) with the GMDR method. In conclusion, the interaction of the PPAR-γ and RXR-α gene could play a role in susceptibility to metabolic syndrome. A more realistic model is obtained by using BPANN to screen out determinants of diseases of multiple etiologies like metabolic syndrome.
Collapse
Affiliation(s)
- Xu Zhao
- Department of Epidemiology and Biostatistics, School of Public Health, Nanjing Medical University, Nanjing, Jiangsu 211166, China
| | - Kang Xu
- Department of Epidemiology and Biostatistics, School of Public Health, Nanjing Medical University, Nanjing, Jiangsu 211166, China
| | - Hui Shi
- Department of Epidemiology and Biostatistics, School of Public Health, Nanjing Medical University, Nanjing, Jiangsu 211166, China
| | - Jinluo Cheng
- Department of Endocrinology, the Affiliated Changzhou Second Hospital, Nanjing Medical University, Changzhou, Jiangsu 213003, China
| | - Jianhua Ma
- Department of Endocrinology, the First Affiliated Hospital, Nanjing Medical University, Nanjing, Jiangsu 210029, China
| | - Yanqin Gao
- Department of Endocrinology, the Third Affiliated Hospital, Nanjing Medical University, Yizheng, Jiangsu 211400, China
| | - Qian Li
- Department of Endocrinology, the First Affiliated Hospital, Nanjing Medical University, Nanjing, Jiangsu 210029, China
| | - Xinhua Ye
- Department of Endocrinology, the Affiliated Changzhou Second Hospital, Nanjing Medical University, Changzhou, Jiangsu 213003, China
| | - Ying Lu
- Department of Epidemiology and Biostatistics, School of Public Health, Nanjing Medical University, Nanjing, Jiangsu 211166, China
| | - Xiaofang Yu
- Department of Epidemiology and Biostatistics, School of Public Health, Nanjing Medical University, Nanjing, Jiangsu 211166, China
| | - Juan Du
- Department of Epidemiology and Biostatistics, School of Public Health, Nanjing Medical University, Nanjing, Jiangsu 211166, China
| | - Wencong Du
- Department of Epidemiology and Biostatistics, School of Public Health, Nanjing Medical University, Nanjing, Jiangsu 211166, China
| | - Qing Ye
- Department of Epidemiology and Biostatistics, School of Public Health, Nanjing Medical University, Nanjing, Jiangsu 211166, China
| | - Ling Zhou
- Department of Epidemiology and Biostatistics, School of Public Health, Nanjing Medical University, Nanjing, Jiangsu 211166, China
| |
Collapse
|
8
|
Ahn YH, Shin PM, Oh NR, Park GW, Kim H, Yoo JS. A lectin-coupled, targeted proteomic mass spectrometry (MRM MS) platform for identification of multiple liver cancer biomarkers in human plasma. J Proteomics 2012; 75:5507-15. [PMID: 22789673 DOI: 10.1016/j.jprot.2012.06.027] [Citation(s) in RCA: 47] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2012] [Revised: 06/04/2012] [Accepted: 06/30/2012] [Indexed: 10/28/2022]
Abstract
Aberrantly glycosylated proteins related to liver cancer progression were captured with specific lectin and identified from human plasma by multiple reaction monitoring (MRM) mass spectrometry as multiple biomarkers for hepatocellular carcinoma (HCC). The lectin fractionation for fucosylated protein glycoforms in human plasma was conducted with a fucose-specific aleuria aurantia lectin (AAL). Following tryptic digestion of the lectin-captured fraction, plasma samples from 30 control cases (including 10 healthy, 10 hepatitis B virus [HBV], and 10 cirrhosis cases) and 10 HCC cases were quantitatively analyzed by MRM to identify which glycoproteins are viable HCC biomarkers. A1AG1, AACT, A1AT, and CERU were found to be potent biomarkers to differentiate HCC plasma from control plasmas. The AUROC generated independently from these four biomarker candidates ranged from 0.73 to 0.92. However, the lectin-coupled MRM assay with multiple combinations of biomarker candidates is superior statistically to those generated from the individual candidates with AUROC more than 0.95, which can be an alternative to the immunoassay inevitably requiring tedious development of multiple antibodies against biomarker candidates to be verified. Eventually the lectin-coupled, targeted proteomic mass spectrometry (MRM MS) platform was found to be efficient to identify multiple biomarkers from human plasma according to cancer progression.
Collapse
Affiliation(s)
- Yeong Hee Ahn
- Division of Mass Spectrometry, Korea Basic Science Institute, Cheongwon-Gun 363-883, Republic of Korea
| | | | | | | | | | | |
Collapse
|
9
|
Bridges M, Heron EA, O'Dushlaine C, Segurado R, Morris D, Corvin A, Gill M, Pinto C. Genetic classification of populations using supervised learning. PLoS One 2011; 6:e14802. [PMID: 21589856 PMCID: PMC3093382 DOI: 10.1371/journal.pone.0014802] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2010] [Accepted: 12/01/2010] [Indexed: 11/18/2022] Open
Abstract
There are many instances in genetics in which we wish to determine whether two
candidate populations are distinguishable on the basis of their genetic
structure. Examples include populations which are geographically separated,
case–control studies and quality control (when participants in a study
have been genotyped at different laboratories). This latter application is of
particular importance in the era of large scale genome wide association studies,
when collections of individuals genotyped at different locations are being
merged to provide increased power. The traditional method for detecting
structure within a population is some form of exploratory technique such as
principal components analysis. Such methods, which do not utilise our prior
knowledge of the membership of the candidate populations. are termed
unsupervised. Supervised methods, on the other hand are
able to utilise this prior knowledge when it is available. In this paper we demonstrate that in such cases modern supervised approaches are
a more appropriate tool for detecting genetic differences between populations.
We apply two such methods, (neural networks and support vector machines) to the
classification of three populations (two from Scotland and one from Bulgaria).
The sensitivity exhibited by both these methods is considerably higher than that
attained by principal components analysis and in fact comfortably exceeds a
recently conjectured theoretical limit on the sensitivity of unsupervised
methods. In particular, our methods can distinguish between the two Scottish
populations, where principal components analysis cannot. We suggest, on the
basis of our results that a supervised learning approach should be the method of
choice when classifying individuals into pre-defined populations, particularly
in quality control for large scale genome wide association studies.
Collapse
Affiliation(s)
- Michael Bridges
- Astrophysics Group, Cavendish Laboratory, Cambridge, United
Kingdom
| | - Elizabeth A. Heron
- Neuropsychiatric Genetics Research Group, Department of Psychiatry,
Trinity College, Dublin, Ireland
| | - Colm O'Dushlaine
- Neuropsychiatric Genetics Research Group, Department of Psychiatry,
Trinity College, Dublin, Ireland
| | - Ricardo Segurado
- Neuropsychiatric Genetics Research Group, Department of Psychiatry,
Trinity College, Dublin, Ireland
| | | | - Derek Morris
- Neuropsychiatric Genetics Research Group, Department of Psychiatry,
Trinity College, Dublin, Ireland
| | - Aiden Corvin
- Neuropsychiatric Genetics Research Group, Department of Psychiatry,
Trinity College, Dublin, Ireland
| | - Michael Gill
- Neuropsychiatric Genetics Research Group, Department of Psychiatry,
Trinity College, Dublin, Ireland
| | - Carlos Pinto
- Neuropsychiatric Genetics Research Group, Department of Psychiatry,
Trinity College, Dublin, Ireland
- * E-mail:
| |
Collapse
|
10
|
Prediction of body mass index in mice using dense molecular markers and a regularized neural network. Genet Res (Camb) 2011; 93:189-201. [PMID: 21481292 DOI: 10.1017/s0016672310000662] [Citation(s) in RCA: 57] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022] Open
Abstract
Bayesian regularization of artificial neural networks (BRANNs) were used to predict body mass index (BMI) in mice using single nucleotide polymorphism (SNP) markers. Data from 1896 animals with both phenotypic and genotypic (12 320 loci) information were used for the analysis. Missing genotypes were imputed based on estimated allelic frequencies, with no attempt to reconstruct haplotypes based on family information or linkage disequilibrium between markers. A feed-forward multilayer perceptron network consisting of a single output layer and one hidden layer was used. Training of the neural network was done using the Bayesian regularized backpropagation algorithm. When the number of neurons in the hidden layer was increased, the number of effective parameters, γ, increased up to a point and stabilized thereafter. A model with five neurons in the hidden layer produced a value of γ that saturated the data. In terms of predictive ability, a network with five neurons in the hidden layer attained the smallest error and highest correlation in the test data although differences among networks were negligible. Using inherent weight information of BRANN with different number of neurons in the hidden layer, it was observed that 17 SNPs had a larger impact on the network, indicating their possible relevance in prediction of BMI. It is concluded that BRANN may be at least as useful as other methods for high-dimensional genome-enabled prediction, with the advantage of its potential ability of capturing non-linear relationships, which may be useful in the study of quantitative traits under complex gene action.
Collapse
|
11
|
Neural networks for genetic epidemiology: past, present, and future. BioData Min 2008; 1:3. [PMID: 18822147 PMCID: PMC2553772 DOI: 10.1186/1756-0381-1-3] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2008] [Accepted: 07/17/2008] [Indexed: 01/17/2023] Open
Abstract
During the past two decades, the field of human genetics has experienced an information explosion. The completion of the human genome project and the development of high throughput SNP technologies have created a wealth of data; however, the analysis and interpretation of these data have created a research bottleneck. While technology facilitates the measurement of hundreds or thousands of genes, statistical and computational methodologies are lacking for the analysis of these data. New statistical methods and variable selection strategies must be explored for identifying disease susceptibility genes for common, complex diseases. Neural networks (NN) are a class of pattern recognition methods that have been successfully implemented for data mining and prediction in a variety of fields. The application of NN for statistical genetics studies is an active area of research. Neural networks have been applied in both linkage and association analysis for the identification of disease susceptibility genes. In the current review, we consider how NN have been used for both linkage and association analyses in genetic epidemiology. We discuss both the successes of these initial NN applications, and the questions that arose during the previous studies. Finally, we introduce evolutionary computing strategies, Genetic Programming Neural Networks (GPNN) and Grammatical Evolution Neural Networks (GENN), for using NN in association studies of complex human diseases that address some of the caveats illuminated by previous work.
Collapse
|
12
|
Curtis D, Vine AE, Knight J. Investigation into the ability of SNP chipsets and microsatellites to detect association with a disease locus. Ann Hum Genet 2008; 72:547-56. [PMID: 18355389 PMCID: PMC2592259 DOI: 10.1111/j.1469-1809.2008.00434.x] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
We wished to investigate the ability of different SNP chipsets to detect association with a disease and to investigate the linkage disequilibrium (LD) relationships between microsatellites and nearby SNPs in order to assess their potential usefulness to detect association. SNP genotypes were obtained from HapMap and microsatellite genotypes from CEPH. 5000 SNPs were simulated as disease genes which increased penetrance from 0.01 to 0.02 in a sample of 400 cases and 400 controls. The power of flanking SNPs to detect association was tested using sets of 1, 2, 3 or 4 markers analysed with haplotype analysis or logistic regression and using either all HapMap markers or those from the Affymetrix 500K, Illumina 300K or Illumina 550K chipsets. Additionally, LD relationships between 10 microsatellites and SNPs within 2Mb of each other were studied. The power for one of the markers to detect association at p = 0.001 was around 0.4. Power was slightly better for logistic regression than haplotype analysis and for two-marker as opposed to single marker analysis but analysing with larger numbers markers had little benefit. The Illumina 550K marker set was better able to detect association than the other two and was almost as powerful as using all HapMap markers. Microsatellites had detectable LD with only a small number of nearby SNPs and the pattern of LD was very variable. Available chipsets have quite good ability to detect association although obviously results will be critically dependent on the nature of the genetic effect on risk, sample size and the actual LD relationships of the susceptibility polymorphisms involved. Microsatellites seem ill-suited for systematic studies to detect association.
Collapse
Affiliation(s)
- D Curtis
- Centre for Psychiatry, Queen Mary's School of Medicine and Dentistry, London E1 1BB, UK.
| | | | | |
Collapse
|