1
|
Li H, Wang Z, Xu L, Li Q, Gao H, Ma H, Cai W, Chen Y, Gao X, Zhang L, Gao H, Zhu B, Xu L, Li J. Genomic prediction of carcass traits using different haplotype block partitioning methods in beef cattle. Evol Appl 2022; 15:2028-2042. [PMID: 36540636 PMCID: PMC9753827 DOI: 10.1111/eva.13491] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2022] [Accepted: 09/18/2022] [Indexed: 09/22/2023] Open
Abstract
Genomic prediction (GP) based on haplotype alleles can capture quantitative trait loci (QTL) effects and increase predictive ability because the haplotypes are expected to be in linkage disequilibrium (LD) with QTL. In this study, we constructed haploblocks using LD-based and the fixed number of single nucleotide polymorphisms (fixed-SNP) methods with Illumina BovineHD chip in beef cattle. To evaluate the performance of different haplotype block partitioning methods, we constructed haploblocks based on LD thresholds (from r 2 > 0.2 to r 2 > 0.8) and the number of fixed-SNPs (5, 10, 20). The performance of predictive methods for three carcass traits including liveweight (LW), dressing percentage (DP), and longissimus dorsi muscle weight (LDMW) was evaluated using three approaches (GBLUP and BayesB model based on the SNP, GHBLUP, and BayesBH models based on the haploblock, and GHBLUP+GBLUP and BayesBH+BayesB models based on the combined haploblock and the nonblocked SNPs, which were located between blocks). In this study, we found the accuracies of LD-based and fixed-SNP haplotype Bayesian methods outperformed the Bayesian models (up to 8.54 ± 7.44% and 5.74 ± 2.95%, respectively). GHBLUP showed a high improvement (up to 11.29 ± 9.87%) compared with GBLUP. The Bayesian models have higher accuracies than BLUP models in most scenarios. The average computing time of the BayesBH+BayesB model can reduce by 29.3% compared with the BayesB model. The prediction accuracies using the LD-based haplotype method showed higher improvements than the fixed-SNP haplotype method. In addition, to avoid the influence of rare haplotypes generated from haplotype construction, we compared the performance of GP by filtering four types of minor haplotype allele frequency (MHAF) (0.01, 0.025, 0.05, and 0.1) under different conditions (LD levels were set at r 2 > 0.3, and the fixed number of SNPs was 5). We found the optimal MHAF threshold for LW was 0.01, and the optimal MHAF threshold for DP and LDMW was 0.025.
Collapse
Affiliation(s)
- Hongwei Li
- Laboratory of Molecular Biology and Bovine Breeding, Institute of Animal SciencesChinese Academy of Agricultural SciencesBeijingChina
| | - Zezhao Wang
- Laboratory of Molecular Biology and Bovine Breeding, Institute of Animal SciencesChinese Academy of Agricultural SciencesBeijingChina
| | - Lei Xu
- Laboratory of Molecular Biology and Bovine Breeding, Institute of Animal SciencesChinese Academy of Agricultural SciencesBeijingChina
| | - Qian Li
- Laboratory of Molecular Biology and Bovine Breeding, Institute of Animal SciencesChinese Academy of Agricultural SciencesBeijingChina
| | - Han Gao
- Laboratory of Molecular Biology and Bovine Breeding, Institute of Animal SciencesChinese Academy of Agricultural SciencesBeijingChina
| | - Haoran Ma
- Laboratory of Molecular Biology and Bovine Breeding, Institute of Animal SciencesChinese Academy of Agricultural SciencesBeijingChina
| | - Wentao Cai
- Laboratory of Molecular Biology and Bovine Breeding, Institute of Animal SciencesChinese Academy of Agricultural SciencesBeijingChina
| | - Yan Chen
- Laboratory of Molecular Biology and Bovine Breeding, Institute of Animal SciencesChinese Academy of Agricultural SciencesBeijingChina
| | - Xue Gao
- Laboratory of Molecular Biology and Bovine Breeding, Institute of Animal SciencesChinese Academy of Agricultural SciencesBeijingChina
| | - Lupei Zhang
- Laboratory of Molecular Biology and Bovine Breeding, Institute of Animal SciencesChinese Academy of Agricultural SciencesBeijingChina
| | - Huijiang Gao
- Laboratory of Molecular Biology and Bovine Breeding, Institute of Animal SciencesChinese Academy of Agricultural SciencesBeijingChina
| | - Bo Zhu
- Laboratory of Molecular Biology and Bovine Breeding, Institute of Animal SciencesChinese Academy of Agricultural SciencesBeijingChina
| | - Lingyang Xu
- Laboratory of Molecular Biology and Bovine Breeding, Institute of Animal SciencesChinese Academy of Agricultural SciencesBeijingChina
| | - Junya Li
- Laboratory of Molecular Biology and Bovine Breeding, Institute of Animal SciencesChinese Academy of Agricultural SciencesBeijingChina
| |
Collapse
|
2
|
Ye H, Zhang Z, Ren D, Cai X, Zhu Q, Ding X, Zhang H, Zhang Z, Li J. Genomic Prediction Using LD-Based Haplotypes in Combined Pig Populations. Front Genet 2022; 13:843300. [PMID: 35754827 PMCID: PMC9218795 DOI: 10.3389/fgene.2022.843300] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/25/2021] [Accepted: 05/02/2022] [Indexed: 11/13/2022] Open
Abstract
The size of reference population is an important factor affecting genomic prediction. Thus, combining different populations in genomic prediction is an attractive way to improve prediction ability. However, combining multireference population roughly cannot increase the prediction accuracy as well as expected in pig. This may be due to different linkage disequilibrium (LD) pattern differences between population. In this study, we used the imputed whole-genome sequencing (WGS) data to construct LD-based haplotypes for genomic prediction in combined population to explore the impact of different single-nucleotide polymorphism (SNP) densities, variant representation (SNPs or haplotype alleles), and reference population size on the prediction accuracy for reproduction traits. Our results showed that genomic best linear unbiased prediction (GBLUP) using the WGS data can improve prediction accuracy in multi-population but not within-population. Not only the genomic prediction accuracy of the haplotype method using 80 K chip data in multi-population but also GBLUP for the multi-population (3.4–5.9%) was higher than that within-population (1.2–4.3%). More importantly, we have found that using the haplotype method based on the WGS data in multi-population has better genomic prediction performance, and our results showed that building haploblock in this scenario based on low LD threshold (r2 = 0.2–0.3) produced an optimal set of variables for reproduction traits in Yorkshire pig population. Our results suggested that whether the use of the haplotype method based on the chip data or GBLUP (individual SNP method) based on the WGS data were beneficial for genomic prediction in multi-population, while simultaneously combining the haplotype method and WGS data was a better strategy for multi-population genomic evaluation.
Collapse
Affiliation(s)
- Haoqiang Ye
- Guangdong Provincial Key Laboratory of Agro-Animal Genomics and Molecular Breeding, National Engineering Research Centre for Breeding Swine Industry, College of Animal Science, South China Agricultural University, Guangzhou, China
| | - Zipeng Zhang
- Key Laboratory of Animal Genetics and Breeding of Ministry of Agriculture and Rural Affairs, National Engineering Laboratory of Animal Breeding, College of Animal Science and Technology, China Agricultural University, Beijing, China
| | - Duanyang Ren
- Guangdong Provincial Key Laboratory of Agro-Animal Genomics and Molecular Breeding, National Engineering Research Centre for Breeding Swine Industry, College of Animal Science, South China Agricultural University, Guangzhou, China
| | - Xiaodian Cai
- Guangdong Provincial Key Laboratory of Agro-Animal Genomics and Molecular Breeding, National Engineering Research Centre for Breeding Swine Industry, College of Animal Science, South China Agricultural University, Guangzhou, China
| | - Qianghui Zhu
- Guangdong Provincial Key Laboratory of Agro-Animal Genomics and Molecular Breeding, National Engineering Research Centre for Breeding Swine Industry, College of Animal Science, South China Agricultural University, Guangzhou, China
| | - Xiangdong Ding
- Key Laboratory of Animal Genetics and Breeding of Ministry of Agriculture and Rural Affairs, National Engineering Laboratory of Animal Breeding, College of Animal Science and Technology, China Agricultural University, Beijing, China
| | - Hao Zhang
- Guangdong Provincial Key Laboratory of Agro-Animal Genomics and Molecular Breeding, National Engineering Research Centre for Breeding Swine Industry, College of Animal Science, South China Agricultural University, Guangzhou, China
| | - Zhe Zhang
- Guangdong Provincial Key Laboratory of Agro-Animal Genomics and Molecular Breeding, National Engineering Research Centre for Breeding Swine Industry, College of Animal Science, South China Agricultural University, Guangzhou, China
| | - Jiaqi Li
- Guangdong Provincial Key Laboratory of Agro-Animal Genomics and Molecular Breeding, National Engineering Research Centre for Breeding Swine Industry, College of Animal Science, South China Agricultural University, Guangzhou, China
| |
Collapse
|
3
|
Ahmar S, Ballesta P, Ali M, Mora-Poblete F. Achievements and Challenges of Genomics-Assisted Breeding in Forest Trees: From Marker-Assisted Selection to Genome Editing. Int J Mol Sci 2021; 22:10583. [PMID: 34638922 PMCID: PMC8508745 DOI: 10.3390/ijms221910583] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2021] [Revised: 09/26/2021] [Accepted: 09/27/2021] [Indexed: 12/23/2022] Open
Abstract
Forest tree breeding efforts have focused mainly on improving traits of economic importance, selecting trees suited to new environments or generating trees that are more resilient to biotic and abiotic stressors. This review describes various methods of forest tree selection assisted by genomics and the main technological challenges and achievements in research at the genomic level. Due to the long rotation time of a forest plantation and the resulting long generation times necessary to complete a breeding cycle, the use of advanced techniques with traditional breeding have been necessary, allowing the use of more precise methods for determining the genetic architecture of traits of interest, such as genome-wide association studies (GWASs) and genomic selection (GS). In this sense, main factors that determine the accuracy of genomic prediction models are also addressed. In turn, the introduction of genome editing opens the door to new possibilities in forest trees and especially clustered regularly interspaced short palindromic repeats and CRISPR-associated protein 9 (CRISPR/Cas9). It is a highly efficient and effective genome editing technique that has been used to effectively implement targetable changes at specific places in the genome of a forest tree. In this sense, forest trees still lack a transformation method and an inefficient number of genotypes for CRISPR/Cas9. This challenge could be addressed with the use of the newly developing technique GRF-GIF with speed breeding.
Collapse
Affiliation(s)
- Sunny Ahmar
- Institute of Biological Sciences, University of Talca, 1 Poniente 1141, Talca 3460000, Chile;
| | - Paulina Ballesta
- The National Fund for Scientific and Technological Development, Av. del Agua 3895, Talca 3460000, Chile
| | - Mohsin Ali
- Department of Forestry and Range Management, University of Agriculture Faisalabad, Faisalabad 38000, Pakistan;
| | - Freddy Mora-Poblete
- Institute of Biological Sciences, University of Talca, 1 Poniente 1141, Talca 3460000, Chile;
| |
Collapse
|
4
|
Li H, Zhu B, Xu L, Wang Z, Xu L, Zhou P, Gao H, Guo P, Chen Y, Gao X, Zhang L, Gao H, Cai W, Xu L, Li J. Genomic Prediction Using LD-Based Haplotypes Inferred From High-Density Chip and Imputed Sequence Variants in Chinese Simmental Beef Cattle. Front Genet 2021; 12:665382. [PMID: 34394182 PMCID: PMC8358323 DOI: 10.3389/fgene.2021.665382] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2021] [Accepted: 06/30/2021] [Indexed: 01/05/2023] Open
Abstract
A haplotype is defined as a combination of alleles at adjacent loci belonging to the same chromosome that can be transmitted as a unit. In this study, we used both the Illumina BovineHD chip (HD chip) and imputed whole-genome sequence (WGS) data to explore haploblocks and assess haplotype effects, and the haploblocks were defined based on the different LD thresholds. The accuracies of genomic prediction (GP) for dressing percentage (DP), meat percentage (MP), and rib eye roll weight (RERW) based on haplotype were investigated and compared for both data sets in Chinese Simmental beef cattle. The accuracies of GP using the entire imputed WGS data were lower than those using the HD chip data in all cases. For DP and MP, the accuracy of GP using haploblock approaches outperformed the individual single nucleotide polymorphism (SNP) approach (GBLUP_In_Block) at specific LD levels. Hotelling’s test confirmed that GP using LD-based haplotypes from WGS data can significantly increase the accuracies of GP for RERW, compared with the individual SNP approach (∼1.4 and 1.9% for GHBLUP and GHBLUP+GBLUP, respectively). We found that the accuracies using haploblock approach varied with different LD thresholds. The LD thresholds (r2 ≥ 0.5) were optimal for most scenarios. Our results suggested that LD-based haploblock approach can improve accuracy of genomic prediction for carcass traits using both HD chip and imputed WGS data under the optimal LD thresholds in Chinese Simmental beef cattle.
Collapse
Affiliation(s)
- Hongwei Li
- Laboratory of Molecular Biology and Bovine Breeding, Institute of Animal Sciences, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Bo Zhu
- Laboratory of Molecular Biology and Bovine Breeding, Institute of Animal Sciences, Chinese Academy of Agricultural Sciences, Beijing, China.,National Centre of Beef Cattle Genetic Evaluation, Beijing, China
| | - Ling Xu
- Laboratory of Molecular Biology and Bovine Breeding, Institute of Animal Sciences, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Zezhao Wang
- Laboratory of Molecular Biology and Bovine Breeding, Institute of Animal Sciences, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Lei Xu
- Laboratory of Molecular Biology and Bovine Breeding, Institute of Animal Sciences, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Peinuo Zhou
- Laboratory of Molecular Biology and Bovine Breeding, Institute of Animal Sciences, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Han Gao
- Laboratory of Molecular Biology and Bovine Breeding, Institute of Animal Sciences, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Peng Guo
- College of Computer and Information Engineering, Tianjin Agricultural University, Tianjin, China
| | - Yan Chen
- Laboratory of Molecular Biology and Bovine Breeding, Institute of Animal Sciences, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Xue Gao
- Laboratory of Molecular Biology and Bovine Breeding, Institute of Animal Sciences, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Lupei Zhang
- Laboratory of Molecular Biology and Bovine Breeding, Institute of Animal Sciences, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Huijiang Gao
- Laboratory of Molecular Biology and Bovine Breeding, Institute of Animal Sciences, Chinese Academy of Agricultural Sciences, Beijing, China.,National Centre of Beef Cattle Genetic Evaluation, Beijing, China
| | - Wentao Cai
- Laboratory of Molecular Biology and Bovine Breeding, Institute of Animal Sciences, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Lingyang Xu
- Laboratory of Molecular Biology and Bovine Breeding, Institute of Animal Sciences, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Junya Li
- Laboratory of Molecular Biology and Bovine Breeding, Institute of Animal Sciences, Chinese Academy of Agricultural Sciences, Beijing, China.,National Centre of Beef Cattle Genetic Evaluation, Beijing, China
| |
Collapse
|
5
|
Sharifi RS, Noshahr FA, Seifdavati J, Evrigh NH, Cipriano-Salazar M, Mariezcurrena-Berasain MA. Comparison of haplotype method using for genomic prediction versus single SNP genotypes in sheep breeding programs. Small Rumin Res 2021. [DOI: 10.1016/j.smallrumres.2021.106380] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
|
6
|
Xu L, Gao N, Wang Z, Xu L, Liu Y, Chen Y, Xu L, Gao X, Zhang L, Gao H, Zhu B, Li J. Incorporating Genome Annotation Into Genomic Prediction for Carcass Traits in Chinese Simmental Beef Cattle. Front Genet 2020; 11:481. [PMID: 32499816 PMCID: PMC7243208 DOI: 10.3389/fgene.2020.00481] [Citation(s) in RCA: 19] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2019] [Accepted: 04/17/2020] [Indexed: 01/08/2023] Open
Abstract
Various methods have been proposed for genomic prediction (GP) in livestock. These methods have mainly focused on statistical considerations and did not include genome annotation information. In this study, to improve the predictive performance of carcass traits in Chinese Simmental beef cattle, we incorporated the genome annotation information into GP. Single nucleotide polymorphisms (SNPs) were annotated to five genomic classes: intergenic, gene, exon, protein coding sequences, and 3'/5' untranslated region. Haploblocks were constructed for all markers and these five genomic classes by defining a biologically functional unit, and haplotype effects were modeled in both numerical dosage and categorical coding strategies. The first-order epistatic effects among SNPs and haplotypes were modeled using a categorical epistasis model. For all makers, the extension from the SNP-based model to a haplotype-based model improved the accuracy by 5.4-9.8% for carcass weight (CW), live weight (LW), and striploin (SI). For the five genomic classes using the haplotype-based prediction model, the incorporation of gene class information into the model improved the accuracies by an average of 1.4, 2.1, and 1.3% for CW, LW, and SI, respectively, compared with their corresponding results for all markers. Including the first-order epistatic effects into the prediction models improved the accuracies in some traits and genomic classes. Therefore, for traits with moderate-to-high heritability, incorporating genome annotation information of gene class into haplotype-based prediction models could be considered as a promising tool for GP in Chinese Simmental beef cattle, and modeling epistasis in prediction can further increase the accuracy to some degree.
Collapse
Affiliation(s)
- Ling Xu
- Laboratory of Molecular Biology and Bovine Breeding, Institute of Animal Sciences, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Ning Gao
- State Key Laboratory of Biocontrol, School of Life Sciences, Sun Yat-sen University, Guangzhou, China
| | - Zezhao Wang
- Laboratory of Molecular Biology and Bovine Breeding, Institute of Animal Sciences, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Lei Xu
- Laboratory of Molecular Biology and Bovine Breeding, Institute of Animal Sciences, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Ying Liu
- Laboratory of Molecular Biology and Bovine Breeding, Institute of Animal Sciences, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Yan Chen
- Laboratory of Molecular Biology and Bovine Breeding, Institute of Animal Sciences, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Lingyang Xu
- Laboratory of Molecular Biology and Bovine Breeding, Institute of Animal Sciences, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Xue Gao
- Laboratory of Molecular Biology and Bovine Breeding, Institute of Animal Sciences, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Lupei Zhang
- Laboratory of Molecular Biology and Bovine Breeding, Institute of Animal Sciences, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Huijiang Gao
- Laboratory of Molecular Biology and Bovine Breeding, Institute of Animal Sciences, Chinese Academy of Agricultural Sciences, Beijing, China
- National Centre of Beef Cattle Genetic Evaluation, Beijing, China
| | - Bo Zhu
- Laboratory of Molecular Biology and Bovine Breeding, Institute of Animal Sciences, Chinese Academy of Agricultural Sciences, Beijing, China
- National Centre of Beef Cattle Genetic Evaluation, Beijing, China
| | - Junya Li
- Laboratory of Molecular Biology and Bovine Breeding, Institute of Animal Sciences, Chinese Academy of Agricultural Sciences, Beijing, China
- National Centre of Beef Cattle Genetic Evaluation, Beijing, China
| |
Collapse
|
7
|
Ballesta P, Maldonado C, Pérez-Rodríguez P, Mora F. SNP and Haplotype-Based Genomic Selection of Quantitative Traits in Eucalyptus globulus. PLANTS 2019; 8:plants8090331. [PMID: 31492041 PMCID: PMC6783840 DOI: 10.3390/plants8090331] [Citation(s) in RCA: 23] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/25/2019] [Revised: 09/02/2019] [Accepted: 09/03/2019] [Indexed: 01/02/2023]
Abstract
Eucalyptus globulus (Labill.) is one of the most important cultivated eucalypts in temperate and subtropical regions and has been successfully subjected to intensive breeding. In this study, Bayesian genomic models that include the effects of haplotype and single nucleotide polymorphisms (SNP) were assessed to predict quantitative traits related to wood quality and tree growth in a 6-year-old breeding population. To this end, the following markers were considered: (a) ~14 K SNP markers (SNP), (b) ~3 K haplotypes (HAP), and (c) haplotypes and SNPs that were not assigned to a haplotype (HAP-SNP). Predictive ability values (PA) were dependent on the genomic prediction models and markers. On average, Bayesian ridge regression (BRR) and Bayes C had the highest PA for the majority of traits. Notably, genomic models that included the haplotype effect (either HAP or HAP-SNP) significantly increased the PA of low-heritability traits. For instance, BRR based on HAP had the highest PA (0.58) for stem straightness. Consistently, the heritability estimates from genomic models were higher than the pedigree-based estimates for these traits. The results provide additional perspectives for the implementation of genomic selection in Eucalyptus breeding programs, which could be especially beneficial for improving traits with low heritability.
Collapse
Affiliation(s)
- Paulina Ballesta
- Institute of Biological Sciences, University of Talca, 2 Norte 685, Talca 3460000, Chile.
| | - Carlos Maldonado
- Institute of Biological Sciences, University of Talca, 2 Norte 685, Talca 3460000, Chile.
| | - Paulino Pérez-Rodríguez
- Colegio de Postgraduados, Statistics and Computer Sciences, Montecillos, Edo. de México 56230, Mexico.
| | - Freddy Mora
- Institute of Biological Sciences, University of Talca, 2 Norte 685, Talca 3460000, Chile.
| |
Collapse
|
8
|
Cuyabano B, Su G, Rosa G, Lund M, Gianola D. Bootstrap study of genome-enabled prediction reliabilities using haplotype blocks across Nordic Red cattle breeds. J Dairy Sci 2015; 98:7351-63. [DOI: 10.3168/jds.2015-9360] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2015] [Accepted: 06/16/2015] [Indexed: 12/30/2022]
|
9
|
Cuyabano BCD, Su G, Lund MS. Genomic prediction of genetic merit using LD-based haplotypes in the Nordic Holstein population. BMC Genomics 2014; 15:1171. [PMID: 25539631 PMCID: PMC4367958 DOI: 10.1186/1471-2164-15-1171] [Citation(s) in RCA: 54] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2013] [Accepted: 12/12/2014] [Indexed: 11/17/2022] Open
Abstract
Background A haplotype approach to genomic prediction using high density data in dairy cattle as an alternative to single-marker methods is presented. With the assumption that haplotypes are in stronger linkage disequilibrium (LD) with quantitative trait loci (QTL) than single markers, this study focuses on the use of haplotype blocks (haploblocks) as explanatory variables for genomic prediction. Haploblocks were built based on the LD between markers, which allowed variable reduction. The haploblocks were then used to predict three economically important traits (milk protein, fertility and mastitis) in the Nordic Holstein population. Results The haploblock approach improved prediction accuracy compared with the commonly used individual single nucleotide polymorphism (SNP) approach. Furthermore, using an average LD threshold to define the haploblocks (LD≥0.45 between any two markers) increased the prediction accuracies for all three traits, although the improvement was most significant for milk protein (up to 3.1 % improvement in prediction accuracy, compared with the individual SNP approach). Hotelling’s t-tests were performed, confirming the improvement in prediction accuracy for milk protein. Because the phenotypic values were in the form of de-regressed proofs, the improved accuracy for milk protein may be due to higher reliability of the data for this trait compared with the reliability of the mastitis and fertility data. Comparisons between best linear unbiased prediction (BLUP) and Bayesian mixture models also indicated that the Bayesian model produced the most accurate predictions in every scenario for the milk protein trait, and in some scenarios for fertility. Conclusions The haploblock approach to genomic prediction is a promising method for genomic selection in animal breeding. Building haploblocks based on LD reduced the number of variables without the loss of information. This method may play an important role in the future genomic prediction involving while genome sequences.
Collapse
Affiliation(s)
| | - Guosheng Su
- Center for Quantitative Genetics and Genomics, Department of Molecular Biology and Genetics, Aarhus University, Denmark.
| | | |
Collapse
|
10
|
Computational Approaches and Resources in Single Amino Acid Substitutions Analysis Toward Clinical Research. ADVANCES IN PROTEIN CHEMISTRY AND STRUCTURAL BIOLOGY 2014; 94:365-423. [DOI: 10.1016/b978-0-12-800168-4.00010-x] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/08/2023]
|
11
|
A review for detecting gene-gene interactions using machine learning methods in genetic epidemiology. BIOMED RESEARCH INTERNATIONAL 2013; 2013:432375. [PMID: 24228248 PMCID: PMC3818807 DOI: 10.1155/2013/432375] [Citation(s) in RCA: 44] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/17/2013] [Revised: 08/26/2013] [Accepted: 08/27/2013] [Indexed: 01/04/2023]
Abstract
Recently, the greatest statistical computational challenge in genetic epidemiology is to identify and characterize the genes that interact with other genes and environment factors that bring the effect on complex multifactorial disease. These gene-gene interactions are also denoted as epitasis in which this phenomenon cannot be solved by traditional statistical method due to the high dimensionality of the data and the occurrence of multiple polymorphism. Hence, there are several machine learning methods to solve such problems by identifying such susceptibility gene which are neural networks (NNs), support vector machine (SVM), and random forests (RFs) in such common and multifactorial disease. This paper gives an overview on machine learning methods, describing the methodology of each machine learning methods and its application in detecting gene-gene and gene-environment interactions. Lastly, this paper discussed each machine learning method and presents the strengths and weaknesses of each machine learning method in detecting gene-gene interactions in complex human disease.
Collapse
|
12
|
Zhou J, Deng Y, Luo F, He Z, Tu Q, Zhi X. Functional molecular ecological networks. mBio 2010; 1:e00169-10. [PMID: 20941329 PMCID: PMC2953006 DOI: 10.1128/mbio.00169-10] [Citation(s) in RCA: 521] [Impact Index Per Article: 37.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2010] [Accepted: 09/15/2010] [Indexed: 11/20/2022] Open
Abstract
Biodiversity and its responses to environmental changes are central issues in ecology and for society. Almost all microbial biodiversity research focuses on "species" richness and abundance but not on their interactions. Although a network approach is powerful in describing ecological interactions among species, defining the network structure in a microbial community is a great challenge. Also, although the stimulating effects of elevated CO(2) (eCO(2)) on plant growth and primary productivity are well established, its influences on belowground microbial communities, especially microbial interactions, are poorly understood. Here, a random matrix theory (RMT)-based conceptual framework for identifying functional molecular ecological networks was developed with the high-throughput functional gene array hybridization data of soil microbial communities in a long-term grassland FACE (free air, CO(2) enrichment) experiment. Our results indicate that RMT is powerful in identifying functional molecular ecological networks in microbial communities. Both functional molecular ecological networks under eCO(2) and ambient CO(2) (aCO(2)) possessed the general characteristics of complex systems such as scale free, small world, modular, and hierarchical. However, the topological structures of the functional molecular ecological networks are distinctly different between eCO(2) and aCO(2), at the levels of the entire communities, individual functional gene categories/groups, and functional genes/sequences, suggesting that eCO(2) dramatically altered the network interactions among different microbial functional genes/populations. Such a shift in network structure is also significantly correlated with soil geochemical variables. In short, elucidating network interactions in microbial communities and their responses to environmental changes is fundamentally important for research in microbial ecology, systems microbiology, and global change.
Collapse
Affiliation(s)
- Jizhong Zhou
- Institute for Environmental Genomics, University of Oklahoma, Norman, Oklahoma, USA.
| | | | | | | | | | | |
Collapse
|
13
|
Abstract
Over the last two decades, aging research has expanded to include not only age-related disease models, and conversely, longevity and disease-free models, but also focuses on biological mechanisms related to the aging process. By viewing aging on multiple research frontiers, we are rapidly expanding knowledge as a whole and mapping connections between biological processes and particular age-related diseases that emerge. This is perhaps most true in the field of genetics, where variation across individuals has improved our understanding of aging mechanisms, etiology of age-related disease, and prediction of therapeutic responses. A close partnership between gerontologists, epidemiologists, and geneticists is needed to take full advantage of emerging genome information and technology and bring about a new age for biological aging research. Here we review current genetic findings for aging across both disease-specific and aging process domains. We then highlight the limitations of most work to date in terms of study design, genomic information, and trait modeling and focus on emerging technology and future directions that can partner genetic epidemiology and aging research fields to best take advantage of the rapid discoveries in each.
Collapse
Affiliation(s)
- M Daniele Fallin
- Department of Epidemiology, Bloomberg School of Public Health, Baltimore, MD 21205, USA.
| | | |
Collapse
|
14
|
Motsinger-Reif AA, Dudek SM, Hahn LW, Ritchie MD. Comparison of approaches for machine-learning optimization of neural networks for detecting gene-gene interactions in genetic epidemiology. Genet Epidemiol 2008; 32:325-40. [PMID: 18265411 DOI: 10.1002/gepi.20307] [Citation(s) in RCA: 57] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/17/2023]
Abstract
The detection of genotypes that predict common, complex disease is a challenge for human geneticists. The phenomenon of epistasis, or gene-gene interactions, is particularly problematic for traditional statistical techniques. Additionally, the explosion of genetic information makes exhaustive searches of multilocus combinations computationally infeasible. To address these challenges, neural networks (NN), a pattern recognition method, have been used. One limitation of the NN approach is that its success is dependent on the architecture of the network. To solve this, machine-learning approaches have been suggested to evolve the best NN architecture for a particular data set. In this study we provide a detailed technical description of the use of grammatical evolution to optimize neural networks (GENN) for use in genetic association studies. We compare the performance of GENN to that of a previous machine-learning NN application--genetic programming neural networks in both simulated and real data. We show that GENN greatly outperforms genetic programming neural networks in data sets with a large number of single nucleotide polymorphisms. Additionally, we demonstrate that GENN has high power to detect disease-risk loci in a range of high-order epistatic models. Finally, we demonstrate the scalability of the GENN method with increasing numbers of variables--as many as 500,000 single nucleotide polymorphisms.
Collapse
Affiliation(s)
- Alison A Motsinger-Reif
- Bioinformatics Research Center, North Carolina State University, Raleigh, North Carolina, USA
| | | | | | | |
Collapse
|
15
|
Bergholdt R, Størling ZM, Lage K, Karlberg EO, Olason PI, Aalund M, Nerup J, Brunak S, Workman CT, Pociot F. Integrative analysis for finding genes and networks involved in diabetes and other complex diseases. Genome Biol 2008; 8:R253. [PMID: 18045462 PMCID: PMC2258178 DOI: 10.1186/gb-2007-8-11-r253] [Citation(s) in RCA: 46] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2007] [Revised: 10/31/2007] [Accepted: 11/28/2007] [Indexed: 01/17/2023] Open
Abstract
An integrative analysis combining genetic interactions and protein interactions can be used to identify candidate genes/proteins for type 1 diabetes and other complex diseases. We have developed an integrative analysis method combining genetic interactions, identified using type 1 diabetes genome scan data, and a high-confidence human protein interaction network. Resulting networks were ranked by the significance of the enrichment of proteins from interacting regions. We identified a number of new protein network modules and novel candidate genes/proteins for type 1 diabetes. We propose this type of integrative analysis as a general method for the elucidation of genes and networks involved in diabetes and other complex diseases.
Collapse
Affiliation(s)
- Regine Bergholdt
- Steno Diabetes Center, Niels Steensensvej 2, DK-2820 Gentofte, Denmark.
| | | | | | | | | | | | | | | | | | | |
Collapse
|
16
|
Neural networks for genetic epidemiology: past, present, and future. BioData Min 2008; 1:3. [PMID: 18822147 PMCID: PMC2553772 DOI: 10.1186/1756-0381-1-3] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2008] [Accepted: 07/17/2008] [Indexed: 01/17/2023] Open
Abstract
During the past two decades, the field of human genetics has experienced an information explosion. The completion of the human genome project and the development of high throughput SNP technologies have created a wealth of data; however, the analysis and interpretation of these data have created a research bottleneck. While technology facilitates the measurement of hundreds or thousands of genes, statistical and computational methodologies are lacking for the analysis of these data. New statistical methods and variable selection strategies must be explored for identifying disease susceptibility genes for common, complex diseases. Neural networks (NN) are a class of pattern recognition methods that have been successfully implemented for data mining and prediction in a variety of fields. The application of NN for statistical genetics studies is an active area of research. Neural networks have been applied in both linkage and association analysis for the identification of disease susceptibility genes. In the current review, we consider how NN have been used for both linkage and association analyses in genetic epidemiology. We discuss both the successes of these initial NN applications, and the questions that arose during the previous studies. Finally, we introduce evolutionary computing strategies, Genetic Programming Neural Networks (GPNN) and Grammatical Evolution Neural Networks (GENN), for using NN in association studies of complex human diseases that address some of the caveats illuminated by previous work.
Collapse
|
17
|
Yang HC, Hsieh HY, Fann CSJ. Kernel-based association test. Genetics 2008; 179:1057-68. [PMID: 18558654 PMCID: PMC2429859 DOI: 10.1534/genetics.107.084616] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2007] [Accepted: 03/23/2008] [Indexed: 11/18/2022] Open
Abstract
Association mapping (i.e., linkage disequilibrium mapping) is a powerful tool for positional cloning of disease genes. We propose a kernel-based association test (KBAT), which is a composite function of "P-values of single-locus association tests" and "kernel weights related to intermarker distances and/or linkage disequilibria." The KBAT is a general form of some current test statistics. This method can be applied to the study of candidate genes and can scan each chromosome using a moving average procedure. We evaluated the performance of the KBAT through simulation studies that considered evolutionary parameters, disease models, sample sizes, kernel functions, test statistics, window attributes, empirical P-value estimations, and genetic/physical maps. The results showed that the KBAT had a well-controlled false positive rate and high power compared to existing methods. In addition, the KBAT was also applied to analyze a genomewide data set from the Collaborative Study on the Genetics of Alcoholism. Important genes associated with alcoholism dependence were identified. In summary, the merits of the KBAT are multifold: the KBAT is robust against the inclusion of nuisance markers, is invariant to the map scale, and accommodates different types of genomic data, study designs, and study purposes. The proposed methods are packaged in the user-friendly software, KBAT, available at http://www.stat.sinica.edu.tw/hsinchou/genetics/association/KBAT.htm.
Collapse
Affiliation(s)
- Hsin-Chou Yang
- Institute of Statistical Science, Academia Sinica, 128 Academia Rd., Sec. 2, Nankang, Taipei, Taiwan 115.
| | | | | |
Collapse
|
18
|
Artificial neural networks for linkage analysis of quantitative gene expression phenotypes and evaluation of gene x gene interactions. BMC Proc 2007; 1 Suppl 1:S47. [PMID: 18466546 PMCID: PMC2367483 DOI: 10.1186/1753-6561-1-s1-s47] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/17/2023] Open
Abstract
Background Using single-nucleotide polymorphism (SNP) genotypes and selected gene expression phenotypes from 14 CEPH (Centre d'Etude du Polymorphisme Humain) pedigrees provided for Genetic Analysis Workshop 15 (GAW15), we analyzed quantitative traits with artificial neural networks (ANNs). Our goals were to identify individual linkage signals and examine gene × gene interactions. First, we used classical multipoint methods to identify phenotypes having nominal linkage evidence at two or more loci. ANNs were then applied to sib-pair identity-by-descent (IBD) allele sharing across the genome as input variables and squared trait sums and differences for the sib pairs as output variables. The weights of the trained networks were analyzed to assess the linkage evidence at each locus as well as potential interactions between them. Results Loci identified by classical linkage analysis could also be identified by our ANN analysis. However some ANN results were noisy, and our attempts to use cross-validated training to avoid overtraining and thereby improve results were only partially successful. Potential interactions between loci with high-ranked weight measures were also evaluated, with the resulting patterns suggesting existence of both synergistic and antagonistic effects between loci. Conclusion Our results suggest that ANNs can serve as a useful method to analyze quantitative traits and are a potential tool for detecting gene × gene interactions. However, for the approach implemented here, optimizing the ANNs and obtaining stable results remains challenging.
Collapse
|
19
|
Curtis D. Comparison of artificial neural network analysis with other multimarker methods for detecting genetic association. BMC Genet 2007; 8:49. [PMID: 17640352 PMCID: PMC1940019 DOI: 10.1186/1471-2156-8-49] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2007] [Accepted: 07/18/2007] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Debate remains as to the optimal method for utilising genotype data obtained from multiple markers in case-control association studies. I and colleagues have previously described a method of association analysis using artificial neural networks (ANNs), whose performance compared favourably to single-marker methods. Here, the performance of ANN analysis is compared with other multi-marker methods, comprising different haplotype-based analyses and locus-based analyses. RESULTS Of several methods studied and applied to simulated SNP datasets, heterogeneity testing of estimated haplotype frequencies using asymptotic p values rather than permutation testing had the lowest power of the methods studied and ANN analysis had the highest power. The difference in power to detect association between these two methods was statistically significant (p = 0.001) but other comparisons between methods were not significant. The raw t statistic obtained from ANN analysis correlated highly with the empirical statistical significance obtained from permutation testing of the ANN results and with the p value obtained from the heterogeneity test. CONCLUSION Although ANN analysis was more powerful than the standard haplotype-based test it is unlikely to be taken up widely. The permutation testing necessary to obtain a valid p value makes it slow to perform and it is not underpinned by a theoretical model relating marker genotypes to disease phenotype. Nevertheless, the superior performance of this method does imply that the widely-used haplotype-based methods for detecting association with multiple markers are not optimal and efforts could be made to improve upon them. The fact that the t statistic obtained from ANN analysis is highly correlated with the statistical significance does suggest a possibility to use ANN analysis in situations where large numbers of markers have been genotyped, since the t value could be used as a proxy for the p value in preliminary analyses.
Collapse
Affiliation(s)
- David Curtis
- Academic Centre for Psychiatry, St Bartholomew's and Royal London School of Medicine and Dentistry, Royal London Hospital, Whitechapel, London, UK.
| |
Collapse
|
20
|
Methodological aspects of the assessment of gene-nutrient interactions at the population level. Nutr Metab Cardiovasc Dis 2007; 17:82-8. [PMID: 17306733 DOI: 10.1016/j.numecd.2006.01.004] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 12/16/2005] [Revised: 01/04/2006] [Accepted: 01/09/2006] [Indexed: 12/21/2022]
Abstract
Nutritional-related diseases are the result of complex interactions between genes and diet. The understanding of these interactions will provide the rationale for dietary interventions based on the individual's genetic constitution. However, the approach to this kind of study is not easy, the complexity of the interactions increasing exponentially the dimensionality of the problem. The aim of this review is to analyze the major problems that arise in approaching complex interactions at the population level. Furthermore, several statistical tools available for this type of analysis are discussed. In conclusion, although analytic techniques able to reduce the dimensionality of the problem are suggested, sample size requirement seems to remain an inescapable challenge for the researcher. A synergy between traditional and nontraditional statistical approaches could be useful.
Collapse
|
21
|
Musani SK, Shriner D, Liu N, Feng R, Coffey CS, Yi N, Tiwari HK, Allison DB. Detection of gene x gene interactions in genome-wide association studies of human population data. Hum Hered 2007; 63:67-84. [PMID: 17283436 DOI: 10.1159/000099179] [Citation(s) in RCA: 138] [Impact Index Per Article: 8.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022] Open
Abstract
Empirical evidence supporting the commonality of gene x gene interactions, coupled with frequent failure to replicate results from previous association studies, has prompted statisticians to develop methods to handle this important subject. Nonparametric methods have generated intense interest because of their capacity to handle high-dimensional data. Genome-wide association analysis of large-scale SNP data is challenging mathematically and computationally. In this paper, we describe major issues and questions arising from this challenge, along with methodological implications. Data reduction and pattern recognition methods seem to be the new frontiers in efforts to detect gene x gene interactions comprehensively. Currently, there is no single method that is recognized as the 'best' for detecting, characterizing, and interpreting gene x gene interactions. Instead, a combination of approaches with the aim of balancing their specific strengths may be the optimal approach to investigate gene x gene interactions in human data.
Collapse
Affiliation(s)
- Solomon K Musani
- Section on Statistical Genetics, Department of Biostatistics, University of Alabama at Birmingham, Birmingham, AL 35294, USA.
| | | | | | | | | | | | | | | |
Collapse
|
22
|
Sabbagh A, Darlu P. SNP selection at the NAT2 locus for an accurate prediction of the acetylation phenotype. Genet Med 2006; 8:76-85. [PMID: 16481889 DOI: 10.1097/01.gim.0000200951.54346.d6] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022] Open
Abstract
PURPOSE Genetic polymorphisms in the N-acetyltransferase 2 gene determine the individual acetylator status, which influences both the toxicity and efficacy profile of acetylated drugs. Determination of an individual's acetylation phenotype prior to initiation of therapy, through DNA-based tests, should permit to improve therapy response and reduce adverse events. However, due to extensive linkage disequilibrium between markers within NAT2, the genotyping of closely spaced markers yields highly redundant data: testing them all is expensive and often unnecessary. The objective of this study is to establish the optimal strategy to define, in the genetic context of a given ethnic group, the most informative set of single-nucleotide polymorphisms that best enables accurate prediction of acetylation phenotype. METHODS Three classification methods have been investigated (classification trees, artificial neural networks and multifactor dimensionality reduction method) in order to find the optimal set of single-nucleotide polymorphisms enabling the most efficient classification of individuals in rapid and slow acetylators. RESULTS Our results show that, in almost all population samples, only one or two single-nucleotide polymorphisms would be enough to obtain a good predictive capacity with no or only a modest reduction in power relative to direct assays of all common markers. In contrast, in Black African populations, where lower levels of linkage disequilibrium are observed at NAT2, a larger number of single-nucleotide polymorphisms are required to predict acetylation phenotype. CONCLUSION The results of this study will be helpful for the design of time- and cost-effective pharmacogenetic tests (adapted to specific populations) that could be used as routine tools in clinical practice.
Collapse
Affiliation(s)
- Audrey Sabbagh
- Unité de Recherche en Génétique Epidémiologique et Structure des Populations Humaines, INSERM U535, Villejuif, France
| | | |
Collapse
|
23
|
Motsinger AA, Dudek SM, Hahn LW, Ritchie MD. Comparison of Neural Network Optimization Approaches for Studies of Human Genetics. LECTURE NOTES IN COMPUTER SCIENCE 2006. [DOI: 10.1007/11732242_10] [Citation(s) in RCA: 16] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/07/2023]
|
24
|
Abstract
Multilocus linkage disequilibrium (LD) tests that consider inter-marker (LD) are more powerful than single-locus tests when disease etiology is contributed simultaneously by several linked and correlated loci. However, inclusion of redundant non-informative markers may result in reduced testing power and/or inflated false-positive rate, therefore selection of proper marker sets is important in such tests. We introduce a unified LD test based on a convenient marker-selection procedure (sliding window) combined with an adjustment approach (marker weighting) to dilute the impact of nuisance markers on tests. The proposed procedure includes several conventional p-value combination methods as its special cases. Simulation studies were performed to evaluate the impact of inclusion of nuisance markers and performance of the procedure. The results showed that testing power was often inversely proportional to the quantity of nuisance markers. Among a class of p-value combination methods, the product p-value method had the highest testing power. P-value truncation somewhat reduced the testing power but controlled the false-positive rate well. Compared with conventional unweighted approaches, the weighted strategy alleviated the false-positive rate and/or increased testing power when nuisance markers were included. Analyses of two authentic data sets for psoriasis and Alzheimer's disease using our proposed method confirmed previous findings.
Collapse
Affiliation(s)
- Hsin-Chou Yang
- Institute of Biomedical Sciences, Academia Sinica, Nankang, Taipei, Taiwan
| | | | | |
Collapse
|
25
|
Sabbagh A, Darlu P. Data-Mining Methods as Useful Tools for Predicting Individual Drug Response: Application to CYP2D6 Data. Hum Hered 2006; 62:119-34. [PMID: 17057402 DOI: 10.1159/000096416] [Citation(s) in RCA: 15] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2006] [Accepted: 08/22/2006] [Indexed: 11/19/2022] Open
Abstract
OBJECTIVES Selecting a maximally informative subset of polymorphisms to predict a clinical outcome, such as drug response, requires appropriate search methods due to the increased dimensionality associated with looking at multiple genotypes. In this study, we investigated the ability of several pattern recognition methods to identify the most informative markers in the CYP2D6 gene for the prediction of CYP2D6 metabolizer status. METHODS Four data-mining tools were explored: decision trees, random forests, artificial neural networks, and the multifactor dimensionality reduction (MDR) method. Marker selection was performed separately in eight population samples of different ethnic origin to evaluate to what extent the most informative markers differ across ethnic groups. RESULTS Our results show that the number of polymorphisms required to predict CYP2D6 metabolic phenotype with a high accuracy can be dramatically reduced owing to the strong haplotype block structure observed at CYP2D6. MDR and neural networks provided nearly identical results and performed the best. CONCLUSION Data-mining methods, such as MDR and neural networks, appear as promising tools to improve the efficiency of genotyping tests in pharmacogenetics with the ultimate goal of pre-screening patients for individual therapy selection with minimum genotyping effort.
Collapse
Affiliation(s)
- Audrey Sabbagh
- Unité de Recherche en Génétique Epidémiologique et Structure des Populations Humaines, INSERM U535, Villejuif, France.
| | | |
Collapse
|
26
|
Abstract
Complex interactions among genes and environmental factors are known to play a role in common human disease aetiology. There is a growing body of evidence to suggest that complex interactions are 'the norm' and, rather than amounting to a small perturbation to classical Mendelian genetics, interactions may be the predominant effect. Traditional statistical methods are not well suited for detecting such interactions, especially when the data are high dimensional (many attributes or independent variables) or when interactions occur between more than two polymorphisms. In this review, we discuss machine-learning models and algorithms for identifying and characterising susceptibility genes in common, complex, multifactorial human diseases. We focus on the following machine-learning methods that have been used to detect gene-gene interactions: neural networks, cellular automata, random forests, and multifactor dimensionality reduction. We conclude with some ideas about how these methods and others can be integrated into a comprehensive and flexible framework for data mining and knowledge discovery in human genetics.
Collapse
Affiliation(s)
- Brett A. McKinney
- Department of Molecular Physiology and Biophysics, Center for Human Genetics Research, Vanderbilt University Medical School, Nashville, Tennessee, USA
- Computational Genetics Laboratory, Department of Genetics, Dartmouth Medical School, Lebanon, New Hampshire, USA
| | - David M. Reif
- Department of Molecular Physiology and Biophysics, Center for Human Genetics Research, Vanderbilt University Medical School, Nashville, Tennessee, USA
- Computational Genetics Laboratory, Department of Genetics, Dartmouth Medical School, Lebanon, New Hampshire, USA
| | - Marylyn D. Ritchie
- Department of Molecular Physiology and Biophysics, Center for Human Genetics Research, Vanderbilt University Medical School, Nashville, Tennessee, USA
| | - Jason H. Moore
- Computational Genetics Laboratory, Department of Genetics, Dartmouth Medical School, Lebanon, New Hampshire, USA
- Department of Community and Family Medicine, Dartmouth Medical School, Lebanon, New Hampshire, USA
- Department of Biological Sciences, Dartmouth College, Hanover, New Hampshire, USA
- Department of Computer Science, University of New Hampshire, Durham, New Hampshire, USA
- Department of Computer Science, University of Vermont, Burlington, Vermont, USA
| |
Collapse
|
27
|
Penco S, Grossi E, Cheng S, Intraligi M, Maurelli G, Patrosso MC, Marocchi A, Buscema M. Assessment of the role of genetic polymorphism in venous thrombosis through artificial neural networks. Ann Hum Genet 2005; 69:693-706. [PMID: 16266408 DOI: 10.1111/j.1529-8817.2005.00206.x] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
Abstract
PURPOSE To assess the role of genetic polymorphisms in venous thrombosis events (VTE) using Artificial Neural Networks (ANNs), a model for solving non-linear problems frequently associated with complex biological systems, due to interactions between biological, genetic and environmental factors. METHODS A database was generated from a case-control study of venous thrombosis, using 238 patients and 211 controls. The database of 64 variables included age, gender and a panel of 62 genetic variants. Three different ANNs were compared, with logistic regression for the accuracy of predicting cases and controls. RESULTS ANNs yielded a better performance than the logistic regression algorithm. Indeed, through ANNs models, the 62 variables related to genetic variants were first reduced to a set of 9, and then of 3 (MTHFR 677 C/T, FV arg506gln, ICAM1 gly214arg). CONCLUSIONS The findings of this study illustrate the power of ANN in evaluating multifactorial data, and show that the different sensitivities of the models of elaboration are related to the characteristics of the data. This may contribute to a better understanding of the role played by genetic polymorphisms in VTE, and help to define, if possible, a test panel of genetic variants to estimate an individual's probability of developing the disease.
Collapse
Affiliation(s)
- S Penco
- Medical Genetics, Clinical Chemistry and Clinical Pathology Laboratory, Niguarda Ca' Granda Hospital, Piazza Ospedale Maggiore 3, 20100 Milan, Italy.
| | | | | | | | | | | | | | | |
Collapse
|
28
|
Di Luca M, Grossi E, Borroni B, Zimmermann M, Marcello E, Colciaghi F, Gardoni F, Intraligi M, Padovani A, Buscema M. Artificial neural networks allow the use of simultaneous measurements of Alzheimer disease markers for early detection of the disease. J Transl Med 2005; 3:30. [PMID: 16048651 PMCID: PMC1198261 DOI: 10.1186/1479-5876-3-30] [Citation(s) in RCA: 17] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2005] [Accepted: 07/27/2005] [Indexed: 02/08/2023] Open
Abstract
BACKGROUND Previous studies have shown that in platelets of mild Alzheimer Disease (AD) patients there are alterations of specific APP forms, paralleled by alteration in expression level of both ADAM 10 and BACE when compared to control subjects. Due to the poor linear relation among each key-element of beta-amyloid cascade and the target diagnosis, the use of systems able to afford non linear tasks, like artificial neural networks (ANNs), should allow a better discriminating capacity in comparison with classical statistics. OBJECTIVE To evaluate the accuracy of ANNs in AD diagnosis. METHODS 37 mild-AD patients and 25 control subjects were enrolled, and APP, ADM10 and BACE measures were performed. Fifteen different models of feed-forward and complex-recurrent ANNs (provided by Semeion Research Centre), based on different learning laws (back propagation, sine-net, bi-modal) were compared with the linear discriminant analysis (LDA). RESULTS The best ANN model correctly identified mild AD patients in the 94% of cases and the control subjects in the 92%. The corresponding diagnostic performance obtained with LDA was 90% and 73%. CONCLUSION This preliminary study suggests that the processing of biochemical tests related to beta-amyloid cascade with ANNs allows a very good discrimination of AD in early stages, higher than that obtainable with classical statistics methods.
Collapse
Affiliation(s)
- Monica Di Luca
- Centre of Excellence for Neurodegenerative Disorders and Department of Pharmacological Sciences, University of Milan, Italy
| | - Enzo Grossi
- Medical Department, Bracco Spa, Milan, Italy
| | - Barbara Borroni
- Department of Neurological Sciences, University of Brescia, Italy
| | - Martina Zimmermann
- Centre of Excellence for Neurodegenerative Disorders and Department of Pharmacological Sciences, University of Milan, Italy
| | - Elena Marcello
- Centre of Excellence for Neurodegenerative Disorders and Department of Pharmacological Sciences, University of Milan, Italy
| | - Francesca Colciaghi
- Centre of Excellence for Neurodegenerative Disorders and Department of Pharmacological Sciences, University of Milan, Italy
| | - Fabrizio Gardoni
- Centre of Excellence for Neurodegenerative Disorders and Department of Pharmacological Sciences, University of Milan, Italy
| | | | | | | |
Collapse
|
29
|
Serretti A, Smeraldi E. Neural network analysis in pharmacogenetics of mood disorders. BMC MEDICAL GENETICS 2004; 5:27. [PMID: 15588300 PMCID: PMC539307 DOI: 10.1186/1471-2350-5-27] [Citation(s) in RCA: 40] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/23/2004] [Accepted: 12/09/2004] [Indexed: 01/17/2023]
Abstract
Background The increasing number of available genotypes for genetic studies in humans requires more advanced techniques of analysis. We previously reported significant univariate associations between gene polymorphisms and antidepressant response in mood disorders. However the combined analysis of multiple gene polymorphisms and clinical variables requires the use of non linear methods. Methods In the present study we tested a neural network strategy for a combined analysis of two gene polymorphisms. A Multi Layer Perceptron model showed the best performance and was therefore selected over the other networks. One hundred and twenty one depressed inpatients treated with fluvoxamine in the context of previously reported pharmacogenetic studies were included. The polymorphism in the transcriptional control region upstream of the 5HTT coding sequence (SERTPR) and in the Tryptophan Hydroxylase (TPH) gene were analysed simultaneously. Results A multi layer perceptron network composed by 1 hidden layer with 7 nodes was chosen. 77.5 % of responders and 51.2% of non responders were correctly classified (ROC area = 0.731 – empirical p value = 0.0082). Finally, we performed a comparison with traditional techniques. A discriminant function analysis correctly classified 34.1 % of responders and 68.1 % of non responders (F = 8.16 p = 0.0005). Conclusions Overall, our findings suggest that neural networks may be a valid technique for the analysis of gene polymorphisms in pharmacogenetic studies. The complex interactions modelled through NN may be eventually applied at the clinical level for the individualized therapy.
Collapse
Affiliation(s)
- Alessandro Serretti
- Istituto Scientifico Universitario Ospedale San Raffaele, Department of Neuropsychiatric Sciences, Milano, Italy
- Università Vita-Salute San Raffaele, School of Medicine, Milano, Italy
| | - Enrico Smeraldi
- Istituto Scientifico Universitario Ospedale San Raffaele, Department of Neuropsychiatric Sciences, Milano, Italy
- Università Vita-Salute San Raffaele, School of Medicine, Milano, Italy
| |
Collapse
|
30
|
Pullinger CR, Kane JP, Malloy MJ. Primary hypercholesterolemia: genetic causes and treatment of five monogenic disorders. Expert Rev Cardiovasc Ther 2004; 1:107-19. [PMID: 15030301 DOI: 10.1586/14779072.1.1.107] [Citation(s) in RCA: 29] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
Coronary heart disease is a major cause of death in Europe and the USA. Insudation of atherogenic lipoproteins, including low-density lipoprotein (LDL), into the artery wall is integral to atherosclerosis. It is clear that numerous genetic loci contribute to increased plasma levels of LDL. However, five specific monogenic disorders, three of which have been reported recently, are known to increase LDL. These are familial hypercholesterolemia (LDL receptor gene: LDLR); familial ligand-defective apoB- 100 (apoB gene: APOB); autosomal recessive hypercholesterolemia (ARH gene); sitosterolemia (ABCG5 or ABCG8 genes) and cholesterol 7alpha-hydroxylase deficiency (CYP7A1 gene). This review relates the mechanisms underlying these five disorders with specific therapeutic interventions.
Collapse
Affiliation(s)
- Clive R Pullinger
- Cardiovascular Research Institute, University of California, San Francisco, USA.
| | | | | |
Collapse
|
31
|
Pociot F, Karlsen AE, Pedersen CB, Aalund M, Nerup J. Novel analytical methods applied to type 1 diabetes genome-scan data. Am J Hum Genet 2004; 74:647-60. [PMID: 15024687 PMCID: PMC1181942 DOI: 10.1086/383095] [Citation(s) in RCA: 18] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/13/2003] [Accepted: 01/16/2004] [Indexed: 01/17/2023] Open
Abstract
Complex traits like type 1 diabetes mellitus (T1DM) are generally taken to be under the influence of multiple genes interacting with each other to confer disease susceptibility and/or protection. Although novel methods are being developed, analyses of whole-genome scans are most often performed with multipoint methods that work under the assumption that multiple trait loci are unrelated to each other; that is, most models specify the effect of only one locus at a time. We have applied a novel approach, which includes decision-tree construction and artificial neural networks, to the analysis of T1DM genome-scan data. We demonstrate that this approach (1) allows identification of all major susceptibility loci identified by nonparametric linkage analysis, (2) identifies a number of novel regions as well as combinations of markers with predictive value for T1DM, and (3) may be useful in characterizing markers in linkage disequilibrium with protective-gene variants. Furthermore, the approach outlined here permits combined analyses of genetic-marker data and information on environmental and clinical covariates.
Collapse
|
32
|
Tahri-Daizadeh N, Tregouet DA, Nicaud V, Manuel N, Cambien F, Tiret L. Automated detection of informative combined effects in genetic association studies of complex traits. Genome Res 2003; 13:1952-60. [PMID: 12902385 PMCID: PMC403788 DOI: 10.1101/gr.1254203] [Citation(s) in RCA: 18] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
There is a growing body of evidence suggesting that the relationships between gene variability and common disease are more complex than initially thought and require the exploration of the whole polymorphism of candidate genes as well as several genes belonging to biological pathways. When the number of polymorphisms is relatively large and the structure of the relationships among them complex, the use of data mining tools to extract the relevant information is a necessity. Here, we propose an automated method for the detection of informative combined effects (DICE) among several polymorphisms (and nongenetic covariates) within the framework of association studies. The algorithm combines the advantages of the regressive approaches with those of data exploration tools. Importantly, DICE considers the problem of interaction between polymorphisms as an effect of interest and not as a nuisance effect. We illustrate the method with three applications on the relationship between (1). the P-selectin gene and myocardial infarction, (2). the cholesteryl ester transfer protein gene and plasma high-density-lipoprotein cholesterol concentration, and (3). genes of the renin-angiotensin-aldosterone system and myocardial infarction. The applications demonstrated that the method was able to recover results already found using other approaches, but in addition detected biologically sensible effects not previously described.
Collapse
Affiliation(s)
- Nadia Tahri-Daizadeh
- INSERM U525, Faculté de Médecine, Hôpital Pitié-Salpêtrière, 75634 Paris, France
| | | | | | | | | | | |
Collapse
|
33
|
Ritchie MD, White BC, Parker JS, Hahn LW, Moore JH. Optimization of neural network architecture using genetic programming improves detection and modeling of gene-gene interactions in studies of human diseases. BMC Bioinformatics 2003; 4:28. [PMID: 12846935 PMCID: PMC183838 DOI: 10.1186/1471-2105-4-28] [Citation(s) in RCA: 115] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2003] [Accepted: 07/07/2003] [Indexed: 12/11/2022] Open
Abstract
BACKGROUND Appropriate definition of neural network architecture prior to data analysis is crucial for successful data mining. This can be challenging when the underlying model of the data is unknown. The goal of this study was to determine whether optimizing neural network architecture using genetic programming as a machine learning strategy would improve the ability of neural networks to model and detect nonlinear interactions among genes in studies of common human diseases. RESULTS Using simulated data, we show that a genetic programming optimized neural network approach is able to model gene-gene interactions as well as a traditional back propagation neural network. Furthermore, the genetic programming optimized neural network is better than the traditional back propagation neural network approach in terms of predictive ability and power to detect gene-gene interactions when non-functional polymorphisms are present. CONCLUSION This study suggests that a machine learning strategy for optimizing neural network architecture may be preferable to traditional trial-and-error approaches for the identification and characterization of gene-gene interactions in common, complex human diseases.
Collapse
Affiliation(s)
- Marylyn D Ritchie
- Program in Human Genetics
and Department of Molecular Physiology and Biophysics, Vanderbilt
University Medical School, Nashville, TN, 37232-0700, USA
| | - Bill C White
- Program in Human Genetics
and Department of Molecular Physiology and Biophysics, Vanderbilt
University Medical School, Nashville, TN, 37232-0700, USA
| | - Joel S Parker
- Program in Human Genetics
and Department of Molecular Physiology and Biophysics, Vanderbilt
University Medical School, Nashville, TN, 37232-0700, USA
| | - Lance W Hahn
- Program in Human Genetics
and Department of Molecular Physiology and Biophysics, Vanderbilt
University Medical School, Nashville, TN, 37232-0700, USA
| | - Jason H Moore
- Program in Human Genetics
and Department of Molecular Physiology and Biophysics, Vanderbilt
University Medical School, Nashville, TN, 37232-0700, USA
| |
Collapse
|
34
|
North BV, Curtis D, Cassell PG, Hitman GA, Sham PC. Assessing optimal neural network architecture for identifying disease-associated multi-marker genotypes using a permutation test, and application to calpain 10 polymorphisms associated with diabetes. Ann Hum Genet 2003; 67:348-56. [PMID: 12914569 DOI: 10.1046/j.1469-1809.2003.00030.x] [Citation(s) in RCA: 31] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
Abstract
Biallelic markers, such as single nucleotide polymorphisms (SNPs), provide greater information for localising disease loci when treated as multilocus haplotypes, but often haplotypes are not immediately available from multilocus genotypes in case-control studies. An artificial neural network allows investigation of association between disease phenotype and tightly linked markers without requiring haplotype phase and without modelling any evolutionary history for the disease-related haplotypes. The network assesses whether marker haplotypes differ between cases and controls to the extent that classification of disease status based on multi-marker genotypes is achievable. The network is "trained" to "recognise" affection status based on supplied marker genotypes, and then for each multi-marker genotype it produces outputs which aim to approximate the associated affection status. Next, the genotypes are permuted relative to affection status to produce many random datasets and the process of training and recording of outputs is repeated. The extent to which the ability to predict affection for the real dataset exceeds that for the random datasets measures the statistical significance of the association between multi-marker genotype and affection. This permutation test performs well with simulated case-control datasets, particularly when major gene effects are present. We have explored the effects of systematically varying different network parameters in order to identify their optimal values. We have applied the permutation test to 4 SNPs of the calpain 10 (CAPN10) gene typed in a case-control sample of subjects with type 2 diabetes, impaired glucose tolerance, and controls. We show that the neural network produces more highly significant evidence for association than do single marker tests corrected for the number of markers genotyped. The use of a permutation test could potentially allow conditional analyses which could incorporate known risk factors alongside marker genotypes. Permuting only the marker genotypes relative to affection status and these risk factors would allow the contribution of the markers to disease risk to be independently assessed.
Collapse
Affiliation(s)
- B V North
- Academic Department of Psychiatry, Barts and The London Queen Mary's School of Medicine and Dentistry, London E1 1BB, UK.
| | | | | | | | | |
Collapse
|
35
|
Yoon Y, Song J, Hong SH, Kim JQ. Analysis of multiple single nucleotide polymorphisms of candidate genes related to coronary heart disease susceptibility by using support vector machines. Clin Chem Lab Med 2003; 41:529-34. [PMID: 12747598 DOI: 10.1515/cclm.2003.080] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Coronary heart disease (CHD) is a complex genetic disease involving gene-environment interaction. Many association studies between single nucleotide polymorphisms (SNPs) of candidate genes and CHD have been reported. We have applied a new method to analyze such relationships using support vector machines (SVMs), which is one of the methods for artificial neuronal network. We assumed that common haplotype implicit in genotypes will differ between cases and controls, and that this will allow SVM-derived patterns to be classifiable according to subject genotypes. Fourteen SNPs of ten candidate genes in 86 CHD patients and 119 controls were investigated. Genotypes were transformed to a numerical vector by giving scores based on difference between the genotypes of each subject and the reference genotypes, which represent the healthy normal population. Overall classification accuracy by SVMs was 64.4% with a receiver operating characteristic (ROC) area of 0.639. By conventional analysis using the chi2 test, the association between CHD and the SNP of the scavenger receptor B1 gene was most significant in terms of allele frequencies in cases vs. controls (p = 0.0001). In conclusion, we suggest that the application of SVMs for association studies of SNPs in candidate genes shows considerable promise and that further work could be usefully performed upon the estimation of CHD susceptibility in individuals of high risk.
Collapse
Affiliation(s)
- Yeomin Yoon
- Department of Laboratory Medicine, Cheju National University College of Medicine, Jeju, South Korea
| | | | | | | |
Collapse
|
36
|
Kato C, Petronis A, Okazaki Y, Tochigi M, Umekage T, Sasaki T. Molecular genetic studies of schizophrenia: challenges and insights. Neurosci Res 2002; 43:295-304. [PMID: 12135773 DOI: 10.1016/s0168-0102(02)00064-0] [Citation(s) in RCA: 29] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
Abstract
Schizophrenia (SCZ) is a mental disease that affects approximately 1% of the population with life-long devastating consequences. Based on evidence for a major contribution of genetic factors, a decade of extensive efforts has been dedicated to the search of DNA sequence variations that increase the risk to SCZ. Search for genes in rare multiplex SCZ families with a large number of affected individuals and quasi-Mendelian mode of inheritance using genetic linkage methodology has been one of the favorite strategies in psychiatric genetics. Although several genomic regions were suggested for linkage to SCZ, not a single gene causing or predisposing to SCZ has been identified thus far. Furthermore, it is not clear whether the genes of familial SCZ are also involved in sporadic cases that represent the overwhelming majority of SCZ patients. For sporadic cases, genetic association studies comparing the distribution of allelic frequencies of candidate genes in SCZ patients and controls have been performed but the outcome of such studies has also been quite modest. Several factors such as possible involvement of numerous interactive genes of minor effect, yet unknown environmental effects and diagnostic ambiguities of the disease have made genetic studies in SCZ quite unproductive. In terms of future studies, a genome-wide association search is a promising approach; however, this approach requires genotyping of thousands of genetic markers in large samples. In addition, a detailed analysis of the genes, expression of which changes under the influence of environmental factors, can indicate good candidates for genetic association studies. In this connection, investigations of the epigenetic regulation of genes and not only the DNA sequence variation, may be necessary for complete understanding of the etiopathogenic mechanisms of SCZ.
Collapse
Affiliation(s)
- Chieko Kato
- Department of Neuropsychiatry, University of Tokyo, Tokyo, Japan
| | | | | | | | | | | |
Collapse
|