1
|
pour AF, Pietrzak M, Sucheston-Campbell LE, Karaesmen E, Dalton LA, Rempała GA. High dimensional model representation of log likelihood ratio: binary classification with SNP data. BMC Med Genomics 2020; 13:133. [PMID: 32957998 PMCID: PMC7504683 DOI: 10.1186/s12920-020-00774-1] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/06/2023] Open
Abstract
BACKGROUND Developing binary classification rules based on SNP observations has been a major challenge for many modern bioinformatics applications, e.g., predicting risk of future disease events in complex conditions such as cancer. Small-sample, high-dimensional nature of SNP data, weak effect of each SNP on the outcome, and highly non-linear SNP interactions are several key factors complicating the analysis. Additionally, SNPs take a finite number of values which may be best understood as ordinal or categorical variables, but are treated as continuous ones by many algorithms. METHODS We use the theory of high dimensional model representation (HDMR) to build appropriate low dimensional glass-box models, allowing us to account for the effects of feature interactions. We compute the second order HDMR expansion of the log-likelihood ratio to account for the effects of single SNPs and their pairwise interactions. We propose a regression based approach, called linear approximation for block second order HDMR expansion of categorical observations (LABS-HDMR-CO), to approximate the HDMR coefficients. We show how HDMR can be used to detect pairwise SNP interactions, and propose the fixed pattern test (FPT) to identify statistically significant pairwise interactions. RESULTS We apply LABS-HDMR-CO and FPT to synthetically generated HAPGEN2 data as well as to two GWAS cancer datasets. In these examples LABS-HDMR-CO enjoys superior accuracy compared with several algorithms used for SNP classification, while also taking pairwise interactions into account. FPT declares very few significant interactions in the small sample GWAS datasets when bounding false discovery rate (FDR) by 5%, due to the large number of tests performed. On the other hand, LABS-HDMR-CO utilizes a large number of SNP pairs to improve its prediction accuracy. In the larger HAPGEN2 dataset FTP declares a larger portion of SNP pairs used by LABS-HDMR-CO as significant. CONCLUSION LABS-HDMR-CO and FPT are interesting methods to design prediction rules and detect pairwise feature interactions for SNP data. Reliably detecting pairwise SNP interactions and taking advantage of potential interactions to improve prediction accuracy are two different objectives addressed by these methods. While the large number of potential SNP interactions may result in low power of detection, potentially interacting SNP pairs, of which many might be false alarms, can still be used to improve prediction accuracy.
Collapse
Affiliation(s)
- Ali Foroughi pour
- Department of Electrical and Computer Engineering, The Ohio State University, 2015 Neil Ave, Columbus, 43210 OH USA
- Department of Mathematics, The Ohio State University, 231 West 18th Ave, Columbus, 43210 OH USA
| | - Maciej Pietrzak
- Mathematical Biosciences Institute, 1735 Neil Ave, Columbus, 43210 OH USA
- Department of Biomedical Informatics, The Ohio State University, 1585 Neil Ave, Columbus, 43210 OH USA
| | | | - Ezgi Karaesmen
- College of Pharmacy, The Ohio State University, 500 West 12th Ave, Columbus, 43210 OH USA
| | - Lori A. Dalton
- Department of Electrical and Computer Engineering, The Ohio State University, 2015 Neil Ave, Columbus, 43210 OH USA
| | - Grzegorz A. Rempała
- Mathematical Biosciences Institute, 1735 Neil Ave, Columbus, 43210 OH USA
- College of Public Health, The Ohio State University, 1841 Neil Ave, Columbus, 43210 OH USA
| |
Collapse
|
2
|
Chang LY, Toghiani S, Aggrey SE, Rekaya R. Increasing accuracy of genomic selection in presence of high density marker panels through the prioritization of relevant polymorphisms. BMC Genet 2019; 20:21. [PMID: 30795734 PMCID: PMC6387489 DOI: 10.1186/s12863-019-0720-5] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2018] [Accepted: 02/04/2019] [Indexed: 12/27/2022] Open
Abstract
BACKGROUND It becomes clear that the increase in the density of marker panels and even the use of sequence data didn't result in any meaningful increase in the accuracy of genomic selection (GS) using either regression (RM) or variance component (VC) approaches. This is in part due to the limitations of current methods. Association model are well over-parameterized and suffer from severe co-linearity and lack of statistical power. Even when the variant effects are not directly estimated using VC based approaches, the genomic relationships didn't improve after the marker density exceeded a certain threshold. SNP prioritization-based fixation index (FST) scores were used to track the majority of significant QTL and to reduce the dimensionality of the association model. RESULTS Two populations with average LD between adjacent markers of 0.3 (P1) and 0.7 (P2) were simulated. In both populations, the genomic data consisted of 400 K SNP markers distributed on 10 chromosomes. The density of simulated genomic data mimics roughly 1.2 million SNP markers in the bovine genome. The genomic relationship matrix (G) was calculated for each set of selected SNPs based on their FST score and similar numbers of SNPs were selected randomly for comparison. Using all 400 K SNPs, 46% of the off-diagonal elements (OD) were between - 0.01 and 0.01. The same portion was 31, 23 and 16% when 80 K, 40 K and 20 K SNPs were selected based on FST scores. For randomly selected 20 K SNP subsets, around 33% of the OD fell within the same range. Genomic similarity computed using SNPs selected based on FST scores was always higher than using the same number of SNPs selected randomly. Maximum accuracies of 0.741 and 0.828 were achieved when 20 and 10 K SNPs were selected based on FST scores in P1 and P2, respectively. CONCLUSIONS Genomic similarity could be maximized by the decrease in the number of selected SNPs, but it also leads to a decrease in the percentage of genetic variation explained by the selected markers. Finding the balance between these two parameters could optimize the accuracy of GS in the presence of high density marker panels.
Collapse
Affiliation(s)
- Ling-Yun Chang
- Department of Animal and Dairy Science, University of Georgia, Athens, GA, 30602, USA. .,ABS Global, Inc., DeForest, WI, 53532, USA.
| | - Sajjad Toghiani
- Department of Animal and Dairy Science, University of Georgia, Athens, GA, 30602, USA.,USDA Agricultural Research Service, Fort Keogh Livestock and Range Research Laboratory, Miles City, MT, 59301, USA
| | - Samuel E Aggrey
- Department of Poultry Science, University of Georgia, Athens, GA, 30602, USA.,Institute of Bioinformatics, University of Georgia, Athens, GA, 30602, USA
| | - Romdhane Rekaya
- Department of Animal and Dairy Science, University of Georgia, Athens, GA, 30602, USA.,Institute of Bioinformatics, University of Georgia, Athens, GA, 30602, USA
| |
Collapse
|
3
|
Computational Biosensors: Molecules, Algorithms, and Detection Platforms. MODELING, METHODOLOGIES AND TOOLS FOR MOLECULAR AND NANO-SCALE COMMUNICATIONS 2017. [PMCID: PMC7123247 DOI: 10.1007/978-3-319-50688-3_23] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
Advanced nucleic acid-based sensor-applications require computationally intelligent biosensors that are able to concurrently perform complex detection and classification of samples within an in vitro platform. Realization of these cutting-edge computational biosensor systems necessitates innovation and integration of three key technologies: molecular probes with computational capabilities, algorithmic methods to enable in vitro computational post processing and classification, and immobilization and detection approaches that enable the realization of deployable computational biosensor platforms. We provide an overview of current technologies, including our contributions towards the development of computational biosensor systems.
Collapse
|
4
|
Shahinfar S, Page D, Guenther J, Cabrera V, Fricke P, Weigel K. Prediction of insemination outcomes in Holstein dairy cattle using alternative machine learning algorithms. J Dairy Sci 2014; 97:731-42. [DOI: 10.3168/jds.2013-6693] [Citation(s) in RCA: 60] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2013] [Accepted: 09/11/2013] [Indexed: 11/19/2022]
|
5
|
Abstract
This study evaluated different female-selective genotyping strategies to increase the predictive accuracy of genomic breeding values (GBVs) in populations that have a limited number of sires with a large number of progeny. A simulated dairy population was utilized to address the aims of the study. The following selection strategies were used: random selection, two-tailed selection by yield deviations, two-tailed selection by breeding value, top yield deviation selection and top breeding value selection. For comparison, two other strategies, genotyping of sires and pedigree indexes from traditional genetic evaluation, were included in the analysis. Two scenarios were simulated, low heritability (h 2 = 0.10) and medium heritability (h 2 = 0.30). GBVs were estimated using the Bayesian Lasso. The accuracy of predicted GBVs using the two-tailed strategies was better than the accuracy obtained using other strategies (0.50 and 0.63 for the two-tailed selection by yield deviations strategy and 0.48 and 0.63 for the two-tailed selection by breeding values strategy in low- and medium-heritability scenarios, respectively, using 1000 genotyped cows). When 996 genotyped bulls were used as the training population, the sire' strategy led to accuracies of 0.48 and 0.55 for low- and medium-heritability traits, respectively. The Random strategies required larger training populations to outperform the accuracies of the pedigree index; however, selecting females from the top of the yield deviations or breeding values of the population did not improve accuracy relative to that of the pedigree index. Bias was found for all genotyping strategies considered, although the Top strategies produced the most biased predictions. Strategies that involve genotyping cows can be implemented in breeding programs that have a limited number of sires with a reliable progeny test. The results of this study showed that females that exhibited upper and lower extreme values within the distribution of yield deviations may be included as training population to increase reliability in small reference populations. The strategies that selected only the females that had high estimated breeding values or yield deviations produced suboptimal results.
Collapse
|
6
|
Investigation of Single Nucleotide Polymorphisms Associated to Familial Combined Hyperlipidemia with Random Forests. ACTA ACUST UNITED AC 2013. [DOI: 10.1007/978-3-642-35467-0_18] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/18/2023]
|
7
|
Morota G, Valente BD, Rosa GJM, Weigel KA, Gianola D. An assessment of linkage disequilibrium in Holstein cattle using a Bayesian network. J Anim Breed Genet 2012; 129:474-87. [PMID: 23148973 DOI: 10.1111/jbg.12002] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2012] [Accepted: 07/31/2012] [Indexed: 11/30/2022]
Abstract
Linkage disequilibrium (LD) is defined as a non-random association of the distributions of alleles at different loci within a population. This association between loci is valuable in prediction of quantitative traits in animals and plants and in genome-wide association studies. A question that arises is whether standard metrics such as D' and r(2) reflect complex associations in a genetic system properly. It seems reasonable to take the view that loci associate and interact together as a system or network, as opposed to in a simple pairwise manner. We used a Bayesian network (BN) as a representation of choice for an LD network. A BN is a graphical depiction of a probability distribution and can represent sets of conditional independencies. Moreover, it provides a visual display of the joint distribution of the set of random variables in question. The usefulness of BN for linkage disequilibrium was explored and illustrated using genetic marker loci found to have the strongest effects on milk protein in Holstein cattle based on three strategies for ranking marker effect estimates: posterior means, standardized posterior means and additive genetic variance. Two different algorithms, Tabu search (a local score-based algorithm) and incremental association Markov blanket (a constraint-based algorithm), coupled with the chi-square test, were used for learning the structure of the BN and were compared with the reference r(2) metric represented as an LD heat map. The BN captured several genetic markers associated as clusters, implying that markers are inter-related in a complicated manner. Further, the BN detected conditionally dependent markers. The results confirm that LD relationships are of a multivariate nature and that r(2) gives an incomplete description and understanding of LD. Use of an LD Bayesian network enables inferring associations between loci in a systems framework and provides a more accurate picture of LD than that resulting from the use of pairwise metrics.
Collapse
Affiliation(s)
- G Morota
- Department of Animal Sciences, University of Wisconsin, Madison, WI 53706, USA.
| | | | | | | | | |
Collapse
|
8
|
Parsimonious classification of binary lacunarity data computed from food surface images using kernel principal component analysis and artificial neural networks. Meat Sci 2011; 87:107-14. [DOI: 10.1016/j.meatsci.2010.08.014] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2010] [Revised: 08/18/2010] [Accepted: 08/25/2010] [Indexed: 11/21/2022]
|
9
|
Okser S, Lehtimäki T, Elo LL, Mononen N, Peltonen N, Kähönen M, Juonala M, Fan YM, Hernesniemi JA, Laitinen T, Lyytikäinen LP, Rontu R, Eklund C, Hutri-Kähönen N, Taittonen L, Hurme M, Viikari JSA, Raitakari OT, Aittokallio T. Genetic variants and their interactions in the prediction of increased pre-clinical carotid atherosclerosis: the cardiovascular risk in young Finns study. PLoS Genet 2010; 6:e1001146. [PMID: 20941391 PMCID: PMC2947986 DOI: 10.1371/journal.pgen.1001146] [Citation(s) in RCA: 34] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2009] [Accepted: 09/01/2010] [Indexed: 12/14/2022] Open
Abstract
The relative contribution of genetic risk factors to the progression of subclinical atherosclerosis is poorly understood. It is likely that multiple variants are implicated in the development of atherosclerosis, but the subtle genotypic and phenotypic differences are beyond the reach of the conventional case-control designs and the statistical significance testing procedures being used in most association studies. Our objective here was to investigate whether an alternative approach--in which common disorders are treated as quantitative phenotypes that are continuously distributed over a population--can reveal predictive insights into the early atherosclerosis, as assessed using ultrasound imaging-based quantitative measurement of carotid artery intima-media thickness (IMT). Using our population-based follow-up study of atherosclerosis precursors as a basis for sampling subjects with gradually increasing IMT levels, we searched for such subsets of genetic variants and their interactions that are the most predictive of the various risk classes, rather than using exclusively those variants meeting a stringent level of statistical significance. The area under the receiver operating characteristic curve (AUC) was used to evaluate the predictive value of the variants, and cross-validation was used to assess how well the predictive models will generalize to other subsets of subjects. By means of our predictive modeling framework with machine learning-based SNP selection, we could improve the prediction of the extreme classes of atherosclerosis risk and progression over a 6-year period (average AUC 0.844 and 0.761), compared to that of using conventional cardiovascular risk factors alone (average AUC 0.741 and 0.629), or when combined with the statistically significant variants (average AUC 0.762 and 0.651). The predictive accuracy remained relatively high in an independent validation set of subjects (average decrease of 0.043). These results demonstrate that the modeling framework can utilize the "gray zone" of genetic variation in the classification of subjects with different degrees of risk of developing atherosclerosis.
Collapse
Affiliation(s)
- Sebastian Okser
- Biomathematics Research Group, Department of Mathematics, University of Turku, Turku, Finland
| | - Terho Lehtimäki
- Department of Clinical Chemistry, Tampere University Hospital and University of Tampere, Tampere, Finland
| | - Laura L. Elo
- Biomathematics Research Group, Department of Mathematics, University of Turku, Turku, Finland
- Data Mining and Modeling Group, Turku Centre for Biotechnology, Turku, Finland
| | - Nina Mononen
- Department of Clinical Chemistry, Tampere University Hospital and University of Tampere, Tampere, Finland
| | - Nina Peltonen
- Department of Clinical Chemistry, Tampere University Hospital and University of Tampere, Tampere, Finland
| | - Mika Kähönen
- Department of Clinical Physiology, Tampere University Hospital and University of Tampere, Tampere, Finland
| | - Markus Juonala
- Department of Medicine, Turku University Central Hospital, Turku, Finland
- Research Centre of Applied and Preventive Cardiovascular Medicine, University of Turku, Turku, Finland
| | - Yue-Mei Fan
- Department of Clinical Chemistry, Tampere University Hospital and University of Tampere, Tampere, Finland
| | - Jussi A. Hernesniemi
- Department of Clinical Chemistry, Tampere University Hospital and University of Tampere, Tampere, Finland
| | - Tomi Laitinen
- Department of Clinical Physiology and Nuclear Medicine, Kuopio University Hospital and University of Eastern Finland, Kuopio, Finland
| | - Leo-Pekka Lyytikäinen
- Department of Clinical Chemistry, Tampere University Hospital and University of Tampere, Tampere, Finland
| | - Riikka Rontu
- Department of Clinical Chemistry, Tampere University Hospital and University of Tampere, Tampere, Finland
| | - Carita Eklund
- Department of Microbiology and Immunology, University of Tampere, Tampere, Finland
| | | | | | - Mikko Hurme
- Department of Microbiology and Immunology, University of Tampere, Tampere, Finland
| | - Jorma S. A. Viikari
- Department of Medicine, Turku University Central Hospital, Turku, Finland
- Department of Medicine, University of Turku, Turku, Finland
| | - Olli T. Raitakari
- Research Centre of Applied and Preventive Cardiovascular Medicine, University of Turku, Turku, Finland
- Department of Clinical Physiology, Turku University Hospital, Turku, Finland
| | - Tero Aittokallio
- Biomathematics Research Group, Department of Mathematics, University of Turku, Turku, Finland
- Data Mining and Modeling Group, Turku Centre for Biotechnology, Turku, Finland
- * E-mail:
| |
Collapse
|
10
|
Wang G, Yang Y, Ott J. Genome-wide conditional search for epistatic disease-predisposing variants in human association studies. Hum Hered 2010; 70:34-41. [PMID: 20413980 PMCID: PMC2912644 DOI: 10.1159/000293722] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2009] [Accepted: 03/01/2010] [Indexed: 11/19/2022] Open
Abstract
Genome-wide search for new disease variants, based on well-established variants, has a long history in linkage analysis but is less well-known in genetic case-control association studies. We developed a simple yet highly efficient conditional search method that can find new variants, which are associated with a disease only through epistatic interaction with another variant and do not necessarily have a direct association effect. Our approach is analogous to partitioning of chi(2) in a hierarchical design, which is a well-established statistical technique. Applied to previously published data on age-related macular degeneration, our method found two single-nucleotide polymorphisms with genome-wide significant epistatic interaction that could not be found based only on direct main effects.
Collapse
Affiliation(s)
- Gao Wang
- Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing, China
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Tex., USA
| | - Yaning Yang
- Department of Statistics and Finance, University of Science and Technology of China, Hefei, China
| | - Jurg Ott
- Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing, China
| |
Collapse
|
11
|
Jannink JL, Lorenz AJ, Iwata H. Genomic selection in plant breeding: from theory to practice. Brief Funct Genomics 2010; 9:166-77. [PMID: 20156985 DOI: 10.1093/bfgp/elq001] [Citation(s) in RCA: 538] [Impact Index Per Article: 38.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/22/2023] Open
Abstract
We intuitively believe that the dramatic drop in the cost of DNA marker information we have experienced should have immediate benefits in accelerating the delivery of crop varieties with improved yield, quality and biotic and abiotic stress tolerance. But these traits are complex and affected by many genes, each with small effect. Traditional marker-assisted selection has been ineffective for such traits. The introduction of genomic selection (GS), however, has shifted that paradigm. Rather than seeking to identify individual loci significantly associated with a trait, GS uses all marker data as predictors of performance and consequently delivers more accurate predictions. Selection can be based on GS predictions, potentially leading to more rapid and lower cost gains from breeding. The objectives of this article are to review essential aspects of GS and summarize the important take-home messages from recent theoretical, simulation and empirical studies. We then look forward and consider research needs surrounding methodological questions and the implications of GS for long-term selection.
Collapse
Affiliation(s)
- Jean-Luc Jannink
- USDA-ARS, R.W. Holley Center for Agriculture and Health, Department of Plant Breeding and Genetics, Cornell University, Ithaca, New York 14853, USA.
| | | | | |
Collapse
|