1
|
Lazebnik T, Simon-Keren L. Cancer-inspired genomics mapper model for the generation of synthetic DNA sequences with desired genomics signatures. Comput Biol Med 2023; 164:107221. [PMID: 37478715 DOI: 10.1016/j.compbiomed.2023.107221] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2023] [Revised: 06/16/2023] [Accepted: 06/30/2023] [Indexed: 07/23/2023]
Abstract
Genome data are crucial in modern medicine, offering significant potential for diagnosis and treatment. Thanks to technological advancements, many millions of healthy and diseased genomes have already been sequenced; however, obtaining the most suitable data for a specific study, and specifically for validation studies, remains challenging with respect to scale and access. Therefore, in silico genomics sequence generators have been proposed as a possible solution. However, the current generators produce inferior data using mostly shallow (stochastic) connections, detected with limited computational complexity in the training data. This means they do not take the appropriate biological relations and constraints, that originally caused the observed connections, into consideration. To address this issue, we propose cancer-inspired genomics mapper model (CGMM), that combines genetic algorithm (GA) and deep learning (DL) methods to tackle this challenge. CGMM mimics processes that generate genetic variations and mutations to transform readily available control genomes into genomes with the desired phenotypes. We demonstrate that CGMM can generate synthetic genomes of selected phenotypes such as ancestry and cancer that are indistinguishable from real genomes of such phenotypes, based on unsupervised clustering. Our results show that CGMM outperforms four current state-of-the-art genomics generators on two different tasks, suggesting that CGMM will be suitable for a wide range of purposes in genomic medicine, especially for much-needed validation studies.
Collapse
Affiliation(s)
- Teddy Lazebnik
- Department of Cancer Biology, Cancer Institute, University College London, London, UK.
| | | |
Collapse
|
2
|
Liu L, Zhou J, Chen CJ, Zhang J, Wen W, Tian J, Zhang Z, Gu Y. GWAS-Based Identification of New Loci for Milk Yield, Fat, and Protein in Holstein Cattle. Animals (Basel) 2020; 10:E2048. [PMID: 33167458 PMCID: PMC7694478 DOI: 10.3390/ani10112048] [Citation(s) in RCA: 21] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2020] [Revised: 11/01/2020] [Accepted: 11/03/2020] [Indexed: 12/20/2022] Open
Abstract
High-yield and high-quality of milk are the primary goals of dairy production. Understanding the genetic architecture underlying these milk-related traits is beneficial so that genetic variants can be targeted toward the genetic improvement. In this study, we measured five milk production and quality traits in Holstein cattle population from China. These traits included milk yield, fat, and protein. We used the estimated breeding values as dependent variables to conduct the genome-wide association studies (GWAS). Breeding values were estimated through pedigree relationships by using a linear mixed model. Genotyping was carried out on the individuals with phenotypes by using the Illumina BovineSNP150 BeadChip. The association analyses were conducted by using the fixed and random model Circulating Probability Unification (FarmCPU) method. A total of ten single-nucleotide polymorphisms (SNPs) were detected above the genome-wide significant threshold (p < 4.0 × 10-7), including six located in previously reported quantitative traits locus (QTL) regions. We found eight candidate genes within distances of 120 kb upstream or downstream to the associated SNPs. The study not only identified the effect of DGAT1 gene on milk fat and protein, but also discovered novel genetic loci and candidate genes related to milk traits. These novel genetic loci would be an important basis for molecular breeding in dairy cattle.
Collapse
Affiliation(s)
- Liyuan Liu
- School of Agriculture, Ningxia University, Yinchuan 750021, Ningxia, China; (L.L.); (J.Z.); (J.Z.)
- Department of Crop and Soil Sciences, Washington State University, Pullman, Washington, DC 99164, USA;
| | - Jinghang Zhou
- School of Agriculture, Ningxia University, Yinchuan 750021, Ningxia, China; (L.L.); (J.Z.); (J.Z.)
- Department of Crop and Soil Sciences, Washington State University, Pullman, Washington, DC 99164, USA;
| | - Chunpeng James Chen
- Department of Crop and Soil Sciences, Washington State University, Pullman, Washington, DC 99164, USA;
| | - Juan Zhang
- School of Agriculture, Ningxia University, Yinchuan 750021, Ningxia, China; (L.L.); (J.Z.); (J.Z.)
| | - Wan Wen
- Animal Husbandry Workstation, Yinchuan 750001, Ningxia, China; (W.W.); (J.T.)
| | - Jia Tian
- Animal Husbandry Workstation, Yinchuan 750001, Ningxia, China; (W.W.); (J.T.)
| | - Zhiwu Zhang
- Department of Crop and Soil Sciences, Washington State University, Pullman, Washington, DC 99164, USA;
| | - Yaling Gu
- School of Agriculture, Ningxia University, Yinchuan 750021, Ningxia, China; (L.L.); (J.Z.); (J.Z.)
| |
Collapse
|
3
|
Blumenthal DB, Viola L, List M, Baumbach J, Tieri P, Kacprowski T. EpiGEN: an epistasis simulation pipeline. Bioinformatics 2020; 36:4957-4959. [DOI: 10.1093/bioinformatics/btaa245] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2019] [Revised: 04/03/2020] [Accepted: 04/08/2020] [Indexed: 02/06/2023] Open
Abstract
Abstract
Summary
Simulated data are crucial for evaluating epistasis detection tools in genome-wide association studies. Existing simulators are limited, as they do not account for linkage disequilibrium (LD), support limited interaction models of single nucleotide polymorphisms (SNPs) and only dichotomous phenotypes or depend on proprietary software. In contrast, EpiGEN supports SNP interactions of arbitrary order, produces realistic LD patterns and generates both categorical and quantitative phenotypes.
Availability and implementation
EpiGEN is implemented in Python 3 and is freely available at https://github.com/baumbachlab/epigen.
Supplementary information
Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- David B Blumenthal
- Technical University of Munich, School of Life Sciences Weihenstephan, Chair of Experimental Bioinformatics, 85354 Freising, Germany
| | - Lorenzo Viola
- Technical University of Munich, School of Life Sciences Weihenstephan, Chair of Experimental Bioinformatics, 85354 Freising, Germany
| | - Markus List
- Technical University of Munich, School of Life Sciences Weihenstephan, Chair of Experimental Bioinformatics, 85354 Freising, Germany
| | - Jan Baumbach
- Technical University of Munich, School of Life Sciences Weihenstephan, Chair of Experimental Bioinformatics, 85354 Freising, Germany
| | - Paolo Tieri
- CNR National Research Council, IAC Institute for Applied Computing, 00185 Rome, Italy
| | - Tim Kacprowski
- Technical University of Munich, School of Life Sciences Weihenstephan, Chair of Experimental Bioinformatics, 85354 Freising, Germany
| |
Collapse
|
4
|
Juan L, Wang Y, Jiang J, Yang Q, Jiang Q, Wang Y. PGsim: A Comprehensive and Highly Customizable Personal Genome Simulator. Front Bioeng Biotechnol 2020; 8:28. [PMID: 32047747 PMCID: PMC6997238 DOI: 10.3389/fbioe.2020.00028] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2019] [Accepted: 01/13/2020] [Indexed: 11/26/2022] Open
Abstract
Although genome sequencing has become increasingly popular, the simulation of individual genomes is still important. This is because sequencing a large number of individual genomes is costly and genome data with extreme and boundary conditions, such as fatal genetic defects, are difficult to obtain. Privacy and legal barriers also prevent many applications of real data. Large sequencing projects in recent years have provided a deeper understanding of the human genome. However, there is a lack of tools to leverage known data to simulate personal genomes as real as possible. Here, we designed and developed PGsim, a comprehensive and highly customizable individual genome simulator, that fully uses existing knowledge, such as variant allele frequencies in global or world main populations, mutation probability differences between protein-coding regions and non-coding regions, transition/transversion (Ti/Tv) ratios, Indel incidence, Indel length distribution, structural variation sites, and pathogenic mutation sites. Users can flexibly control the proportion and quantity of known variants, common variants, novel variants in both coding and non-coding regions, and special variants through detailed parameter settings. To ensure that the simulated personal genome has sufficient randomness, PGsim makes the generated variants more real and reliable in terms of variant distribution, proportion, and population characteristics. PGsim is able to employ a huge volume database as background data to simulate personal genomes and does not require SQL database support. Users can easily change the variant databases used as needed. As a Perl script, there is no obstacle to running PGsim on any version of the MAC OS or Linux systems, and no libraries, packages, interpreters, compilers, or other dependencies need to be installed in advance. The PGsim tool is publicly available at https://github.com/lrjuan/PGsim.
Collapse
Affiliation(s)
- Liran Juan
- School of Life Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Yongtian Wang
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Jingyi Jiang
- School of Life Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Qi Yang
- School of Life Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Qinghua Jiang
- School of Life Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Yadong Wang
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| |
Collapse
|
5
|
Zhao M, Liu D, Qu H. Systematic review of next-generation sequencing simulators: computational tools, features and perspectives. Brief Funct Genomics 2018; 16:121-128. [PMID: 27069250 DOI: 10.1093/bfgp/elw012] [Citation(s) in RCA: 20] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023] Open
Abstract
High-throughput next-generation sequencing (NGS) technologies have rapidly generated a large volume of genomic data. To aid the development and evaluation of new statistical models and computational methods, NGS-based simulators have been proposed to construct better experimental workflows. However, the comparative performance of these NGS simulators remains unclear. In this review, we conducted a comprehensive investigation of NGS simulators for various sequencing techniques, including DNA sequencing, metagenomic sequencing, RNA-seq, ChIP-seq and bisulfite sequencing for methylation.
Collapse
|
6
|
Li X, Liao B, Chen H. A new technique for generating pathogenic barcodes in breast cancer susceptibility analysis. J Theor Biol 2015; 366:84-90. [DOI: 10.1016/j.jtbi.2014.11.005] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2014] [Revised: 10/08/2014] [Accepted: 11/04/2014] [Indexed: 01/09/2023]
|
7
|
Liao B, Li X, Cai L, Cao Z, Chen H. A Hierarchical Clustering Method of Selecting Kernel SNP to Unify Informative SNP and Tag SNP. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2015; 12:113-122. [PMID: 26357082 DOI: 10.1109/tcbb.2014.2351797] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Various strategies can be used to select representative single nucleotide polymorphisms (SNPs) from a large number of SNPs, such as tag SNP for haplotype coverage and informative SNP for haplotype reconstruction, respectively. Representative SNPs are not only instrumental in reducing the cost of genotyping, but also serve an important function in narrowing the combinatorial space in epistasis analysis. The capacity of kernel SNPs to unify informative SNP and tag SNP is explored, and inconsistencies are minimized in further studies. The correlation between multiple SNPs is formalized using multi-information measures. In extending the correlation, a distance formula for measuring the similarity between clusters is first designed to conduct hierarchical clustering. Hierarchical clustering consists of both information gain and haplotype diversity, so that the proposed approach can achieve unification. The kernel SNPs are then selected from every cluster through the top rank or backward elimination scheme. Using these kernel SNPs, extensive experimental comparisons are conducted between informative SNPs on haplotype reconstruction accuracy and tag SNPs on haplotype coverage. Results indicate that the kernel SNP can practically unify informative SNP and tag SNP and is therefore adaptable to various applications.
Collapse
|
8
|
Peng B, Chen HS, Mechanic LE, Racine B, Clarke J, Gillanders E, Feuer EJ. Genetic data simulators and their applications: an overview. Genet Epidemiol 2014; 39:2-10. [PMID: 25504286 DOI: 10.1002/gepi.21876] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2014] [Revised: 09/14/2014] [Accepted: 10/31/2014] [Indexed: 11/10/2022]
Abstract
Computer simulations have played an indispensable role in the development and applications of statistical models and methods for genetic studies across multiple disciplines. The need to simulate complex evolutionary scenarios and pseudo-datasets for various studies has fueled the development of dozens of computer programs with varying reliability, performance, and application areas. To help researchers compare and choose the most appropriate simulators for their studies, we have created the genetic simulation resources (GSR) website, which allows authors of simulation software to register their applications and describe them with more than 160 defined attributes. This article summarizes the properties of 93 simulators currently registered at GSR and provides an overview of the development and applications of genetic simulators. Unlike other review articles that address technical issues or compare simulators for particular application areas, we focus on software development, maintenance, and features of simulators, often from a historical perspective. Publications that cite these simulators are used to summarize both the applications of genetic simulations and the utilization of simulators.
Collapse
Affiliation(s)
- Bo Peng
- Department of Bioinformatics and Computational Biology, The University of Texas, MD Anderson Cancer Center, Houston, Texas, United States of America
| | | | | | | | | | | | | |
Collapse
|