1
|
Gao Z, Liu Q, Zeng W, Jiang R, Wong WH. EpiGePT: a pretrained transformer-based language model for context-specific human epigenomics. Genome Biol 2024; 25:310. [PMID: 39696471 PMCID: PMC11657395 DOI: 10.1186/s13059-024-03449-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2024] [Accepted: 11/28/2024] [Indexed: 12/20/2024] Open
Abstract
The inherent similarities between natural language and biological sequences have inspired the use of large language models in genomics, but current models struggle to incorporate chromatin interactions or predict in unseen cellular contexts. To address this, we propose EpiGePT, a transformer-based model designed for predicting context-specific human epigenomic signals. By incorporating transcription factor activities and 3D genome interactions, EpiGePT outperforms existing methods in epigenomic signal prediction tasks, especially in cell-type-specific long-range interaction predictions and genetic variant impacts, advancing our understanding of gene regulation. A free online prediction service is available at http://health.tsinghua.edu.cn/epigept .
Collapse
Affiliation(s)
- Zijing Gao
- Ministry of Education Key Laboratory of Bioinformatics, Bioinformatics Division at the Beijing National Research Center for Information Science and Technology, Center for Synthetic and Systems Biology, Department of Automation, Tsinghua University, Beijing, 100084, China
| | - Qiao Liu
- Department of Statistics, Stanford University, CA, Stanford, 94305, USA.
| | - Wanwen Zeng
- Department of Statistics, Stanford University, CA, Stanford, 94305, USA
| | - Rui Jiang
- Ministry of Education Key Laboratory of Bioinformatics, Bioinformatics Division at the Beijing National Research Center for Information Science and Technology, Center for Synthetic and Systems Biology, Department of Automation, Tsinghua University, Beijing, 100084, China.
| | - Wing Hung Wong
- Department of Statistics, Stanford University, CA, Stanford, 94305, USA.
- Department of Biomedical Data Science, Bio-X Program, Center for Personal Dynamic Regulomes, Stanford University, Stanford, CA, 94305, USA.
| |
Collapse
|
2
|
Sengupta P, Dutta S, Liew F, Samrot A, Dasgupta S, Rajput MA, Slama P, Kolesarova A, Roychoudhury S. Reproductomics: Exploring the Applications and Advancements of Computational Tools. Physiol Res 2024; 73:687-702. [PMID: 39530905 PMCID: PMC11629954 DOI: 10.33549/physiolres.935389] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2024] [Accepted: 06/25/2024] [Indexed: 12/13/2024] Open
Abstract
Over recent decades, advancements in omics technologies, such as proteomics, genomics, epigenomics, metabolomics, transcriptomics, and microbiomics, have significantly enhanced our understanding of the molecular mechanisms underlying various physiological and pathological processes. Nonetheless, the analysis and interpretation of vast omics data concerning reproductive diseases are complicated by the cyclic regulation of hormones and multiple other factors, which, in conjunction with a genetic makeup of an individual, lead to diverse biological responses. Reproductomics investigates the interplay between a hormonal regulation of an individual, environmental factors, genetic predisposition (DNA composition and epigenome), health effects, and resulting biological outcomes. It is a rapidly emerging field that utilizes computational tools to analyze and interpret reproductive data, with the aim of improving reproductive health outcomes. It is time to explore the applications of reproductomics in understanding the molecular mechanisms underlying infertility, identification of potential biomarkers for diagnosis and treatment, and in improving assisted reproductive technologies (ARTs). Reproductomics tools include machine learning algorithms for predicting fertility outcomes, gene editing technologies for correcting genetic abnormalities, and single cell sequencing techniques for analyzing gene expression patterns at the individual cell level. However, there are several challenges, limitations and ethical issues involved with the use of reproductomics, such as the applications of gene editing technologies and their potential impact on future generations are discussed. The review comprehensively covers the applications and advancements of reproductomics, highlighting its potential to improve reproductive health outcomes and deepen our understanding of reproductive molecular mechanisms.
Collapse
Affiliation(s)
- P Sengupta
- Department of Biomedical Sciences, College of Medicine, Gulf Medical University, Ajman, UAE; Department of Life Science and Bioinformatics, Assam University, Silchar, India.
| | | | | | | | | | | | | | | | | |
Collapse
|
3
|
Tran H, Friendship R, Poljak Z. Classification of group A rotavirus VP7 and VP4 genotypes using random forest. Front Genet 2023; 14:1029185. [PMID: 37323680 PMCID: PMC10267748 DOI: 10.3389/fgene.2023.1029185] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2022] [Accepted: 05/15/2023] [Indexed: 06/17/2023] Open
Abstract
Introduction: Group A rotaviruses are major pathogens in causing severe diarrhea in young children and neonates of many different species of animals worldwide and group A rotavirus sequence data are becoming increasingly available over time. Different methods exist that allow for rotavirus genotyping, but machine learning methods have yet to be explored. Usage of machine learning algorithms such as random forest alongside alignment-based methodology may allow for both efficient and accurate classification of circulating rotavirus genotypes through the dual classification system. Methods: Random forest models were trained on positional features obtained from pairwise and multiple sequence alignment and cross-validated using methods of repeated 10-fold cross-validation thrice and leave one- out cross validation. Models were then validated on unseen data from the testing datasets to observe real-world performance. Results: All models were found to perform strongly in classification of VP7 and VP4 genotypes with high overall accuracy and kappa values during model training (0.975-0.992, 0.970-0.989) and during model testing (0.972-0.996, 0.969-0.996), respectively. Models trained on multiple sequence alignment generally had slightly higher overall accuracy and kappa values than models trained on pairwise sequence alignment method. In contrast, pairwise sequence alignment models were found to be generally faster than multiple sequence alignment models in computational speed when models do not need to be retrained. Models that used repeated 10-fold cross-validation thrice were also found to be much faster in model computational speed than models that used leave-one-out cross validation, with no noticeable difference in overall accuracy and kappa values between the cross-validation methods. Discussion: Overall, random forest models showed strong performance in the classification of both group A rotavirus VP7 and VP4 genotypes. Application of these models as classifiers will allow for rapid and accurate classification of the increasing amounts of rotavirus sequence data that are becoming available.
Collapse
|
4
|
Liu Q, Hua K, Zhang X, Wong WH, Jiang R. DeepCAGE: Incorporating Transcription Factors in Genome-wide Prediction of Chromatin Accessibility. GENOMICS, PROTEOMICS & BIOINFORMATICS 2022; 20:496-507. [PMID: 35293310 PMCID: PMC9801045 DOI: 10.1016/j.gpb.2021.08.015] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 01/31/2021] [Revised: 05/31/2021] [Accepted: 09/27/2021] [Indexed: 01/26/2023]
Abstract
Although computational approaches have been complementing high-throughput biological experiments for the identification of functional regions in the human genome, it remains a great challenge to systematically decipher interactions between transcription factors (TFs) and regulatory elements to achieve interpretable annotations of chromatin accessibility across diverse cellular contexts. To solve this problem, we propose DeepCAGE, a deep learning framework that integrates sequence information and binding statuses of TFs, for the accurate prediction of chromatin accessible regions at a genome-wide scale in a variety of cell types. DeepCAGE takes advantage of a densely connected deep convolutional neural network architecture to automatically learn sequence signatures of known chromatin accessible regions and then incorporates such features with expression levels and binding activities of human core TFs to predict novel chromatin accessible regions. In a series of systematic comparisons with existing methods, DeepCAGE exhibits superior performance in not only the classification but also the regression of chromatin accessibility signals. In a detailed analysis of TF activities, DeepCAGE successfully extracts novel binding motifs and measures the contribution of a TF to the regulation with respect to a specific locus in a certain cell type. When applied to whole-genome sequencing data analysis, our method successfully prioritizes putative deleterious variants underlying a human complex trait and thus provides insights into the understanding of disease-associated genetic variants. DeepCAGE can be downloaded from https://github.com/kimmo1019/DeepCAGE.
Collapse
Affiliation(s)
- Qiao Liu
- Ministry of Education Key Laboratory of Bioinformatics; Bioinformatics Division, Beijing National Research Center for Information Science and Technology; Center for Synthetic and Systems Biology, Department of Automation, Tsinghua University, Beijing 100084, China,Department of Statistics, Stanford University, Stanford, CA 94305, USA
| | - Kui Hua
- Ministry of Education Key Laboratory of Bioinformatics; Bioinformatics Division, Beijing National Research Center for Information Science and Technology; Center for Synthetic and Systems Biology, Department of Automation, Tsinghua University, Beijing 100084, China
| | - Xuegong Zhang
- Ministry of Education Key Laboratory of Bioinformatics; Bioinformatics Division, Beijing National Research Center for Information Science and Technology; Center for Synthetic and Systems Biology, Department of Automation, Tsinghua University, Beijing 100084, China
| | - Wing Hung Wong
- Department of Statistics, Stanford University, Stanford, CA 94305, USA,Corresponding authors.
| | - Rui Jiang
- Ministry of Education Key Laboratory of Bioinformatics; Bioinformatics Division, Beijing National Research Center for Information Science and Technology; Center for Synthetic and Systems Biology, Department of Automation, Tsinghua University, Beijing 100084, China,Corresponding authors.
| |
Collapse
|
5
|
Yang XF, Zhou YK, Zhang L, Gao Y, Du PF. Predicting LncRNA Subcellular Localization Using Unbalanced Pseudo-k Nucleotide Compositions. Curr Bioinform 2020. [DOI: 10.2174/1574893614666190902151038] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022]
Abstract
Background:
Long non-coding RNAs (lncRNAs) are transcripts with a length more
than 200 nucleotides, functioning in the regulation of gene expression. More evidence has shown
that the biological functions of lncRNAs are intimately related to their subcellular localizations.
Therefore, it is very important to confirm the lncRNA subcellular localization.
Methods:
In this paper, we proposed a novel method to predict the subcellular localization of
lncRNAs. To more comprehensively utilize lncRNA sequence information, we exploited both kmer
nucleotide composition and sequence order correlated factors of lncRNA to formulate
lncRNA sequences. Meanwhile, a feature selection technique which was based on the Analysis Of
Variance (ANOVA) was applied to obtain the optimal feature subset. Finally, we used the support
vector machine (SVM) to perform the prediction.
Results:
The AUC value of the proposed method can reach 0.9695, which indicated the proposed
predictor is an efficient and reliable tool for determining lncRNA subcellular localization. Furthermore,
the predictor can reach the maximum overall accuracy of 90.37% in leave-one-out cross
validation, which clearly outperforms the existing state-of- the-art method.
Conclusion:
It is demonstrated that the proposed predictor is feasible and powerful for the prediction
of lncRNA subcellular. To facilitate subsequent genetic sequence research, we shared the
source code at https://github.com/NicoleYXF/lncRNA.
Collapse
Affiliation(s)
- Xiao-Fei Yang
- College of Intelligence and Computing, Tianjin University, Tianjin 300350, China
| | - Yuan-Ke Zhou
- College of Intelligence and Computing, Tianjin University, Tianjin 300350, China
| | - Lin Zhang
- College of Intelligence and Computing, Tianjin University, Tianjin 300350, China
| | - Yang Gao
- School of Medicine, Nankai University, Tianjin 300071, China
| | - Pu-Feng Du
- College of Intelligence and Computing, Tianjin University, Tianjin 300350, China
| |
Collapse
|
6
|
Liu H, Duncan K, Helverson A, Kumari P, Mumm C, Xiao Y, Carlson JC, Darbellay F, Visel A, Leslie E, Breheny P, Erives AJ, Cornell RA. Analysis of zebrafish periderm enhancers facilitates identification of a regulatory variant near human KRT8/18. eLife 2020; 9:e51325. [PMID: 32031521 PMCID: PMC7039683 DOI: 10.7554/elife.51325] [Citation(s) in RCA: 23] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2019] [Accepted: 02/06/2020] [Indexed: 12/18/2022] Open
Abstract
Genome-wide association studies for non-syndromic orofacial clefting (OFC) have identified single nucleotide polymorphisms (SNPs) at loci where the presumed risk-relevant gene is expressed in oral periderm. The functional subsets of such SNPs are difficult to predict because the sequence underpinnings of periderm enhancers are unknown. We applied ATAC-seq to models of human palate periderm, including zebrafish periderm, mouse embryonic palate epithelia, and a human oral epithelium cell line, and to complementary mesenchymal cell types. We identified sets of enhancers specific to the epithelial cells and trained gapped-kmer support-vector-machine classifiers on these sets. We used the classifiers to predict the effects of 14 OFC-associated SNPs at 12q13 near KRT18. All the classifiers picked the same SNP as having the strongest effect, but the significance was highest with the classifier trained on zebrafish periderm. Reporter and deletion analyses support this SNP as lying within a periderm enhancer regulating KRT18/KRT8 expression.
Collapse
Affiliation(s)
- Huan Liu
- State Key Laboratory Breeding Base of Basic Science of Stomatology (Hubei-MOST) and Key Laboratory for Oral Biomedicine of Ministry of Education (KLOBM), School and Hospital of Stomatology, Wuhan UniversityWuhanChina
- Department of Anatomy and Cell Biology, University of IowaIowa CityUnited States
- Department of Periodontology, School of Stomatology, Wuhan UniversityWuhanChina
| | - Kaylia Duncan
- Interdisciplinary Program in Molecular Medicine, University of IowaIowa CityUnited States
| | - Annika Helverson
- Department of Anatomy and Cell Biology, University of IowaIowa CityUnited States
| | - Priyanka Kumari
- Department of Anatomy and Cell Biology, University of IowaIowa CityUnited States
| | - Camille Mumm
- Department of Anatomy and Cell Biology, University of IowaIowa CityUnited States
| | - Yao Xiao
- State Key Laboratory Breeding Base of Basic Science of Stomatology (Hubei-MOST) and Key Laboratory for Oral Biomedicine of Ministry of Education (KLOBM), School and Hospital of Stomatology, Wuhan UniversityWuhanChina
| | | | - Fabrice Darbellay
- Environmental Genomics and Systems Biology Division, Lawrence Berkeley LaboratoriesBerkeleyUnited States
| | - Axel Visel
- Environmental Genomics and Systems Biology Division, Lawrence Berkeley LaboratoriesBerkeleyUnited States
- U.S. Department of Energy Joint Genome Institute, Lawrence Berkeley LaboratoriesBerkeleyUnited States
- University of California, MercedMercedUnited States
| | - Elizabeth Leslie
- Department of Human Genetics, Emory University School of MedicineAtlantaGeorgia
| | - Patrick Breheny
- Department of Biostatistics, University of IowaIowa CityUnited States
| | - Albert J Erives
- Department of Biology, University of IowaIowa CityUnited States
| | - Robert A Cornell
- Department of Anatomy and Cell Biology, University of IowaIowa CityUnited States
- Interdisciplinary Program in Molecular Medicine, University of IowaIowa CityUnited States
| |
Collapse
|
7
|
Mejía-Guerra MK, Buckler ES. A k-mer grammar analysis to uncover maize regulatory architecture. BMC PLANT BIOLOGY 2019; 19:103. [PMID: 30876396 PMCID: PMC6419808 DOI: 10.1186/s12870-019-1693-2] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/25/2018] [Accepted: 02/21/2019] [Indexed: 05/06/2023]
Abstract
BACKGROUND Only a small percentage of the genome sequence is involved in regulation of gene expression, but to biochemically identify this portion is expensive and laborious. In species like maize, with diverse intergenic regions and lots of repetitive elements, this is an especially challenging problem that limits the use of the data from one line to the other. While regulatory regions are rare, they do have characteristic chromatin contexts and sequence organization (the grammar) with which they can be identified. RESULTS We developed a computational framework to exploit this sequence arrangement. The models learn to classify regulatory regions based on sequence features - k-mers. To do this, we borrowed two approaches from the field of natural language processing: (1) "bag-of-words" which is commonly used for differentially weighting key words in tasks like sentiment analyses, and (2) a vector-space model using word2vec (vector-k-mers), that captures semantic and linguistic relationships between words. We built "bag-of-k-mers" and "vector-k-mers" models that distinguish between regulatory and non-regulatory regions with an average accuracy above 90%. Our "bag-of-k-mers" achieved higher overall accuracy, while the "vector-k-mers" models were more useful in highlighting key groups of sequences within the regulatory regions. CONCLUSIONS These models now provide powerful tools to annotate regulatory regions in other maize lines beyond the reference, at low cost and with high accuracy.
Collapse
Affiliation(s)
| | - Edward S. Buckler
- Institute for Genomic Diversity, Cornell University, 175 Biotechnology Building, Ithaca, 14853 NY USA
- USDA-ARS, Research Geneticist, USDA ARS Robert Holley Center, Ithaca, 14853 NY USA
- Department of Plant Breeding and Genetics, Cornell University, 159 Biotechnology Building, Ithaca, 14853 NY USA
| |
Collapse
|
8
|
Liu Q, Xia F, Yin Q, Jiang R. Chromatin accessibility prediction via a hybrid deep convolutional neural network. Bioinformatics 2018; 34:732-738. [PMID: 29069282 PMCID: PMC6192215 DOI: 10.1093/bioinformatics/btx679] [Citation(s) in RCA: 47] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2017] [Accepted: 10/20/2017] [Indexed: 11/26/2022] Open
Abstract
Motivation A majority of known genetic variants associated with human-inherited diseases lie in
non-coding regions that lack adequate interpretation, making it indispensable to
systematically discover functional sites at the whole genome level and precisely
decipher their implications in a comprehensive manner. Although computational approaches
have been complementing high-throughput biological experiments towards the annotation of
the human genome, it still remains a big challenge to accurately annotate regulatory
elements in the context of a specific cell type via automatic learning of the DNA
sequence code from large-scale sequencing data. Indeed, the development of an accurate
and interpretable model to learn the DNA sequence signature and further enable the
identification of causative genetic variants has become essential in both genomic and
genetic studies. Results We proposed Deopen, a hybrid framework mainly based on a deep convolutional neural
network, to automatically learn the regulatory code of DNA sequences and predict
chromatin accessibility. In a series of comparison with existing methods, we show the
superior performance of our model in not only the classification of accessible regions
against background sequences sampled at random, but also the regression of DNase-seq
signals. Besides, we further visualize the convolutional kernels and show the match of
identified sequence signatures and known motifs. We finally demonstrate the sensitivity
of our model in finding causative noncoding variants in the analysis of a breast cancer
dataset. We expect to see wide applications of Deopen with either public or in-house
chromatin accessibility data in the annotation of the human genome and the
identification of non-coding variants associated with diseases. Availability and implementation Deopen is freely available at https://github.com/kimmo1019/Deopen. Supplementary information Supplementary data are
available at Bioinformatics online.
Collapse
Affiliation(s)
- Qiao Liu
- MOE Key Laboratory of Bioinformatics; Bioinformatics Division and Center for Synthetic & Systems Biology, TNLIST; Department of Automation, Tsinghua University, Beijing 100084, China
| | - Fei Xia
- Department of Electrical Engineering, Stanford University, Stanford, CA 94305, USA
| | - Qijin Yin
- MOE Key Laboratory of Bioinformatics; Bioinformatics Division and Center for Synthetic & Systems Biology, TNLIST; Department of Automation, Tsinghua University, Beijing 100084, China
| | - Rui Jiang
- MOE Key Laboratory of Bioinformatics; Bioinformatics Division and Center for Synthetic & Systems Biology, TNLIST; Department of Automation, Tsinghua University, Beijing 100084, China
| |
Collapse
|
9
|
Jia Y, Li H, Wang J, Meng H, Yang Z. Spectrum structures and biological functions of 8-mers in the human genome. Genomics 2018. [PMID: 29522801 DOI: 10.1016/j.ygeno.2018.03.006] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
The spectra of k-mer frequencies can reveal the structures and evolution of genome sequences. We confirmed that the trimodal spectrum of 8-mers in human genome sequences is distinguished only by CG2, CG1 and CG0 8-mer sets, containing 2,1 or 0 CpG, respectively. This phenomenon is called independent selection law. The three types of CG 8-mers were considered as different functional elements. We conjectured that (1) nucleosome binding motifs are mainly characterized by CG1 8-mers and (2) the core structural units of CpG island sequences are predominantly characterized by CG2 8-mers. To validate our conjectures, nucleosome occupied sequences and CGI sequences were extracted, then the sequence parameters were constructed through the information of the three CG 8-mer sets respectively. ROC analysis showed that CG1 8-mers are more preference in nucleosome occupied segments (AUC > 0.7) and CG2 8-mers are more preference in CGI sequences (AUC > 0.99). This validates our conjecture in principle.
Collapse
Affiliation(s)
- Yun Jia
- Laboratory of Theoretical Biophysics, School of Physical Science and Technology, Inner Mongolia University, Hohhot 010021, China; College of Science, Inner Mongolia University of Technology, Hohhot 010051, China
| | - Hong Li
- Laboratory of Theoretical Biophysics, School of Physical Science and Technology, Inner Mongolia University, Hohhot 010021, China.
| | - Jingfeng Wang
- College of Science, Inner Mongolia University of Technology, Hohhot 010051, China
| | - Hu Meng
- Laboratory of Theoretical Biophysics, School of Physical Science and Technology, Inner Mongolia University, Hohhot 010021, China
| | - Zhenhua Yang
- Laboratory of Theoretical Biophysics, School of Physical Science and Technology, Inner Mongolia University, Hohhot 010021, China
| |
Collapse
|