1
|
Huo Q, Song R, Ma Z. Recent advances in exploring transcriptional regulatory landscape of crops. FRONTIERS IN PLANT SCIENCE 2024; 15:1421503. [PMID: 38903438 PMCID: PMC11188431 DOI: 10.3389/fpls.2024.1421503] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 04/22/2024] [Accepted: 05/23/2024] [Indexed: 06/22/2024]
Abstract
Crop breeding entails developing and selecting plant varieties with improved agronomic traits. Modern molecular techniques, such as genome editing, enable more efficient manipulation of plant phenotype by altering the expression of particular regulatory or functional genes. Hence, it is essential to thoroughly comprehend the transcriptional regulatory mechanisms that underpin these traits. In the multi-omics era, a large amount of omics data has been generated for diverse crop species, including genomics, epigenomics, transcriptomics, proteomics, and single-cell omics. The abundant data resources and the emergence of advanced computational tools offer unprecedented opportunities for obtaining a holistic view and profound understanding of the regulatory processes linked to desirable traits. This review focuses on integrated network approaches that utilize multi-omics data to investigate gene expression regulation. Various types of regulatory networks and their inference methods are discussed, focusing on recent advancements in crop plants. The integration of multi-omics data has been proven to be crucial for the construction of high-confidence regulatory networks. With the refinement of these methodologies, they will significantly enhance crop breeding efforts and contribute to global food security.
Collapse
Affiliation(s)
| | | | - Zeyang Ma
- State Key Laboratory of Maize Bio-breeding, Frontiers Science Center for Molecular Design Breeding, Joint International Research Laboratory of Crop Molecular Breeding, National Maize Improvement Center, College of Agronomy and Biotechnology, China Agricultural University, Beijing, China
| |
Collapse
|
2
|
Vorontsov IE, Eliseeva IA, Zinkevich A, Nikonov M, Abramov S, Boytsov A, Kamenets V, Kasianova A, Kolmykov S, Yevshin I, Favorov A, Medvedeva YA, Jolma A, Kolpakov F, Makeev V, Kulakovskiy I. HOCOMOCO in 2024: a rebuild of the curated collection of binding models for human and mouse transcription factors. Nucleic Acids Res 2024; 52:D154-D163. [PMID: 37971293 PMCID: PMC10767914 DOI: 10.1093/nar/gkad1077] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2023] [Revised: 10/17/2023] [Accepted: 10/26/2023] [Indexed: 11/19/2023] Open
Abstract
We present a major update of the HOCOMOCO collection that provides DNA binding specificity patterns of 949 human transcription factors and 720 mouse orthologs. To make this release, we performed motif discovery in peak sets that originated from 14 183 ChIP-Seq experiments and reads from 2554 HT-SELEX experiments yielding more than 400 thousand candidate motifs. The candidate motifs were annotated according to their similarity to known motifs and the hierarchy of DNA-binding domains of the respective transcription factors. Next, the motifs underwent human expert curation to stratify distinct motif subtypes and remove non-informative patterns and common artifacts. Finally, the curated subset of 100 thousand motifs was supplied to the automated benchmarking to select the best-performing motifs for each transcription factor. The resulting HOCOMOCO v12 core collection contains 1443 verified position weight matrices, including distinct subtypes of DNA binding motifs for particular transcription factors. In addition to the core collection, HOCOMOCO v12 provides motif sets optimized for the recognition of binding sites in vivo and in vitro, and for annotation of regulatory sequence variants. HOCOMOCO is available at https://hocomoco12.autosome.org and https://hocomoco.autosome.org.
Collapse
Affiliation(s)
- Ilya E Vorontsov
- Vavilov Institute of General Genetics, Russian Academy of Sciences, 119991 Moscow, Russia
| | - Irina A Eliseeva
- Institute of Protein Research, Russian Academy of Sciences, 142290 Pushchino, Russia
| | - Arsenii Zinkevich
- Vavilov Institute of General Genetics, Russian Academy of Sciences, 119991 Moscow, Russia
- Faculty of Bioengineering and Bioinformatics, Lomonosov Moscow State University, 119991 Moscow, Russia
| | - Mikhail Nikonov
- Faculty of Bioengineering and Bioinformatics, Lomonosov Moscow State University, 119991 Moscow, Russia
| | - Sergey Abramov
- Vavilov Institute of General Genetics, Russian Academy of Sciences, 119991 Moscow, Russia
- Altius Institute for Biomedical Sciences, 98121 Seattle, WA, USA
| | - Alexandr Boytsov
- Vavilov Institute of General Genetics, Russian Academy of Sciences, 119991 Moscow, Russia
- Altius Institute for Biomedical Sciences, 98121 Seattle, WA, USA
| | - Vasily Kamenets
- Vavilov Institute of General Genetics, Russian Academy of Sciences, 119991 Moscow, Russia
- Moscow Institute of Physics and Technology, 141700 Dolgoprudny, Russia
- Institute of Biochemistry and Genetics of the Ufa Federal Research Centre of the Russian Academy of Sciences, 450054 Ufa, Russia
| | - Alexandra Kasianova
- Skolkovo Institute of Science and Technology, 121205 Moscow, Russia
- Institute for Information Transmission Problems of the Russian Academy of Sciences, 127051 Moscow, Russia
| | - Semyon Kolmykov
- Department of Computational Biology, Sirius University of Science and Technology, 354340 Sirius, Krasnodar region, Russia
| | | | - Alexander Favorov
- Vavilov Institute of General Genetics, Russian Academy of Sciences, 119991 Moscow, Russia
- Johns Hopkins University School of Medicine, Baltimore, MD 21205, USA
| | - Yulia A Medvedeva
- Research Center of Biotechnology RAS, Russian Academy of Sciences, 119071 Moscow, Russia
| | - Arttu Jolma
- Donnelly Centre, University of Toronto, Toronto, Ontario M5S 3E1, Canada
| | - Fedor Kolpakov
- Department of Computational Biology, Sirius University of Science and Technology, 354340 Sirius, Krasnodar region, Russia
- Bioinformatics Laboratory, Federal Research Center for Information and Computational Technologies, 630090 Novosibirsk, Russia
| | - Vsevolod J Makeev
- Vavilov Institute of General Genetics, Russian Academy of Sciences, 119991 Moscow, Russia
- Moscow Institute of Physics and Technology, 141700 Dolgoprudny, Russia
- Institute of Biochemistry and Genetics of the Ufa Federal Research Centre of the Russian Academy of Sciences, 450054 Ufa, Russia
| | - Ivan V Kulakovskiy
- Vavilov Institute of General Genetics, Russian Academy of Sciences, 119991 Moscow, Russia
- Institute of Protein Research, Russian Academy of Sciences, 142290 Pushchino, Russia
- Laboratory of Regulatory Genomics, Institute of Fundamental Medicine and Biology, Kazan Federal University, 420008 Kazan, Russia
| |
Collapse
|
3
|
Yang Z, Li X, Sheng L, Zhu M, Lan X, Gu F. Multiomics-integrated deep language model enables in silico genome-wide detection of transcription factor binding site in unexplored biosamples. Bioinformatics 2024; 40:btae013. [PMID: 38216534 PMCID: PMC10812877 DOI: 10.1093/bioinformatics/btae013] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2023] [Revised: 12/07/2023] [Accepted: 01/11/2024] [Indexed: 01/14/2024] Open
Abstract
MOTIVATION Transcription factor binding sites (TFBS) are regulatory elements that have significant impact on transcription regulation and cell fate determination. Canonical motifs, biological experiments, and computational methods have made it possible to discover TFBS. However, most existing in silico TFBS prediction models are solely DNA-based, and are trained and utilized within the same biosample, which fail to infer TFBS in experimentally unexplored biosamples. RESULTS Here, we propose TFBS prediction by modified TransFormer (TFTF), a multimodal deep language architecture which integrates multiomics information in epigenetic studies. In comparison to existing computational techniques, TFTF has state-of-the-art accuracy, and is also the first approach to accurately perform genome-wide detection for cell-type and species-specific TFBS in experimentally unexplored biosamples. Compared to peak calling methods, TFTF consistently discovers true TFBS in threshold tuning-free way, with higher recalled rates. The underlying mechanism of TFTF reveals greater attention to the targeted TF's motif region in TFBS, and general attention to the entire peak region in non-TFBS. TFTF can benefit from the integration of broader and more diverse data for improvement and can be applied to multiple epigenetic scenarios. AVAILABILITY AND IMPLEMENTATION We provide a web server (https://tftf.ibreed.cn/) for users to utilize TFTF model. Users can train TFTF model and discover TFBS with their own data.
Collapse
Affiliation(s)
- Zikun Yang
- Damo Academy, Alibaba Group, Hangzhou 310023, China
- Hupan Lab, Hangzhou 310023, China
| | - Xin Li
- Damo Academy, Alibaba Group, Hangzhou 310023, China
- Hupan Lab, Hangzhou 310023, China
| | - Lele Sheng
- Damo Academy, Alibaba Group, Hangzhou 310023, China
- Hupan Lab, Hangzhou 310023, China
| | - Ming Zhu
- Department of Basic Medical Science, School of Medicine, Tsinghua University, Beijing 100084, China
- Tsinghua-Peking Joint Center for Life Sciences, Tsinghua University, Beijing 100084, China
- MOE Key Laboratory of Bioinformatics, Tsinghua University, Beijing 100084, China
| | - Xun Lan
- Department of Basic Medical Science, School of Medicine, Tsinghua University, Beijing 100084, China
- Tsinghua-Peking Joint Center for Life Sciences, Tsinghua University, Beijing 100084, China
- MOE Key Laboratory of Bioinformatics, Tsinghua University, Beijing 100084, China
| | - Fei Gu
- Damo Academy, Alibaba Group, Hangzhou 310023, China
- Hupan Lab, Hangzhou 310023, China
| |
Collapse
|
4
|
Breton TS, Fike S, Francis M, Patnaude M, Murray CA, DiMaggio MA. Characterizing the SREB G protein-coupled receptor family in fish: Brain gene expression and genomic differences in upstream transcription factor binding sites. Comp Biochem Physiol A Mol Integr Physiol 2023; 285:111507. [PMID: 37611891 PMCID: PMC10529039 DOI: 10.1016/j.cbpa.2023.111507] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2023] [Revised: 07/12/2023] [Accepted: 08/20/2023] [Indexed: 08/25/2023]
Abstract
The SREB (Super-conserved Receptors Expressed in Brain) family of orphan G protein-coupled receptors is highly conserved in vertebrates and consists of three members: SREB1 (orphan designation GPR27), SREB2 (GPR85), and SREB3 (GPR173). SREBs are associated with processes ranging from neuronal plasticity to reproductive control. Relatively little is known about similarities across the entire family, or how mammalian gene expression patterns compare to non-mammalian vertebrates. In fish, this system may be particularly complex, as some species have gained a fourth member (SREB3B) while others have lost genes. To better understand the system, the present study aimed to: 1) use qPCR to characterize sreb and related gene expression patterns in the brains of three fish species with different systems, and 2) identify possible differences in transcriptional regulation among the receptors, using upstream transcription factor binding sites across 70 ray-finned fish genomes. Overall, regional patterns of sreb expression were abundant in forebrain-related areas. However, some species-specific patterns were detected, such as abundant expression of receptors in zebrafish (Danio rerio) hypothalamic-containing sections, and divergence between sreb3a and sreb3b in pufferfish (Dichotomyctere nigroviridis). In addition, a gene possibly related to the system (dkk3a) was spatially correlated with the receptors in all three species. Genomic regions upstream of sreb2 and sreb3b, but largely not sreb1 or sreb3a, contained many highly conserved transcription factor binding sites. These results provide novel information about expression differences and transcriptional regulation across fish that may inform future research to better understand these receptors.
Collapse
Affiliation(s)
- Timothy S Breton
- Division of Natural Sciences, University of Maine at Farmington, Farmington, ME 04938, USA.
| | - Samantha Fike
- Division of Natural Sciences, University of Maine at Farmington, Farmington, ME 04938, USA
| | - Mullein Francis
- Division of Natural Sciences, University of Maine at Farmington, Farmington, ME 04938, USA
| | - Michael Patnaude
- Division of Natural Sciences, University of Maine at Farmington, Farmington, ME 04938, USA
| | - Casey A Murray
- Tropical Aquaculture Laboratory, Program in Fisheries and Aquatic Sciences, School of Forest, Fisheries, and Geomatics Sciences, Institute of Food and Agricultural Sciences, University of Florida, Ruskin, FL 33570, USA
| | - Matthew A DiMaggio
- Tropical Aquaculture Laboratory, Program in Fisheries and Aquatic Sciences, School of Forest, Fisheries, and Geomatics Sciences, Institute of Food and Agricultural Sciences, University of Florida, Ruskin, FL 33570, USA
| |
Collapse
|
5
|
Gamache J, Gingerich D, Shwab EK, Barrera J, Garrett ME, Hume C, Crawford GE, Ashley-Koch AE, Chiba-Falek O. Integrative single-nucleus multi-omics analysis prioritizes candidate cis and trans regulatory networks and their target genes in Alzheimer's disease brains. Cell Biosci 2023; 13:185. [PMID: 37789374 PMCID: PMC10546724 DOI: 10.1186/s13578-023-01120-5] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2023] [Accepted: 08/30/2023] [Indexed: 10/05/2023] Open
Abstract
BACKGROUND The genetic underpinnings of late-onset Alzheimer's disease (LOAD) are yet to be fully elucidated. Although numerous LOAD-associated loci have been discovered, the causal variants and their target genes remain largely unknown. Since the brain is composed of heterogenous cell subtypes, it is imperative to study the brain on a cell subtype specific level to explore the biological processes underlying LOAD. METHODS Here, we present the largest parallel single-nucleus (sn) multi-omics study to simultaneously profile gene expression (snRNA-seq) and chromatin accessibility (snATAC-seq) to date, using nuclei from 12 normal and 12 LOAD brains. We identified cell subtype clusters based on gene expression and chromatin accessibility profiles and characterized cell subtype-specific LOAD-associated differentially expressed genes (DEGs), differentially accessible peaks (DAPs) and cis co-accessibility networks (CCANs). RESULTS Integrative analysis defined disease-relevant CCANs in multiple cell subtypes and discovered LOAD-associated cell subtype-specific candidate cis regulatory elements (cCREs), their candidate target genes, and trans-interacting transcription factors (TFs), some of which, including ELK1, JUN, and SMAD4 in excitatory neurons, were also LOAD-DEGs. Finally, we focused on a subset of cell subtype-specific CCANs that overlap known LOAD-GWAS regions and catalogued putative functional SNPs changing the affinities of TF motifs within LOAD-cCREs linked to LOAD-DEGs, including APOE and MYO1E in a specific subtype of microglia and BIN1 in a subpopulation of oligodendrocytes. CONCLUSIONS To our knowledge, this study represents the most comprehensive systematic interrogation to date of regulatory networks and the impact of genetic variants on gene dysregulation in LOAD at a cell subtype resolution. Our findings reveal crosstalk between epigenetic, genomic, and transcriptomic determinants of LOAD pathogenesis and define catalogues of candidate genes, cCREs, and variants involved in LOAD genetic etiology and the cell subtypes in which they act to exert their pathogenic effects. Overall, these results suggest that cell subtype-specific cis-trans interactions between regulatory elements and TFs, and the genes dysregulated by these networks contribute to the development of LOAD.
Collapse
Affiliation(s)
- Julia Gamache
- Division of Translational Brain Sciences, Department of Neurology, Duke University Medical Center, DUMC Box 2900, Durham, NC, 27710, USA
- Center for Genomic and Computational Biology, Duke University Medical Center, Durham, NC, 27708, USA
| | - Daniel Gingerich
- Division of Translational Brain Sciences, Department of Neurology, Duke University Medical Center, DUMC Box 2900, Durham, NC, 27710, USA
- Center for Genomic and Computational Biology, Duke University Medical Center, Durham, NC, 27708, USA
| | - E Keats Shwab
- Division of Translational Brain Sciences, Department of Neurology, Duke University Medical Center, DUMC Box 2900, Durham, NC, 27710, USA
- Center for Genomic and Computational Biology, Duke University Medical Center, Durham, NC, 27708, USA
| | - Julio Barrera
- Division of Translational Brain Sciences, Department of Neurology, Duke University Medical Center, DUMC Box 2900, Durham, NC, 27710, USA
- Center for Genomic and Computational Biology, Duke University Medical Center, Durham, NC, 27708, USA
| | - Melanie E Garrett
- Duke Molecular Physiology Institute, Duke University Medical Center, DUMC Box 104775, Durham, NC, 27701, USA
| | - Cordelia Hume
- Division of Translational Brain Sciences, Department of Neurology, Duke University Medical Center, DUMC Box 2900, Durham, NC, 27710, USA
- Center for Genomic and Computational Biology, Duke University Medical Center, Durham, NC, 27708, USA
| | - Gregory E Crawford
- Center for Genomic and Computational Biology, Duke University Medical Center, Durham, NC, 27708, USA.
- Department of Pediatrics, Division of Medical Genetics, Duke University Medical Center, DUMC Box 3382, Durham, NC, 27708, USA.
- Center for Advanced Genomic Technologies, Duke University Medical Center, Durham, NC, 27708, USA.
| | - Allison E Ashley-Koch
- Duke Molecular Physiology Institute, Duke University Medical Center, DUMC Box 104775, Durham, NC, 27701, USA.
- Department of Medicine, Duke University Medical Center, Durham, NC, 27708, USA.
| | - Ornit Chiba-Falek
- Division of Translational Brain Sciences, Department of Neurology, Duke University Medical Center, DUMC Box 2900, Durham, NC, 27710, USA.
- Center for Genomic and Computational Biology, Duke University Medical Center, Durham, NC, 27708, USA.
| |
Collapse
|
6
|
Zhang B, Zhu X, Chen Z, Zhang H, Huang J, Huang J. RiceTFtarget: A rice transcription factor-target prediction server based on coexpression and machine learning. PLANT PHYSIOLOGY 2023; 193:190-194. [PMID: 37294915 DOI: 10.1093/plphys/kiad332] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/24/2023] [Revised: 05/03/2023] [Accepted: 05/26/2023] [Indexed: 06/11/2023]
Abstract
The online webserver RiceTFtarget realizes rice transcription factor–target prediction based on coexpression, pattern matching, and machine learning.
Collapse
Affiliation(s)
- Baoyi Zhang
- State Key Laboratory of Crop Genetics & Germplasm Enhancement and Utilization, Jiangsu Province Engineering Research Center of Seed Industry Science and Technology, Nanjing Agricultural University, Nanjing 210095, China
| | - Xueai Zhu
- State Key Laboratory of Crop Genetics & Germplasm Enhancement and Utilization, Jiangsu Province Engineering Research Center of Seed Industry Science and Technology, Nanjing Agricultural University, Nanjing 210095, China
| | - Zixin Chen
- College of Artificial Intelligence, Nanjing Agricultural University, Nanjing 210095, China
| | - Hongsheng Zhang
- State Key Laboratory of Crop Genetics & Germplasm Enhancement and Utilization, Jiangsu Province Engineering Research Center of Seed Industry Science and Technology, Nanjing Agricultural University, Nanjing 210095, China
| | - Junxian Huang
- College of Artificial Intelligence, Nanjing Agricultural University, Nanjing 210095, China
| | - Ji Huang
- State Key Laboratory of Crop Genetics & Germplasm Enhancement and Utilization, Jiangsu Province Engineering Research Center of Seed Industry Science and Technology, Nanjing Agricultural University, Nanjing 210095, China
- Jiangsu Key Laboratory for Information Agriculture, Nanjing Agricultural University, Nanjing 210095, China
| |
Collapse
|
7
|
Sahana G, Cai Z, Sanchez MP, Bouwman AC, Boichard D. Invited review: Good practices in genome-wide association studies to identify candidate sequence variants in dairy cattle. J Dairy Sci 2023:S0022-0302(23)00357-0. [PMID: 37349208 DOI: 10.3168/jds.2022-22694] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2022] [Accepted: 02/01/2023] [Indexed: 06/24/2023]
Abstract
Genotype data from dairy cattle selection programs have greatly facilitated GWAS to identify variants related to economic traits. Results can enhance the accuracy of genomic prediction, analyze more complex models that go beyond additive effects, elucidate the genetic architecture of a trait, and finally, decipher the underlying biology of traits. The entire process, comprising data generation, quality control, statistical analyses, interpretation of association results, and linking results to biology should be designed and executed to minimize the generation of false-positive and false-negative associations and misleading links to biological processes. This review aims to provide general guidelines for data analysis that address data quality control, association tests, adjustment for population stratification, and significance evaluation to improve the reliability of conclusions. We also provide guidance on post-GWAS strategy and the interpretation of results. These guidelines are tailored to dairy cattle, which are characterized by long-range linkage disequilibrium, large half-sib families, and routinely collected phenotypes, requiring different approaches than those applied in human GWAS. We discuss common limitations and challenges that have been overlooked in the analysis and interpretation of GWAS to identify candidate sequence variants in dairy cattle.
Collapse
Affiliation(s)
- G Sahana
- Aarhus University, Center for Quantitative Genetic and Genomics, 8830 Tjele, Denmark.
| | - Z Cai
- Aarhus University, Center for Quantitative Genetic and Genomics, 8830 Tjele, Denmark
| | - M P Sanchez
- Université Paris-Saclay, INRAE, AgroParisTech, GABI, 78350 Jouy-en-Josas, France
| | - A C Bouwman
- Wageningen University & Research, Animal Breeding and Genomics, 6700 AH Wageningen, the Netherlands
| | - D Boichard
- Université Paris-Saclay, INRAE, AgroParisTech, GABI, 78350 Jouy-en-Josas, France
| |
Collapse
|
8
|
Liao J, Wang Q, Wu F, Huang Z. In Silico Methods for Identification of Potential Active Sites of Therapeutic Targets. Molecules 2022; 27:7103. [PMID: 36296697 PMCID: PMC9609013 DOI: 10.3390/molecules27207103] [Citation(s) in RCA: 12] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2022] [Revised: 08/12/2022] [Accepted: 08/25/2022] [Indexed: 07/30/2023] Open
Abstract
Target identification is an important step in drug discovery, and computer-aided drug target identification methods are attracting more attention compared with traditional drug target identification methods, which are time-consuming and costly. Computer-aided drug target identification methods can greatly reduce the searching scope of experimental targets and associated costs by identifying the diseases-related targets and their binding sites and evaluating the druggability of the predicted active sites for clinical trials. In this review, we introduce the principles of computer-based active site identification methods, including the identification of binding sites and assessment of druggability. We provide some guidelines for selecting methods for the identification of binding sites and assessment of druggability. In addition, we list the databases and tools commonly used with these methods, present examples of individual and combined applications, and compare the methods and tools. Finally, we discuss the challenges and limitations of binding site identification and druggability assessment at the current stage and provide some recommendations and future perspectives.
Collapse
Affiliation(s)
- Jianbo Liao
- Key Laboratory of Big Data Mining and Precision Drug Design of Guangdong Medical University, Key Laboratory of Computer-Aided Drug Design of Dongguan City, Key Laboratory for Research and Development of Natural Drugs of Guangdong Province, School of Pharmacy, Guangdong Medical University, Dongguan 523808, China
- The Second School of Clinical Medicine, Guangdong Medical University, Dongguan 523808, China
| | - Qinyu Wang
- Key Laboratory of Big Data Mining and Precision Drug Design of Guangdong Medical University, Key Laboratory of Computer-Aided Drug Design of Dongguan City, Key Laboratory for Research and Development of Natural Drugs of Guangdong Province, School of Pharmacy, Guangdong Medical University, Dongguan 523808, China
| | - Fengxu Wu
- Hubei Key Laboratory of Wudang Local Chinese Medicine Research, School of Pharmaceutical Sciences, Hubei University of Medicine, Shiyan 442000, China
| | - Zunnan Huang
- Key Laboratory of Big Data Mining and Precision Drug Design of Guangdong Medical University, Key Laboratory of Computer-Aided Drug Design of Dongguan City, Key Laboratory for Research and Development of Natural Drugs of Guangdong Province, School of Pharmacy, Guangdong Medical University, Dongguan 523808, China
- Marine Biomedical Research Institute of Guangdong Zhanjiang, Zhanjiang 524023, China
| |
Collapse
|
9
|
Ruengsrichaiya B, Nukoolkit C, Kalapanulak S, Saithong T. Plant-DTI: Extending the landscape of TF protein and DNA interaction in plants by a machine learning-based approach. FRONTIERS IN PLANT SCIENCE 2022; 13:970018. [PMID: 36082286 PMCID: PMC9445498 DOI: 10.3389/fpls.2022.970018] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Figures] [Subscribe] [Scholar Register] [Received: 06/15/2022] [Accepted: 08/01/2022] [Indexed: 06/15/2023]
Abstract
As a sessile organism, plants hold elaborate transcriptional regulatory systems that allow them to adapt to variable surrounding environments. Current understanding of plant regulatory mechanisms is greatly constrained by limited knowledge of transcription factor (TF)-DNA interactions. To mitigate this problem, a Plant-DTI predictor (Plant DBD-TFBS Interaction) was developed here as the first machine-learning model that covered the largest experimental datasets of 30 plant TF families, including 7 plant-specific DNA binding domain (DBD) types, and their transcription factor binding sites (TFBSs). Plant-DTI introduced a novel TFBS feature construction, called TFBS base-preference, which enhanced the specificity of TFBS to DBD types. The proposed model showed better predictive performance with the TFBS base-preference than the simple binary representation. Plant-DTI was validated with 22 independent ChIP-seq datasets. It accurately predicted the measured DBD-TFBS pairs along with their TFBS motifs, and effectively predicted interactions of other TFs containing similar DBD types. Comparing to the existing state-of-art methods, Plant-DTI prediction showed a figure of merit in sensitivity and specificity with respect to the position weight matrix (PWM) and TSPTFBS methods. Finally, the proposed Plant-DTI model helped to fill the knowledge gap in the regulatory mechanisms of the cassava sucrose synthase 1 gene (MeSUS1). Plant-DTI predicted MeERF72 as a regulator of MeSUS1 in consistence with the yeast one-hybrid (Y1H) experiment. Taken together, Plant-DTI would help facilitate the prediction of TF-TFBS and TF-target gene (TG) interactions, thereby accelerating the study of transcriptional regulatory systems in plant species.
Collapse
Affiliation(s)
- Bhukrit Ruengsrichaiya
- Bioinformatics and Systems Biology Program, School of Bioresources and Technology and School of Information Technology, King Mongkut’s University of Technology Thonburi (Bang KhunThian), Bangkok, Thailand
| | - Chakarida Nukoolkit
- Bioinformatics and Systems Biology Program, School of Bioresources and Technology and School of Information Technology, King Mongkut’s University of Technology Thonburi (Bang KhunThian), Bangkok, Thailand
- School of Information Technology, King Mongkut’s University of Technology Thonburi, Bangkok, Thailand
| | - Saowalak Kalapanulak
- Bioinformatics and Systems Biology Program, School of Bioresources and Technology and School of Information Technology, King Mongkut’s University of Technology Thonburi (Bang KhunThian), Bangkok, Thailand
- Center for Agricultural Systems Biology, Systems Biology and Bioinformatics Research Group, Pilot Plant Development and Training Institute, King Mongkut’s University of Technology Thonburi (Bang KhunThian), Bangkok, Thailand
| | - Treenut Saithong
- Bioinformatics and Systems Biology Program, School of Bioresources and Technology and School of Information Technology, King Mongkut’s University of Technology Thonburi (Bang KhunThian), Bangkok, Thailand
- Center for Agricultural Systems Biology, Systems Biology and Bioinformatics Research Group, Pilot Plant Development and Training Institute, King Mongkut’s University of Technology Thonburi (Bang KhunThian), Bangkok, Thailand
| |
Collapse
|
10
|
Liu S, Fan Y, Duan M, Wang Y, Su G, Ren Y, Huang L, Zhou F. AcneGrader: An ensemble pruning of the deep learning base models to grade acne. Skin Res Technol 2022; 28:677-688. [PMID: 35639819 PMCID: PMC9907630 DOI: 10.1111/srt.13166] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2022] [Accepted: 05/03/2022] [Indexed: 12/21/2022]
Abstract
BACKGROUND Acne is one of the most common skin lesions in adolescents. Some severe or inflammatory acne leads to scars, which may have major impacts on patients' quality of life or even job prospects. Grading acne plays an important role in diagnosis, and the diagnosis is made by counting the number of acne. It is a labor-intensive job and it is easy for dermatologists to make mistakes, so it is very important to develop automatic diagnosis methods. Ensemble learning may improve the prediction results of the base models, but its time complexity is relatively high. The ensemble pruning strategy may solve this computational challenge by removing the redundant base models. MATERIALS AND METHODS This study proposed a novel ensemble pruning framework of deep learning models to accurately detect and grade acne using images. First, we train multi-base models and prune the redundancy models according to the performance and diversity of the models. Then, we construct the new features of the training data by the base models we select in the previous step. Next, we remove the redundancy models further by a feature selection algorithm. Finally, we integrate all the base models by classifiers. The ensemble pruning algorithm was proposed to prune the deep learning base models. RESULTS The experimental data showed that the ensemble pruned framework achieved a prediction accuracy of 85.82% on the acne dataset, better than the existing studies. To verify our method's effectiveness, we test our method in a skin cancer dataset and greatly outperform the state-of-the-art methods. CONCLUSION The method we proposed is used to grade acne. Our method's performance outperforms state-of-the-art methods on two datasets, and it can also remove redundancy models to reduce computational complexity.
Collapse
Affiliation(s)
- Shuai Liu
- College of Computer Science and Technology, Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, Jilin, P.R. China
| | - Yusi Fan
- College of Software, Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, Jilin, P.R. China
| | - Meiyu Duan
- College of Computer Science and Technology, Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, Jilin, P.R. China
| | - Yueying Wang
- College of Computer Science and Technology, Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, Jilin, P.R. China
| | - Guoxiong Su
- Beijing Dr. of Acne Medical Research Institute, Beijing, China
| | - Yanjiao Ren
- College of Information Technology (Smart Agriculture Research Institute), Jilin Agricultural University, Changchun, Jilin, China
| | - Lan Huang
- College of Computer Science and Technology, Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, Jilin, P.R. China
| | - Fengfeng Zhou
- College of Computer Science and Technology, Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, Jilin, P.R. China
| |
Collapse
|
11
|
Zhang Y, Wang Z, Zeng Y, Liu Y, Xiong S, Wang M, Zhou J, Zou Q. A novel convolution attention model for predicting transcription factor binding sites by combination of sequence and shape. Brief Bioinform 2021; 23:6470969. [PMID: 34929739 DOI: 10.1093/bib/bbab525] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2021] [Revised: 10/28/2021] [Accepted: 11/13/2021] [Indexed: 12/17/2022] Open
Abstract
The discovery of putative transcription factor binding sites (TFBSs) is important for understanding the underlying binding mechanism and cellular functions. Recently, many computational methods have been proposed to jointly account for DNA sequence and shape properties in TFBSs prediction. However, these methods fail to fully utilize the latent features derived from both sequence and shape profiles and have limitation in interpretability and knowledge discovery. To this end, we present a novel Deep Convolution Attention network combining Sequence and Shape, dubbed as D-SSCA, for precisely predicting putative TFBSs. Experiments conducted on 165 ENCODE ChIP-seq datasets reveal that D-SSCA significantly outperforms several state-of-the-art methods in predicting TFBSs, and justify the utility of channel attention module for feature refinements. Besides, the thorough analysis about the contribution of five shapes to TFBSs prediction demonstrates that shape features can improve the predictive power for transcription factors-DNA binding. Furthermore, D-SSCA can realize the cross-cell line prediction of TFBSs, indicating the occupancy of common interplay patterns concerning both sequence and shape across various cell lines. The source code of D-SSCA can be found at https://github.com/MoonLord0525/.
Collapse
Affiliation(s)
- Yongqing Zhang
- School of Computer Science, Chengdu University of Information Technology, 610225, Chengdu, China.,School of Computer Science and Engineering, University of Electronic Science and Technology of China, 611731, Chengdu, China
| | - Zixuan Wang
- School of Computer Science, Chengdu University of Information Technology, 610225, Chengdu, China
| | - Yuanqi Zeng
- School of Computer Science, Chengdu University of Information Technology, 610225, Chengdu, China
| | - Yuhang Liu
- School of Computer Science, Chengdu University of Information Technology, 610225, Chengdu, China
| | - Shuwen Xiong
- School of Computer Science, Chengdu University of Information Technology, 610225, Chengdu, China
| | - Maocheng Wang
- School of Computer Science, Chengdu University of Information Technology, 610225, Chengdu, China
| | - Jiliu Zhou
- School of Computer Science, Chengdu University of Information Technology, 610225, Chengdu, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, 610054, Chengdu, China
| |
Collapse
|
12
|
Ji Y, Zhou Z, Liu H, Davuluri RV. DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome. Bioinformatics 2021; 37:2112-2120. [PMID: 33538820 PMCID: PMC11025658 DOI: 10.1093/bioinformatics/btab083] [Citation(s) in RCA: 169] [Impact Index Per Article: 56.3] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2020] [Revised: 12/31/2020] [Accepted: 02/01/2021] [Indexed: 12/19/2022] Open
Abstract
MOTIVATION Deciphering the language of non-coding DNA is one of the fundamental problems in genome research. Gene regulatory code is highly complex due to the existence of polysemy and distant semantic relationship, which previous informatics methods often fail to capture especially in data-scarce scenarios. RESULTS To address this challenge, we developed a novel pre-trained bidirectional encoder representation, named DNABERT, to capture global and transferrable understanding of genomic DNA sequences based on up and downstream nucleotide contexts. We compared DNABERT to the most widely used programs for genome-wide regulatory elements prediction and demonstrate its ease of use, accuracy and efficiency. We show that the single pre-trained transformers model can simultaneously achieve state-of-the-art performance on prediction of promoters, splice sites and transcription factor binding sites, after easy fine-tuning using small task-specific labeled data. Further, DNABERT enables direct visualization of nucleotide-level importance and semantic relationship within input sequences for better interpretability and accurate identification of conserved sequence motifs and functional genetic variant candidates. Finally, we demonstrate that pre-trained DNABERT with human genome can even be readily applied to other organisms with exceptional performance. We anticipate that the pre-trained DNABERT model can be fined tuned to many other sequence analyses tasks. AVAILABILITY AND IMPLEMENTATION The source code, pretrained and finetuned model for DNABERT are available at GitHub (https://github.com/jerryji1993/DNABERT). SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yanrong Ji
- Division of Health and Biomedical Informatics, Department of Preventive Medicine, Northwestern University Feinberg School of Medicine, Chicago, IL 60611, USA
| | - Zhihan Zhou
- Department of Computer Science, Northwestern University, Evanston, IL 60208, USA
| | - Han Liu
- Department of Computer Science, Northwestern University, Evanston, IL 60208, USA
| | - Ramana V Davuluri
- Department of Biomedical Informatics, Stony Brook University, Stony Brook, NY 11794, USA
| |
Collapse
|
13
|
Zheng SQ, Chen HX, Liu XC, Yang Q, He GW. Identification of variants of ISL1 gene promoter and cellular functions in isolated ventricular septal defects. Am J Physiol Cell Physiol 2021; 321:C443-C452. [PMID: 34260301 DOI: 10.1152/ajpcell.00167.2021] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Ventricular septal defects (VSDs) are the most common congenital heart defects (CHDs). Studies have documented that ISL1 has a crucial impact on cardiac growth, but the role of variants in the ISL1 gene promoter in patients with VSD has not been explored. In 400 subjects (200 patients with isolated and sporadic VSDs: 200 healthy controls), we investigated the ISL1 gene promoter variant and performed cellular functional experiments by using the dual-luciferase reporter assay to verify the impact on gene expression. In the ISL1 promoter, five variants were found only in patients with VSD by sequencing. Cellular functional experiments demonstrated that three variants decreased the transcriptional activity of the ISL1 promoter (P < 0.05). Further analysis with the online JASPAR database demonstrated that a cluster of putative binding sites for transcription factors may be altered by these variants, possibly resulting in change of ISL1 protein expression and VSD formation. Our study has, for the first time, identified novel variants in the ISL1 gene promoter region in the Han Chinese patients with isolated and sporadic VSD. In addition, the cellular functional experiments, electrophoretic mobility shift assay, and bioinformatic analysis have demonstrated that these variants significantly alter the expression of the ISL1 gene and affect the binding of transcription factors, likely resulting in VSD. Therefore, this study may provide new insights into the role of the gene promoter region for a better understanding of genetic basis of the formation of CHDs and may promote further investigations on mechanism of the formation of CHDs.
Collapse
Affiliation(s)
- Si-Qiang Zheng
- The Institute of Cardiovascular Diseases & Department of Cardiovascular Surgery, TEDA International Cardiovascular Hospital, Tianjin University & Chinese Academy of Medical Sciences, Tianjin, People's Republic of China
| | - Huan-Xin Chen
- The Institute of Cardiovascular Diseases & Department of Cardiovascular Surgery, TEDA International Cardiovascular Hospital, Tianjin University & Chinese Academy of Medical Sciences, Tianjin, People's Republic of China
| | - Xiao-Cheng Liu
- The Institute of Cardiovascular Diseases & Department of Cardiovascular Surgery, TEDA International Cardiovascular Hospital, Tianjin University & Chinese Academy of Medical Sciences, Tianjin, People's Republic of China
| | - Qin Yang
- The Institute of Cardiovascular Diseases & Department of Cardiovascular Surgery, TEDA International Cardiovascular Hospital, Tianjin University & Chinese Academy of Medical Sciences, Tianjin, People's Republic of China
| | - Guo-Wei He
- The Institute of Cardiovascular Diseases & Department of Cardiovascular Surgery, TEDA International Cardiovascular Hospital, Tianjin University & Chinese Academy of Medical Sciences, Tianjin, People's Republic of China.,Drug Research and Development Center, Wannan Medical College, Wuhu, People's Republic of China.,Department of Surgery, Oregon Health and Science University, Portland, Oregon
| |
Collapse
|
14
|
Sequence-based GWAS and post-GWAS analyses reveal a key role of SLC37A1, ANKH, and regulatory regions on bovine milk mineral content. Sci Rep 2021; 11:7537. [PMID: 33824377 PMCID: PMC8024349 DOI: 10.1038/s41598-021-87078-1] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2020] [Accepted: 03/23/2021] [Indexed: 12/14/2022] Open
Abstract
The mineral composition of bovine milk plays an important role in determining its nutritional and cheese-making value. Concentrations of the main minerals predicted from mid-infrared spectra produced during milk recording, combined with cow genotypes, provide a unique opportunity to decipher the genetic determinism of these traits. The present study included 1 million test-day predictions of Ca, Mg, P, K, Na, and citrate content from 126,876 Montbéliarde cows, of which 19,586 had genotype data available. All investigated traits were highly heritable (0.50-0.58), with the exception of Na (0.32). A sequence-based genome-wide association study (GWAS) detected 50 QTL (18 affecting two to five traits) and positional candidate genes and variants, mostly located in non-coding sequences. In silico post-GWAS analyses highlighted 877 variants that could be regulatory SNPs altering transcription factor (TF) binding sites or located in non-coding RNA (mainly lncRNA). Furthermore, we found 47 positional candidate genes and 45 TFs highly expressed in mammary gland compared to 90 other bovine tissues. Among the mammary-specific genes, SLC37A1 and ANKH, encoding proteins involved in ion transport were located in the most significant QTL. This study therefore highlights a comprehensive set of functional candidate genes and variants that affect milk mineral content.
Collapse
|
15
|
Long P, Zhang L, Huang B, Chen Q, Liu H. Integrating genome sequence and structural data for statistical learning to predict transcription factor binding sites. Nucleic Acids Res 2021; 48:12604-12617. [PMID: 33264415 PMCID: PMC7736823 DOI: 10.1093/nar/gkaa1134] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2020] [Revised: 09/18/2020] [Accepted: 11/10/2020] [Indexed: 01/11/2023] Open
Abstract
We report an approach to predict DNA specificity of the tetracycline repressor (TetR) family transcription regulators (TFRs). First, a genome sequence-based method was streamlined with quantitative P-values defined to filter out reliable predictions. Then, a framework was introduced to incorporate structural data and to train a statistical energy function to score the pairing between TFR and TFR binding site (TFBS) based on sequences. The predictions benchmarked against experiments, TFBSs for 29 out of 30 TFRs were correctly predicted by either the genome sequence-based or the statistical energy-based method. Using P-values or Z-scores as indicators, we estimate that 59.6% of TFRs are covered with relatively reliable predictions by at least one of the two methods, while only 28.7% are covered by the genome sequence-based method alone. Our approach predicts a large number of new TFBs which cannot be correctly retrieved from public databases such as FootprintDB. High-throughput experimental assays suggest that the statistical energy can model the TFBSs of a significant number of TFRs reliably. Thus the energy function may be applied to explore for new TFBSs in respective genomes. It is possible to extend our approach to other transcriptional factor families with sufficient structural information.
Collapse
Affiliation(s)
- Pengpeng Long
- School of Life Sciences, University of Science and Technology of China, Hefei, Anhui 230026, China
| | - Lu Zhang
- School of Life Sciences, University of Science and Technology of China, Hefei, Anhui 230026, China
| | - Bin Huang
- School of Life Sciences, University of Science and Technology of China, Hefei, Anhui 230026, China
| | - Quan Chen
- School of Life Sciences, University of Science and Technology of China, Hefei, Anhui 230026, China.,Hefei National Laboratory for Physical Sciences at the Microscale, Hefei, Anhui 230026, China
| | - Haiyan Liu
- School of Life Sciences, University of Science and Technology of China, Hefei, Anhui 230026, China.,Hefei National Laboratory for Physical Sciences at the Microscale, Hefei, Anhui 230026, China.,School of Data Science, University of Science and Technology of China, Hefei, Anhui 230026, China
| |
Collapse
|
16
|
Ramírez-Ayala LC, Rocha D, Ramos-Onsins SE, Leno-Colorado J, Charles M, Bouchez O, Rodríguez-Valera Y, Pérez-Enciso M, Ramayo-Caldas Y. Whole-genome sequencing reveals insights into the adaptation of French Charolais cattle to Cuban tropical conditions. Genet Sel Evol 2021; 53:3. [PMID: 33397281 PMCID: PMC7784321 DOI: 10.1186/s12711-020-00597-9] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2020] [Accepted: 12/11/2020] [Indexed: 02/01/2023] Open
Abstract
Background In the early 20th century, Cuban farmers imported Charolais cattle (CHFR) directly from France. These animals are now known as Chacuba (CHCU) and have become adapted to the rough environmental tropical conditions in Cuba. These conditions include long periods of drought and food shortage with extreme temperatures that European taurine cattle have difficulty coping with. Results In this study, we used whole-genome sequence data from 12 CHCU individuals together with 60 whole-genome sequences from six additional taurine, indicus and crossed breeds to estimate the genetic diversity, structure and accurate ancestral origin of the CHCU animals. Although CHCU animals are assumed to form a closed population, the results of our admixture analysis indicate a limited introgression of Bos indicus. We used the extended haplotype homozygosity (EHH) approach to identify regions in the genome that may have had an important role in the adaptation of CHCU to tropical conditions. Putative selection events occurred in genomic regions with a high proportion of Bos indicus, but they were not sufficient to explain adaptation of CHCU to tropical conditions by Bos indicus introgression only. EHH suggested signals of potential adaptation in genomic windows that include genes of taurine origin involved in thermogenesis (ATP9A, GABBR1, PGR, PTPN1 and UCP1) and hair development (CCHCR1 and CDSN). Within these genes, we identified single nucleotide polymorphisms (SNPs) that may have a functional impact and contribute to some of the observed phenotypic differences between CHCU and CHFR animals. Conclusions Whole-genome data confirm that CHCU cattle are closely related to Charolais from France (CHFR) and Canada, but also reveal a limited introgression of Bos indicus genes in CHCU. We observed possible signals of recent adaptation to tropical conditions between CHCU and CHFR founder populations, which were largely independent of the Bos indicus introgression. Finally, we report candidate genes and variants that may have a functional impact and explain some of the phenotypic differences observed between CHCU and CHFR cattle.
Collapse
Affiliation(s)
- Lino C Ramírez-Ayala
- Plant and Animal Genomics, Centre de Recerca en Agrigenòmica (CRAG) CSIC-IRTA-UAB-UB, Campus UAB, Bellaterra, Spain
| | - Dominique Rocha
- Université Paris-Saclay, INRAE, Jouy-En-Josas, AgroParisTech, GABI, 78350, France
| | - Sebas E Ramos-Onsins
- Plant and Animal Genomics, Centre de Recerca en Agrigenòmica (CRAG) CSIC-IRTA-UAB-UB, Campus UAB, Bellaterra, Spain
| | - Jordi Leno-Colorado
- Plant and Animal Genomics, Centre de Recerca en Agrigenòmica (CRAG) CSIC-IRTA-UAB-UB, Campus UAB, Bellaterra, Spain
| | - Mathieu Charles
- Université Paris-Saclay, INRAE, Jouy-En-Josas, AgroParisTech, GABI, 78350, France.,INRAE, SIGENAE, Jouy-En-Josas, 78350, France
| | - Olivier Bouchez
- INRAE, GeT-PlaGe, Genotoul, Castanet-Tolosan, US, 1426, France
| | | | - Miguel Pérez-Enciso
- Plant and Animal Genomics, Centre de Recerca en Agrigenòmica (CRAG) CSIC-IRTA-UAB-UB, Campus UAB, Bellaterra, Spain.,Institut Català de Recerca I Estudis Avançats (ICREA), Barcelona, Spain
| | - Yuliaxis Ramayo-Caldas
- Université Paris-Saclay, INRAE, Jouy-En-Josas, AgroParisTech, GABI, 78350, France. .,Animal Breeding and Genetics Program, Institute for Research and Technology in Food and Agriculture (IRTA), Torre Marimon, Caldes De Montbui, 08140, Spain.
| |
Collapse
|
17
|
Xiao P, Cai X, Rajasekaran S. EMS3: An Improved Algorithm for Finding Edit-Distance Based Motifs. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:27-37. [PMID: 32931433 DOI: 10.1109/tcbb.2020.3024222] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Discovering patterns in biological sequences is a crucial step to extract useful information from them. Motifs can be viewed as patterns that occur exactly or with minor changes across some or all of the biological sequences. Motif search has numerous applications including the identification of transcription factors and their binding sites, composite regulatory patterns, similarity among families of proteins, etc. The general problem of motif search is intractable. One of the most studied models of motif search proposed in literature is Edit-distance based Motif Search (EMS). In EMS, the goal is to find all the patterns of length l that occur with an edit-distance of at most d in each of the input sequences. EMS algorithms existing in the literature do not scale well on challenging instances and large datasets. In this paper, the current state-of-the-art EMS solver is advanced by exploiting the idea of dimension reduction. A novel idea to reduce the cardinality of the alphabet is proposed. The algorithm we propose, EMS3, is an exact algorithm. I.e., it finds all the motifs present in the input sequences. EMS3 can be also viewed as a divide and conquer algorithm. In this paper, we provide theoretical analyses to establish the efficiency of EMS3. Extensive experiments on standard benchmark datasets (synthetic and real-world) show that the proposed algorithm outperforms the existing state-of-the-art algorithm (EMS2).
Collapse
|
18
|
Mariadassou M, Ramayo-Caldas Y, Charles M, Féménia M, Renand G, Rocha D. Detection of selection signatures in Limousin cattle using whole-genome resequencing. Anim Genet 2020; 51:815-819. [PMID: 32686174 DOI: 10.1111/age.12982] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 06/16/2020] [Indexed: 12/27/2022]
Abstract
Limousin, a renowned beef breed originating from central France, has been selectively bred over the last 100 years to improve economically important traits. We used whole-genome sequencing data from 10 unrelated Limousin bull calves to detect polymorphisms and identify regions under selection. A total of 13 943 766 variants were identified. Moreover, 311 852 bi-allelic SNPs and 92 229 indels located on autosomes were fixed for the alternative allele in all sequenced animals, including the previously reported missense deleterious F94L mutation in MSTN. We performed a whole-genome screen to discover genomic regions with excess homozygosity, using the pooled heterozygosity score and identified 171 different candidate selective sweeps. In total, 68 candidate genes were found in only 57 of these regions, indicating that a large fraction of the genome under selection might lie in non-coding regions and suggesting that a majority of adaptive mutations might be regulatory in nature. Many QTL were found within candidate selective sweep regions, including QTL associated with shear force or carcass weight. Among the putative selective sweeps, we located genes (MSTN, NCKAP5, RUNX2) that potentially contribute to important phenotypes in Limousin. Several candidate regions and genes under selection were also found in previous genome-wide selection scans performed in Limousin. In addition, we were able to pinpoint candidate causative regulatory polymorphisms in GRIK3 and RUNX2 that might have been under selection. Our results will contribute to improved understanding of the mechanisms and targets of artificial selection and will facilitate the interpretation of GWASs performed in Limousin.
Collapse
Affiliation(s)
- M Mariadassou
- INRAE, MaIAGE, Université Paris-Saclay, Jouy-en-Josas, F-78350, France
| | - Y Ramayo-Caldas
- INRAE, AgroParisTech, GABI, Université Paris-Saclay, Jouy-en-Josas, F-78350, France.,Animal Breeding and Genetics Program, Institute for Research and Technology in Food and Agriculture, Torre Marimon, Caldes de Montbui, 08140, Spain
| | - M Charles
- INRAE, AgroParisTech, GABI, Université Paris-Saclay, Jouy-en-Josas, F-78350, France.,INRAE, SIGENAE, Université Paris-Saclay, Jouy-en-Josas, F-78350, France
| | - M Féménia
- INRAE, AgroParisTech, GABI, Université Paris-Saclay, Jouy-en-Josas, F-78350, France
| | - G Renand
- INRAE, AgroParisTech, GABI, Université Paris-Saclay, Jouy-en-Josas, F-78350, France
| | - D Rocha
- INRAE, AgroParisTech, GABI, Université Paris-Saclay, Jouy-en-Josas, F-78350, France
| |
Collapse
|
19
|
Mahood EH, Kruse LH, Moghe GD. Machine learning: A powerful tool for gene function prediction in plants. APPLICATIONS IN PLANT SCIENCES 2020; 8:e11376. [PMID: 32765975 PMCID: PMC7394712 DOI: 10.1002/aps3.11376] [Citation(s) in RCA: 51] [Impact Index Per Article: 12.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/01/2019] [Accepted: 03/19/2020] [Indexed: 05/06/2023]
Abstract
Recent advances in sequencing and informatic technologies have led to a deluge of publicly available genomic data. While it is now relatively easy to sequence, assemble, and identify genic regions in diploid plant genomes, functional annotation of these genes is still a challenge. Over the past decade, there has been a steady increase in studies utilizing machine learning algorithms for various aspects of functional prediction, because these algorithms are able to integrate large amounts of heterogeneous data and detect patterns inconspicuous through rule-based approaches. The goal of this review is to introduce experimental plant biologists to machine learning, by describing how it is currently being used in gene function prediction to gain novel biological insights. In this review, we discuss specific applications of machine learning in identifying structural features in sequenced genomes, predicting interactions between different cellular components, and predicting gene function and organismal phenotypes. Finally, we also propose strategies for stimulating functional discovery using machine learning-based approaches in plants.
Collapse
Affiliation(s)
- Elizabeth H. Mahood
- Plant Biology SectionSchool of Integrative Plant SciencesCornell UniversityIthacaNew York14853USA
| | - Lars H. Kruse
- Plant Biology SectionSchool of Integrative Plant SciencesCornell UniversityIthacaNew York14853USA
| | - Gaurav D. Moghe
- Plant Biology SectionSchool of Integrative Plant SciencesCornell UniversityIthacaNew York14853USA
| |
Collapse
|
20
|
In silico based screening of WRKY genes for identifying functional genes regulated by WRKY under salt stress. Comput Biol Chem 2019; 83:107131. [DOI: 10.1016/j.compbiolchem.2019.107131] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2019] [Revised: 08/18/2019] [Accepted: 09/18/2019] [Indexed: 11/21/2022]
|
21
|
Gearing LJ, Cumming HE, Chapman R, Finkel AM, Woodhouse IB, Luu K, Gould JA, Forster SC, Hertzog PJ. CiiiDER: A tool for predicting and analysing transcription factor binding sites. PLoS One 2019; 14:e0215495. [PMID: 31483836 PMCID: PMC6726224 DOI: 10.1371/journal.pone.0215495] [Citation(s) in RCA: 110] [Impact Index Per Article: 22.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2019] [Accepted: 08/05/2019] [Indexed: 12/30/2022] Open
Abstract
The availability of large amounts of high-throughput genomic, transcriptomic and epigenomic data has provided opportunity to understand regulation of the cellular transcriptome with an unprecedented level of detail. As a result, research has advanced from identifying gene expression patterns associated with particular conditions to elucidating signalling pathways that regulate expression. There are over 1,000 transcription factors (TFs) in vertebrates that play a role in this regulation. Determining which of these are likely to be controlling a set of genes can be assisted by computational prediction, utilising experimentally verified binding site motifs. Here we present CiiiDER, an integrated computational toolkit for transcription factor binding analysis, written in the Java programming language, to make it independent of computer operating system. It is operated through an intuitive graphical user interface with interactive, high-quality visual outputs, making it accessible to all researchers. CiiiDER predicts transcription factor binding sites (TFBSs) across regulatory regions of interest, such as promoters and enhancers derived from any species. It can perform an enrichment analysis to identify TFs that are significantly over- or under-represented in comparison to a bespoke background set and thereby elucidate pathways regulating sets of genes of pathophysiological importance.
Collapse
Affiliation(s)
- Linden J. Gearing
- Centre for Innate Immunity and Infectious Diseases, Hudson Institute of Medical Research, Clayton, Victoria, Australia
- Department of Molecular Translational Science, Monash University, Clayton, Victoria, Australia
| | - Helen E. Cumming
- Centre for Innate Immunity and Infectious Diseases, Hudson Institute of Medical Research, Clayton, Victoria, Australia
- Department of Molecular Translational Science, Monash University, Clayton, Victoria, Australia
| | - Ross Chapman
- Centre for Innate Immunity and Infectious Diseases, Hudson Institute of Medical Research, Clayton, Victoria, Australia
- Department of Molecular Translational Science, Monash University, Clayton, Victoria, Australia
| | - Alexander M. Finkel
- Centre for Innate Immunity and Infectious Diseases, Hudson Institute of Medical Research, Clayton, Victoria, Australia
- Department of Molecular Translational Science, Monash University, Clayton, Victoria, Australia
| | - Isaac B. Woodhouse
- Centre for Innate Immunity and Infectious Diseases, Hudson Institute of Medical Research, Clayton, Victoria, Australia
- Department of Molecular Translational Science, Monash University, Clayton, Victoria, Australia
| | - Kevin Luu
- Centre for Innate Immunity and Infectious Diseases, Hudson Institute of Medical Research, Clayton, Victoria, Australia
- Department of Molecular Translational Science, Monash University, Clayton, Victoria, Australia
| | - Jodee A. Gould
- Centre for Innate Immunity and Infectious Diseases, Hudson Institute of Medical Research, Clayton, Victoria, Australia
- Department of Molecular Translational Science, Monash University, Clayton, Victoria, Australia
| | - Samuel C. Forster
- Centre for Innate Immunity and Infectious Diseases, Hudson Institute of Medical Research, Clayton, Victoria, Australia
- Department of Molecular Translational Science, Monash University, Clayton, Victoria, Australia
| | - Paul J. Hertzog
- Centre for Innate Immunity and Infectious Diseases, Hudson Institute of Medical Research, Clayton, Victoria, Australia
- Department of Molecular Translational Science, Monash University, Clayton, Victoria, Australia
- * E-mail:
| |
Collapse
|
22
|
Albalawi F, Chahid A, Guo X, Albaradei S, Magana-Mora A, Jankovic BR, Uludag M, Van Neste C, Essack M, Laleg-Kirati TM, Bajic VB. Hybrid model for efficient prediction of poly(A) signals in human genomic DNA. Methods 2019; 166:31-39. [PMID: 30991099 DOI: 10.1016/j.ymeth.2019.04.001] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2018] [Revised: 03/12/2019] [Accepted: 04/01/2019] [Indexed: 12/15/2022] Open
Abstract
Polyadenylation signals (PAS) are found in most protein-coding and some non-coding genes in eukaryotes. Their accurate recognition improves understanding gene regulation mechanisms and recognition of the 3'-end of transcribed gene regions where premature or alternate transcription ends may lead to various diseases. Although different methods and tools for in-silico prediction of genomic signals have been proposed, the correct identification of PAS in genomic DNA remains challenging due to a vast number of non-relevant hexamers identical to PAS hexamers. In this study, we developed a novel method for PAS recognition. The method is implemented in a hybrid PAS recognition model (HybPAS), which is based on deep neural networks (DNNs) and logistic regression models (LRMs). One of such models is developed for each of the 12 most frequent human PAS hexamers. DNN models appeared the best for eight PAS types (including the two most frequent PAS hexamers), while LRM appeared best for the remaining four PAS types. The new models use different combinations of signal processing-based, statistical, and sequence-based features as input. The results obtained on human genomic data show that HybPAS outperforms the well-tuned state-of-the-art Omni-PolyA models, reducing the classification error for different PAS hexamers by up to 57.35% for 10 out of 12 PAS types, with Omni-PolyA models being better for two PAS types. For the most frequent PAS types, 'AATAAA' and 'ATTAAA', HybPAS reduced the error rate by 35.14% and 34.48%, respectively. On average, HybPAS reduces the error by 30.29%. HybPAS is implemented partly in Python and in MATLAB available at https://github.com/EMANG-KAUST/PolyA_Prediction_LRM_DNN.
Collapse
Affiliation(s)
- Fahad Albalawi
- King Abdullah University of Science and Technology, Computational Bioscience Research Center, Thuwal 23955-6900, Saudi Arabia; Taif University, Electrical Engineering, Taif 21944, Saudi Arabia
| | - Abderrazak Chahid
- King Abdullah University of Science and Technology, Computational Bioscience Research Center, Thuwal 23955-6900, Saudi Arabia
| | - Xingang Guo
- King Abdullah University of Science and Technology, Computational Bioscience Research Center, Thuwal 23955-6900, Saudi Arabia
| | - Somayah Albaradei
- King Abdullah University of Science and Technology, Computational Bioscience Research Center, Thuwal 23955-6900, Saudi Arabia
| | - Arturo Magana-Mora
- King Abdullah University of Science and Technology, Computational Bioscience Research Center, Thuwal 23955-6900, Saudi Arabia; Saudi Aramco, EXPEC-ARC, Drilling Technology Team, Dhahran 31311, Saudi Arabia
| | - Boris R Jankovic
- King Abdullah University of Science and Technology, Computational Bioscience Research Center, Thuwal 23955-6900, Saudi Arabia
| | - Mahmut Uludag
- King Abdullah University of Science and Technology, Computational Bioscience Research Center, Thuwal 23955-6900, Saudi Arabia
| | - Christophe Van Neste
- King Abdullah University of Science and Technology, Computational Bioscience Research Center, Thuwal 23955-6900, Saudi Arabia; Ghent University, Center for Medical Genetics Ghent (CMGG), B-9000 Ghent, Belgium
| | - Magbubah Essack
- King Abdullah University of Science and Technology, Computational Bioscience Research Center, Thuwal 23955-6900, Saudi Arabia
| | - Taous-Meriem Laleg-Kirati
- King Abdullah University of Science and Technology, Computational Bioscience Research Center, Thuwal 23955-6900, Saudi Arabia.
| | - Vladimir B Bajic
- King Abdullah University of Science and Technology, Computational Bioscience Research Center, Thuwal 23955-6900, Saudi Arabia.
| |
Collapse
|
23
|
Wong KC. DNA Motif Recognition Modeling from Protein Sequences. iScience 2018; 7:198-211. [PMID: 30267681 PMCID: PMC6153143 DOI: 10.1016/j.isci.2018.09.003] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2018] [Revised: 08/08/2018] [Accepted: 09/04/2018] [Indexed: 12/31/2022] Open
Abstract
Although the existing works on DNA motif discovery on DNA sequences are plethoric, mechanistic knowledge to infer DNA motifs from protein sequences across multiple DNA-binding domain families without conducting any wet-lab experiments is still lacking. Therefore, the k-spectrum recognition modeling is proposed to address the issues at the highest possible resolutions. The k-spectrum model can capture DNA motif patterns from protein sequences at the resolution in which local sequence context and nucleotide dependency can be taken into account completely. Multiple evaluation metrics are adopted and measured on millions of k-mer binding intensities from 92 proteins across 5 DNA-binding families (i.e., bHLH, bZIP, ETS, Forkhead, and Homeodomain), demonstrating its competitive edges. In addition, it not only can contribute to DNA motif recognition modeling but also can help prioritize the observed or even unobserved binding of single nucleotide variants on transcription factor binding sites in a genome-wide manner. DNA motif modeling from protein is fundamental for understanding gene regulation A framework is proposed at the highest possible sequence resolution for the first time It is validated on millions of k-mer intensities from 92 proteins across 5 families It can prioritize the unobserved regulatory single nucleotide variants on DNA motifs
Collapse
Affiliation(s)
- Ka-Chun Wong
- Department of Computer Science, City University of Hong Kong, Kowloon Tong, Hong Kong.
| |
Collapse
|
24
|
Raghunath A, Sundarraj K, Nagarajan R, Arfuso F, Bian J, Kumar AP, Sethi G, Perumal E. Antioxidant response elements: Discovery, classes, regulation and potential applications. Redox Biol 2018; 17:297-314. [PMID: 29775961 PMCID: PMC6007815 DOI: 10.1016/j.redox.2018.05.002] [Citation(s) in RCA: 281] [Impact Index Per Article: 46.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2018] [Revised: 04/25/2018] [Accepted: 05/05/2018] [Indexed: 12/20/2022] Open
Abstract
Exposure to antioxidants and xenobiotics triggers the expression of a myriad of genes encoding antioxidant proteins, detoxifying enzymes, and xenobiotic transporters to offer protection against oxidative stress. This articulated universal mechanism is regulated through the cis-acting elements in an array of Nrf2 target genes called antioxidant response elements (AREs), which play a critical role in redox homeostasis. Though the Keap1/Nrf2/ARE system involves many players, AREs hold the key in transcriptional regulation of cytoprotective genes. ARE-mediated reporter constructs have been widely used, including xenobiotics profiling and Nrf2 activator screening. The complexity of AREs is brought by the presence of other regulatory elements within the AREs. The diversity in the ARE sequences not only bring regulatory selectivity of diverse transcription factors, but also confer functional complexity in the Keap1/Nrf2/ARE pathway. The different transcription factors either homodimerize or heterodimerize to bind the AREs. Depending on the nature of partners, they may activate or suppress the transcription. Attention is required for deeper mechanistic understanding of ARE-mediated gene regulation. The computational methods of identification and analysis of AREs are still in their infancy. Investigations are required to know whether epigenetics mechanism plays a role in the regulation of genes mediated through AREs. The polymorphisms in the AREs leading to oxidative stress related diseases are warranted. A thorough understanding of AREs will pave the way for the development of therapeutic agents against cancer, neurodegenerative, cardiovascular, metabolic and other diseases with oxidative stress.
Collapse
Affiliation(s)
- Azhwar Raghunath
- Molecular Toxicology Laboratory, Department of Biotechnology, Bharathiar University, Coimbatore 641046, Tamilnadu, India
| | - Kiruthika Sundarraj
- Molecular Toxicology Laboratory, Department of Biotechnology, Bharathiar University, Coimbatore 641046, Tamilnadu, India
| | - Raju Nagarajan
- Department of Biotechnology, Indian Institute of Technology Madras, Chennai 600036, Tamilnadu, India
| | - Frank Arfuso
- Stem Cell and Cancer Biology Laboratory, School of Biomedical Sciences, Curtin Health Innovation Research Institute, Curtin University, Perth, WA 6009, Australia
| | - Jinsong Bian
- Department of Pharmacology, Yong Loo Lin School of Medicine, National University of Singapore, 117600 Singapore, Singapore
| | - Alan P Kumar
- Department of Pharmacology, Yong Loo Lin School of Medicine, National University of Singapore, 117600 Singapore, Singapore; Cancer Science Institute of Singapore, National University of Singapore, Singapore 117599, Singapore; Medical Science Cluster, Yong Loo Lin School of Medicine, National University of Singapore, Singapore; Curtin Medical School, Faculty of Health Sciences, Curtin University, Perth, WA, Australia.
| | - Gautam Sethi
- Department of Pharmacology, Yong Loo Lin School of Medicine, National University of Singapore, 117600 Singapore, Singapore.
| | - Ekambaram Perumal
- Molecular Toxicology Laboratory, Department of Biotechnology, Bharathiar University, Coimbatore 641046, Tamilnadu, India.
| |
Collapse
|