1
|
Moeckel C, Mouratidis I, Chantzi N, Uzun Y, Georgakopoulos-Soares I. Advances in computational and experimental approaches for deciphering transcriptional regulatory networks: Understanding the roles of cis-regulatory elements is essential, and recent research utilizing MPRAs, STARR-seq, CRISPR-Cas9, and machine learning has yielded valuable insights. Bioessays 2024; 46:e2300210. [PMID: 38715516 DOI: 10.1002/bies.202300210] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2023] [Revised: 04/22/2024] [Accepted: 04/23/2024] [Indexed: 05/16/2024]
Abstract
Understanding the influence of cis-regulatory elements on gene regulation poses numerous challenges given complexities stemming from variations in transcription factor (TF) binding, chromatin accessibility, structural constraints, and cell-type differences. This review discusses the role of gene regulatory networks in enhancing understanding of transcriptional regulation and covers construction methods ranging from expression-based approaches to supervised machine learning. Additionally, key experimental methods, including MPRAs and CRISPR-Cas9-based screening, which have significantly contributed to understanding TF binding preferences and cis-regulatory element functions, are explored. Lastly, the potential of machine learning and artificial intelligence to unravel cis-regulatory logic is analyzed. These computational advances have far-reaching implications for precision medicine, therapeutic target discovery, and the study of genetic variations in health and disease.
Collapse
Affiliation(s)
- Camille Moeckel
- Department of Biochemistry and Molecular Biology, Institute for Personalized Medicine, The Pennsylvania State University College of Medicine, Hershey, Pennsylvania, USA
| | - Ioannis Mouratidis
- Department of Biochemistry and Molecular Biology, Institute for Personalized Medicine, The Pennsylvania State University College of Medicine, Hershey, Pennsylvania, USA
- Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, Pennsylvania, USA
| | - Nikol Chantzi
- Department of Biochemistry and Molecular Biology, Institute for Personalized Medicine, The Pennsylvania State University College of Medicine, Hershey, Pennsylvania, USA
| | - Yasin Uzun
- Department of Biochemistry and Molecular Biology, Institute for Personalized Medicine, The Pennsylvania State University College of Medicine, Hershey, Pennsylvania, USA
- Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, Pennsylvania, USA
- Department of Pediatrics, The Pennsylvania State University College of Medicine, Hershey, Pennsylvania, USA
| | - Ilias Georgakopoulos-Soares
- Department of Biochemistry and Molecular Biology, Institute for Personalized Medicine, The Pennsylvania State University College of Medicine, Hershey, Pennsylvania, USA
- Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, Pennsylvania, USA
| |
Collapse
|
2
|
Hu W, Li Y, Wu Y, Guan L, Li M. A deep learning model for DNA enhancer prediction based on nucleotide position aware feature encoding. iScience 2024; 27:110030. [PMID: 38868182 PMCID: PMC11167433 DOI: 10.1016/j.isci.2024.110030] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2024] [Revised: 04/23/2024] [Accepted: 05/16/2024] [Indexed: 06/14/2024] Open
Abstract
Enhancers, genomic DNA elements, regulate neighboring gene expression crucial for biological processes like cell differentiation and stress response. However, current machine learning methods for predicting DNA enhancers often underutilize hidden features in gene sequences, limiting model accuracy. Hence, this article proposes the PDCNN model, a deep learning-based enhancer prediction method. PDCNN extracts statistical nucleotide representations from gene sequences, discerning positional distribution information of nucleotides in modifier-like DNA sequences. With a convolutional neural network structure, PDCNN employs dual convolutional and fully connected layers. The cross-entropy loss function iteratively updates using a gradient descent algorithm, enhancing prediction accuracy. Model parameters are fine-tuned to select optimal combinations for training, achieving over 95% accuracy. Comparative analysis with traditional methods and existing models demonstrates PDCNN's robust feature extraction capability. It outperforms advanced machine learning methods in identifying DNA enhancers, presenting an effective method with broad implications for genomics, biology, and medical research.
Collapse
Affiliation(s)
- Wenxing Hu
- College of Physics and Electronic Information, Gannan Normal University, Ganzhou 341000, Jiangxi, China
| | - Yelin Li
- College of Physics and Electronic Information, Gannan Normal University, Ganzhou 341000, Jiangxi, China
| | - Yan Wu
- College of Physics and Electronic Information, Gannan Normal University, Ganzhou 341000, Jiangxi, China
| | - Lixin Guan
- College of Physics and Electronic Information, Gannan Normal University, Ganzhou 341000, Jiangxi, China
| | - Mengshan Li
- College of Physics and Electronic Information, Gannan Normal University, Ganzhou 341000, Jiangxi, China
| |
Collapse
|
3
|
Sinha R, Pal RK, De RK. A novel method addressing NGS-based mappability bias for sensitive detection of DNA alterations. J Bioinform Comput Biol 2024; 22:2450009. [PMID: 39030667 DOI: 10.1142/s0219720024500094] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/21/2024]
Abstract
A turning point in cancer research is the introduction of massively parallel sequencing technology which greatly reduced the cost and time for genome sequencing. This enhanced the scope for detecting and analyzing the role of structural alterations in cancer. However, certain bias exists in NGS-based approaches, which badly affects the CNV identification process. Moreover, DNA repeats existing in CNV regions need special attention as they will degrade the performance of majority of the existing CNV detection tools, even after applying generalized bias correction method. This motivated this work, where a novel method has been designed to address the issue of DNA repeats and thereby mappability bias existing in regions of CNV. The method consists of three phases, where the first phase computes the alignment information of uniquely mapped DNA reads, considering the base quality and base mismatch parameters at nucleotide level precision. The second and the third phase use a novel approach to allocate the non-uniquely mapped reads to an optimal region of the DNA repeats based on a probabilistic membership model. The proposed method is capable of identifying CNVs present in coding, as well as non-coding region of the DNA, and is also capable of detecting CNVs existing in DNA repeat regions. The methodology achieves a sensitivity greater than [Formula: see text] during the performed simulations, and on real data, the detected variants are validated with the database of genomic variants, where the percentage overlap is also greater than 95%, and has achieved much better breakpoint prediction, as compared with other popular bias correction CNV detection methods.
Collapse
Affiliation(s)
- Rituparna Sinha
- Information Technology, Heritage Institute of Technology, Anandapur Kolkata, West Bengal, India
| | - Rajat Kumar Pal
- Computer Science and Engineering Department, University of Calcutta, Kolkata, India
| | - Rajat Kumar De
- Machine Intelligence Unit, Indian Statistical Institute, Kolkata, India
| |
Collapse
|
4
|
Ren Y, Li C, Nanayakkara Sapugahawatte D, Zhu C, Spänig S, Jamrozy D, Rothen J, Daubenberger CA, Bentley SD, Ip M, Heider D. Predicting hosts and cross-species transmission of Streptococcus agalactiae by interpretable machine learning. Comput Biol Med 2024; 171:108185. [PMID: 38401454 DOI: 10.1016/j.compbiomed.2024.108185] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2023] [Revised: 02/13/2024] [Accepted: 02/18/2024] [Indexed: 02/26/2024]
Abstract
BACKGROUND Streptococcus agalactiae, commonly known as Group B Streptococcus (GBS), exhibits a broad host range, manifesting as both a beneficial commensal and an opportunistic pathogen across various species. In humans, it poses significant risks, causing neonatal sepsis and meningitis, along with severe infections in adults. Additionally, it impacts livestock by inducing mastitis in bovines and contributing to epidemic mortality in fish populations. Despite its wide host spectrum, the mechanisms enabling GBS to adapt to specific hosts remain inadequately elucidated. Therefore, the development of a rapid and accurate method differentiates GBS strains associated with particular animal hosts based on genome-wide information holds immense potential. Such a tool would not only bolster the identification and containment efforts during GBS outbreaks but also deepen our comprehension of the bacteria's host adaptations spanning humans, livestock, and other natural animal reservoirs. METHODS AND RESULTS Here, we developed three machine learning models-random forest (RF), logistic regression (LR), and support vector machine (SVM) based on genome-wide mutation data. These models enabled precise prediction of the host origin of GBS, accurately distinguishing between human, bovine, fish, and pig hosts. Moreover, we conducted an interpretable machine learning using SHapley Additive exPlanations (SHAP) and variant annotation to uncover the most influential genomic features and associated genes for each host. Additionally, by meticulously examining misclassified samples, we gained valuable insights into the dynamics of host transmission and the potential for zoonotic infections. CONCLUSIONS Our study underscores the effectiveness of random forest (RF) and logistic regression (LR) models based on mutation data for accurately predicting GBS host origins. Additionally, we identify the key features associated with each GBS host, thereby enhancing our understanding of the bacteria's host-specific adaptations.
Collapse
Affiliation(s)
- Yunxiao Ren
- Department for Data Science in Biomedicine, Faculty of Mathematics and Computer Science, Philipps-University of Marburg, Marburg, Germany
| | - Carmen Li
- Department of Microbiology, Faculty of Medicine, The Chinese University of Hong Kong, Hong Kong, China
| | | | - Chendi Zhu
- Department of Microbiology, Faculty of Medicine, The Chinese University of Hong Kong, Hong Kong, China
| | - Sebastian Spänig
- Department for Data Science in Biomedicine, Faculty of Mathematics and Computer Science, Philipps-University of Marburg, Marburg, Germany
| | - Dorota Jamrozy
- Parasites and Microbes Programme, Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge, United Kingdom
| | - Julian Rothen
- Swiss Tropical and Public Health Institute (Swiss TPH) Basel, Department of Medical Parasitology and Infection Biology, 4002, Basel, Switzerland; University of Basel, 4002, Basel, Switzerland
| | - Claudia A Daubenberger
- Swiss Tropical and Public Health Institute (Swiss TPH) Basel, Department of Medical Parasitology and Infection Biology, 4002, Basel, Switzerland; University of Basel, 4002, Basel, Switzerland
| | - Stephen D Bentley
- Parasites and Microbes Programme, Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge, United Kingdom
| | - Margaret Ip
- Department of Microbiology, Faculty of Medicine, The Chinese University of Hong Kong, Hong Kong, China
| | - Dominik Heider
- Department for Data Science in Biomedicine, Faculty of Mathematics and Computer Science, Philipps-University of Marburg, Marburg, Germany; Institute for Computer Science, University of Düsseldorf, 40211, Düsseldorf, Germany; Center for Digital Health, Heinrich Heine University Düsseldorf, Moorenstr. 5, 40225, Düsseldorf, Germany.
| |
Collapse
|
5
|
Jiang J, Pei H, Li J, Li M, Zou Q, Lv Z. FEOpti-ACVP: identification of novel anti-coronavirus peptide sequences based on feature engineering and optimization. Brief Bioinform 2024; 25:bbae037. [PMID: 38366802 PMCID: PMC10939380 DOI: 10.1093/bib/bbae037] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2023] [Revised: 12/27/2023] [Accepted: 01/17/2024] [Indexed: 02/18/2024] Open
Abstract
Anti-coronavirus peptides (ACVPs) represent a relatively novel approach of inhibiting the adsorption and fusion of the virus with human cells. Several peptide-based inhibitors showed promise as potential therapeutic drug candidates. However, identifying such peptides in laboratory experiments is both costly and time consuming. Therefore, there is growing interest in using computational methods to predict ACVPs. Here, we describe a model for the prediction of ACVPs that is based on the combination of feature engineering (FE) optimization and deep representation learning. FEOpti-ACVP was pre-trained using two feature extraction frameworks. At the next step, several machine learning approaches were tested in to construct the final algorithm. The final version of FEOpti-ACVP outperformed existing methods used for ACVPs prediction and it has the potential to become a valuable tool in ACVP drug design. A user-friendly webserver of FEOpti-ACVP can be accessed at http://servers.aibiochem.net/soft/FEOpti-ACVP/.
Collapse
Affiliation(s)
- Jici Jiang
- College of Biomedical Engineering, Sichuan University, Chengdu 610065, China
| | - Hongdi Pei
- College of Biomedical Engineering, Sichuan University, Chengdu 610065, China
| | - Jiayu Li
- College of Life Science, Sichuan University, Chengdu 610065, China
| | - Mingxin Li
- College of Biomedical Engineering, Sichuan University, Chengdu 610065, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 610054, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou 324000, China
| | - Zhibin Lv
- College of Biomedical Engineering, Sichuan University, Chengdu 610065, China
| |
Collapse
|
6
|
Ramakrishnan A, Wangensteen G, Kim S, Nestler EJ, Shen L. DeepRegFinder: deep learning-based regulatory elements finder. BIOINFORMATICS ADVANCES 2024; 4:vbae007. [PMID: 38343388 PMCID: PMC10858349 DOI: 10.1093/bioadv/vbae007] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 06/14/2023] [Revised: 12/06/2023] [Accepted: 01/12/2024] [Indexed: 06/15/2024]
Abstract
Summary Enhancers and promoters are important classes of DNA regulatory elements (DREs) that govern gene expression. Identifying them at a genomic scale is a critical task in bioinformatics. The DREs often exhibit unique histone mark binding patterns, which can be captured by high-throughput ChIP-seq experiments. To account for the variations and noises among the binding sites, machine learning models are trained on known enhancer/promoter sites using histone mark ChIP-seq data and predict enhancers/promoters at other genomic regions. To this end, we have developed a highly customizable program named DeepRegFinder, which automates the entire process of data processing, model training, and prediction. We have employed convolutional and recurrent neural networks for model training and prediction. DeepRegFinder further categorizes enhancers and promoters into active and poised states, making it a unique and valuable feature for researchers. Our method demonstrates improved precision and recall in comparison to existing algorithms for enhancer prediction across multiple cell types. Moreover, our pipeline is modular and eliminates the tedious steps involved in preprocessing, making it easier for users to apply on their data quickly. Availability and implementation https://github.com/shenlab-sinai/DeepRegFinder.
Collapse
Affiliation(s)
- Aarthi Ramakrishnan
- Friedman Brain Institute and Nash Family Department of Neuroscience, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
| | - George Wangensteen
- Department of Computer Science, Brown University, Providence, RI 02912, United States
| | - Sarah Kim
- Cancer Program, Broad Institute, Cambridge, MA 02142, United States
| | - Eric J Nestler
- Friedman Brain Institute and Nash Family Department of Neuroscience, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
| | - Li Shen
- Friedman Brain Institute and Nash Family Department of Neuroscience, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
| |
Collapse
|
7
|
Raes A, Athanasiou G, Azari-Dolatabad N, Sadeghi H, Gonzalez Andueza S, Arcos JL, Cerquides J, Chaitanya Pavani K, Opsomer G, Bogado Pascottini O, Smits K, Angel-Velez D, Van Soom A. Manual versus deep learning measurements to evaluate cumulus expansion of bovine oocytes and its relationship with embryo development in vitro. Comput Biol Med 2024; 168:107785. [PMID: 38056209 DOI: 10.1016/j.compbiomed.2023.107785] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2023] [Revised: 11/20/2023] [Accepted: 11/28/2023] [Indexed: 12/08/2023]
Abstract
Cumulus expansion is an important indicator of oocyte maturation and has been suggested to be indicative of greater oocyte developmental capacity. Although multiple methods have been described to assess cumulus expansion, none of them is considered a gold standard. Additionally, these methods are subjective and time-consuming. In this manuscript, the reliability of three cumulus expansion measurement methods was assessed, and a deep learning model was created to automatically perform the measurement. Cumulus expansion of 232 cumulus-oocyte complexes was evaluated by three independent observers using three methods: (1) measurement of the cumulus area, (2) measurement of three distances between the zona pellucida and outer cumulus, and (3) scoring cumulus expansion on a 5-point Likert scale. The reliability of the methods was calculated in terms of intraclass-correlation coefficients (ICC) for both inter- and intra-observer agreements. The area method resulted in the best overall inter-observer agreement with an ICC of 0.89 versus 0.54 and 0.30 for the 3-distance and scoring methods, respectively. Therefore, the area method served as the base to create a deep learning model, AI-xpansion, which reaches a human-level performance in terms of average rank, bias and variance. To evaluate the accuracy of the methods, the results of cumulus expansion calculations were linked to embryonic development. Cumulus expansion had increased significantly in oocytes that achieved successful embryo development when measured by AI-xpansion, the area- or 3-distance method, while this was not the case for the scoring method. Measuring the area is the most reliable method to manually evaluate cumulus expansion, whilst deep learning automatically performs the calculation with human-level precision and high accuracy and could therefore be a valuable prospective tool for embryologists.
Collapse
Affiliation(s)
- Annelies Raes
- Department of Internal Medicine, Reproduction and Population Medicine, Faculty of Veterinary Medicine, Ghent University, Merelbeke, Belgium.
| | - Georgios Athanasiou
- Artificial Intelligence Research Institute (IIIA-CSIC), 08193, Bellaterra, Spain; Department of Computer Science, Universitat Autonoma de Barcelona, Spain.
| | - Nima Azari-Dolatabad
- Department of Internal Medicine, Reproduction and Population Medicine, Faculty of Veterinary Medicine, Ghent University, Merelbeke, Belgium
| | - Hafez Sadeghi
- Department of Internal Medicine, Reproduction and Population Medicine, Faculty of Veterinary Medicine, Ghent University, Merelbeke, Belgium
| | - Sebastian Gonzalez Andueza
- Department of Internal Medicine, Reproduction and Population Medicine, Faculty of Veterinary Medicine, Ghent University, Merelbeke, Belgium
| | - Josep Lluis Arcos
- Artificial Intelligence Research Institute (IIIA-CSIC), 08193, Bellaterra, Spain
| | - Jesus Cerquides
- Artificial Intelligence Research Institute (IIIA-CSIC), 08193, Bellaterra, Spain.
| | - Krishna Chaitanya Pavani
- Department of Internal Medicine, Reproduction and Population Medicine, Faculty of Veterinary Medicine, Ghent University, Merelbeke, Belgium
| | - Geert Opsomer
- Department of Internal Medicine, Reproduction and Population Medicine, Faculty of Veterinary Medicine, Ghent University, Merelbeke, Belgium
| | - Osvaldo Bogado Pascottini
- Department of Internal Medicine, Reproduction and Population Medicine, Faculty of Veterinary Medicine, Ghent University, Merelbeke, Belgium
| | - Katrien Smits
- Department of Internal Medicine, Reproduction and Population Medicine, Faculty of Veterinary Medicine, Ghent University, Merelbeke, Belgium
| | - Daniel Angel-Velez
- Department of Internal Medicine, Reproduction and Population Medicine, Faculty of Veterinary Medicine, Ghent University, Merelbeke, Belgium; Research Group in Animal Sciences-INCA-CES, Universidad CES, Medellin, 050021, Colombia
| | - Ann Van Soom
- Department of Internal Medicine, Reproduction and Population Medicine, Faculty of Veterinary Medicine, Ghent University, Merelbeke, Belgium
| |
Collapse
|
8
|
Ma J, Kong D, Wu F, Bao L, Yuan J, Liu Y. Densely connected convolutional networks for ultrasound image based lesion segmentation. Comput Biol Med 2024; 168:107725. [PMID: 38006827 DOI: 10.1016/j.compbiomed.2023.107725] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2022] [Revised: 11/03/2023] [Accepted: 11/15/2023] [Indexed: 11/27/2023]
Abstract
Delineating lesion boundaries play a central role in diagnosing thyroid and breast cancers, making related therapy plans and evaluating therapeutic effects. However, it is often time-consuming and error-prone with limited reproducibility to manually annotate low-quality ultrasound (US) images, given high speckle noises, heterogeneous appearances, ambiguous boundaries etc., especially for nodular lesions with huge intra-class variance. It is hence appreciative but challenging for accurate lesion segmentations from US images in clinical practices. In this study, we propose a new densely connected convolutional network (called MDenseNet) architecture to automatically segment nodular lesions from 2D US images, which is first pre-trained over ImageNet database (called PMDenseNet) and then retrained upon the given US image datasets. Moreover, we also designed a deep MDenseNet with pre-training strategy (PDMDenseNet) for segmentation of thyroid and breast nodules by adding a dense block to increase the depth of our MDenseNet. Extensive experiments demonstrate that the proposed MDenseNet-based method can accurately extract multiple nodular lesions, with even complex shapes, from input thyroid and breast US images. Moreover, additional experiments show that the introduced MDenseNet-based method also outperforms three state-of-the-art convolutional neural networks in terms of accuracy and reproducibility. Meanwhile, promising results in nodular lesion segmentation from thyroid and breast US images illustrate its great potential in many other clinical segmentation tasks.
Collapse
Affiliation(s)
- Jinlian Ma
- School of Integrated Circuits, Shandong University, Jinan 250101, China; Shenzhen Research Institute of Shandong University, A301 Virtual University Park in South District of Shenzhen, China; State Key Lab of CAD&CG, College of Computer Science and Technology, Zhejiang University, Hangzhou 310027, China.
| | - Dexing Kong
- School of Mathematical Sciences, Zhejiang University, Hangzhou 310027, China
| | - Fa Wu
- School of Mathematical Sciences, Zhejiang University, Hangzhou 310027, China
| | - Lingyun Bao
- Department of Ultrasound, Hangzhou First Peoples Hospital, Zhejiang University, Hangzhou, China
| | - Jing Yuan
- School of Mathematics and Statistics, Xidian University, China
| | - Yusheng Liu
- State Key Lab of CAD&CG, College of Computer Science and Technology, Zhejiang University, Hangzhou 310027, China
| |
Collapse
|
9
|
Zhu J, Yang Y. Imputation for Single-cell RNA-seq Data with Non-negative Matrix Factorization and Transfer Learning. J Bioinform Comput Biol 2023; 21:2350029. [PMID: 38248911 DOI: 10.1142/s0219720023500294] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/23/2024]
Abstract
Single-cell RNA sequencing (scRNA-seq) has been proven to be an effective technology for investigating the heterogeneity and transcriptome dynamics due to the single-cell resolution. However, one of the major problems for data obtained by scRNA-seq is excessive zeros in the count matrix, which hinders the downstream analysis enormously. Here, we present a method that integrates non-negative matrix factorization and transfer learning (NMFTL) to impute the scRNA-seq data. It borrows gene expression information from the additional dataset and adds graph-regularized terms to the decomposed matrices. These strategies not only maintain the intrinsic geometrical structure of the data itself but also further improve the accuracy of estimating the expression values by adding the transfer term in the model. The real data analysis result demonstrates that the proposed method outperforms the existing matrix-factorization-based imputation methods in recovering dropout entries, preserving gene-to-gene and cell-to-cell relationships, and in the downstream analysis, such as cell clustering analysis, the proposed method also has a good performance. For convenience, we have implemented the "NMFTL" method with R scripts, which could be available at https://github.com/FocusPaka/NMFTL.
Collapse
Affiliation(s)
- Jiadi Zhu
- School of Mathematics and Statistics, Xidian University, Xi'an, Shaanxi, P. R. China
| | - Youlong Yang
- School of Mathematics and Statistics, Xidian University, Xi'an, Shaanxi, P. R. China
| |
Collapse
|
10
|
Lazaros K, Vlamos P, Vrahatis AG. Methods for cell-type annotation on scRNA-seq data: A recent overview. J Bioinform Comput Biol 2023; 21:2340002. [PMID: 37743364 DOI: 10.1142/s0219720023400024] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/26/2023]
Abstract
The evolution of single-cell technology is ongoing, continually generating massive amounts of data that reveal many mysteries surrounding intricate diseases. However, their drawbacks continue to constrain us. Among these, annotating cell types in single-cell gene expressions pose a substantial challenge, despite the myriad of tools at our disposal. The rapid growth in data, resources, and tools has consequently brought about significant alterations in this area over the years. In our study, we spotlight all note-worthy cell type annotation techniques developed over the past four years. We provide an overview of the latest trends in this field, showcasing the most advanced methods in taxonomy. Our research underscores the demand for additional tools that incorporate a biological context and also predicts that the rising trend of graph neural network approaches will likely lead this research field in the coming years.
Collapse
Affiliation(s)
- Konstantinos Lazaros
- Bioinformatics and Human Electrophysiology Laboratory, Department of Informatics, Ionian University, 49100 Corfu, Greece
| | - Panagiotis Vlamos
- Bioinformatics and Human Electrophysiology Laboratory, Department of Informatics, Ionian University, 49100 Corfu, Greece
| | - Aristidis G Vrahatis
- Bioinformatics and Human Electrophysiology Laboratory, Department of Informatics, Ionian University, 49100 Corfu, Greece
| |
Collapse
|
11
|
Liu Y, Wang Z, Yuan H, Zhu G, Zhang Y. HEAP: a task adaptive-based explainable deep learning framework for enhancer activity prediction. Brief Bioinform 2023; 24:bbad286. [PMID: 37539835 DOI: 10.1093/bib/bbad286] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2023] [Revised: 07/05/2023] [Accepted: 07/21/2023] [Indexed: 08/05/2023] Open
Abstract
Enhancers are crucial cis-regulatory elements that control gene expression in a cell-type-specific manner. Despite extensive genetic and computational studies, accurately predicting enhancer activity in different cell types remains a challenge, and the grammar of enhancers is still poorly understood. Here, we present HEAP (high-resolution enhancer activity prediction), an explainable deep learning framework for predicting enhancers and exploring enhancer grammar. The framework includes three modules that use grammar-based reasoning for enhancer prediction. The algorithm can incorporate DNA sequences and epigenetic modifications to obtain better accuracy. We use a novel two-step multi-task learning method, task adaptive parameter sharing (TAPS), to efficiently predict enhancers in different cell types. We first train a shared model with all cell-type datasets. Then we adapt to specific tasks by adding several task-specific subset layers. Experiments demonstrate that HEAP outperforms published methods and showcases the effectiveness of the TAPS, especially for those with limited training samples. Notably, the explainable framework HEAP utilizes post-hoc interpretation to provide insights into the prediction mechanisms from three perspectives: data, model architecture and algorithm, leading to a better understanding of model decisions and enhancer grammar. To the best of our knowledge, HEAP will be a valuable tool for insight into the complex mechanisms of enhancer activity.
Collapse
Affiliation(s)
- Yuhang Liu
- School of Computer Science, Chengdu University of Information Technology, 610225, Chengdu, China
| | - Zixuan Wang
- College of Electronics and Information Engieering, Sichuan University, 610065, Chengdu, China
| | - Hao Yuan
- School of Computer Science, Chengdu University of Information Technology, 610225, Chengdu, China
| | - Guiquan Zhu
- West China Hospital of Stomatology, Sichuan University, 610041, Chengdu, China
| | - Yongqing Zhang
- School of Computer Science, Chengdu University of Information Technology, 610225, Chengdu, China
| |
Collapse
|
12
|
Phan LT, Oh C, He T, Manavalan B. A comprehensive revisit of the machine-learning tools developed for the identification of enhancers in the human genome. Proteomics 2023; 23:e2200409. [PMID: 37021401 DOI: 10.1002/pmic.202200409] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2023] [Revised: 03/18/2023] [Accepted: 03/27/2023] [Indexed: 04/07/2023]
Abstract
Enhancers are non-coding DNA elements that play a crucial role in enhancing the transcription rate of a specific gene in the genome. Experiments for identifying enhancers can be restricted by their conditions and involve complicated, time-consuming, laborious, and costly steps. To overcome these challenges, computational platforms have been developed to complement experimental methods that enable high-throughput identification of enhancers. Over the last few years, the development of various enhancer computational tools has resulted in significant progress in predicting putative enhancers. Thus, researchers are now able to use a variety of strategies to enhance and advance enhancer study. In this review, an overview of machine learning (ML)-based prediction methods for enhancer identification and related databases has been provided. The existing enhancer-prediction methods have also been reviewed regarding their algorithms, feature selection processes, validation techniques, and software utility. In addition, the advantages and drawbacks of these ML approaches and guidelines for developing bioinformatic tools have been highlighted for a more efficient enhancer prediction. This review will serve as a useful resource for experimentalists in selecting the appropriate ML tool for their study, and for bioinformaticians in developing more accurate and advanced ML-based predictors.
Collapse
Affiliation(s)
- Le Thi Phan
- Computational Biology and Bioinformatics Laboratory, Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon, Gyeonggi-do, South Korea
| | - Changmin Oh
- Computational Biology and Bioinformatics Laboratory, Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon, Gyeonggi-do, South Korea
| | - Tao He
- Beidahuang Industry Group General Hospital, Harbin, China
| | - Balachandran Manavalan
- Computational Biology and Bioinformatics Laboratory, Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon, Gyeonggi-do, South Korea
| |
Collapse
|
13
|
Alakuş TB. A Novel Repetition Frequency-Based DNA Encoding Scheme to Predict Human and Mouse DNA Enhancers with Deep Learning. Biomimetics (Basel) 2023; 8:218. [PMID: 37366813 DOI: 10.3390/biomimetics8020218] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2023] [Revised: 05/18/2023] [Accepted: 05/22/2023] [Indexed: 06/28/2023] Open
Abstract
Recent studies have shown that DNA enhancers have an important role in the regulation of gene expression. They are responsible for different important biological elements and processes such as development, homeostasis, and embryogenesis. However, experimental prediction of these DNA enhancers is time-consuming and costly as it requires laboratory work. Therefore, researchers started to look for alternative ways and started to apply computation-based deep learning algorithms to this field. Yet, the inconsistency and unsuccessful prediction performance of computational-based approaches among various cell lines led to the investigation of these approaches as well. Therefore, in this study, a novel DNA encoding scheme was proposed, and solutions were sought to the problems mentioned and DNA enhancers were predicted with BiLSTM. The study consisted of four different stages for two scenarios. In the first stage, DNA enhancer data were obtained. In the second stage, DNA sequences were converted to numerical representations by both the proposed encoding scheme and various DNA encoding schemes including EIIP, integer number, and atomic number. In the third stage, the BiLSTM model was designed, and the data were classified. In the final stage, the performance of DNA encoding schemes was determined by accuracy, precision, recall, F1-score, CSI, MCC, G-mean, Kappa coefficient, and AUC scores. In the first scenario, it was determined whether the DNA enhancers belonged to humans or mice. As a result of the prediction process, the highest performance was achieved with the proposed DNA encoding scheme, and an accuracy of 92.16% and an AUC score of 0.85 were calculated, respectively. The closest accuracy score to the proposed scheme was obtained with the EIIP DNA encoding scheme and the result was observed as 89.14%. The AUC score of this scheme was measured as 0.87. Among the remaining DNA encoding schemes, the atomic number showed an accuracy score of 86.61%, while this rate decreased to 76.96% with the integer scheme. The AUC values of these schemes were 0.84 and 0.82, respectively. In the second scenario, it was determined whether there was a DNA enhancer and, if so, it was decided to which species this enhancer belonged. In this scenario, the highest accuracy score was obtained with the proposed DNA encoding scheme and the result was 84.59%. Moreover, the AUC score of the proposed scheme was determined as 0.92. EIIP and integer DNA encoding schemes showed accuracy scores of 77.80% and 73.68%, respectively, while their AUC scores were close to 0.90. The most ineffective prediction was performed with the atomic number and the accuracy score of this scheme was calculated as 68.27%. Finally, the AUC score of this scheme was 0.81. At the end of the study, it was observed that the proposed DNA encoding scheme was successful and effective in predicting DNA enhancers.
Collapse
Affiliation(s)
- Talha Burak Alakuş
- Department of Software Engineering, Faculty of Engineering, Kırklareli University, 39100 Kırklareli, Turkey
| |
Collapse
|
14
|
Wu P, Nie Z, Huang Z, Zhang X. CircPCBL: Identification of Plant CircRNAs with a CNN-BiGRU-GLT Model. PLANTS (BASEL, SWITZERLAND) 2023; 12:1652. [PMID: 37111874 PMCID: PMC10143888 DOI: 10.3390/plants12081652] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 02/20/2023] [Revised: 04/10/2023] [Accepted: 04/13/2023] [Indexed: 06/19/2023]
Abstract
Circular RNAs (circRNAs), which are produced post-splicing of pre-mRNAs, are strongly linked to the emergence of several tumor types. The initial stage in conducting follow-up studies involves identifying circRNAs. Currently, animals are the primary target of most established circRNA recognition technologies. However, the sequence features of plant circRNAs differ from those of animal circRNAs, making it impossible to detect plant circRNAs. For example, there are non-GT/AG splicing signals at circRNA junction sites and few reverse complementary sequences and repetitive elements in the flanking intron sequences of plant circRNAs. In addition, there have been few studies on circRNAs in plants, and thus it is urgent to create a plant-specific method for identifying circRNAs. In this study, we propose CircPCBL, a deep-learning approach that only uses raw sequences to distinguish between circRNAs found in plants and other lncRNAs. CircPCBL comprises two separate detectors: a CNN-BiGRU detector and a GLT detector. The CNN-BiGRU detector takes in the one-hot encoding of the RNA sequence as the input, while the GLT detector uses k-mer (k = 1 - 4) features. The output matrices of the two submodels are then concatenated and ultimately pass through a fully connected layer to produce the final output. To verify the generalization performance of the model, we evaluated CircPCBL using several datasets, and the results revealed that it had an F1 of 85.40% on the validation dataset composed of six different plants species and 85.88%, 75.87%, and 86.83% on the three cross-species independent test sets composed of Cucumis sativus, Populus trichocarpa, and Gossypium raimondii, respectively. With an accuracy of 90.9% and 90%, respectively, CircPCBL successfully predicted ten of the eleven circRNAs of experimentally reported Poncirus trifoliata and nine of the ten lncRNAs of rice on the real set. CircPCBL could potentially contribute to the identification of circRNAs in plants. In addition, it is remarkable that CircPCBL also achieved an average accuracy of 94.08% on the human datasets, which is also an excellent result, implying its potential application in animal datasets. Ultimately, CircPCBL is available as a web server, from which the data and source code can also be downloaded free of charge.
Collapse
Affiliation(s)
- Pengpeng Wu
- Anhui Province Key Laboratory of Smart Agricultural Technology and Equipment, Anhui Agricultural University, Hefei 230036, China
- School of Life Science, Anhui Agricultural University, Hefei 230036, China
| | - Zhenjun Nie
- Anhui Province Key Laboratory of Smart Agricultural Technology and Equipment, Anhui Agricultural University, Hefei 230036, China
- School of Information and Computer Science, Anhui Agricultural University, Hefei 230036, China
| | - Zhiqiang Huang
- Anhui Province Key Laboratory of Smart Agricultural Technology and Equipment, Anhui Agricultural University, Hefei 230036, China
- School of Information and Computer Science, Anhui Agricultural University, Hefei 230036, China
| | - Xiaodan Zhang
- Anhui Province Key Laboratory of Smart Agricultural Technology and Equipment, Anhui Agricultural University, Hefei 230036, China
- School of Information and Computer Science, Anhui Agricultural University, Hefei 230036, China
| |
Collapse
|
15
|
Sokhansanj BA, Rosen GL. Predicting COVID-19 disease severity from SARS-CoV-2 spike protein sequence by mixed effects machine learning. Comput Biol Med 2022; 149:105969. [PMID: 36041271 PMCID: PMC9384346 DOI: 10.1016/j.compbiomed.2022.105969] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2022] [Revised: 07/11/2022] [Accepted: 08/13/2022] [Indexed: 11/17/2022]
Abstract
Epidemiological studies show that COVID-19 variants-of-concern, like Delta and Omicron, pose different risks for severe disease, but they typically lack sequence-level information for the virus. Studies which do obtain viral genome sequences are generally limited in time, location, and population scope. Retrospective meta-analyses require time-consuming data extraction from heterogeneous formats and are limited to publicly available reports. Fortuitously, a subset of GISAID, the global SARS-CoV-2 sequence repository, includes "patient status" metadata that can indicate whether a sequence record is associated with mild or severe disease. While GISAID lacks data on comorbidities relevant to severity, such as obesity and chronic disease, it does include metadata for age and sex to use as additional attributes in modeling. With these caveats, previous efforts have demonstrated that genotype-patient status models can be fit to GISAID data, particularly when country-of-origin is used as an additional feature. But are these models robust and biologically meaningful? This paper shows that, in fact, temporal and geographic biases in sequences submitted to GISAID, as well as the evolving pandemic response, particularly reduction in severe disease due to vaccination, create complex issues for model development and interpretation. This paper poses a potential solution: efficient mixed effects machine learning using GPBoost, treating country as a random effect group. Training and validation using temporally split GISAID data and emerging Omicron variants demonstrates that GPBoost models are more predictive of the impact of spike protein mutations on patient outcomes than fixed effect XGBoost, LightGBM, random forests, and elastic net logistic regression models.
Collapse
Affiliation(s)
- Bahrad A Sokhansanj
- Ecological and Evolutionary Signal Processing & Informatics Laboratory, Drexel University, 3100 Chestnut St., Philadelphia, PA, 19104, United States of America.
| | - Gail L Rosen
- Ecological and Evolutionary Signal Processing & Informatics Laboratory, Drexel University, 3100 Chestnut St., Philadelphia, PA, 19104, United States of America.
| |
Collapse
|