1
|
Xia Y, Liu Y, Li T, He S, Chang H, Wang Y, Zhang Y, Ge W. Assessing parameter efficient methods for pre-trained language model in annotating scRNA-seq data. Methods 2024; 228:12-21. [PMID: 38759908 DOI: 10.1016/j.ymeth.2024.05.007] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2024] [Revised: 04/28/2024] [Accepted: 05/10/2024] [Indexed: 05/19/2024] Open
Abstract
Annotating cell types of single-cell RNA sequencing (scRNA-seq) data is crucial for studying cellular heterogeneity in the tumor microenvironment. Recently, large-scale pre-trained language models (PLMs) have achieved significant progress in cell-type annotation of scRNA-seq data. This approach effectively addresses previous methods' shortcomings in performance and generalization. However, fine-tuning PLMs for different downstream tasks demands considerable computational resources, rendering it impractical. Hence, a new research branch introduces parameter-efficient fine-tuning (PEFT). This involves optimizing a few parameters while leaving the majority unchanged, leading to substantial reductions in computational expenses. Here, we utilize scBERT, a large-scale pre-trained model, to explore the capabilities of three PEFT methods in scRNA-seq cell type annotation. Extensive benchmark studies across several datasets demonstrate the superior applicability of PEFT methods. Furthermore, downstream analysis using models obtained through PEFT showcases their utility in novel cell type discovery and model interpretability for potential marker genes. Our findings underscore the considerable potential of PEFT in PLM-based cell type annotation, presenting novel perspectives for the analysis of scRNA-seq data.
Collapse
Affiliation(s)
- Yucheng Xia
- Institute of Optics and Electronics, Chinese Academy of Sciences, Chengdu, 610209, China
| | - Yuhang Liu
- School of Computer Science, Chengdu University of Information Technology, Chengdu, 610225, China
| | - Tianhao Li
- School of Computer Science, Chengdu University of Information Technology, Chengdu, 610225, China
| | - Sihan He
- School of Computer Science, Chengdu University of Information Technology, Chengdu, 610225, China
| | - Hong Chang
- School of Computer Science, Chengdu University of Information Technology, Chengdu, 610225, China
| | - Yaqing Wang
- School of Computer Science, Chengdu University of Information Technology, Chengdu, 610225, China
| | - Yongqing Zhang
- School of Computer Science, Chengdu University of Information Technology, Chengdu, 610225, China
| | - Wenyi Ge
- School of Computer Science, Chengdu University of Information Technology, Chengdu, 610225, China.
| |
Collapse
|
2
|
Peng B, Sun G, Fan Y. iProL: identifying DNA promoters from sequence information based on Longformer pre-trained model. BMC Bioinformatics 2024; 25:224. [PMID: 38918692 PMCID: PMC11201334 DOI: 10.1186/s12859-024-05849-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2024] [Accepted: 06/19/2024] [Indexed: 06/27/2024] Open
Abstract
Promoters are essential elements of DNA sequence, usually located in the immediate region of the gene transcription start sites, and play a critical role in the regulation of gene transcription. Its importance in molecular biology and genetics has attracted the research interest of researchers, and it has become a consensus to seek a computational method to efficiently identify promoters. Still, existing methods suffer from imbalanced recognition capabilities for positive and negative samples, and their recognition effect can still be further improved. We conducted research on E. coli promoters and proposed a more advanced prediction model, iProL, based on the Longformer pre-trained model in the field of natural language processing. iProL does not rely on prior biological knowledge but simply uses promoter DNA sequences as plain text to identify promoters. It also combines one-dimensional convolutional neural networks and bidirectional long short-term memory to extract both local and global features. Experimental results show that iProL has a more balanced and superior performance than currently published methods. Additionally, we constructed a novel independent test set following the previous specification and compared iProL with three existing methods on this independent test set.
Collapse
Affiliation(s)
- Binchao Peng
- School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin, 541004, China
| | - Guicong Sun
- School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin, 541004, China
| | - Yongxian Fan
- School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin, 541004, China.
| |
Collapse
|
3
|
Li F, Zhang J, Li K, Peng Y, Zhang H, Xu Y, Yu Y, Zhang Y, Liu Z, Wang Y, Huang L, Zhou F. GANSamples-ac4C: Enhancing ac4C site prediction via generative adversarial networks and transfer learning. Anal Biochem 2024; 689:115495. [PMID: 38431142 DOI: 10.1016/j.ab.2024.115495] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2023] [Revised: 02/18/2024] [Accepted: 02/22/2024] [Indexed: 03/05/2024]
Abstract
RNA modification, N4-acetylcytidine (ac4C), is enzymatically catalyzed by N-acetyltransferase 10 (NAT10) and plays an essential role across tRNA, rRNA, and mRNA. It influences various cellular functions, including mRNA stability and rRNA biosynthesis. Wet-lab detection of ac4C modification sites is highly resource-intensive and costly. Therefore, various machine learning and deep learning techniques have been employed for computational detection of ac4C modification sites. The known ac4C modification sites are limited for training an accurate and stable prediction model. This study introduces GANSamples-ac4C, a novel framework that synergizes transfer learning and generative adversarial network (GAN) to generate synthetic RNA sequences to train a better ac4C modification site prediction model. Comparative analysis reveals that GANSamples-ac4C outperforms existing state-of-the-art methods in identifying ac4C sites. Moreover, our result underscores the potential of synthetic data in mitigating the issue of data scarcity for biological sequence prediction tasks. Another major advantage of GANSamples-ac4C is its interpretable decision logic. Multi-faceted interpretability analyses detect key regions in the ac4C sequences influencing the discriminating decision between positive and negative samples, a pronounced enrichment of G in this region, and ac4C-associated motifs. These findings may offer novel insights for ac4C research. The GANSamples-ac4C framework and its source code are publicly accessible at http://www.healthinformaticslab.org/supp/.
Collapse
Affiliation(s)
- Fei Li
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, and College of Computer Science and Technology, Jilin University, Changchun, Jilin, 130012, China
| | - Jiale Zhang
- College of Software, Jilin University, Changchun, Jilin, 130012, China
| | - Kewei Li
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, and College of Computer Science and Technology, Jilin University, Changchun, Jilin, 130012, China.
| | - Yu Peng
- College of Software, Jilin University, Changchun, Jilin, 130012, China
| | - Haotian Zhang
- College of Software, Jilin University, Changchun, Jilin, 130012, China
| | - Yiping Xu
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, and College of Computer Science and Technology, Jilin University, Changchun, Jilin, 130012, China
| | - Yue Yu
- College of Software, Jilin University, Changchun, Jilin, 130012, China
| | - Yuteng Zhang
- College of Software, Jilin University, Changchun, Jilin, 130012, China
| | - Zewen Liu
- College of Software, Jilin University, Changchun, Jilin, 130012, China
| | - Ying Wang
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, and College of Computer Science and Technology, Jilin University, Changchun, Jilin, 130012, China
| | - Lan Huang
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, and College of Computer Science and Technology, Jilin University, Changchun, Jilin, 130012, China
| | - Fengfeng Zhou
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, and College of Computer Science and Technology, Jilin University, Changchun, Jilin, 130012, China; School of Biology and Engineering, Guizhou Medical University, Guiyang, 550025, Guizhou, China.
| |
Collapse
|
4
|
Tenekeci S, Tekir S. Identifying promoter and enhancer sequences by graph convolutional networks. Comput Biol Chem 2024; 110:108040. [PMID: 38430611 DOI: 10.1016/j.compbiolchem.2024.108040] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2023] [Revised: 01/09/2024] [Accepted: 02/27/2024] [Indexed: 03/05/2024]
Abstract
Identification of promoters, enhancers, and their interactions helps understand genetic regulation. This study proposes a graph-based semi-supervised learning model (GCN4EPI) for the enhancer-promoter classification problem. We adopt a graph convolutional network (GCN) architecture to integrate interaction information with sequence features. Nodes of the constructed graph hold word embeddings of DNA sequences while edges hold the Enhancer-Promoter Interaction (EPI) information. By means of semi-supervised learning, much less data (16%) and time are needed in model training. Comparisons on a benchmark dataset of six human cell lines show that the proposed approach outperforms the state-of-the-art methods by a large margin (10% higher F1 score) and has the fastest training time (up to 3 times). Moreover, GCN4EPI's performance on cross-cell line data is also better than the baselines (3% higher F1 score). Our qualitative analyses with graph explainability models prove that GCN4EPI learns from both text and graph structure. The results suggest that integrating interaction information with sequence features improves predictive performance and compensates for the number of training instances.
Collapse
Affiliation(s)
- Samet Tenekeci
- Department of Computer Engineering, Izmir Institute of Technology, Izmir, 35430, Turkiye
| | - Selma Tekir
- Department of Computer Engineering, Izmir Institute of Technology, Izmir, 35430, Turkiye.
| |
Collapse
|
5
|
Yoshizaki M, Kuriya Y, Yamamoto M, Watanabe N, Araki M. Development of method using language processing techniques for extracting information on drug-health food product interactions. Br J Clin Pharmacol 2024; 90:1514-1524. [PMID: 38504605 DOI: 10.1111/bcp.16032] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2023] [Revised: 12/25/2023] [Accepted: 01/22/2024] [Indexed: 03/21/2024] Open
Abstract
AIMS Health food products (HFPs) are foods and products related to maintaining and promoting health. HFPs may sometimes cause unforeseen adverse health effects by interacting with drugs. Considering the importance of information on the interactions between HFPs and drugs, this study aimed to establish a workflow to extract information on Drug-HFP Interactions (DHIs) from open resources. METHODS First, Information on drugs, enzymes, their interactions, and known DHIs was collected from multiple public databases and literature sources. Next, a network consisted of enzymes, HFP, and drugs was constructed, assuming enzymes as candidates for hubs in Drug-HFP interactions (Method 1). Furthermore, we developed methods to analyze the biomedical context of each drug and HFP to predict potential DHIs out of the DHIs obtained in Method 1 by applying BioWordVec, a widely used biomedical terminology quantifier (Method 2-1 and 2-2). RESULTS 44,965 DHIs (30% known) were identified in Method 1, including 38 metabolic enzymes, 157 HFPs, and 1256 drugs. Method 2-1 selected 7401 DHIs (17% known) from the DHIs of Method 1, while Method 2-2 chose 2819 DHIs (30% known). Based on the different assumptions in these methods where Method 2-1 specifically selects HFPs interacting with specific enzymes and Method 2-2 specifically selects HFPs with similar function with drugs, the propsed methods resulted in extracting a wide variety of DHIs. CONCLUSIONS By integrating the results of language processing techniques with those of the network analysis, a workflow to efficiently extract unknown and known DHIs was constructed.
Collapse
Affiliation(s)
- Mari Yoshizaki
- Biological Science and Technology, Life and Materials Systems Engineering, Graduate School of Advanced Technology and Science, Tokushima University, Tokushima City, Tokushima Prefecture, Japan
- Artificial Intelligence Center for Health and Biomedical Research, National Institutes of Biomedical Innovation, Health and Nutrition, Settsu City, Osaka Prefecture, Japan
| | - Yuki Kuriya
- Artificial Intelligence Center for Health and Biomedical Research, National Institutes of Biomedical Innovation, Health and Nutrition, Settsu City, Osaka Prefecture, Japan
| | - Masaki Yamamoto
- Artificial Intelligence Center for Health and Biomedical Research, National Institutes of Biomedical Innovation, Health and Nutrition, Settsu City, Osaka Prefecture, Japan
| | - Naoki Watanabe
- Artificial Intelligence Center for Health and Biomedical Research, National Institutes of Biomedical Innovation, Health and Nutrition, Settsu City, Osaka Prefecture, Japan
| | - Michihiro Araki
- Artificial Intelligence Center for Health and Biomedical Research, National Institutes of Biomedical Innovation, Health and Nutrition, Settsu City, Osaka Prefecture, Japan
| |
Collapse
|
6
|
Li Y, Wei X, Yang Q, Xiong A, Li X, Zou Q, Cui F, Zhang Z. msBERT-Promoter: a multi-scale ensemble predictor based on BERT pre-trained model for the two-stage prediction of DNA promoters and their strengths. BMC Biol 2024; 22:126. [PMID: 38816885 DOI: 10.1186/s12915-024-01923-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2024] [Accepted: 05/21/2024] [Indexed: 06/01/2024] Open
Abstract
BACKGROUND A promoter is a specific sequence in DNA that has transcriptional regulatory functions, playing a role in initiating gene expression. Identifying promoters and their strengths can provide valuable information related to human diseases. In recent years, computational methods have gained prominence as an effective means for identifying promoter, offering a more efficient alternative to labor-intensive biological approaches. RESULTS In this study, a two-stage integrated predictor called "msBERT-Promoter" is proposed for identifying promoters and predicting their strengths. The model incorporates multi-scale sequence information through a tokenization strategy and fine-tunes the DNABERT model. Soft voting is then used to fuse the multi-scale information, effectively addressing the issue of insufficient DNA sequence information extraction in traditional models. To the best of our knowledge, this is the first time an integrated approach has been used in the DNABERT model for promoter identification and strength prediction. Our model achieves accuracy rates of 96.2% for promoter identification and 79.8% for promoter strength prediction, significantly outperforming existing methods. Furthermore, through attention mechanism analysis, we demonstrate that our model can effectively combine local and global sequence information, enhancing its interpretability. CONCLUSIONS msBERT-Promoter provides an effective tool that successfully captures sequence-related attributes of DNA promoters and can accurately identify promoters and predict their strengths. This work paves a new path for the application of artificial intelligence in traditional biology.
Collapse
Affiliation(s)
- Yazi Li
- School of Mathematics and Statistics, Hainan University, Haikou, 570228, China
| | - Xiaoman Wei
- School of Computer Science and Technology, Hainan University, Haikou, 570228, China
| | - Qinglin Yang
- School of Computer Science and Technology, Hainan University, Haikou, 570228, China
| | - An Xiong
- School of Computer Science and Technology, Hainan University, Haikou, 570228, China
| | - Xingfeng Li
- School of Computer Science and Technology, Hainan University, Haikou, 570228, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, 610054, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, 324000, China
| | - Feifei Cui
- School of Computer Science and Technology, Hainan University, Haikou, 570228, China.
| | - Zilong Zhang
- School of Computer Science and Technology, Hainan University, Haikou, 570228, China.
| |
Collapse
|
7
|
Akay A, Reddy HN, Galloway R, Kozyra J, Jackson AW. Predicting DNA toehold-mediated strand displacement rate constants using a DNA-BERT transformer deep learning model. Heliyon 2024; 10:e28443. [PMID: 38560216 PMCID: PMC10981123 DOI: 10.1016/j.heliyon.2024.e28443] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2023] [Revised: 03/15/2024] [Accepted: 03/19/2024] [Indexed: 04/04/2024] Open
Abstract
Dynamic DNA nanotechnology is driving exciting developments in molecular computing, cargo delivery, sensing and detection. Combining this innovative area of research with the progress made in machine learning will aid in the design of sophisticated DNA machinery. Herein, we present a novel framework based on a transformer architecture and a deep learning model which can predict the rate constant of toehold-mediated strand displacement, the underlying process in dynamic DNA nanotechnology. Initially, a dataset of 4450 DNA sequences and corresponding rate constants were generated in-silico using KinDA. Subsequently, a 1D convolution neural network was trained using specific local features and DNA-BERT sequence embedding to produce predicted rate constants. As a result, the newly trained deep learning model predicted toehold-mediated strand displacement rate constants with a root mean square error of 0.76, during testing. These findings demonstrate that DNA-BERT can improve prediction accuracy, negating the need for extensive computational simulations or experimentation. Finally, the impact of various local features during model training is discussed, and a detailed comparison between the One-hot encoder and DNA-BERT sequences representation methods is presented.
Collapse
Affiliation(s)
- Ali Akay
- Nanovery Limited, United Kingdom
- Universita Degli Studi di Trento, Italy
| | | | | | | | | |
Collapse
|
8
|
Shen C, Mao D, Tang J, Liao Z, Chen S. Prediction of LncRNA-Protein Interactions Based on Kernel Combinations and Graph Convolutional Networks. IEEE J Biomed Health Inform 2024; 28:1937-1948. [PMID: 37327093 DOI: 10.1109/jbhi.2023.3286917] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/18/2023]
Abstract
The complexes of long non-coding RNAs bound to proteins can be involved in regulating life activities at various stages of organisms. However, in the face of the growing number of lncRNAs and proteins, verifying LncRNA-Protein Interactions (LPI) based on traditional biological experiments is time-consuming and laborious. Therefore, with the improvement of computing power, predicting LPI has met new development opportunity. In virtue of the state-of-the-art works, a framework called LncRNA-Protein Interactions based on Kernel Combinations and Graph Convolutional Networks (LPI-KCGCN) has been proposed in this article. We first construct kernel matrices by taking advantage of extracting both the lncRNAs and protein concerning the sequence features, sequence similarity features, expression features, and gene ontology. Then reconstruct the existent kernel matrices as the input of the next step. Combined with known LPI interactions, the reconstructed similarity matrices, which can be used as features of the topology map of the LPI network, are exploited in extracting potential representations in the lncRNA and protein space using a two-layer Graph Convolutional Network. The predicted matrix can be finally obtained by training the network to produce scoring matrices w.r.t. lncRNAs and proteins. Different LPI-KCGCN variants are ensemble to derive the final prediction results and testify on balanced and unbalanced datasets. The 5-fold cross-validation shows that the optimal feature information combination on a dataset with 15.5% positive samples has an AUC value of 0.9714 and an AUPR value of 0.9216. On another highly unbalanced dataset with only 5% positive samples, LPI-KCGCN also has outperformed the state-of-the-art works, which achieved an AUC value of 0.9907 and an AUPR value of 0.9267.
Collapse
|
9
|
Nopour R. Screening ovarian cancer by using risk factors: machine learning assists. Biomed Eng Online 2024; 23:18. [PMID: 38347611 PMCID: PMC10863117 DOI: 10.1186/s12938-024-01219-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2023] [Accepted: 02/06/2024] [Indexed: 02/15/2024] Open
Abstract
BACKGROUND AND AIM Ovarian cancer (OC) is a prevalent and aggressive malignancy that poses a significant public health challenge. The lack of preventive strategies for OC increases morbidity, mortality, and other negative consequences. Screening OC through risk prediction could be leveraged as a powerful strategy for preventive purposes that have not received much attention. So, this study aimed to leverage machine learning approaches as predictive assistance solutions to screen high-risk groups of OC and achieve practical preventive purposes. MATERIALS AND METHODS As this study is data-driven and retrospective in nature, we leveraged 1516 suspicious OC women data from one concentrated database belonging to six clinical settings in Sari City from 2015 to 2019. Six machine learning (ML) algorithms, including XG-Boost, Random Forest (RF), J-48, support vector machine (SVM), K-nearest neighbor (KNN), and artificial neural network (ANN) were leveraged to construct prediction models for OC. To choose the best model for predicting OC, we compared various prediction models built using the area under the receiver characteristic operator curve (AU-ROC). RESULTS Current experimental results revealed that the XG-Boost with AU-ROC = 0.93 (0.95 CI = [0.91-0.95]) was recognized as the best-performing model for predicting OC. CONCLUSIONS ML approaches possess significant predictive efficiency and interoperability to achieve powerful preventive strategies leveraging OC screening high-risk groups.
Collapse
Affiliation(s)
- Raoof Nopour
- Department of Health Information Management, Student Research Committee, School of Health Management and Information Sciences Branch, Iran University of Medical Sciences, Tehran, Iran.
| |
Collapse
|
10
|
Thadikemalla VSG, Focke NK, Tummala S. A 3D Sparse Autoencoder for Fully Automated Quality Control of Affine Registrations in Big Data Brain MRI Studies. JOURNAL OF IMAGING INFORMATICS IN MEDICINE 2024; 37:412-427. [PMID: 38343221 DOI: 10.1007/s10278-023-00933-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/04/2023] [Revised: 10/13/2023] [Accepted: 10/24/2023] [Indexed: 03/02/2024]
Abstract
This paper presents a fully automated pipeline using a sparse convolutional autoencoder for quality control (QC) of affine registrations in large-scale T1-weighted (T1w) and T2-weighted (T2w) magnetic resonance imaging (MRI) studies. Here, a customized 3D convolutional encoder-decoder (autoencoder) framework is proposed and the network is trained in a fully unsupervised manner. For cross-validating the proposed model, we used 1000 correctly aligned MRI images of the human connectome project young adult (HCP-YA) dataset. We proposed that the quality of the registration is proportional to the reconstruction error of the autoencoder. Further, to make this method applicable to unseen datasets, we have proposed dataset-specific optimal threshold calculation (using the reconstruction error) from ROC analysis that requires a subset of the correctly aligned and artificially generated misalignments specific to that dataset. The calculated optimum threshold is used for testing the quality of remaining affine registrations from the corresponding datasets. The proposed framework was tested on four unseen datasets from autism brain imaging data exchange (ABIDE I, 215 subjects), information eXtraction from images (IXI, 577 subjects), Open Access Series of Imaging Studies (OASIS4, 646 subjects), and "Food and Brain" study (77 subjects). The framework has achieved excellent performance for T1w and T2w affine registrations with an accuracy of 100% for HCP-YA. Further, we evaluated the generality of the model on four unseen datasets and obtained accuracies of 81.81% for ABIDE I (only T1w), 93.45% (T1w) and 81.75% (T2w) for OASIS4, and 92.59% for "Food and Brain" study (only T1w) and in the range 88-97% for IXI (for both T1w and T2w and stratified concerning scanner vendor and magnetic field strengths). Moreover, the real failures from "Food and Brain" and OASIS4 datasets were detected with sensitivities of 100% and 80% for T1w and T2w, respectively. In addition, AUCs of > 0.88 in all scenarios were obtained during threshold calculation on the four test sets.
Collapse
Affiliation(s)
- Venkata Sainath Gupta Thadikemalla
- Department of Electronics and Communication Engineering, Velagapudi Ramakrishna Siddhartha Engineering College, Vijayawada, Andhra Pradesh, India.
| | - Niels K Focke
- Clinic for Neurology, University Medical Center, Göttingen, Germany
| | - Sudhakar Tummala
- Department of Electronics and Communication Engineering, School of Engineering and Sciences, SRM University-AP, Andhra Pradesh, India.
| |
Collapse
|
11
|
Huang G, Tang X, Zheng P. DeepHLAPred: a deep learning-based method for non-classical HLA binder prediction. BMC Genomics 2023; 24:706. [PMID: 37993812 PMCID: PMC10666343 DOI: 10.1186/s12864-023-09796-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2023] [Accepted: 11/08/2023] [Indexed: 11/24/2023] Open
Abstract
Human leukocyte antigen (HLA) is closely involved in regulating the human immune system. Despite great advance in detecting classical HLA Class I binders, there are few methods or toolkits for recognizing non-classical HLA Class I binders. To fill in this gap, we have developed a deep learning-based tool called DeepHLAPred. The DeepHLAPred used electron-ion interaction pseudo potential, integer numerical mapping and accumulated amino acid frequency as initial representation of non-classical HLA binder sequence. The deep learning module was used to further refine high-level representations. The deep learning module comprised two parallel convolutional neural networks, each followed by maximum pooling layer, dropout layer, and bi-directional long short-term memory network. The experimental results showed that the DeepHLAPred reached the state-of-the-art performanceson the cross-validation test and the independent test. The extensive test demonstrated the rationality of the DeepHLAPred. We further analyzed sequence pattern of non-classical HLA class I binders by information entropy. The information entropy of non-classical HLA binder sequence implied sequence pattern to a certain extent. In addition, we have developed a user-friendly webserver for convenient use, which is available at http://www.biolscience.cn/DeepHLApred/ . The tool and the analysis is helpful to detect non-classical HLA Class I binder. The source code and data is available at https://github.com/tangxingyu0/DeepHLApred .
Collapse
Affiliation(s)
- Guohua Huang
- School of Information Technology and Administration, Hunan University of Finance and Economics, Changsha, Hunan, 410215, China.
- College of Information Science and Engineering, Shaoyang University, Shaoyang, Hunan, 422000, China.
| | - Xingyu Tang
- College of Information Science and Engineering, Shaoyang University, Shaoyang, Hunan, 422000, China
| | - Peijie Zheng
- College of Information Science and Engineering, Shaoyang University, Shaoyang, Hunan, 422000, China
| |
Collapse
|
12
|
Frisby TS, Langmead CJ. Identifying promising sequences for protein engineering using a deep transformer protein language model. Proteins 2023; 91:1471-1486. [PMID: 37337902 DOI: 10.1002/prot.26536] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2023] [Revised: 05/10/2023] [Accepted: 05/23/2023] [Indexed: 06/21/2023]
Abstract
Protein engineers aim to discover and design novel sequences with targeted, desirable properties. Given the near limitless size of the protein sequence landscape, it is no surprise that these desirable sequences are often a relative rarity. This makes identifying such sequences a costly and time-consuming endeavor. In this work, we show how to use a deep transformer protein language model to identify sequences that have the most promise. Specifically, we use the model's self-attention map to calculate a Promise Score that weights the relative importance of a given sequence according to predicted interactions with a specified binding partner. This Promise Score can then be used to identify strong binders worthy of further study and experimentation. We use the Promise Score within two protein engineering contexts-Nanobody (Nb) discovery and protein optimization. With Nb discovery, we show how the Promise Score provides an effective way to select lead sequences from Nb repertoires. With protein optimization, we show how to use the Promise Score to select site-specific mutagenesis experiments that identify a high percentage of improved sequences. In both cases, we also show how the self-attention map used to calculate the Promise Score can indicate which regions of a protein are involved in intermolecular interactions that drive the targeted property. Finally, we describe how to fine-tune the transformer protein language model to learn a predictive model for the targeted property, and discuss the capabilities and limitations of fine-tuning with and without knowledge transfer within the context of protein engineering.
Collapse
Affiliation(s)
- Trevor S Frisby
- Computational Biology Department, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
| | | |
Collapse
|
13
|
Wu Q, Wang J, Sun Z, Xiao L, Ying W, Shi J. Immunotherapy Efficacy Prediction for Non-Small Cell Lung Cancer Using Multi-View Adaptive Weighted Graph Convolutional Networks. IEEE J Biomed Health Inform 2023; 27:5564-5575. [PMID: 37643107 DOI: 10.1109/jbhi.2023.3309840] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/31/2023]
Abstract
Immunotherapy is an effective way to treat non-small cell lung cancer (NSCLC). The efficacy of immunotherapy differs from person to person and may cause side effects, making it important to predict the efficacy of immunotherapy before surgery. Radiomics based on machine learning has been successfully used to predict the efficacy of NSCLC immunotherapy. However, most studies only considered the radiomic features of the individual patient, ignoring the inter-patient correlations. Besides, they usually concatenated different features as the input of a single-view model, failing to consider the complex correlation among features of multiple types. To this end, we propose a multi-view adaptive weighted graph convolutional network (MVAW-GCN) for the prediction of NSCLC immunotherapy efficacy. Specifically, we group the radiomic features into several views according to the type of the fitered images they extracted from. We construct a graph in each view based on the radiomic features and phenotypic information. An attention mechanism is introduced to automatically assign weights to each view. Considering the view-shared and view-specific knowledge of radiomic features, we propose separable graph convolution that decomposes the output of the last convolution layer into two components, i.e., the view-shared and view-specific outputs. We maximize the consistency and enhance the diversity among different views in the learning procedure. The proposed MVAW-GCN is evaluated on 107 NSCLC patients, including 52 patients with valid efficacy and 55 patients with invalid efficacy. Our method achieved an accuracy of 77.27% and an area under the curve (AUC) of 0.7780, indicating its effectiveness in NSCLC immunotherapy efficacy prediction.
Collapse
|
14
|
Xia F, Guo F, Liu Z, Zeng J, Ma X, Yu C, Li C. Enhanced CT combined with texture analysis for differential diagnosis of pleomorphic adenoma and adenolymphoma. BMC Med Imaging 2023; 23:169. [PMID: 37891554 PMCID: PMC10612226 DOI: 10.1186/s12880-023-01129-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2023] [Accepted: 10/18/2023] [Indexed: 10/29/2023] Open
Abstract
OBJECTIVE This study sought to evaluate the worth of the general characteristics of enhanced CT images and the histogram parameters of each stage in distinguishing pleomorphic adenoma (PA) and adenolymphoma (AL). METHODS The imaging features and histogram parameters of preoperative enhanced CT images in 20 patients with PA and 29 patients with AL were analyzed. Tumor morphology and histogram parameters of PA and AL were compared. Area under the curve (AUC), sensitivity, and subject operational feature specificity (ROC) analysis were used to determine the differential diagnostic effect of single-stage or multi-stage parameter combinations. RESULTS The difference in CT value and net enhancement value of arterial phase (AP) were significant (p < 0.05); Flat sweep phase (FSP), AP mean, percentiles, 10th, 50th, 90th, 99th and arterial period variance and venous phase (VP) kurtosis in the nine histogram parameters of each period (p < 0.05). An analysis of the ROC curve revealed a maximum area beneath the curve (AUC) in the 90th percentile of FSP for a single-parameter differential diagnosis to be 0.870. The diagnostic efficacy of the mean value of FSP + The 90th percentile of AP + Kurtosis of VP was the best in multi-parameter combination diagnosis, with an AUC of 0.925, and the sensitivity and specificity of 0.900 and 0.850, respectively. CONCLUSION The histogram analysis of enhanced CT images is valuable for the differentiation of PA and AL. Moreover, the combination of single-stage parameters or multi-stage parameters can improve the differential diagnosis efficiency.
Collapse
Affiliation(s)
- Feifei Xia
- Department of Oral and Maxillofacial Surgery, the First Affiliated Hospital of Shihezi University, Shihezi, 832000, China
| | - Foqing Guo
- Department of Oral and Maxillofacial Surgery, the First Affiliated Hospital of Shihezi University, Shihezi, 832000, China
| | - Zhe Liu
- Department of Oral and Maxillofacial Surgery, the First Affiliated Hospital of Shihezi University, Shihezi, 832000, China
| | - Jie Zeng
- Department of Oral and Maxillofacial Surgery, the First Affiliated Hospital of Shihezi University, Shihezi, 832000, China
| | - Xuehua Ma
- Department of Oral and Maxillofacial Surgery, the First Affiliated Hospital of Shihezi University, Shihezi, 832000, China
| | - Chongqing Yu
- Department of Oral and Maxillofacial Surgery, the First Affiliated Hospital of Shihezi University, Shihezi, 832000, China
| | - Changxue Li
- Department of Oral and Maxillofacial Surgery, the First Affiliated Hospital of Shihezi University, Shihezi, 832000, China.
| |
Collapse
|
15
|
Jia J, Wei Z, Sun M. EMDL_m6Am: identifying N6,2'-O-dimethyladenosine sites based on stacking ensemble deep learning. BMC Bioinformatics 2023; 24:397. [PMID: 37880673 PMCID: PMC10598967 DOI: 10.1186/s12859-023-05543-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2023] [Accepted: 10/20/2023] [Indexed: 10/27/2023] Open
Abstract
BACKGROUND N6, 2'-O-dimethyladenosine (m6Am) is an abundant RNA methylation modification on vertebrate mRNAs and is present in the transcription initiation region of mRNAs. It has recently been experimentally shown to be associated with several human disorders, including obesity genes, and stomach cancer, among others. As a result, N6,2'-O-dimethyladenosine (m6Am) site will play a crucial part in the regulation of RNA if it can be correctly identified. RESULTS This study proposes a novel deep learning-based m6Am prediction model, EMDL_m6Am, which employs one-hot encoding to expressthe feature map of the RNA sequence and recognizes m6Am sites by integrating different CNN models via stacking. Including DenseNet, Inflated Convolutional Network (DCNN) and Deep Multiscale Residual Network (MSRN), the sensitivity (Sn), specificity (Sp), accuracy (ACC), Mathews correlation coefficient (MCC) and area under the curve (AUC) of our model on the training data set reach 86.62%, 88.94%, 87.78%, 0.7590 and 0.8778, respectively, and the prediction results on the independent test set are as high as 82.25%, 79.72%, 80.98%, 0.6199, and 0.8211. CONCLUSIONS In conclusion, the experimental results demonstrated that EMDL_m6Am greatly improved the predictive performance of the m6Am sites and could provide a valuable reference for the next part of the study. The source code and experimental data are available at: https://github.com/13133989982/EMDL-m6Am .
Collapse
Affiliation(s)
- Jianhua Jia
- School of Information Engineering, Jingdezhen Ceramic University, Jingdezhen, 333403, China.
| | - Zhangying Wei
- School of Information Engineering, Jingdezhen Ceramic University, Jingdezhen, 333403, China.
| | - Mingwei Sun
- School of Information Engineering, Jingdezhen Ceramic University, Jingdezhen, 333403, China
| |
Collapse
|
16
|
Sun M, Hu H, Pang W, Zhou Y. ACP-BC: A Model for Accurate Identification of Anticancer Peptides Based on Fusion Features of Bidirectional Long Short-Term Memory and Chemically Derived Information. Int J Mol Sci 2023; 24:15447. [PMID: 37895128 PMCID: PMC10607064 DOI: 10.3390/ijms242015447] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2023] [Revised: 09/10/2023] [Accepted: 10/20/2023] [Indexed: 10/29/2023] Open
Abstract
Anticancer peptides (ACPs) have been proven to possess potent anticancer activities. Although computational methods have emerged for rapid ACPs identification, their accuracy still needs improvement. In this study, we propose a model called ACP-BC, a three-channel end-to-end model that utilizes various combinations of data augmentation techniques. In the first channel, features are extracted from the raw sequence using a bidirectional long short-term memory network. In the second channel, the entire sequence is converted into a chemical molecular formula, which is further simplified using Simplified Molecular Input Line Entry System notation to obtain deep abstract features through a bidirectional encoder representation transformer (BERT). In the third channel, we manually selected four effective features according to dipeptide composition, binary profile feature, k-mer sparse matrix, and pseudo amino acid composition. Notably, the application of chemical BERT in predicting ACPs is novel and successfully integrated into our model. To validate the performance of our model, we selected two benchmark datasets, ACPs740 and ACPs240. ACP-BC achieved prediction accuracy with 87% and 90% on these two datasets, respectively, representing improvements of 1.3% and 7% compared to existing state-of-the-art methods on these datasets. Therefore, systematic comparative experiments have shown that the ACP-BC can effectively identify anticancer peptides.
Collapse
Affiliation(s)
- Mingwei Sun
- Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun 130012, China; (M.S.); (H.H.)
| | - Haoyuan Hu
- Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun 130012, China; (M.S.); (H.H.)
| | - Wei Pang
- School of Mathematical and Computer Sciences, Heriot-Watt University, Edinburgh EH14 4AS, UK;
| | - You Zhou
- Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun 130012, China; (M.S.); (H.H.)
- College of Software, Jilin University, Changchun 130012, China
| |
Collapse
|
17
|
Wu Q, Chang Y, Yang C, Liu H, Chen F, Dong H, Chen C, Luo Q. Adjuvant chemotherapy or no adjuvant chemotherapy? A prediction model for the risk stratification of recurrence or metastasis of nasopharyngeal carcinoma combining MRI radiomics with clinical factors. PLoS One 2023; 18:e0287031. [PMID: 37751422 PMCID: PMC10522047 DOI: 10.1371/journal.pone.0287031] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2023] [Accepted: 05/28/2023] [Indexed: 09/28/2023] Open
Abstract
BACKGROUND Dose adjuvant chemotherapy (AC) should be offered in nasopharyngeal carcinoma (NPC) patients? Different guidelines provided the different recommendations. METHODS In this retrospective study, a total of 140 patients were enrolled and followed for 3 years, with 24 clinical features being collected. The imaging features on the enhanced-MRI sequence were extracted by using PyRadiomics platform. The pearson correlation coefficient and the random forest was used to filter the features associated with recurrence or metastasis. A clinical-radiomics model (CRM) was constructed by the Cox multivariable analysis in training cohort, and was validated in validation cohort. All patients were divided into high- and low-risk groups through the median Rad-score of the model. The Kaplan-Meier survival curves were used to compare the 3-year recurrence or metastasis free rate (RMFR) of patients with or without AC in high- and low-groups. RESULTS In total, 960 imaging features were extracted. A CRM was constructed from nine features (seven imaging features and two clinical factors). In the training cohort, the area under curve (AUC) of CRM for 3-year RMFR was 0.872 (P <0.001), and the sensitivity and specificity were 0.935 and 0.672, respectively; In the validation cohort, the AUC was 0.864 (P <0.001), and the sensitivity and specificity were 1.00 and 0.75, respectively. Kaplan-Meier curve showed that the 3-year RMFR and 3-year cancer specific survival (CSS) rate in the high-risk group were significantly lower than those in the low-risk group (P <0.001). In the high-risk group, patients who received AC had greater 3-year RMFR than those who did not receive AC (78.6% vs. 48.1%) (p = 0.03). CONCLUSION Considering increasing RMFR, a prediction model for NPC based on two clinical factors and seven imaging features suggested the AC needs to be added to patients in the high-risk group and not in the low-risk group.
Collapse
Affiliation(s)
- Qiaoyuan Wu
- The Public Experimental Center of Medicine, Department of Pathology, Affiliated Hospital of Zunyi Medical University, Zunyi, Guizhou, P. R. China
| | - Yonghu Chang
- School of Medical Information Engineering of Zunyi Medical University, Zunyi Medical University, Zunyi, Guizhou, P. R. China
| | - Cheng Yang
- The Third Clinical Medical College of Ningxia Medical University, Yinchuan, Ningxia, P. R. China
| | - Heng Liu
- Department of Radiology, Affiliated Hospital of Zunyi Medical University, Zunyi, Guizhou, P. R. China
| | - Fang Chen
- The Public Experimental Center of Medicine, Department of Pathology, Affiliated Hospital of Zunyi Medical University, Zunyi, Guizhou, P. R. China
| | - Hui Dong
- The Public Experimental Center of Medicine, Department of Pathology, Affiliated Hospital of Zunyi Medical University, Zunyi, Guizhou, P. R. China
| | - Cheng Chen
- Department of Thoracic Surgery, Affiliated Hospital of Zunyi Medical University, Zunyi, Guizhou, P.R. China
| | - Qing Luo
- The Public Experimental Center of Medicine, Department of Pathology, Affiliated Hospital of Zunyi Medical University, Zunyi, Guizhou, P. R. China
| |
Collapse
|
18
|
Sharif Rahmani E, Lawarde A, Lingasamy P, Moreno SV, Salumets A, Modhukur V. MBMethPred: a computational framework for the accurate classification of childhood medulloblastoma subgroups using data integration and AI-based approaches. Front Genet 2023; 14:1233657. [PMID: 37745846 PMCID: PMC10513500 DOI: 10.3389/fgene.2023.1233657] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2023] [Accepted: 08/24/2023] [Indexed: 09/26/2023] Open
Abstract
Childhood medulloblastoma is a malignant form of brain tumor that is widely classified into four subgroups based on molecular and genetic characteristics. Accurate classification of these subgroups is crucial for appropriate treatment, monitoring plans, and targeted therapies. However, misclassification between groups 3 and 4 is common. To address this issue, an AI-based R package called MBMethPred was developed based on DNA methylation and gene expression profiles of 763 medulloblastoma samples to classify subgroups using machine learning and neural network models. The developed prediction models achieved a classification accuracy of over 96% for subgroup classification by using 399 CpGs as prediction biomarkers. We also assessed the prognostic relevance of prediction biomarkers using survival analysis. Furthermore, we identified subgroup-specific drivers of medulloblastoma using functional enrichment analysis, Shapley values, and gene network analysis. In particular, the genes involved in the nervous system development process have the potential to separate medulloblastoma subgroups with 99% accuracy. Notably, our analysis identified 16 genes that were specifically significant for subgroup classification, including EP300, CXCR4, WNT4, ZIC4, MEIS1, SLC8A1, NFASC, ASCL2, KIF5C, SYNGAP1, SEMA4F, ROR1, DPYSL4, ARTN, RTN4RL1, and TLX2. Our findings contribute to enhanced survival outcomes for patients with medulloblastoma. Continued research and validation efforts are needed to further refine and expand the utility of our approach in other cancer types, advancing personalized medicine in pediatric oncology.
Collapse
Affiliation(s)
| | - Ankita Lawarde
- Competence Centre on Health Technologies, Tartu, Estonia
- Department of Obstetrics and Gynecology, Institute of Clinical Medicine, University of Tartu, Tartu, Estonia
| | | | - Sergio Vela Moreno
- Competence Centre on Health Technologies, Tartu, Estonia
- Department of Obstetrics and Gynecology, Institute of Clinical Medicine, University of Tartu, Tartu, Estonia
| | - Andres Salumets
- Competence Centre on Health Technologies, Tartu, Estonia
- Department of Obstetrics and Gynecology, Institute of Clinical Medicine, University of Tartu, Tartu, Estonia
- Division of Obstetrics and Gynecology, Department of Clinical Science, Intervention and Technology, Karolinska Institute and Karolinska University Hospital, Stockholm, Sweden
| | - Vijayachitra Modhukur
- Competence Centre on Health Technologies, Tartu, Estonia
- Department of Obstetrics and Gynecology, Institute of Clinical Medicine, University of Tartu, Tartu, Estonia
| |
Collapse
|
19
|
Zhang P, Wu H. IChrom-Deep: An Attention-Based Deep Learning Model for Identifying Chromatin Interactions. IEEE J Biomed Health Inform 2023; 27:4559-4568. [PMID: 37402191 DOI: 10.1109/jbhi.2023.3292299] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/06/2023]
Abstract
Identification of chromatin interactions is crucial for advancing our knowledge of gene regulation. However, due to the limitations of high-throughput experimental techniques, there is an urgent need to develop computational methods for predicting chromatin interactions. In this study, we propose a novel attention-based deep learning model, termed IChrom-Deep, to identify chromatin interactions using sequence features and genomic features. The experimental results based on the datasets of three cell lines demonstrate that the IChrom-Deep achieves satisfactory performance and is superior to the previous methods. We also investigate the effect of DNA sequence and associated features and genomic features on chromatin interactions, and highlight the applicable scenarios of some features, such as sequence conservation and distance. Moreover, we identify a few genomic features that are extremely important across different cell lines, and IChrom-Deep achieves comparable performance with only these significant genomic features versus using all genomic features. It is believed that IChrom-Deep can serve as a useful tool for future studies that seek to identify chromatin interactions.
Collapse
|
20
|
Jia J, Wei Z, Cao X. EMDL-ac4C: identifying N4-acetylcytidine based on ensemble two-branch residual connection DenseNet and attention. Front Genet 2023; 14:1232038. [PMID: 37519885 PMCID: PMC10372626 DOI: 10.3389/fgene.2023.1232038] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2023] [Accepted: 06/29/2023] [Indexed: 08/01/2023] Open
Abstract
Introduction: N4-acetylcytidine (ac4C) is a critical acetylation modification that has an essential function in protein translation and is associated with a number of human diseases. Methods: The process of identifying ac4C sites by biological experiments is too cumbersome and costly. And the performance of several existing computational models needs to be improved. Therefore, we propose a new deep learning tool EMDL-ac4C to predict ac4C sites, which uses a simple one-hot encoding for a unbalanced dataset using a downsampled ensemble deep learning network to extract important features to identify ac4C sites. The base learner of this ensemble model consists of a modified DenseNet and Squeeze-and-Excitation Networks. In addition, we innovatively add a convolutional residual structure in parallel with the dense block to achieve the effect of two-layer feature extraction. Results: The average accuracy (Acc), mathews correlation coefficient (MCC), and area under the curve Area under curve of EMDL-ac4C on ten independent testing sets are 80.84%, 61.77%, and 87.94%, respectively. Discussion: Multiple experimental comparisons indicate that EMDL-ac4C outperforms existing predictors and it greatly improved the predictive performance of the ac4C sites. At the same time, EMDL-ac4C could provide a valuable reference for the next part of the study. The source code and experimental data are available at: https://github.com/13133989982/EMDLac4C.
Collapse
Affiliation(s)
- Jianhua Jia
- *Correspondence: Jianhua Jia, ; Zhangying Wei,
| | | | | |
Collapse
|
21
|
Sun X, Zhao J, Guo C, Zhu X. Early Prediction of Epilepsy after Encephalitis in Childhood Based on EEG and Clinical Features. Emerg Med Int 2023; 2023:8862598. [PMID: 37485251 PMCID: PMC10359137 DOI: 10.1155/2023/8862598] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2022] [Revised: 03/20/2023] [Accepted: 03/30/2023] [Indexed: 07/25/2023] Open
Abstract
Objective The present study was designed to establish and evaluate an early prediction model of epilepsy after encephalitis in childhood based on electroencephalogram (ECG) and clinical features. Methods 255 patients with encephalitis were randomly divided into training and verification sets and were divided into postencephalitic epilepsy (PE) and no postencephalitic epilepsy (no-PE) according to whether epilepsy occurred one year after discharge. Univariate and multivariate logistic regression analyses were used to screen the risk factors for PE. The identified risk factors were used to establish and verify a model. Results This study included 255 patients with encephalitis, including 209 in the non-PE group and 46 in the PE group. Univariate and multiple logistic regression analysis showed that hemoglobin (OR = 0.968, 95% CI = 0.951-0.958), epilepsy frequency (OR = 0.968, 95% CI = 0.951-0.958), and ECG slow wave/fast wave frequency (S/F) in the occipital region were independent influencing factors for PE (P < 0.05).The prediction model is based on the above factors: -0.031 × hemoglobin -2.113 × epilepsy frequency + 7.836 × occipital region S/F + 1.595. In the training set and the validation set, the area under the ROC curve (AUC) of the model for the diagnosis of PE was 0.835 and 0.712, respectively. Conclusion The peripheral blood hemoglobin, the number of epileptic seizures in the acute stage of encephalitis, and EEG slow wave/fast wave frequencies can be used as predictors of epilepsy after encephalitis.
Collapse
Affiliation(s)
- Xiaojuan Sun
- Department of Pediatrics, The Second Affiliated Hospital of Nantong University, Nantong First People's Hospital, Nantong, Jiangsu, China
| | - Jinhua Zhao
- Department of Pediatrics, The Second Affiliated Hospital of Nantong University, Nantong First People's Hospital, Nantong, Jiangsu, China
| | - Chunyun Guo
- Department of Pediatrics, The Second Affiliated Hospital of Nantong University, Nantong First People's Hospital, Nantong, Jiangsu, China
| | - Xiaoxiao Zhu
- Department of Pediatrics, The Second Affiliated Hospital of Nantong University, Nantong First People's Hospital, Nantong, Jiangsu, China
| |
Collapse
|
22
|
Karlsen ST, Rau MH, Sánchez BJ, Jensen K, Zeidan AA. From genotype to phenotype: computational approaches for inferring microbial traits relevant to the food industry. FEMS Microbiol Rev 2023; 47:fuad030. [PMID: 37286882 PMCID: PMC10337747 DOI: 10.1093/femsre/fuad030] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2023] [Revised: 05/31/2023] [Accepted: 06/06/2023] [Indexed: 06/09/2023] Open
Abstract
When selecting microbial strains for the production of fermented foods, various microbial phenotypes need to be taken into account to achieve target product characteristics, such as biosafety, flavor, texture, and health-promoting effects. Through continuous advances in sequencing technologies, microbial whole-genome sequences of increasing quality can now be obtained both cheaper and faster, which increases the relevance of genome-based characterization of microbial phenotypes. Prediction of microbial phenotypes from genome sequences makes it possible to quickly screen large strain collections in silico to identify candidates with desirable traits. Several microbial phenotypes relevant to the production of fermented foods can be predicted using knowledge-based approaches, leveraging our existing understanding of the genetic and molecular mechanisms underlying those phenotypes. In the absence of this knowledge, data-driven approaches can be applied to estimate genotype-phenotype relationships based on large experimental datasets. Here, we review computational methods that implement knowledge- and data-driven approaches for phenotype prediction, as well as methods that combine elements from both approaches. Furthermore, we provide examples of how these methods have been applied in industrial biotechnology, with special focus on the fermented food industry.
Collapse
Affiliation(s)
- Signe T Karlsen
- Bioinformatics & Modeling, R&D Digital Innovation, Chr. Hansen A/S, Bøge Allé 10-12, 2970 Hørsholm, Denmark
| | - Martin H Rau
- Bioinformatics & Modeling, R&D Digital Innovation, Chr. Hansen A/S, Bøge Allé 10-12, 2970 Hørsholm, Denmark
| | - Benjamín J Sánchez
- Bioinformatics & Modeling, R&D Digital Innovation, Chr. Hansen A/S, Bøge Allé 10-12, 2970 Hørsholm, Denmark
| | - Kristian Jensen
- Bioinformatics & Modeling, R&D Digital Innovation, Chr. Hansen A/S, Bøge Allé 10-12, 2970 Hørsholm, Denmark
| | - Ahmad A Zeidan
- Bioinformatics & Modeling, R&D Digital Innovation, Chr. Hansen A/S, Bøge Allé 10-12, 2970 Hørsholm, Denmark
| |
Collapse
|
23
|
Qian Y, Shang T, Guo F, Wang C, Cui Z, Ding Y, Wu H. Identification of DNA-binding protein based multiple kernel model. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2023; 20:13149-13170. [PMID: 37501482 DOI: 10.3934/mbe.2023586] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/29/2023]
Abstract
DNA-binding proteins (DBPs) play a critical role in the development of drugs for treating genetic diseases and in DNA biology research. It is essential for predicting DNA-binding proteins more accurately and efficiently. In this paper, a Laplacian Local Kernel Alignment-based Restricted Kernel Machine (LapLKA-RKM) is proposed to predict DBPs. In detail, we first extract features from the protein sequence using six methods. Second, the Radial Basis Function (RBF) kernel function is utilized to construct pre-defined kernel metrics. Then, these metrics are combined linearly by weights calculated by LapLKA. Finally, the fused kernel is input to RKM for training and prediction. Independent tests and leave-one-out cross-validation were used to validate the performance of our method on a small dataset and two large datasets. Importantly, we built an online platform to represent our model, which is now freely accessible via http://8.130.69.121:8082/.
Collapse
Affiliation(s)
- Yuqing Qian
- College of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, China
| | - Tingting Shang
- College of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, China
| | - Fei Guo
- School of Computer Science and Engineering, Central South University, Changsha, China
| | - Chunliang Wang
- The Second Affiliated Hospital of Soochow University, Suzhou, China
| | - Zhiming Cui
- College of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, China
| | - Yijie Ding
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China
| | - Hongjie Wu
- College of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, China
| |
Collapse
|
24
|
Smith GD, Ching WH, Cornejo-Páramo P, Wong ES. Decoding enhancer complexity with machine learning and high-throughput discovery. Genome Biol 2023; 24:116. [PMID: 37173718 PMCID: PMC10176946 DOI: 10.1186/s13059-023-02955-4] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2022] [Accepted: 04/28/2023] [Indexed: 05/15/2023] Open
Abstract
Enhancers are genomic DNA elements controlling spatiotemporal gene expression. Their flexible organization and functional redundancies make deciphering their sequence-function relationships challenging. This article provides an overview of the current understanding of enhancer organization and evolution, with an emphasis on factors that influence these relationships. Technological advancements, particularly in machine learning and synthetic biology, are discussed in light of how they provide new ways to understand this complexity. Exciting opportunities lie ahead as we continue to unravel the intricacies of enhancer function.
Collapse
Affiliation(s)
- Gabrielle D Smith
- Victor Chang Cardiac Research Institute, 405 Liverpool Street, Darlinghurst, NSW, Australia
- School of Biotechnology and Biomolecular Sciences, UNSW Sydney, Kensington, NSW, Australia
| | - Wan Hern Ching
- Victor Chang Cardiac Research Institute, 405 Liverpool Street, Darlinghurst, NSW, Australia
| | - Paola Cornejo-Páramo
- Victor Chang Cardiac Research Institute, 405 Liverpool Street, Darlinghurst, NSW, Australia
- School of Biotechnology and Biomolecular Sciences, UNSW Sydney, Kensington, NSW, Australia
| | - Emily S Wong
- Victor Chang Cardiac Research Institute, 405 Liverpool Street, Darlinghurst, NSW, Australia.
- School of Biotechnology and Biomolecular Sciences, UNSW Sydney, Kensington, NSW, Australia.
| |
Collapse
|
25
|
Zhu J, Ge M, Chang Z, Dong W. CRCNet: Global-local context and multi-modality cross attention for polyp segmentation. Biomed Signal Process Control 2023. [DOI: 10.1016/j.bspc.2023.104593] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/30/2023]
|
26
|
Sadad T, Aurangzeb RA, Safran M, Alfarhood S, Kim J. Classification of Highly Divergent Viruses from DNA/RNA Sequence Using Transformer-Based Models. Biomedicines 2023; 11:biomedicines11051323. [PMID: 37238994 DOI: 10.3390/biomedicines11051323] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2023] [Revised: 04/18/2023] [Accepted: 04/25/2023] [Indexed: 05/28/2023] Open
Abstract
Viruses infect millions of people worldwide each year, and some can lead to cancer or increase the risk of cancer. As viruses have highly mutable genomes, new viruses may emerge in the future, such as COVID-19 and influenza. Traditional virology relies on predefined rules to identify viruses, but new viruses may be completely or partially divergent from the reference genome, rendering statistical methods and similarity calculations insufficient for all genome sequences. Identifying DNA/RNA-based viral sequences is a crucial step in differentiating different types of lethal pathogens, including their variants and strains. While various tools in bioinformatics can align them, expert biologists are required to interpret the results. Computational virology is a scientific field that studies viruses, their origins, and drug discovery, where machine learning plays a crucial role in extracting domain- and task-specific features to tackle this challenge. This paper proposes a genome analysis system that uses advanced deep learning to identify dozens of viruses. The system uses nucleotide sequences from the NCBI GenBank database and a BERT tokenizer to extract features from the sequences by breaking them down into tokens. We also generated synthetic data for viruses with small sample sizes. The proposed system has two components: a scratch BERT architecture specifically designed for DNA analysis, which is used to learn the next codons unsupervised, and a classifier that identifies important features and understands the relationship between genotype and phenotype. Our system achieved an accuracy of 97.69% in identifying viral sequences.
Collapse
Affiliation(s)
- Tariq Sadad
- Department of Computer Science, University of Engineering & Technology, Mardan 23200, Pakistan
| | - Raja Atif Aurangzeb
- Department of Computer Science & Software Engineering, International Islamic University Islamabad, Islamabad 44000, Pakistan
| | - Mejdl Safran
- Department of Computer Science, College of Computer and Information Sciences, King Saud University, Riyadh 11543, Saudi Arabia
| | - Sultan Alfarhood
- Department of Computer Science, College of Computer and Information Sciences, King Saud University, Riyadh 11543, Saudi Arabia
| | - Jungsuk Kim
- Department of Biomedical Engineering, Gachon University, Seongnam-si 13120, Republic of Korea
| |
Collapse
|
27
|
Enhanced Preprocessing Approach Using Ensemble Machine Learning Algorithms for Detecting Liver Disease. Biomedicines 2023; 11:biomedicines11020581. [PMID: 36831118 PMCID: PMC9953600 DOI: 10.3390/biomedicines11020581] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2022] [Revised: 01/30/2023] [Accepted: 01/31/2023] [Indexed: 02/18/2023] Open
Abstract
There has been a sharp increase in liver disease globally, and many people are dying without even knowing that they have it. As a result of its limited symptoms, it is extremely difficult to detect liver disease until the very last stage. In the event of early detection, patients can begin treatment earlier, thereby saving their lives. It has become increasingly popular to use ensemble learning algorithms since they perform better than traditional machine learning algorithms. In this context, this paper proposes a novel architecture based on ensemble learning and enhanced preprocessing to predict liver disease using the Indian Liver Patient Dataset (ILPD). Six ensemble learning algorithms are applied to the ILPD, and their results are compared to those obtained with existing studies. The proposed model uses several data preprocessing methods, such as data balancing, feature scaling, and feature selection, to improve the accuracy with appropriate imputations. Multivariate imputation is applied to fill in missing values. On skewed columns, log1p transformation was applied, along with standardization, min-max scaling, maximum absolute scaling, and robust scaling techniques. The selection of features is carried out based on several methods including univariate selection, feature importance, and correlation matrix. These enhanced preprocessed data are trained on Gradient boosting, XGBoost, Bagging, Random Forest, Extra Tree, and Stacking ensemble learning algorithms. The results of the six models were compared with each other, as well as with the models used in other research works. The proposed model using extra tree classifier and random forest, outperformed the other methods with the highest testing accuracy of 91.82% and 86.06%, respectively, portraying our method as a real-world solution for detecting liver disease.
Collapse
|
28
|
Jubair S, Domaratzki M. Crop genomic selection with deep learning and environmental data: A survey. Front Artif Intell 2023; 5:1040295. [PMID: 36703955 PMCID: PMC9871498 DOI: 10.3389/frai.2022.1040295] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2022] [Accepted: 12/22/2022] [Indexed: 01/12/2023] Open
Abstract
Machine learning techniques for crop genomic selections, especially for single-environment plants, are well-developed. These machine learning models, which use dense genome-wide markers to predict phenotype, routinely perform well on single-environment datasets, especially for complex traits affected by multiple markers. On the other hand, machine learning models for predicting crop phenotype, especially deep learning models, using datasets that span different environmental conditions, have only recently emerged. Models that can accept heterogeneous data sources, such as temperature, soil conditions and precipitation, are natural choices for modeling GxE in multi-environment prediction. Here, we review emerging deep learning techniques that incorporate environmental data directly into genomic selection models.
Collapse
Affiliation(s)
- Sheikh Jubair
- Department of Computer Science, University of Manitoba, Winnipeg, MB, Canada,*Correspondence: Sheikh Jubair ✉
| | - Mike Domaratzki
- Department of Computer Science, University of Western Ontario, London, ON, Canada
| |
Collapse
|
29
|
Yang TH, Yu YH, Wu SH, Zhang FY. CFA: An explainable deep learning model for annotating the transcriptional roles of cis-regulatory modules based on epigenetic codes. Comput Biol Med 2023; 152:106375. [PMID: 36502693 DOI: 10.1016/j.compbiomed.2022.106375] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2022] [Revised: 11/07/2022] [Accepted: 11/27/2022] [Indexed: 11/30/2022]
Abstract
Metazoa gene expression is controlled by modular DNA segments called cis-regulatory modules (CRMs). CRMs can convey promoter/enhancer/insulator roles, generating additional regulation layers in transcription. Experiments for understanding CRM roles are low-throughput and costly. Large-scale CRM function investigation still depends on computational methods. However, existing in silico tools only recognize enhancers or promoters exclusively, thus accumulating errors when considering CRM promoter/enhancer/insulator roles altogether. Currently, no algorithm can concurrently consider these CRM roles. In this research, we developed the CRM Function Annotator (CFA) model. CFA provides complete CRM transcriptional role labeling based on epigenetic profiling interpretation. We demonstrated that CFA achieves high performance (test macro auROC/auPRC = 94.1%/90.3%) and outperforms existing tools in promoter/enhancer/insulator identification. CFA is also inspected to recognize explainable epigenetic codes consistent with previous findings when labeling CRM roles. By considering the higher-order combinations of the epigenetic codes, CFA significantly reduces false-positive rates in CRM transcriptional role annotation. CFA is available at https://github.com/cobisLab/CFA/.
Collapse
Affiliation(s)
- Tzu-Hsien Yang
- Department of Biomedical Engineering, National Cheng Kung University, No. 1, University Road, Tainan 701, Taiwan.
| | - Yu-Huai Yu
- Department of Information Management, National University of Kaohsiung, Kaohsiung University Rd, 811 Kaohsiung, Taiwan.
| | - Sheng-Hang Wu
- Department of Information Management, National University of Kaohsiung, Kaohsiung University Rd, 811 Kaohsiung, Taiwan.
| | - Fang-Yuan Zhang
- Department of Information Management, National University of Kaohsiung, Kaohsiung University Rd, 811 Kaohsiung, Taiwan.
| |
Collapse
|
30
|
Cuproptosis-Related LncRNA Signature for Predicting Prognosis of Hepatocellular Carcinoma: A Comprehensive Analysis. DISEASE MARKERS 2022; 2022:3265212. [PMID: 36452343 PMCID: PMC9705118 DOI: 10.1155/2022/3265212] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 08/16/2022] [Revised: 10/21/2022] [Accepted: 10/25/2022] [Indexed: 11/23/2022]
Abstract
Hepatocellular carcinoma (HCC) is one of the most common malignant tumors worldwide and has a poor prognosis. Cuproptosis is a novel mode of cell death that has only recently been discovered. Considering the critical role of lncRNAs in liver cancer development, the aim of this study was to construct a prognostic signature based on cuproptosis-related lncRNAs (CRlncRNAs). We downloaded RNA-sequencing data and corresponding clinical information of patients with HCC from The Cancer Genome Atlas (TCGA) database. To verify the robustness of the model, we added an external validation set obtained from the Gene Expression Omnibus (GEO): GSE40144. In addition, we identified the cuproptosis-related genes (CRGs) based on previous reports. Pearson correlation analysis, univariate Cox regression, and least absolute shrinkage and selection operator (LASSO) Cox regression analysis were utilized to screen for genes associated with prognosis. On this basis, multivariate Cox regression and stepAIC were used to further construct and optimize the prognostic model. The simplified signature with the lowest Akaike information criterion (AIC) value was considered the prognostic signature. Seven different algorithms were used to perform immune infiltration analysis. The single-sample Gene Set Enrichment Analysis (ssGSEA) algorithm was utilized to find the difference in immune function between the high- and low-risk groups. Finally, in vitro experiments were performed by quantitative real-time PCR (qRT-PCR) analysis using HCC cell lines to validate the expression of prognostic genes. We identified 3 lncRNAs (CYTOR, LINC00205, and LINC01184) as independent risk factors for HCC. The receiver operating characteristic (ROC) curves calculated that the AUC at 1, 3, and 5 years reached 0.717, 0.633, and 0.607, respectively. The expression levels of 41 immune checkpoints differed significantly between the high- and low-risk groups, and there were significant differences in sensitivity to immunotherapy between the high- and low-risk groups. The risk model could also serve as a promising predictor of immunotherapeutic response, which has been verified by the TIDE algorithm (p < 0.001). Overall, we propose a signature related to CRlncRNAs that can be used to predict the prognosis of HCC patients, which was validated in external cohort and in vitro experiments.
Collapse
|