1
|
Vardaxis I, Simovski B, Anzar I, Stratford R, Clancy T. Deep learning of antibody epitopes using positional permutation vectors. Comput Struct Biotechnol J 2024; 23:2695-2707. [PMID: 39035832 PMCID: PMC11260035 DOI: 10.1016/j.csbj.2024.06.005] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2024] [Revised: 06/04/2024] [Accepted: 06/04/2024] [Indexed: 07/23/2024] Open
Abstract
Background The accurate computational prediction of B cell epitopes can vastly reduce the cost and time required for identifying potential epitope candidates for the design of vaccines and immunodiagnostics. However, current computational tools for B cell epitope prediction perform poorly and are not fit-for-purpose, and there remains enormous room for improvement and the need for superior prediction strategies. Results Here we propose a novel approach that improves B cell epitope prediction by encoding epitopes as binary positional permutation vectors that represent the position and structural properties of the amino acids within a protein antigen sequence that interact with an antibody. This approach supersedes the traditional method of defining epitopes as scores per amino acid on a protein sequence, where each score reflects each amino acids predicted probability of partaking in a B cell epitope antibody interaction. In addition to defining epitopes as binary positional permutation vectors, the approach also uses the 3D macrostructure features of the unbound protein structures, and in turn uses these features to train another deep learning model on the corresponding antibody-bound protein 3D structures. This enables the algorithm to learn the key structural and physiochemical features of the unbound protein and embedded epitope that initiate the antibody binding process helping to eliminate "induced fit" biases in the training data. We demonstrate that the strategy predicts B cell epitopes with improved accuracy compared to the existing tools. Additionally, we show that this approach reliably identifies the majority of experimentally verified epitopes on the spike protein of SARS-CoV-2 not seen by the model during training and generalizes in a very robust manner on dissimilar data not seen by the model during training. Conclusions With the approach described herein, a primary protein sequence and a query positional permutation vector encoding a putative epitope is sufficient to predict B cell epitopes in a reliable manner, potentially advancing the use of computational prediction of B cell epitopes in biomedical research applications.
Collapse
Affiliation(s)
- Ioannis Vardaxis
- NEC OncoImmunity AS, Oslo Cancer Cluster, Ullernchausseen 64/66, Oslo 0379, Norway
| | - Boris Simovski
- NEC OncoImmunity AS, Oslo Cancer Cluster, Ullernchausseen 64/66, Oslo 0379, Norway
| | - Irantzu Anzar
- NEC OncoImmunity AS, Oslo Cancer Cluster, Ullernchausseen 64/66, Oslo 0379, Norway
| | - Richard Stratford
- NEC OncoImmunity AS, Oslo Cancer Cluster, Ullernchausseen 64/66, Oslo 0379, Norway
| | - Trevor Clancy
- NEC OncoImmunity AS, Oslo Cancer Cluster, Ullernchausseen 64/66, Oslo 0379, Norway
- Department of Vaccine Informatics, Institute for Tropical Medicine, Nagasaki University, Japan
| |
Collapse
|
2
|
Kang G, Baek SH, Kim YH, Kim DH, Park JW. Genetic Risk Assessment of Nonsyndromic Cleft Lip with or without Cleft Palate by Linking Genetic Networks and Deep Learning Models. Int J Mol Sci 2023; 24:ijms24054557. [PMID: 36901988 PMCID: PMC10003462 DOI: 10.3390/ijms24054557] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2023] [Revised: 02/13/2023] [Accepted: 02/20/2023] [Indexed: 03/02/2023] Open
Abstract
Recent deep learning algorithms have further improved risk classification capabilities. However, an appropriate feature selection method is required to overcome dimensionality issues in population-based genetic studies. In this Korean case-control study of nonsyndromic cleft lip with or without cleft palate (NSCL/P), we compared the predictive performance of models that were developed by using the genetic-algorithm-optimized neural networks ensemble (GANNE) technique with those models that were generated by eight conventional risk classification methods, including polygenic risk score (PRS), random forest (RF), support vector machine (SVM), extreme gradient boosting (XGBoost), and deep-learning-based artificial neural network (ANN). GANNE, which is capable of automatic input SNP selection, exhibited the highest predictive power, especially in the 10-SNP model (AUC of 88.2%), thus improving the AUC by 23% and 17% compared to PRS and ANN, respectively. Genes mapped with input SNPs that were selected by using a genetic algorithm (GA) were functionally validated for risks of developing NSCL/P in gene ontology and protein-protein interaction (PPI) network analyses. The IRF6 gene, which is most frequently selected via GA, was also a major hub gene in the PPI network. Genes such as RUNX2, MTHFR, PVRL1, TGFB3, and TBX22 significantly contributed to predicting NSCL/P risk. GANNE is an efficient disease risk classification method using a minimum optimal set of SNPs; however, further validation studies are needed to ensure the clinical utility of the model for predicting NSCL/P risk.
Collapse
Affiliation(s)
- Geon Kang
- Department of Medical Genetics, College of Medicine, Hallym University, Chuncheon 24252, Republic of Korea
| | - Seung-Hak Baek
- Department of Orthodontics, School of Dentistry, Seoul National University, Seoul 03080, Republic of Korea
| | - Young Ho Kim
- Department of Orthodontics, The Institute of Oral Health Science, Samsung Medical Center, School of Medicine, Sungkyunkwan University, Seoul 06351, Republic of Korea
| | - Dong-Hyun Kim
- Department of Social and Preventive Medicine, College of Medicine, Hallym University, Chuncheon 24252, Republic of Korea
| | - Ji Wan Park
- Department of Medical Genetics, College of Medicine, Hallym University, Chuncheon 24252, Republic of Korea
- Correspondence:
| |
Collapse
|
3
|
Bukhari SNH, Jain A, Haq E, Mehbodniya A, Webber J. Machine Learning Techniques for the Prediction of B-Cell and T-Cell Epitopes as Potential Vaccine Targets with a Specific Focus on SARS-CoV-2 Pathogen: A Review. Pathogens 2022; 11:146. [PMID: 35215090 PMCID: PMC8879824 DOI: 10.3390/pathogens11020146] [Citation(s) in RCA: 28] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2021] [Revised: 01/19/2022] [Accepted: 01/21/2022] [Indexed: 02/01/2023] Open
Abstract
The only part of an antigen (a protein molecule found on the surface of a pathogen) that is composed of epitopes specific to T and B cells is recognized by the human immune system (HIS). Identification of epitopes is considered critical for designing an epitope-based peptide vaccine (EBPV). Although there are a number of vaccine types, EBPVs have received less attention thus far. It is important to mention that EBPVs have a great deal of untapped potential for boosting vaccination safety-they are less expensive and take a short time to produce. Thus, in order to quickly contain global pandemics such as the ongoing outbreak of coronavirus disease 2019 (COVID-19) caused by the severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2), as well as epidemics and endemics, EBPVs are considered promising vaccine types. The high mutation rate of SARS-CoV-2 has posed a great challenge to public health worldwide because either the composition of existing vaccines has to be changed or a new vaccine has to be developed to protect against its different variants. In such scenarios, time being the critical factor, EBPVs can be a promising alternative. To design an effective and viable EBPV against different strains of a pathogen, it is important to identify the putative T- and B-cell epitopes. Using the wet-lab experimental approach to identify these epitopes is time-consuming and costly because the experimental screening of a vast number of potential epitope candidates is required. Fortunately, various available machine learning (ML)-based prediction methods have reduced the burden related to the epitope mapping process by decreasing the potential epitope candidate list for experimental trials. Moreover, these methods are also cost-effective, scalable, and fast. This paper presents a systematic review of various state-of-the-art and relevant ML-based methods and tools for predicting T- and B-cell epitopes. Special emphasis is placed on highlighting and analyzing various models for predicting epitopes of SARS-CoV-2, the causative agent of COVID-19. Based on the various methods and tools discussed, future research directions for epitope prediction are presented.
Collapse
Affiliation(s)
- Syed Nisar Hussain Bukhari
- University Institute of Computing, Chandigarh University, NH-95, Chandigarh-Ludhiana Highway, Mohali 140413, India;
| | - Amit Jain
- University Institute of Computing, Chandigarh University, NH-95, Chandigarh-Ludhiana Highway, Mohali 140413, India;
| | - Ehtishamul Haq
- Department of Biotechnology, University of Kashmir, Srinagar 190006, India;
| | - Abolfazl Mehbodniya
- Department of Electronics and Communication Engineering, Kuwait College of Science and Technology, Kuwait City 20185145, Kuwait;
| | - Julian Webber
- Graduate School of Engineering Science, Osaka University, Osaka 560-8531, Japan;
| |
Collapse
|
4
|
Cai L, Ren X, Fu X, Peng L, Gao M, Zeng X. iEnhancer-XG: interpretable sequence-based enhancers and their strength predictor. Bioinformatics 2021; 37:1060-1067. [PMID: 33119044 DOI: 10.1093/bioinformatics/btaa914] [Citation(s) in RCA: 56] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2020] [Revised: 09/30/2020] [Accepted: 10/15/2020] [Indexed: 01/10/2023] Open
Abstract
MOTIVATION Enhancers are non-coding DNA fragments with high position variability and free scattering. They play an important role in controlling gene expression. As machine learning has become more widely used in identifying enhancers, a number of bioinformatic tools have been developed. Although several models for identifying enhancers and their strengths have been proposed, their accuracy and efficiency have yet to be improved. RESULTS We propose a two-layer predictor called 'iEnhancer-XG.' It comprises a one-layer predictor (for identifying enhancers) and a second classifier (for their strength) and uses 'XGBoost' as a base classifier and five feature extraction methods, namely, k-Spectrum Profile, Mismatch k-tuple, Subsequence Profile, Position-specific scoring matrix (PSSM) and Pseudo dinucleotide composition (PseDNC). Each method has an independent output. We place the feature vector matrix into the ensemble learning for fusion. This experiment involves the method of 'SHapley Additive explanations' to provide interpretability for the previous black box machine learning methods and improve their credibility. The accuracies of the ensemble learning method are 0.811 (first layer) and 0.657 (second layer). The rigorous 10-fold cross-validation confirms that the proposed method is significantly better than existing technologies. AVAILABILITY AND IMPLEMENTATION The source code and dataset for the enhancer predictions have been uploaded to https://github.com/jimmyrate/ienhancer-xg. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Lijun Cai
- College of Computer Science and Electronic Engineering, Hunan University, 410082 Changsha, Hunan, China
| | - Xuanbai Ren
- College of Computer Science and Electronic Engineering, Hunan University, 410082 Changsha, Hunan, China
| | - Xiangzheng Fu
- College of Computer Science and Electronic Engineering, Hunan University, 410082 Changsha, Hunan, China
| | - Li Peng
- College of Computer Science and Engineering, Hunan University of Science and Technology, 411103 XiangTan, China
| | - Mingyu Gao
- College of Computer Science and Electronic Engineering, Hunan University, 410082 Changsha, Hunan, China
| | - Xiangxiang Zeng
- College of Computer Science and Electronic Engineering, Hunan University, 410082 Changsha, Hunan, China
| |
Collapse
|
5
|
Haji Abdolvahab M, Venselaar H, Fazeli A, Arab SS, Behmanesh M. Point Mutation Approach to Reduce Antigenicity of Interferon Beta. Int J Pept Res Ther 2020. [DOI: 10.1007/s10989-019-09938-9] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/24/2023]
|
6
|
Uncovering the Tumor Antigen Landscape: What to Know about the Discovery Process. Cancers (Basel) 2020; 12:cancers12061660. [PMID: 32585818 PMCID: PMC7352969 DOI: 10.3390/cancers12061660] [Citation(s) in RCA: 28] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2020] [Revised: 06/11/2020] [Accepted: 06/20/2020] [Indexed: 12/14/2022] Open
Abstract
According to the latest available data, cancer is the second leading cause of death, highlighting the need for novel cancer therapeutic approaches. In this context, immunotherapy is emerging as a reliable first-line treatment for many cancers, particularly metastatic melanoma. Indeed, cancer immunotherapy has attracted great interest following the recent clinical approval of antibodies targeting immune checkpoint molecules, such as PD-1, PD-L1, and CTLA-4, that release the brakes of the immune system, thus reviving a field otherwise poorly explored. Cancer immunotherapy mainly relies on the generation and stimulation of cytotoxic CD8 T lymphocytes (CTLs) within the tumor microenvironment (TME), priming T cells and establishing efficient and durable anti-tumor immunity. Therefore, there is a clear need to define and identify immunogenic T cell epitopes to use in therapeutic cancer vaccines. Naturally presented antigens in the human leucocyte antigen-1 (HLA-I) complex on the tumor surface are the main protagonists in evocating a specific anti-tumor CD8+ T cell response. However, the methodologies for their identification have been a major bottleneck for their reliable characterization. Consequently, the field of antigen discovery has yet to improve. The current review is intended to define what are today known as tumor antigens, with a main focus on CTL antigenic peptides. We also review the techniques developed and employed to date for antigen discovery, exploring both the direct elution of HLA-I peptides and the in silico prediction of epitopes. Finally, the last part of the review analyses the future challenges and direction of the antigen discovery field.
Collapse
|
7
|
PredLnc-GFStack: A Global Sequence Feature Based on a Stacked Ensemble Learning Method for Predicting lncRNAs from Transcripts. Genes (Basel) 2019; 10:genes10090672. [PMID: 31484412 PMCID: PMC6770532 DOI: 10.3390/genes10090672] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2019] [Revised: 08/05/2019] [Accepted: 08/28/2019] [Indexed: 11/16/2022] Open
Abstract
Long non-coding RNAs (lncRNAs) are a class of RNAs with the length exceeding 200 base pairs (bps), which do not encode proteins, nevertheless, lncRNAs have many vital biological functions. A large number of novel transcripts were discovered as a result of the development of high-throughput sequencing technology. Under this circumstance, computational methods for lncRNA prediction are in great demand. In this paper, we consider global sequence features and propose a stacked ensemble learning-based method to predict lncRNAs from transcripts, abbreviated as PredLnc-GFStack. We extract the critical features from the candidate feature list using the genetic algorithm (GA) and then employ the stacked ensemble learning method to construct PredLnc-GFStack model. Computational experimental results show that PredLnc-GFStack outperforms several state-of-the-art methods for lncRNA prediction. Furthermore, PredLnc-GFStack demonstrates an outstanding ability for cross-species ncRNA prediction.
Collapse
|
8
|
Quan Y, Luo ZH, Yang QY, Li J, Zhu Q, Liu YM, Lv BM, Cui ZJ, Qin X, Xu YH, Zhu LD, Zhang HY. Systems Chemical Genetics-Based Drug Discovery: Prioritizing Agents Targeting Multiple/Reliable Disease-Associated Genes as Drug Candidates. Front Genet 2019; 10:474. [PMID: 31191604 PMCID: PMC6549477 DOI: 10.3389/fgene.2019.00474] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2019] [Accepted: 05/01/2019] [Indexed: 01/10/2023] Open
Abstract
Genetic disease genes are considered a promising source of drug targets. Most diseases are caused by more than one pathogenic factor; thus, it is reasonable to consider that chemical agents targeting multiple disease genes are more likely to have desired activities. This is supported by a comprehensive analysis on the relationships between agent activity/druggability and target genetic characteristics. The therapeutic potential of agents increases steadily with increasing number of targeted disease genes, and can be further enhanced by strengthened genetic links between targets and diseases. By using the multi-label classification models for genetics-based drug activity prediction, we provide universal tools for prioritizing drug candidates. All of the documented data and the machine-learning prediction service are available at SCG-Drug (http://zhanglab.hzau.edu.cn/scgdrug).
Collapse
Affiliation(s)
- Yuan Quan
- Hubei Key Laboratory of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, China
| | - Zhi-Hui Luo
- College of Life Sciences and Technology, Huazhong Agricultural University, Wuhan, China
| | - Qing-Yong Yang
- Hubei Key Laboratory of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, China
| | - Jiang Li
- Hubei Key Laboratory of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, China
| | - Qiang Zhu
- Hubei Key Laboratory of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, China
| | - Ye-Mao Liu
- Hubei Key Laboratory of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, China
| | - Bo-Min Lv
- Hubei Key Laboratory of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, China
| | - Ze-Jia Cui
- Hubei Key Laboratory of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, China
| | - Xuan Qin
- Hubei Key Laboratory of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, China
| | - Yan-Hua Xu
- Sci-meds Biopharmaceutical Co., Ltd., Wuhan, China
| | - Li-Da Zhu
- Hubei Key Laboratory of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, China
| | - Hong-Yu Zhang
- Hubei Key Laboratory of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, China
| |
Collapse
|
9
|
Qu K, Wei L, Yu J, Wang C. Identifying Plant Pentatricopeptide Repeat Coding Gene/Protein Using Mixed Feature Extraction Methods. FRONTIERS IN PLANT SCIENCE 2019; 9:1961. [PMID: 30687359 PMCID: PMC6335366 DOI: 10.3389/fpls.2018.01961] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/20/2018] [Accepted: 12/17/2018] [Indexed: 05/04/2023]
Abstract
Motivation: Pentatricopeptide repeat (PPR) is a triangular pentapeptide repeat domain that plays a vital role in plant growth. In this study, we seek to identify PPR coding genes and proteins using a mixture of feature extraction methods. We use four single feature extraction methods focusing on the sequence, physical, and chemical properties as well as the amino acid composition, and mix the features. The Max-Relevant-Max-Distance (MRMD) technique is applied to reduce the feature dimension. Classification uses the random forest, J48, and naïve Bayes with 10-fold cross-validation. Results: Combining two of the feature extraction methods with the random forest classifier produces the highest area under the curve of 0.9848. Using MRMD to reduce the dimension improves this metric for J48 and naïve Bayes, but has little effect on the random forest results. Availability and Implementation: The webserver is available at: http://server.malab.cn/MixedPPR/index.jsp.
Collapse
Affiliation(s)
- Kaiyang Qu
- College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Leyi Wei
- College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Jiantao Yu
- College of Information Engineering, North-West A&F University, Yangling, China
| | - Chunyu Wang
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO, United States
| |
Collapse
|
10
|
Tang G, Shi J, Wu W, Yue X, Zhang W. Sequence-based bacterial small RNAs prediction using ensemble learning strategies. BMC Bioinformatics 2018; 19:503. [PMID: 30577759 PMCID: PMC6302447 DOI: 10.1186/s12859-018-2535-1] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023] Open
Abstract
Background Bacterial small non-coding RNAs (sRNAs) have emerged as important elements in diverse physiological processes, including growth, development, cell proliferation, differentiation, metabolic reactions and carbon metabolism, and attract great attention. Accurate prediction of sRNAs is important and challenging, and helps to explore functions and mechanism of sRNAs. Results In this paper, we utilize a variety of sRNA sequence-derived features to develop ensemble learning methods for the sRNA prediction. First, we compile a balanced dataset and four imbalanced datasets. Then, we investigate various sRNA sequence-derived features, such as spectrum profile, mismatch profile, reverse compliment k-mer and pseudo nucleotide composition. Finally, we consider two ensemble learning strategies to integrate all features for building ensemble learning models for the sRNA prediction. One is the weighted average ensemble method (WAEM), which uses the linear weighted sum of outputs from the individual feature-based predictors to predict sRNAs. The other is the neural network ensemble method (NNEM), which trains a deep neural network by combining diverse features. In the computational experiments, we evaluate our methods on these five datasets by using 5-fold cross validation. WAEM and NNEM can produce better results than existing state-of-the-art sRNA prediction methods. Conclusions WAEM and NNEM have great potential for the sRNA prediction, and are helpful for understanding the biological mechanism of bacteria.
Collapse
Affiliation(s)
- Guifeng Tang
- School of Computer Science, Wuhan University, Wuhan, 430072, China
| | - Jingwen Shi
- School of Mathematics and Statistics, Wuhan University, Wuhan, 430072, China
| | - Wenjian Wu
- Electronic Information School, Wuhan University, Wuhan, 430072, China
| | - Xiang Yue
- Department of Computer Science and Engineering, The Ohio State University, Columbus, OH, 43210, USA
| | - Wen Zhang
- College of Informatics, Huazhong Agricultural University, Wuhan, 430070, China.
| |
Collapse
|
11
|
Zhang W, Yue X, Tang G, Wu W, Huang F, Zhang X. SFPEL-LPI: Sequence-based feature projection ensemble learning for predicting LncRNA-protein interactions. PLoS Comput Biol 2018; 14:e1006616. [PMID: 30533006 PMCID: PMC6331124 DOI: 10.1371/journal.pcbi.1006616] [Citation(s) in RCA: 103] [Impact Index Per Article: 14.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2018] [Revised: 01/14/2019] [Accepted: 11/02/2018] [Indexed: 01/12/2023] Open
Abstract
LncRNA-protein interactions play important roles in post-transcriptional gene regulation, poly-adenylation, splicing and translation. Identification of lncRNA-protein interactions helps to understand lncRNA-related activities. Existing computational methods utilize multiple lncRNA features or multiple protein features to predict lncRNA-protein interactions, but features are not available for all lncRNAs or proteins; most of existing methods are not capable of predicting interacting proteins (or lncRNAs) for new lncRNAs (or proteins), which don’t have known interactions. In this paper, we propose the sequence-based feature projection ensemble learning method, “SFPEL-LPI”, to predict lncRNA-protein interactions. First, SFPEL-LPI extracts lncRNA sequence-based features and protein sequence-based features. Second, SFPEL-LPI calculates multiple lncRNA-lncRNA similarities and protein-protein similarities by using lncRNA sequences, protein sequences and known lncRNA-protein interactions. Then, SFPEL-LPI combines multiple similarities and multiple features with a feature projection ensemble learning frame. In computational experiments, SFPEL-LPI accurately predicts lncRNA-protein associations and outperforms other state-of-the-art methods. More importantly, SFPEL-LPI can be applied to new lncRNAs (or proteins). The case studies demonstrate that our method can find out novel lncRNA-protein interactions, which are confirmed by literature. Finally, we construct a user-friendly web server, available at http://www.bioinfotech.cn/SFPEL-LPI/. LncRNA-protein interactions play important roles in post-transcriptional gene regulation, poly-adenylation, splicing and translation. Identification of lncRNA-protein interactions helps to understand lncRNA-related activities. In this paper, we propose a novel computational method “SFPEL-LPI” to predict lncRNA-protein interactions. SFPEL-LPI makes use of lncRNA sequences, protein sequences and known lncRNA-protein associations to extract features and calculate similarities for lncRNAs and proteins, and then combines them with a feature projection ensemble learning frame. SFPEL-LPI can predict unobserved interactions between lncRNAs and proteins, and also can make predictions for new lncRNAs (or proteins), which have no interactions with any proteins (or lncRNAs). SFPEL-LPI produces high-accuracy performances on the benchmark dataset when evaluated by five-fold cross validation, and outperforms state-of-the-art methods. The case studies demonstrate that SFPEL-LPI can find out novel associations, which are confirmed by literature. To facilitate the lncRNA-protein interaction prediction, we develop a user-friendly web server, available at http://www.bioinfotech.cn/SFPEL-LPI/.
Collapse
Affiliation(s)
- Wen Zhang
- College of Informatics, Huazhong Agricultural University, Wuhan, China
- School of Computer Science, Wuhan University, Wuhan, China
- * E-mail: , (WZ); (XZ)
| | - Xiang Yue
- Department of Computer Science and Engineering, The Ohio State University, Columbus, United States of America
| | - Guifeng Tang
- School of Computer Science, Wuhan University, Wuhan, China
| | - Wenjian Wu
- Electronic Information School, Wuhan University, Wuhan, China
| | - Feng Huang
- School of Computer Science, Wuhan University, Wuhan, China
| | - Xining Zhang
- School of Computer Science, Wuhan University, Wuhan, China
- * E-mail: , (WZ); (XZ)
| |
Collapse
|
12
|
Manifold regularized matrix factorization for drug-drug interaction prediction. J Biomed Inform 2018; 88:90-97. [DOI: 10.1016/j.jbi.2018.11.005] [Citation(s) in RCA: 48] [Impact Index Per Article: 6.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2017] [Revised: 11/03/2018] [Accepted: 11/11/2018] [Indexed: 12/20/2022]
|
13
|
Blanco JL, Porto-Pazos AB, Pazos A, Fernandez-Lozano C. Prediction of high anti-angiogenic activity peptides in silico using a generalized linear model and feature selection. Sci Rep 2018; 8:15688. [PMID: 30356060 PMCID: PMC6200741 DOI: 10.1038/s41598-018-33911-z] [Citation(s) in RCA: 34] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2018] [Accepted: 10/06/2018] [Indexed: 12/22/2022] Open
Abstract
Screening and in silico modeling are critical activities for the reduction of experimental costs. They also speed up research notably and strengthen the theoretical framework, thus allowing researchers to numerically quantify the importance of a particular subset of information. For example, in fields such as cancer and other highly prevalent diseases, having a reliable prediction method is crucial. The objective of this paper is to classify peptide sequences according to their anti-angiogenic activity to understand the underlying principles via machine learning. First, the peptide sequences were converted into three types of numerical molecular descriptors based on the amino acid composition. We performed different experiments with the descriptors and merged them to obtain baseline results for the performance of the models, particularly of each molecular descriptor subset. A feature selection process was applied to reduce the dimensionality of the problem and remove noisy features – which are highly present in biological problems. After a robust machine learning experimental design under equal conditions (nested resampling, cross-validation, hyperparameter tuning and different runs), we statistically and significantly outperformed the best previously published anti-angiogenic model with a generalized linear model via coordinate descent (glmnet), achieving a mean AUC value greater than 0.96 and with an accuracy of 0.86 with 200 molecular descriptors, mixed from the three groups. A final analysis with the top-40 discriminative anti-angiogenic activity peptides is presented along with a discussion of the feature selection process and the individual importance of each molecular descriptors According to our findings, anti-angiogenic activity peptides are strongly associated with amino acid sequences SP, LSL, PF, DIT, PC, GH, RQ, QD, TC, SC, AS, CLD, ST, MF, GRE, IQ, CQ and HG.
Collapse
Affiliation(s)
- Jose Liñares Blanco
- Department of Computer Science, Faculty of Computer Science, University of A Coruña, A Coruña, 15071, Spain
| | - Ana B Porto-Pazos
- Department of Computer Science, Faculty of Computer Science, University of A Coruña, A Coruña, 15071, Spain.,Instituto de Investigación Biomédica de A Coruña (INIBIC). Complexo Hospitalario Universitario de A Coruña, A Coruña, Spain
| | - Alejandro Pazos
- Department of Computer Science, Faculty of Computer Science, University of A Coruña, A Coruña, 15071, Spain.,Instituto de Investigación Biomédica de A Coruña (INIBIC). Complexo Hospitalario Universitario de A Coruña, A Coruña, Spain
| | - Carlos Fernandez-Lozano
- Department of Computer Science, Faculty of Computer Science, University of A Coruña, A Coruña, 15071, Spain. .,Instituto de Investigación Biomédica de A Coruña (INIBIC). Complexo Hospitalario Universitario de A Coruña, A Coruña, Spain.
| |
Collapse
|
14
|
Tahir M, Hayat M, Khan SA. iNuc-ext-PseTNC: an efficient ensemble model for identification of nucleosome positioning by extending the concept of Chou's PseAAC to pseudo-tri-nucleotide composition. Mol Genet Genomics 2018; 294:199-210. [PMID: 30291426 DOI: 10.1007/s00438-018-1498-2] [Citation(s) in RCA: 25] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2018] [Accepted: 09/28/2018] [Indexed: 10/28/2022]
Abstract
Nucleosome is a central element of eukaryotic chromatin, which composes of histone proteins and DNA molecules. It performs vital roles in many eukaryotic intra-nuclear processes, for instance, chromatin structure and transcriptional regulation formation. Identification of nucleosome positioning via wet lab is difficult; so, the attention is diverted towards the accurate intelligent automated prediction. In this regard, a novel intelligent automated model "iNuc-ext-PseTNC" is developed to identify the nucleosome positioning in genomes accurately. In this predictor, the sequences of DNA are mathematically represented by two different discrete feature extraction techniques, namely pseudo-tri-nucleotide composition (PseTNC) and pseudo-di-nucleotide composition. Several contemporary machine learning algorithms were examined. Further, the predictions of individual classifiers were integrated through an evolutionary genetic algorithm. The success rates of the ensemble model are higher than individual classifiers. After analyzing the prediction results, it is noticed that iNuc-ext-PseTNC model has achieved better performance in combination with PseTNC feature space, which are 94.3%, 93.14%, and 88.60% of accuracies using six-fold cross-validation test for the three benchmark datasets S1, S2, and S3, respectively. The achieved outcomes exposed that the results of iNuc-ext-PseTNC model are prominent compared to the existing methods so far notifiable in the literature. It is ascertained that the proposed model might be more fruitful and a practical tool for rudimentary academia and research.
Collapse
Affiliation(s)
- Muhammad Tahir
- Department of Computer Science, Abdul Wali Khan University Mardan, Mardan, KP, Pakistan
| | - Maqsood Hayat
- Department of Computer Science, Abdul Wali Khan University Mardan, Mardan, KP, Pakistan.
| | - Sher Afzal Khan
- Department of Computer Science, Abdul Wali Khan University Mardan, Mardan, KP, Pakistan
| |
Collapse
|
15
|
He W, Ju Y, Zeng X, Liu X, Zou Q. Sc-ncDNAPred: A Sequence-Based Predictor for Identifying Non-coding DNA in Saccharomyces cerevisiae. Front Microbiol 2018; 9:2174. [PMID: 30258427 PMCID: PMC6144933 DOI: 10.3389/fmicb.2018.02174] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2018] [Accepted: 08/24/2018] [Indexed: 12/22/2022] Open
Abstract
With the rapid development of high-speed sequencing technologies and the implementation of many whole genome sequencing project, research in the genomics is advancing from genome sequencing to genome synthesis. Synthetic biology technologies such as DNA-based molecular assemblies, genome editing technology, directional evolution technology and DNA storage technology, and other cutting-edge technologies emerge in succession. Especially the rapid growth and development of DNA assembly technology may greatly push forward the success of artificial life. Meanwhile, DNA assembly technology needs a large number of target sequences of known information as data support. Non-coding DNA (ncDNA) sequences occupy most of the organism genomes, thus accurate recognizing of them is necessary. Although experimental methods have been proposed to detect ncDNA sequences, they are expensive for performing genome wide detections. Thus, it is necessary to develop machine-learning methods for predicting non-coding DNA sequences. In this study, we collected the ncDNA benchmark dataset of Saccharomyces cerevisiae and reported a support vector machine-based predictor, called Sc-ncDNAPred, for predicting ncDNA sequences. The optimal feature extraction strategy was selected from a group included mononucleotide, dimer, trimer, tetramer, pentamer, and hexamer, using support vector machine learning method. Sc-ncDNAPred achieved an overall accuracy of 0.98. For the convenience of users, an online web-server has been built at: http://server.malab.cn/Sc_ncDNAPred/index.jsp.
Collapse
Affiliation(s)
- Wenying He
- School of Computer Science and Technology, Tianjin University, Tianjin, China
| | - Ying Ju
- School of Information Science and Technology, Xiamen University, Xiamen, China
| | - Xiangxiang Zeng
- School of Information Science and Technology, Xiamen University, Xiamen, China
| | - Xiangrong Liu
- School of Information Science and Technology, Xiamen University, Xiamen, China
| | - Quan Zou
- School of Computer Science and Technology, Tianjin University, Tianjin, China.,Shandong Provincial Key Laboratory of Biophysics, Institute of Biophysics, Dezhou University, Dezhou, China
| |
Collapse
|
16
|
Niu M, Li Y, Wang C, Han K. RFAmyloid: A Web Server for Predicting Amyloid Proteins. Int J Mol Sci 2018; 19:ijms19072071. [PMID: 30013015 PMCID: PMC6073578 DOI: 10.3390/ijms19072071] [Citation(s) in RCA: 39] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2018] [Revised: 07/10/2018] [Accepted: 07/12/2018] [Indexed: 12/22/2022] Open
Abstract
Amyloid is an insoluble fibrous protein and its mis-aggregation can lead to some diseases, such as Alzheimer’s disease and Creutzfeldt–Jakob’s disease. Therefore, the identification of amyloid is essential for the discovery and understanding of disease. We established a novel predictor called RFAmy based on random forest to identify amyloid, and it employed SVMProt 188-D feature extraction method based on protein composition and physicochemical properties and pse-in-one feature extraction method based on amino acid composition, autocorrelation pseudo acid composition, profile-based features and predicted structures features. In the ten-fold cross-validation test, RFAmy’s overall accuracy was 89.19% and F-measure was 0.891. Results were obtained by comparison experiments with other feature, classifiers, and existing methods. This shows the effectiveness of RFAmy in predicting amyloid protein. The RFAmy proposed in this paper can be accessed through the URL http://server.malab.cn/RFAmyloid/.
Collapse
Affiliation(s)
- Mengting Niu
- School of Information and Computer Engineering, Northeast Forestry University, Harbin 150040, China.
| | - Yanjuan Li
- School of Information and Computer Engineering, Northeast Forestry University, Harbin 150040, China.
| | - Chunyu Wang
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150040, China.
| | - Ke Han
- School of Computer and Information Engineering, Harbin University of Commerce, Harbin 150040, China.
| |
Collapse
|
17
|
Wei L, Chen H, Su R. M6APred-EL: A Sequence-Based Predictor for Identifying N6-methyladenosine Sites Using Ensemble Learning. MOLECULAR THERAPY-NUCLEIC ACIDS 2018; 12:635-644. [PMID: 30081234 PMCID: PMC6082921 DOI: 10.1016/j.omtn.2018.07.004] [Citation(s) in RCA: 145] [Impact Index Per Article: 20.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/31/2018] [Revised: 07/03/2018] [Accepted: 07/03/2018] [Indexed: 12/28/2022]
Abstract
N6-methyladenosine (m6A) modification is the most abundant RNA methylation modification and involves various biological processes, such as RNA splicing and degradation. Recent studies have demonstrated the feasibility of identifying m6A peaks using high-throughput sequencing techniques. However, such techniques cannot accurately identify specific methylated sites, which is important for a better understanding of m6A functions. In this study, we develop a novel machine learning-based predictor called M6APred-EL for the identification of m6A sites. To predict m6A sites accurately within genomic sequences, we trained an ensemble of three support vector machine classifiers that explore the position-specific information and physical chemical information from position-specific k-mer nucleotide propensity, physical-chemical properties, and ring-function-hydrogen-chemical properties. We examined and compared the performance of our predictor with other state-of-the-art methods of benchmarking datasets. Comparative results showed that the proposed M6APred-EL performed more accurately for m6A site identification. Moreover, a user-friendly web server that implements the proposed M6APred-EL is well established and is currently available at http://server.malab.cn/M6APred-EL/. It is expected to be a practical and effective tool for the investigation of m6A functional mechanisms.
Collapse
Affiliation(s)
- Leyi Wei
- School of Computer Science and Technology, Tianjin University, Tianjin, China; State Key Laboratory of Medicinal Chemical Biology, Nankai University, Tianjin, China
| | - Huangrong Chen
- School of Computer Science and Technology, Tianjin University, Tianjin, China
| | - Ran Su
- School of Computer Software, Tianjin University, Tianjin, China; State Key Laboratory of Medicinal Chemical Biology, Nankai University, Tianjin, China.
| |
Collapse
|
18
|
Zhang W, Yue X, Lin W, Wu W, Liu R, Huang F, Liu F. Predicting drug-disease associations by using similarity constrained matrix factorization. BMC Bioinformatics 2018; 19:233. [PMID: 29914348 PMCID: PMC6006580 DOI: 10.1186/s12859-018-2220-4] [Citation(s) in RCA: 135] [Impact Index Per Article: 19.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2017] [Accepted: 05/28/2018] [Indexed: 02/06/2023] Open
Abstract
Background Drug-disease associations provide important information for the drug discovery. Wet experiments that identify drug-disease associations are time-consuming and expensive. However, many drug-disease associations are still unobserved or unknown. The development of computational methods for predicting unobserved drug-disease associations is an important and urgent task. Results In this paper, we proposed a similarity constrained matrix factorization method for the drug-disease association prediction (SCMFDD), which makes use of known drug-disease associations, drug features and disease semantic information. SCMFDD projects the drug-disease association relationship into two low-rank spaces, which uncover latent features for drugs and diseases, and then introduces drug feature-based similarities and disease semantic similarity as constraints for drugs and diseases in low-rank spaces. Different from the classic matrix factorization technique, SCMFDD takes the biological context of the problem into account. In computational experiments, the proposed method can produce high-accuracy performances on benchmark datasets, and outperform existing state-of-the-art prediction methods when evaluated by five-fold cross validation and independent testing. Conclusion We developed a user-friendly web server by using known associations collected from the CTD database, available at http://www.bioinfotech.cn/SCMFDD/. The case studies show that the server can find out novel associations, which are not included in the CTD database.
Collapse
Affiliation(s)
- Wen Zhang
- School of Computer Science, Wuhan University, Wuhan, 430072, China.
| | - Xiang Yue
- School of Computer Science, Wuhan University, Wuhan, 430072, China
| | - Weiran Lin
- School of Computer Science, Wuhan University, Wuhan, 430072, China
| | - Wenjian Wu
- School of Electronic Information, Wuhan University, Wuhan, 430072, China
| | - Ruoqi Liu
- School of Computer Science, Wuhan University, Wuhan, 430072, China
| | - Feng Huang
- School of Computer Science, Wuhan University, Wuhan, 430072, China
| | - Feng Liu
- School of Computer Science, Wuhan University, Wuhan, 430072, China.
| |
Collapse
|
19
|
Zhang S, Zhuang W, Xu Z. Prediction of DNase I hypersensitive sites in plant genome using multiple modes of pseudo components. Anal Biochem 2018; 549:149-156. [DOI: 10.1016/j.ab.2018.03.025] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2018] [Revised: 03/23/2018] [Accepted: 03/27/2018] [Indexed: 12/25/2022]
|
20
|
Kar P, Ruiz-Perez L, Arooj M, Mancera RL. Current methods for the prediction of T-cell epitopes. Pept Sci (Hoboken) 2018. [DOI: 10.1002/pep2.24046] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Affiliation(s)
- Prattusha Kar
- School of Pharmacy and Biomedical Sciences; Curtin Health Innovation Research Institute and Curtin Institute for Computation, Curtin University; Perth Western Australia 6845 Australia
| | - Lanie Ruiz-Perez
- School of Pharmacy and Biomedical Sciences; Curtin Health Innovation Research Institute and Curtin Institute for Computation, Curtin University; Perth Western Australia 6845 Australia
| | - Mahreen Arooj
- School of Pharmacy and Biomedical Sciences; Curtin Health Innovation Research Institute and Curtin Institute for Computation, Curtin University; Perth Western Australia 6845 Australia
| | - Ricardo L. Mancera
- School of Pharmacy and Biomedical Sciences; Curtin Health Innovation Research Institute and Curtin Institute for Computation, Curtin University; Perth Western Australia 6845 Australia
| |
Collapse
|
21
|
The linear neighborhood propagation method for predicting long non-coding RNA–protein interactions. Neurocomputing 2018. [DOI: 10.1016/j.neucom.2017.07.065] [Citation(s) in RCA: 105] [Impact Index Per Article: 15.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/09/2023]
|
22
|
Zhang W, Chen Y, Li D. Drug-Target Interaction Prediction through Label Propagation with Linear Neighborhood Information. Molecules 2017; 22:molecules22122056. [PMID: 29186828 PMCID: PMC6149680 DOI: 10.3390/molecules22122056] [Citation(s) in RCA: 53] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2017] [Revised: 11/19/2017] [Accepted: 11/20/2017] [Indexed: 11/16/2022] Open
Abstract
Interactions between drugs and target proteins provide important information for the drug discovery. Currently, experiments identified only a small number of drug-target interactions. Therefore, the development of computational methods for drug-target interaction prediction is an urgent task of theoretical interest and practical significance. In this paper, we propose a label propagation method with linear neighborhood information (LPLNI) for predicting unobserved drug-target interactions. Firstly, we calculate drug-drug linear neighborhood similarity in the feature spaces, by considering how to reconstruct data points from neighbors. Then, we take similarities as the manifold of drugs, and assume the manifold unchanged in the interaction space. At last, we predict unobserved interactions between known drugs and targets by using drug-drug linear neighborhood similarity and known drug-target interactions. The experiments show that LPLNI can utilize only known drug-target interactions to make high-accuracy predictions on four benchmark datasets. Furthermore, we consider incorporating chemical structures into LPLNI models. Experimental results demonstrate that the model with integrated information (LPLNI-II) can produce improved performances, better than other state-of-the-art methods. The known drug-target interactions are an important information source for computational predictions. The usefulness of the proposed method is demonstrated by cross validation and the case study.
Collapse
Affiliation(s)
- Wen Zhang
- School of Computer, Wuhan University, Wuhan 430072, China.
| | - Yanlin Chen
- School of Mathematics and Statistics, Wuhan University, Wuhan 430072, China.
| | - Dingfang Li
- School of Mathematics and Statistics, Wuhan University, Wuhan 430072, China.
| |
Collapse
|
23
|
Wang JY, Chen LL, Zhou XH. Identifying prognostic signature in ovarian cancer using DirGenerank. Oncotarget 2017; 8:46398-46413. [PMID: 28615526 PMCID: PMC5542276 DOI: 10.18632/oncotarget.18189] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2017] [Accepted: 04/26/2017] [Indexed: 12/27/2022] Open
Abstract
Identifying the prognostic genes in cancer is essential not only for the treatment of cancer patients, but also for drug discovery. However, it's still a big challenge to select the prognostic genes that can distinguish the risk of cancer patients across various data sets because of tumor heterogeneity. In this situation, the selected genes whose expression levels are statistically related to prognostic risks may be passengers. In this paper, based on gene expression data and prognostic data of ovarian cancer patients, we used conditional mutual information to construct gene dependency network in which the nodes (genes) with more out-degrees have more chances to be the modulators of cancer prognosis. After that, we proposed DirGenerank (Generank in direct netowrk) algorithm, which concerns both the gene dependency network and genes' correlations to prognostic risks, to identify the gene signature that can predict the prognostic risks of ovarian cancer patients. Using ovarian cancer data set from TCGA (The Cancer Genome Atlas) as training data set, 40 genes with the highest importance were selected as prognostic signature. Survival analysis of these patients divided by the prognostic signature in testing data set and four independent data sets showed the signature can distinguish the prognostic risks of cancer patients significantly. Enrichment analysis of the signature with curated cancer genes and the drugs selected by CMAP showed the genes in the signature may be drug targets for therapy. In summary, we have proposed a useful pipeline to identify prognostic genes of cancer patients.
Collapse
Affiliation(s)
- Jian-Yong Wang
- College of Informatics, Huazhong Agricultural University, Wuhan 430070, P.R. China
| | - Ling-Ling Chen
- College of Informatics, Huazhong Agricultural University, Wuhan 430070, P.R. China
| | - Xiong-Hui Zhou
- College of Informatics, Huazhong Agricultural University, Wuhan 430070, P.R. China
| |
Collapse
|
24
|
Wang W, Sun L, Zhang S, Zhang H, Shi J, Xu T, Li K. Analysis and prediction of single-stranded and double-stranded DNA binding proteins based on protein sequences. BMC Bioinformatics 2017; 18:300. [PMID: 28606086 PMCID: PMC5469069 DOI: 10.1186/s12859-017-1715-8] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2017] [Accepted: 06/06/2017] [Indexed: 01/22/2023] Open
Abstract
BACKGROUND DNA-binding proteins perform important functions in a great number of biological activities. DNA-binding proteins can interact with ssDNA (single-stranded DNA) or dsDNA (double-stranded DNA), and DNA-binding proteins can be categorized as single-stranded DNA-binding proteins (SSBs) and double-stranded DNA-binding proteins (DSBs). The identification of DNA-binding proteins from amino acid sequences can help to annotate protein functions and understand the binding specificity. In this study, we systematically consider a variety of schemes to represent protein sequences: OAAC (overall amino acid composition) features, dipeptide compositions, PSSM (position-specific scoring matrix profiles) and split amino acid composition (SAA), and then we adopt SVM (support vector machine) and RF (random forest) classification model to distinguish SSBs from DSBs. RESULTS Our results suggest that some sequence features can significantly differentiate DSBs and SSBs. Evaluated by 10 fold cross-validation on the benchmark datasets, our prediction method can achieve the accuracy of 88.7% and AUC (area under the curve) of 0.919. Moreover, our method has good performance in independent testing. CONCLUSIONS Using various sequence-derived features, a novel method is proposed to distinguish DSBs and SSBs accurately. The method also explores novel features, which could be helpful to discover the binding specificity of DNA-binding proteins.
Collapse
Affiliation(s)
- Wei Wang
- College of Computer and Information Engineering, Henan Normal University, Xinxiang, Henan Province 453007 China
- Laboratory of Computation Intelligence and Information Processing, Engineering Technology Research Center for Computing Intelligence and Data Mining, Xinxiang, Henan Province 453007 China
| | - Lin Sun
- College of Computer and Information Engineering, Henan Normal University, Xinxiang, Henan Province 453007 China
| | - Shiguang Zhang
- College of Computer and Information Engineering, Henan Normal University, Xinxiang, Henan Province 453007 China
| | - Hongjun Zhang
- School of Aviation Engineering, Anyang University, Anyang, Henan Province 455000 China
| | - Jinling Shi
- School of International Education, Xuchang University, Xuchang, Henan Province 461000 China
| | - Tianhe Xu
- College of Computer and Information Engineering, Henan Normal University, Xinxiang, Henan Province 453007 China
| | - Keliang Li
- College of Computer and Information Engineering, Henan Normal University, Xinxiang, Henan Province 453007 China
| |
Collapse
|
25
|
Li D, Luo L, Zhang W, Liu F, Luo F. A genetic algorithm-based weighted ensemble method for predicting transposon-derived piRNAs. BMC Bioinformatics 2016; 17:329. [PMID: 27578422 PMCID: PMC5006569 DOI: 10.1186/s12859-016-1206-3] [Citation(s) in RCA: 56] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2016] [Accepted: 08/24/2016] [Indexed: 02/05/2023] Open
Abstract
BACKGROUND Predicting piwi-interacting RNA (piRNA) is an important topic in the small non-coding RNAs, which provides clues for understanding the generation mechanism of gamete. To the best of our knowledge, several machine learning approaches have been proposed for the piRNA prediction, but there is still room for improvements. RESULTS In this paper, we develop a genetic algorithm-based weighted ensemble method for predicting transposon-derived piRNAs. We construct datasets for three species: Human, Mouse and Drosophila. For each species, we compile the balanced dataset and imbalanced dataset, and thus obtain six datasets to build and evaluate prediction models. In the computational experiments, the genetic algorithm-based weighted ensemble method achieves 10-fold cross validation AUC of 0.932, 0.937 and 0.995 on the balanced Human dataset, Mouse dataset and Drosophila dataset, respectively, and achieves AUC of 0.935, 0.939 and 0.996 on the imbalanced datasets of three species. Further, we use the prediction models trained on the Mouse dataset to identify piRNAs of other species, and the models demonstrate the good performances in the cross-species prediction. CONCLUSIONS Compared with other state-of-the-art methods, our method can lead to better performances. In conclusion, the proposed method is promising for the transposon-derived piRNA prediction. The source codes and datasets are available in https://github.com/zw9977129/piRNAPredictor .
Collapse
Affiliation(s)
- Dingfang Li
- School of Mathematics and Statistics, Wuhan University, Wuhan, 430072 China
| | - Longqiang Luo
- School of Mathematics and Statistics, Wuhan University, Wuhan, 430072 China
| | - Wen Zhang
- State Key Lab of Software Engineering, Wuhan University, Wuhan, 430072 China
- School of Computer, Wuhan University, Wuhan, 430072 China
| | - Feng Liu
- International School of Software, Wuhan University, Wuhan, 430072 China
| | - Fei Luo
- State Key Lab of Software Engineering, Wuhan University, Wuhan, 430072 China
- School of Computer, Wuhan University, Wuhan, 430072 China
| |
Collapse
|
26
|
Luo L, Li D, Zhang W, Tu S, Zhu X, Tian G. Accurate Prediction of Transposon-Derived piRNAs by Integrating Various Sequential and Physicochemical Features. PLoS One 2016; 11:e0153268. [PMID: 27074043 PMCID: PMC4830532 DOI: 10.1371/journal.pone.0153268] [Citation(s) in RCA: 47] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2015] [Accepted: 03/25/2016] [Indexed: 11/30/2022] Open
Abstract
BACKGROUND Piwi-interacting RNA (piRNA) is the largest class of small non-coding RNA molecules. The transposon-derived piRNA prediction can enrich the research contents of small ncRNAs as well as help to further understand generation mechanism of gamete. METHODS In this paper, we attempt to differentiate transposon-derived piRNAs from non-piRNAs based on their sequential and physicochemical features by using machine learning methods. We explore six sequence-derived features, i.e. spectrum profile, mismatch profile, subsequence profile, position-specific scoring matrix, pseudo dinucleotide composition and local structure-sequence triplet elements, and systematically evaluate their performances for transposon-derived piRNA prediction. Finally, we consider two approaches: direct combination and ensemble learning to integrate useful features and achieve high-accuracy prediction models. RESULTS We construct three datasets, covering three species: Human, Mouse and Drosophila, and evaluate the performances of prediction models by 10-fold cross validation. In the computational experiments, direct combination models achieve AUC of 0.917, 0.922 and 0.992 on Human, Mouse and Drosophila, respectively; ensemble learning models achieve AUC of 0.922, 0.926 and 0.994 on the three datasets. CONCLUSIONS Compared with other state-of-the-art methods, our methods can lead to better performances. In conclusion, the proposed methods are promising for the transposon-derived piRNA prediction. The source codes and datasets are available in S1 File.
Collapse
Affiliation(s)
- Longqiang Luo
- School of Mathematics and Statistics, Wuhan University, Wuhan, 430072, China
| | - Dingfang Li
- School of Mathematics and Statistics, Wuhan University, Wuhan, 430072, China
| | - Wen Zhang
- School of Computer, Wuhan University, Wuhan, 430072, China
- Research Institute of Shenzhen, Wuhan University, Shenzhen, 518057, China
| | - Shikui Tu
- Program in Bioinformatics and Integrative Biology, University of Massachusetts Medical School, 368 Plantation Street, Worcester, Massachusetts, 01605, United States of America
| | - Xiaopeng Zhu
- Program in Bioinformatics and Integrative Biology, University of Massachusetts Medical School, 368 Plantation Street, Worcester, Massachusetts, 01605, United States of America
| | - Gang Tian
- School of Computer, Wuhan University, Wuhan, 430072, China
| |
Collapse
|
27
|
Abstract
String-of-beads polypeptides allow convenient delivery of epitope-based vaccines. The success of a polypeptide relies on efficient processing: constituent epitopes need to be recovered while avoiding neo-epitopes from epitope junctions. Spacers between epitopes are employed to ensure this, but spacer selection is non-trivial. We present a framework to determine optimally the length and sequence of a spacer through multi-objective optimization for human leukocyte antigen class I restricted polypeptides. The method yields string-of-bead vaccines with flexible spacer lengths that increase the predicted epitope recovery rate fivefold while reducing the immunogenicity from neo-epitopes by 44 % compared to designs without spacers.
Collapse
|
28
|
Zhang W, Liu F, Luo L, Zhang J. Predicting drug side effects by multi-label learning and ensemble learning. BMC Bioinformatics 2015; 16:365. [PMID: 26537615 PMCID: PMC4634905 DOI: 10.1186/s12859-015-0774-y] [Citation(s) in RCA: 100] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2015] [Accepted: 10/14/2015] [Indexed: 01/10/2023] Open
Abstract
BACKGROUND Predicting drug side effects is an important topic in the drug discovery. Although several machine learning methods have been proposed to predict side effects, there is still space for improvements. Firstly, the side effect prediction is a multi-label learning task, and we can adopt the multi-label learning techniques for it. Secondly, drug-related features are associated with side effects, and feature dimensions have specific biological meanings. Recognizing critical dimensions and reducing irrelevant dimensions may help to reveal the causes of side effects. METHODS In this paper, we propose a novel method 'feature selection-based multi-label k-nearest neighbor method' (FS-MLKNN), which can simultaneously determine critical feature dimensions and construct high-accuracy multi-label prediction models. RESULTS Computational experiments demonstrate that FS-MLKNN leads to good performances as well as explainable results. To achieve better performances, we further develop the ensemble learning model by integrating individual feature-based FS-MLKNN models. When compared with other state-of-the-art methods, the ensemble method produces better performances on benchmark datasets. CONCLUSIONS In conclusion, FS-MLKNN and the ensemble method are promising tools for the side effect prediction. The source code and datasets are available in the Additional file 1.
Collapse
Affiliation(s)
- Wen Zhang
- School of Computer, Wuhan University, Wuhan, 430072, China. .,Research Institute of Shenzhen, Wuhan University, Shenzhen, 518057, China.
| | - Feng Liu
- International School of software, Wuhan University, Wuhan, 430072, China.
| | - Longqiang Luo
- School of Mathematics and Statistics, Wuhan University, Wuhan, 430072, China.
| | - Jingxia Zhang
- School of Mathematics and Statistics, Wuhan University, Wuhan, 430072, China.
| |
Collapse
|