1
|
Nath A, Bora U. RNAinsecta: A tool for prediction of precursor microRNA in insects and search for their target in the model organism Drosophila melanogaster. PLoS One 2023; 18:e0287323. [PMID: 37812647 PMCID: PMC10561860 DOI: 10.1371/journal.pone.0287323] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2023] [Accepted: 06/03/2023] [Indexed: 10/11/2023] Open
Abstract
INTRODUCTION AND BACKGROUND Pre-MicroRNAs are the hairpin loops from which microRNAs are produced that have been found to negatively regulate gene expression in several organisms. In insects, microRNAs participate in several biological processes including metamorphosis, reproduction, immune response, etc. Numerous tools have been designed in recent years to predict novel pre-microRNA using binary machine learning classifiers where prediction models are trained with true and pseudo pre-microRNA hairpin loops. Currently, there are no existing tool that is exclusively designed for insect pre-microRNA detection. AIM Application of machine learning algorithms to develop an open source tool for prediction of novel precursor microRNA in insects and search for their miRNA targets in the model insect organism, Drosophila melanogaster. METHODS Machine learning algorithms such as Random Forest, Support Vector Machine, Logistic Regression and K-Nearest Neighbours were used to train insect true and false pre-microRNA features with 10-fold Cross Validation on SMOTE and Near-Miss datasets. miRNA targets IDs were collected from miRTarbase and their corresponding transcripts were collected from FlyBase. We used miRanda algorithm for the target searching. RESULTS In our experiment, SMOTE performed significantly better than Near-Miss for which it was used for modelling. We kept the best performing parameters after obtaining initial mean accuracy scores >90% of Cross Validation. The trained models on Support Vector Machine achieved accuracy of 92.19% while the Random Forest attained an accuracy of 80.28% on our validation dataset. These models are hosted online as web application called RNAinsecta. Further, searching target for the predicted pre-microRNA in Drosophila melanogaster has been provided in RNAinsecta.
Collapse
Affiliation(s)
- Adhiraj Nath
- Department of BSBE, IIT Guwahati, North Guwahati, Assam, India
| | - Utpal Bora
- Department of BSBE, IIT Guwahati, North Guwahati, Assam, India
| |
Collapse
|
2
|
Omer A. MicroRNAs as powerful tool against COVID-19: Computational perspective. WIREs Mech Dis 2023; 15:e1621. [PMID: 37345625 DOI: 10.1002/wsbm.1621] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2022] [Revised: 04/13/2023] [Accepted: 05/23/2023] [Indexed: 06/23/2023]
Abstract
Severe acute respiratory syndrome coronavirus 2 is the virus that is responsible for the current pandemic, COVID-19 (SARS-CoV-2). MiRNAs, a component of RNAi technology, belong to the family of short, noncoding ssRNAs, and may be crucial in the battle against this global threat since they are involved in regulating complex biochemical pathways and may prevent viral proliferation, translation, and host expression. The complicated metabolic pathways are modulated by the activity of many proteins, mRNAs, and miRNAs working together in miRNA-mediated genetic control. The amount of omics data has increased dramatically in recent years. This massive, linked, yet complex metabolic regulatory network data offers a wealth of opportunity for iterative analysis; hence, extensive, in-depth, but time-efficient screening is necessary to acquire fresh discoveries; this is readily performed with the use of bioinformatics. We have reviewed the literature on microRNAs, bioinformatics, and COVID-19 infection to summarize (1) the function of miRNAs in combating COVID-19, and (2) the use of computational methods in combating COVID-19 in certain noteworthy studies, and (3) computational tools used by these studies against COVID-19 in several purposes. This article is categorized under: Infectious Diseases > Computational Models.
Collapse
Affiliation(s)
- Ankur Omer
- Government College Silodi, MPHED, Katni, Madhya Pradesh, India
| |
Collapse
|
3
|
Saini S, Khurana S, Saini D, Rajput S, Thakur CJ, Singh J, Jaswal A, Kapoor Y, Kumar V, Saini A. In silico analysis of genomic landscape of SARS-CoV-2 and its variant of concerns (Delta and Omicron) reveals changes in the coding potential of miRNAs and their target genes. Gene 2023; 853:147097. [PMID: 36470485 PMCID: PMC9721428 DOI: 10.1016/j.gene.2022.147097] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2022] [Revised: 11/24/2022] [Accepted: 11/29/2022] [Indexed: 12/12/2022]
Abstract
COVID-19 related morbidities and mortalities are still continued due to the emergence of new variants of SARS-CoV-2. In the last few years, viral miRNAs have been the centre of study to understand the disease pathophysiology. In this work, we aimed to predict the change in coding potential of the viral miRNAs in SARS-CoV-2's VOCs, Delta and Omicron compared to the Reference (Wuhan origin) strain using bioinformatics tools. After ab-intio based screening by the Vmir tool and validation, we retrieved 22, 6, and 6 pre-miRNAs for Reference, Delta, and Omicron. Most of the predicted unique pre-miRNAs of Delta and Omicron were found to be encoded from the terminal and origin of the genomic sequence, respectively. Mature miRNAs identified by MatureBayes from the unique pre-miRNAs were used for target identification using miRDB. A total of 1786, 216, and 143 high-confidence target genes were captured for GO (Gene Ontology) and KEGG (Kyoto Encyclopedia of Genes and Genomes) analysis. The GO and KEGG pathways terms analysis revealed the involvement of Delta miRNAs targeted genes in the pathways such as Human cytomegalovirus infection, Breast cancer, Apoptosis, Neurotrophin signaling, and Axon guidance whereas the Sphingolipid signaling pathway was found for the Omicron. Furthermore, we focussed our analysis on target genes that were validated through GEO's (Gene Expression Omnibus) DEGs (Differentially Expressed Genes) dataset, in which FGL2, TNSF12, OGN, GDF11, and BMP11 target genes were found to be down-regulated by Reference miRNAs and YAE1 and RSU1 by Delta. Few genes were also observed to be validated among in up-regulated gene set of the GEO dataset, in which MMP14, TNFRSF21, SGMS1, and TMEM192 were related to Reference whereas ZEB2 was detected in all three strains. This study thus provides an in-silico based analysis that deciphered the unique pre-miRNAs in Delta and Omicron compared to Reference. However, the findings need future wet lab studies for validation.
Collapse
Affiliation(s)
- Sandeep Saini
- Department of Bioinformatics, Goswami Ganesh Dutta Sanatan Dharma College, Sector 32, Chandigarh 160030, India; Department of Biophysics, Panjab University, Sector 25, Chandigarh 160014, India.
| | - Savi Khurana
- Department of Bioinformatics, Goswami Ganesh Dutta Sanatan Dharma College, Sector 32, Chandigarh 160030, India
| | - Dikshant Saini
- Department of Bioinformatics, Goswami Ganesh Dutta Sanatan Dharma College, Sector 32, Chandigarh 160030, India
| | - Saru Rajput
- Department of Bioinformatics, Goswami Ganesh Dutta Sanatan Dharma College, Sector 32, Chandigarh 160030, India
| | - Chander Jyoti Thakur
- Department of Bioinformatics, Goswami Ganesh Dutta Sanatan Dharma College, Sector 32, Chandigarh 160030, India
| | - Jeevisha Singh
- Department of Bioinformatics, Goswami Ganesh Dutta Sanatan Dharma College, Sector 32, Chandigarh 160030, India
| | - Akanksha Jaswal
- Department of Bioinformatics, Goswami Ganesh Dutta Sanatan Dharma College, Sector 32, Chandigarh 160030, India
| | - Yogesh Kapoor
- Department of Engineering and Technology, Shoolini University, Solan, Himachal Pradesh, India
| | - Varinder Kumar
- Department of Bioinformatics, Goswami Ganesh Dutta Sanatan Dharma College, Sector 32, Chandigarh 160030, India
| | - Avneet Saini
- Department of Biophysics, Panjab University, Sector 25, Chandigarh 160014, India.
| |
Collapse
|
4
|
Samami E, Pourali G, Arabpour M, Fanipakdel A, Shahidsales S, Javadinia SA, Hassanian SM, Mohammadparast S, Avan A. The Potential Diagnostic and Prognostic Value of Circulating MicroRNAs in the Assessment of Patients With Prostate Cancer: Rational and Progress. Front Oncol 2022; 11:716831. [PMID: 35186706 PMCID: PMC8855122 DOI: 10.3389/fonc.2021.716831] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2021] [Accepted: 12/31/2021] [Indexed: 12/20/2022] Open
Abstract
Prostate cancer (P.C.) is one of the most frequent diagnosed cancers among men and the first leading cause of death with an annual incidence of 1.4 million worldwide. Prostate-specific antigen is being used for screening/diagnosis of prostate disease, although it is associated with several limitations. Thus, identification of novel biomarkers is warranted for diagnosis of patients at earlier stages. MicroRNAs (miRNAs) are recently being emerged as potential biomarkers. It has been shown that these small molecules can be circulated in body fluids and prognosticate the risk of developing P.C. Several miRNAs, including MiR-20a, MiR-21, miR-375, miR-378, and miR-141, have been proposed to be expressed in prostate cancer. This review summarizes the current knowledge about possible molecular mechanisms and potential application of tissue specific and circulating microRNAs as diagnosis, prognosis, and therapeutic targets in prostate cancer.
Collapse
Affiliation(s)
- Elham Samami
- Network of Immunity in Infection, Malignancy and Autoimmunity (NIIMA), Universal Scientific Education and Research Network (USERN), Tehran University of Medical Sciences, Tehran, Iran
| | - Ghazaleh Pourali
- Cancer Research Center, Mashhad University of Medical Sciences, Mashhad, Iran
- Metabolic Syndrome Research Center, Mashhad University of Medical Sciences, Mashhad, Iran
| | - Mahla Arabpour
- Metabolic Syndrome Research Center, Mashhad University of Medical Sciences, Mashhad, Iran
| | - Azar Fanipakdel
- Cancer Research Center, Mashhad University of Medical Sciences, Mashhad, Iran
| | | | - Seyed Alireza Javadinia
- Vasei Clinical Research Development Unit, Sabzevar University of Medical Sciences, Sabzevar, Iran
| | - Seyed Mahdi Hassanian
- Metabolic Syndrome Research Center, Mashhad University of Medical Sciences, Mashhad, Iran
| | - Saeid Mohammadparast
- Department of Cell, Developmental and Integrative Biology, University of Alabama at Birmingham, Birmingham, AL, United States
| | - Amir Avan
- Metabolic Syndrome Research Center, Mashhad University of Medical Sciences, Mashhad, Iran
- Basic Medical Sciences Institute, Mashhad University of Medical Sciences, Mashhad, Iran
- *Correspondence: Amir Avan,
| |
Collapse
|
5
|
Gharbi S, Mohammadi Z, Dezaki MS, Dokanehiifard S, Dabiri S, Korsching E. Characterization of the first microRNA in human CDH1 that affects cell cycle and apoptosis and indicates breast cancers progression. J Cell Biochem 2022; 123:657-672. [PMID: 34997630 DOI: 10.1002/jcb.30211] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2021] [Revised: 11/26/2021] [Accepted: 12/21/2021] [Indexed: 11/12/2022]
Abstract
The E-cadherin protein (Cadherin 1, gene: CDH1), a master regulator of the human epithelial homeostasis, contributes to the epithelial-mesenchymal transition (EMT) which confers cell migratory features to the cells. The EMT is central to many pathophysiological changes in cancer. Therefore, a better understanding of this regulatory scenario is beneficial for therapeutic regiments. The CDH1 gene is approximately 100 kbp long and consists of 16 exons with a relatively large second intron. Since none microRNA (miRNA) has been identified in CDH1 up to now we screened the CDH1 gene for promising miRNA hairpin structures in silico. Out of the 27 hairpin structures we identified, one stable RNA fold with a promising sequence motive was selected for experimental verification. The exogenous validation of the hairpin sequence was performed by transfection of HEK293T cells and the mature miRNA sequences could be verified by quantitative polymerase chain reaction. The endogenous expression of the mature miRNA provisionally named CDH1-i2-miR-1 could be confirmed in two normal (HEK293T, HUVEK) and five cancer cell lines (MCF7, MDA-MB-231, SW480, HT-29, A549). The functional characterization by the 3-(4,5-dimethylthiazol-2-yl)-2,5-diphenyltetrazolium bromide assay showed a suppression of HEK293T cell proliferation. A flow cytometry-based approach showed the ability of CDH1-i2-miR-1 to arrest transfected cells on a G2/M state while annexin staining exemplified an apoptotic effect. BAX and PTEN expression levels were affected following the overexpression with the new miRNA. The in vivo expression level was assessed in 35 breast tumor tissues and their paired nonmalignant marginal part. A fourfold downregulation in the tumor specimens compared to their marginal controls could be observed. It can be concluded that the sequence of the hub gene CDH1 harbors at least one miRNA but eventually even more relevant for the pathophysiology of breast cancer.
Collapse
Affiliation(s)
- Sedigheh Gharbi
- Department of Biology, Faculty of Sciences, Shahid Bahonar University of Kerman, Kerman, Iran
| | - Zahra Mohammadi
- Department of Biology, Faculty of Sciences, Shahid Bahonar University of Kerman, Kerman, Iran
| | - Maryam Saedi Dezaki
- Department of Biology, Faculty of Sciences, Shahid Bahonar University of Kerman, Kerman, Iran
| | - Sadat Dokanehiifard
- Department of Human Genetics, Sylvester Comprehensive Cancer Center, University of Miami Miller School of Medicine, Miami, Florida, USA
| | - Shahriar Dabiri
- Department of Pathology, Pathology and Stem Cell Research Center, Kerman University of Medical Sciences, Kerman, Iran
| | - Eberhard Korsching
- Institute of Bioinformatics, Faculty of Medicine, University of Münster, Münster, Germany
| |
Collapse
|
6
|
Saçar Demirci MD. Computational Detection of Pre-microRNAs. Methods Mol Biol 2022; 2257:167-174. [PMID: 34432278 DOI: 10.1007/978-1-0716-1170-8_8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
MicroRNA (miRNA) studies have been one of the most popular research areas in recent years. Although thousands of miRNAs have been detected in several species, the majority remains unidentified. Thus, finding novel miRNAs is a vital element for investigating miRNA mediated posttranscriptional gene regulation machineries. Furthermore, experimental methods have challenging inadequacies in their capability to detect rare miRNAs, and are also limited to the state of the organism under examination (e.g., tissue type, developmental stage, stress-disease conditions). These issues have initiated the creation of high-level computational methodologies endeavoring to distinguish potential miRNAs in silico. On the other hand, most of these tools suffer from high numbers of false positives and/or false negatives and as a result they do not provide enough confidence for validating all their predictions experimentally. In this chapter, computational difficulties in detection of pre-miRNAs are discussed and a machine learning based approach that has been designed to address these issues is reviewed.
Collapse
|
7
|
Islam MS, Khan MAAK. Computational analysis revealed miRNAs produced by Chikungunya virus target genes associated with antiviral immune responses and cell cycle regulation. Comput Biol Chem 2021; 92:107462. [PMID: 33640797 DOI: 10.1016/j.compbiolchem.2021.107462] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2020] [Revised: 02/11/2021] [Accepted: 02/17/2021] [Indexed: 11/18/2022]
Abstract
Chikungunya virus (CHIKV) that causes chikungunya fever, is an alphavirus that belongs to the Togaviridae family containing a single-stranded RNA genome. Mosquitoes of the Aedes species act as the vectors for this virus and can be found in the blood, which can be passed from an infected person to a mosquito through mosquito bites. CHIKV has drawn much attention recently because of its potential of causing an epidemic. As the detailed mechanism of its pathogenesis inside the host system is still lacking, in this in silico research we have hypothesized that CHIKV might create miRNAs, which would target the genes associated with host cellular regulatory pathways, thereby providing the virus with prolonged refuge. Using bioinformatics approaches we found several putative miRNAs produced by CHIKV. Then we predicted the genes of the host targeted by these miRNAs. Functional enrichment analysis of these targeted genes shows the involvement of several biological pathways regulating antiviral immune stimulation, cellular proliferation, and cell cycle, thereby provide themselves with prolonged refuge and facilitate their pathogenesis, which in turn may lead to disease conditions. Finally, we analyzed a publicly available microarray dataset (GSE49985) to determine the altered expression levels of the targeted genes and found genes associated with pathways such as cell differentiation, phagocytosis, T-cell activation, response to cytokine, autophagy, Toll-like receptor signaling, RIG-I like receptor signaling and apoptosis. Our finding presents novel miRNAs and their targeted genes, which upon experimental validation could facilitate in developing new therapeutics to combat CHIKV infection and minimize CHIKV mediated diseases.
Collapse
Affiliation(s)
- Md Sajedul Islam
- Department of Biochemistry & Biotechnology, University of Barishal, Barishal, 8254, Bangladesh.
| | | |
Collapse
|
8
|
Stegmayer G, Di Persia LE, Rubiolo M, Gerard M, Pividori M, Yones C, Bugnon LA, Rodriguez T, Raad J, Milone DH. Predicting novel microRNA: a comprehensive comparison of machine learning approaches. Brief Bioinform 2020; 20:1607-1620. [PMID: 29800232 DOI: 10.1093/bib/bby037] [Citation(s) in RCA: 23] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2017] [Revised: 03/26/2018] [Indexed: 12/25/2022] Open
Abstract
MOTIVATION The importance of microRNAs (miRNAs) is widely recognized in the community nowadays because these short segments of RNA can play several roles in almost all biological processes. The computational prediction of novel miRNAs involves training a classifier for identifying sequences having the highest chance of being precursors of miRNAs (pre-miRNAs). The big issue with this task is that well-known pre-miRNAs are usually few in comparison with the hundreds of thousands of candidate sequences in a genome, which results in high class imbalance. This imbalance has a strong influence on most standard classifiers, and if not properly addressed in the model and the experiments, not only performance reported can be completely unrealistic but also the classifier will not be able to work properly for pre-miRNA prediction. Besides, another important issue is that for most of the machine learning (ML) approaches already used (supervised methods), it is necessary to have both positive and negative examples. The selection of positive examples is straightforward (well-known pre-miRNAs). However, it is difficult to build a representative set of negative examples because they should be sequences with hairpin structure that do not contain a pre-miRNA. RESULTS This review provides a comprehensive study and comparative assessment of methods from these two ML approaches for dealing with the prediction of novel pre-miRNAs: supervised and unsupervised training. We present and analyze the ML proposals that have appeared during the past 10 years in literature. They have been compared in several prediction tasks involving two model genomes and increasing imbalance levels. This work provides a review of existing ML approaches for pre-miRNA prediction and fair comparisons of the classifiers with same features and data sets, instead of just a revision of published software tools. The results and the discussion can help the community to select the most adequate bioinformatics approach according to the prediction task at hand. The comparative results obtained suggest that from low to mid-imbalance levels between classes, supervised methods can be the best. However, at very high imbalance levels, closer to real case scenarios, models including unsupervised and deep learning can provide better performance.
Collapse
Affiliation(s)
- Georgina Stegmayer
- sinc(i), Research Institute for Signals, Systems and Computational Intelligence (CONICET-UNL), Ciudad Universitaria, Santa Fe, Argentina
| | - Leandro E Di Persia
- sinc(i), Research Institute for Signals, Systems and Computational Intelligence (CONICET-UNL), Ciudad Universitaria, Santa Fe, Argentina
| | - Mariano Rubiolo
- sinc(i), Research Institute for Signals, Systems and Computational Intelligence (CONICET-UNL), Ciudad Universitaria, Santa Fe, Argentina
| | - Matias Gerard
- sinc(i), Research Institute for Signals, Systems and Computational Intelligence (CONICET-UNL), Ciudad Universitaria, Santa Fe, Argentina
| | - Milton Pividori
- sinc(i), Research Institute for Signals, Systems and Computational Intelligence (CONICET-UNL), Ciudad Universitaria, Santa Fe, Argentina
| | - Cristian Yones
- sinc(i), Research Institute for Signals, Systems and Computational Intelligence (CONICET-UNL), Ciudad Universitaria, Santa Fe, Argentina
| | - Leandro A Bugnon
- sinc(i), Research Institute for Signals, Systems and Computational Intelligence (CONICET-UNL), Ciudad Universitaria, Santa Fe, Argentina
| | - Tadeo Rodriguez
- sinc(i), Research Institute for Signals, Systems and Computational Intelligence (CONICET-UNL), Ciudad Universitaria, Santa Fe, Argentina
| | - Jonathan Raad
- sinc(i), Research Institute for Signals, Systems and Computational Intelligence (CONICET-UNL), Ciudad Universitaria, Santa Fe, Argentina
| | - Diego H Milone
- sinc(i), Research Institute for Signals, Systems and Computational Intelligence (CONICET-UNL), Ciudad Universitaria, Santa Fe, Argentina
| |
Collapse
|
9
|
Khan MAAK, Sany MRU, Islam MS, Islam ABMMK. Epigenetic Regulator miRNA Pattern Differences Among SARS-CoV, SARS-CoV-2, and SARS-CoV-2 World-Wide Isolates Delineated the Mystery Behind the Epic Pathogenicity and Distinct Clinical Characteristics of Pandemic COVID-19. Front Genet 2020; 11:765. [PMID: 32765592 PMCID: PMC7381279 DOI: 10.3389/fgene.2020.00765] [Citation(s) in RCA: 122] [Impact Index Per Article: 30.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2020] [Accepted: 06/29/2020] [Indexed: 12/13/2022] Open
Abstract
A detailed understanding of the molecular mechanism of SARS-CoV-2 pathogenesis is still elusive, and there is a need to address its deadly nature and to design effective therapeutics. Here, we present a study that elucidates the interplay between the SARS-CoV and SARS-CoV-2 viruses' and host's miRNAs, an epigenetic regulator, as a mode of pathogenesis; and we explored how the SARS-CoV and SARS-CoV-2 infections differ in terms of their miRNA-mediated interactions with the host and the implications this has in terms of disease complexity. We have utilized computational approaches to predict potential host and viral miRNAs and their possible roles in different important functional pathways. We have identified several putative host antiviral miRNAs that can target the SARS viruses and also predicted SARS viruses-encoded miRNAs targeting host genes. In silico predicted targets were also integrated with SARS-infected human cell microarray and RNA-seq gene expression data. A comparison between the host miRNA binding profiles on 67 different SARS-CoV-2 genomes from 24 different countries with respective country's normalized death count surprisingly uncovered some miRNA clusters, which are associated with increased death rates. We have found that induced cellular miRNAs can be both a boon and a bane to the host immunity, as they have possible roles in neutralizing the viral threat; conversely, they can also function as proviral factors. On the other hand, from over representation analysis, our study revealed that although both SARS-CoV and SARS-CoV-2 viral miRNAs could target broad immune-signaling pathways; only some of the SARS-CoV-2 miRNAs are found to uniquely target some immune-signaling pathways, such as autophagy, IFN-I signaling, etc., which might suggest their immune-escape mechanisms for prolonged latency inside some hosts without any symptoms of COVID-19. Furthermore, SARS-CoV-2 can modulate several important cellular pathways that might lead to the increased anomalies in patients with comorbidities like cardiovascular diseases, diabetes, breathing complications, etc. This might suggest that miRNAs can be a key epigenetic modulator behind the overcomplications amongst the COVID-19 patients. Our results support that miRNAs of host and SARS-CoV-2 can indeed play a role in the pathogenesis which can be further concluded with more experiments. These results will also be useful in designing RNA therapeutics to alleviate the complications from COVID-19.
Collapse
Affiliation(s)
| | - Md Rabi Us Sany
- Department of Genetic Engineering & Biotechnology, University of Dhaka, Dhaka, Bangladesh
| | - Md Shafiqul Islam
- Department of Genetic Engineering & Biotechnology, University of Dhaka, Dhaka, Bangladesh
| | | |
Collapse
|
10
|
Liu B. BioSeq-Analysis: a platform for DNA, RNA and protein sequence analysis based on machine learning approaches. Brief Bioinform 2020; 20:1280-1294. [PMID: 29272359 DOI: 10.1093/bib/bbx165] [Citation(s) in RCA: 188] [Impact Index Per Article: 47.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2017] [Revised: 11/08/2017] [Indexed: 01/07/2023] Open
Abstract
With the avalanche of biological sequences generated in the post-genomic age, one of the most challenging problems is how to computationally analyze their structures and functions. Machine learning techniques are playing key roles in this field. Typically, predictors based on machine learning techniques contain three main steps: feature extraction, predictor construction and performance evaluation. Although several Web servers and stand-alone tools have been developed to facilitate the biological sequence analysis, they only focus on individual step. In this regard, in this study a powerful Web server called BioSeq-Analysis (http://bioinformatics.hitsz.edu.cn/BioSeq-Analysis/) has been proposed to automatically complete the three main steps for constructing a predictor. The user only needs to upload the benchmark data set. BioSeq-Analysis can generate the optimized predictor based on the benchmark data set, and the performance measures can be reported as well. Furthermore, to maximize user's convenience, its stand-alone program was also released, which can be downloaded from http://bioinformatics.hitsz.edu.cn/BioSeq-Analysis/download/, and can be directly run on Windows, Linux and UNIX. Applied to three sequence analysis tasks, experimental results showed that the predictors generated by BioSeq-Analysis even outperformed some state-of-the-art methods. It is anticipated that BioSeq-Analysis will become a useful tool for biological sequence analysis.
Collapse
|
11
|
Liu B, Chen S, Yan K, Weng F. iRO-PsekGCC: Identify DNA Replication Origins Based on Pseudo k-Tuple GC Composition. Front Genet 2019; 10:842. [PMID: 31620165 PMCID: PMC6759546 DOI: 10.3389/fgene.2019.00842] [Citation(s) in RCA: 29] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2019] [Accepted: 08/13/2019] [Indexed: 11/22/2022] Open
Abstract
Identification of replication origins is playing a key role in understanding the mechanism of DNA replication. This task is of great significance in DNA sequence analysis. Because of its importance, some computational approaches have been introduced. Among these predictors, the iRO-3wPseKNC predictor is the first discriminative method that is able to correctly identify the entire replication origins. For further improving its predictive performance, we proposed the Pseudo k-tuple GC Composition (PsekGCC) approach to capture the "GC asymmetry bias" of yeast species by considering both the GC skew and the sequence order effects of k-tuple GC Composition (k-GCC) in this study. Based on PseKGCC, we proposed a new predictor called iRO-PsekGCC to identify the DNA replication origins. Rigorous jackknife test on two yeast species benchmark datasets (Saccharomyces cerevisiae, Pichia pastoris) indicated that iRO-PsekGCC outperformed iRO-3wPseKNC. It can be anticipated that iRO-PsekGCC will be a useful tool for DNA replication origin identification. Availability and implementation: The web-server for the iRO-PsekGCC predictor was established, and it can be accessed at http://bliulab.net/iRO-PsekGCC/.
Collapse
Affiliation(s)
- Bin Liu
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China
- Advanced Research Institute of Multidisciplinary Science, Beijing Institute of Technology, Beijing, China
| | - Shengyu Chen
- School of Informatics, Computing, and Engineering, Indiana University Bloomington, Bloomington, IN, United States
| | - Ke Yan
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, China
| | - Fan Weng
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, China
| |
Collapse
|
12
|
Computational Resources for Prediction and Analysis of Functional miRNA and Their Targetome. Methods Mol Biol 2019; 1912:215-250. [PMID: 30635896 DOI: 10.1007/978-1-4939-8982-9_9] [Citation(s) in RCA: 21] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Abstract
microRNAs are evolutionarily conserved, endogenously produced, noncoding RNAs (ncRNAs) of approximately 19-24 nucleotides (nts) in length known to exhibit gene silencing of complementary target sequence. Their deregulated expression is reported in various disease conditions and thus has therapeutic implications. In the last decade, various computational resources are published in this field. In this chapter, we have reviewed bioinformatics resources, i.e., miRNA-centered databases, algorithms, and tools to predict miRNA targets. First section has enlisted more than 75 databases, which mainly covers information regarding miRNA registries, targets, disease associations, differential expression, interactions with other noncoding RNAs, and all-in-one resources. In the algorithms section, we have compiled about 140 algorithms from eight subcategories, viz. for the prediction of precursor (pre-) and mature miRNAs. These algorithms are developed on various sequence, structure, and thermodynamic based features incorporated into different machine learning techniques (MLTs). In addition, computational identification of miRNAs from high-throughput next generation sequencing (NGS) data and their variants, viz. isomiRs, differential expression, miR-SNPs, and functional annotation, are discussed. Prediction and analysis of miRNAs and their associated targets are also evaluated under miR-targets section providing knowledge regarding novel miRNA targets and complex host-pathogen interactions. In conclusion, we have provided comprehensive review of in silico resources published in miRNA research to help scientific community be updated and choose the appropriate tool according to their needs.
Collapse
|
13
|
Abstract
Background:DNA-binding proteins, binding to DNA, widely exist in living cells, participating in many cell activities. They can participate some DNA-related cell activities, for instance DNA replication, transcription, recombination, and DNA repair.Objective:Given the importance of DNA-binding proteins, studies for predicting the DNA-binding proteins have been a popular issue over the past decades. In this article, we review current machine-learning methods which research on the prediction of DNA-binding proteins through feature representation methods, classifiers, measurements, dataset and existing web server.Method:The prediction methods of DNA-binding protein can be divided into two types, based on amino acid composition and based on protein structure. In this article, we accord to the two types methods to introduce the application of machine learning in DNA-binding proteins prediction.Results:Machine learning plays an important role in the classification of DNA-binding proteins, and the result is better. The best ACC is above 80%.Conclusion:Machine learning can be widely used in many aspects of biological information, especially in protein classification. Some issues should be considered in future work. First, the relationship between the number of features and performance must be explored. Second, many features are used to predict DNA-binding proteins and propose solutions for high-dimensional spaces.
Collapse
Affiliation(s)
- Kaiyang Qu
- College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Leyi Wei
- College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Quan Zou
- College of Intelligence and Computing, Tianjin University, Tianjin, China
| |
Collapse
|
14
|
Yang W, Zhu XJ, Huang J, Ding H, Lin H. A Brief Survey of Machine Learning Methods in Protein Sub-Golgi Localization. Curr Bioinform 2019. [DOI: 10.2174/1574893613666181113131415] [Citation(s) in RCA: 111] [Impact Index Per Article: 22.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023]
Abstract
Background:The location of proteins in a cell can provide important clues to their functions in various biological processes. Thus, the application of machine learning method in the prediction of protein subcellular localization has become a hotspot in bioinformatics. As one of key organelles, the Golgi apparatus is in charge of protein storage, package, and distribution.Objective:The identification of protein location in Golgi apparatus will provide in-depth insights into their functions. Thus, the machine learning-based method of predicting protein location in Golgi apparatus has been extensively explored. The development of protein sub-Golgi apparatus localization prediction should be reviewed for providing a whole background for the fields.Method:The benchmark dataset, feature extraction, machine learning method and published results were summarized.Results:We briefly introduced the recent progresses in protein sub-Golgi apparatus localization prediction using machine learning methods and discussed their advantages and disadvantages.Conclusion:We pointed out the perspective of machine learning methods in protein sub-Golgi localization prediction.
Collapse
Affiliation(s)
- Wuritu Yang
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, Sichuan, 610054, China
| | - Xiao-Juan Zhu
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, Sichuan, 610054, China
| | - Jian Huang
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, Sichuan, 610054, China
| | - Hui Ding
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, Sichuan, 610054, China
| | - Hao Lin
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, Sichuan, 610054, China
| |
Collapse
|
15
|
Fu X, Zhu W, Cai L, Liao B, Peng L, Chen Y, Yang J. Improved Pre-miRNAs Identification Through Mutual Information of Pre-miRNA Sequences and Structures. Front Genet 2019; 10:119. [PMID: 30858864 PMCID: PMC6397858 DOI: 10.3389/fgene.2019.00119] [Citation(s) in RCA: 21] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2018] [Accepted: 02/04/2019] [Indexed: 11/30/2022] Open
Abstract
Playing critical roles as post-transcriptional regulators, microRNAs (miRNAs) are a family of short non-coding RNAs that are derived from longer transcripts called precursor miRNAs (pre-miRNAs). Experimental methods to identify pre-miRNAs are expensive and time-consuming, which presents the need for computational alternatives. In recent years, the accuracy of computational methods to predict pre-miRNAs has been increasing significantly. However, there are still several drawbacks. First, these methods usually only consider base frequencies or sequence information while ignoring the information between bases. Second, feature extraction methods based on secondary structures usually only consider the global characteristics while ignoring the mutual influence of the local structures. Third, methods integrating high-dimensional feature information is computationally inefficient. In this study, we have proposed a novel mutual information-based feature representation algorithm for pre-miRNA sequences and secondary structures, which is capable of catching the interactions between sequence bases and local features of the RNA secondary structure. In addition, the feature space is smaller than that of most popular methods, which makes our method computationally more efficient than the competitors. Finally, we applied these features to train a support vector machine model to predict pre-miRNAs and compared the results with other popular predictors. As a result, our method outperforms others based on both 5-fold cross-validation and the Jackknife test.
Collapse
Affiliation(s)
- Xiangzheng Fu
- College of Information Science and Engineering, Hunan University, Changsha, China
| | - Wen Zhu
- School of Mathematics and Statistics, Hainan Normal University, Haikou, China
| | - Lijun Cai
- College of Information Science and Engineering, Hunan University, Changsha, China
| | - Bo Liao
- College of Information Science and Engineering, Hunan University, Changsha, China
- School of Mathematics and Statistics, Hainan Normal University, Haikou, China
| | - Lihong Peng
- School of Computer Science, Hunan University of Technology, Zhuzhou, China
| | - Yifan Chen
- College of Information Science and Engineering, Hunan University, Changsha, China
| | - Jialiang Yang
- School of Mathematics and Statistics, Hainan Normal University, Haikou, China
- Department of Genetics and Genomic Sciences, Icahn Institute for Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, New York, NY, United States
| |
Collapse
|
16
|
Xu L, Liang G, Liao C, Chen GD, Chang CC. k-Skip-n-Gram-RF: A Random Forest Based Method for Alzheimer's Disease Protein Identification. Front Genet 2019; 10:33. [PMID: 30809242 PMCID: PMC6379451 DOI: 10.3389/fgene.2019.00033] [Citation(s) in RCA: 53] [Impact Index Per Article: 10.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2018] [Accepted: 01/17/2019] [Indexed: 11/18/2022] Open
Abstract
In this paper, a computational method based on machine learning technique for identifying Alzheimer's disease genes is proposed. Compared with most existing machine learning based methods, existing methods predict Alzheimer's disease genes by using structural magnetic resonance imaging (MRI) technique. Most methods have attained acceptable results, but the cost is expensive and time consuming. Thus, we proposed a computational method for identifying Alzheimer disease genes by use of the sequence information of proteins, and classify the feature vectors by random forest. In the proposed method, the gene protein information is extracted by adaptive k-skip-n-gram features. The proposed method can attain the accuracy to 85.5% on the selected UniProt dataset, which has been demonstrated by the experimental results.
Collapse
Affiliation(s)
- Lei Xu
- School of Electronic and Communication Engineering, Shenzhen Polytechnic, Shenzhen, China
| | - Guangmin Liang
- School of Electronic and Communication Engineering, Shenzhen Polytechnic, Shenzhen, China
| | - Changrui Liao
- Key Laboratory of Optoelectronic Devices and Systems of Ministry of Education and Guangdong Province, College of Optoelectronic Engineering, Shenzhen University, Shenzhen, China
| | - Gin-Den Chen
- Department of Obstetrics and Gynecology, Chung Shan Medical University Hospital, Taichung, Taiwan
| | - Chi-Chang Chang
- School of Medical Informatics, Chung Shan Medical University, Taichung, Taiwan
- IT Office, Chung Shan Medical University Hospital, Taichung, Taiwan
| |
Collapse
|
17
|
Saçar Demirci MD, Yousef M, Allmer J. Computational Prediction of Functional MicroRNA-mRNA Interactions. Methods Mol Biol 2019; 1912:175-196. [PMID: 30635894 DOI: 10.1007/978-1-4939-8982-9_7] [Citation(s) in RCA: 17] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022]
Abstract
Proteins have a strong influence on the phenotype and their aberrant expression leads to diseases. MicroRNAs (miRNAs) are short RNA sequences which posttranscriptionally regulate protein expression. This regulation is driven by miRNAs acting as recognition sequences for their target mRNAs within a larger regulatory machinery. A miRNA can have many target mRNAs and an mRNA can be targeted by many miRNAs which makes it difficult to experimentally discover all miRNA-mRNA interactions. Therefore, computational methods have been developed for miRNA detection and miRNA target prediction. An abundance of available computational tools makes selection difficult. Additionally, interactions are not currently the focus of investigation although they more accurately define the regulation than pre-miRNA detection or target prediction could perform alone. We define an interaction including the miRNA source and the mRNA target. We present computational methods allowing the investigation of these interactions as well as how they can be used to extend regulatory pathways. Finally, we present a list of points that should be taken into account when investigating miRNA-mRNA interactions. In the future, this may lead to better understanding of functional interactions which may pave the way for disease marker discovery and design of miRNA-based drugs.
Collapse
Affiliation(s)
| | - Malik Yousef
- Department of Community Information Systems, Zefat Academic College, Zefat, Israel
| | - Jens Allmer
- Applied Bioinformatics, Bioscience, Wageningen University & Research, Wageningen, The Netherlands.
| |
Collapse
|
18
|
Fu X, Liao B, Zhu W, Cai L. New 3D graphical representation for RNA structure analysis and its application in the pre-miRNA identification of plants. RSC Adv 2018; 8:30833-30841. [PMID: 35548744 PMCID: PMC9085476 DOI: 10.1039/c8ra04138e] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2018] [Accepted: 08/24/2018] [Indexed: 11/26/2022] Open
Abstract
MicroRNAs (miRNAs) are a family of short non-coding RNAs that play significant roles as post-transcriptional regulators. Consequently, various methods have been proposed to identify precursor miRNAs (pre-miRNAs), among which the comparative studies of miRNA structures are the most important. To measure and classify the structural similarity of miRNAs, we propose a new three-dimensional (3D) graphical representation of the secondary structure of miRNAs, in which an miRNA secondary structure is initially transformed into a characteristic sequence based on physicochemical properties and frequency of base. A numerical characterization of the 3D graph is used to represent the miRNA secondary structure. We then utilize a novel Euclidean distance method based on this expression to compute the distance of different miRNA sequences for the sequence similarity analysis. Finally, we use this sequence similarity analysis method to identify plant pre-miRNAs among three commonly used datasets. Results show that the method is reasonable and effective.
Collapse
Affiliation(s)
- Xiangzheng Fu
- College of Information Science and Engineering, Hunan University Changsha Hunan 410082 China
| | - Bo Liao
- College of Information Science and Engineering, Hunan University Changsha Hunan 410082 China
| | - Wen Zhu
- College of Information Science and Engineering, Hunan University Changsha Hunan 410082 China
| | - Lijun Cai
- College of Information Science and Engineering, Hunan University Changsha Hunan 410082 China
| |
Collapse
|
19
|
Rorbach G, Unold O, Konopka BM. Distinguishing mirtrons from canonical miRNAs with data exploration and machine learning methods. Sci Rep 2018; 8:7560. [PMID: 29765080 PMCID: PMC5953923 DOI: 10.1038/s41598-018-25578-3] [Citation(s) in RCA: 27] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2017] [Accepted: 04/13/2018] [Indexed: 12/13/2022] Open
Abstract
Mirtrons are non-canonical microRNAs encoded in introns the biogenesis of which starts with splicing. They are not processed by Drosha and enter the canonical pathway at the Exportin-5 level. Mirtrons are much less evolutionary conserved than canonical miRNAs. Due to the differences, canonical miRNA predictors are not applicable to mirtron prediction. Identification of differences is important for designing mirtron prediction algorithms and may help to improve the understanding of mirtron functioning. So far, only simple, single-feature comparisons were reported. These are insensitive to complex feature relations. We quantified miRNAs with 25 features and showed that it is impossible to distinguish the two miRNA species using simple thresholds on any single feature. However, when using the Principal Component Analysis mirtrons and canonical miRNAs are grouped separately. Moreover, several methodologically diverse machine learning classifiers delivered high classification performance. Using feature selection algorithms we found features (e.g. bulges in the stem region), previously reported divergent in two classes, that did not contribute to improving classification accuracy, which suggests that they are not biologically meaningful. Finally, we proposed a combination of the most important features (including Guanine content, hairpin free energy and hairpin length) which convey a specific pattern, crucial for identifying mirtrons.
Collapse
Affiliation(s)
- Grzegorz Rorbach
- Department of Computer Engineering, Faculty of Electronics, Wroclaw University of Science and Technology, Wroclaw, Poland
| | - Olgierd Unold
- Department of Computer Engineering, Faculty of Electronics, Wroclaw University of Science and Technology, Wroclaw, Poland
| | - Bogumil M Konopka
- Department of Biomedical Engineering, Faculty of Fundamental Problems of Technology, Wroclaw University of Science and Technology, Wroclaw, Poland.
| |
Collapse
|
20
|
Which statistical significance test best detects oncomiRNAs in cancer tissues? An exploratory analysis. Oncotarget 2018; 7:85613-85623. [PMID: 27784000 PMCID: PMC5356763 DOI: 10.18632/oncotarget.12828] [Citation(s) in RCA: 32] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2016] [Accepted: 10/14/2016] [Indexed: 01/09/2023] Open
Abstract
MicroRNAs(miRNAs) often exert their oncogenic and tumor suppressor functions by suppressing protein-coding genes expressions in cancers and thus have a strong association with cancers' generation, development and metastasis. Through comprehensively understanding differentially expressed miRNAs (oncomiRNA) in tumor tissues, we can elucidate the underlying molecular mechanisms in tumorigenesis and develop novel strategies for cancer diagnosis and treatment. The differential expression of miRNAs can now be analyzed through numerous statistical significance tests based on different principles, which are also available in various R packages. However, the results can be notably different. In this study, we compared miRNAs obtained from 6 common significance tests/R packages (t-test, Limma, DESeq, edgeR, LRT and MARS) with the miRNAs archived in two databases; HMDD 2.0 database, which collects experimentally validated differentially expressed miRNAs, and Infer microRNA-disease association database, which contains the potential disease-associated miRNAs by network forecasting. Finally, we sought the MARS method in DEGseq package more effectively searched out differentially expressed miRNAs than other common methods.
Collapse
|
21
|
Wei L, Tang J, Zou Q. SkipCPP-Pred: an improved and promising sequence-based predictor for predicting cell-penetrating peptides. BMC Genomics 2017. [PMID: 29513192 PMCID: PMC5657092 DOI: 10.1186/s12864-017-4128-1] [Citation(s) in RCA: 76] [Impact Index Per Article: 10.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/08/2023] Open
Abstract
Background Cell-penetrating peptides (CPPs) are short peptides (5–30 amino acids) that can enter almost any cell without significant damage. On account of their high delivery efficiency, CPPs are promising candidates for gene therapy and cancer treatment. Accordingly, techniques that correctly predict CPPs are anticipated to accelerate CPP applications in future therapeutics. Recently, computational methods have been reportedly successful in predicting CPPs. Unfortunately, the predictive performance of existing methods is not satisfactory and reliable so as to accurately identify CPPs. Results In this study, we propose a novel computational predictor called SkipCPP-Pred to further improve the predictive performance. The novelty of the proposed predictor is that we present a sequence-based feature representation algorithm called adaptive k-skip-n-gram that sufficiently captures the intrinsic correlation information of residues. By fusing the proposed adaptive skip features with a random forest (RF) classifier, we successfully construct the prediction model of SkipCPP-Pred. The various jackknife results demonstrate that the proposed SkipCPP-Pred is 3.6% higher than state-of-the-art CPP predictors in terms of accuracy. Moreover, we construct a high-quality benchmark dataset by reducing the data redundancy and enhancing the similarity between the positive and negative classes. Using this dataset to build prediction models, we can successfully avoid the performance bias lying in existing methods and yield a promising predictive model. Conclusions The proposed SkipCPP-Pred is a simple and fast sequence-based predictor featured with the adaptive k-skip-n-gram model for the improved prediction of CPPs. Currently, SkipCPP-Pred is publicly available from an online webserver (http://server.malab.cn/SkipCPP-Pred/Index.html). Electronic supplementary material The online version of this article (10.1186/s12864-017-4128-1) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Leyi Wei
- School of Computer Science and Technology, Tianjin University, Tianjin, 30050, China.,State Key Laboratory of Medicinal Chemical Biology, Nankai University, Tianjin, 300074, China
| | - Jijun Tang
- School of Computer Science and Technology, Tianjin University, Tianjin, 30050, China
| | - Quan Zou
- School of Computer Science and Technology, Tianjin University, Tianjin, 30050, China.
| |
Collapse
|
22
|
Saçar Demirci MD, Baumbach J, Allmer J. On the performance of pre-microRNA detection algorithms. Nat Commun 2017; 8:330. [PMID: 28839141 PMCID: PMC5571158 DOI: 10.1038/s41467-017-00403-z] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2017] [Accepted: 06/23/2017] [Indexed: 01/31/2023] Open
Abstract
MicroRNAs are crucial for post-transcriptional gene regulation, and their dysregulation has been associated with diseases like cancer and, therefore, their analysis has become popular. The experimental discovery of miRNAs is cumbersome and, thus, many computational tools have been proposed. Here we assess 13 ab initio pre-miRNA detection approaches using all relevant, published, and novel data sets while judging algorithm performance based on ten intrinsic performance measures. We present an extensible framework, izMiR, which allows for the unbiased comparison of existing algorithms, adding new ones, and combining multiple approaches into ensemble methods. In an exhaustive attempt, we condense the results of millions of computations and show that no method is clearly superior; however, we provide a guideline for biomedical researchers to select a tool. Finally, we demonstrate that combining all of the methods into one ensemble approach, for the first time, allows reliable purely computational pre-miRNA detection in large eukaryotic genomes.As the experimental discovery of microRNAs (miRNAs) is cumbersome, computational tools have been developed for the prediction of pre-miRNAs. Here the authors develop a framework to assess the performance of existing and novel pre-miRNA prediction tools and provide guidelines for selecting an appropriate approach for a given data set.
Collapse
Affiliation(s)
| | - Jan Baumbach
- Computational Systems Biology, Max Planck Institute for Informatics, 66123, Saarbrücken, Germany.
- Computational Biology, University of Southern Denmark, DK-5230, Odense M, Denmark.
| | - Jens Allmer
- Molecular Biology and Genetics, Izmir Institute of Technology, Urla, Izmir, 35430, Turkey
- Bionia Incorporated, IZTEKGEB A8, Urla, Izmir, 35430, Turkey
| |
Collapse
|
23
|
Barman RK, Mukhopadhyay A, Das S. An improved method for identification of small non-coding RNAs in bacteria using support vector machine. Sci Rep 2017; 7:46070. [PMID: 28383059 PMCID: PMC5382675 DOI: 10.1038/srep46070] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2016] [Accepted: 03/08/2017] [Indexed: 12/25/2022] Open
Abstract
Bacterial small non-coding RNAs (sRNAs) are not translated into proteins, but act as functional RNAs. They are involved in diverse biological processes like virulence, stress response and quorum sensing. Several high-throughput techniques have enabled identification of sRNAs in bacteria, but experimental detection remains a challenge and grossly incomplete for most species. Thus, there is a need to develop computational tools to predict bacterial sRNAs. Here, we propose a computational method to identify sRNAs in bacteria using support vector machine (SVM) classifier. The primary sequence and secondary structure features of experimentally-validated sRNAs of Salmonella Typhimurium LT2 (SLT2) was used to build the optimal SVM model. We found that a tri-nucleotide composition feature of sRNAs achieved an accuracy of 88.35% for SLT2. We validated the SVM model also on the experimentally-detected sRNAs of E. coli and Salmonella Typhi. The proposed model had robustly attained an accuracy of 81.25% and 88.82% for E. coli K-12 and S. Typhi Ty2, respectively. We confirmed that this method significantly improved the identification of sRNAs in bacteria. Furthermore, we used a sliding window-based method and identified sRNAs from complete genomes of SLT2, S. Typhi Ty2 and E. coli K-12 with sensitivities of 89.09%, 83.33% and 67.39%, respectively.
Collapse
Affiliation(s)
- Ranjan Kumar Barman
- Biomedical Informatics Centre, National Institute Of Cholera and Enteric Diseases, Kolkata, West Bengal, India
| | - Anirban Mukhopadhyay
- Department of Computer Science and Engineering, University of Kalyani, Kalyani, West Bengal, India
| | - Santasabuj Das
- Biomedical Informatics Centre, National Institute Of Cholera and Enteric Diseases, Kolkata, West Bengal, India.,Division of Clinical Medicine, National Institute of Cholera and Enteric Diseases, Kolkata, West Bengal, India
| |
Collapse
|
24
|
Das JK, Pal Choudhury P. Chemical property based sequence characterization of PpcA and its homolog proteins PpcB-E: A mathematical approach. PLoS One 2017; 12:e0175031. [PMID: 28362850 PMCID: PMC5376323 DOI: 10.1371/journal.pone.0175031] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2016] [Accepted: 03/20/2017] [Indexed: 11/19/2022] Open
Abstract
Periplasmic c7 type cytochrome A (PpcA) protein is determined in Geobacter sulfurreducens along with its other four homologs (PpcB-E). From the crystal structure viewpoint the observation emerges that PpcA protein can bind with Deoxycholate (DXCA), while its other homologs do not. But it is yet to be established with certainty the reason behind this from primary protein sequence information. This study is primarily based on primary protein sequence analysis through the chemical basis of embedded amino acids. Firstly, we look for the chemical group specific score of amino acids. Along with this, we have developed a new methodology for the phylogenetic analysis based on chemical group dissimilarities of amino acids. This new methodology is applied to the cytochrome c7 family members and pinpoint how a particular sequence is differing with others. Secondly, we build a graph theoretic model on using amino acid sequences which is also applied to the cytochrome c7 family members and some unique characteristics and their domains are highlighted. Thirdly, we search for unique patterns as subsequences which are common among the group or specific individual member. In all the cases, we are able to show some distinct features of PpcA that emerges PpcA as an outstanding protein compared to its other homologs, resulting towards its binding with deoxycholate. Similarly, some notable features for the structurally dissimilar protein PpcD compared to the other homologs are also brought out. Further, the five members of cytochrome family being homolog proteins, they must have some common significant features which are also enumerated in this study.
Collapse
Affiliation(s)
- Jayanta Kumar Das
- Applied Statistics Unit, Indian Statistical Institute, 203 B.T Road, Kolkata-700108, West Bengal, India
| | - Pabitra Pal Choudhury
- Applied Statistics Unit, Indian Statistical Institute, 203 B.T Road, Kolkata-700108, West Bengal, India
| |
Collapse
|
25
|
Saçar Demirci MD, Allmer J. Delineating the impact of machine learning elements in pre-microRNA detection. PeerJ 2017; 5:e3131. [PMID: 28367373 PMCID: PMC5374968 DOI: 10.7717/peerj.3131] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2016] [Accepted: 02/28/2017] [Indexed: 01/06/2023] Open
Abstract
Gene regulation modulates RNA expression via transcription factors. Post-transcriptional gene regulation in turn influences the amount of protein product through, for example, microRNAs (miRNAs). Experimental establishment of miRNAs and their effects is complicated and even futile when aiming to establish the entirety of miRNA target interactions. Therefore, computational approaches have been proposed. Many such tools rely on machine learning (ML) which involves example selection, feature extraction, model training, algorithm selection, and parameter optimization. Different ML algorithms have been used for model training on various example sets, more than 1,000 features describing pre-miRNAs have been proposed and different training and testing schemes have been used for model establishment. For pre-miRNA detection, negative examples cannot easily be established causing a problem for two class classification algorithms. There is also no consensus on what ML approach works best and, therefore, we set forth and established the impact of the different parts involved in ML on model performance. Furthermore, we established two new negative datasets and analyzed the impact of using them for training and testing. It was our aim to attach an order of importance to the parts involved in ML for pre-miRNA detection, but instead we found that all parts are intricately connected and their contributions cannot be easily untangled leading us to suggest that when attempting ML-based pre-miRNA detection many scenarios need to be explored.
Collapse
Affiliation(s)
| | - Jens Allmer
- Department of Molecular Biology and Genetics, Izmir Institute of Technology, Urla, Izmir, Turkey; Bionia Incorporated, IZTEKGEB A8, Urla, Izmir, Turkey
| |
Collapse
|
26
|
Abstract
The secondary structure of an RNA molecule represents the base-pairing interactions within the molecule and fundamentally determines its overall structure. In this chapter, we overview the main approaches and existing tools for predicting RNA secondary structures, as well as methods for identifying noncoding RNAs from genomic sequences or RNA sequencing data. We then focus on the identification of a well-known class of small noncoding RNAs, namely microRNAs, which play very important roles in many biological processes through regulating post-transcriptionally the expression of genes and which dysregulation has been shown to be involved in several human diseases.
Collapse
Affiliation(s)
- Fariza Tahi
- IBISC, UEVE/Genopole, 23 bv. de France, 91000, Evry, France.
- IPS2, University of Paris-Saclay, 91190, Gif-sur-Yvette, France.
| | - Van Du T Tran
- Vital-IT group, SIB Swiss Institute of Bioinformatics, 1015, Lausanne, Switzerland
| | - Anouar Boucheham
- IBISC, UEVE/Genopole, 23 bv. de France, 91000, Evry, France
- College of NTIC, Constantine University 2, Constantine, Algeria
| |
Collapse
|
27
|
Zou Q, Wan S, Ju Y, Tang J, Zeng X. Pretata: predicting TATA binding proteins with novel features and dimensionality reduction strategy. BMC SYSTEMS BIOLOGY 2016; 10:114. [PMID: 28155714 PMCID: PMC5259984 DOI: 10.1186/s12918-016-0353-5] [Citation(s) in RCA: 135] [Impact Index Per Article: 16.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 02/08/2023]
Abstract
Background It is necessary and essential to discovery protein function from the novel primary sequences. Wet lab experimental procedures are not only time-consuming, but also costly, so predicting protein structure and function reliably based only on amino acid sequence has significant value. TATA-binding protein (TBP) is a kind of DNA binding protein, which plays a key role in the transcription regulation. Our study proposed an automatic approach for identifying TATA-binding proteins efficiently, accurately, and conveniently. This method would guide for the special protein identification with computational intelligence strategies. Results Firstly, we proposed novel fingerprint features for TBP based on pseudo amino acid composition, physicochemical properties, and secondary structure. Secondly, hierarchical features dimensionality reduction strategies were employed to improve the performance furthermore. Currently, Pretata achieves 92.92% TATA-binding protein prediction accuracy, which is better than all other existing methods. Conclusions The experiments demonstrate that our method could greatly improve the prediction accuracy and speed, thus allowing large-scale NGS data prediction to be practical. A web server is developed to facilitate the other researchers, which can be accessed at http://server.malab.cn/preTata/.
Collapse
Affiliation(s)
- Quan Zou
- School of Computer Science and Technology, Tianjin University, Tianjin, China
| | - Shixiang Wan
- School of Computer Science and Technology, Tianjin University, Tianjin, China.,Key Laboratory of Network Oriented Intelligent Computation, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong, 518055, China
| | - Ying Ju
- School of Information Science and Engineering, Xiamen University, Xiamen, China
| | - Jijun Tang
- School of Computer Science and Technology, Tianjin University, Tianjin, China.,School of Computational Science and Engineering, University of South Carolina, Columbia, USA
| | - Xiangxiang Zeng
- School of Information Science and Engineering, Xiamen University, Xiamen, China.
| |
Collapse
|
28
|
Wei L, Bowen Z, Zhiyong C, Gao X, Liao M. Exploring local discriminative information from evolutionary profiles for cytokine–receptor interaction prediction. Neurocomputing 2016. [DOI: 10.1016/j.neucom.2016.02.078] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023]
|
29
|
DNABP: Identification of DNA-Binding Proteins Based on Feature Selection Using a Random Forest and Predicting Binding Residues. PLoS One 2016; 11:e0167345. [PMID: 27907159 PMCID: PMC5132331 DOI: 10.1371/journal.pone.0167345] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2016] [Accepted: 11/12/2016] [Indexed: 12/24/2022] Open
Abstract
DNA-binding proteins are fundamentally important in cellular processes. Several computational-based methods have been developed to improve the prediction of DNA-binding proteins in previous years. However, insufficient work has been done on the prediction of DNA-binding proteins from protein sequence information. In this paper, a novel predictor, DNABP (DNA-binding proteins), was designed to predict DNA-binding proteins using the random forest (RF) classifier with a hybrid feature. The hybrid feature contains two types of novel sequence features, which reflect information about the conservation of physicochemical properties of the amino acids, and the binding propensity of DNA-binding residues and non-binding propensities of non-binding residues. The comparisons with each feature demonstrated that these two novel features contributed most to the improvement in predictive ability. Furthermore, to improve the prediction performance of the DNABP model, feature selection using the minimum redundancy maximum relevance (mRMR) method combined with incremental feature selection (IFS) was carried out during the model construction. The results showed that the DNABP model could achieve 86.90% accuracy, 83.76% sensitivity, 90.03% specificity and a Matthews correlation coefficient of 0.727. High prediction accuracy and performance comparisons with previous research suggested that DNABP could be a useful approach to identify DNA-binding proteins from sequence information. The DNABP web server system is freely available at http://www.cbi.seu.edu.cn/DNABP/.
Collapse
|
30
|
Liu B, Liu Y, Jin X, Wang X, Liu B. iRSpot-DACC: a computational predictor for recombination hot/cold spots identification based on dinucleotide-based auto-cross covariance. Sci Rep 2016; 6:33483. [PMID: 27641752 PMCID: PMC5027590 DOI: 10.1038/srep33483] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2016] [Accepted: 08/25/2016] [Indexed: 01/01/2023] Open
Abstract
Meiotic recombination presents an uneven distribution across the genome. Genomic regions that exhibit at relatively high frequencies of recombination are called hotspots, whereas those with relatively low frequencies of recombination are called coldspots. Therefore, hotspots and coldspots would provide useful information for the study of the mechanism of recombination. In this study, we proposed a computational predictor called iRSpot-DACC to predict hot/cold spots across the yeast genome. It combined Support Vector Machines (SVMs) and a feature called dinucleotide-based auto-cross covariance (DACC), which is able to incorporate the global sequence-order information and fifteen local DNA properties into the predictor. Combined with Principal Component Analysis (PCA), its performance was further improved. Experimental results on a benchmark dataset showed that iRSpot-DACC can achieve an accuracy of 82.7%, outperforming some highly related methods.
Collapse
Affiliation(s)
- Bingquan Liu
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, Heilongjiang 150080, China
| | - Yumeng Liu
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong 518055, China
| | - Xiaopeng Jin
- School of Computer Science and Technology, Harbin Engineering University, Harbin, Heilongjiang 150001, China
| | - Xiaolong Wang
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong 518055, China.,Key Laboratory of Network Oriented Intelligent Computation, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong 518055, China
| | - Bin Liu
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong 518055, China.,Key Laboratory of Network Oriented Intelligent Computation, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong 518055, China
| |
Collapse
|
31
|
dRHP-PseRA: detecting remote homology proteins using profile-based pseudo protein sequence and rank aggregation. Sci Rep 2016; 6:32333. [PMID: 27581095 PMCID: PMC5007510 DOI: 10.1038/srep32333] [Citation(s) in RCA: 71] [Impact Index Per Article: 8.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2016] [Accepted: 08/04/2016] [Indexed: 11/09/2022] Open
Abstract
Protein remote homology detection is an important task in computational proteomics. Some computational methods have been proposed, which detect remote homology proteins based on different features and algorithms. As noted in previous studies, their predictive results are complementary to each other. Therefore, it is intriguing to explore whether these methods can be combined into one package so as to further enhance the performance power and application convenience. In view of this, we introduced a protein representation called profile-based pseudo protein sequence to extract the evolutionary information from the relevant profiles. Based on the concept of pseudo proteins, a new predictor, called “dRHP-PseRA”, was developed by combining four state-of-the-art predictors (PSI-BLAST, HHblits, Hmmer, and Coma) via the rank aggregation approach. Cross-validation tests on a SCOP benchmark dataset have demonstrated that the new predictor has remarkably outperformed any of the existing methods for the same purpose on ROC50 scores. Accordingly, it is anticipated that dRHP-PseRA holds very high potential to become a useful high throughput tool for detecting remote homology proteins. For the convenience of most experimental scientists, a web-server for dRHP-PseRA has been established at http://bioinformatics.hitsz.edu.cn/dRHP-PseRA/.
Collapse
|
32
|
Recombination Hotspot/Coldspot Identification Combining Three Different Pseudocomponents via an Ensemble Learning Approach. BIOMED RESEARCH INTERNATIONAL 2016; 2016:8527435. [PMID: 27648451 PMCID: PMC5015011 DOI: 10.1155/2016/8527435] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/30/2016] [Accepted: 07/11/2016] [Indexed: 11/17/2022]
Abstract
Recombination presents a nonuniform distribution across the genome. Genomic regions that present relatively higher frequencies of recombination are called hotspots while those with relatively lower frequencies of recombination are recombination coldspots. Therefore, the identification of hotspots/coldspots could provide useful information for the study of the mechanism of recombination. In this study, a new computational predictor called SVM-EL was proposed to identify hotspots/coldspots across the yeast genome. It combined Support Vector Machines (SVMs) and Ensemble Learning (EL) based on three features including basic kmer (Kmer), dinucleotide-based auto-cross covariance (DACC), and pseudo dinucleotide composition (PseDNC). These features are able to incorporate the nucleic acid composition and their order information into the predictor. The proposed SVM-EL achieves an accuracy of 82.89% on a widely used benchmark dataset, which outperforms some related methods.
Collapse
|
33
|
Identifying the Types of Ion Channel-Targeted Conotoxins by Incorporating New Properties of Residues into Pseudo Amino Acid Composition. BIOMED RESEARCH INTERNATIONAL 2016; 2016:3981478. [PMID: 27631006 PMCID: PMC5008028 DOI: 10.1155/2016/3981478] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/13/2016] [Accepted: 07/31/2016] [Indexed: 12/31/2022]
Abstract
Conotoxins are a kind of neurotoxin which can specifically interact with potassium, sodium type, and calcium channels. They have become potential drug candidates to treat diseases such as chronic pain, epilepsy, and cardiovascular diseases. Thus, correctly identifying the types of ion channel-targeted conotoxins will provide important clue to understand their function and find potential drugs. Based on this consideration, we developed a new computational method to rapidly and accurately predict the types of ion-targeted conotoxins. Three kinds of new properties of residues were proposed to use in pseudo amino acid composition to formulate conotoxins samples. The support vector machine was utilized as classifier. A feature selection technique based on F-score was used to optimize features. Jackknife cross-validated results showed that the overall accuracy of 94.6% was achieved, which is higher than other published results, demonstrating that the proposed method is superior to published methods. Hence the current method may play a complementary role to other existing methods for recognizing the types of ion-target conotoxins.
Collapse
|
34
|
Identification of Secretory Proteins in Mycobacterium tuberculosis Using Pseudo Amino Acid Composition. BIOMED RESEARCH INTERNATIONAL 2016; 2016:5413903. [PMID: 27597968 PMCID: PMC4997101 DOI: 10.1155/2016/5413903] [Citation(s) in RCA: 97] [Impact Index Per Article: 12.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/23/2016] [Accepted: 07/18/2016] [Indexed: 11/17/2022]
Abstract
Tuberculosis is killing millions of lives every year and on the blacklist of the most appalling public health problems. Recent findings suggest that secretory protein of Mycobacterium tuberculosis may serve the purpose of developing specific vaccines and drugs due to their antigenicity. Responding to global infectious disease, we focused on the identification of secretory proteins in Mycobacterium tuberculosis. A novel method called MycoSec was designed by incorporating g-gap dipeptide compositions into pseudo amino acid composition. Analysis of variance-based technique was applied in the process of feature selection and a total of 374 optimal features were obtained and used for constructing the final predicting model. In the jackknife test, MycoSec yielded a good performance with the area under the receiver operating characteristic curve of 0.93, demonstrating that the proposed system is powerful and robust. For user's convenience, the web server MycoSec was established and an obliging manual on how to use it was provided for getting around any trouble unnecessary.
Collapse
|
35
|
In Silico Prediction of Gamma-Aminobutyric Acid Type-A Receptors Using Novel Machine-Learning-Based SVM and GBDT Approaches. BIOMED RESEARCH INTERNATIONAL 2016; 2016:2375268. [PMID: 27579307 PMCID: PMC4992803 DOI: 10.1155/2016/2375268] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/24/2016] [Revised: 06/08/2016] [Accepted: 06/19/2016] [Indexed: 11/17/2022]
Abstract
Gamma-aminobutyric acid type-A receptors (GABAARs) belong to multisubunit membrane spanning ligand-gated ion channels (LGICs) which act as the principal mediators of rapid inhibitory synaptic transmission in the human brain. Therefore, the category prediction of GABAARs just from the protein amino acid sequence would be very helpful for the recognition and research of novel receptors. Based on the proteins' physicochemical properties, amino acids composition and position, a GABAAR classifier was first constructed using a 188-dimensional (188D) algorithm at 90% cd-hit identity and compared with pseudo-amino acid composition (PseAAC) and ProtrWeb web-based algorithms for human GABAAR proteins. Then, four classifiers including gradient boosting decision tree (GBDT), random forest (RF), a library for support vector machine (libSVM), and k-nearest neighbor (k-NN) were compared on the dataset at cd-hit 40% low identity. This work obtained the highest correctly classified rate at 96.8% and the highest specificity at 99.29%. But the values of sensitivity, accuracy, and Matthew's correlation coefficient were a little lower than those of PseAAC and ProtrWeb; GBDT and libSVM can make a little better performance than RF and k-NN at the second dataset. In conclusion, a GABAAR classifier was successfully constructed using only the protein sequence information.
Collapse
|
36
|
Identification of apolipoprotein using feature selection technique. Sci Rep 2016; 6:30441. [PMID: 27443605 PMCID: PMC4957217 DOI: 10.1038/srep30441] [Citation(s) in RCA: 36] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2016] [Accepted: 07/01/2016] [Indexed: 12/16/2022] Open
Abstract
Apolipoprotein is a kind of protein which can transport the lipids through the lymphatic and circulatory systems. The abnormal expression level of apolipoprotein always causes angiocardiopathy. Thus, correct recognition of apolipoprotein from proteomic data is very crucial to the comprehension of cardiovascular system and drug design. This study is to develop a computational model to predict apolipoproteins. In the model, the apolipoproteins and non-apolipoproteins were collected to form benchmark dataset. On the basis of the dataset, we extracted the g-gap dipeptide composition information from residue sequences to formulate protein samples. To exclude redundant information or noise, the analysis of various (ANOVA)-based feature selection technique was proposed to find out the best feature subset. The support vector machine (SVM) was selected as discrimination algorithm. Results show that 96.2% of sensitivity and 99.3% of specificity were achieved in five-fold cross-validation. These findings open new perspectives to improve apolipoproteins prediction by considering the specific dipeptides. We expect that these findings will help to improve drug development in anti-angiocardiopathy disease.
Collapse
|
37
|
Accurate Identification of Cancerlectins through Hybrid Machine Learning Technology. Int J Genomics 2016; 2016:7604641. [PMID: 27478823 PMCID: PMC4961832 DOI: 10.1155/2016/7604641] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2016] [Revised: 05/24/2016] [Accepted: 06/14/2016] [Indexed: 01/03/2023] Open
Abstract
Cancerlectins are cancer-related proteins that function as lectins. They have been identified through computational identification techniques, but these techniques have sometimes failed to identify proteins because of sequence diversity among the cancerlectins. Advanced machine learning identification methods, such as support vector machine and basic sequence features (n-gram), have also been used to identify cancerlectins. In this study, various protein fingerprint features and advanced classifiers, including ensemble learning techniques, were utilized to identify this group of proteins. We improved the prediction accuracy of the original feature extraction methods and classification algorithms by more than 10% on average. Our work provides a basis for the computational identification of cancerlectins and reveals the power of hybrid machine learning techniques in computational proteomics.
Collapse
|
38
|
Identification of Bacterial Cell Wall Lyases via Pseudo Amino Acid Composition. BIOMED RESEARCH INTERNATIONAL 2016; 2016:1654623. [PMID: 27437396 PMCID: PMC4942628 DOI: 10.1155/2016/1654623] [Citation(s) in RCA: 65] [Impact Index Per Article: 8.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/24/2016] [Accepted: 05/30/2016] [Indexed: 11/18/2022]
Abstract
Owing to the abuse of antibiotics, drug resistance of pathogenic bacteria becomes more and more serious. Therefore, it is interesting to develop a more reasonable way to solve this issue. Because they can destroy the bacterial cell structure and then kill the infectious bacterium, the bacterial cell wall lyases are suitable candidates of antibacteria sources. Thus, it is urgent to develop an accurate and efficient computational method to predict the lyases. Based on the consideration, in this paper, a set of objective and rigorous data was collected by searching through the Universal Protein Resource (the UniProt database), whereafter a feature selection technique based on the analysis of variance (ANOVA) was used to acquire optimal feature subset. Finally, the support vector machine (SVM) was used to perform prediction. The jackknife cross-validated results showed that the optimal average accuracy of 84.82% was achieved with the sensitivity of 76.47% and the specificity of 93.16%. For the convenience of other scholars, we built a free online server called Lypred. We believe that Lypred will become a practical tool for the research of cell wall lyases and development of antimicrobial agents.
Collapse
|
39
|
Identification of polycystic ovary syndrome potential drug targets based on pathobiological similarity in the protein-protein interaction network. Oncotarget 2016; 7:37906-37919. [PMID: 27191267 PMCID: PMC5122359 DOI: 10.18632/oncotarget.9353] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2016] [Accepted: 04/28/2016] [Indexed: 12/13/2022] Open
Abstract
Polycystic ovary syndrome (PCOS) is one of the most common endocrinological disorders in reproductive aged women. PCOS and Type 2 Diabetes (T2D) are closely linked in multiple levels and possess high pathobiological similarity. Here, we put forward a new computational approach based on the pathobiological similarity to identify PCOS potential drug target modules (PPDT-Modules) and PCOS potential drug targets in the protein-protein interaction network (PPIN). From the systems level and biological background, 1 PPDT-Module and 22 PCOS potential drug targets were identified, 21 of which were verified by literatures to be associated with the pathogenesis of PCOS. 42 drugs targeting to 13 PCOS potential drug targets were investigated experimentally or clinically for PCOS. Evaluated by independent datasets, the whole PPDT-Module and 22 PCOS potential drug targets could not only reveal the drug response, but also distinguish the statuses between normal and disease. Our identified PPDT-Module and PCOS potential drug targets would shed light on the treatment of PCOS. And our approach would provide valuable insights to research on the pathogenesis and drug response of other diseases.
Collapse
|
40
|
Improving classification of mature microRNA by solving class imbalance problem. Sci Rep 2016; 6:25941. [PMID: 27181057 PMCID: PMC4867574 DOI: 10.1038/srep25941] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2016] [Accepted: 04/22/2016] [Indexed: 11/29/2022] Open
Abstract
MicroRNAs (miRNAs) are ~20–25 nucleotides non-coding RNAs, which regulated gene expression in the post-transcriptional level. The accurate rate of identifying the start sit of mature miRNA from a given pre-miRNA remains lower. It is noting that the mature miRNA prediction is a class-imbalanced problem which also leads to the unsatisfactory performance of these methods. We improved the prediction accuracy of classifier using balanced datasets and presented MatFind which is used for identifying 5′ mature miRNAs candidates from their pre-miRNA based on ensemble SVM classifiers with idea of adaboost. Firstly, the balanced-dataset was extract based on K-nearest neighbor algorithm. Secondly, the multiple SVM classifiers were trained in orderly using the balance datasets base on represented features. At last, all SVM classifiers were combined together to form the ensemble classifier. Our results on independent testing dataset show that the proposed method is more efficient than one without treating class imbalance problem. Moreover, MatFind achieves much higher classification accuracy than other three approaches. The ensemble SVM classifiers and balanced-datasets can solve the class-imbalanced problem, as well as improve performance of classifier for mature miRNA identification. MatFind is an accurate and fast method for 5′ mature miRNA identification.
Collapse
|
41
|
Chen J, Liu B, Huang D. Protein Remote Homology Detection Based on an Ensemble Learning Approach. BIOMED RESEARCH INTERNATIONAL 2016; 2016:5813645. [PMID: 27294123 PMCID: PMC4875977 DOI: 10.1155/2016/5813645] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/29/2016] [Accepted: 02/21/2016] [Indexed: 12/15/2022]
Abstract
Protein remote homology detection is one of the central problems in bioinformatics. Although some computational methods have been proposed, the problem is still far from being solved. In this paper, an ensemble classifier for protein remote homology detection, called SVM-Ensemble, was proposed with a weighted voting strategy. SVM-Ensemble combined three basic classifiers based on different feature spaces, including Kmer, ACC, and SC-PseAAC. These features consider the characteristics of proteins from various perspectives, incorporating both the sequence composition and the sequence-order information along the protein sequences. Experimental results on a widely used benchmark dataset showed that the proposed SVM-Ensemble can obviously improve the predictive performance for the protein remote homology detection. Moreover, it achieved the best performance and outperformed other state-of-the-art methods.
Collapse
Affiliation(s)
- Junjie Chen
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong 518055, China
| | - Bingquan Liu
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, Heilongjiang 150001, China
| | - Dong Huang
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong 518055, China
- Key Laboratory of Network Oriented Intelligent Computation, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong 518055, China
| |
Collapse
|
42
|
Chen W, Tang H, Lin H. MethyRNA: a web server for identification of N6-methyladenosine sites. J Biomol Struct Dyn 2016; 35:683-687. [DOI: 10.1080/07391102.2016.1157761] [Citation(s) in RCA: 74] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
Affiliation(s)
- Wei Chen
- Department of Physics, School of Sciences, Center for Genomics and Computational Biology, North China University of Science and Technology, Tangshan 063009, China
| | - Hua Tang
- Department of Pathophysiology, Sichuan Medical University, Luzhou 646000, China
| | - Hao Lin
- Key Laboratory for Neuro-Information of Ministry of Education, Center of Bioinformatics, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, China
| |
Collapse
|
43
|
Liu B, Wang S, Dong Q, Li S, Liu X. Identification of DNA-binding proteins by combining auto-cross covariance transformation and ensemble learning. IEEE Trans Nanobioscience 2016; 15:328-334. [PMID: 28113908 DOI: 10.1109/tnb.2016.2555951] [Citation(s) in RCA: 65] [Impact Index Per Article: 8.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
DNA-binding proteins play a pivotal role in various intra- and extra-cellular activities ranging from DNA replication to gene expression control. With the rapid development of next generation of sequencing technique, the number of protein sequences is unprecedentedly increasing. Thus it is necessary to develop computational methods to identify the DNA-binding proteins only based on the protein sequence information. In this study, a novel method called iDNA-KACC is presented, which combines the Support Vector Machine (SVM) and the auto-cross covariance transformation. The protein sequences are first converted into profile-based protein representation, and then converted into a series of fixed-length vectors by the auto-cross covariance transformation with Kmer composition. The sequence order effect can be effectively captured by this scheme. These vectors are then fed into Support Vector Machine (SVM) to discriminate the DNA-binding proteins from the non DNA-binding ones. iDNA-KACC achieves an overall accuracy of 75.16% and Matthew correlation coefficient of 0.5 by a rigorous jackknife test. Its performance is further improved by employing an ensemble learning approach, and the improved predictor is called iDNA-KACC-EL. Experimental results on an independent dataset shows that iDNA-KACC-EL outperforms all the other state-of-the-art predictors, indicating that it would be a useful computational tool for DNA binding protein identification. .
Collapse
|
44
|
Modeling Dynamic Systems with Efficient Ensembles of Process-Based Models. PLoS One 2016; 11:e0153507. [PMID: 27078633 PMCID: PMC4831761 DOI: 10.1371/journal.pone.0153507] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2015] [Accepted: 03/30/2016] [Indexed: 11/19/2022] Open
Abstract
Ensembles are a well established machine learning paradigm, leading to accurate and robust models, predominantly applied to predictive modeling tasks. Ensemble models comprise a finite set of diverse predictive models whose combined output is expected to yield an improved predictive performance as compared to an individual model. In this paper, we propose a new method for learning ensembles of process-based models of dynamic systems. The process-based modeling paradigm employs domain-specific knowledge to automatically learn models of dynamic systems from time-series observational data. Previous work has shown that ensembles based on sampling observational data (i.e., bagging and boosting), significantly improve predictive performance of process-based models. However, this improvement comes at the cost of a substantial increase of the computational time needed for learning. To address this problem, the paper proposes a method that aims at efficiently learning ensembles of process-based models, while maintaining their accurate long-term predictive performance. This is achieved by constructing ensembles with sampling domain-specific knowledge instead of sampling data. We apply the proposed method to and evaluate its performance on a set of problems of automated predictive modeling in three lake ecosystems using a library of process-based knowledge for modeling population dynamics. The experimental results identify the optimal design decisions regarding the learning algorithm. The results also show that the proposed ensembles yield significantly more accurate predictions of population dynamics as compared to individual process-based models. Finally, while their predictive performance is comparable to the one of ensembles obtained with the state-of-the-art methods of bagging and boosting, they are substantially more efficient.
Collapse
|
45
|
Che Y, Ju Y, Xuan P, Long R, Xing F. Identification of Multi-Functional Enzyme with Multi-Label Classifier. PLoS One 2016; 11:e0153503. [PMID: 27078147 PMCID: PMC4831692 DOI: 10.1371/journal.pone.0153503] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2016] [Accepted: 03/30/2016] [Indexed: 11/23/2022] Open
Abstract
Enzymes are important and effective biological catalyst proteins participating in almost all active cell processes. Identification of multi-functional enzymes is essential in understanding the function of enzymes. Machine learning methods perform better in protein structure and function prediction than traditional biological wet experiments. Thus, in this study, we explore an efficient and effective machine learning method to categorize enzymes according to their function. Multi-functional enzymes are predicted with a special machine learning strategy, namely, multi-label classifier. Sequence features are extracted from a position-specific scoring matrix with autocross-covariance transformation. Experiment results show that the proposed method obtains an accuracy rate of 94.1% in classifying six main functional classes through five cross-validation tests and outperforms state-of-the-art methods. In addition, 91.25% accuracy is achieved in multi-functional enzyme prediction, which is often ignored in other enzyme function prediction studies. The online prediction server and datasets can be accessed from the link http://server.malab.cn/MEC/.
Collapse
Affiliation(s)
- Yuxin Che
- School of Information Science and Technology, Xiamen University, Xiamen, Fujian 361005, China
| | - Ying Ju
- School of Information Science and Technology, Xiamen University, Xiamen, Fujian 361005, China
| | - Ping Xuan
- School of Computer Science and Technology, Heilongjiang University, Harbin 150080, China
| | - Ren Long
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong 518055, China
| | - Fei Xing
- School of Aerospace Engineering, Xiamen University, Xiamen, Fujian 361005, China
| |
Collapse
|
46
|
Wang R, Xu Y, Liu B. Recombination spot identification Based on gapped k-mers. Sci Rep 2016; 6:23934. [PMID: 27030570 PMCID: PMC4814916 DOI: 10.1038/srep23934] [Citation(s) in RCA: 24] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2015] [Accepted: 03/16/2016] [Indexed: 12/14/2022] Open
Abstract
Recombination is crucial for biological evolution, which provides many new combinations of genetic diversity. Accurate identification of recombination spots is useful for DNA function study. To improve the prediction accuracy, researchers have proposed several computational methods for recombination spot identification. The k-mer feature is one of the most useful features for modeling the properties and function of DNA sequences. However, it suffers from the inherent limitation. If the value of word length k is large, the occurrences of k-mers are closed to a binary variable, with a few k-mers present once and most k-mers are absent. This usually causes the sparse problem and reduces the classification accuracy. To solve this problem, we add gaps into k-mer and introduce a new feature called gapped k-mer (GKM) for identification of recombination spots. By using this feature, we present a new predictor called SVM-GKM, which combines the gapped k-mers and Support Vector Machine (SVM) for recombination spot identification. Experimental results on a widely used benchmark dataset show that SVM-GKM outperforms other highly related predictors. Therefore, SVM-GKM would be a powerful predictor for computational genomics.
Collapse
Affiliation(s)
- Rong Wang
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong 518055, China
| | - Yong Xu
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong 518055, China
| | - Bin Liu
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong 518055, China
| |
Collapse
|
47
|
DephosSite: a machine learning approach for discovering phosphotase-specific dephosphorylation sites. Sci Rep 2016; 6:23510. [PMID: 27002216 PMCID: PMC4802303 DOI: 10.1038/srep23510] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2015] [Accepted: 03/08/2016] [Indexed: 12/20/2022] Open
Abstract
Protein dephosphorylation, which is an inverse process of phosphorylation, plays a crucial role in a myriad of cellular processes, including mitotic cycle, proliferation, differentiation, and cell growth. Compared with tyrosine kinase substrate and phosphorylation site prediction, there is a paucity of studies focusing on computational methods of predicting protein tyrosine phosphatase substrates and dephosphorylation sites. In this work, we developed two elegant models for predicting the substrate dephosphorylation sites of three specific phosphatases, namely, PTP1B, SHP-1, and SHP-2. The first predictor is called MGPS-DEPHOS, which is modified from the GPS (Group-based Prediction System) algorithm with an interpretable capability. The second predictor is called CKSAAP-DEPHOS, which is built through the combination of support vector machine (SVM) and the composition of k-spaced amino acid pairs (CKSAAP) encoding scheme. Benchmarking experiments using jackknife cross validation and 30 repeats of 5-fold cross validation tests show that MGPS-DEPHOS and CKSAAP-DEPHOS achieved AUC values of 0.921, 0.914 and 0.912, for predicting dephosphorylation sites of the three phosphatases PTP1B, SHP-1, and SHP-2, respectively. Both methods outperformed the previously developed kNN-DEPHOS algorithm. In addition, a web server implementing our algorithms is publicly available at http://genomics.fzu.edu.cn/dephossite/ for the research community.
Collapse
|
48
|
Liu B, Fang L. WITHDRAWN: Identification of microRNA precursor based on gapped n-tuple structure status composition kernel. Comput Biol Chem 2016:S1476-9271(16)30036-6. [PMID: 26935400 DOI: 10.1016/j.compbiolchem.2016.02.010] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2016] [Accepted: 02/01/2016] [Indexed: 10/22/2022]
Abstract
This article has been withdrawn at the request of the author(s) and/or editor. The Publisher apologizes for any inconvenience this may cause. The full Elsevier Policy on Article Withdrawal can be found at http://www.elsevier.com/locate/withdrawalpolicy.
Collapse
Affiliation(s)
- Bin Liu
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong 518055, China; Key Laboratory of Network Oriented Intelligent Computation, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong 518055, China.
| | - Longyun Fang
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong 518055, China.
| |
Collapse
|