1
|
Möbus L, Serra A, Fratello M, Pavel A, Federico A, Greco D. A Multi-Dimensional Approach to Map Disease Relationships Challenges Classical Disease Views. ADVANCED SCIENCE (WEINHEIM, BADEN-WURTTEMBERG, GERMANY) 2024:e2401754. [PMID: 38840452 DOI: 10.1002/advs.202401754] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/19/2024] [Revised: 04/05/2024] [Indexed: 06/07/2024]
Abstract
The categorization of human diseases is mainly based on the affected organ system and phenotypic characteristics. This is limiting the view to the pathological manifestations, while it neglects mechanistic relationships that are crucial to develop therapeutic strategies. This work aims to advance the understanding of diseases and their relatedness beyond traditional phenotypic views. Hence, the similarity among 502 diseases is mapped using six different data dimensions encompassing molecular, clinical, and pharmacological information retrieved from public sources. Multiple distance measures and multi-view clustering are used to assess the patterns of disease relatedness. The integration of all six dimensions into a consensus map of disease relationships reveals a divergent disease view from the International Classification of Diseases (ICD), emphasizing novel insights offered by a multi-view disease map. Disease features such as genes, pathways, and chemicals that are enriched in distinct disease groups are identified. Finally, an evaluation of the top similar diseases of three candidate diseases common in the Western population shows concordance with known epidemiological associations and reveals rare features shared between Type 2 diabetes (T2D) and Alzheimer's disease. A revision of disease relationships holds promise for facilitating the reconstruction of comorbidity patterns, repurposing drugs, and advancing drug discovery in the future.
Collapse
Affiliation(s)
- Lena Möbus
- Finnish Hub for Development and Validation of Integrated Approaches (FHAIVE), Faculty of Medicine and Health Technology, Tampere University, Tampere, 33520, Finland
| | - Angela Serra
- Finnish Hub for Development and Validation of Integrated Approaches (FHAIVE), Faculty of Medicine and Health Technology, Tampere University, Tampere, 33520, Finland
- Tampere Institute for Advanced Study, Tampere University, Tampere, 33520, Finland
- Division of Pharmaceutical Biosciences, Faculty of Pharmacy, University of Helsinki, Helsinki, 00790, Finland
| | - Michele Fratello
- Finnish Hub for Development and Validation of Integrated Approaches (FHAIVE), Faculty of Medicine and Health Technology, Tampere University, Tampere, 33520, Finland
| | - Alisa Pavel
- Finnish Hub for Development and Validation of Integrated Approaches (FHAIVE), Faculty of Medicine and Health Technology, Tampere University, Tampere, 33520, Finland
- Applied Mathematics and Computer Science, Technical University of Denmark, Kongens Lyngby, 2800, Denmark
| | - Antonio Federico
- Finnish Hub for Development and Validation of Integrated Approaches (FHAIVE), Faculty of Medicine and Health Technology, Tampere University, Tampere, 33520, Finland
- Tampere Institute for Advanced Study, Tampere University, Tampere, 33520, Finland
- Division of Pharmaceutical Biosciences, Faculty of Pharmacy, University of Helsinki, Helsinki, 00790, Finland
| | - Dario Greco
- Finnish Hub for Development and Validation of Integrated Approaches (FHAIVE), Faculty of Medicine and Health Technology, Tampere University, Tampere, 33520, Finland
- Division of Pharmaceutical Biosciences, Faculty of Pharmacy, University of Helsinki, Helsinki, 00790, Finland
- Institute of Biotechnology, University of Helsinki, Helsinki, 00790, Finland
| |
Collapse
|
2
|
da Silva Rosa SC, Barzegar Behrooz A, Guedes S, Vitorino R, Ghavami S. Prioritization of genes for translation: a computational approach. Expert Rev Proteomics 2024; 21:125-147. [PMID: 38563427 DOI: 10.1080/14789450.2024.2337004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2023] [Accepted: 02/21/2024] [Indexed: 04/04/2024]
Abstract
INTRODUCTION Gene identification for genetic diseases is critical for the development of new diagnostic approaches and personalized treatment options. Prioritization of gene translation is an important consideration in the molecular biology field, allowing researchers to focus on the most promising candidates for further investigation. AREAS COVERED In this paper, we discussed different approaches to prioritize genes for translation, including the use of computational tools and machine learning algorithms, as well as experimental techniques such as knockdown and overexpression studies. We also explored the potential biases and limitations of these approaches and proposed strategies to improve the accuracy and reliability of gene prioritization methods. Although numerous computational methods have been developed for this purpose, there is a need for computational methods that incorporate tissue-specific information to enable more accurate prioritization of candidate genes. Such methods should provide tissue-specific predictions, insights into underlying disease mechanisms, and more accurate prioritization of genes. EXPERT OPINION Using advanced computational tools and machine learning algorithms to prioritize genes, we can identify potential targets for therapeutic intervention of complex diseases. This represents an up-and-coming method for drug development and personalized medicine.
Collapse
Affiliation(s)
- Simone C da Silva Rosa
- Department of Human Anatomy and Cell Science, Max Rady College of Medicine, Rady Faculty of Health Science, University of Manitoba, Winnipeg, Canada
| | - Amir Barzegar Behrooz
- Department of Human Anatomy and Cell Science, Max Rady College of Medicine, Rady Faculty of Health Science, University of Manitoba, Winnipeg, Canada
- Electrophysiology Research Center, Neuroscience Institute, Tehran University of Medical Sciences, Tehran, Iran
| | - Sofia Guedes
- LAQV/REQUIMTE, Department of Chemistry, University of Aveiro, Aveiro, Portugal
| | - Rui Vitorino
- LAQV/REQUIMTE, Department of Chemistry, University of Aveiro, Aveiro, Portugal
- Department of Medical Sciences, Institute of Biomedicine-iBiMED, University of Aveiro, Aveiro, Portugal
- UnIC@RISE, Department of Surgery and Physiology, Faculty of Medicine of the University of Porto, Porto, Portugal
| | - Saeid Ghavami
- Department of Human Anatomy and Cell Science, Max Rady College of Medicine, Rady Faculty of Health Science, University of Manitoba, Winnipeg, Canada
- Faculty of Medicine in Zabrze, Academia of Silesia, Katowice, Poland
- Research Institute of Oncology and Hematology, Cancer Care Manitoba, University of Manitoba, Winnipeg, Canada
| |
Collapse
|
3
|
Visonà G, Bouzigon E, Demenais F, Schweikert G. Network propagation for GWAS analysis: a practical guide to leveraging molecular networks for disease gene discovery. Brief Bioinform 2024; 25:bbae014. [PMID: 38340090 PMCID: PMC10858647 DOI: 10.1093/bib/bbae014] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2023] [Revised: 12/28/2023] [Accepted: 01/08/2024] [Indexed: 02/12/2024] Open
Abstract
MOTIVATION Genome-wide association studies (GWAS) have enabled large-scale analysis of the role of genetic variants in human disease. Despite impressive methodological advances, subsequent clinical interpretation and application remains challenging when GWAS suffer from a lack of statistical power. In recent years, however, the use of information diffusion algorithms with molecular networks has led to fruitful insights on disease genes. RESULTS We present an overview of the design choices and pitfalls that prove crucial in the application of network propagation methods to GWAS summary statistics. We highlight general trends from the literature, and present benchmark experiments to expand on these insights selecting as case study three diseases and five molecular networks. We verify that the use of gene-level scores based on GWAS P-values offers advantages over the selection of a set of 'seed' disease genes not weighted by the associated P-values if the GWAS summary statistics are of sufficient quality. Beyond that, the size and the density of the networks prove to be important factors for consideration. Finally, we explore several ensemble methods and show that combining multiple networks may improve the network propagation approach.
Collapse
Affiliation(s)
- Giovanni Visonà
- Empirical Inference, Max-Planck Institute for Intelligent Systems, Tübingen 72076, Germany
| | | | | | | |
Collapse
|
4
|
Liu Y, Li H, Zeng T, Wang Y, Zhang H, Wan Y, Shi Z, Cao R, Tang H. Integrated bulk and single-cell transcriptomes reveal pyroptotic signature in prognosis and therapeutic options of hepatocellular carcinoma by combining deep learning. Brief Bioinform 2023; 25:bbad487. [PMID: 38197309 PMCID: PMC10777172 DOI: 10.1093/bib/bbad487] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2023] [Revised: 11/22/2023] [Accepted: 11/30/2023] [Indexed: 01/11/2024] Open
Abstract
Although some pyroptosis-related (PR) prognostic models for cancers have been reported, pyroptosis-based features have not been fully discovered at the single-cell level in hepatocellular carcinoma (HCC). In this study, by deeply integrating single-cell and bulk transcriptome data, we systematically investigated significance of the shared pyroptotic signature at both single-cell and bulk levels in HCC prognosis. Based on the pyroptotic signature, a robust PR risk system was constructed to quantify the prognostic risk of individual patient. To further verify capacity of the pyroptotic signature on predicting patients' prognosis, an attention mechanism-based deep neural network classification model was constructed. The mechanisms of prognostic difference in the patients with distinct PR risk were dissected on tumor stemness, cancer pathways, transcriptional regulation, immune infiltration and cell communications. A nomogram model combining PR risk with clinicopathologic data was constructed to evaluate the prognosis of individual patients in clinic. The PR risk could also evaluate therapeutic response to neoadjuvant therapies in HCC patients. In conclusion, the constructed PR risk system enables a comprehensive assessment of tumor microenvironment characteristics, accurate prognosis prediction and rational therapeutic options in HCC.
Collapse
Affiliation(s)
- Yang Liu
- School of Basic Medical Sciences, Southwest Medical University, Luzhou 646000, China
| | - Hanlin Li
- School of Basic Medical Sciences, Southwest Medical University, Luzhou 646000, China
| | - Tianyu Zeng
- School of Basic Medical Sciences, Southwest Medical University, Luzhou 646000, China
| | - Yang Wang
- School of Basic Medical Sciences, Southwest Medical University, Luzhou 646000, China
| | - Hongqi Zhang
- School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Ying Wan
- School of Basic Medical Sciences, Southwest Medical University, Luzhou 646000, China
| | - Zheng Shi
- Clinical Genetics Laboratory, Clinical Medical College & Affiliated Hospital, Chengdu University, Chengdu 610106, China
| | - Renzhi Cao
- Department of Computer Science, Pacific Lutheran University, Tacoma, Washington 98447, USA
| | - Hua Tang
- School of Basic Medical Sciences, Southwest Medical University, Luzhou 646000, China
- Basic Medicine Research Innovation Center for Cardiometabolic Diseases,Ministry of Education, Luzhou 646000, China
- Medical Engineering & Medical Informatics Integration and Transformational Medicine Key Laboratory of Luzhou City, Luzhou 646000, China
| |
Collapse
|
5
|
Cingiz MÖ. k- Strong Inference Algorithm: A Hybrid Information Theory Based Gene Network Inference Algorithm. Mol Biotechnol 2023:10.1007/s12033-023-00929-2. [PMID: 37950851 DOI: 10.1007/s12033-023-00929-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2023] [Accepted: 10/05/2023] [Indexed: 11/13/2023]
Abstract
Gene networks allow researchers to understand the underlying mechanisms between diseases and genes while reducing the need for wet lab experiments. Numerous gene network inference (GNI) algorithms have been presented in the literature to infer accurate gene networks. We proposed a hybrid GNI algorithm, k-Strong Inference Algorithm (ksia), to infer more reliable and robust gene networks from omics datasets. To increase reliability, ksia integrates Pearson correlation coefficient (PCC) and Spearman rank correlation coefficient (SCC) scores to determine mutual information scores between molecules to increase diversity of relation predictions. To infer a more robust gene network, ksia applies three different elimination steps to remove redundant and spurious relations between genes. The performance of ksia was evaluated on microbe microarrays database in the overlap analysis with other GNI algorithms, namely ARACNE, C3NET, CLR, and MRNET. Ksia inferred less number of relations due to its strict elimination steps. However, ksia generally performed better on Escherichia coli (E.coli) and Saccharomyces cerevisiae (yeast) gene expression datasets due to F- measure and precision values. The integration of association estimator scores and three elimination stages slightly increases the performance of ksia based gene networks. Users can access ksia R package and user manual of package via https://github.com/ozgurcingiz/ksia .
Collapse
Affiliation(s)
- Mustafa Özgür Cingiz
- Computer Engineering Department, Faculty of Engineering and Natural Sciences, Bursa Technical University, Mimar Sinan Campus, Yildirim, 16310, Bursa, Turkey.
| |
Collapse
|
6
|
Sánchez-Valle J, Valencia A. Molecular bases of comorbidities: present and future perspectives. Trends Genet 2023; 39:773-786. [PMID: 37482451 DOI: 10.1016/j.tig.2023.06.003] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2023] [Revised: 06/12/2023] [Accepted: 06/12/2023] [Indexed: 07/25/2023]
Abstract
Co-occurrence of diseases decreases patient quality of life, complicates treatment choices, and increases mortality. Analyses of electronic health records present a complex scenario of comorbidity relationships that vary by age, sex, and cohort under study. The study of similarities between diseases using 'omics data, such as genes altered in diseases, gene expression, proteome, and microbiome, are fundamental to uncovering the origin of, and potential treatment for, comorbidities. Recent studies have produced a first generation of genetic interpretations for as much as 46% of the comorbidities described in large cohorts. Integrating different sources of molecular information and using artificial intelligence (AI) methods are promising approaches for the study of comorbidities. They may help to improve the treatment of comorbidities, including the potential repositioning of drugs.
Collapse
Affiliation(s)
- Jon Sánchez-Valle
- Life Sciences Department, Barcelona Supercomputing Center, Barcelona, 08034, Spain.
| | - Alfonso Valencia
- Life Sciences Department, Barcelona Supercomputing Center, Barcelona, 08034, Spain; ICREA, Barcelona, 08010, Spain.
| |
Collapse
|
7
|
Wang C, Zou Q, Ju Y, Shi H. Enhancer-FRL: Improved and Robust Identification of Enhancers and Their Activities Using Feature Representation Learning. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:967-975. [PMID: 36063523 DOI: 10.1109/tcbb.2022.3204365] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Enhancers are crucial for precise regulation of gene expression, while enhancer identification and strength prediction are challenging because of their free distribution and tremendous number of similar fractions in the genome. Although several bioinformatics tools have been developed, shortfalls in these models remain, and their performances need further improvement. In the present study, a two-layer predictor called Enhancer-FRL was proposed for identifying enhancers (enhancers or nonenhancers) and their activities (strong and weak). More specifically, to build an efficient model, the feature representation learning scheme was applied to generate a 50D probabilistic vector based on 10 feature encodings and five machine learning algorithms. Subsequently, the multiview probabilistic features were integrated to construct the final prediction model. Compared with the single feature-based model, Enhancer-FRL showed significant performance improvement and model robustness. Performance assessment on the independent test dataset indicated that the proposed model outperformed state-of-the-art available toolkits. The webserver Enhancer-FRL is freely accessible at http://lab.malab.cn/∼wangchao/softwares/Enhancer-FRL/, The code and datasets can be downloaded at the webserver page or at the Github https://github.com/wangchao-malab/Enhancer-FRL/.
Collapse
|
8
|
Fu M, Yan Y, Olde Loohuis LM, Chang TS. Defining the distance between diseases using SNOMED CT embeddings. J Biomed Inform 2023; 139:104307. [PMID: 36738869 DOI: 10.1016/j.jbi.2023.104307] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2022] [Revised: 12/10/2022] [Accepted: 01/29/2023] [Indexed: 02/05/2023]
Abstract
Characterizing disease relationships is essential to biomedical research to understand disease etiology and improve clinical decision-making. Measurements of distance between disease pairs enable valuable research tasks, such as subgrouping patients and identifying common time courses of disease onset. Distance metrics developed in prior work focused on smaller, targeted disease sets. Distance metrics covering all diseases have not yet been defined, which limits the applications to a broader disease spectrum. Our current study defines disease distances for all disease pairs within the International Classification of Diseases, version 10 (ICD-10), the diagnostic classification system universally used in electronic health records. Our proposed distance is computed based on a biomedical ontology, SNOMED CT (Systemized Nomenclature of Medicine, Clinical Terms), which can also be viewed as a structured knowledge graph. We compared the knowledge graph-based metric to three other distance metrics based on the hierarchical structure of ICD, clinical comorbidity, and genetic correlation, to evaluate how each may capture similar or unique aspects of disease relationships. We show that our knowledge graph-based distance metric captures known phenotypic, clinical, and molecular characteristics at a finer granularity than the other three. With the continued growth of using electronic health records data for research, we believe that our distance metric will play an important role in subgrouping patients for precision health, and enabling individualized disease prevention and treatments.
Collapse
Affiliation(s)
- Mingzhou Fu
- Movement Disorders Program, Department of Neurology, David Geffen School of Medicine, University of California, Los Angeles, CA, USA; Medical Informatics Home Area, Department of Bioinformatics, University of California, Los Angeles, CA, USA
| | - Yu Yan
- Medical Informatics Home Area, Department of Bioinformatics, University of California, Los Angeles, CA, USA
| | - Loes M Olde Loohuis
- Center for Neurobehavioral Genetics, Semel Institute, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, CA, USA; Department of Human Genetics, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, CA, USA.
| | - Timothy S Chang
- Movement Disorders Program, Department of Neurology, David Geffen School of Medicine, University of California, Los Angeles, CA, USA.
| |
Collapse
|
9
|
Abstract
Developing personalized diagnostic strategies and targeted treatments requires a deep understanding of disease biology and the ability to dissect the relationship between molecular and genetic factors and their phenotypic consequences. However, such knowledge is fragmented across publications, non-standardized repositories, and evolving ontologies describing various scales of biological organization between genotypes and clinical phenotypes. Here, we present PrimeKG, a multimodal knowledge graph for precision medicine analyses. PrimeKG integrates 20 high-quality resources to describe 17,080 diseases with 4,050,249 relationships representing ten major biological scales, including disease-associated protein perturbations, biological processes and pathways, anatomical and phenotypic scales, and the entire range of approved drugs with their therapeutic action, considerably expanding previous efforts in disease-rooted knowledge graphs. PrimeKG contains an abundance of 'indications', 'contradictions', and 'off-label use' drug-disease edges that lack in other knowledge graphs and can support AI analyses of how drugs affect disease-associated networks. We supplement PrimeKG's graph structure with language descriptions of clinical guidelines to enable multimodal analyses and provide instructions for continual updates of PrimeKG as new data become available.
Collapse
|
10
|
Yuan Q, Chen K, Yu Y, Le NQK, Chua MCH. Prediction of anticancer peptides based on an ensemble model of deep learning and machine learning using ordinal positional encoding. Brief Bioinform 2023; 24:6987656. [PMID: 36642410 DOI: 10.1093/bib/bbac630] [Citation(s) in RCA: 32] [Impact Index Per Article: 32.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2022] [Revised: 12/01/2022] [Accepted: 12/28/2022] [Indexed: 01/17/2023] Open
Abstract
Anticancer peptides (ACPs) are the types of peptides that have been demonstrated to have anticancer activities. Using ACPs to prevent cancer could be a viable alternative to conventional cancer treatments because they are safer and display higher selectivity. Due to ACP identification being highly lab-limited, expensive and lengthy, a computational method is proposed to predict ACPs from sequence information in this study. The process includes the input of the peptide sequences, feature extraction in terms of ordinal encoding with positional information and handcrafted features, and finally feature selection. The whole model comprises of two modules, including deep learning and machine learning algorithms. The deep learning module contained two channels: bidirectional long short-term memory (BiLSTM) and convolutional neural network (CNN). Light Gradient Boosting Machine (LightGBM) was used in the machine learning module. Finally, this study voted the three models' classification results for the three paths resulting in the model ensemble layer. This study provides insights into ACP prediction utilizing a novel method and presented a promising performance. It used a benchmark dataset for further exploration and improvement compared with previous studies. Our final model has an accuracy of 0.7895, sensitivity of 0.8153 and specificity of 0.7676, and it was increased by at least 2% compared with the state-of-the-art studies in all metrics. Hence, this paper presents a novel method that can potentially predict ACPs more effectively and efficiently. The work and source codes are made available to the community of researchers and developers at https://github.com/khanhlee/acp-ope/.
Collapse
Affiliation(s)
- Qitong Yuan
- Institute of Systems Science, National University of Singapore, 25 Heng Mui Keng Terrace, 119615, Singapore, Singapore
| | - Keyi Chen
- Institute of Systems Science, National University of Singapore, 25 Heng Mui Keng Terrace, 119615, Singapore, Singapore
| | - Yimin Yu
- Institute of Systems Science, National University of Singapore, 25 Heng Mui Keng Terrace, 119615, Singapore, Singapore
| | - Nguyen Quoc Khanh Le
- Professional Master Program in Artificial Intelligence in Medicine, College of Medicine, Taipei Medical University, 250 Wuxing St, 106, Taipei, Taiwan.,Research Center for Artificial Intelligence in Medicine, Taipei Medical University, 250 Wuxing St, 106, Taipei, Taiwan.,Translational Imaging Research Center, Taipei Medical University Hospital, 252 Wuxing St, 110, Taipei, Taiwan
| | - Matthew Chin Heng Chua
- Institute of Systems Science, National University of Singapore, 25 Heng Mui Keng Terrace, 119615, Singapore, Singapore
| |
Collapse
|
11
|
Lin W, Hu S, Wu Z, Xu Z, Zhong Y, Lv Z, Qiu W, Xiao X. iCancer-Pred: A tool for identifying cancer and its type using DNA methylation. Genomics 2022; 114:110486. [PMID: 36126833 DOI: 10.1016/j.ygeno.2022.110486] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2022] [Revised: 09/11/2022] [Accepted: 09/16/2022] [Indexed: 01/14/2023]
Abstract
DNA methylation is an important epigenetics, which occurs in the early stages of tumor formation. And it also is of great significance to find the relationship between DNA methylation and cancer. This paper proposes a novel model, iCancer-Pred, to identify cancer and classify its types further. The datasets of DNA methylation information of 7 cancer types have been collected from The Cancer Genome Atlas (TCGA). The coefficient of variation firstly is used to reduce the number of features, and then the elastic network is applied to select important features. Finally, a fully connected neural network is constructed with these selected features. In predicting seven types of cancers, iCancer-Pred has achieved an overall accuracy of over 97% accuracy with 5-fold cross-validation. For the convenience of the application, a user-friendly web server: http://bioinfo.jcu.edu.cn/cancer or http://121.36.221.79/cancer/ is available. And the source codes are freely available for download at https://github.com/Huerhu/iCancer-Pred.
Collapse
Affiliation(s)
- Weizhong Lin
- School of Information Engineering, Jingdezhen Ceramic University, Jingdezhen 333000, China.
| | - Siqin Hu
- School of Information Engineering, Jingdezhen Ceramic University, Jingdezhen 333000, China
| | - Zhicheng Wu
- Wuhan Ammunition Life Science & Technology Co., Ltd., Wuhan 430000, China
| | - Zhaochun Xu
- School of Information Engineering, Jingdezhen Ceramic University, Jingdezhen 333000, China
| | - Yu Zhong
- School of Information Engineering, Jingdezhen Ceramic University, Jingdezhen 333000, China
| | - Zhe Lv
- School of Information Engineering, Jingdezhen Ceramic University, Jingdezhen 333000, China
| | - Wangren Qiu
- School of Information Engineering, Jingdezhen Ceramic University, Jingdezhen 333000, China
| | - Xuan Xiao
- School of Information Engineering, Jingdezhen Ceramic University, Jingdezhen 333000, China
| |
Collapse
|
12
|
Cheng X, Qu J, Song S, Bian Z. Neighborhood-based inference and restricted Boltzmann machine for microbe and drug associations prediction. PeerJ 2022; 10:e13848. [PMID: 35990901 PMCID: PMC9387521 DOI: 10.7717/peerj.13848] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2022] [Accepted: 07/14/2022] [Indexed: 01/18/2023] Open
Abstract
Background Efficient identification of microbe-drug associations is critical for drug development and solving problem of antimicrobial resistance. Traditional wet-lab method requires a lot of money and labor in identifying potential microbe-drug associations. With development of machine learning and publication of large amounts of biological data, computational methods become feasible. Methods In this article, we proposed a computational model of neighborhood-based inference (NI) and restricted Boltzmann machine (RBM) to predict potential microbe-drug association (NIRBMMDA) by using integrated microbe similarity, integrated drug similarity and known microbe-drug associations. First, NI was used to obtain a score matrix of potential microbe-drug associations by using different thresholds to find similar neighbors for drug or microbe. Second, RBM was employed to obtain another score matrix of potential microbe-drug associations based on contrastive divergence algorithm and sigmoid function. Because generalization ability of individual method is poor, we used an ensemble learning to integrate two score matrices for predicting potential microbe-drug associations more accurately. In particular, NI can fully utilize similar (neighbor) information of drug or microbe and RBM can learn potential probability distribution hid in known microbe-drug associations. Moreover, ensemble learning was used to integrate individual predictor for obtaining a stronger predictor. Results In global leave-one-out cross validation (LOOCV), NIRBMMDA gained the area under the receiver operating characteristics curve (AUC) of 0.8666, 0.9413 and 0.9557 for datasets of DrugVirus, MDAD and aBiofilm, respectively. In local LOOCV, AUCs of 0.8512, 0.9204 and 0.9414 were obtained for NIRBMMDA based on datasets of DrugVirus, MDAD and aBiofilm, respectively. For five-fold cross validation, NIRBMMDA acquired AUC and standard deviation of 0.8569 ± -0.0027, 0.9248 ± -0.0014 and 0.9369 ± -0.0020 on the basis of datasets of DrugVirus, MDAD and aBiofilm, respectively. Moreover, case study for severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) showed that 13 out of the top 20 predicted drugs were verified by searching literature. The other two case studies indicated that 17 and 17 out of the top 20 predicted microbes for the drug of ciprofloxacin and minocycline were confirmed by identifying published literature, respectively.
Collapse
Affiliation(s)
- Xiaolong Cheng
- School of Computer Science and Artificial Intelligence, Changzhou University, Changzhou, Jiangsu, China
| | - Jia Qu
- School of Computer Science and Artificial Intelligence, Changzhou University, Changzhou, Jiangsu, China
| | - Shuangbao Song
- School of Computer Science and Artificial Intelligence, Changzhou University, Changzhou, Jiangsu, China
| | - Zekang Bian
- School of AI & Computer Science, Jiangnan University, Wuxi, Jiangsu, China
| |
Collapse
|
13
|
Network-Based Methods for Approaching Human Pathologies from a Phenotypic Point of View. Genes (Basel) 2022; 13:genes13061081. [PMID: 35741843 PMCID: PMC9222217 DOI: 10.3390/genes13061081] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2022] [Revised: 06/10/2022] [Accepted: 06/14/2022] [Indexed: 01/27/2023] Open
Abstract
Network and systemic approaches to studying human pathologies are helping us to gain insight into the molecular mechanisms of and potential therapeutic interventions for human diseases, especially for complex diseases where large numbers of genes are involved. The complex human pathological landscape is traditionally partitioned into discrete “diseases”; however, that partition is sometimes problematic, as diseases are highly heterogeneous and can differ greatly from one patient to another. Moreover, for many pathological states, the set of symptoms (phenotypes) manifested by the patient is not enough to diagnose a particular disease. On the contrary, phenotypes, by definition, are directly observable and can be closer to the molecular basis of the pathology. These clinical phenotypes are also important for personalised medicine, as they can help stratify patients and design personalised interventions. For these reasons, network and systemic approaches to pathologies are gradually incorporating phenotypic information. This review covers the current landscape of phenotype-centred network approaches to study different aspects of human diseases.
Collapse
|
14
|
Xiang J, Zhang J, Zhao Y, Wu FX, Li M. Biomedical data, computational methods and tools for evaluating disease-disease associations. Brief Bioinform 2022; 23:6522999. [PMID: 35136949 DOI: 10.1093/bib/bbac006] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2021] [Revised: 01/04/2022] [Accepted: 01/05/2022] [Indexed: 12/12/2022] Open
Abstract
In recent decades, exploring potential relationships between diseases has been an active research field. With the rapid accumulation of disease-related biomedical data, a lot of computational methods and tools/platforms have been developed to reveal intrinsic relationship between diseases, which can provide useful insights to the study of complex diseases, e.g. understanding molecular mechanisms of diseases and discovering new treatment of diseases. Human complex diseases involve both external phenotypic abnormalities and complex internal molecular mechanisms in organisms. Computational methods with different types of biomedical data from phenotype to genotype can evaluate disease-disease associations at different levels, providing a comprehensive perspective for understanding diseases. In this review, available biomedical data and databases for evaluating disease-disease associations are first summarized. Then, existing computational methods for disease-disease associations are reviewed and classified into five groups in terms of the usages of biomedical data, including disease semantic-based, phenotype-based, function-based, representation learning-based and text mining-based methods. Further, we summarize software tools/platforms for computation and analysis of disease-disease associations. Finally, we give a discussion and summary on the research of disease-disease associations. This review provides a systematic overview for current disease association research, which could promote the development and applications of computational methods and tools/platforms for disease-disease associations.
Collapse
Affiliation(s)
- Ju Xiang
- School of Computer Science and Engineering, Central South University, China
| | - Jiashuai Zhang
- Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha, Hunan 410083, China
| | - Yichao Zhao
- School of Computer Science and Engineering, Central South University, China
| | - Fang-Xiang Wu
- Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha, Hunan 410083, China
| | - Min Li
- Division of Biomedical Engineering and Department of Mechanical Engineering at University of Saskatchewan, Saskatoon, Canada
| |
Collapse
|
15
|
Zhang S, Zhang J, Zhang Q, Liang Y, Du Y, Wang G. Identification of Prognostic Biomarkers for Bladder Cancer Based on DNA Methylation Profile. Front Cell Dev Biol 2022; 9:817086. [PMID: 35174173 PMCID: PMC8841402 DOI: 10.3389/fcell.2021.817086] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2021] [Accepted: 12/22/2021] [Indexed: 12/14/2022] Open
Abstract
Background: DNA methylation is an important epigenetic modification, which plays an important role in regulating gene expression at the transcriptional level. In tumor research, it has been found that the change of DNA methylation leads to the abnormality of gene structure and function, which can provide early warning for tumorigenesis. Our study aims to explore the relationship between the occurrence and development of tumor and the level of DNA methylation. Moreover, this study will provide a set of prognostic biomarkers, which can more accurately predict the survival and health of patients after treatment. Methods: Datasets of bladder cancer patients and control samples were collected from TCGA database, differential analysis was employed to obtain genes with differential DNA methylation levels between tumor samples and normal samples. Then the protein-protein interaction network was constructed, and the potential tumor markers were further obtained by extracting Hub genes from subnet. Cox proportional hazard regression model and survival analysis were used to construct the prognostic model and screen out the prognostic markers of bladder cancer, so as to provide reference for tumor prognosis monitoring and improvement of treatment plan. Results: In this study, we found that DNA methylation was indeed related with the occurrence of bladder cancer. Genes with differential DNA methylation could serve as potential biomarkers for bladder cancer. Through univariate and multivariate Cox proportional hazard regression analysis, we concluded that FASLG and PRKCZ can be used as prognostic biomarkers for bladder cancer. Patients can be classified into high or low risk group by using this two-gene prognostic model. By detecting the methylation status of these genes, we can evaluate the survival of patients. Conclusion: The analysis in our study indicates that the methylation status of tumor-related genes can be used as prognostic biomarkers of bladder cancer.
Collapse
Affiliation(s)
- Shumei Zhang
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| | - Jingyu Zhang
- Department of Neurology, The Fourth Affiliated Hospital of Harbin Medical University, Harbin, China
| | - Qichao Zhang
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| | - Yingjian Liang
- Department of General Surgery, The First Affiliated Hospital of Harbin Medical University, Harbin, China
- Key Laboratory of Hepatosplenic Surgery, Ministry of Education, The First Affiliated Hospital of Harbin Medical University, Harbin, China
| | - Youwen Du
- School of Life Sciences, Anhui Medical University, Hefei, China
| | - Guohua Wang
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
- *Correspondence: Guohua Wang,
| |
Collapse
|
16
|
Zhao Z, Yang W, Zhai Y, Liang Y, Zhao Y. Identify DNA-Binding Proteins Through the Extreme Gradient Boosting Algorithm. Front Genet 2022; 12:821996. [PMID: 35154264 PMCID: PMC8837382 DOI: 10.3389/fgene.2021.821996] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2021] [Accepted: 12/07/2021] [Indexed: 12/13/2022] Open
Abstract
The exploration of DNA-binding proteins (DBPs) is an important aspect of studying biological life activities. Research on life activities requires the support of scientific research results on DBPs. The decline in many life activities is closely related to DBPs. Generally, the detection method for identifying DBPs is achieved through biochemical experiments. This method is inefficient and requires considerable manpower, material resources and time. At present, several computational approaches have been developed to detect DBPs, among which machine learning (ML) algorithm-based computational techniques have shown excellent performance. In our experiments, our method uses fewer features and simpler recognition methods than other methods and simultaneously obtains satisfactory results. First, we use six feature extraction methods to extract sequence features from the same group of DBPs. Then, this feature information is spliced together, and the data are standardized. Finally, the extreme gradient boosting (XGBoost) model is used to construct an effective predictive model. Compared with other excellent methods, our proposed method has achieved better results. The accuracy achieved by our method is 78.26% for PDB2272 and 85.48% for PDB186. The accuracy of the experimental results achieved by our strategy is similar to that of previous detection methods.
Collapse
Affiliation(s)
- Ziye Zhao
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| | - Wen Yang
- International Medical Center, Shenzhen University General Hospital, Shenzhen, China
| | - Yixiao Zhai
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| | - Yingjian Liang
- Department of Obstetrics and Gynecology, The First Affiliated Hospital of Harbin Medical University, Harbin, China
- *Correspondence: Yingjian Liang, ; Yuming Zhao,
| | - Yuming Zhao
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
- *Correspondence: Yingjian Liang, ; Yuming Zhao,
| |
Collapse
|
17
|
Zhang Z, Gong Y, Gao B, Li H, Gao W, Zhao Y, Dong B. SNAREs-SAP: SNARE Proteins Identification With PSSM Profiles. Front Genet 2022; 12:809001. [PMID: 34987554 PMCID: PMC8721734 DOI: 10.3389/fgene.2021.809001] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2021] [Accepted: 11/15/2021] [Indexed: 12/20/2022] Open
Abstract
Soluble N-ethylmaleimide sensitive factor activating protein receptor (SNARE) proteins are a large family of transmembrane proteins located in organelles and vesicles. The important roles of SNARE proteins include initiating the vesicle fusion process and activating and fusing proteins as they undergo exocytosis activity, and SNARE proteins are also vital for the transport regulation of membrane proteins and non-regulatory vesicles. Therefore, there is great significance in establishing a method to efficiently identify SNARE proteins. However, the identification accuracy of the existing methods such as SNARE CNN is not satisfied. In our study, we developed a method based on a support vector machine (SVM) that can effectively recognize SNARE proteins. We used the position-specific scoring matrix (PSSM) method to extract features of SNARE protein sequences, used the support vector machine recursive elimination correlation bias reduction (SVM-RFE-CBR) algorithm to rank the importance of features, and then screened out the optimal subset of feature data based on the sorted results. We input the feature data into the model when building the model, used 10-fold crossing validation for training, and tested model performance by using an independent dataset. In independent tests, the ability of our method to identify SNARE proteins achieved a sensitivity of 68%, specificity of 94%, accuracy of 92%, area under the curve (AUC) of 84%, and Matthew’s correlation coefficient (MCC) of 0.48. The results of the experiment show that the common evaluation indicators of our method are excellent, indicating that our method performs better than other existing classification methods in identifying SNARE proteins.
Collapse
Affiliation(s)
- Zixiao Zhang
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| | - Yue Gong
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| | - Bo Gao
- Department of Radiology, The Second Affiliated Hospital, Harbin Medical University, Harbin, China
| | - Hongfei Li
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| | - Wentao Gao
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| | - Yuming Zhao
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| | - Benzhi Dong
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| |
Collapse
|
18
|
Du J, Lin D, Yuan R, Chen X, Liu X, Yan J. Graph Embedding Based Novel Gene Discovery Associated With Diabetes Mellitus. Front Genet 2021; 12:779186. [PMID: 34899863 PMCID: PMC8657768 DOI: 10.3389/fgene.2021.779186] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2021] [Accepted: 10/20/2021] [Indexed: 11/25/2022] Open
Abstract
Diabetes mellitus is a group of complex metabolic disorders which has affected hundreds of millions of patients world-widely. The underlying pathogenesis of various types of diabetes is still unclear, which hinders the way of developing more efficient therapies. Although many genes have been found associated with diabetes mellitus, more novel genes are still needed to be discovered towards a complete picture of the underlying mechanism. With the development of complex molecular networks, network-based disease-gene prediction methods have been widely proposed. However, most existing methods are based on the hypothesis of guilt-by-association and often handcraft node features based on local topological structures. Advances in graph embedding techniques have enabled automatically global feature extraction from molecular networks. Inspired by the successful applications of cutting-edge graph embedding methods on complex diseases, we proposed a computational framework to investigate novel genes associated with diabetes mellitus. There are three main steps in the framework: network feature extraction based on graph embedding methods; feature denoising and regeneration using stacked autoencoder; and disease-gene prediction based on machine learning classifiers. We compared the performance by using different graph embedding methods and machine learning classifiers and designed the best workflow for predicting genes associated with diabetes mellitus. Functional enrichment analysis based on Human Phenotype Ontology (HPO), KEGG, and GO biological process and publication search further evaluated the predicted novel genes.
Collapse
Affiliation(s)
| | | | | | | | | | - Jing Yan
- Zhejiang Hospital, Hangzhou, China.,Zhejiang Provincial Key Lab of Geriatrics, Zhejiang Hospital, Hangzhou, China
| |
Collapse
|
19
|
Zhang H, Xu R, Ding M, Zhang Y. Prediction of Gastric Cancer-Related Proteins Based on Graph Fusion Method. Front Cell Dev Biol 2021; 9:739715. [PMID: 34790662 PMCID: PMC8591485 DOI: 10.3389/fcell.2021.739715] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2021] [Accepted: 08/02/2021] [Indexed: 12/09/2022] Open
Abstract
Gastric cancer is a common malignant tumor of the digestive system with no specific symptoms. Due to the limited knowledge of pathogenesis, patients are usually diagnosed in advanced stage and do not have effective treatment methods. Proteome has unique tissue and time specificity and can reflect the influence of external factors that has become a potential biomarker for early diagnosis. Therefore, discovering gastric cancer-related proteins could greatly help researchers design drugs and develop an early diagnosis kit. However, identifying gastric cancer-related proteins by biological experiments is time- and money-consuming. With the high speed increase of data, it has become a hot issue to mine the knowledge of proteomics data on a large scale through computational methods. Based on the hypothesis that the stronger the association between the two proteins, the more likely they are to be associated with the same disease, in this paper, we constructed both disease similarity network and protein interaction network. Then, Graph Convolutional Networks (GCN) was applied to extract topological features of these networks. Finally, Xgboost was used to identify the relationship between proteins and gastric cancer. Results of 10-cross validation experiments show high area under the curve (AUC) (0.85) and area under the precision recall (AUPR) curve (0.76) of our method, which proves the effectiveness of our method.
Collapse
Affiliation(s)
- Hao Zhang
- Endoscopy Center, China-Japan Union Hospital of Jilin University, Changchun, China
| | - Ruisi Xu
- Endoscopy Center, China-Japan Union Hospital of Jilin University, Changchun, China
| | - Meng Ding
- Endoscopy Center, China-Japan Union Hospital of Jilin University, Changchun, China
| | - Ying Zhang
- Endoscopy Center, China-Japan Union Hospital of Jilin University, Changchun, China
| |
Collapse
|
20
|
ReRF-Pred: predicting amyloidogenic regions of proteins based on their pseudo amino acid composition and tripeptide composition. BMC Bioinformatics 2021; 22:545. [PMID: 34753427 PMCID: PMC8579573 DOI: 10.1186/s12859-021-04446-4] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2021] [Accepted: 10/13/2021] [Indexed: 02/08/2023] Open
Abstract
BACKGROUND Amyloids are insoluble fibrillar aggregates that are highly associated with complex human diseases, such as Alzheimer's disease, Parkinson's disease, and type II diabetes. Recently, many studies reported that some specific regions of amino acid sequences may be responsible for the amyloidosis of proteins. It has become very important for elucidating the mechanism of amyloids that identifying the amyloidogenic regions. Accordingly, several computational methods have been put forward to discover amyloidogenic regions. The majority of these methods predicted amyloidogenic regions based on the physicochemical properties of amino acids. In fact, position, order, and correlation of amino acids may also influence the amyloidosis of proteins, which should be also considered in detecting amyloidogenic regions. RESULTS To address this problem, we proposed a novel machine-learning approach for predicting amyloidogenic regions, called ReRF-Pred. Firstly, the pseudo amino acid composition (PseAAC) was exploited to characterize physicochemical properties and correlation of amino acids. Secondly, tripeptides composition (TPC) was employed to represent the order and position of amino acids. To improve the distinguishability of TPC, all possible tripeptides were analyzed by the binomial distribution method, and only those which have significantly different distribution between positive and negative samples remained. Finally, all samples were characterized by PseAAC and TPC of their amino acid sequence, and a random forest-based amyloidogenic regions predictor was trained on these samples. It was proved by validation experiments that the feature set consisted of PseAAC and TPC is the most distinguishable one for detecting amyloidosis. Meanwhile, random forest is superior to other concerned classifiers on almost all metrics. To validate the effectiveness of our model, ReRF-Pred is compared with a series of gold-standard methods on two datasets: Pep-251 and Reg33. The results suggested our method has the best overall performance and makes significant improvements in discovering amyloidogenic regions. CONCLUSIONS The advantages of our method are mainly attributed to that PseAAC and TPC can describe the differences between amyloids and other proteins successfully. The ReRF-Pred server can be accessed at http://106.12.83.135:8080/ReRF-Pred/.
Collapse
|
21
|
Qiu S, Li M, Jin S, Lu H, Hu Y. Rheumatoid Arthritis and Cardio-Cerebrovascular Disease: A Mendelian Randomization Study. Front Genet 2021; 12:745224. [PMID: 34745219 PMCID: PMC8567962 DOI: 10.3389/fgene.2021.745224] [Citation(s) in RCA: 19] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2021] [Accepted: 08/20/2021] [Indexed: 01/05/2023] Open
Abstract
Significant genetic association exists between rheumatoid arthritis (RA) and cardiovascular disease. The associated mechanisms include common inflammatory mediators, changes in lipoprotein composition and function, immune responses, etc. However, the causality of RA and vascular/heart problems remains unknown. Herein, we performed Mendelian randomization (MR) analysis using a large-scale RA genome-wide association study (GWAS) dataset (462,933 cases and 457,732 controls) and six cardio-cerebrovascular disease GWAS datasets, including age angina (461,880 cases and 447,052 controls), hypertension (461,880 cases and 337,653 controls), age heart attack (10,693 cases and 451,187 controls), abnormalities of heartbeat (461,880 cases and 361,194 controls), stroke (7,055 cases and 454,825 controls), and coronary heart disease (361,194 cases and 351,037 controls) from United Kingdom biobank. We further carried out heterogeneity and sensitivity analyses. We confirmed the causality of RA with age angina (OR = 1.17, 95% CI: 1.04–1.33, p = 1.07E−02), hypertension (OR = 1.45, 95% CI: 1.20–1.75, p = 9.64E−05), age heart attack (OR = 1.15, 95% CI: 1.05–1.26, p = 3.56E−03), abnormalities of heartbeat (OR = 1.07, 95% CI: 1.01–1.12, p = 1.49E−02), stroke (OR = 1.06, 95% CI: 1.01–1.12, p = 2.79E−02), and coronary heart disease (OR = 1.19, 95% CI: 1.01–1.39, p = 3.33E−02), contributing to the understanding of the overlapping genetic mechanisms and therapeutic approaches between RA and cardiovascular disease.
Collapse
Affiliation(s)
- Shizheng Qiu
- School of Life Sciences and Technology, Harbin Institute of Technology, Harbin, China
| | - Meijie Li
- Department of Neurology, Xuanwu Hospital, Capital Medical University, Beijing, China
| | - Shunshan Jin
- General Hospital of Heilongjiang Province Land Reclamation Bureau, Harbin, China
| | - Haoyu Lu
- School of Life Sciences and Technology, Harbin Institute of Technology, Harbin, China
| | - Yang Hu
- School of Life Sciences and Technology, Harbin Institute of Technology, Harbin, China
| |
Collapse
|
22
|
Liu T, Chen J, Zhang Q, Hippe K, Hunt C, Le T, Cao R, Tang H. The Development of Machine Learning Methods in discriminating Secretory Proteins of Malaria Parasite. Curr Med Chem 2021; 29:807-821. [PMID: 34636289 DOI: 10.2174/0929867328666211005140625] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2021] [Revised: 07/28/2021] [Accepted: 08/15/2021] [Indexed: 11/22/2022]
Abstract
Malaria caused by Plasmodium falciparum is one of the major infectious diseases in the world. It is essential to exploit an effective method to predict secretory proteins of malaria parasites to develop effective cures and treatment. Biochemical assays can provide details for accurate identification of the secretory proteins, but these methods are expensive and time-consuming. In this paper, we summarized the machine learning-based identification algorithms and compared the construction strategies between different computational methods. Also, we discussed the use of machine learning to improve the ability of algorithms to identify proteins secreted by malaria parasites.
Collapse
Affiliation(s)
- Ting Liu
- School of Basic Medical Sciences, Southwest Medical University, Luzhou. China
| | - Jiamao Chen
- School of Basic Medical Sciences, Southwest Medical University, Luzhou. China
| | - Qian Zhang
- School of Basic Medical Sciences, Southwest Medical University, Luzhou. China
| | - Kyle Hippe
- Department of Computer Science, Pacific Lutheran University. United States
| | - Cassandra Hunt
- Department of Computer Science, Pacific Lutheran University. United States
| | - Thu Le
- Department of Computer Science, Pacific Lutheran University. United States
| | - Renzhi Cao
- Department of Computer Science, Pacific Lutheran University. United States
| | - Hua Tang
- School of Basic Medical Sciences, Southwest Medical University, Luzhou. China
| |
Collapse
|
23
|
Zhao YW, Zhang S, Ding H. Recent development of machine learning methods in sumoylation sites prediction. Curr Med Chem 2021; 29:894-907. [PMID: 34525906 DOI: 10.2174/0929867328666210915112030] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2021] [Revised: 07/24/2021] [Accepted: 08/07/2021] [Indexed: 11/22/2022]
Abstract
Sumoylation of proteins is an important reversible post-translational modification of proteins and mediates a variety of cellular processes. Sumo-modified proteins can change their subcellular localization, activity and stability. In addition, it also plays an important role in various cellular processes such as transcriptional regulation and signal transduction. The abnormal sumoylation is involved in many diseases, including neurodegeneration and immune-related diseases, as well as the development of cancer. Therefore, identification of the sumoylation site (SUMO site) is fundamental to understanding their molecular mechanisms and regulatory roles. In contrast to labor-intensive and costly experimental approaches, computational prediction of sumoylation sites in silico also attracted much attention for its accuracy, convenience and speed. At present, many computational prediction models have been used to identify SUMO sites, but these contents have not been comprehensively summarized and reviewed. Therefore, the research progress of relevant models is summarized and discussed in this paper. We will briefly summarize the development of bioinformatics methods on sumoylation site prediction. We will mainly focus on the benchmark dataset construction, feature extraction, machine learning method, published results and online tools. We hope the review will provide more help for wet-experimental scholars.
Collapse
Affiliation(s)
- Yi-Wei Zhao
- School of Medicine, University of Electronic Science and Technology of China, Chengdu 610054. China
| | - Shihua Zhang
- College of Life Science and Health, Wuhan University of Science and Technology, Wuhan 430065. China
| | - Hui Ding
- School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054. China
| |
Collapse
|
24
|
Yang YH, Wang JS, Yuan SS, Liu ML, Su W, Lin H, Zhang ZY. A Survey for Predicting ATP Binding Residues of Proteins Using Machine Learning Methods. Curr Med Chem 2021; 29:789-806. [PMID: 34514982 DOI: 10.2174/0929867328666210910125802] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2021] [Revised: 06/29/2021] [Accepted: 07/04/2021] [Indexed: 11/22/2022]
Abstract
Protein-ligand interactions are necessary for majority protein functions. Adenosine-5'-triphosphate (ATP) is one such ligand that plays vital role as a coenzyme in providing energy for cellular activities, catalyzing biological reaction and signaling. Knowing ATP binding residues of proteins is helpful for annotation of protein function and drug design. However, due to the huge amounts of protein sequences influx into databases in the post-genome era, experimentally identifying ATP binding residues is cost-ineffective and time-consuming. To address this problem, computational methods have been developed to predict ATP binding residues. In this review, we briefly summarized the application of machine learning methods in detecting ATP binding residues of proteins. We expect this review will be helpful for further research.
Collapse
Affiliation(s)
- Yu-He Yang
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054. China
| | - Jia-Shu Wang
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054. China
| | - Shi-Shi Yuan
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054. China
| | - Meng-Lu Liu
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054. China
| | - Wei Su
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054. China
| | - Hao Lin
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054. China
| | - Zhao-Yue Zhang
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054. China
| |
Collapse
|
25
|
Wang T, Liu Y, Ruan J, Dong X, Wang Y, Peng J. A pipeline for RNA-seq based eQTL analysis with automated quality control procedures. BMC Bioinformatics 2021; 22:403. [PMID: 34433407 PMCID: PMC8386049 DOI: 10.1186/s12859-021-04307-0] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2021] [Accepted: 07/06/2021] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Advances in the expression quantitative trait loci (eQTL) studies have provided valuable insights into the mechanism of diseases and traits-associated genetic variants. However, it remains challenging to evaluate and control the quality of multi-source heterogeneous eQTL raw data for researchers with limited computational background. There is an urgent need to develop a powerful and user-friendly tool to automatically process the raw datasets in various formats and perform the eQTL mapping afterward. RESULTS In this work, we present a pipeline for eQTL analysis, termed eQTLQC, featured with automated data preprocessing for both genotype data and gene expression data. Our pipeline provides a set of quality control and normalization approaches, and utilizes automated techniques to reduce manual intervention. We demonstrate the utility and robustness of this pipeline by performing eQTL case studies using multiple independent real-world datasets with RNA-seq data and whole genome sequencing (WGS) based genotype data. CONCLUSIONS eQTLQC provides a reliable computational workflow for eQTL analysis. It provides standard quality control and normalization as well as eQTL mapping procedures for eQTL raw data in multiple formats. The source code, demo data, and instructions are freely available at https://github.com/stormlovetao/eQTLQC .
Collapse
Affiliation(s)
- Tao Wang
- School of Computer Science, Northwestern Polytechnical University, 1 Dongxiang Road, Chang’an District, Xi’an, China
- School of Computer Science and Technology, Harbin Institute of Technology, West Dazhi St., Harbin, China
| | - Yongzhuang Liu
- School of Computer Science and Technology, Harbin Institute of Technology, West Dazhi St., Harbin, China
| | - Junpeng Ruan
- School of Computer Science, Northwestern Polytechnical University, 1 Dongxiang Road, Chang’an District, Xi’an, China
| | - Xianjun Dong
- Brigham and Women’s Hospital, Harvard Medical School, 75 Francis St., Boston, USA
| | - Yadong Wang
- School of Computer Science and Technology, Harbin Institute of Technology, West Dazhi St., Harbin, China
| | - Jiajie Peng
- School of Computer Science, Northwestern Polytechnical University, 1 Dongxiang Road, Chang’an District, Xi’an, China
| |
Collapse
|
26
|
Li Y, Pu F, Wang J, Zhou Z, Zhang C, He F, Ma Z, Zhang J. Machine Learning Methods in Prediction of Protein Palmitoylation Sites: A Brief Review. Curr Pharm Des 2021; 27:2189-2198. [PMID: 33183190 DOI: 10.2174/1381612826666201112142826] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2020] [Accepted: 07/27/2020] [Indexed: 11/22/2022]
Abstract
Protein palmitoylation is a fundamental and reversible post-translational lipid modification that involves a series of biological processes. Although a large number of experimental studies have explored the molecular mechanism behind the palmitoylation process, the computational methods has attracted much attention for its good performance in predicting palmitoylation sites compared with expensive and time-consuming biochemical experiments. The prediction of protein palmitoylation sites is helpful to reveal its biological mechanism. Therefore, the research on the application of machine learning methods to predict palmitoylation sites has become a hot topic in bioinformatics and promoted the development in the related fields. In this review, we briefly introduced the recent development in predicting protein palmitoylation sites by using machine learningbased methods and discussed their benefits and drawbacks. The perspective of machine learning-based methods in predicting palmitoylation sites was also provided. We hope the review could provide a guide in related fields.
Collapse
Affiliation(s)
- Yanwen Li
- School of Information Science and Technology, Northeast Normal University, Changchun 130117, China
| | - Feng Pu
- School of Information Science and Technology, Northeast Normal University, Changchun 130117, China
| | - Jingru Wang
- School of Information Science and Technology, Northeast Normal University, Changchun 130117, China
| | - Zhiguo Zhou
- School of Information Science and Technology, Northeast Normal University, Changchun 130117, China
| | - Chunhua Zhang
- School of Information Science and Technology, Northeast Normal University, Changchun 130117, China
| | - Fei He
- School of Information Science and Technology, Northeast Normal University, Changchun 130117, China
| | - Zhiqiang Ma
- School of Information Science and Technology, Northeast Normal University, Changchun 130117, China
| | - Jingbo Zhang
- School of Information Science and Technology, Northeast Normal University, Changchun 130117, China
| |
Collapse
|
27
|
Yang H, Qi C, Li B, Cheng L. Non-coding RNAs as Novel Biomarkers in Cancer Drug Resistance. Curr Med Chem 2021; 29:837-848. [PMID: 34348605 DOI: 10.2174/0929867328666210804090644] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2021] [Revised: 06/09/2021] [Accepted: 06/15/2021] [Indexed: 11/22/2022]
Abstract
Chemotherapy is often the primary and most effective anticancer treatment; however, drug resistance remains a major obstacle to it being curative. Recent studies have demonstrated that non-coding RNAs (ncRNAs), especially microRNAs and long non-coding RNAs, are involved in drug resistance of tumor cells in many ways, such as modulation of apoptosis, drug efflux and metabolism, epithelial-to-mesenchymal transition, DNA repair, and cell cycle progression. Exploring the relationships between ncRNAs and drug resistance will not only contribute to our understanding of the mechanisms of drug resistance and provide ncRNA biomarkers of chemoresistance, but will also help realize personalized anticancer treatment regimens. Due to the high cost and low efficiency of biological experimentation, many researchers have opted to use computational methods to identify ncRNA biomarkers associated with drug resistance. In this review, we summarize recent discoveries related to ncRNA-mediated drug resistance and highlight the computational methods and resources available for ncRNA biomarkers involved in chemoresistance.
Collapse
Affiliation(s)
- Haixiu Yang
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, 150081. China
| | - Changlu Qi
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, 150081. China
| | - Boyan Li
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, 150081. China
| | - Liang Cheng
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, 150081. China
| |
Collapse
|
28
|
Jiang Y, Zheng B, Yang Y, Li X, Han J. Identification of Somatic Mutation-Driven Immune Cells by Integrating Genomic and Transcriptome Data. Front Cell Dev Biol 2021; 9:715275. [PMID: 34368166 PMCID: PMC8335569 DOI: 10.3389/fcell.2021.715275] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2021] [Accepted: 06/25/2021] [Indexed: 01/08/2023] Open
Abstract
Tumor somatic mutations in protein-coding regions may generate neoantigens which may trigger antitumor immune cell response. Increasing evidence supports that immune cell response may profoundly influence tumor progression. However, there are no calculated tools to systematically identify immune cells driven by specific somatic mutations. It is urgent to develop a calculated method to comprehensively detect tumor-infiltrating immune cells driven by the specific somatic mutations in cancer. We developed a novel software package (SMDIC) that enables the automated identification of somatic mutation-driven immune cell. SMDIC provides a novel pipeline to discover mutation-specific immune cells by integrating genomic and transcriptome data. The operation modes include inference of the relative abundance matrix of tumor-infiltrating immune cells, detection of differential abundance immune cells with respect to the gene mutation status, conversion of the abundance matrix of significantly dysregulated cells into two binary matrices (one for upregulated and one for downregulated cells), identification of somatic mutation-driven immune cells by comparing the gene mutation status with each immune cell in the binary matrices across all samples, and visualization of immune cell abundance of samples in different mutation status for each gene. SMDIC provides a user-friendly tool to identify somatic mutation-specific immune cell response. SMDIC may contribute to understand the mechanisms underlying anticancer immune response and find targets for cancer immunotherapy. The SMDIC was implemented as an R-based tool which was freely available from the CRAN website https://CRAN.R-project.org/package=SMDIC.
Collapse
Affiliation(s)
- Ying Jiang
- College of Basic Medical Science, Heilongjiang University of Chinese Medicine, Harbin, China
| | - Baotong Zheng
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China
| | - Yang Yang
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China
| | - Xiangmei Li
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China
| | - Junwei Han
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China
| |
Collapse
|
29
|
Liang X, Li F, Chen J, Li J, Wu H, Li S, Song J, Liu Q. Large-scale comparative review and assessment of computational methods for anti-cancer peptide identification. Brief Bioinform 2021; 22:bbaa312. [PMID: 33316035 PMCID: PMC8294543 DOI: 10.1093/bib/bbaa312] [Citation(s) in RCA: 43] [Impact Index Per Article: 14.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2020] [Revised: 09/30/2020] [Accepted: 08/25/2020] [Indexed: 12/13/2022] Open
Abstract
Anti-cancer peptides (ACPs) are known as potential therapeutics for cancer. Due to their unique ability to target cancer cells without affecting healthy cells directly, they have been extensively studied. Many peptide-based drugs are currently evaluated in the preclinical and clinical trials. Accurate identification of ACPs has received considerable attention in recent years; as such, a number of machine learning-based methods for in silico identification of ACPs have been developed. These methods promote the research on the mechanism of ACPs therapeutics against cancer to some extent. There is a vast difference in these methods in terms of their training/testing datasets, machine learning algorithms, feature encoding schemes, feature selection methods and evaluation strategies used. Therefore, it is desirable to summarize the advantages and disadvantages of the existing methods, provide useful insights and suggestions for the development and improvement of novel computational tools to characterize and identify ACPs. With this in mind, we firstly comprehensively investigate 16 state-of-the-art predictors for ACPs in terms of their core algorithms, feature encoding schemes, performance evaluation metrics and webserver/software usability. Then, comprehensive performance assessment is conducted to evaluate the robustness and scalability of the existing predictors using a well-prepared benchmark dataset. We provide potential strategies for the model performance improvement. Moreover, we propose a novel ensemble learning framework, termed ACPredStackL, for the accurate identification of ACPs. ACPredStackL is developed based on the stacking ensemble strategy combined with SVM, Naïve Bayesian, lightGBM and KNN. Empirical benchmarking experiments against the state-of-the-art methods demonstrate that ACPredStackL achieves a comparative performance for predicting ACPs. The webserver and source code of ACPredStackL is freely available at http://bigdata.biocie.cn/ACPredStackL/ and https://github.com/liangxiaoq/ACPredStackL, respectively.
Collapse
Affiliation(s)
- Xiao Liang
- College of Information Engineering, Northwest A&F University, Yangling, 712100, China
- Shaanxi Key Laboratory of Agricultural Information Perception and Intelligent Service, Yangling, Shaanxi 712100, China
| | - Fuyi Li
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia
- Monash Centre for Data Science, Monash University, Melbourne, VIC 3800, Australia
- Department of Microbiology and Immunology, Peter Doherty Institute for Infection and Immunity, University of Melbourne, Melbourne, Victoria, Australia
| | - Jinxiang Chen
- College of Information Engineering, Northwest A&F University, Yangling, 712100, China
| | - Junlong Li
- College of Information Engineering, Northwest A&F University, Yangling, 712100, China
| | - Hao Wu
- College of Information Engineering, Northwest A&F University, Yangling, 712100, China
| | - Shuqin Li
- College of Information Engineering, Northwest A&F University, Yangling, 712100, China
- Shaanxi Key Laboratory of Agricultural Information Perception and Intelligent Service, Yangling, Shaanxi 712100, China
| | - Jiangning Song
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia
- Monash Centre for Data Science, Monash University, Melbourne, VIC 3800, Australia
- ARC Centre of Excellence in Advanced Molecular Imaging, Monash University, Melbourne, VIC 3800, Australia
| | - Quanzhong Liu
- College of Information Engineering, Northwest A&F University, Yangling, 712100, China
- Shaanxi Key Laboratory of Agricultural Information Perception and Intelligent Service, Yangling, Shaanxi 712100, China
| |
Collapse
|
30
|
Zulfiqar H, Yuan SS, Huang QL, Sun ZJ, Dao FY, Yu XL, Lin H. Identification of cyclin protein using gradient boost decision tree algorithm. Comput Struct Biotechnol J 2021; 19:4123-4131. [PMID: 34527186 PMCID: PMC8346528 DOI: 10.1016/j.csbj.2021.07.013] [Citation(s) in RCA: 35] [Impact Index Per Article: 11.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2021] [Revised: 07/15/2021] [Accepted: 07/15/2021] [Indexed: 12/12/2022] Open
Abstract
Cyclin proteins are capable to regulate the cell cycle by forming a complex with cyclin-dependent kinases to activate cell cycle. Correct recognition of cyclin proteins could provide key clues for studying their functions. However, their sequences share low similarity, which results in poor prediction for sequence similarity-based methods. Thus, it is urgent to construct a machine learning model to identify cyclin proteins. This study aimed to develop a computational model to discriminate cyclin proteins from non-cyclin proteins. In our model, protein sequences were encoded by seven kinds of features that are amino acid composition, composition of k-spaced amino acid pairs, tri peptide composition, pseudo amino acid composition, geary correlation, normalized moreau-broto autocorrelation and composition/transition/distribution. Afterward, these features were optimized by using analysis of variance (ANOVA) and minimum redundancy maximum relevance (mRMR) with incremental feature selection (IFS) technique. A gradient boost decision tree (GBDT) classifier was trained on the optimal features. Five-fold cross-validated results showed that our model would identify cyclins with an accuracy of 93.06% and AUC value of 0.971, which are higher than the two recent studies on the same data.
Collapse
Affiliation(s)
- Hasan Zulfiqar
- School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Shi-Shi Yuan
- School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Qin-Lai Huang
- School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Zi-Jie Sun
- School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Fu-Ying Dao
- School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Xiao-Long Yu
- School of Materials Science and Engineering, Hainan University, Haikou 570228, China
| | - Hao Lin
- School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| |
Collapse
|
31
|
Yang H, Tong F, Qi C, Wang P, Li J, Cheng L. Prioritizing Disease-Related Microbes Based on the Topological Properties of a Comprehensive Network. Front Microbiol 2021; 12:685549. [PMID: 34326821 PMCID: PMC8315281 DOI: 10.3389/fmicb.2021.685549] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2021] [Accepted: 05/10/2021] [Indexed: 01/09/2023] Open
Abstract
Many microbes are parasitic within the human body, engaging in various physiological processes and playing an important role in human diseases. The discovery of new microbe-disease associations aids our understanding of disease pathogenesis. Computational methods can be applied in such investigations, thereby avoiding the time-consuming and laborious nature of experimental methods. In this study, we constructed a comprehensive microbe-disease network by integrating known microbe-disease associations from three large-scale databases (Peryton, Disbiome, and gutMDisorder), and extended the random walk with restart to the network for prioritizing unknown microbe-disease associations. The area under the curve values of the leave-one-out cross-validation and the fivefold cross-validation exceeded 0.9370 and 0.9366, respectively, indicating the high performance of this method. Despite being widely studied diseases, in case studies of inflammatory bowel disease, asthma, and obesity, some prioritized disease-related microbes were validated by recent literature. This suggested that our method is effective at prioritizing novel disease-related microbes and may offer further insight into disease pathogenesis.
Collapse
Affiliation(s)
- Haixiu Yang
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China
| | - Fan Tong
- Academy of Military Medical Science, Beijing, China
| | - Changlu Qi
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China
| | - Ping Wang
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China
| | - Jiangyu Li
- Academy of Military Medical Science, Beijing, China
| | - Liang Cheng
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China.,NHC and CAMS Key Laboratory of Molecular Probe and Targeted Theranostics, Harbin Medical University, Harbin, China
| |
Collapse
|
32
|
Zhu Z, Han X, Cheng L. Identification of gene signature associated with type 2 diabetes mellitus by integrating mutation and expression data. Curr Gene Ther 2021; 22:51-58. [PMID: 34238156 DOI: 10.2174/1566523221666210707140839] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2021] [Revised: 04/08/2021] [Accepted: 04/18/2021] [Indexed: 11/22/2022]
Abstract
Type 2 diabetes mellitus (T2DM) is a chronic disease. The molecular diagnosis should be helpful for the treatment of T2DM patients. With the development of sequencing technology, a large number of differentially expressed genes were identified from expression data. However, the method of machine learning can only identify the local optimal solution as the signature. The mutation information obtained by inheritance can better reflect the relationship between genes and diseases. Therefore, we need to integrate mutation information to more accurately identify the signature. To this end, we integrated genome-wide association study (GWAS) data and expression data, combined with expression quantitative trait loci (eQTL) technology to get T2DM predictive signature (T2DMSig-10). Firstly, we used GWAS data to obtain a list of T2DM susceptible loci. Then, we used eQTL technology to obtain risk single nucleotide polymorphisms (SNPs), and combined with the pancreatic β-cells gene expression data to obtain 10 protein-coding genes. Next, we combined these genes with equal weights. After receiver operating characteristic (ROC), single-gene removal and increase method, gene ontology function enrichment and protein-protein interaction network were used to verify the results that showed that T2DMSig-10 had an excellent predictive effect on T2DM (AUC=0.99), and was highly robust. In short, we obtained the predictive signature of T2DM, and further verified it.
Collapse
Affiliation(s)
- Zijun Zhu
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, Heilongjiang, China
| | - Xudong Han
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, Heilongjiang, China
| | - Liang Cheng
- NHC and CAMS Key Laboratory of Molecular Probe and Targeted Theranostics, College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, Heilongjiang, China
| |
Collapse
|
33
|
Zong Y, Li X. Identification of Causal Genes of COVID-19 Using the SMR Method. Front Genet 2021; 12:690349. [PMID: 34290742 PMCID: PMC8287881 DOI: 10.3389/fgene.2021.690349] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2021] [Accepted: 05/07/2021] [Indexed: 01/03/2023] Open
Abstract
Since the first report of COVID-19 in December 2019, more than 100 million people have been infected with SARS-CoV-2. Despite ongoing research, there is still limited knowledge about the genetic causes of COVID-19. To resolve this problem, we applied the SMR method to analyze the genes involved in COVID-19 pathogenesis by the integration of multiple omics data. Here, we assessed the SNPs associated with COVID-19 risk from the GWAS data of Spanish and Italian patients and lung eQTL data from the GTEx project. Then, GWAS and eQTL data were integrated by summary-data-based (SMR) methods using SNPs as instrumental variables (IVs). As a result, six protein-coding and five non-protein-coding genes regulated by nine SNPs were identified as significant risk factors for COVID-19. Functional analysis of these genes showed that UQCRH participates in cardiac muscle contraction, PPA2 is closely related to sudden cardiac failure (SCD), and OGT, as the interacting gene partner of PANO1, is associated with neurological disease. Observational studies show that myocardial damage, SCD, and neurological disease often occur in COVID-19 patients. Thus, our findings provide a potential molecular mechanism for understanding the complications of COVID-19.
Collapse
Affiliation(s)
- Yan Zong
- Department of Infectious Diseases, Yiwu Central Hospital, Jinhua, China
| | - Xiaofei Li
- Department of Infectious Diseases, Yiwu Central Hospital, Jinhua, China
| |
Collapse
|
34
|
Ru X, Ye X, Sakurai T, Zou Q, Xu L, Lin C. Current status and future prospects of drug-target interaction prediction. Brief Funct Genomics 2021; 20:312-322. [PMID: 34189559 DOI: 10.1093/bfgp/elab031] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2021] [Revised: 06/01/2021] [Accepted: 06/04/2021] [Indexed: 01/09/2023] Open
Abstract
Drug-target interaction prediction is important for drug development and drug repurposing. Many computational methods have been proposed for drug-target interaction prediction due to their potential to the time and cost reduction. In this review, we introduce the molecular docking and machine learning-based methods, which have been widely applied to drug-target interaction prediction. Particularly, machine learning-based methods are divided into different types according to the data processing form and task type. For each type of method, we provide a specific description and propose some solutions to improve its capability. The knowledge of heterogeneous network and learning to rank are also summarized in this review. As far as we know, this is the first comprehensive review that summarizes the knowledge of heterogeneous network and learning to rank in the drug-target interaction prediction. Moreover, we propose three aspects that can be explored in depth for future research.
Collapse
Affiliation(s)
| | - Xiucai Ye
- Department of Computer Science, and Center for Artificial Intelligence Research (C-AIR), University of Tsukuba
| | - Tetsuya Sakurai
- Department of Computer Science and is the director of the C-AIR, University of Tsukuba
| | - Quan Zou
- University of Electronic Science and Technology of China
| | - Lei Xu
- School of Electronic and Communication Engineering, Shenzhen Polytechnic
| | | |
Collapse
|
35
|
Hunt C, Montgomery S, Berkenpas JW, Sigafoos N, Oakley JC, Espinosa J, Justice N, Kishaba K, Hippe K, Si D, Hou J, Ding H, Cao R. Recent Progress of Machine Learning in Gene Therapy. Curr Gene Ther 2021; 22:132-143. [PMID: 34161210 DOI: 10.2174/1566523221666210622164133] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2021] [Revised: 03/15/2021] [Accepted: 04/02/2021] [Indexed: 11/22/2022]
Abstract
With new developments in biomedical technology, it is now a viable therapeutic treatment to alter genes with techniques like CRISPR. At the same time, it is increasingly cheaper to do whole genome sequencing, resulting in rapid advancement in gene therapy and editing in precision medicine. Thus, understanding the current industry and academic applications of gene therapy provides an important backdrop to future scientific developments. Additionally, machine learning and artificial intelligence techniques allow for the reduction of time and money spent in the development of new gene therapy products and techniques. In this paper, we survey the current progress of gene therapy treatments for several diseases and explore machine learning applications in gene therapy. We also discuss the ethical implications of gene therapy and the use of machine learning in precision medicine. Machine learning and gene therapy are both topics gaining popularity in various publications, and we conclude that there is still room for continued research and application of machine learning techniques in the gene therapy field.
Collapse
Affiliation(s)
- Cassandra Hunt
- Department of Computer Science, Pacific Lutheran University, Tacoma, WA, United States
| | - Sandra Montgomery
- Department of Physics, Pacific Lutheran University, Tacoma, WA, United States
| | | | - Noel Sigafoos
- Department of Computer Science, Pacific Lutheran University, Tacoma, WA, United States
| | - John Christian Oakley
- Department of Computer Science, Pacific Lutheran University, Tacoma, WA, United States
| | - Jacob Espinosa
- Department of Mathematics, Pacific Lutheran University, Tacoma, WA, United States
| | - Nicola Justice
- Department of Mathematics, Pacific Lutheran University, Tacoma, WA, United States
| | - Kiyomi Kishaba
- Department of Humanities, Pacific Lutheran University, Tacoma, WA, United States
| | - Kyle Hippe
- Department of Computer Science, Pacific Lutheran University, Tacoma, WA, United States
| | - Dong Si
- Division of Computing Software Systems, University of Washington-Bothell, Bothell, WA, United States
| | - Jie Hou
- Department of Computer Science, Saint Louis University, St. Louis, MO, United States
| | - Hui Ding
- School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Renzhi Cao
- Department of Computer Science, Pacific Lutheran University, Tacoma, WA, United States
| |
Collapse
|
36
|
Xu L, Ru X, Song R. Application of Machine Learning for Drug-Target Interaction Prediction. Front Genet 2021; 12:680117. [PMID: 34234813 PMCID: PMC8255962 DOI: 10.3389/fgene.2021.680117] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2021] [Accepted: 05/28/2021] [Indexed: 11/13/2022] Open
Abstract
Exploring drug–target interactions by biomedical experiments requires a lot of human, financial, and material resources. To save time and cost to meet the needs of the present generation, machine learning methods have been introduced into the prediction of drug–target interactions. The large amount of available drug and target data in existing databases, the evolving and innovative computer technologies, and the inherent characteristics of various types of machine learning have made machine learning techniques the mainstream method for drug–target interaction prediction research. In this review, details of the specific applications of machine learning in drug–target interaction prediction are summarized, the characteristics of each algorithm are analyzed, and the issues that need to be further addressed and explored for future research are discussed. The aim of this review is to provide a sound basis for the construction of high-performance models.
Collapse
Affiliation(s)
- Lei Xu
- School of Electronic and Communication Engineering, Shenzhen Polytechnic, Shenzhen, China
| | - Xiaoqing Ru
- Department of Computer Science, University of Tsukuba, Tsukuba, Japan
| | - Rong Song
- School of Electronic and Communication Engineering, Shenzhen Polytechnic, Shenzhen, China
| |
Collapse
|
37
|
Wang X, Xin B, Tan W, Xu Z, Li K, Li F, Zhong W, Peng S. DeepR2cov: deep representation learning on heterogeneous drug networks to discover anti-inflammatory agents for COVID-19. Brief Bioinform 2021; 22:6296505. [PMID: 34117734 PMCID: PMC8344611 DOI: 10.1093/bib/bbab226] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2021] [Revised: 05/14/2021] [Accepted: 05/24/2021] [Indexed: 02/06/2023] Open
Abstract
Recent studies have demonstrated that the excessive inflammatory response is an important factor of death in coronavirus disease 2019 (COVID-19) patients. In this study, we propose a deep representation on heterogeneous drug networks, termed DeepR2cov, to discover potential agents for treating the excessive inflammatory response in COVID-19 patients. This work explores the multi-hub characteristic of a heterogeneous drug network integrating eight unique networks. Inspired by the multi-hub characteristic, we design 3 billion special meta paths to train a deep representation model for learning low-dimensional vectors that integrate long-range structure dependency and complex semantic relation among network nodes. Based on the representation vectors and transcriptomics data, we predict 22 drugs that bind to tumor necrosis factor-α or interleukin-6, whose therapeutic associations with the inflammation storm in COVID-19 patients, and molecular binding model are further validated via data from PubMed publications, ongoing clinical trials and a docking program. In addition, the results on five biomedical applications suggest that DeepR2cov significantly outperforms five existing representation approaches. In summary, DeepR2cov is a powerful network representation approach and holds the potential to accelerate treatment of the inflammatory responses in COVID-19 patients. The source code and data can be downloaded from https://github.com/pengsl-lab/DeepR2cov.git.
Collapse
Affiliation(s)
- Xiaoqi Wang
- College of Computer Science and Electronic Engineering, Hunan University, China
| | - Bin Xin
- College of Computer Science and Electronic Engineering, Hunan University, China
| | - Weihong Tan
- Chinese Academy of Sciences in the College of Chemistry and Chemical Engineering, College of Biology, Hunan University, China
| | - Zhijian Xu
- Drug Discovery and Design Center, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, China
| | - Kenli Li
- College of Computer Science and Electronic Engineering, Hunan University, China
| | - Fei Li
- Computer Network Information Center, Chinese Academy of Sciences, China
| | - Wu Zhong
- National Engineering Research Center for the Emergency Drug, Beijing Institute of Pharmacology and Toxicology, China
| | - Shaoliang Peng
- College of Computer Science and Electronic Engineering, Hunan University, China
| |
Collapse
|
38
|
Ao C, Zou Q, Yu L. RFhy-m2G: Identification of RNA N2-methylguanosine modification sites based on random forest and hybrid features. Methods 2021; 203:32-39. [PMID: 34033879 DOI: 10.1016/j.ymeth.2021.05.016] [Citation(s) in RCA: 26] [Impact Index Per Article: 8.7] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2021] [Revised: 05/04/2021] [Accepted: 05/20/2021] [Indexed: 12/31/2022] Open
Abstract
N2-methylguanosine is a post-transcriptional modification of RNA that is found in eukaryotes and archaea. The biological function of m2G modification discovered so far is to control and stabilize the three-dimensional structure of tRNA and the dynamic barrier of reverse transcription. To discover additional biological functions of m2G, it is necessary to develop time-saving and labor-saving calculation tools to identify m2G. In this paper, based on hybrid features and a random forest, a novel predictor, RFhy-m2G, was developed to identify the m2G modification sites for three species. The hybrid feature used by the predictor is used to fuse the three features of ENAC, PseDNC, and NPPS. These three features include primary sequence derivation properties, physicochemical properties, and position-specific properties. Since there are redundant features in hybrid features, MRMD2.0 is used for optimal feature selection. Through feature analysis, it is found that the optimal hybrid features obtained still contain three kinds of properties, and the hybrid features can more accurately identify m2G modification sites and improve prediction performance. Based on five-fold cross-validation and independent testing to evaluate the prediction model, the accuracies obtained were 0.9982 and 0.9417, respectively. The robustness of the predictor is demonstrated by comparisons with other predictors.
Collapse
Affiliation(s)
- Chunyan Ao
- School of Computer Science and Technology, Xidian University, Xi'an, China; Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| | - Liang Yu
- School of Computer Science and Technology, Xidian University, Xi'an, China.
| |
Collapse
|
39
|
Zhang J, Sun M, Zhao Y, Geng G, Hu Y. Identification of Gingivitis-Related Genes Across Human Tissues Based on the Summary Mendelian Randomization. Front Cell Dev Biol 2021; 8:624766. [PMID: 34026747 PMCID: PMC8134671 DOI: 10.3389/fcell.2020.624766] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2020] [Accepted: 12/02/2020] [Indexed: 11/13/2022] Open
Abstract
Periodontal diseases are among the most frequent inflammatory diseases affecting children and adolescents, which affect the supporting structures of the teeth and lead to tooth loss and contribute to systemic inflammation. Gingivitis is the most common periodontal infection. Gingivitis, which is mainly caused by a substance produced by microbial plaque, systemic disorders, and genetic abnormalities in the host. Identifying gingivitis-related genes across human tissues is not only significant for understanding disease mechanisms but also disease development and clinical diagnosis. The Genome-wide association study (GWAS) a commonly used method to mine disease-related genetic variants. However, due to some factors such as linkage disequilibrium, it is difficult for GWAS to identify genes directly related to the disease. Hence, we constructed a data integration method that uses the Summary Mendelian randomization (SMR) to combine the GWAS with expression quantitative trait locus (eQTL) data to identify gingivitis-related genes. Five eQTL studies from different human tissues and one GWAS studies were referenced in this paper. This study identified several candidates SNPs and genes relate to gingivitis in tissue-specific or cross-tissue. Further, we also analyzed and explained the functions of these genes. The R program for the SMR method has been uploaded to GitHub(https://github.com/hxdde/SMR).
Collapse
Affiliation(s)
- Jiahui Zhang
- Department of Stomatology and Dental Hygiene, The Fourth Affiliated Hospital, Harbin Medical University, Harbin, China
| | - Mingai Sun
- General Hospital of Heilongjiang Province Land Reclamation Bureau, Harbin, China
| | - Yuanyuan Zhao
- General Hospital of Heilongjiang Province Land Reclamation Bureau, Harbin, China
| | - Guannan Geng
- Department of Endocrinology, The First Affiliated Hospital of Harbin Medical University, Harbin, China
| | - Yang Hu
- School of Life Science and Technology, Harbin Institute of Technology, Harbin, China
| |
Collapse
|
40
|
Zulfiqar H, Khan RS, Hassan F, Hippe K, Hunt C, Ding H, Song XM, Cao R. Computational identification of N4-methylcytosine sites in the mouse genome with machine-learning method. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2021; 18:3348-3363. [PMID: 34198389 DOI: 10.3934/mbe.2021167] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/24/2023]
Abstract
N4-methylcytosine (4mC) is a kind of DNA modification which could regulate multiple biological processes. Correctly identifying 4mC sites in genomic sequences can provide precise knowledge about their genetic roles. This study aimed to develop an ensemble model to predict 4mC sites in the mouse genome. In the proposed model, DNA sequences were encoded by k-mer, enhanced nucleic acid composition and composition of k-spaced nucleic acid pairs. Subsequently, these features were optimized by using minimum redundancy maximum relevance (mRMR) with incremental feature selection (IFS) and five-fold cross-validation. The obtained optimal features were inputted into random forest classifier for discriminating 4mC from non-4mC sites in mouse. On the independent dataset, our model could yield the overall accuracy of 85.41%, which was approximately 3.8% -6.3% higher than the two existing models, i4mC-Mouse and 4mCpred-EL respectively. The data and source code of the model can be freely download from https://github.com/linDing-groups/model_4mc.
Collapse
Affiliation(s)
- Hasan Zulfiqar
- School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Rida Sarwar Khan
- School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Farwa Hassan
- School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Kyle Hippe
- Department of Computer Science, Pacific Lutheran University, Tacoma 98447, USA
| | - Cassandra Hunt
- Department of Computer Science, Pacific Lutheran University, Tacoma 98447, USA
| | - Hui Ding
- School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Xiao-Ming Song
- School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
- School of Life Sciences, North China University of Science and Technology, Tangshan, Hebei 063210, China
| | - Renzhi Cao
- Department of Computer Science, Pacific Lutheran University, Tacoma 98447, USA
| |
Collapse
|
41
|
Chen Z, Shen Z, Zhang Z, Zhao D, Xu L, Zhang L. RNA-Associated Co-expression Network Identifies Novel Biomarkers for Digestive System Cancer. Front Genet 2021; 12:659788. [PMID: 33841514 PMCID: PMC8033200 DOI: 10.3389/fgene.2021.659788] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2021] [Accepted: 02/25/2021] [Indexed: 01/04/2023] Open
Abstract
Cancers of the digestive system are malignant diseases. Our study focused on colon cancer, esophageal cancer (ESCC), rectal cancer, gastric cancer (GC), and rectosigmoid junction cancer to identify possible biomarkers for these diseases. The transcriptome data were downloaded from the TCGA database (The Cancer Genome Atlas Program), and a network was constructed using the WGCNA algorithm. Two significant modules were found, and coexpression networks were constructed. CytoHubba was used to identify hub genes of the two networks. GO analysis suggested that the network genes were involved in metabolic processes, biological regulation, and membrane and protein binding. KEGG analysis indicated that the significant pathways were the calcium signaling pathway, fatty acid biosynthesis, and pathways in cancer and insulin resistance. Some of the most significant hub genes were hsa-let-7b-3p, hsa-miR-378a-5p, hsa-miR-26a-5p, hsa-miR-382-5p, and hsa-miR-29b-2-5p and SECISBP2 L, NCOA1, HERC1, HIPK3, and MBNL1, respectively. These genes were predicted to be associated with the tumor prognostic reference for this patient population.
Collapse
Affiliation(s)
- Zheng Chen
- School of Applied Chemistry and Biological Technology, Shenzhen Polytechnic, Shenzhen, China
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| | - Zijie Shen
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| | - Zilong Zhang
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| | - Da Zhao
- School of Applied Chemistry and Biological Technology, Shenzhen Polytechnic, Shenzhen, China
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| | - Lei Xu
- School of Electronic and Communication Engineering, Shenzhen Polytechnic, Shenzhen, China
| | - Lijun Zhang
- School of Applied Chemistry and Biological Technology, Shenzhen Polytechnic, Shenzhen, China
| |
Collapse
|
42
|
Niu K, Luo X, Zhang S, Teng Z, Zhang T, Zhao Y. iEnhancer-EBLSTM: Identifying Enhancers and Strengths by Ensembles of Bidirectional Long Short-Term Memory. Front Genet 2021; 12:665498. [PMID: 33833783 PMCID: PMC8021722 DOI: 10.3389/fgene.2021.665498] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2021] [Accepted: 03/01/2021] [Indexed: 12/26/2022] Open
Abstract
Enhancers are regulatory DNA sequences that could be bound by specific proteins named transcription factors (TFs). The interactions between enhancers and TFs regulate specific genes by increasing the target gene expression. Therefore, enhancer identification and classification have been a critical issue in the enhancer field. Unfortunately, so far there has been a lack of suitable methods to identify enhancers. Previous research has mainly focused on the features of the enhancer's function and interactions, which ignores the sequence information. As we know, the recurrent neural network (RNN) and long short-term memory (LSTM) models are currently the most common methods for processing time series data. LSTM is more suitable than RNN to address the DNA sequence. In this paper, we take the advantages of LSTM to build a method named iEnhancer-EBLSTM to identify enhancers. iEnhancer-ensembles of bidirectional LSTM (EBLSTM) consists of two steps. In the first step, we extract subsequences by sliding a 3-mer window along the DNA sequence as features. Second, EBLSTM model is used to identify enhancers from the candidate input sequences. We use the dataset from the study of Quang H et al. as the benchmarks. The experimental results from the datasets demonstrate the efficiency of our proposed model.
Collapse
Affiliation(s)
- Kun Niu
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| | - Ximei Luo
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Shumei Zhang
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| | - Zhixia Teng
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| | - Tianjiao Zhang
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| | - Yuming Zhao
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| |
Collapse
|
43
|
Jiao S, Wu S, Huang S, Liu M, Gao B. Advances in the Identification of Circular RNAs and Research Into circRNAs in Human Diseases. Front Genet 2021; 12:665233. [PMID: 33815488 PMCID: PMC8017306 DOI: 10.3389/fgene.2021.665233] [Citation(s) in RCA: 30] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2021] [Accepted: 03/01/2021] [Indexed: 12/14/2022] Open
Abstract
Circular RNAs (circRNAs) are a class of endogenous non-coding RNAs (ncRNAs) with a closed-loop structure that are mainly produced by variable processing of precursor mRNAs (pre-mRNAs). They are widely present in all eukaryotes and are very stable. Currently, circRNA studies have become a hotspot in RNA research. It has been reported that circRNAs constitute a significant proportion of transcript expression, and some are significantly more abundantly expressed than other transcripts. CircRNAs have regulatory roles in gene expression and critical biological functions in the development of organisms, such as acting as microRNA sponges or as endogenous RNAs and biomarkers. As such, they may have useful functions in the diagnosis and treatment of diseases. CircRNAs have been found to play an important role in the development of several diseases, including atherosclerosis, neurological disorders, diabetes, and cancer. In this paper, we review the status of circRNA research, describe circRNA-related databases and the identification of circRNAs, discuss the role of circRNAs in human diseases such as colon cancer, atherosclerosis, and gastric cancer, and identify remaining research questions related to circRNAs.
Collapse
Affiliation(s)
- Shihu Jiao
- Hainan Key Laboratory for Computational Science and Application, Hainan Normal University, Haikou, China.,Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China
| | - Song Wu
- Director of Preventive Treatment of Disease Centre, Qinhuangdao Hospital of Traditional Chinese Medicine, Qinhuangdao, China
| | - Shan Huang
- Department of Neurology, The Second Affiliated Hospital, Harbin Medical University, Harbin, China
| | - Mingyang Liu
- Department of Internal Medicine-Oncology, Heilongjiang Province Land Reclamation Headquarters General Hospital, Harbin, China
| | - Bo Gao
- Department of Radiology, The Second Affiliated Hospital, Harbin Medical University, Harbin, China
| |
Collapse
|
44
|
Wang X, Yang Y, Liu J, Wang G. The stacking strategy-based hybrid framework for identifying non-coding RNAs. Brief Bioinform 2021; 22:6165004. [PMID: 33693454 DOI: 10.1093/bib/bbab023] [Citation(s) in RCA: 26] [Impact Index Per Article: 8.7] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2020] [Revised: 01/16/2021] [Indexed: 12/12/2022] Open
Abstract
With the development of next-generation sequencing technology, a large number of transcripts need to be analyzed, and it has been a challenge to distinguish non-coding ribonucleic acid (RNAs) (ncRNAs) from coding RNAs. And for non-model organisms, due to the lack of transcriptional data, many existing methods cannot identify them. Therefore, in addition to using deoxyribonucleic acid-based and RNA-based features, we also proposed a hybrid framework based on the stacking strategy to identify ncRNAs, and we innovatively added eight features based on predicted peptides. The proposed framework was based on stacking two-layer classifier which combined random forest (RF), LightGBM, XGBoost and logistic regression (LR) models. We used this framework to build two types of models. For cross-species ncRNAs identification model, we tested it on six different species: human, mouse, zebrafish, fruit fly, worm and Arabidopsis. Compared with other tools, our model was the best in datasets of Arabidopsis, worm and zebrafish with the accuracy of 98.36%, 99.65% and 94.12%. For performance metrics analysis, the datasets of the six species were considered as a whole set, and the sensitivity, accuracy, precision and F1 values of our model were the best. For the plant-specific ncRNAs identification model, the average values of the six metrics of the two experiments were all greater than 95%, which demonstrated it can be used to identify ncRNAs in plants. The above indicates that the hybrid framework we designed is universal between animals and plants and has significant advantages in the identification of cross-species ncRNAs.
Collapse
Affiliation(s)
- Xin Wang
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Yang Yang
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Jian Liu
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Guohua Wang
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| |
Collapse
|
45
|
Niu M, Lin Y, Zou Q. sgRNACNN: identifying sgRNA on-target activity in four crops using ensembles of convolutional neural networks. PLANT MOLECULAR BIOLOGY 2021; 105:483-495. [PMID: 33385273 DOI: 10.1007/s11103-020-01102-y] [Citation(s) in RCA: 65] [Impact Index Per Article: 21.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/17/2020] [Accepted: 12/01/2020] [Indexed: 06/12/2023]
Abstract
KEY MESSAGE We proposed an ensemble convolutional neural network model to identify sgRNA high on-target activity in four crops and we used one-hot encoding and k-mers for sequence encoding. As an important component of the CRISPR/Cas9 system, single-guide RNA (sgRNA) plays an important role in gene redirection and editing. sgRNA has played an important role in the improvement of agronomic species, but there is a lack of effective bioinformatics tools to identify the activity of sgRNA in agronomic species. Therefore, it is necessary to develop a method based on machine learning to identify sgRNA high on-target activity. In this work, we proposed a simple convolutional neural network method to identify sgRNA high on-target activity. Our study used one-hot encoding and k-mers for sequence data conversion and a voting algorithm for constructing the convolutional neural network ensemble model sgRNACNN for the prediction of sgRNA activity. The ensemble model sgRNACNN was used for predictions in four crops: Glycine max, Zea mays, Sorghum bicolor and Triticum aestivum. The accuracy rates of the four crops in the sgRNACNN model were 82.43%, 80.33%, 78.25% and 87.49%, respectively. The experimental results showed that sgRNACNN realizes the identification of high on-target activity sgRNA of agronomic data and can meet the demands of sgRNA activity prediction in agronomy to a certain extent. These results have certain significance for guiding crop gene editing and academic research. The source code and relevant dataset can be found in the following link: https://github.com/nmt315320/sgRNACNN.git .
Collapse
Affiliation(s)
- Mengting Niu
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| | - Yuan Lin
- Department of System Integration, Sparebanken Vest, Bergen, Norway.
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China.
| |
Collapse
|
46
|
Chen CX, Sun LN, Hou XX, Du PC, Wang XL, Du XC, Yu YF, Cai RK, Yu L, Li TJ, Luo MN, Shen Y, Lu C, Li Q, Zhang C, Gao HF, Ma X, Lin H, Cao ZF. Prevention and Control of Pathogens Based on Big-Data Mining and Visualization Analysis. Front Mol Biosci 2021; 7:626595. [PMID: 33718431 PMCID: PMC7947816 DOI: 10.3389/fmolb.2020.626595] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/06/2020] [Accepted: 12/21/2020] [Indexed: 11/13/2022] Open
Abstract
Morbidity and mortality caused by infectious diseases rank first among all human illnesses. Many pathogenic mechanisms remain unclear, while misuse of antibiotics has led to the emergence of drug-resistant strains. Infectious diseases spread rapidly and pathogens mutate quickly, posing new threats to human health. However, with the increasing use of high-throughput screening of pathogen genomes, research based on big data mining and visualization analysis has gradually become a hot topic for studies of infectious disease prevention and control. In this paper, the framework was performed on four infectious pathogens (Fusobacterium, Streptococcus, Neisseria, and Streptococcus salivarius) through five functions: 1) genome annotation, 2) phylogeny analysis based on core genome, 3) analysis of structure differences between genomes, 4) prediction of virulence genes/factors with their pathogenic mechanisms, and 5) prediction of resistance genes/factors with their signaling pathways. The experiments were carried out from three angles: phylogeny (macro perspective), structure differences of genomes (micro perspective), and virulence and drug-resistance characteristics (prediction perspective). Therefore, the framework can not only provide evidence to support the rapid identification of new or unknown pathogens and thus plays a role in the prevention and control of infectious diseases, but also help to recommend the most appropriate strains for clinical and scientific research. This paper presented a new genome information visualization analysis process framework based on big data mining technology with the accommodation of the depth and breadth of pathogens in molecular level research.
Collapse
Affiliation(s)
- Cui-Xia Chen
- National Research Institute for Family Planning, Beijing, China.,National Center of Human Genetic Resources, Beijing, China
| | - Li-Na Sun
- National Institute for Communicable Disease Control and Prevention, Beijing, China
| | - Xue-Xin Hou
- National Institute for Communicable Disease Control and Prevention, Beijing, China
| | | | - Xiao-Long Wang
- Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing, China
| | - Xiao-Chen Du
- Shanghai Jiaotong University School of Medicine, Shanghai, China
| | - Yu-Fei Yu
- National Research Institute for Family Planning, Beijing, China.,National Center of Human Genetic Resources, Beijing, China
| | - Rui-Kun Cai
- National Research Institute for Family Planning, Beijing, China.,National Center of Human Genetic Resources, Beijing, China
| | - Lei Yu
- National Research Institute for Family Planning, Beijing, China.,National Center of Human Genetic Resources, Beijing, China
| | - Tian-Jun Li
- National Research Institute for Family Planning, Beijing, China.,National Center of Human Genetic Resources, Beijing, China
| | - Min-Na Luo
- National Research Institute for Family Planning, Beijing, China.,National Center of Human Genetic Resources, Beijing, China
| | - Yue Shen
- National Research Institute for Family Planning, Beijing, China.,National Center of Human Genetic Resources, Beijing, China
| | - Chao Lu
- National Research Institute for Family Planning, Beijing, China.,National Center of Human Genetic Resources, Beijing, China
| | - Qian Li
- National Research Institute for Family Planning, Beijing, China.,National Center of Human Genetic Resources, Beijing, China
| | - Chuan Zhang
- National Research Institute for Family Planning, Beijing, China.,National Center of Human Genetic Resources, Beijing, China
| | - Hua-Fang Gao
- National Research Institute for Family Planning, Beijing, China.,National Center of Human Genetic Resources, Beijing, China
| | - Xu Ma
- National Research Institute for Family Planning, Beijing, China.,National Center of Human Genetic Resources, Beijing, China
| | - Hao Lin
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Zong-Fu Cao
- National Research Institute for Family Planning, Beijing, China.,National Center of Human Genetic Resources, Beijing, China
| |
Collapse
|
47
|
Huang Q, Zhou W, Guo F, Xu L, Zhang L. 6mA-Pred: identifying DNA N6-methyladenine sites based on deep learning. PeerJ 2021; 9:e10813. [PMID: 33604189 PMCID: PMC7866889 DOI: 10.7717/peerj.10813] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2020] [Accepted: 12/30/2020] [Indexed: 01/03/2023] Open
Abstract
With the accumulation of data on 6mA modification sites, an increasing number of scholars have begun to focus on the identification of 6mA sites. Despite the recognized importance of 6mA sites, methods for their identification remain lacking, with most existing methods being aimed at their identification in individual species. In the present study, we aimed to develop an identification method suitable for multiple species. Based on previous research, we propose a method for 6mA site recognition. Our experiments prove that the proposed 6mA-Pred method is effective for identifying 6mA sites in genes from taxa such as rice, Mus musculus, and human. A series of experimental results show that 6mA-Pred is an excellent method. We provide the source code used in the study, which can be obtained from http://39.100.246.211:5004/6mA_Pred/.
Collapse
Affiliation(s)
- Qianfei Huang
- College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Wenyang Zhou
- School of Life Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Fei Guo
- College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Lei Xu
- School of Electronic and Communication Engineering, Shenzhen Polytechnic, Shenzhen, China
| | - Lichao Zhang
- School of Intelligent Manufacturing and Equipment, Shenzhen Institute of Information Technology, Shenzhen, China
| |
Collapse
|
48
|
Jing XY, Li FM. Predicting Cell Wall Lytic Enzymes Using Combined Features. Front Bioeng Biotechnol 2021; 8:627335. [PMID: 33585423 PMCID: PMC7874139 DOI: 10.3389/fbioe.2020.627335] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2020] [Accepted: 12/04/2020] [Indexed: 11/13/2022] Open
Abstract
Due to the overuse of antibiotics, people are worried that existing antibiotics will become ineffective against pathogens with the rapid rise of antibiotic-resistant strains. The use of cell wall lytic enzymes to destroy bacteria has become a viable alternative to avoid the crisis of antimicrobial resistance. In this paper, an improved method for cell wall lytic enzymes prediction was proposed and the amino acid composition (AAC), the dipeptide composition (DC), the position-specific score matrix auto-covariance (PSSM-AC), and the auto-covariance average chemical shift (acACS) were selected to predict the cell wall lytic enzymes with support vector machine (SVM). In order to overcome the imbalanced data classification problems and remove redundant or irrelevant features, the synthetic minority over-sampling technique (SMOTE) was used to balance the dataset. The F-score was used to select features. The Sn, Sp, MCC, and Acc were 99.35%, 99.02%, 0.98, and 99.19% with jackknife test using the optimized combination feature AAC+DC+acACS+PSSM-AC. The Sn, Sp, MCC, and Acc of cell wall lytic enzymes in our predictive model were higher than those in existing methods. This improved method may be helpful for protein function prediction.
Collapse
Affiliation(s)
- Xiao-Yang Jing
- College of Science, Inner Mongolia Agricultural University, Hohhot, China
| | - Feng-Min Li
- College of Science, Inner Mongolia Agricultural University, Hohhot, China
| |
Collapse
|
49
|
Lv Z, Cui F, Zou Q, Zhang L, Xu L. Anticancer peptides prediction with deep representation learning features. Brief Bioinform 2021; 22:6126754. [PMID: 33529337 DOI: 10.1093/bib/bbab008] [Citation(s) in RCA: 61] [Impact Index Per Article: 20.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2020] [Revised: 12/20/2020] [Accepted: 01/05/2021] [Indexed: 12/13/2022] Open
Abstract
Anticancer peptides constitute one of the most promising therapeutic agents for combating common human cancers. Using wet experiments to verify whether a peptide displays anticancer characteristics is time-consuming and costly. Hence, in this study, we proposed a computational method named identify anticancer peptides via deep representation learning features (iACP-DRLF) using light gradient boosting machine algorithm and deep representation learning features. Two kinds of sequence embedding technologies were used, namely soft symmetric alignment embedding and unified representation (UniRep) embedding, both of which involved deep neural network models based on long short-term memory networks and their derived networks. The results showed that the use of deep representation learning features greatly improved the capability of the models to discriminate anticancer peptides from other peptides. Also, UMAP (uniform manifold approximation and projection for dimension reduction) and SHAP (shapley additive explanations) analysis proved that UniRep have an advantage over other features for anticancer peptide identification. The python script and pretrained models could be downloaded from https://github.com/zhibinlv/iACP-DRLF or from http://public.aibiochem.net/iACP-DRLF/.
Collapse
Affiliation(s)
- Zhibin Lv
- University of Electronic Science and Technology of China
| | - Feifei Cui
- University of Electronic Science and Technology of China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences at University of Electronic Science and Technology of China
| | - Lichao Zhang
- School of Intelligent Manufacturing and Equipment, Shenzhen Institute of Information Technology
| | - Lei Xu
- School of Electronic and Communication Engineering, Shenzhen Polytechnic
| |
Collapse
|
50
|
Cui F, Zhang Z, Zou Q. Sequence representation approaches for sequence-based protein prediction tasks that use deep learning. Brief Funct Genomics 2021; 20:61-73. [PMID: 33527980 DOI: 10.1093/bfgp/elaa030] [Citation(s) in RCA: 31] [Impact Index Per Article: 10.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2020] [Revised: 12/16/2020] [Accepted: 12/18/2020] [Indexed: 11/12/2022] Open
Abstract
Deep learning has been increasingly used in bioinformatics, especially in sequence-based protein prediction tasks, as large amounts of biological data are available and deep learning techniques have been developed rapidly in recent years. For sequence-based protein prediction tasks, the selection of a suitable model architecture is essential, whereas sequence data representation is a major factor in controlling model performance. Here, we summarized all the main approaches that are used to represent protein sequence data (amino acid sequence encoding or embedding), which include end-to-end embedding methods, non-contextual embedding methods and embedding methods that use transfer learning and others that are applied for some specific tasks (such as protein sequence embedding based on extracted features for protein structure predictions and graph convolutional network-based embedding for drug discovery tasks). We have also reviewed the architectures of various types of embedding models theoretically and the development of these types of sequence embedding approaches to facilitate researchers and users in selecting the model that best suits their requirements.
Collapse
Affiliation(s)
- Feifei Cui
- University of Electronic Science and Technology of China, Chengdu, Sichuan, China
| | - Zilong Zhang
- University of Electronic Science and Technology of China, Chengdu, Sichuan, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, Sichuan, China
| |
Collapse
|