1
|
Zhu YH, Zhang C, Liu Y, Omenn GS, Freddolino PL, Yu DJ, Zhang Y. TripletGO: Integrating Transcript Expression Profiles with Protein Homology Inferences for Gene Function Prediction. GENOMICS, PROTEOMICS & BIOINFORMATICS 2022; 20:1013-1027. [PMID: 35568117 PMCID: PMC10025770 DOI: 10.1016/j.gpb.2022.03.001] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/20/2021] [Revised: 03/02/2022] [Accepted: 04/16/2022] [Indexed: 01/13/2023]
Abstract
Gene Ontology (GO) has been widely used to annotate functions of genes and gene products. Here, we proposed a new method, TripletGO, to deduce GO terms of protein-coding and non-coding genes, through the integration of four complementary pipelines built on transcript expression profile, genetic sequence alignment, protein sequence alignment, and naïve probability. TripletGO was tested on a large set of 5754 genes from 8 species (human, mouse, Arabidopsis, rat, fly, budding yeast, fission yeast, and nematoda) and 2433 proteins with available expression data from the third Critical Assessment of Protein Function Annotation challenge (CAFA3). Experimental results show that TripletGO achieves function annotation accuracy significantly beyond the current state-of-the-art approaches. Detailed analyses show that the major advantage of TripletGO lies in the coupling of a new triplet network-based profiling method with the feature space mapping technique, which can accurately recognize function patterns from transcript expression profiles. Meanwhile, the combination of multiple complementary models, especially those from transcript expression and protein-level alignments, improves the coverage and accuracy of the final GO annotation results. The standalone package and an online server of TripletGO are freely available at https://zhanggroup.org/TripletGO/.
Collapse
Affiliation(s)
- Yi-Heng Zhu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China; Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA
| | - Chengxin Zhang
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA
| | - Yan Liu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China
| | - Gilbert S Omenn
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA; Departments of Internal Medicine and Human Genetics, and School of Public Health, University of Michigan, Ann Arbor, MI 48109, USA
| | - Peter L Freddolino
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA; Department of Biological Chemistry, University of Michigan, Ann Arbor, MI 48109, USA
| | - Dong-Jun Yu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China.
| | - Yang Zhang
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA; Department of Biological Chemistry, University of Michigan, Ann Arbor, MI 48109, USA.
| |
Collapse
|
2
|
Ma Z, Lu YY, Wang Y, Lin R, Yang Z, Zhang F, Wang Y. Metric learning for comparing genomic data with triplet network. Brief Bioinform 2022; 23:6679451. [DOI: 10.1093/bib/bbac345] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2022] [Revised: 07/20/2022] [Accepted: 07/26/2022] [Indexed: 11/13/2022] Open
Abstract
Abstract
Many biological applications are essentially pairwise comparison problems, such as evolutionary relationships on genomic sequences, contigs binning on metagenomic data, cell type identification on gene expression profiles of single-cells, etc. To make pair-wise comparison, it is necessary to adopt suitable dissimilarity metric. However, not all the metrics can be fully adapted to all possible biological applications. It is necessary to employ metric learning based on data adaptive to the application of interest. Therefore, in this study, we proposed MEtric Learning with Triplet network (MELT), which learns a nonlinear mapping from original space to the embedding space in order to keep similar data closer and dissimilar data far apart. MELT is a weakly supervised and data-driven comparison framework that offers more adaptive and accurate dissimilarity learned in the absence of the label information when the supervised methods are not applicable. We applied MELT in three typical applications of genomic data comparison, including hierarchical genomic sequences, longitudinal microbiome samples and longitudinal single-cell gene expression profiles, which have no distinctive grouping information. In the experiments, MELT demonstrated its empirical utility in comparison to many widely used dissimilarity metrics. And MELT is expected to accommodate a more extensive set of applications in large-scale genomic comparisons. MELT is available at https://github.com/Ying-Lab/MELT.
Collapse
Affiliation(s)
- Zhi Ma
- Department of Automation, Xiamen University , China
- National Institute for Data Science in Health and Medicine, Xiamen University
| | - Yang Young Lu
- Cheriton School of Computer Science, University of Waterloo , Waterloo, Ontario , Canada
| | - Yiwen Wang
- Department of Automation, Xiamen University , China
| | - Renhao Lin
- Department of Automation, Xiamen University , China
| | - Zizi Yang
- Department of Automation, Xiamen University , China
| | - Fang Zhang
- Cheriton School of Computer Science, University of Waterloo , Waterloo, Ontario , Canada
| | - Ying Wang
- Department of Automation, Xiamen University , China
- National Institute for Data Science in Health and Medicine, Xiamen University
- Xiamen Key Laboratory of Big Data Intelligent Analysis and Decision , Xiamen, Fujian 361005 , China
- Fujian Key Laboratory of Genetics and Breeding of Marine Organisms , Xiamen, 361100 , China
| |
Collapse
|
3
|
Torkamanian-Afshar M, Nematzadeh S, Tabarzad M, Najafi A, Lanjanian H, Masoudi-Nejad A. In silico design of novel aptamers utilizing a hybrid method of machine learning and genetic algorithm. Mol Divers 2021; 25:1395-1407. [PMID: 33554306 DOI: 10.1007/s11030-021-10192-9] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2020] [Accepted: 01/28/2021] [Indexed: 12/29/2022]
Abstract
Aptamers can be regarded as efficient substitutes for monoclonal antibodies in many diagnostic and therapeutic applications. Due to the tedious and prohibitive nature of SELEX (systematic evolution of ligands by exponential enrichment), the in silico methods have been developed to improve the enrichment processes rate. However, the majority of these methods did not show any effort in designing novel aptamers. Moreover, some target proteins may have not any binding RNA candidates in nature and a reductive mechanism is needed to generate novel aptamer pools among enormous possible combinations of nucleotide acids to be examined in vitro. We have applied a genetic algorithm (GA) with an embedded binding predictor fitness function to in silico design of RNA aptamers. As a case study of this research, all steps were accomplished to generate an aptamer pool against aminopeptidase N (CD13) biomarker. First, the model was developed based on sequential and structural features of known RNA-protein complexes. Then, utilizing RNA sequences involved in complexes with positive prediction results, as the first-generation, novel aptamers were designed and top-ranked sequences were selected. A 76-mer aptamer was identified with the highest fitness value with a 3 to 6 time higher score than parent oligonucleotides. The reliability of obtained sequences was confirmed utilizing docking and molecular dynamic simulation. The proposed method provides an important simplified contribution to the oligonucleotide-aptamer design process. Also, it can be an underlying ground to design novel aptamers against a wide range of biomarkers.
Collapse
Affiliation(s)
- Mahsa Torkamanian-Afshar
- Department of Bioinformatics, Kish International Campus, University of Tehran, Kish Island, Iran.,Laboratory of Systems Biology and Bioinformatics (LBB), Institute of Biochemistry and Biophysics, University of Tehran, Tehran, Iran.,Department of Computer Technologies, Beykent University, Istanbul, Turkey
| | - Sajjad Nematzadeh
- Department of Computer Technologies, Beykent University, Istanbul, Turkey
| | - Maryam Tabarzad
- Protein Technology Research Center, Shahid Beheshti University of Medical Sciences, Tehran, Iran
| | - Ali Najafi
- Molecular Biology Research Center, Systems Biology and Poisonings Institute, Tehran, Iran
| | - Hossein Lanjanian
- Cellular and Molecular Endocrine Research Center, Research Institute for Endocrine Sciences, Shahid Beheshti University of Medical Sciences, Tehran, Iran
| | - Ali Masoudi-Nejad
- Department of Bioinformatics, Kish International Campus, University of Tehran, Kish Island, Iran. .,Laboratory of Systems Biology and Bioinformatics (LBB), Institute of Biochemistry and Biophysics, University of Tehran, Tehran, Iran.
| |
Collapse
|
4
|
Power system events classification using genetic algorithm based feature weighting technique for support vector machine. Heliyon 2021; 7:e05936. [PMID: 33490688 PMCID: PMC7810784 DOI: 10.1016/j.heliyon.2021.e05936] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2020] [Revised: 12/09/2020] [Accepted: 01/06/2021] [Indexed: 11/23/2022] Open
Abstract
Currently, ensuring that power systems operate efficiently in stable and secure conditions has become a key challenge worldwide. Various unwanted events including injections and faults, especially within the generation and transmission domains are major causes of these instability menaces. The earlier operators can identify and accurately diagnose these unwanted events, the faster they can react and execute timely corrective measures to prevent large-scale blackouts and avoidable loss to lives and equipment. This paper presents a hybrid classification technique using support vector machine (SVM) with the evolutionary genetic algorithm (GA) model to detect and classify power system unwanted events in an accurate yet straightforward manner. In the proposed classification approach, the features of two large dimensional synchrophasor datasets are initially reduced using principal component analysis before they are weighted in their relevance and the dominant weights are heuristically identified using the genetic algorithm to boost classification results. Consequently, the weighted and dominant selected features by the GA are utilized to train the modelled linear SVM and radial basis function kernel SVM in classifying unwanted events. The performance of the proposed GA-SVM model was evaluated and compared with other models using key classification metrics. The high classification results from the proposed model validates the proposed method. The experimental results indicate that the proposed model can achieve an overall improvement in the classification rate of unwanted events in power systems and it showed that the application of the GA as the feature weighting tool offers significant improvement on classification performances.
Collapse
|
5
|
Makrodimitris S, Reinders MJT, van Ham RCHJ. Metric learning on expression data for gene function prediction. Bioinformatics 2020; 36:1182-1190. [PMID: 31562759 PMCID: PMC7703756 DOI: 10.1093/bioinformatics/btz731] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2019] [Revised: 08/31/2019] [Accepted: 09/25/2019] [Indexed: 01/02/2023] Open
Abstract
MOTIVATION Co-expression of two genes across different conditions is indicative of their involvement in the same biological process. However, when using RNA-Seq datasets with many experimental conditions from diverse sources, only a subset of the experimental conditions is expected to be relevant for finding genes related to a particular Gene Ontology (GO) term. Therefore, we hypothesize that when the purpose is to find similarly functioning genes, the co-expression of genes should not be determined on all samples but only on those samples informative for the GO term of interest. RESULTS To address this, we developed Metric Learning for Co-expression (MLC), a fast algorithm that assigns a GO-term-specific weight to each expression sample. The goal is to obtain a weighted co-expression measure that is more suitable than the unweighted Pearson correlation for applying Guilt-By-Association-based function predictions. More specifically, if two genes are annotated with a given GO term, MLC tries to maximize their weighted co-expression and, in addition, if one of them is not annotated with that term, the weighted co-expression is minimized. Our experiments on publicly available Arabidopsis thaliana RNA-Seq data demonstrate that MLC outperforms standard Pearson correlation in term-centric performance. Moreover, our method is particularly good at more specific terms, which are the most interesting. Finally, by observing the sample weights for a particular GO term, one can identify which experiments are important for learning that term and potentially identify novel conditions that are relevant, as demonstrated by experiments in both A. thaliana and Pseudomonas Aeruginosa. AVAILABILITY AND IMPLEMENTATION MLC is available as a Python package at www.github.com/stamakro/MLC. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Stavros Makrodimitris
- Delft Bioinformatics Lab, Delft University of Technology, Delft 2628 XE, The Netherlands.,Keygene N.V., Wageningen 6708 PW, The Netherlands
| | - Marcel J T Reinders
- Delft Bioinformatics Lab, Delft University of Technology, Delft 2628 XE, The Netherlands.,Leiden Computational Biology Center, Leiden University Medical Center, Leiden 2333 ZC, The Netherlands
| | - Roeland C H J van Ham
- Delft Bioinformatics Lab, Delft University of Technology, Delft 2628 XE, The Netherlands.,Keygene N.V., Wageningen 6708 PW, The Netherlands
| |
Collapse
|