1
|
Ofer D, Linial M. Automated annotation of disease subtypes. J Biomed Inform 2024; 154:104650. [PMID: 38701887 DOI: 10.1016/j.jbi.2024.104650] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2023] [Revised: 03/28/2024] [Accepted: 04/29/2024] [Indexed: 05/05/2024]
Abstract
BACKGROUND Distinguishing diseases into distinct subtypes is crucial for study and effective treatment strategies. The Open Targets Platform (OT) integrates biomedical, genetic, and biochemical datasets to empower disease ontologies, classifications, and potential gene targets. Nevertheless, many disease annotations are incomplete, requiring laborious expert medical input. This challenge is especially pronounced for rare and orphan diseases, where resources are scarce. METHODS We present a machine learning approach to identifying diseases with potential subtypes, using the approximately 23,000 diseases documented in OT. We derive novel features for predicting diseases with subtypes using direct evidence. Machine learning models were applied to analyze feature importance and evaluate predictive performance for discovering both known and novel disease subtypes. RESULTS Our model achieves a high (89.4%) ROC AUC (Area Under the Receiver Operating Characteristic Curve) in identifying known disease subtypes. We integrated pre-trained deep-learning language models and showed their benefits. Moreover, we identify 515 disease candidates predicted to possess previously unannotated subtypes. CONCLUSIONS Our models can partition diseases into distinct subtypes. This methodology enables a robust, scalable approach for improving knowledge-based annotations and a comprehensive assessment of disease ontology tiers. Our candidates are attractive targets for further study and personalized medicine, potentially aiding in the unveiling of new therapeutic indications for sought-after targets.
Collapse
Affiliation(s)
- Dan Ofer
- Department of Biological Chemistry, The Life Science Institute, The Hebrew University of Jerusalem, Israel.
| | - Michal Linial
- Department of Biological Chemistry, The Life Science Institute, The Hebrew University of Jerusalem, Israel.
| |
Collapse
|
2
|
Michael-Pitschaze T, Cohen N, Ofer D, Hoshen Y, Linial M. Detecting anomalous proteins using deep representations. NAR Genom Bioinform 2024; 6:lqae021. [PMID: 38486884 PMCID: PMC10939404 DOI: 10.1093/nargab/lqae021] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2023] [Revised: 11/17/2023] [Accepted: 02/23/2024] [Indexed: 03/17/2024] Open
Abstract
Many advances in biomedicine can be attributed to identifying unusual proteins and genes. Many of these proteins' unique properties were discovered by manual inspection, which is becoming infeasible at the scale of modern protein datasets. Here, we propose to tackle this challenge using anomaly detection methods that automatically identify unexpected properties. We adopt a state-of-the-art anomaly detection paradigm from computer vision, to highlight unusual proteins. We generate meaningful representations without labeled inputs, using pretrained deep neural network models. We apply these protein language models (pLM) to detect anomalies in function, phylogenetic families, and segmentation tasks. We compute protein anomaly scores to highlight human prion-like proteins, distinguish viral proteins from their host proteome, and mark non-classical ion/metal binding proteins and enzymes. Other tasks concern segmentation of protein sequences into folded and unstructured regions. We provide candidates for rare functionality (e.g. prion proteins). Additionally, we show the anomaly score is useful in 3D folding-related segmentation. Our novel method shows improved performance over strong baselines and has objectively high performance across a variety of tasks. We conclude that the combination of pLM and anomaly detection techniques is a valid method for discovering a range of global and local protein characteristics.
Collapse
Affiliation(s)
- Tomer Michael-Pitschaze
- The Rachel and Selim Benin School of Computer Science and Engineering, The Hebrew University of Jerusalem, Jerusalem, Israel
| | - Niv Cohen
- The Rachel and Selim Benin School of Computer Science and Engineering, The Hebrew University of Jerusalem, Jerusalem, Israel
| | - Dan Ofer
- Department of Biological Chemistry, Institute of Life Sciences, The Hebrew University of Jerusalem, Jerusalem, Israel
| | - Yedid Hoshen
- The Rachel and Selim Benin School of Computer Science and Engineering, The Hebrew University of Jerusalem, Jerusalem, Israel
| | - Michal Linial
- Department of Biological Chemistry, Institute of Life Sciences, The Hebrew University of Jerusalem, Jerusalem, Israel
| |
Collapse
|
3
|
Mazzi Esquinca ME, Correa CN, Marques de Barros G, Montenegro H, Mantovani de Castro L. Multiomic Approach for Bioprospection: Investigation of Toxins and Peptides of Brazilian Sea Anemone Bunodosoma caissarum. Mar Drugs 2023; 21:md21030197. [PMID: 36976246 PMCID: PMC10058367 DOI: 10.3390/md21030197] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2023] [Revised: 03/16/2023] [Accepted: 03/20/2023] [Indexed: 03/29/2023] Open
Abstract
Sea anemones are sessile invertebrates of the phylum Cnidaria and their survival and evolutive success are highly related to the ability to produce and quickly inoculate venom, with the presence of potent toxins. In this study, a multi-omics approach was applied to characterize the protein composition of the tentacles and mucus of Bunodosoma caissarum, a species of sea anemone from the Brazilian coast. The tentacles transcriptome resulted in 23,444 annotated genes, of which 1% showed similarity with toxins or proteins related to toxin activity. In the proteome analysis, 430 polypeptides were consistently identified: 316 of them were more abundant in the tentacles while 114 were enriched in the mucus. Tentacle proteins were mostly enzymes, followed by DNA- and RNA-associated proteins, while in the mucus most proteins were toxins. In addition, peptidomics allowed the identification of large and small fragments of mature toxins, neuropeptides, and intracellular peptides. In conclusion, integrated omics identified previously unknown or uncharacterized genes in addition to 23 toxin-like proteins of therapeutic potential, improving the understanding of tentacle and mucus composition of sea anemones.
Collapse
Affiliation(s)
- Maria Eduarda Mazzi Esquinca
- Department of Biological and Environmental Sciences, Bioscience Institute, Sao Paulo State University (UNESP), Sao Vicente 11330-900, SP, Brazil
| | - Claudia Neves Correa
- Department of Biological and Environmental Sciences, Bioscience Institute, Sao Paulo State University (UNESP), Sao Vicente 11330-900, SP, Brazil
- Biodiversity of Coastal Environments Postgraduate Program, Department of Biological and Environmental Sciences, Bioscience Institute, Sao Paulo State University (UNESP), Sao Vicente 11330-900, SP, Brazil
| | - Gabriel Marques de Barros
- Department of Biological and Environmental Sciences, Bioscience Institute, Sao Paulo State University (UNESP), Sao Vicente 11330-900, SP, Brazil
- Biodiversity of Coastal Environments Postgraduate Program, Department of Biological and Environmental Sciences, Bioscience Institute, Sao Paulo State University (UNESP), Sao Vicente 11330-900, SP, Brazil
| | | | - Leandro Mantovani de Castro
- Department of Biological and Environmental Sciences, Bioscience Institute, Sao Paulo State University (UNESP), Sao Vicente 11330-900, SP, Brazil
- Biodiversity of Coastal Environments Postgraduate Program, Department of Biological and Environmental Sciences, Bioscience Institute, Sao Paulo State University (UNESP), Sao Vicente 11330-900, SP, Brazil
| |
Collapse
|
4
|
Jones DAB, Moolhuijzen PM, Hane JK. Remote homology clustering identifies lowly conserved families of effector proteins in plant-pathogenic fungi. Microb Genom 2021; 7. [PMID: 34468307 PMCID: PMC8715435 DOI: 10.1099/mgen.0.000637] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022] Open
Abstract
Plant diseases caused by fungal pathogens are typically initiated by molecular interactions between 'effector' molecules released by a pathogen and receptor molecules on or within the plant host cell. In many cases these effector-receptor interactions directly determine host resistance or susceptibility. The search for fungal effector proteins is a developing area in fungal-plant pathology, with more than 165 distinct confirmed fungal effector proteins in the public domain. For a small number of these, novel effectors can be rapidly discovered across multiple fungal species through the identification of known effector homologues. However, many have no detectable homology by standard sequence-based search methods. This study employs a novel comparison method (RemEff) that is capable of identifying protein families with greater sensitivity than traditional homology-inference methods, leveraging a growing pool of confirmed fungal effector data to enable the prediction of novel fungal effector candidates by protein family association. Resources relating to the RemEff method and data used in this study are available from https://figshare.com/projects/Effector_protein_remote_homology/87965.
Collapse
Affiliation(s)
- Darcy A B Jones
- Centre for Crop & Disease Management, School of Molecular & Life Sciences, Curtin University, Perth, Australia
| | - Paula M Moolhuijzen
- Centre for Crop & Disease Management, School of Molecular & Life Sciences, Curtin University, Perth, Australia
| | - James K Hane
- Centre for Crop & Disease Management, School of Molecular & Life Sciences, Curtin University, Perth, Australia.,Curtin Institute for Computation, Curtin University, Perth, Australia
| |
Collapse
|
5
|
Cole TJ, Brewer MS. TOXIFY: a deep learning approach to classify animal venom proteins. PeerJ 2019; 7:e7200. [PMID: 31293833 PMCID: PMC6601600 DOI: 10.7717/peerj.7200] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2019] [Accepted: 05/29/2019] [Indexed: 01/06/2023] Open
Abstract
In the era of Next-Generation Sequencing and shotgun proteomics, the sequences of animal toxigenic proteins are being generated at rates exceeding the pace of traditional means for empirical toxicity verification. To facilitate the automation of toxin identification from protein sequences, we trained Recurrent Neural Networks with Gated Recurrent Units on publicly available datasets. The resulting models are available via the novel software package TOXIFY, allowing users to infer the probability of a given protein sequence being a venom protein. TOXIFY is more than 20X faster and uses over an order of magnitude less memory than previously published methods. Additionally, TOXIFY is more accurate, precise, and sensitive at classifying venom proteins.
Collapse
Affiliation(s)
- T Jeffrey Cole
- Department of Biology, East Carolina University, Greenville, NC, United States of America
| | - Michael S Brewer
- Department of Biology, East Carolina University, Greenville, NC, United States of America
| |
Collapse
|
6
|
Fingerhut LCHW, Strugnell JM, Faou P, Labiaga ÁR, Zhang J, Cooke IR. Shotgun Proteomics Analysis of Saliva and Salivary Gland Tissue from the Common Octopus Octopus vulgaris. J Proteome Res 2018; 17:3866-3876. [PMID: 30220204 DOI: 10.1021/acs.jproteome.8b00525] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
The salivary apparatus of the common octopus ( Octopus vulgaris) has been the subject of biochemical study for over a century. A combination of bioassays, behavioral studies and molecular analysis on O. vulgaris and related species suggests that its proteome should contain a mixture of highly potent neurotoxins and degradative proteins. However, a lack of genomic and transcriptomic data has meant that the amino acid sequences of these proteins remain almost entirely unknown. To address this, we assembled the posterior salivary gland transcriptome of O. vulgaris and combined it with high resolution mass spectrometry data from the posterior and anterior salivary glands of two adults, the posterior salivary glands of six paralarvae and the saliva from a single adult. We identified a total of 2810 protein groups from across this range of salivary tissues and age classes, including 84 with homology to known venom protein families. Additionally, we found 21 short secreted cysteine rich protein groups of which 12 were specific to cephalopods. By combining protein expression data with phylogenetic analysis we demonstrate that serine proteases expanded dramatically within the cephalopod lineage and that cephalopod specific proteins are strongly associated with the salivary apparatus.
Collapse
Affiliation(s)
- Legana C H W Fingerhut
- Department of Molecular and Cell Biology , James Cook University , Townsville , Queensland 4811 , Australia
| | - Jan M Strugnell
- Centre for Sustainable Tropical Fisheries and Aquaculture, College of Science and Engineering , James Cook University , Townsville , Queensland 4811 , Australia.,Department of Ecology, Environment and Evolution, School of Life Sciences , La Trobe University , Melbourne , Victoria 3086 , Australia
| | - Pierre Faou
- Department of Biochemistry and Genetics, La Trobe Institute for Molecular Science , La Trobe University , Melbourne , Victoria 3086 , Australia
| | - Álvaro Roura Labiaga
- Department of Ecology and Marine Biodiversity , Instituto de Investigaciones Marinas de Vigo (IIM-CSIC) , Vigo 36208 , Spain
| | - Jia Zhang
- Department of Molecular and Cell Biology , James Cook University , Townsville , Queensland 4811 , Australia
| | - Ira R Cooke
- Department of Molecular and Cell Biology , James Cook University , Townsville , Queensland 4811 , Australia.,Department of Biochemistry and Genetics, La Trobe Institute for Molecular Science , La Trobe University , Melbourne , Victoria 3086 , Australia
| |
Collapse
|
7
|
Classes, Databases, and Prediction Methods of Pharmaceutically and Commercially Important Cystine-Stabilized Peptides. Toxins (Basel) 2018; 10:toxins10060251. [PMID: 29921767 PMCID: PMC6024828 DOI: 10.3390/toxins10060251] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2018] [Revised: 06/12/2018] [Accepted: 06/14/2018] [Indexed: 12/13/2022] Open
Abstract
Cystine-stabilized peptides represent a large family of peptides characterized by high structural stability and bactericidal, fungicidal, or insecticidal properties. Found throughout a wide range of taxa, this broad and functionally important family can be subclassified into distinct groups dependent upon their number and type of cystine bonding patters, tertiary structures, and/or their species of origin. Furthermore, the annotation of proteins related to the cystine-stabilized family are under-represented in the literature due to their difficulty of isolation and identification. As a result, there are several recent attempts to collate them into data resources and build analytic tools for their dynamic prediction. Ultimately, the identification and delivery of new members of this family will lead to their growing inclusion into the repertoire of commercial viable alternatives to antibiotics and environmentally safe insecticides. This review of the literature and current state of cystine-stabilized peptide biology is aimed to better describe peptide subfamilies, identify databases and analytics resources associated with specific cystine-stabilized peptides, and highlight their current commercial success.
Collapse
|
8
|
Peigneur S, Tytgat J. Toxins in Drug Discovery and Pharmacology. Toxins (Basel) 2018; 10:toxins10030126. [PMID: 29547537 PMCID: PMC5869414 DOI: 10.3390/toxins10030126] [Citation(s) in RCA: 35] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2018] [Revised: 03/13/2018] [Accepted: 03/14/2018] [Indexed: 12/18/2022] Open
Abstract
Venoms from marine and terrestrial animals (cone snails, scorpions, spiders, snakes, centipedes, cnidarian, etc.) can be seen as an untapped cocktail of biologically active compounds, being increasingly recognized as a new emerging source of peptide-based therapeutics.
Collapse
Affiliation(s)
- Steve Peigneur
- Toxicology and Pharmacology, University of Leuven (KU Leuven), Campus Gasthuisberg, P.O. Box 922, Herestraat 49, 3000 Leuven, Belgium.
| | - Jan Tytgat
- Toxicology and Pharmacology, University of Leuven (KU Leuven), Campus Gasthuisberg, P.O. Box 922, Herestraat 49, 3000 Leuven, Belgium.
| |
Collapse
|