1
|
Fernández-Díaz R, Cossio-Pérez R, Agoni C, Lam HT, Lopez V, Shields DC. AutoPeptideML: a study on how to build more trustworthy peptide bioactivity predictors. BIOINFORMATICS (OXFORD, ENGLAND) 2024; 40:btae555. [PMID: 39292535 PMCID: PMC11438549 DOI: 10.1093/bioinformatics/btae555] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/24/2024] [Revised: 08/08/2024] [Accepted: 09/17/2024] [Indexed: 09/20/2024]
Abstract
MOTIVATION Automated machine learning (AutoML) solutions can bridge the gap between new computational advances and their real-world applications by enabling experimental scientists to build their own custom models. We examine different steps in the development life-cycle of peptide bioactivity binary predictors and identify key steps where automation cannot only result in a more accessible method, but also more robust and interpretable evaluation leading to more trustworthy models. RESULTS We present a new automated method for drawing negative peptides that achieves better balance between specificity and generalization than current alternatives. We study the effect of homology-based partitioning for generating the training and testing data subsets and demonstrate that model performance is overestimated when no such homology correction is used, which indicates that prior studies may have overestimated their performance when applied to new peptide sequences. We also conduct a systematic analysis of different protein language models as peptide representation methods and find that they can serve as better descriptors than a naive alternative, but that there is no significant difference across models with different sizes or algorithms. Finally, we demonstrate that an ensemble of optimized traditional machine learning algorithms can compete with more complex neural network models, while being more computationally efficient. We integrate these findings into AutoPeptideML, an easy-to-use AutoML tool to allow researchers without a computational background to build new predictive models for peptide bioactivity in a matter of minutes. AVAILABILITY AND IMPLEMENTATION Source code, documentation, and data are available at https://github.com/IBM/AutoPeptideML and a dedicated web-server at http://peptide.ucd.ie/AutoPeptideML. A static version of the software to ensure the reproduction of the results is available at https://zenodo.org/records/13363975.
Collapse
Affiliation(s)
- Raúl Fernández-Díaz
- IBM Research, Dublin, Dublin D15 HN66, Ireland
- School of Medicine, University College Dublin, Dublin D04 C1P1, Ireland
- Conway Institute of Biomolecular and Biomedical Science, University College Dublin, Dublin D04 C1P, Ireland
- The SFI Centre for Research Training in Genomics Data Science, Ireland
| | - Rodrigo Cossio-Pérez
- School of Medicine, University College Dublin, Dublin D04 C1P1, Ireland
- Conway Institute of Biomolecular and Biomedical Science, University College Dublin, Dublin D04 C1P, Ireland
- Department of Science and Technology, National University of Quilmes, Bernal B1876, Provincia de Buenos Aires, Argentina
| | - Clement Agoni
- School of Medicine, University College Dublin, Dublin D04 C1P1, Ireland
- Conway Institute of Biomolecular and Biomedical Science, University College Dublin, Dublin D04 C1P, Ireland
- Discipline of Pharmaceutical Sciences, School of Health Sciences, University of KwaZulu-Natal, Durban 4000, South Africa
| | | | | | - Denis C Shields
- School of Medicine, University College Dublin, Dublin D04 C1P1, Ireland
- Conway Institute of Biomolecular and Biomedical Science, University College Dublin, Dublin D04 C1P, Ireland
| |
Collapse
|
2
|
Ferrer Florensa A, Almagro Armenteros J, Nielsen H, Aarestrup F, Clausen P. SpanSeq: similarity-based sequence data splitting method for improved development and assessment of deep learning projects. NAR Genom Bioinform 2024; 6:lqae106. [PMID: 39157582 PMCID: PMC11327874 DOI: 10.1093/nargab/lqae106] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2024] [Revised: 07/26/2024] [Accepted: 08/05/2024] [Indexed: 08/20/2024] Open
Abstract
The use of deep learning models in computational biology has increased massively in recent years, and it is expected to continue with the current advances in the fields such as Natural Language Processing. These models, although able to draw complex relations between input and target, are also inclined to learn noisy deviations from the pool of data used during their development. In order to assess their performance on unseen data (their capacity to generalize), it is common to split the available data randomly into development (train/validation) and test sets. This procedure, although standard, has been shown to produce dubious assessments of generalization due to the existing similarity between samples in the databases used. In this work, we present SpanSeq, a database partition method for machine learning that can scale to most biological sequences (genes, proteins and genomes) in order to avoid data leakage between sets. We also explore the effect of not restraining similarity between sets by reproducing the development of two state-of-the-art models on bioinformatics, not only confirming the consequences of randomly splitting databases on the model assessment, but expanding those repercussions to the model development. SpanSeq is available at https://github.com/genomicepidemiology/SpanSeq.
Collapse
Affiliation(s)
- Alfred Ferrer Florensa
- Research Group for Genomic Epidemiology, DTU National Food Institute, Technical University of Denmark, Anker Engelunds Vej 1, 2800 Kongens Lyngby, Denmark
| | - Jose Juan Almagro Armenteros
- Informatics and Predictive Sciences Research, Bristol Myers Squibb Company, Calle Isaac Newton 4, 41092 Sevilla, Spain
| | - Henrik Nielsen
- Section for Bioinformatics, Department of Health Technology, Technical University of Denmark, Anker Engelunds Vej 1, 2800 Kongens Lyngby, Denmark
| | - Frank Møller Aarestrup
- Research Group for Genomic Epidemiology, DTU National Food Institute, Technical University of Denmark, Anker Engelunds Vej 1, 2800 Kongens Lyngby, Denmark
| | - Philip Thomas Lanken Conradsen Clausen
- Research Group for Genomic Epidemiology, DTU National Food Institute, Technical University of Denmark, Anker Engelunds Vej 1, 2800 Kongens Lyngby, Denmark
| |
Collapse
|
3
|
Bernett J, Blumenthal DB, Grimm DG, Haselbeck F, Joeres R, Kalinina OV, List M. Guiding questions to avoid data leakage in biological machine learning applications. Nat Methods 2024; 21:1444-1453. [PMID: 39122953 DOI: 10.1038/s41592-024-02362-y] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2024] [Accepted: 06/26/2024] [Indexed: 08/12/2024]
Abstract
Machine learning methods for extracting patterns from high-dimensional data are very important in the biological sciences. However, in certain cases, real-world applications cannot confirm the reported prediction performance. One of the main reasons for this is data leakage, which can be seen as the illicit sharing of information between the training data and the test data, resulting in performance estimates that are far better than the performance observed in the intended application scenario. Data leakage can be difficult to detect in biological datasets due to their complex dependencies. With this in mind, we present seven questions that should be asked to prevent data leakage when constructing machine learning models in biological domains. We illustrate the usefulness of our questions by applying them to nontrivial examples. Our goal is to raise awareness of potential data leakage problems and to promote robust and reproducible machine learning-based research in biology.
Collapse
Affiliation(s)
- Judith Bernett
- TUM School of Life Sciences, Technical University of Munich, Freising, Germany
| | - David B Blumenthal
- Department Artificial Intelligence in Biomedical Engineering, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany.
| | - Dominik G Grimm
- TUM Campus Straubing for Biotechnology and Sustainability, Technical University of Munich, Straubing, Germany.
- Bioinformatics, Weihenstephan-Triesdorf University of Applied Sciences, Straubing, Germany.
- TUM School of Computation, Information and Technology, Technical University of Munich, Garching, Germany.
| | - Florian Haselbeck
- TUM Campus Straubing for Biotechnology and Sustainability, Technical University of Munich, Straubing, Germany
- Bioinformatics, Weihenstephan-Triesdorf University of Applied Sciences, Straubing, Germany
- Smart Farming, Weihenstephan-Triesdorf University of Applied Sciences, Freising, Germany
| | - Roman Joeres
- Department of Chemistry and Molecular Biology, University of Gothenburg, Gothenburg, Sweden
- Wallenberg Centre for Molecular and Translational Medicine, University of Gothenburg, Gothenburg, Sweden
- Helmholtz Institute for Pharmaceutical Research Saarland (HIPS), Helmholtz Centre for Infection Research (HZI), Saarbrücken, Germany
- Center for Bioinformatics, Saarland University, Saarbrücken, Germany
| | - Olga V Kalinina
- Helmholtz Institute for Pharmaceutical Research Saarland (HIPS), Helmholtz Centre for Infection Research (HZI), Saarbrücken, Germany.
- Center for Bioinformatics, Saarland University, Saarbrücken, Germany.
- Medical Faculty, Saarland University, Homburg, Germany.
| | - Markus List
- TUM School of Life Sciences, Technical University of Munich, Freising, Germany.
- Munich Data Science Institute (MDSI), Technical University of Munich, Garching, Germany.
| |
Collapse
|
4
|
Ødum MT, Teufel F, Thumuluri V, Almagro Armenteros JJ, Johansen AR, Winther O, Nielsen H. DeepLoc 2.1: multi-label membrane protein type prediction using protein language models. Nucleic Acids Res 2024; 52:W215-W220. [PMID: 38587188 PMCID: PMC11223819 DOI: 10.1093/nar/gkae237] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2024] [Revised: 03/06/2024] [Accepted: 03/21/2024] [Indexed: 04/09/2024] Open
Abstract
DeepLoc 2.0 is a popular web server for the prediction of protein subcellular localization and sorting signals. Here, we introduce DeepLoc 2.1, which additionally classifies the input proteins into the membrane protein types Transmembrane, Peripheral, Lipid-anchored and Soluble. Leveraging pre-trained transformer-based protein language models, the server utilizes a three-stage architecture for sequence-based, multi-label predictions. Comparative evaluations with other established tools on a test set of 4933 eukaryotic protein sequences, constructed following stringent homology partitioning, demonstrate state-of-the-art performance. Notably, DeepLoc 2.1 outperforms existing models, with the larger ProtT5 model exhibiting a marginal advantage over the ESM-1B model. The web server is available at https://services.healthtech.dtu.dk/services/DeepLoc-2.1.
Collapse
Affiliation(s)
- Marius Thrane Ødum
- Section for Bioinformatics, Department of Health Technology, Technical University of Denmark, 2800 Kongens Lyngby, Denmark
| | - Felix Teufel
- Bioinformatics Centre, Department of Biology, University of Copenhagen, 2200 Copenhagen, Denmark
- Digital Science & Innovation, Novo Nordisk A/S, 2760 Måløv, Denmark
| | | | - José Juan Almagro Armenteros
- Bristol Myers Squibb Company, Informatics and Predictive Sciences Research, Calle Isaac Newton 4, Sevilla 41092, Spain
| | | | - Ole Winther
- Bioinformatics Centre, Department of Biology, University of Copenhagen, 2200 Copenhagen, Denmark
- Department of Genomic Medicine, Rigshospitalet (Copenhagen University Hospital), 2100 Copenhagen, Denmark
- Section for Cognitive Systems, Department of Applied Mathematics and Computer Science, Technical University of Denmark, 2800 Kongens Lyngby, Denmark
| | - Henrik Nielsen
- Section for Bioinformatics, Department of Health Technology, Technical University of Denmark, 2800 Kongens Lyngby, Denmark
| |
Collapse
|
5
|
Nielsen H, Teufel F, Brunak S, von Heijne G. SignalP: The Evolution of a Web Server. Methods Mol Biol 2024; 2836:331-367. [PMID: 38995548 DOI: 10.1007/978-1-0716-4007-4_17] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/13/2024]
Abstract
SignalP ( https://services.healthtech.dtu.dk/services/SignalP-6.0/ ) is a very popular prediction method for signal peptides, the intrinsic signals that make proteins secretory. The SignalP web server has existed since 1995 and is now in its sixth major version. In this historical account, we (three authors who have taken part in the entire journey plus the first author of the latest version) describe the differences between the versions and discuss the various decisions taken along the way.
Collapse
Affiliation(s)
- Henrik Nielsen
- Section for Bioinformatics, Department of Health Technology, Technical University of Denmark, Kongens Lyngby, Denmark.
| | - Felix Teufel
- Bioinformatics Centre, Department of Biology, University of Copenhagen, Copenhagen, Denmark
- Digital Science & Innovation, Novo Nordisk A/S, Malov, Denmark
| | - Søren Brunak
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark
| | - Gunnar von Heijne
- Department of Biochemistry and Biophysics, Stockholm University, Stockholm, Sweden
- Science for Life Laboratory, Stockholm University, Solna, Sweden
| |
Collapse
|
6
|
Klie A, Laub D, Talwar JV, Stites H, Jores T, Solvason JJ, Farley EK, Carter H. Predictive analyses of regulatory sequences with EUGENe. NATURE COMPUTATIONAL SCIENCE 2023; 3:946-956. [PMID: 38177592 PMCID: PMC10768637 DOI: 10.1038/s43588-023-00544-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/12/2023] [Accepted: 09/27/2023] [Indexed: 01/06/2024]
Abstract
Deep learning has become a popular tool to study cis-regulatory function. Yet efforts to design software for deep-learning analyses in regulatory genomics that are findable, accessible, interoperable and reusable (FAIR) have fallen short of fully meeting these criteria. Here we present elucidating the utility of genomic elements with neural nets (EUGENe), a FAIR toolkit for the analysis of genomic sequences with deep learning. EUGENe consists of a set of modules and subpackages for executing the key functionality of a genomics deep learning workflow: (1) extracting, transforming and loading sequence data from many common file formats; (2) instantiating, initializing and training diverse model architectures; and (3) evaluating and interpreting model behavior. We designed EUGENe as a simple, flexible and extensible interface for streamlining and customizing end-to-end deep-learning sequence analyses, and illustrate these principles through application of the toolkit to three predictive modeling tasks. We hope that EUGENe represents a springboard towards a collaborative ecosystem for deep-learning applications in genomics research.
Collapse
Affiliation(s)
- Adam Klie
- Department of Medicine, University of California San Diego, La Jolla, CA, USA
- Bioinformatics and Systems Biology Program, University of California San Diego, La Jolla, CA, USA
| | - David Laub
- Department of Medicine, University of California San Diego, La Jolla, CA, USA
- Bioinformatics and Systems Biology Program, University of California San Diego, La Jolla, CA, USA
| | - James V Talwar
- Department of Medicine, University of California San Diego, La Jolla, CA, USA
- Bioinformatics and Systems Biology Program, University of California San Diego, La Jolla, CA, USA
| | | | - Tobias Jores
- Department of Genome Sciences, University of Washington, Seattle, WA, USA
| | - Joe J Solvason
- Department of Medicine, University of California San Diego, La Jolla, CA, USA
- Bioinformatics and Systems Biology Program, University of California San Diego, La Jolla, CA, USA
- Department of Molecular Biology, University of California San Diego, La Jolla, CA, USA
| | - Emma K Farley
- Department of Medicine, University of California San Diego, La Jolla, CA, USA
- Bioinformatics and Systems Biology Program, University of California San Diego, La Jolla, CA, USA
- Department of Molecular Biology, University of California San Diego, La Jolla, CA, USA
| | - Hannah Carter
- Department of Medicine, University of California San Diego, La Jolla, CA, USA.
- Bioinformatics and Systems Biology Program, University of California San Diego, La Jolla, CA, USA.
| |
Collapse
|