1
|
Abdul-Khalek N, Wimmer R, Overgaard MT, Gregersen Echers S. Insight on physicochemical properties governing peptide MS1 response in HPLC-ESI-MS/MS: A deep learning approach. Comput Struct Biotechnol J 2023; 21:3715-3727. [PMID: 37560124 PMCID: PMC10407266 DOI: 10.1016/j.csbj.2023.07.027] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2023] [Revised: 07/13/2023] [Accepted: 07/19/2023] [Indexed: 08/11/2023] Open
Abstract
Accurate and absolute quantification of peptides in complex mixtures using quantitative mass spectrometry (MS)-based methods requires foreground knowledge and isotopically labeled standards, thereby increasing analytical expenses, time consumption, and labor, thus limiting the number of peptides that can be accurately quantified. This originates from differential ionization efficiency between peptides and thus, understanding the physicochemical properties that influence the ionization and response in MS analysis is essential for developing less restrictive label-free quantitative methods. Here, we used equimolar peptide pool repository data to develop a deep learning model capable of identifying amino acids influencing the MS1 response. By using an encoder-decoder with an attention mechanism and correlating attention weights with amino acid physicochemical properties, we obtain insight on properties governing the peptide-level MS1 response within the datasets. While the problem cannot be described by one single set of amino acids and properties, distinct patterns were reproducibly obtained. Properties are grouped in three main categories related to peptide hydrophobicity, charge, and structural propensities. Moreover, our model can predict MS1 intensity output under defined conditions based solely on peptide sequence input. Using a refined training dataset, the model predicted log-transformed peptide MS1 intensities with an average error of 9.7 ± 0.5% based on 5-fold cross validation, and outperformed random forest and ridge regression models on both log-transformed and real scale data. This work demonstrates how deep learning can facilitate identification of physicochemical properties influencing peptide MS1 responses, but also illustrates how sequence-based response prediction and label-free peptide-level quantification may impact future workflows within quantitative proteomics.
Collapse
Affiliation(s)
- Naim Abdul-Khalek
- Department of Chemistry and Bioscience, Aalborg University, Aalborg 9220, Denmark
| | - Reinhard Wimmer
- Department of Chemistry and Bioscience, Aalborg University, Aalborg 9220, Denmark
| | | | | |
Collapse
|
2
|
Neely BA, Dorfer V, Martens L, Bludau I, Bouwmeester R, Degroeve S, Deutsch EW, Gessulat S, Käll L, Palczynski P, Payne SH, Rehfeldt TG, Schmidt T, Schwämmle V, Uszkoreit J, Vizcaíno JA, Wilhelm M, Palmblad M. Toward an Integrated Machine Learning Model of a Proteomics Experiment. J Proteome Res 2023; 22:681-696. [PMID: 36744821 PMCID: PMC9990124 DOI: 10.1021/acs.jproteome.2c00711] [Citation(s) in RCA: 8] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2022] [Indexed: 02/07/2023]
Abstract
In recent years machine learning has made extensive progress in modeling many aspects of mass spectrometry data. We brought together proteomics data generators, repository managers, and machine learning experts in a workshop with the goals to evaluate and explore machine learning applications for realistic modeling of data from multidimensional mass spectrometry-based proteomics analysis of any sample or organism. Following this sample-to-data roadmap helped identify knowledge gaps and define needs. Being able to generate bespoke and realistic synthetic data has legitimate and important uses in system suitability, method development, and algorithm benchmarking, while also posing critical ethical questions. The interdisciplinary nature of the workshop informed discussions of what is currently possible and future opportunities and challenges. In the following perspective we summarize these discussions in the hope of conveying our excitement about the potential of machine learning in proteomics and to inspire future research.
Collapse
Affiliation(s)
- Benjamin A. Neely
- National
Institute of Standards and Technology, Charleston, South Carolina 29412, United States
| | - Viktoria Dorfer
- Bioinformatics
Research Group, University of Applied Sciences
Upper Austria, Softwarepark
11, 4232 Hagenberg, Austria
| | - Lennart Martens
- VIB-UGent
Center for Medical Biotechnology, VIB, 9000 Ghent, Belgium
- Department
of Biomolecular Medicine, Faculty of Health Sciences and Medicine, Ghent University, 9000 Ghent, Belgium
| | - Isabell Bludau
- Department
of Proteomics and Signal Transduction, Max
Planck Institute of Biochemistry, 82152 Martinsried, Germany
| | - Robbin Bouwmeester
- VIB-UGent
Center for Medical Biotechnology, VIB, 9000 Ghent, Belgium
- Department
of Biomolecular Medicine, Faculty of Health Sciences and Medicine, Ghent University, 9000 Ghent, Belgium
| | - Sven Degroeve
- VIB-UGent
Center for Medical Biotechnology, VIB, 9000 Ghent, Belgium
- Department
of Biomolecular Medicine, Faculty of Health Sciences and Medicine, Ghent University, 9000 Ghent, Belgium
| | - Eric W. Deutsch
- Institute
for Systems Biology, Seattle, Washington 98109, United States
| | | | - Lukas Käll
- Science
for Life Laboratory, KTH - Royal Institute
of Technology, 171 21 Solna, Sweden
| | - Pawel Palczynski
- Department
of Biochemistry and Molecular Biology, University
of Southern Denmark, 5230 Odense, Denmark
| | - Samuel H. Payne
- Department
of Biology, Brigham Young University, Provo, Utah 84602, United States
| | - Tobias Greisager Rehfeldt
- Institute
for Mathematics and Computer Science, University
of Southern Denmark, 5230 Odense, Denmark
| | | | - Veit Schwämmle
- Department
of Biochemistry and Molecular Biology, University
of Southern Denmark, 5230 Odense, Denmark
| | - Julian Uszkoreit
- Medical
Proteome Analysis, Center for Protein Diagnostics (ProDi), Ruhr University Bochum, 44801 Bochum, Germany
- Medizinisches
Proteom-Center, Medical Faculty, Ruhr University
Bochum, 44801 Bochum, Germany
| | - Juan Antonio Vizcaíno
- European Molecular Biology Laboratory,
European Bioinformatics Institute
(EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, United
Kingdom
| | - Mathias Wilhelm
- Computational
Mass Spectrometry, Technical University
of Munich (TUM), 85354 Freising, Germany
| | - Magnus Palmblad
- Leiden University Medical Center, Postbus 9600, 2300
RC Leiden, The Netherlands
| |
Collapse
|
3
|
Rehfeldt T, Gabriels R, Bouwmeester R, Gessulat S, Neely BA, Palmblad M, Perez-Riverol Y, Schmidt T, Vizcaíno JA, Deutsch EW. ProteomicsML: An Online Platform for Community-Curated Data sets and Tutorials for Machine Learning in Proteomics. J Proteome Res 2023; 22:632-636. [PMID: 36693629 PMCID: PMC9903315 DOI: 10.1021/acs.jproteome.2c00629] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2022] [Indexed: 01/26/2023]
Abstract
Data set acquisition and curation are often the most difficult and time-consuming parts of a machine learning endeavor. This is especially true for proteomics-based liquid chromatography (LC) coupled to mass spectrometry (MS) data sets, due to the high levels of data reduction that occur between raw data and machine learning-ready data. Since predictive proteomics is an emerging field, when predicting peptide behavior in LC-MS setups, each lab often uses unique and complex data processing pipelines in order to maximize performance, at the cost of accessibility and reproducibility. For this reason we introduce ProteomicsML, an online resource for proteomics-based data sets and tutorials across most of the currently explored physicochemical peptide properties. This community-driven resource makes it simple to access data in easy-to-process formats, and contains easy-to-follow tutorials that allow new users to interact with even the most advanced algorithms in the field. ProteomicsML provides data sets that are useful for comparing state-of-the-art machine learning algorithms, as well as providing introductory material for teachers and newcomers to the field alike. The platform is freely available at https://www.proteomicsml.org/, and we welcome the entire proteomics community to contribute to the project at https://github.com/ProteomicsML/ProteomicsML.
Collapse
Affiliation(s)
- Tobias
G. Rehfeldt
- Institute
for Mathematics and Computer Science, University
of Southern Denmark, 5000 Odense, Denmark
| | - Ralf Gabriels
- VIB-UGent
Center for Medical Biotechnology, VIB, Ghent 9052, Belgium
- Department
of Biomolecular Medicine, Ghent University, Ghent 9052, Belgium
| | - Robbin Bouwmeester
- VIB-UGent
Center for Medical Biotechnology, VIB, Ghent 9052, Belgium
- Department
of Biomolecular Medicine, Ghent University, Ghent 9052, Belgium
| | | | - Benjamin A. Neely
- National
Institute of Standards and Technology, Charleston, South Carolina 29412, United States
| | - Magnus Palmblad
- Center for
Proteomics and Metabolomics, Leiden University
Medical Center, 2300 RC Leiden, The Netherlands
| | - Yasset Perez-Riverol
- European
Molecular Biology Laboratory, European Bioinformatics
Institute (EMBL-EBI), Wellcome Trust
Genome Campus, Hinxton, Cambridge CB10 1SD, United Kingdom
| | | | - Juan Antonio Vizcaíno
- European
Molecular Biology Laboratory, European Bioinformatics
Institute (EMBL-EBI), Wellcome Trust
Genome Campus, Hinxton, Cambridge CB10 1SD, United Kingdom
| | - Eric W. Deutsch
- Institute
for Systems Biology, Seattle, Washington 98109, United States
| |
Collapse
|
4
|
Bilbao A, Ross DH, Lee JY, Donor MT, Williams SM, Zhu Y, Ibrahim YM, Smith RD, Zheng X. MZA: A Data Conversion Tool to Facilitate Software Development and Artificial Intelligence Research in Multidimensional Mass Spectrometry. J Proteome Res 2023; 22:508-513. [PMID: 36414245 PMCID: PMC9898216 DOI: 10.1021/acs.jproteome.2c00313] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
Modern mass spectrometry-based workflows employing hybrid instrumentation and orthogonal separations collect multidimensional data, potentially allowing deeper understanding in omics studies through adoption of artificial intelligence methods. However, the large volume of these rich spectra challenges existing data storage and access technologies, therefore precluding informatics advancements. We present MZA (pronounced m-za), the mass-to-charge (m/z) generic data storage and access tool designed to facilitate software development and artificial intelligence research in multidimensional mass spectrometry measurements. Composed of a data conversion tool and a simple file structure based on the HDF5 format, MZA provides easy, cross-platform and cross-programming language access to raw MS-data, enabling fast development of new tools in data science programming languages such as Python and R. The software executable, example MS-data and example Python and R scripts are freely available at https://github.com/PNNL-m-q/mza.
Collapse
Affiliation(s)
- Aivett Bilbao
- Pacific Northwest National Laboratory, Richland, WA, 99352, USA,Corresponding authors Aivett Bilbao – Earth and Biological Sciences Directorate, Pacific Northwest National Laboratory, Richland, WA, 99352, United States; .; Xueyun Zheng – Biological Sciences Division, Pacific Northwest National Laboratory, Richland, WA, 99352, United States;
| | - Dylan H. Ross
- Pacific Northwest National Laboratory, Richland, WA, 99352, USA
| | - Joon-Yong Lee
- Pacific Northwest National Laboratory, Richland, WA, 99352, USA
| | - Micah T. Donor
- Pacific Northwest National Laboratory, Richland, WA, 99352, USA
| | | | - Ying Zhu
- Pacific Northwest National Laboratory, Richland, WA, 99352, USA
| | | | | | - Xueyun Zheng
- Pacific Northwest National Laboratory, Richland, WA, 99352, USA,Corresponding authors Aivett Bilbao – Earth and Biological Sciences Directorate, Pacific Northwest National Laboratory, Richland, WA, 99352, United States; .; Xueyun Zheng – Biological Sciences Division, Pacific Northwest National Laboratory, Richland, WA, 99352, United States;
| |
Collapse
|
5
|
Palmblad M, Böcker S, Degroeve S, Kohlbacher O, Käll L, Noble WS, Wilhelm M. Interpretation of the DOME Recommendations for Machine Learning in Proteomics and Metabolomics. J Proteome Res 2022; 21:1204-1207. [PMID: 35119864 PMCID: PMC8981311 DOI: 10.1021/acs.jproteome.1c00900] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
![]()
Machine
learning is increasingly applied in proteomics and metabolomics
to predict molecular structure, function, and physicochemical properties,
including behavior in chromatography, ion mobility, and tandem mass
spectrometry. These must be described in sufficient detail to apply
or evaluate the performance of trained models. Here we look at and
interpret the recently published and general DOME (Data, Optimization,
Model, Evaluation) recommendations for conducting and reporting on
machine learning in the specific context of proteomics and metabolomics.
Collapse
Affiliation(s)
- Magnus Palmblad
- Center for Proteomics and Metabolomics, Leiden University Medical Center, 2300 RC, Leiden, The Netherlands
| | - Sebastian Böcker
- Faculty of Mathematics and Computer Science, Friedrich Schiller University, 07743 Jena, Germany
| | - Sven Degroeve
- VIB-UGent Center for Medical Biotechnology, VIB, Ghent, Belgium and Department of Biomolecular Medicine, Ghent University, 9052 Ghent, Belgium
| | - Oliver Kohlbacher
- Eberhard Karls Universität Tübingen, WSI/ZBIT, 72076 Tübingen, Germany
| | - Lukas Käll
- Science for Life Laboratory, School of Engineering Sciences in Chemistry, Biotechnology and Health, Royal Institute of Technology (KTH), 171 21 Solna, Sweden
| | - William Stafford Noble
- Department of Genome Sciences and the Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, Washington 98195-5065, United States
| | - Mathias Wilhelm
- Computational Mass Spectrometry, Technical University of Munich (TUM), 85354 Freising, Germany
| |
Collapse
|