1
|
Xiao D, Lin M, Liu C, Geddes TA, Burchfield J, Parker B, Humphrey SJ, Yang P. SnapKin: a snapshot deep learning ensemble for kinase-substrate prediction from phosphoproteomics data. NAR Genom Bioinform 2023; 5:lqad099. [PMID: 37954574 PMCID: PMC10632189 DOI: 10.1093/nargab/lqad099] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2023] [Revised: 09/18/2023] [Accepted: 10/25/2023] [Indexed: 11/14/2023] Open
Abstract
A major challenge in mass spectrometry-based phosphoproteomics lies in identifying the substrates of kinases, as currently only a small fraction of substrates identified can be confidently linked with a known kinase. Machine learning techniques are promising approaches for leveraging large-scale phosphoproteomics data to computationally predict substrates of kinases. However, the small number of experimentally validated kinase substrates (true positive) and the high data noise in many phosphoproteomics datasets together limit their applicability and utility. Here, we aim to develop advanced kinase-substrate prediction methods to address these challenges. Using a collection of seven large phosphoproteomics datasets, and both traditional and deep learning models, we first demonstrate that a 'pseudo-positive' learning strategy for alleviating small sample size is effective at improving model predictive performance. We next show that a data resampling-based ensemble learning strategy is useful for improving model stability while further enhancing prediction. Lastly, we introduce an ensemble deep learning model ('SnapKin') by incorporating the above two learning strategies into a 'snapshot' ensemble learning algorithm. We propose SnapKin, an ensemble deep learning method, for predicting substrates of kinases from large-scale phosphoproteomics data. We demonstrate that SnapKin consistently outperforms existing methods in kinase-substrate prediction. SnapKin is freely available at https://github.com/PYangLab/SnapKin.
Collapse
Affiliation(s)
- Di Xiao
- Computational Systems Biology Group, Children’s Medical Research Institute, The University of Sydney, Westmead, NSW 2145, Australia
| | - Michael Lin
- School of Mathematics and Statistics, The University of Sydney, Sydney, NSW 2006, Australia
| | - Chunlei Liu
- Computational Systems Biology Group, Children’s Medical Research Institute, The University of Sydney, Westmead, NSW 2145, Australia
| | - Thomas A Geddes
- Computational Systems Biology Group, Children’s Medical Research Institute, The University of Sydney, Westmead, NSW 2145, Australia
- Charles Perkins Centre, The University of Sydney, Sydney, NSW 2006, Australia
- School of Environmental and Life Sciences, The University of Sydney, Sydney, NSW 2006, Australia
| | - James G Burchfield
- Charles Perkins Centre, The University of Sydney, Sydney, NSW 2006, Australia
- School of Environmental and Life Sciences, The University of Sydney, Sydney, NSW 2006, Australia
| | - Benjamin L Parker
- Centre for Muscle Research, Department of Anatomy and Physiology, School of Biomedical Sciences, Melbourne, VIC 3010, Australia
| | - Sean J Humphrey
- Charles Perkins Centre, The University of Sydney, Sydney, NSW 2006, Australia
- School of Environmental and Life Sciences, The University of Sydney, Sydney, NSW 2006, Australia
- Murdoch Children’s Research Institute, The Royal Children’s Hospital, Melbourne, VIC, 3052, Australia
| | - Pengyi Yang
- Computational Systems Biology Group, Children’s Medical Research Institute, The University of Sydney, Westmead, NSW 2145, Australia
- School of Mathematics and Statistics, The University of Sydney, Sydney, NSW 2006, Australia
- Charles Perkins Centre, The University of Sydney, Sydney, NSW 2006, Australia
| |
Collapse
|
2
|
Xiao D, Chen C, Yang P. Computational systems approach towards phosphoproteomics and their downstream regulation. Proteomics 2023; 23:e2200068. [PMID: 35580145 DOI: 10.1002/pmic.202200068] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2022] [Revised: 04/26/2022] [Accepted: 05/03/2022] [Indexed: 11/07/2022]
Abstract
Protein phosphorylation plays an essential role in modulating cell signalling and its downstream transcriptional and translational regulations. Until recently, protein phosphorylation has been studied mostly using low-throughput biochemical assays. The advancement of mass spectrometry (MS)-based phosphoproteomics transformed the field by enabling measurement of proteome-wide phosphorylation events, where tens of thousands of phosphosites are routinely identified and quantified in an experiment. This has brought a significant challenge in analysing large-scale phosphoproteomic data, making computational methods and systems approaches integral parts of phosphoproteomics. Previous works have primarily focused on reviewing the experimental techniques in MS-based phosphoproteomics, yet a systematic survey of the computational landscape in this field is still missing. Here, we review computational methods and tools, and systems approaches that have been developed for phosphoproteomics data analysis. We categorise them into four aspects including data processing, functional analysis, phosphoproteome annotation and their integration with other omics, and in each aspect, we discuss the key methods and example studies. Lastly, we highlight some of the potential research directions on which future work would make a significant contribution to this fast-growing field. We hope this review provides a useful snapshot of the field of computational systems phosphoproteomics and stimulates new research that drives future development.
Collapse
Affiliation(s)
- Di Xiao
- Computational Systems Biology Group, Children's Medical Research Institute, The University of Sydney, Westmead, New South Wales, Australia.,Charles Perkins Centre, The University of Sydney, Sydney, New South Wales, Australia
| | - Carissa Chen
- Computational Systems Biology Group, Children's Medical Research Institute, The University of Sydney, Westmead, New South Wales, Australia.,Charles Perkins Centre, The University of Sydney, Sydney, New South Wales, Australia
| | - Pengyi Yang
- Computational Systems Biology Group, Children's Medical Research Institute, The University of Sydney, Westmead, New South Wales, Australia.,Charles Perkins Centre, The University of Sydney, Sydney, New South Wales, Australia.,School of Mathematics and Statistics, The University of Sydney, Sydney, New South Wales, Australia
| |
Collapse
|