1
|
Piersma SR, Valles-Marti A, Rolfs F, Pham TV, Henneman AA, Jiménez CR. Inferring kinase activity from phosphoproteomic data: Tool comparison and recent applications. MASS SPECTROMETRY REVIEWS 2024; 43:725-751. [PMID: 36156810 DOI: 10.1002/mas.21808] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/16/2023]
Abstract
Aberrant cellular signaling pathways are a hallmark of cancer and other diseases. One of the most important signaling mechanisms involves protein phosphorylation/dephosphorylation. Protein phosphorylation is catalyzed by protein kinases, and over 530 protein kinases have been identified in the human genome. Aberrant kinase activity is one of the drivers of tumorigenesis and cancer progression and results in altered phosphorylation abundance of downstream substrates. Upstream kinase activity can be inferred from the global collection of phosphorylated substrates. Mass spectrometry-based phosphoproteomic experiments nowadays routinely allow identification and quantitation of >10k phosphosites per biological sample. This substrate phosphorylation footprint can be used to infer upstream kinase activities using tools like Kinase Substrate Enrichment Analysis (KSEA), Posttranslational Modification Substrate Enrichment Analysis (PTM-SEA), and Integrative Inferred Kinase Activity Analysis (INKA). Since the topic of kinase activity inference is very active with many new approaches reported in the past 3 years, we would like to give an overview of the field. In this review, an inventory of kinase activity inference tools, their underlying algorithms, statistical frameworks, kinase-substrate databases, and user-friendliness is presented. The most widely-used tools are compared in-depth. Subsequently, recent applications of the tools are described focusing on clinical tissues and hematological samples. Two main application areas for kinase activity inference tools can be discerned. (1) Maximal biological insights can be obtained from large data sets with group comparisons using multiple complementary tools (e.g., PTM-SEA and KSEA or INKA). (2) In the oncology context where personalized treatment requires analysis of single samples, INKA for example, has emerged as tool that can prioritize actionable kinases for targeted inhibition.
Collapse
Affiliation(s)
- Sander R Piersma
- OncoProteomics Laboratory Amsterdam UMC, Vrije Universiteit, Amsterdam, The Netherlands
| | - Andrea Valles-Marti
- OncoProteomics Laboratory Amsterdam UMC, Vrije Universiteit, Amsterdam, The Netherlands
| | - Frank Rolfs
- OncoProteomics Laboratory Amsterdam UMC, Vrije Universiteit, Amsterdam, The Netherlands
| | - Thang V Pham
- OncoProteomics Laboratory Amsterdam UMC, Vrije Universiteit, Amsterdam, The Netherlands
| | - Alex A Henneman
- OncoProteomics Laboratory Amsterdam UMC, Vrije Universiteit, Amsterdam, The Netherlands
| | - Connie R Jiménez
- OncoProteomics Laboratory Amsterdam UMC, Vrije Universiteit, Amsterdam, The Netherlands
| |
Collapse
|
2
|
Perron N, Kirst M, Chen S. Bringing CAM photosynthesis to the table: Paving the way for resilient and productive agricultural systems in a changing climate. PLANT COMMUNICATIONS 2024; 5:100772. [PMID: 37990498 PMCID: PMC10943566 DOI: 10.1016/j.xplc.2023.100772] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/20/2023] [Revised: 07/27/2023] [Accepted: 11/20/2023] [Indexed: 11/23/2023]
Abstract
Modern agricultural systems are directly threatened by global climate change and the resulting freshwater crisis. A considerable challenge in the coming years will be to develop crops that can cope with the consequences of declining freshwater resources and changing temperatures. One approach to meeting this challenge may lie in our understanding of plant photosynthetic adaptations and water use efficiency. Plants from various taxa have evolved crassulacean acid metabolism (CAM), a water-conserving adaptation of photosynthetic carbon dioxide fixation that enables plants to thrive under semi-arid or seasonally drought-prone conditions. Although past research on CAM has led to a better understanding of the inner workings of plant resilience and adaptation to stress, successful introduction of this pathway into C3 or C4 plants has not been reported. The recent revolution in molecular, systems, and synthetic biology, as well as innovations in high-throughput data generation and mining, creates new opportunities to uncover the minimum genetic tool kit required to introduce CAM traits into drought-sensitive crops. Here, we propose four complementary research avenues to uncover this tool kit. First, genomes and computational methods should be used to improve understanding of the nature of variations that drive CAM evolution. Second, single-cell 'omics technologies offer the possibility for in-depth characterization of the mechanisms that trigger environmentally controlled CAM induction. Third, the rapid increase in new 'omics data enables a comprehensive, multimodal exploration of CAM. Finally, the expansion of functional genomics methods is paving the way for integration of CAM into farming systems.
Collapse
Affiliation(s)
- Noé Perron
- Plant Molecular and Cellular Biology Program, University of Florida, Gainesville, FL 32608, USA
| | - Matias Kirst
- Plant Molecular and Cellular Biology Program, University of Florida, Gainesville, FL 32608, USA; School of Forest, Fisheries and Geomatics Sciences, University of Florida, Gainesville, FL 32603, USA.
| | - Sixue Chen
- Department of Biology, University of Mississippi, Oxford, MS 38677-1848, USA.
| |
Collapse
|
3
|
Grunfeld N, Levine E, Libby E. Experimental measurement and computational prediction of bacterial Hanks-type Ser/Thr signaling system regulatory targets. Mol Microbiol 2024:10.1111/mmi.15220. [PMID: 38167835 PMCID: PMC11219531 DOI: 10.1111/mmi.15220] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2023] [Revised: 12/15/2023] [Accepted: 12/17/2023] [Indexed: 01/05/2024]
Abstract
Bacteria possess diverse classes of signaling systems that they use to sense and respond to their environments and execute properly timed developmental transitions. One widespread and evolutionarily ancient class of signaling systems are the Hanks-type Ser/Thr kinases, also sometimes termed "eukaryotic-like" due to their homology with eukaryotic kinases. In diverse bacterial species, these signaling systems function as critical regulators of general cellular processes such as metabolism, growth and division, developmental transitions such as sporulation, biofilm formation, and virulence, as well as antibiotic tolerance. This multifaceted regulation is due to the ability of a single Hanks-type Ser/Thr kinase to post-translationally modify the activity of multiple proteins, resulting in the coordinated regulation of diverse cellular pathways. However, in part due to their deep integration with cellular physiology, to date, we have a relatively limited understanding of the timing, regulatory hierarchy, the complete list of targets of a given kinase, as well as the potential regulatory overlap between the often multiple kinases present in a single organism. In this review, we discuss experimental methods and curated datasets aimed at elucidating the targets of these signaling pathways and approaches for using these datasets to develop computational models for quantitative predictions of target motifs. We emphasize novel approaches and opportunities for collecting data suitable for the creation of new predictive computational models applicable to diverse species.
Collapse
Affiliation(s)
- Noam Grunfeld
- Department of Bioengineering, Northeastern University, Boston MA USA
| | - Erel Levine
- Department of Bioengineering, Northeastern University, Boston MA USA
- Department of Chemical Engineering, Northeastern University, Boston MA USA
| | - Elizabeth Libby
- Department of Bioengineering, Northeastern University, Boston MA USA
| |
Collapse
|
4
|
Varshney N, Mishra AK. Deep Learning in Phosphoproteomics: Methods and Application in Cancer Drug Discovery. Proteomes 2023; 11:proteomes11020016. [PMID: 37218921 DOI: 10.3390/proteomes11020016] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2023] [Revised: 04/24/2023] [Accepted: 04/25/2023] [Indexed: 05/24/2023] Open
Abstract
Protein phosphorylation is a key post-translational modification (PTM) that is a central regulatory mechanism of many cellular signaling pathways. Several protein kinases and phosphatases precisely control this biochemical process. Defects in the functions of these proteins have been implicated in many diseases, including cancer. Mass spectrometry (MS)-based analysis of biological samples provides in-depth coverage of phosphoproteome. A large amount of MS data available in public repositories has unveiled big data in the field of phosphoproteomics. To address the challenges associated with handling large data and expanding confidence in phosphorylation site prediction, the development of many computational algorithms and machine learning-based approaches have gained momentum in recent years. Together, the emergence of experimental methods with high resolution and sensitivity and data mining algorithms has provided robust analytical platforms for quantitative proteomics. In this review, we compile a comprehensive collection of bioinformatic resources used for the prediction of phosphorylation sites, and their potential therapeutic applications in the context of cancer.
Collapse
Affiliation(s)
- Neha Varshney
- Division of Biological Sciences, Department of Cellular and Molecular Medicine, University of California, San Diego, CA 93093, USA
- Ludwig Institute for Cancer Research, La Jolla, CA 92093, USA
| | - Abhinava K Mishra
- Molecular, Cellular and Developmental Biology Department, University of California, Santa Barbara, CA 93106, USA
| |
Collapse
|
5
|
Xiao D, Chen C, Yang P. Computational systems approach towards phosphoproteomics and their downstream regulation. Proteomics 2023; 23:e2200068. [PMID: 35580145 DOI: 10.1002/pmic.202200068] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2022] [Revised: 04/26/2022] [Accepted: 05/03/2022] [Indexed: 11/07/2022]
Abstract
Protein phosphorylation plays an essential role in modulating cell signalling and its downstream transcriptional and translational regulations. Until recently, protein phosphorylation has been studied mostly using low-throughput biochemical assays. The advancement of mass spectrometry (MS)-based phosphoproteomics transformed the field by enabling measurement of proteome-wide phosphorylation events, where tens of thousands of phosphosites are routinely identified and quantified in an experiment. This has brought a significant challenge in analysing large-scale phosphoproteomic data, making computational methods and systems approaches integral parts of phosphoproteomics. Previous works have primarily focused on reviewing the experimental techniques in MS-based phosphoproteomics, yet a systematic survey of the computational landscape in this field is still missing. Here, we review computational methods and tools, and systems approaches that have been developed for phosphoproteomics data analysis. We categorise them into four aspects including data processing, functional analysis, phosphoproteome annotation and their integration with other omics, and in each aspect, we discuss the key methods and example studies. Lastly, we highlight some of the potential research directions on which future work would make a significant contribution to this fast-growing field. We hope this review provides a useful snapshot of the field of computational systems phosphoproteomics and stimulates new research that drives future development.
Collapse
Affiliation(s)
- Di Xiao
- Computational Systems Biology Group, Children's Medical Research Institute, The University of Sydney, Westmead, New South Wales, Australia.,Charles Perkins Centre, The University of Sydney, Sydney, New South Wales, Australia
| | - Carissa Chen
- Computational Systems Biology Group, Children's Medical Research Institute, The University of Sydney, Westmead, New South Wales, Australia.,Charles Perkins Centre, The University of Sydney, Sydney, New South Wales, Australia
| | - Pengyi Yang
- Computational Systems Biology Group, Children's Medical Research Institute, The University of Sydney, Westmead, New South Wales, Australia.,Charles Perkins Centre, The University of Sydney, Sydney, New South Wales, Australia.,School of Mathematics and Statistics, The University of Sydney, Sydney, New South Wales, Australia
| |
Collapse
|
6
|
Ayati M, Yilmaz S, Blasco Tavares Pereira Lopes F, Chance M, Koyuturk M. Prediction of Kinase-Substrate Associations Using The Functional Landscape of Kinases and Phosphorylation Sites. PACIFIC SYMPOSIUM ON BIOCOMPUTING. PACIFIC SYMPOSIUM ON BIOCOMPUTING 2023; 28:73-84. [PMID: 36540966 PMCID: PMC9782723] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/25/2022]
Abstract
Protein phosphorylation is a key post-translational modification that plays a central role in many cellular processes. With recent advances in biotechnology, thousands of phosphorylated sites can be identified and quantified in a given sample, enabling proteome-wide screening of cellular signaling. However, for most (> 90%) of the phosphorylation sites that are identified in these experiments, the kinase(s) that target these sites are unknown. To broadly utilize available structural, functional, evolutionary, and contextual information in predicting kinase-substrate associations (KSAs), we develop a network-based machine learning framework. Our framework integrates a multitude of data sources to characterize the landscape of functional relationships and associations among phosphosites and kinases. To construct a phosphosite-phosphosite association network, we use sequence similarity, shared biological pathways, co-evolution, co-occurrence, and co-phosphorylation of phosphosites across different biological states. To construct a kinase-kinase association network, we integrate protein-protein interactions, shared biological pathways, and membership in common kinase families. We use node embeddings computed from these heterogeneous networks to train machine learning models for predicting kinase-substrate associations. Our systematic computational experiments using the PhosphositePLUS database shows that the resulting algorithm, NetKSA, outperforms two state-of-the-art algorithms, including KinomeXplorer and LinkPhinder, in overall KSA prediction. By stratifying the ranking of kinases, NetKSA also enables annotation of phosphosites that are targeted by relatively less-studied kinases.Availability: The code and data are available at compbio.case.edu/NetKSA/.
Collapse
Affiliation(s)
- Marzieh Ayati
- Department of Computer Science, University of Texas Rio Grande Valley, Edinburg, TX, USA,
| | | | | | | | | |
Collapse
|
7
|
Mini-review: Recent advances in post-translational modification site prediction based on deep learning. Comput Struct Biotechnol J 2022; 20:3522-3532. [PMID: 35860402 PMCID: PMC9284371 DOI: 10.1016/j.csbj.2022.06.045] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2022] [Revised: 06/21/2022] [Accepted: 06/21/2022] [Indexed: 11/23/2022] Open
Abstract
Post-translational modifications (PTMs) are closely linked to numerous diseases, playing a significant role in regulating protein structures, activities, and functions. Therefore, the identification of PTMs is crucial for understanding the mechanisms of cell biology and diseases therapy. Compared to traditional machine learning methods, the deep learning approaches for PTM prediction provide accurate and rapid screening, guiding the downstream wet experiments to leverage the screened information for focused studies. In this paper, we reviewed the recent works in deep learning to identify phosphorylation, acetylation, ubiquitination, and other PTM types. In addition, we summarized PTM databases and discussed future directions with critical insights.
Collapse
Key Words
- AAindex, Amino acid index
- ATP, Adenosine triphosphate
- AUC, Area under curve
- Ac, Acetylation
- BE, Binary encoding
- BLOSUM, Blocks substitution matrix
- Bi-LSTM, Bidirectional LSTM
- CKSAAP, Composition of k-spaced amino acid Pairs
- CNN, Convolutional neural network
- CNNOH, CNN with the one-hot encoding
- CNNWE, CNN with the word-embedding encoding
- CNNrgb, CNN red green blue
- CV, Cross-validation
- DC-CNN, Densely connected convolutional neural network
- DL, Deep learning
- DNNs, Deep neural networks
- Deep learning
- E. coli, Escherichia coli
- EBGW, Encoding based on grouped weight
- EGAAC, Enhanced grouped amino acids content
- IG, Information gain
- K, Lysine
- KNN, k nearest neighbor
- LASSO, Least absolute shrinkage and selection operator
- LSTM, Long short-term memory
- LSTMWE, LSTM with the word-embedding encoding
- M.musculus, Mus musculus
- MDC, Modular densely connected convolutional networks
- MDCAN, Multilane dense convolutional attention network
- ML, Machine learning
- MLP, Multilayer perceptron
- MMI, Multivariate mutual information
- Machine learning
- Mass spectrometry
- NMBroto, Normalized Moreau-Broto autocorrelation
- P, Proline
- PSP, PhosphoSitePlus
- PSSM, Position-specific scoring matrix
- PTM, Post-translational modifications
- Ph, Phosphorylation
- Post-translational modification
- Prediction
- PseAAC, Pseudo-amino acid composition
- R, Arginine
- RF, Random forest
- RNN, Recurrent neural network
- ROC, Receiver operating characteristic
- S, Serine
- S. typhimurium, Salmonella typhimurium
- S.cerevisiae, Saccharomyces cerevisiae
- SE, Squeeze and excitation
- SEV, Split to Equal Validation
- ST, Source and target
- SUMO, Small ubiquitin-like modifier
- SVM, Support vector machines
- T, Threonine
- Ub, Ubiquitination
- Y, Tyrosine
- ZSL, Zero-shot learning
Collapse
|
8
|
Ma R, Li S, Li W, Yao L, Huang HD, Lee TY. KinasePhos 3.0: Redesign and Expansion of the Prediction on Kinase-specific Phosphorylation Sites. GENOMICS, PROTEOMICS & BIOINFORMATICS 2022:S1672-0229(22)00081-X. [PMID: 35781048 PMCID: PMC10373160 DOI: 10.1016/j.gpb.2022.06.004] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/21/2021] [Revised: 05/30/2022] [Accepted: 06/27/2022] [Indexed: 06/04/2023]
Abstract
The purpose of this work is to enhance KinasePhos, a machine learning-based kinase-specific phosphorylation site prediction tool. Experimentally verified kinase-specific phosphorylation data were collected from PhosphoSitePlus, UniProtKB, the Group-based Prediction System 5.0, and Phospho.ELM. In total, 41,421 experimentally verified kinase-specific phosphorylation sites were identified. A total of 1380 unique kinases were identified, including 753 with existing classification information from KinBase and the remaining 627 annotated by building a phylogenetic tree. Based on this kinase classification, a total of 771 predictive models were built at the individual, family, and group levels, using at least 15 experimentally verified substrate sites in positive training datasets. The improved models demonstrated their effectiveness compared with other prediction tools. For example, the prediction of sites phosphorylated by the protein kinase B, casein kinase 2, and protein kinase A families had accuracies of 94.5%, 92.5%, and 90.0%, respectively. The average prediction accuracy for all 771 models was 87.2%. For enhancing interpretability, the SHapley Additive exPlanations (SHAP) method was employed to assess feature importance. The web interface of KinasePhos 3.0 has been redesigned to provide comprehensive annotations of kinase-specific phosphorylation sites on multiple proteins. Additionally, considering the large scale of phosphoproteomic data, a downloadable prediction tool is available at https://awi.cuhk.edu.cn/KinasePhos/download.html or https://github.com/tom-209/KinasePhos-3.0-executable-file.
Collapse
Affiliation(s)
- Renfei Ma
- Warshel Institute for Computational Biology, School of Medicine, The Chinese University of Hong Kong, Shenzhen 518172, China; School of Life Sciences, University of Science and Technology of China, Hefei 230027, China
| | - Shangfu Li
- Warshel Institute for Computational Biology, School of Medicine, The Chinese University of Hong Kong, Shenzhen 518172, China
| | - Wenshuo Li
- School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen 518172, China
| | - Lantian Yao
- School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen 518172, China
| | - Hsien-Da Huang
- Warshel Institute for Computational Biology, School of Medicine, The Chinese University of Hong Kong, Shenzhen 518172, China; School of Life and Health Sciences, School of Medicine, The Chinese University of Hong Kong, Shenzhen 518172, China.
| | - Tzong-Yi Lee
- Warshel Institute for Computational Biology, School of Medicine, The Chinese University of Hong Kong, Shenzhen 518172, China; School of Life and Health Sciences, School of Medicine, The Chinese University of Hong Kong, Shenzhen 518172, China.
| |
Collapse
|
9
|
Hu Y, Chapman A, Wen G, Hall DW. What Can Knowledge Bring to Machine Learning?—A Survey of Low-shot Learning for Structured Data. ACM T INTEL SYST TEC 2022. [DOI: 10.1145/3510030] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
Abstract
Supervised machine learning has several drawbacks that make it difficult to use in many situations. Drawbacks include heavy reliance on massive training data, limited generalizability, and poor expressiveness of high-level semantics. Low-shot Learning attempts to address these drawbacks. Low-shot learning allows the model to obtain good predictive power with very little or no training data, where structured knowledge plays a key role as a high-level semantic representation of human. This article will review the fundamental factors of low-shot learning technologies, with a focus on the operation of structured knowledge under different low-shot conditions. We also introduce other techniques relevant to low-shot learning. Finally, we point out the limitations of low-shot learning, the prospects and gaps of industrial applications, and future research directions.
Collapse
Affiliation(s)
- Yang Hu
- University of Southampton, United Kingdom and South China University of Technology, Guangzhou, Guangdong, China
| | - Adriane Chapman
- University of Southampton, Southampton, Hampshire, United Kingdom
| | - Guihua Wen
- South China University of Technology, Guangzhou, Guangdong, China
| | - Dame Wendy Hall
- University of Southampton, Southampton, Hampshire, United Kingdom
| |
Collapse
|
10
|
Naga D, Muster W, Musvasva E, Ecker GF. Off-targetP ML: an open source machine learning framework for off-target panel safety assessment of small molecules. J Cheminform 2022; 14:27. [PMID: 35525988 PMCID: PMC9077900 DOI: 10.1186/s13321-022-00603-w] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2021] [Accepted: 03/26/2022] [Indexed: 11/10/2022] Open
Abstract
Unpredicted drug safety issues constitute the majority of failures in the pharmaceutical industry according to several studies. Some of these preclinical safety issues could be attributed to the non-selective binding of compounds to targets other than their intended therapeutic target, causing undesired adverse events. Consequently, pharmaceutical companies routinely run in-vitro safety screens to detect off-target activities prior to preclinical and clinical studies. Hereby we present an open source machine learning framework aiming at the prediction of our in-house 50 off-target panel activities for ~ 4000 compounds, directly from their structure. This framework is intended to guide chemists in the drug design process prior to synthesis and to accelerate drug discovery. We also present a set of ML approaches that require minimum programming experience for deployment. The workflow incorporates different ML approaches such as deep learning and automated machine learning. It also accommodates popular issues faced in bioactivity predictions, as data imbalance, inter-target duplicated measurements and duplicated public compound identifiers. Throughout the workflow development, we explore and compare the capability of Neural Networks and AutoML in constructing prediction models for fifty off-targets of different protein classes, different dataset sizes, and high-class imbalance. Outcomes from different methods are compared in terms of efficiency and efficacy. The most important challenges and factors impacting model construction and performance in addition to suggestions on how to overcome such challenges are also discussed.
Collapse
Affiliation(s)
- Doha Naga
- Roche Pharma Research & Early Development, Roche Innovation Center Basel, F. Hoffmann-La Roche Ltd., Basel, Switzerland.,Department of Pharmaceutical Sciences, University of Vienna, Vienna, Austria
| | - Wolfgang Muster
- Roche Pharma Research & Early Development, Roche Innovation Center Basel, F. Hoffmann-La Roche Ltd., Basel, Switzerland
| | - Eunice Musvasva
- Roche Pharma Research & Early Development, Roche Innovation Center Basel, F. Hoffmann-La Roche Ltd., Basel, Switzerland
| | - Gerhard F Ecker
- Department of Pharmaceutical Sciences, University of Vienna, Vienna, Austria.
| |
Collapse
|
11
|
Urban J. A review on recent trends in the phosphoproteomics workflow. From sample preparation to data analysis. Anal Chim Acta 2022; 1199:338857. [DOI: 10.1016/j.aca.2021.338857] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2021] [Revised: 07/14/2021] [Accepted: 07/15/2021] [Indexed: 12/12/2022]
|
12
|
Yakimovich A, Beaugnon A, Huang Y, Ozkirimli E. Labels in a haystack: Approaches beyond supervised learning in biomedical applications. PATTERNS (NEW YORK, N.Y.) 2021; 2:100383. [PMID: 34950904 PMCID: PMC8672145 DOI: 10.1016/j.patter.2021.100383] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
Recent advances in biomedical machine learning demonstrate great potential for data-driven techniques in health care and biomedical research. However, this potential has thus far been hampered by both the scarcity of annotated data in the biomedical domain and the diversity of the domain's subfields. While unsupervised learning is capable of finding unknown patterns in the data by design, supervised learning requires human annotation to achieve the desired performance through training. With the latter performing vastly better than the former, the need for annotated datasets is high, but they are costly and laborious to obtain. This review explores a family of approaches existing between the supervised and the unsupervised problem setting. The goal of these algorithms is to make more efficient use of the available labeled data. The advantages and limitations of each approach are addressed and perspectives are provided.
Collapse
Affiliation(s)
- Artur Yakimovich
- Roche Pharma International Informatics, Roche Products Limited, Welwyn Garden City, UK
| | - Anaël Beaugnon
- Roche Pharma International Informatics, Roche, Boulogne-Billancourt, France
| | - Yi Huang
- Roche Pharma International Informatics, Roche (China) Holding Ltd., Shanghai, China
| | - Elif Ozkirimli
- Roche Pharma International Informatics, F. Hoffmann-La Roche AG, Kaiseraugst, Switzerland
| |
Collapse
|
13
|
Petrovsky DV, Kopylov AT, Rudnev VR, Stepanov AA, Kulikova LI, Malsagova KA, Kaysheva AL. Managing of Unassigned Mass Spectrometric Data by Neural Network for Cancer Phenotypes Classification. J Pers Med 2021; 11:1288. [PMID: 34945760 PMCID: PMC8707435 DOI: 10.3390/jpm11121288] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2021] [Revised: 11/24/2021] [Accepted: 12/01/2021] [Indexed: 11/17/2022] Open
Abstract
Mass spectrometric profiling provides information on the protein and metabolic composition of biological samples. However, the weak efficiency of computational algorithms in correlating tandem spectra to molecular components (proteins and metabolites) dramatically limits the use of "omics" profiling for the classification of nosologies. The development of machine learning methods for the intelligent analysis of raw mass spectrometric (HPLC-MS/MS) measurements without involving the stages of preprocessing and data identification seems promising. In our study, we tested the application of neural networks of two types, a 1D residual convolutional neural network (CNN) and a 3D CNN, for the classification of three cancers by analyzing metabolomic-proteomic HPLC-MS/MS data. In this work, we showed that both neural networks could classify the phenotypes of gender-mixed oncology, kidney cancer, gender-specific oncology, ovarian cancer, and the phenotype of a healthy person by analyzing 'omics' data in 'mgf' data format. The created models effectively recognized oncopathologies with a model accuracy of 0.95. Information was obtained on the remoteness of the studied phenotypes. The closest in the experiment were ovarian cancer, kidney cancer, and prostate cancer/kidney cancer. In contrast, the healthy phenotype was the most distant from cancer phenotypes and ovarian and prostate cancers. The neural network makes it possible to not only classify the studied phenotypes, but also to determine their similarity (distance matrix), thus overcoming algorithmic barriers in identifying HPLC-MS/MS spectra. Neural networks are versatile and can be applied to standard experimental data formats obtained using different analytical platforms.
Collapse
Affiliation(s)
- Denis V. Petrovsky
- Biobanking Group, Branch of Institute of Biomedical Chemistry “Scientific and Education Center”, 109028 Moscow, Russia; (D.V.P.); (A.T.K.); (V.R.R.); (A.A.S.); (L.I.K.); (A.L.K.)
| | - Arthur T. Kopylov
- Biobanking Group, Branch of Institute of Biomedical Chemistry “Scientific and Education Center”, 109028 Moscow, Russia; (D.V.P.); (A.T.K.); (V.R.R.); (A.A.S.); (L.I.K.); (A.L.K.)
| | - Vladimir R. Rudnev
- Biobanking Group, Branch of Institute of Biomedical Chemistry “Scientific and Education Center”, 109028 Moscow, Russia; (D.V.P.); (A.T.K.); (V.R.R.); (A.A.S.); (L.I.K.); (A.L.K.)
- Institute of Theoretical and Experimental Biophysics, Russian Academy of Sciences, 142290 Moscow, Russia
| | - Alexander A. Stepanov
- Biobanking Group, Branch of Institute of Biomedical Chemistry “Scientific and Education Center”, 109028 Moscow, Russia; (D.V.P.); (A.T.K.); (V.R.R.); (A.A.S.); (L.I.K.); (A.L.K.)
| | - Liudmila I. Kulikova
- Biobanking Group, Branch of Institute of Biomedical Chemistry “Scientific and Education Center”, 109028 Moscow, Russia; (D.V.P.); (A.T.K.); (V.R.R.); (A.A.S.); (L.I.K.); (A.L.K.)
- Institute of Theoretical and Experimental Biophysics, Russian Academy of Sciences, 142290 Moscow, Russia
| | - Kristina A. Malsagova
- Biobanking Group, Branch of Institute of Biomedical Chemistry “Scientific and Education Center”, 109028 Moscow, Russia; (D.V.P.); (A.T.K.); (V.R.R.); (A.A.S.); (L.I.K.); (A.L.K.)
| | - Anna L. Kaysheva
- Biobanking Group, Branch of Institute of Biomedical Chemistry “Scientific and Education Center”, 109028 Moscow, Russia; (D.V.P.); (A.T.K.); (V.R.R.); (A.A.S.); (L.I.K.); (A.L.K.)
| |
Collapse
|
14
|
Yang H, Wang M, Liu X, Zhao XM, Li A. PhosIDN: an integrated deep neural network for improving protein phosphorylation site prediction by combining sequence and protein-protein interaction information. Bioinformatics 2021; 37:4668-4676. [PMID: 34320631 PMCID: PMC8665744 DOI: 10.1093/bioinformatics/btab551] [Citation(s) in RCA: 32] [Impact Index Per Article: 10.7] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2020] [Revised: 06/22/2021] [Accepted: 07/27/2021] [Indexed: 11/29/2022] Open
Abstract
Motivation Phosphorylation is one of the most studied post-translational modifications, which plays a pivotal role in various cellular processes. Recently, deep learning methods have achieved great success in prediction of phosphorylation sites, but most of them are based on convolutional neural network that may not capture enough information about long-range dependencies between residues in a protein sequence. In addition, existing deep learning methods only make use of sequence information for predicting phosphorylation sites, and it is highly desirable to develop a deep learning architecture that can combine heterogeneous sequence and protein–protein interaction (PPI) information for more accurate phosphorylation site prediction. Results We present a novel integrated deep neural network named PhosIDN, for phosphorylation site prediction by extracting and combining sequence and PPI information. In PhosIDN, a sequence feature encoding sub-network is proposed to capture not only local patterns but also long-range dependencies from protein sequences. Meanwhile, useful PPI features are also extracted in PhosIDN by a PPI feature encoding sub-network adopting a multi-layer deep neural network. Moreover, to effectively combine sequence and PPI information, a heterogeneous feature combination sub-network is introduced to fully exploit the complex associations between sequence and PPI features, and their combined features are used for final prediction. Comprehensive experiment results demonstrate that the proposed PhosIDN significantly improves the prediction performance of phosphorylation sites and compares favorably with existing general and kinase-specific phosphorylation site prediction methods. Availability and implementation PhosIDN is freely available at https://github.com/ustchangyuanyang/PhosIDN. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Hangyuan Yang
- School of Information Science and Technology, University of Science and Technology of China, Hefei AH230027, China
| | - Minghui Wang
- School of Information Science and Technology, University of Science and Technology of China, Hefei AH230027, China.,Centers for Biomedical Engineering, University of Science and Technology of China, Hefei AH230027, China
| | - Xia Liu
- School of Information Science and Technology, University of Science and Technology of China, Hefei AH230027, China
| | - Xing-Ming Zhao
- Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai 200433, China.,MOE Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence and Frontiers Center for Brain Science, China.,Research Institute of Intelligent Complex Systems, Fudan University, Shanghai 200433, China
| | - Ao Li
- School of Information Science and Technology, University of Science and Technology of China, Hefei AH230027, China.,Centers for Biomedical Engineering, University of Science and Technology of China, Hefei AH230027, China
| |
Collapse
|
15
|
Song B, Li Z, Lin X, Wang J, Wang T, Fu X. Pretraining model for biological sequence data. Brief Funct Genomics 2021; 20:181-195. [PMID: 34050350 PMCID: PMC8194843 DOI: 10.1093/bfgp/elab025] [Citation(s) in RCA: 14] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2021] [Revised: 04/13/2021] [Accepted: 04/21/2021] [Indexed: 12/26/2022] Open
Abstract
With the development of high-throughput sequencing technology, biological sequence data reflecting life information becomes increasingly accessible. Particularly on the background of the COVID-19 pandemic, biological sequence data play an important role in detecting diseases, analyzing the mechanism and discovering specific drugs. In recent years, pretraining models that have emerged in natural language processing have attracted widespread attention in many research fields not only to decrease training cost but also to improve performance on downstream tasks. Pretraining models are used for embedding biological sequence and extracting feature from large biological sequence corpus to comprehensively understand the biological sequence data. In this survey, we provide a broad review on pretraining models for biological sequence data. Moreover, we first introduce biological sequences and corresponding datasets, including brief description and accessible link. Subsequently, we systematically summarize popular pretraining models for biological sequences based on four categories: CNN, word2vec, LSTM and Transformer. Then, we present some applications with proposed pretraining models on downstream tasks to explain the role of pretraining models. Next, we provide a novel pretraining scheme for protein sequences and a multitask benchmark for protein pretraining models. Finally, we discuss the challenges and future directions in pretraining models for biological sequences.
Collapse
Affiliation(s)
| | | | | | | | | | - Xiangzheng Fu
- Corresponding author: Xiangzheng Fu, College of Information Science and Engineering, Hunan University, Changsha, Hunan, China. Tel: 86-0731-88821907; E-mail:
| |
Collapse
|
16
|
Iuchi H, Matsutani T, Yamada K, Iwano N, Sumi S, Hosoda S, Zhao S, Fukunaga T, Hamada M. Representation learning applications in biological sequence analysis. Comput Struct Biotechnol J 2021; 19:3198-3208. [PMID: 34141139 PMCID: PMC8190442 DOI: 10.1016/j.csbj.2021.05.039] [Citation(s) in RCA: 27] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2021] [Revised: 05/10/2021] [Accepted: 05/20/2021] [Indexed: 12/16/2022] Open
Abstract
Although remarkable advances have been reported in high-throughput sequencing, the ability to aptly analyze a substantial amount of rapidly generated biological (DNA/RNA/protein) sequencing data remains a critical hurdle. To tackle this issue, the application of natural language processing (NLP) to biological sequence analysis has received increased attention. In this method, biological sequences are regarded as sentences while the single nucleic acids/amino acids or k-mers in these sequences represent the words. Embedding is an essential step in NLP, which performs the conversion of these words into vectors. Specifically, representation learning is an approach used for this transformation process, which can be applied to biological sequences. Vectorized biological sequences can then be applied for function and structure estimation, or as input for other probabilistic models. Considering the importance and growing trend for the application of representation learning to biological research, in the present study, we have reviewed the existing knowledge in representation learning for biological sequence analysis.
Collapse
Affiliation(s)
- Hitoshi Iuchi
- Waseda Research Institute for Science and Engineering, Waseda University, Tokyo 169-8555, Japan
- Computational Bio Big-Data Open Innovation Laboratory (CBBD-OIL), National Institute of Advanced Industrial Science and Technology (AIST), Tokyo 169-8555, Japan
| | - Taro Matsutani
- Computational Bio Big-Data Open Innovation Laboratory (CBBD-OIL), National Institute of Advanced Industrial Science and Technology (AIST), Tokyo 169-8555, Japan
- Graduate School of Advanced Science and Engineering, Waseda University, Tokyo 169-8555, Japan
| | - Keisuke Yamada
- School of Advanced Science and Engineering, Waseda University, Tokyo 169-8555, Japan
| | - Natsuki Iwano
- Graduate School of Advanced Science and Engineering, Waseda University, Tokyo 169-8555, Japan
| | - Shunsuke Sumi
- Graduate School of Advanced Science and Engineering, Waseda University, Tokyo 169-8555, Japan
- Department of Life Science Frontiers, Center for iPS Cell Research and Application, Kyoto University, Kyoto 606-8507, Japan
| | - Shion Hosoda
- Computational Bio Big-Data Open Innovation Laboratory (CBBD-OIL), National Institute of Advanced Industrial Science and Technology (AIST), Tokyo 169-8555, Japan
- Graduate School of Advanced Science and Engineering, Waseda University, Tokyo 169-8555, Japan
| | - Shitao Zhao
- Waseda Research Institute for Science and Engineering, Waseda University, Tokyo 169-8555, Japan
| | - Tsukasa Fukunaga
- Waseda Institute for Advanced Study, Waseda University, Tokyo 169-0051, Japan
- Department of Computer Science, Graduate School of Information Science and Technology, The University of Tokyo, Tokyo 113-0032, Japan
| | - Michiaki Hamada
- Computational Bio Big-Data Open Innovation Laboratory (CBBD-OIL), National Institute of Advanced Industrial Science and Technology (AIST), Tokyo 169-8555, Japan
- Graduate School of Advanced Science and Engineering, Waseda University, Tokyo 169-8555, Japan
- School of Advanced Science and Engineering, Waseda University, Tokyo 169-8555, Japan
- Graduate School of Medicine, Nippon Medical School, Tokyo 113-8602, Japan
| |
Collapse
|
17
|
Yılmaz S, Ayati M, Schlatzer D, Çiçek AE, Chance MR, Koyutürk M. Robust inference of kinase activity using functional networks. Nat Commun 2021; 12:1177. [PMID: 33608514 PMCID: PMC7895941 DOI: 10.1038/s41467-021-21211-6] [Citation(s) in RCA: 30] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2020] [Accepted: 01/15/2021] [Indexed: 12/17/2022] Open
Abstract
Mass spectrometry enables high-throughput screening of phosphoproteins across a broad range of biological contexts. When complemented by computational algorithms, phospho-proteomic data allows the inference of kinase activity, facilitating the identification of dysregulated kinases in various diseases including cancer, Alzheimer’s disease and Parkinson’s disease. To enhance the reliability of kinase activity inference, we present a network-based framework, RoKAI, that integrates various sources of functional information to capture coordinated changes in signaling. Through computational experiments, we show that phosphorylation of sites in the functional neighborhood of a kinase are significantly predictive of its activity. The incorporation of this knowledge in RoKAI consistently enhances the accuracy of kinase activity inference methods while making them more robust to missing annotations and quantifications. This enables the identification of understudied kinases and will likely lead to the development of novel kinase inhibitors for targeted therapy of many diseases. RoKAI is available as web-based tool at http://rokai.io. Kinases drive fundamental changes in cell state, but predicting kinase activity based on substrate-level changes can be challenging. Here the authors introduce a computational framework that utilizes similarities between substrates to robustly infer kinase activity.
Collapse
Affiliation(s)
- Serhan Yılmaz
- Department of Computer and Data Sciences, Case Western Reserve University, Cleveland, OH, USA.
| | - Marzieh Ayati
- Department of Computer Science, University of Texas Rio Grande Valley, Edinburg, TX, USA
| | - Daniela Schlatzer
- Center for Proteomics and Bioinformatics, Case Western Reserve University, Cleveland, OH, USA
| | - A Ercüment Çiçek
- Department of Computer Engineering, Bilkent University, Ankara, Turkey.,Department of Computational Biology, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Mark R Chance
- Center for Proteomics and Bioinformatics, Case Western Reserve University, Cleveland, OH, USA.,Department of Nutrition, Case Western Reserve University, Cleveland, OH, USA
| | - Mehmet Koyutürk
- Department of Computer and Data Sciences, Case Western Reserve University, Cleveland, OH, USA.,Center for Proteomics and Bioinformatics, Case Western Reserve University, Cleveland, OH, USA
| |
Collapse
|
18
|
Wen B, Zeng W, Liao Y, Shi Z, Savage SR, Jiang W, Zhang B. Deep Learning in Proteomics. Proteomics 2020; 20:e1900335. [PMID: 32939979 PMCID: PMC7757195 DOI: 10.1002/pmic.201900335] [Citation(s) in RCA: 70] [Impact Index Per Article: 17.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2020] [Revised: 09/14/2020] [Indexed: 12/17/2022]
Abstract
Proteomics, the study of all the proteins in biological systems, is becoming a data-rich science. Protein sequences and structures are comprehensively catalogued in online databases. With recent advancements in tandem mass spectrometry (MS) technology, protein expression and post-translational modifications (PTMs) can be studied in a variety of biological systems at the global scale. Sophisticated computational algorithms are needed to translate the vast amount of data into novel biological insights. Deep learning automatically extracts data representations at high levels of abstraction from data, and it thrives in data-rich scientific research domains. Here, a comprehensive overview of deep learning applications in proteomics, including retention time prediction, MS/MS spectrum prediction, de novo peptide sequencing, PTM prediction, major histocompatibility complex-peptide binding prediction, and protein structure prediction, is provided. Limitations and the future directions of deep learning in proteomics are also discussed. This review will provide readers an overview of deep learning and how it can be used to analyze proteomics data.
Collapse
Affiliation(s)
- Bo Wen
- Lester and Sue Smith Breast CenterBaylor College of MedicineHoustonTX77030USA
- Department of Molecular and Human GeneticsBaylor College of MedicineHoustonTX77030USA
| | - Wen‐Feng Zeng
- Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS)Chinese Academy of SciencesInstitute of Computing TechnologyBeijing100190China
| | - Yuxing Liao
- Lester and Sue Smith Breast CenterBaylor College of MedicineHoustonTX77030USA
- Department of Molecular and Human GeneticsBaylor College of MedicineHoustonTX77030USA
| | - Zhiao Shi
- Lester and Sue Smith Breast CenterBaylor College of MedicineHoustonTX77030USA
- Department of Molecular and Human GeneticsBaylor College of MedicineHoustonTX77030USA
| | - Sara R. Savage
- Lester and Sue Smith Breast CenterBaylor College of MedicineHoustonTX77030USA
- Department of Molecular and Human GeneticsBaylor College of MedicineHoustonTX77030USA
| | - Wen Jiang
- Lester and Sue Smith Breast CenterBaylor College of MedicineHoustonTX77030USA
- Department of Molecular and Human GeneticsBaylor College of MedicineHoustonTX77030USA
| | - Bing Zhang
- Lester and Sue Smith Breast CenterBaylor College of MedicineHoustonTX77030USA
- Department of Molecular and Human GeneticsBaylor College of MedicineHoustonTX77030USA
| |
Collapse
|