1
|
Zhao Y, Ansarullah, Kumar P, Mahoney JM, He H, Baker C, George J, Li S. Causal network perturbation analysis identifies known and novel type-2 diabetes driver genes. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.05.22.595431. [PMID: 38826370 PMCID: PMC11142180 DOI: 10.1101/2024.05.22.595431] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/04/2024]
Abstract
The molecular pathogenesis of diabetes is multifactorial, involving genetic predisposition and environmental factors that are not yet fully understood. However, pancreatic β-cell failure remains among the primary reasons underlying the progression of type-2 diabetes (T2D) making targeting β-cell dysfunction an attractive pathway for diabetes treatment. To identify genetic contributors to β-cell dysfunction, we investigated single-cell gene expression changes in β-cells from healthy (C57BL/6J) and diabetic (NZO/HlLtJ) mice fed with normal or high-fat, high-sugar diet (HFHS). Our study presents an innovative integration of the causal network perturbation assessment (ssNPA) framework with meta-cell transcriptome analysis to explore the genetic underpinnings of type-2 diabetes (T2D). By generating a reference causal network and in silico perturbation, we identified novel genes implicated in T2D and validated our candidates using the Knockout Mouse Phenotyping (KOMP) Project database.
Collapse
Affiliation(s)
- Yue Zhao
- The Jackson Laboratory for Genomic Medicine, Farmington, CT, USA
| | - Ansarullah
- Center for Biometric Analysis, The Jackson Laboratory, Bar Harbor, ME, USA
| | - Parveen Kumar
- The Jackson Laboratory for Genomic Medicine, Farmington, CT, USA
| | | | - Hao He
- The Jackson Laboratory for Genomic Medicine, Farmington, CT, USA
| | - Candice Baker
- The Jackson Laboratory for Genomic Medicine, Farmington, CT, USA
| | - Joshy George
- The Jackson Laboratory for Genomic Medicine, Farmington, CT, USA
| | - Sheng Li
- The Jackson Laboratory for Genomic Medicine, Farmington, CT, USA
- Department of Genetics and Genome Sciences, University of Connecticut School of Medicine, Farmington, CT, USA
| |
Collapse
|
2
|
Prabhu H, Bhosale H, Sane A, Dhadwal R, Ramakrishnan V, Valadi J. Protein feature engineering framework for AMPylation site prediction. Sci Rep 2024; 14:8695. [PMID: 38622194 PMCID: PMC11369087 DOI: 10.1038/s41598-024-58450-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/13/2023] [Accepted: 03/29/2024] [Indexed: 04/17/2024] Open
Abstract
AMPylation is a biologically significant yet understudied post-translational modification where an adenosine monophosphate (AMP) group is added to Tyrosine and Threonine residues primarily. While recent work has illuminated the prevalence and functional impacts of AMPylation, experimental identification of AMPylation sites remains challenging. Computational prediction techniques provide a faster alternative approach. The predictive performance of machine learning models is highly dependent on the features used to represent the raw amino acid sequences. In this work, we introduce a novel feature extraction pipeline to encode the key properties relevant to AMPylation site prediction. We utilize a recently published dataset of curated AMPylation sites to develop our feature generation framework. We demonstrate the utility of our extracted features by training various machine learning classifiers, on various numerical representations of the raw sequences extracted with the help of our framework. Tenfold cross-validation is used to evaluate the model's capability to distinguish between AMPylated and non-AMPylated sites. The top-performing set of features extracted achieved MCC score of 0.58, Accuracy of 0.8, AUC-ROC of 0.85 and F1 score of 0.73. Further, we elucidate the behaviour of the model on the set of features consisting of monogram and bigram counts for various representations using SHapley Additive exPlanations.
Collapse
Affiliation(s)
- Hardik Prabhu
- Computing and Data Sciences, FLAME University, Pune, 412115, India
- Robert Bosch Centre for Cyber Physical Systems, Indian Institute of Science, Bengaluru, 560012, India
| | | | - Aamod Sane
- Computing and Data Sciences, FLAME University, Pune, 412115, India
| | - Renu Dhadwal
- Computing and Data Sciences, FLAME University, Pune, 412115, India
| | - Vigneshwar Ramakrishnan
- Bioinformatics Center, School of Chemical and Biotechnology, SASTRA Deemed to be University, Thanjavur, 613401, India
| | - Jayaraman Valadi
- Computing and Data Sciences, FLAME University, Pune, 412115, India.
| |
Collapse
|
3
|
Chandra A, Tünnermann L, Löfstedt T, Gratz R. Transformer-based deep learning for predicting protein properties in the life sciences. eLife 2023; 12:82819. [PMID: 36651724 PMCID: PMC9848389 DOI: 10.7554/elife.82819] [Citation(s) in RCA: 14] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2022] [Accepted: 01/06/2023] [Indexed: 01/19/2023] Open
Abstract
Recent developments in deep learning, coupled with an increasing number of sequenced proteins, have led to a breakthrough in life science applications, in particular in protein property prediction. There is hope that deep learning can close the gap between the number of sequenced proteins and proteins with known properties based on lab experiments. Language models from the field of natural language processing have gained popularity for protein property predictions and have led to a new computational revolution in biology, where old prediction results are being improved regularly. Such models can learn useful multipurpose representations of proteins from large open repositories of protein sequences and can be used, for instance, to predict protein properties. The field of natural language processing is growing quickly because of developments in a class of models based on a particular model-the Transformer model. We review recent developments and the use of large-scale Transformer models in applications for predicting protein characteristics and how such models can be used to predict, for example, post-translational modifications. We review shortcomings of other deep learning models and explain how the Transformer models have quickly proven to be a very promising way to unravel information hidden in the sequences of amino acids.
Collapse
Affiliation(s)
- Abel Chandra
- Department of Computing Science, Umeå UniversityUmeåSweden
| | - Laura Tünnermann
- Umeå Plant Science Centre (UPSC), Department of Forest Genetics and Plant Physiology, Swedish University of Agricultural SciencesUmeåSweden
| | - Tommy Löfstedt
- Department of Computing Science, Umeå UniversityUmeåSweden
| | - Regina Gratz
- Umeå Plant Science Centre (UPSC), Department of Forest Genetics and Plant Physiology, Swedish University of Agricultural SciencesUmeåSweden
- Department of Forest Ecology and Management, Swedish University of Agricultural SciencesUmeåSweden
| |
Collapse
|
4
|
Li M, Wu Z, Wang W, Lu K, Zhang J, Zhou Y, Chen Z, Li D, Zheng S, Chen P, Wang B. Protein-Protein Interaction Sites Prediction Based on an Under-Sampling Strategy and Random Forest Algorithm. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:3646-3654. [PMID: 34705656 DOI: 10.1109/tcbb.2021.3123269] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
The computational methods of protein-protein interaction sites prediction can effectively avoid the shortcomings of high cost and time in traditional experimental approaches. However, the serious class imbalance between interface and non-interface residues on the protein sequences limits the prediction performance of these methods. This work therefore proposed a new strategy, NearMiss-based under-sampling for unbalancing datasets and Random Forest classification (NM-RF), to predict protein interaction sites. Herein, the residues on protein sequences were represented by the PSSM-derived features, hydropathy index (HI) and relative solvent accessibility (RSA). In order to resolve the class imbalance problem, an under-sampling method based on NearMiss algorithm is adopted to remove some non-interface residues, and then the random forest algorithm is used to perform binary classification on the balanced feature datasets. Experiments show that the accuracy of NM-RF model reaches 87.6% and 84.3% on Dtestset72 and PDBtestset164 respectively, which demonstrate the effectiveness of the proposed NM-RF method in differentiating the interface or non-interface residues.
Collapse
|
5
|
Hu J, Zhang J, Yang Y, Liang T, Huang T, He C, Wang F, Liu H, Zhang T. Prediction of Communication Impairment in Children With Bilateral Cerebral Palsy Using Multivariate Lesion- and Connectome-Based Approaches: Protocol for a Multicenter Prospective Cohort Study. Front Hum Neurosci 2022; 16:788037. [PMID: 35173593 PMCID: PMC8841608 DOI: 10.3389/fnhum.2022.788037] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2021] [Accepted: 01/10/2022] [Indexed: 12/05/2022] Open
Abstract
BACKGROUND Bilateral cerebral palsy (BCP) is the most common type of CP in children and is often accompanied by different degrees of communication impairment. Several studies have attempted to identify children at high risk for communication impairment. However, most prediction factors are qualitative and subjective and may be influenced by rater bias. Individualized objective diagnostic and/or prediction methods are still lacking, and an effective method is urgently needed to guide clinical diagnosis and treatment. The aim of this study is to develop and validate an objective, individual-based model for the prediction of communication impairment in children with BCP by the time they enter school. METHODS A multicenter prospective cohort study will be conducted in four Chinese hospitals. A total of 178 children with BCP will undergo advanced brain magnetic resonance imaging (MRI) at baseline (corrected age, before the age of 2 years). At school entry, communication performance will be assessed by a communication function classification system (CFCS). Three-quarters of children with BCP will be allocated as a training cohort, whereas the remaining children will be allocated as a test cohort. Multivariate lesion- and connectome-based approaches, which have shown good predictive ability of language performance in stroke patients, will be applied to extract features from MR images for each child with BCP. Multiple machine learning models using extracted features to predict communication impairment for each child with BCP will be constructed using data from the training cohort and externally validated using data from the test cohort. Prediction accuracy across models in the test cohort will be statistically compared. DISCUSSION The findings of the study may lead to the development of several translational tools that can individually predict communication impairment in children newly diagnosed with BCP to ensure that these children receive early, targeted therapeutic intervention before they begin school. TRIAL REGISTRATION The study has been registered with the Chinese Clinical Trial Registry (ChiCTR2100049497).
Collapse
Affiliation(s)
- Jie Hu
- Department of Radiology, Medical Imaging Center of Guizhou Province, The Affiliated Hospital of Zunyi Medical University, Zunyi, China
| | - Jingjing Zhang
- Department of Radiology, Medical Imaging Center of Guizhou Province, The Affiliated Hospital of Zunyi Medical University, Zunyi, China
| | - Yanli Yang
- Department of Radiology, Medical Imaging Center of Guizhou Province, The Affiliated Hospital of Zunyi Medical University, Zunyi, China
| | - Ting Liang
- Department of Diagnostic Radiology, The First Affiliated Hospital of Xi’an Jiaotong University, Xi’an, China
| | - Tingting Huang
- Department of Radiology, The First Affiliated Hospital of Henan University of TCM, Zhengzhou, China
| | - Cheng He
- Department of Radiology, Chongqing University Central Hospital, Chongqing, China
| | - Fuqin Wang
- Department of Radiology, Medical Imaging Center of Guizhou Province, The Affiliated Hospital of Zunyi Medical University, Zunyi, China
| | - Heng Liu
- Department of Radiology, Medical Imaging Center of Guizhou Province, The Affiliated Hospital of Zunyi Medical University, Zunyi, China
| | - Tijiang Zhang
- Department of Radiology, Medical Imaging Center of Guizhou Province, The Affiliated Hospital of Zunyi Medical University, Zunyi, China
| |
Collapse
|
6
|
Yang Y, Wang H, Li W, Wang X, Wei S, Liu Y, Xu Y. Prediction and analysis of multiple protein lysine modified sites based on conditional wasserstein generative adversarial networks. BMC Bioinformatics 2021; 22:171. [PMID: 33789579 PMCID: PMC8010967 DOI: 10.1186/s12859-021-04101-y] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2020] [Accepted: 03/23/2021] [Indexed: 01/05/2023] Open
Abstract
BACKGROUND Protein post-translational modification (PTM) is a key issue to investigate the mechanism of protein's function. With the rapid development of proteomics technology, a large amount of protein sequence data has been generated, which highlights the importance of the in-depth study and analysis of PTMs in proteins. METHOD We proposed a new multi-classification machine learning pipeline MultiLyGAN to identity seven types of lysine modified sites. Using eight different sequential and five structural construction methods, 1497 valid features were remained after the filtering by Pearson correlation coefficient. To solve the data imbalance problem, Conditional Generative Adversarial Network (CGAN) and Conditional Wasserstein Generative Adversarial Network (CWGAN), two influential deep generative methods were leveraged and compared to generate new samples for the types with fewer samples. Finally, random forest algorithm was utilized to predict seven categories. RESULTS In the tenfold cross-validation, accuracy (Acc) and Matthews correlation coefficient (MCC) were 0.8589 and 0.8376, respectively. In the independent test, Acc and MCC were 0.8549 and 0.8330, respectively. The results indicated that CWGAN better solved the existing data imbalance and stabilized the training error. Alternatively, an accumulated feature importance analysis reported that CKSAAP, PWM and structural features were the three most important feature-encoding schemes. MultiLyGAN can be found at https://github.com/Lab-Xu/MultiLyGAN . CONCLUSIONS The CWGAN greatly improved the predictive performance in all experiments. Features derived from CKSAAP, PWM and structure schemes are the most informative and had the greatest contribution to the prediction of PTM.
Collapse
Affiliation(s)
- Yingxi Yang
- Department of Information and Computer Science, University of Science and Technology Beijing, Beijing, 100083, China
| | - Hui Wang
- Institute of Computing Technology, Chinese Academy of Sciences, Beijing, 100080, China
| | - Wen Li
- Department of Information and Computer Science, University of Science and Technology Beijing, Beijing, 100083, China
| | - Xiaobo Wang
- Department of Information and Computer Science, University of Science and Technology Beijing, Beijing, 100083, China
| | - Shizhao Wei
- No. 15 Research Institute, China Electronics Technology Group Corporation, Beijing, 100083, China
| | - Yulong Liu
- No. 15 Research Institute, China Electronics Technology Group Corporation, Beijing, 100083, China
| | - Yan Xu
- Department of Information and Computer Science, University of Science and Technology Beijing, Beijing, 100083, China.
| |
Collapse
|
7
|
Deng A, Zhang H, Wang W, Zhang J, Fan D, Chen P, Wang B. Developing Computational Model to Predict Protein-Protein Interaction Sites Based on the XGBoost Algorithm. Int J Mol Sci 2020; 21:E2274. [PMID: 32218345 PMCID: PMC7178137 DOI: 10.3390/ijms21072274] [Citation(s) in RCA: 31] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2020] [Revised: 03/10/2020] [Accepted: 03/23/2020] [Indexed: 12/27/2022] Open
Abstract
The study of protein-protein interaction is of great biological significance, and the prediction of protein-protein interaction sites can promote the understanding of cell biological activity and will be helpful for drug development. However, uneven distribution between interaction and non-interaction sites is common because only a small number of protein interactions have been confirmed by experimental techniques, which greatly affects the predictive capability of computational methods. In this work, two imbalanced data processing strategies based on XGBoost algorithm were proposed to re-balance the original dataset from inherent relationship between positive and negative samples for the prediction of protein-protein interaction sites. Herein, a feature extraction method was applied to represent the protein interaction sites based on evolutionary conservatism of proteins, and the influence of overlapping regions of positive and negative samples was considered in prediction performance. Our method showed good prediction performance, such as prediction accuracy of 0.807 and MCC of 0.614, on an original dataset with 10,455 surface residues but only 2297 interface residues. Experimental results demonstrated the effectiveness of our XGBoost-based method.
Collapse
Affiliation(s)
- Aijun Deng
- Key Laboratory of Metallurgical Emission Reduction & Resources Recycling (Anhui University of Technology), Ministry of Education, Ma'anshan 243002, China
- School of Metallurgical Engineering, Anhui University of Technology, Ma'anshan 243032, China
- Department of Engineering, University of Leicester, Leicester LE1 7RH, UK
| | - Huan Zhang
- School of Electrical and Information Engineering, Anhui University of Technology, Ma'anshan 243032, China
| | - Wenyan Wang
- School of Electrical and Information Engineering, Anhui University of Technology, Ma'anshan 243032, China
| | - Jun Zhang
- Co-Innovation Center for Information Supply & Assurance Technology, Anhui University, Hefei 230032, China
| | - Dingdong Fan
- School of Metallurgical Engineering, Anhui University of Technology, Ma'anshan 243032, China
| | - Peng Chen
- Co-Innovation Center for Information Supply & Assurance Technology, Anhui University, Hefei 230032, China
| | - Bing Wang
- Key Laboratory of Metallurgical Emission Reduction & Resources Recycling (Anhui University of Technology), Ministry of Education, Ma'anshan 243002, China
- School of Electrical and Information Engineering, Anhui University of Technology, Ma'anshan 243032, China
- Co-Innovation Center for Information Supply & Assurance Technology, Anhui University, Hefei 230032, China
| |
Collapse
|
8
|
Andreatta M, Alvarez B, Nielsen M. GibbsCluster: unsupervised clustering and alignment of peptide sequences. Nucleic Acids Res 2019; 45:W458-W463. [PMID: 28407089 PMCID: PMC5570237 DOI: 10.1093/nar/gkx248] [Citation(s) in RCA: 129] [Impact Index Per Article: 25.8] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2017] [Accepted: 04/11/2017] [Indexed: 01/17/2023] Open
Abstract
Receptor interactions with short linear peptide fragments (ligands) are at the base of many biological signaling processes. Conserved and information-rich amino acid patterns, commonly called sequence motifs, shape and regulate these interactions. Because of the properties of a receptor-ligand system or of the assay used to interrogate it, experimental data often contain multiple sequence motifs. GibbsCluster is a powerful tool for unsupervised motif discovery because it can simultaneously cluster and align peptide data. The GibbsCluster 2.0 presented here is an improved version incorporating insertion and deletions accounting for variations in motif length in the peptide input. In basic terms, the program takes as input a set of peptide sequences and clusters them into meaningful groups. It returns the optimal number of clusters it identified, together with the sequence alignment and sequence motif characterizing each cluster. Several parameters are available to customize cluster analysis, including adjustable penalties for small clusters and overlapping groups and a trash cluster to remove outliers. As an example application, we used the server to deconvolute multiple specificities in large-scale peptidome data generated by mass spectrometry. The server is available at http://www.cbs.dtu.dk/services/GibbsCluster-2.0.
Collapse
Affiliation(s)
- Massimo Andreatta
- Instituto de Investigaciones Biotecnológicas, Universidad Nacional de San Martín, 1650 San Martín, Argentina
| | - Bruno Alvarez
- Instituto de Investigaciones Biotecnológicas, Universidad Nacional de San Martín, 1650 San Martín, Argentina
| | - Morten Nielsen
- Instituto de Investigaciones Biotecnológicas, Universidad Nacional de San Martín, 1650 San Martín, Argentina.,Department of Bio and Health Informatics, Technical University of Denmark, DK-2800 Lyngby, Denmark
| |
Collapse
|
9
|
Lee LYH, Loscalzo J. Network Medicine in Pathobiology. THE AMERICAN JOURNAL OF PATHOLOGY 2019; 189:1311-1326. [PMID: 31014954 DOI: 10.1016/j.ajpath.2019.03.009] [Citation(s) in RCA: 42] [Impact Index Per Article: 8.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/31/2019] [Accepted: 03/05/2019] [Indexed: 12/11/2022]
Abstract
The past decade has witnessed exponential growth in the generation of high-throughput human data across almost all known dimensions of biological systems. The discipline of network medicine has rapidly evolved in parallel, providing an unbiased, comprehensive biological framework through which to interrogate and integrate systematically these large-scale, multi-omic data to enhance our understanding of disease mechanisms and to design drugs that reflect a deep knowledge of molecular pathobiology. In this review, we discuss the key principles of network medicine and the human disease network and explore the latest applications of network medicine in this multi-omic era. We also highlight the current conceptual and technological challenges, which serve as exciting opportunities by which to improve and expand the network-based applications beyond the artificial boundaries of the current state of human pathobiology.
Collapse
Affiliation(s)
| | - Joseph Loscalzo
- Brigham and Women's Hospital, Harvard Medical School, Boston, Massachusetts.
| |
Collapse
|