1
|
Santini BL, Wendel S, Halbwedl N, Knipp A, Zacharias M. cPEPmatch Webserver: A comprehensive tool and database to aid rational design of cyclic peptides for drug discovery. Comput Struct Biotechnol J 2024; 23:3155-3162. [PMID: 39253058 PMCID: PMC11381751 DOI: 10.1016/j.csbj.2024.08.008] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2024] [Revised: 08/09/2024] [Accepted: 08/09/2024] [Indexed: 09/11/2024] Open
Abstract
Cyclic peptides have emerged as versatile scaffolds in drug discovery due to their stability and specificity. Here, we present the cPEPmatch webserver (accessible at https://t38webservices.nat.tum.de/cpepmatch/), an easy-to-use interface for the rational design of cyclic peptides targeting protein-protein interactions combined with a semi-quantitative evaluation of binding stability. This platform also offers access to a comprehensive database of cyclic peptide crystal structures. We demonstrate the webserver's utility through a series of case studies involving medically relevant protein systems, highlighting its potential to significantly advance drug discovery efforts.
Collapse
Affiliation(s)
- Brianda L Santini
- Center for Functional Protein Assemblies, Technical University of Munich, Ernst-Otto-Fischer-Straße 8, Garching, Germany
| | - Stephanie Wendel
- Center for Functional Protein Assemblies, Technical University of Munich, Ernst-Otto-Fischer-Straße 8, Garching, Germany
| | - Niklas Halbwedl
- Center for Functional Protein Assemblies, Technical University of Munich, Ernst-Otto-Fischer-Straße 8, Garching, Germany
| | - Asha Knipp
- Center for Functional Protein Assemblies, Technical University of Munich, Ernst-Otto-Fischer-Straße 8, Garching, Germany
| | - Martin Zacharias
- Center for Functional Protein Assemblies, Technical University of Munich, Ernst-Otto-Fischer-Straße 8, Garching, Germany
| |
Collapse
|
2
|
Naha S, Kaur S, Bhattacharya R, Cheemanapalli S, Iyyappan Y. ANPS: machine learning based server for identification of anti-nutritional proteins in plants. Funct Integr Genomics 2024; 24:201. [PMID: 39453508 DOI: 10.1007/s10142-024-01474-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2024] [Revised: 10/09/2024] [Accepted: 10/10/2024] [Indexed: 10/26/2024]
Abstract
Anti-nutrient factors are inherently present in almost all major crops, which impede the absorption of crucial vitamins and minerals upon human consumption. The commonly found anti-nutrients in food crops are saponins, tannins, lectins, and phytates etc. Currently, there is a lack of computational server for identification of proteins that encode for anti-nutritional factors in plants. Consequently, this study represents a computational approach aimed at distinguishing between proteins encoding anti-nutritional factors and those providing essential nutrients. In this work, machine learning algorithms have been employed to identify plant specific anti-nutrient factor proteins from protein sequences by using compositional features. Achieving a five-fold cross-validation training performance of 94.34% AUC-ROC and 94.13% AUC-PR with extreme gradient boosting surpasses the performance of other methods such as support vector machine, random forest, and adaptive boosting. These results suggest the proposed approach is highly reliable in predicting plant-specific anti-nutritional factor proteins. The resulting prediction models have led to the development of an online server named ANPS, freely available at https://nipb-bi.icar.gov.in .
Collapse
Affiliation(s)
- Sanchita Naha
- Division of Computer Applications, ICAR-Indian Agricultural Statistics Research Institute, Pusa, New Delhi, 110012, India
| | - Sarvjeet Kaur
- ICAR-National Institute for Plant Biotechnology, Pusa, New Delhi, 110012, India
| | | | | | - Yuvaraj Iyyappan
- ICAR-National Institute for Plant Biotechnology, Pusa, New Delhi, 110012, India.
| |
Collapse
|
3
|
Chen YC, Sargsyan K, Wright JD, Chen YH, Huang YS, Lim C. PPI-hotspot ID for detecting protein-protein interaction hot spots from the free protein structure. eLife 2024; 13:RP96643. [PMID: 39283314 PMCID: PMC11405013 DOI: 10.7554/elife.96643] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/22/2024] Open
Abstract
Experimental detection of residues critical for protein-protein interactions (PPI) is a time-consuming, costly, and labor-intensive process. Hence, high-throughput PPI-hot spot prediction methods have been developed, but they have been validated using relatively small datasets, which may compromise their predictive reliability. Here, we introduce PPI-hotspotID, a novel method for identifying PPI-hot spots using the free protein structure, and validated it on the largest collection of experimentally confirmed PPI-hot spots to date. We explored the possibility of detecting PPI-hot spots using (i) FTMap in the PPI mode, which identifies hot spots on protein-protein interfaces from the free protein structure, and (ii) the interface residues predicted by AlphaFold-Multimer. PPI-hotspotID yielded better performance than FTMap and SPOTONE, a webserver for predicting PPI-hot spots given the protein sequence. When combined with the AlphaFold-Multimer-predicted interface residues, PPI-hotspotID yielded better performance than either method alone. Furthermore, we experimentally verified several PPI-hotspotID-predicted PPI-hot spots of eukaryotic elongation factor 2. Notably, PPI-hotspotID can reveal PPI-hot spots not obvious from complex structures, including those in indirect contact with binding partners. PPI-hotspotID serves as a valuable tool for understanding PPI mechanisms and aiding drug design. It is available as a web server (https://ppihotspotid.limlab.dnsalias.org/) and open-source code (https://github.com/wrigjz/ppihotspotid/).
Collapse
Affiliation(s)
- Yao Chi Chen
- Institute of Biomedical Sciences, Academia Sinica, Taipei, Taiwan
| | - Karen Sargsyan
- Institute of Biomedical Sciences, Academia Sinica, Taipei, Taiwan
| | - Jon D Wright
- Institute of Biomedical Sciences, Academia Sinica, Taipei, Taiwan
| | - Yu-Hsien Chen
- Institute of Biomedical Sciences, Academia Sinica, Taipei, Taiwan
| | - Yi-Shuian Huang
- Institute of Biomedical Sciences, Academia Sinica, Taipei, Taiwan
| | - Carmay Lim
- Institute of Biomedical Sciences, Academia Sinica, Taipei, Taiwan
| |
Collapse
|
4
|
Nandi S, Bhaduri S, Das D, Ghosh P, Mandal M, Mitra P. Deciphering the Lexicon of Protein Targets: A Review on Multifaceted Drug Discovery in the Era of Artificial Intelligence. Mol Pharm 2024; 21:1563-1590. [PMID: 38466810 DOI: 10.1021/acs.molpharmaceut.3c01161] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/13/2024]
Abstract
Understanding protein sequence and structure is essential for understanding protein-protein interactions (PPIs), which are essential for many biological processes and diseases. Targeting protein binding hot spots, which regulate signaling and growth, with rational drug design is promising. Rational drug design uses structural data and computational tools to study protein binding sites and protein interfaces to design inhibitors that can change these interactions, thereby potentially leading to therapeutic approaches. Artificial intelligence (AI), such as machine learning (ML) and deep learning (DL), has advanced drug discovery and design by providing computational resources and methods. Quantum chemistry is essential for drug reactivity, toxicology, drug screening, and quantitative structure-activity relationship (QSAR) properties. This review discusses the methodologies and challenges of identifying and characterizing hot spots and binding sites. It also explores the strategies and applications of artificial-intelligence-based rational drug design technologies that target proteins and protein-protein interaction (PPI) binding hot spots. It provides valuable insights for drug design with therapeutic implications. We have also demonstrated the pathological conditions of heat shock protein 27 (HSP27) and matrix metallopoproteinases (MMP2 and MMP9) and designed inhibitors of these proteins using the drug discovery paradigm in a case study on the discovery of drug molecules for cancer treatment. Additionally, the implications of benzothiazole derivatives for anticancer drug design and discovery are deliberated.
Collapse
Affiliation(s)
- Suvendu Nandi
- School of Medical Science and Technology, Indian Institute of Technology Kharagpur, Kharagpur, West Bengal 721302, India
| | - Soumyadeep Bhaduri
- Centre for Computational and Data Sciences, Indian Institute of Technology Kharagpur, Kharagpur, West Bengal 721302, India
| | - Debraj Das
- Centre for Computational and Data Sciences, Indian Institute of Technology Kharagpur, Kharagpur, West Bengal 721302, India
| | - Priya Ghosh
- School of Medical Science and Technology, Indian Institute of Technology Kharagpur, Kharagpur, West Bengal 721302, India
| | - Mahitosh Mandal
- School of Medical Science and Technology, Indian Institute of Technology Kharagpur, Kharagpur, West Bengal 721302, India
| | - Pralay Mitra
- Department of Computer Science and Engineering, Indian Institute of Technology Kharagpur, Kharagpur, West Bengal 721302, India
| |
Collapse
|
5
|
Chen J, Kuhn LA, Raschka S. Techniques for Developing Reliable Machine Learning Classifiers Applied to Understanding and Predicting Protein:Protein Interaction Hot Spots. Methods Mol Biol 2024; 2714:235-268. [PMID: 37676603 DOI: 10.1007/978-1-0716-3441-7_14] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/08/2023]
Abstract
With machine learning now transforming the sciences, successful prediction of biological structure or activity is mainly limited by the extent and quality of data available for training, the astute choice of features for prediction, and thorough assessment of the robustness of prediction on a variety of new cases. In this chapter, we address these issues while developing and sharing protocols to build a robust dataset and rigorously compare several predictive classifiers using the open-source Python machine learning library, scikit-learn. We show how to evaluate whether enough data has been used for training and whether the classifier has been overfit to training data. The most telling experiment is 500-fold repartitioning of the training and test sets, followed by prediction, which gives a good indication of whether a classifier performs consistently well on different datasets. An intuitive method is used to quantify which features are most important for correct prediction.The resulting well-trained classifier, hotspotter, can robustly predict the small subset of amino acid residues on the surface of a protein that are energetically most important for binding a protein partner: the interaction hot spots. Hotspotter has been trained and tested here on a curated dataset assembled from 1046 non-redundant alanine scanning mutation sites with experimentally measured change in binding free energy values from 97 different protein complexes; this dataset is available to download. The accessible surface area of the wild-type residue at a given site and its degree of evolutionary conservation proved the most important features to identify hot spots. A variant classifier was trained and validated for proteins where only the amino acid sequence is available, augmented by secondary structure assignment. This version of hotspotter requiring fewer features is almost as robust as the structure-based classifier. Application to the ACE2 (angiotensin converting enzyme 2) receptor, which mediates COVID-19 virus entry into human cells, identified the critical hot spot triad of ACE2 residues at the center of the small interface with the CoV-2 spike protein. Hotspotter results can be used to guide the strategic design of protein interfaces and ligands and also to identify likely interfacial residues for protein:protein docking.
Collapse
Affiliation(s)
- Jiaxing Chen
- Bioinformatics and Genomics Graduate Program, Pennsylvania State University, University Park, PA, USA
- Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, MI, USA
| | - Leslie A Kuhn
- Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, MI, USA.
| | - Sebastian Raschka
- Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, MI, USA
- Department of Statistics, University of Wisconsin-Madison, Madison, WI, USA
| |
Collapse
|
6
|
Chandra A, Sharma A, Dehzangi I, Tsunoda T, Sattar A. PepCNN deep learning tool for predicting peptide binding residues in proteins using sequence, structural, and language model features. Sci Rep 2023; 13:20882. [PMID: 38016996 PMCID: PMC10684570 DOI: 10.1038/s41598-023-47624-5] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2023] [Accepted: 11/16/2023] [Indexed: 11/30/2023] Open
Abstract
Protein-peptide interactions play a crucial role in various cellular processes and are implicated in abnormal cellular behaviors leading to diseases such as cancer. Therefore, understanding these interactions is vital for both functional genomics and drug discovery efforts. Despite a significant increase in the availability of protein-peptide complexes, experimental methods for studying these interactions remain laborious, time-consuming, and expensive. Computational methods offer a complementary approach but often fall short in terms of prediction accuracy. To address these challenges, we introduce PepCNN, a deep learning-based prediction model that incorporates structural and sequence-based information from primary protein sequences. By utilizing a combination of half-sphere exposure, position specific scoring matrices from multiple-sequence alignment tool, and embedding from a pre-trained protein language model, PepCNN outperforms state-of-the-art methods in terms of specificity, precision, and AUC. The PepCNN software and datasets are publicly available at https://github.com/abelavit/PepCNN.git .
Collapse
Affiliation(s)
- Abel Chandra
- Institute for Integrated and Intelligent Systems, Griffith University, Brisbane, Australia.
| | - Alok Sharma
- Institute for Integrated and Intelligent Systems, Griffith University, Brisbane, Australia.
- Laboratory for Medical Science Mathematics, Department of Biological Sciences, School of Science, The University of Tokyo, Tokyo, Japan.
- Laboratory for Medical Science Mathematics, RIKEN Center for Integrative Medical Sciences, Yokohama, Japan.
| | - Iman Dehzangi
- Department of Computer Science, Rutgers University, Camden, NJ, USA
- Center for Computational and Integrative Biology, Rutgers University, Camden, USA
| | - Tatsuhiko Tsunoda
- Laboratory for Medical Science Mathematics, Department of Biological Sciences, School of Science, The University of Tokyo, Tokyo, Japan
- Laboratory for Medical Science Mathematics, RIKEN Center for Integrative Medical Sciences, Yokohama, Japan
- Laboratory for Medical Science Mathematics, Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, Tokyo, Japan
| | - Abdul Sattar
- Institute for Integrated and Intelligent Systems, Griffith University, Brisbane, Australia
| |
Collapse
|
7
|
Pinto ÉSM, Krause MJ, Dorn M, Feltes BC. The nucleotide excision repair proteins through the lens of molecular dynamics simulations. DNA Repair (Amst) 2023; 127:103510. [PMID: 37148846 DOI: 10.1016/j.dnarep.2023.103510] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2022] [Revised: 04/07/2023] [Accepted: 04/23/2023] [Indexed: 05/08/2023]
Abstract
Mutations that affect the proteins responsible for the nucleotide excision repair (NER) pathway can lead to diseases such as xeroderma pigmentosum, trichothiodystrophy, Cockayne syndrome, and Cerebro-oculo-facio-skeletal syndrome. Hence, understanding their molecular behavior is needed to elucidate these diseases' phenotypes and how the NER pathway is organized and coordinated. Molecular dynamics techniques enable the study of different protein conformations, adaptable to any research question, shedding light on the dynamics of biomolecules. However, as important as they are, molecular dynamics studies focused on DNA repair pathways are still becoming more widespread. Currently, there are no review articles compiling the advancements made in molecular dynamics approaches applied to NER and discussing: (i) how this technique is currently employed in the field of DNA repair, focusing on NER proteins; (ii) which technical setups are being employed, their strengths and limitations; (iii) which insights or information are they providing to understand the NER pathway or NER-associated proteins; (iv) which open questions would be suited for this technique to answer; and (v) where can we go from here. These questions become even more crucial considering the numerous 3D structures published regarding the NER pathway's proteins in recent years. In this work, we tackle each one of these questions, revising and critically discussing the results published in the context of the NER pathway.
Collapse
Affiliation(s)
| | - Mathias J Krause
- Institute for Applied and Numerical Mathematics, Karlsruhe Institute of Technology, Karlsruhe, Germany
| | - Márcio Dorn
- Center for Biotechnology, Federal University of Rio Grande do Sul, RS, Brazil; Institute of Informatics, Federal University of Rio Grande do Sul, Porto Alegre, RS, Brazil; National Institute of Science and Technology - Forensic Science, Porto Alegre, RS, Brazil
| | - Bruno César Feltes
- Institute of Informatics, Federal University of Rio Grande do Sul, Porto Alegre, RS, Brazil
| |
Collapse
|
8
|
Hou Z, Yin W, Hao Z, Fan K, Sun N, Sun P, Li H. Molecular Simulation Study on the Interaction between Porcine CR1-like and C3b. Molecules 2023; 28:molecules28052183. [PMID: 36903431 PMCID: PMC10005376 DOI: 10.3390/molecules28052183] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2023] [Revised: 02/22/2023] [Accepted: 02/23/2023] [Indexed: 03/02/2023] Open
Abstract
The molecular basis of porcine red blood cell immune adhesion function stems from the complement receptor type 1-like (CR1-like) on its cell membrane. The ligand for CR1-like is C3b, which is produced by the cleavage of complement C3; however, the molecular mechanism of the immune adhesion of porcine erythrocytes is still unclear. Here, homology modeling was used to construct three-dimensional models of C3b and two fragments of CR1-like. An interaction model of C3b-CR1-like was constructed by molecular docking, and molecular structure optimization was achieved using molecular dynamics simulation. A simulated alanine mutation scan revealed that the amino acids Tyr761, Arg763, Phe765, Thr789, and Val873 of CR1-like SCR 12-14 and the amino acid residues Tyr1210, Asn1244, Val1249, Thr1253, Tyr1267, Val1322, and Val1339 of CR1-like SCR 19-21 are key residues involved in the interaction of porcine C3b with CR1-like. This study investigated the interaction between porcine CR1-like and C3b using molecular simulation to clarify the molecular mechanism of the immune adhesion of porcine erythrocytes.
Collapse
Affiliation(s)
- Zhen Hou
- Shanxi Key Lab for Modernization of TCVM, College of Veterinary Medicine, Shanxi Agricultural University, Jinzhong 030801, China
| | - Wei Yin
- Shanxi Key Lab for Modernization of TCVM, College of Veterinary Medicine, Shanxi Agricultural University, Jinzhong 030801, China
| | - Zhili Hao
- College of Veterinary Medicine, Jilin University, Changchun 130015, China
| | - Kuohai Fan
- Shanxi Key Lab for Modernization of TCVM, College of Veterinary Medicine, Shanxi Agricultural University, Jinzhong 030801, China
| | - Na Sun
- Shanxi Key Lab for Modernization of TCVM, College of Veterinary Medicine, Shanxi Agricultural University, Jinzhong 030801, China
| | - Panpan Sun
- Shanxi Key Lab for Modernization of TCVM, College of Veterinary Medicine, Shanxi Agricultural University, Jinzhong 030801, China
| | - Hongquan Li
- Shanxi Key Lab for Modernization of TCVM, College of Veterinary Medicine, Shanxi Agricultural University, Jinzhong 030801, China
- Correspondence: ; Tel.: +86-3546289210
| |
Collapse
|
9
|
Sharaha U, Abu-Aqil G, Suleiman M, Riesenberg K, Lapidot I, Huleihel M, Salman A. Rapid determination of Proteus mirabilis susceptibility to antibiotics using infrared spectroscopy in tandem with random forest. JOURNAL OF BIOPHOTONICS 2023; 16:e202200198. [PMID: 36169094 DOI: 10.1002/jbio.202200198] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/28/2022] [Revised: 09/24/2022] [Accepted: 09/26/2022] [Indexed: 06/16/2023]
Abstract
Bacterial infections cause serious illnesses that are treated with antibiotics. Currently used methods for detecting bacterial antibiotic susceptibility consume 48-72 h, leading to overuse of antibiotics. Thus, many bacterial species have acquired resistance to a broad range of available antibiotics. There is an urgent need to develop efficient methods for rapid determination of bacterial susceptibility to antibiotics. The combination of machine learning and Fourier-transform infrared (FTIR) spectroscopy has generated a promising diagnostic approach in medicine and biology. Our main goal is to examine the potential of FTIR spectroscopy to determine the susceptibility of urinary tract infection-Proteus mirabilis to a specific range of antibiotics, within about 20 min after 24 h culture and identification. We measured the infrared spectra of 489 different P. mirabilis isolates and used random forest to analyze this spectral database. A classification success rate of ~84% was achieved in differentiating between the resistant and sensitive isolates based on their susceptibility to ceftazidime, ceftriaxone, cefuroxime, cefuroxime axetil, cephalexin, ciprofloxacin, gentamicin, and sulfamethoxazole antibiotics in a time span of 24 h instead of 48 h.
Collapse
Affiliation(s)
- Uraib Sharaha
- Department of Microbiology, Immunology and Genetics, Faculty of Health Sciences, Ben-Gurion University of the Negev, Beer-Sheva, Israel
| | - George Abu-Aqil
- Department of Microbiology, Immunology and Genetics, Faculty of Health Sciences, Ben-Gurion University of the Negev, Beer-Sheva, Israel
| | - Manal Suleiman
- Department of Microbiology, Immunology and Genetics, Faculty of Health Sciences, Ben-Gurion University of the Negev, Beer-Sheva, Israel
| | - Klaris Riesenberg
- Internal Medicine E, Soroka University Medical Center, Beer-Sheva, Israel
| | - Itshak Lapidot
- Department of Electrical and Electronics Engineering, ACLP-Afeka Center for Language Processing, Afeka Tel-Aviv Academic College of Engineering, Tel-Aviv, Israel
| | - Mahmoud Huleihel
- Department of Microbiology, Immunology and Genetics, Faculty of Health Sciences, Ben-Gurion University of the Negev, Beer-Sheva, Israel
| | - Ahmad Salman
- Department of Physics, SCE - Shamoon College of Engineering, Beer-Sheva, Israel
| |
Collapse
|
10
|
Bowling GC, Rands MG, Dobi A, Eldhose B. Emerging Developments in ETS-Positive Prostate Cancer Therapy. Mol Cancer Ther 2023; 22:168-178. [PMID: 36511830 DOI: 10.1158/1535-7163.mct-22-0527] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2022] [Revised: 10/26/2022] [Accepted: 12/07/2022] [Indexed: 12/15/2022]
Abstract
Prostate cancer is a global health concern, which has a low survival rate in its advanced stages. Even though second-generation androgen receptor-axis inhibitors serve as the mainstay treatment options, utmost of the metastatic cases progress into castration-resistant prostate cancer after their initial treatment response with poor prognostic outcomes. Hence, there is a dire need to develop effective inhibitors that aim the causal oncogenes tangled in the prostate cancer initiation and progression. Molecular-targeted therapy against E-26 transformation-specific (ETS) transcription factors, particularly ETS-related gene, has gained wide attention as a potential treatment strategy. ETS rearrangements with the male hormone responsive transmembrane protease serine 2 promoter defines a significant number of prostate cancer cases and is responsible for cancer initiation and progression. Notably, inhibition of ETS activity has shown to reduce tumorigenesis, thus highlighting its potential as a clinical therapeutic target. In this review, we recapitulate the various targeted drug approaches, including small molecules, peptidomimetics, nucleic acids, and many others, aimed to suppress ETS activity. Several inhibitors have demonstrated ERG antagonist activity in prostate cancer, but further investigations into their molecular mechanisms and impacts on nontumor ETS-containing tissues is warranted.
Collapse
Affiliation(s)
- Gartrell C Bowling
- School of Medicine, Uniformed Services University of the Health Sciences, Bethesda, Maryland.,Center for Prostate Disease Research, Murtha Cancer Center Research Program, Department of Surgery, Uniformed Services University of the Health Sciences, Bethesda, Maryland
| | - Mitchell G Rands
- School of Medicine, Uniformed Services University of the Health Sciences, Bethesda, Maryland
| | - Albert Dobi
- Center for Prostate Disease Research, Murtha Cancer Center Research Program, Department of Surgery, Uniformed Services University of the Health Sciences, Bethesda, Maryland.,Henry M. Jackson Foundation for the Advancement of Military Medicine, Inc., Bethesda, Maryland
| | - Binil Eldhose
- Center for Prostate Disease Research, Murtha Cancer Center Research Program, Department of Surgery, Uniformed Services University of the Health Sciences, Bethesda, Maryland.,Henry M. Jackson Foundation for the Advancement of Military Medicine, Inc., Bethesda, Maryland
| |
Collapse
|
11
|
Leander M, Liu Z, Cui Q, Raman S. Deep mutational scanning and machine learning reveal structural and molecular rules governing allosteric hotspots in homologous proteins. eLife 2022; 11:e79932. [PMID: 36226916 PMCID: PMC9662819 DOI: 10.7554/elife.79932] [Citation(s) in RCA: 17] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2022] [Accepted: 10/13/2022] [Indexed: 01/29/2023] Open
Abstract
A fundamental question in protein science is where allosteric hotspots - residues critical for allosteric signaling - are located, and what properties differentiate them. We carried out deep mutational scanning (DMS) of four homologous bacterial allosteric transcription factors (aTFs) to identify hotspots and built a machine learning model with this data to glean the structural and molecular properties of allosteric hotspots. We found hotspots to be distributed protein-wide rather than being restricted to 'pathways' linking allosteric and active sites as is commonly assumed. Despite structural homology, the location of hotspots was not superimposable across the aTFs. However, common signatures emerged when comparing hotspots coincident with long-range interactions, suggesting that the allosteric mechanism is conserved among the homologs despite differences in molecular details. Machine learning with our large DMS datasets revealed global structural and dynamic properties to be a strong predictor of whether a residue is a hotspot than local and physicochemical properties. Furthermore, a model trained on one protein can predict hotspots in a homolog. In summary, the overall allosteric mechanism is embedded in the structural fold of the aTF family, but the finer, molecular details are sequence-specific.
Collapse
Affiliation(s)
- Megan Leander
- Department of Biochemistry, University of Wisconsin-MadisonMadisonUnited States
| | - Zhuang Liu
- Department of Physics, Boston UniversityBostonUnited States
| | - Qiang Cui
- Department of Physics, Boston UniversityBostonUnited States
- Department of Chemistry, Boston UniversityBostonUnited States
| | - Srivatsan Raman
- Department of Biochemistry, University of Wisconsin-MadisonMadisonUnited States
- Department of Bacteriology, University of Wisconsin-MadisonMadisonUnited States
- Department of Chemical and Biological Engineering, University of Wisconsin-MadisonMadisonUnited States
| |
Collapse
|
12
|
Kurban H, Kurban M, Dalkilic MM. Rapidly predicting Kohn-Sham total energy using data-centric AI. Sci Rep 2022; 12:14403. [PMID: 36002504 PMCID: PMC9402589 DOI: 10.1038/s41598-022-18366-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2022] [Accepted: 08/10/2022] [Indexed: 11/28/2022] Open
Abstract
Predicting material properties by solving the Kohn-Sham (KS) equation, which is the basis of modern computational approaches to electronic structures, has provided significant improvements in materials sciences. Despite its contributions, both DFT and DFTB calculations are limited by the number of electrons and atoms that translate into increasingly longer run-times. In this work we introduce a novel, data-centric machine learning framework that is used to rapidly and accurately predicate the KS total energy of anatase [Formula: see text] nanoparticles (NPs) at different temperatures using only a small amount of theoretical data. The proposed framework that we call co-modeling eliminates the need for experimental data and is general enough to be used over any NPs to determine electronic structure and, consequently, more efficiently study physical and chemical properties. We include a web service to demonstrate the effectiveness of our approach.
Collapse
Affiliation(s)
- Hasan Kurban
- Applied Data Science Department, San José State University, San Jose, CA, 95192, USA.
- Computer Science Department, Indiana University, Bloomington, IN, 47405, US.
| | - Mustafa Kurban
- Department of Electrical and Electronics Engineering, Kırşehir Ahi Evran University, 40100, Kırşehir, Turkey
| | - Mehmet M Dalkilic
- Computer Science Department, Indiana University, Bloomington, IN, 47405, US
| |
Collapse
|
13
|
Liu SH, Xiao Z, Mishra SK, Mitchell JC, Smith JC, Quarles LD, Petridis L. Identification of Small-Molecule Inhibitors of Fibroblast Growth Factor 23 Signaling via In Silico Hot Spot Prediction and Molecular Docking to α-Klotho. J Chem Inf Model 2022; 62:3627-3637. [PMID: 35868851 PMCID: PMC10018682 DOI: 10.1021/acs.jcim.2c00633] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
Fibroblast growth factor 23 (FGF23) is a therapeutic target for treating hereditary and acquired hypophosphatemic disorders, such as X-linked hypophosphatemic (XLH) rickets and tumor-induced osteomalacia (TIO), respectively. FGF23-induced hypophosphatemia is mediated by signaling through a ternary complex formed by FGF23, the FGF receptor (FGFR), and α-Klotho. Currently, disorders of excess FGF23 are treated with an FGF23-blocking antibody, burosumab. Small-molecule drugs that disrupt protein/protein interactions necessary for the ternary complex formation offer an alternative to disrupting FGF23 signaling. In this study, the FGF23:α-Klotho interface was targeted to identify small-molecule protein/protein interaction inhibitors since it was computationally predicted to have a large fraction of hot spots and two druggable residues on α-Klotho. We further identified Tyr433 on the KL1 domain of α-Klotho as a promising hot spot and α-Klotho as an appropriate drug-binding target at this interface. Subsequently, we performed in silico docking of ∼5.5 million compounds from the ZINC database to the interface region of α-Klotho from the ternary crystal structure. Following docking, 24 and 20 compounds were in the final list based on the lowest binding free energies to α-Klotho and the largest number of contacts with Tyr433, respectively. Five compounds were assessed experimentally by their FGF23-mediated extracellular signal-regulated kinase (ERK) activities in vitro, and two of these reduced activities significantly. Both these compounds were predicted to have favorable binding affinities to α-Klotho but not have a large number of contacts with the hot spot Tyr433. ZINC12409120 was found experimentally to disrupt FGF23:α-Klotho interaction to reduce FGF23-mediated ERK activities by 70% and have a half maximal inhibitory concentration (IC50) of 5.0 ± 0.23 μM. Molecular dynamics (MD) simulations of the ZINC12409120:α-Klotho complex starting from in silico docking poses reveal that the ligand exhibits contacts with residues on the KL1 domain, the KL1-KL2 linker, and the KL2 domain of α-Klotho simultaneously, thereby possibly disrupting the regular function of α-Klotho and impeding FGF23:α-Klotho interaction. ZINC12409120 is a candidate for lead optimization.
Collapse
Affiliation(s)
- Shih-Hsien Liu
- UT/ORNL Center for Molecular Biophysics, Oak Ridge National Laboratory, Oak Ridge, Tennessee37831, United States
- Department of Biochemistry and Cellular and Molecular Biology, University of Tennessee, Knoxville, Tennessee37996, United States
| | - Zhousheng Xiao
- Department of Medicine, College of Medicine, University of Tennessee Health Science Center, Memphis, Tennessee38163, United States
| | - Sambit K Mishra
- Biosciences Division, Oak Ridge National Laboratory, Oak Ridge, Tennessee37831, United States
| | - Julie C Mitchell
- Biosciences Division, Oak Ridge National Laboratory, Oak Ridge, Tennessee37831, United States
| | - Jeremy C Smith
- UT/ORNL Center for Molecular Biophysics, Oak Ridge National Laboratory, Oak Ridge, Tennessee37831, United States
- Department of Biochemistry and Cellular and Molecular Biology, University of Tennessee, Knoxville, Tennessee37996, United States
| | - L Darryl Quarles
- Department of Medicine, College of Medicine, University of Tennessee Health Science Center, Memphis, Tennessee38163, United States
| | - Loukas Petridis
- UT/ORNL Center for Molecular Biophysics, Oak Ridge National Laboratory, Oak Ridge, Tennessee37831, United States
- Department of Biochemistry and Cellular and Molecular Biology, University of Tennessee, Knoxville, Tennessee37996, United States
| |
Collapse
|
14
|
Improving Path Loss Prediction Using Environmental Feature Extraction from Satellite Images: Hand-Crafted vs. Convolutional Neural Network. APPLIED SCIENCES-BASEL 2022. [DOI: 10.3390/app12157685] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/10/2022]
Abstract
There is an increased exploration of the potential of wireless communication networks in the automation of daily human tasks via the Internet of Things. Such implementations are only possible with the proper design of networks. Path loss prediction is a key factor in the design of networks with parameters such as cell radius, antenna heights, and the number of cell sites that can be set. As path loss is affected by the environment, satellite images of network locations are used in developing path loss prediction models such that environmental effects are captured. We developed a path loss model based on the Extreme Gradient Boosting (XGBoost) algorithm, whose inputs are numeric (non-image) features that influence path loss and features extracted from images composed of four tiled satellite images of points along the transmitter to receiver path. The model can predict path loss for multiple frequencies, antenna heights, and environments such that it can be incorporated into Radio Planning Tools. Various feature extraction methods that included CNN and hand-crafted and their combinations were applied to the images in order to determine the best input features, which, when combined with non-image features, will result in the best XGBoost model. Although hand-crafted features have the advantage of not requiring a large volume of data as no training is involved in them, they failed in this application as their use led to a reduction in accuracy. However, the best model was obtained when image features extracted using CNN and GLCM were combined with the non-image features, resulting in an RMSE improvement of 9.4272% against a model with non-image features only without satellite images. The XGBoost model performed better than Random Forest (RF), Extreme Learning Trees (ET), Gradient Boosting, and K Nearest Neighbor (KNN) based on the combination of CNN, GLCM, and non-image features. Further analysis using the Shapley Additive Explanations (SHAP) revealed that features extracted from the satellite images using CNN had the highest contribution toward the XGBoost model’s output. The variation in values of features with output path loss values was presented using SHAP summary plots. Interactions were also observed between some features based on their dependence plots from the computed SHAP values. This information, when further explored, could serve as the basis for the development of an explainable/glass box path loss model.
Collapse
|
15
|
Zhang H, Zou Q, Ju Y, Song C, Chen D. Distance-based support vector machine to predict DNA N6-methyladenine modification. Curr Bioinform 2022. [DOI: 10.2174/1574893617666220404145517] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Background:
DNA N6-methyladenine plays an important role in the restriction-modification system to isolate invasion from adventive DNA. The shortcomings of the high time-consumption and high costs of experimental methods have been exposed, and some computational methods have emerged. The support vector machine theory has received extensive attention in the bioinformatics field due to its solid theoretical foundation and many good characteristics.
Objective:
General machine learning methods include an important step of extracting features. The research has omitted this step and replaced with easy-to-obtain sequence distances matrix to obtain better results
Method:
First sequence alignment technology was used to achieve the similarity matrix. Then a novel transformation turned the similarity matrix into a distance matrix. Next, the similarity-distance matrix is made positive semi-definite so that it can be used in the kernel matrix. Finally, the LIBSVM software was applied to solve the support vector machine.
Results:
The five-fold cross-validation of this model on rice and mouse data has achieved excellent accuracy rates of 92.04% and 96.51%, respectively. This shows that the DB-SVM method has obvious advantages compared with traditional machine learning methods. Meanwhile this model achieved 0.943,0.982 and 0.818 accuracy,0.944, 0.982, and 0.838 Matthews correlation coefficient and 0.942, 0.982 and 0.840 F1 scores for the rice, M. musculus and cross-species genome datasets, respectively.
Conclusion:
These outcomes show that this model outperforms the iIM-CNN and csDMA in the prediction of DNA 6mA modification, which are the lastest research on DNA 6mA.
Collapse
Affiliation(s)
- Haoyu Zhang
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 610051, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 610051, China
| | - Ying Ju
- School of Informatics, Xiamen University, Xiamen 361005, China
| | - Chenggang Song
- Beidahuang Industry Group General Hospital, Harbin 150001, China
| | - Dong Chen
- College of Electrical and Information Engineering, Quzhou University, Quzhou 324000, China
| |
Collapse
|
16
|
Fan Z, Chiong R, Hu Z, Keivanian F, Chiong F. Body fat prediction through feature extraction based on anthropometric and laboratory measurements. PLoS One 2022; 17:e0263333. [PMID: 35192644 PMCID: PMC8863283 DOI: 10.1371/journal.pone.0263333] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2021] [Accepted: 01/17/2022] [Indexed: 01/15/2023] Open
Abstract
Obesity, associated with having excess body fat, is a critical public health problem that can cause serious diseases. Although a range of techniques for body fat estimation have been developed to assess obesity, these typically involve high-cost tests requiring special equipment. Thus, the accurate prediction of body fat percentage based on easily accessed body measurements is important for assessing obesity and its related diseases. By considering the characteristics of different features (e.g. body measurements), this study investigates the effectiveness of feature extraction for body fat prediction. It evaluates the performance of three feature extraction approaches by comparing four well-known prediction models. Experimental results based on two real-world body fat datasets show that the prediction models perform better on incorporating feature extraction for body fat prediction, in terms of the mean absolute error, standard deviation, root mean square error and robustness. These results confirm that feature extraction is an effective pre-processing step for predicting body fat. In addition, statistical analysis confirms that feature extraction significantly improves the performance of prediction methods. Moreover, the increase in the number of extracted features results in further, albeit slight, improvements to the prediction models. The findings of this study provide a baseline for future research in related areas.
Collapse
Affiliation(s)
- Zongwen Fan
- School of Information and Physical Sciences, The University of Newcastle, Callaghan, NSW, Australia
- College of Computer Science and Technology, Huaqiao University, Xiamen, China
| | - Raymond Chiong
- School of Information and Physical Sciences, The University of Newcastle, Callaghan, NSW, Australia
- * E-mail:
| | - Zhongyi Hu
- School of Information Management, Wuhan University, Wuhan, China
| | - Farshid Keivanian
- School of Information and Physical Sciences, The University of Newcastle, Callaghan, NSW, Australia
| | | |
Collapse
|
17
|
A two-step ensemble learning for predicting protein hot spot residues from whole protein sequence. Amino Acids 2022; 54:765-776. [DOI: 10.1007/s00726-022-03129-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2020] [Accepted: 01/17/2022] [Indexed: 11/26/2022]
|
18
|
Ovek D, Abali Z, Zeylan ME, Keskin O, Gursoy A, Tuncbag N. Artificial intelligence based methods for hot spot prediction. Curr Opin Struct Biol 2021; 72:209-218. [PMID: 34954608 DOI: 10.1016/j.sbi.2021.11.003] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2021] [Revised: 10/07/2021] [Accepted: 11/08/2021] [Indexed: 11/29/2022]
Abstract
Proteins interact through their interfaces to fulfill essential functions in the cell. They bind to their partners in a highly specific manner and form complexes that have a profound effect on understanding the biological pathways they are involved in. Any abnormal interactions may cause diseases. Therefore, the identification of small molecules which modulate protein interactions through their interfaces has high therapeutic potential. However, discovering such molecules is challenging. Most protein-protein binding affinity is attributed to a small set of amino acids found in protein interfaces known as hot spots. Recent studies demonstrate that drug-like small molecules specifically may bind to hot spots. Therefore, hot spot prediction is crucial. As experimental data accumulates, artificial intelligence begins to be used for computational hot spot prediction. First, we review machine learning and deep learning for computational hot spot prediction and then explain the significance of hot spots toward drug design.
Collapse
Affiliation(s)
- Damla Ovek
- College of Engineering, Koc University, 34450 Istanbul, Turkey
| | - Zeynep Abali
- College of Engineering, Koc University, 34450 Istanbul, Turkey
| | | | - Ozlem Keskin
- College of Engineering, Koc University, 34450 Istanbul, Turkey.
| | - Attila Gursoy
- College of Engineering, Koc University, 34450 Istanbul, Turkey.
| | - Nurcan Tuncbag
- College of Engineering, Koc University, 34450 Istanbul, Turkey; School of Medicine, Koc University, 34450 Istanbul, Turkey.
| |
Collapse
|
19
|
Tarimo CS, Bhuyan SS, Li Q, Ren W, Mahande MJ, Wu J. Combining Resampling Strategies and Ensemble Machine Learning Methods to Enhance Prediction of Neonates with a Low Apgar Score After Induction of Labor in Northern Tanzania. Risk Manag Healthc Policy 2021; 14:3711-3720. [PMID: 34522147 PMCID: PMC8434924 DOI: 10.2147/rmhp.s331077] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2021] [Accepted: 08/26/2021] [Indexed: 11/23/2022] Open
Abstract
Objective The goal of this study was to establish the most efficient boosting method in predicting neonatal low Apgar scores following labor induction intervention and to assess whether resampling strategies would improve the predictive performance of the selected boosting algorithms. Methods A total of 7716 singleton births delivered from 2000 to 2015 were analyzed. Cesarean deliveries following labor induction, deliveries with abnormal presentation, and deliveries with missing Apgar score or delivery mode information were excluded. We examined the effect of resampling approaches or data preprocessing on predicting low Apgar scores, specifically the synthetic minority oversampling technique (SMOTE), borderline-SMOTE, and the random undersampling (RUS) technique. Sensitivity, specificity, precision, area under receiver operating curve (AUROC), F-score, positive predicted values (PPV), negative predicted values (NPV) and accuracy of the three (3) boosting-based ensemble methods were used to evaluate their discriminative ability. The ensemble learning models tested include adoptive boosting (AdaBoost), gradient boosting (GB) and extreme gradient boosting method (XGBoost). Results The prevalence of low (<7) Apgar scores was 9.5% (n = 733). The prediction models performed nearly similar in their baseline mode. Following the application of resampling techniques, borderline-SMOTE significantly improved the predictive performance of all the boosting-based ensemble methods under observation in terms of sensitivity, F1-score, AUROC and PPV. Conclusion Policymakers, healthcare informaticians and neonatologists should consider implementing data preprocessing strategies when predicting a neonatal outcome with imbalanced data to enhance efficiency. The process may be more effective when borderline-SMOTE technique is deployed on the selected ensemble classifiers. However, future research may focus on testing additional resampling techniques, performing feature engineering, variable selection and optimizing further the ensemble learning hyperparameters.
Collapse
Affiliation(s)
- Clifford Silver Tarimo
- Department of Epidemiology and Health Statistics, Zhengzhou University, Zhengzhou, People's Republic of China.,Department of Science and Laboratory Technology, Dar es Salaam Institute of Technology, Dar es Salaam, Tanzania
| | - Soumitra S Bhuyan
- Edward J. Bloustein School of Planning and Public Policy, Rutgers University, New Brunswick, NJ, USA
| | - Quanman Li
- Department of Epidemiology and Health Statistics, Zhengzhou University, Zhengzhou, People's Republic of China
| | - Weicun Ren
- College of Sanquan, Xinxiang Medical University, Xinxiang, People's Republic of China
| | - Michael Johnson Mahande
- Department of Epidemiology and Applied Biostatistics, Kilimanjaro Christian Medical University College, Moshi, Tanzania
| | - Jian Wu
- Department of Epidemiology and Health Statistics, Zhengzhou University, Zhengzhou, People's Republic of China
| |
Collapse
|
20
|
Feng Y, Wang Z, Yang N, Liu S, Yan J, Song J, Yang S, Zhang Y. Identification of Biomarkers for Cervical Cancer Radiotherapy Resistance Based on RNA Sequencing Data. Front Cell Dev Biol 2021; 9:724172. [PMID: 34414195 PMCID: PMC8369412 DOI: 10.3389/fcell.2021.724172] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2021] [Accepted: 07/14/2021] [Indexed: 11/28/2022] Open
Abstract
Cervical cancer as a common gynecological malignancy threatens the health and lives of women. Resistance to radiotherapy is the primary cause of treatment failure and is mainly related to difference in the inherent vulnerability of tumors after radiotherapy. Here, we investigated signature genes associated with poor response to radiotherapy by analyzing an independent cervical cancer dataset from the Gene Expression Omnibus, including pre-irradiation and mid-irradiation information. A total of 316 differentially expressed genes were significantly identified. The correlations between these genes were investigated through the Pearson correlation analysis. Subsequently, random forest model was used in determining cancer-related genes, and all genes were ranked by random forest scoring. The top 30 candidate genes were selected for uncovering their biological functions. Functional enrichment analysis revealed that the biological functions chiefly enriched in tumor immune responses, such as cellular defense response, negative regulation of immune system process, T cell activation, neutrophil activation involved in immune response, regulation of antigen processing and presentation, and peptidyl-tyrosine autophosphorylation. Finally, the top 30 genes were screened and analyzed through literature verification. After validation, 10 genes (KLRK1, LCK, KIF20A, CD247, FASLG, CD163, ZAP70, CD8B, ZNF683, and F10) were to our objective. Overall, the present research confirmed that integrated bioinformatics methods can contribute to the understanding of the molecular mechanisms and potential therapeutic targets underlying radiotherapy resistance in cervical cancer.
Collapse
Affiliation(s)
- Yue Feng
- Department of Gynecological Radiotherapy, Harbin Medical University Cancer Hospital, Harbin, China
| | - Zhao Wang
- Department of Gynecological Radiotherapy, Harbin Medical University Cancer Hospital, Harbin, China
| | - Nan Yang
- Department of Gynecological Radiotherapy, Harbin Medical University Cancer Hospital, Harbin, China
| | - Sijia Liu
- Department of Gynecological Radiotherapy, Harbin Medical University Cancer Hospital, Harbin, China
| | - Jiazhuo Yan
- Department of Gynecological Radiotherapy, Harbin Medical University Cancer Hospital, Harbin, China
| | - Jiayu Song
- Department of Gynecological Radiotherapy, Harbin Medical University Cancer Hospital, Harbin, China
| | - Shanshan Yang
- Department of Gynecological Radiotherapy, Harbin Medical University Cancer Hospital, Harbin, China
| | - Yunyan Zhang
- Department of Gynecological Radiotherapy, Harbin Medical University Cancer Hospital, Harbin, China
| |
Collapse
|
21
|
Mahapatra S, Sahu SS. Integrating Resonant Recognition Model and Stockwell Transform for Localization of Hotspots in Tubulin. IEEE Trans Nanobioscience 2021; 20:345-353. [PMID: 33950844 DOI: 10.1109/tnb.2021.3077710] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
Tubulin is a promising target for designing anti-cancer drugs. Identification of hotspots in multifunctional Tubulin protein provides insights for new drug discovery. Although machine learning techniques have shown significant results in prediction, they fail to identify the hotspots corresponding to a particular biological function. This paper presents a signal processing technique combining resonant recognition model (RRM) and Stockwell Transform (ST) for the identification of hotspots corresponding to a particular functionality. The characteristic frequency (CF) representing a specific biological function is determined using the RRM. Then the spectrum of the protein sequence is computed using ST. The CF is filtered from the ST spectrum using a time-frequency mask. The energy peaks in the filtered sequence represent the hotspots. The hotspots predicted by the proposed method are compared with the experimentally detected binding residues of Tubulin stabilizing drug Taxol and destabilizing drug Colchicine present in the Tubulin protein. Out of the 53 experimentally identified hotspots, 60% are predicted by the proposed method whereas around 20% are predicted by existing machine learning based methods. Additionally, the proposed method predicts some new hot spots, which may be investigated.
Collapse
|
22
|
Xu H, Qing X, Wang Q, Li C, Lai L. Dimerization of PHGDH via the catalytic unit is essential for its enzymatic function. J Biol Chem 2021; 296:100572. [PMID: 33753166 PMCID: PMC8081924 DOI: 10.1016/j.jbc.2021.100572] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2021] [Revised: 03/12/2021] [Accepted: 03/18/2021] [Indexed: 11/25/2022] Open
Abstract
Human D-3-phosphoglycerate dehydrogenase (PHGDH), a key enzyme in de novo serine biosynthesis, is amplified in various cancers and serves as a potential target for anticancer drug development. To facilitate this process, more information is needed on the basic biochemistry of this enzyme. For example, PHGDH was found to form tetramers in solution and the structure of its catalytic unit (sPHGDH) was solved as a dimer. However, how the oligomeric states affect PHGDH enzyme activity remains elusive. We studied the dependence of PHGDH enzymatic activity on its oligomeric states. We found that sPHGDH forms a mixture of monomers and dimers in solution with a dimer dissociation constant of ∼0.58 μM, with the enzyme activity depending on the dimer content. We computationally identified hotspot residues at the sPHGDH dimer interface. Single-point mutants at these sites disrupt dimer formation and abolish enzyme activity. Molecular dynamics simulations showed that dimer formation facilitates substrate binding and maintains the correct conformation required for enzyme catalysis. We further showed that the full-length PHGDH exists as a dynamic mixture of monomers, dimers, and tetramers in solution with enzyme concentration-dependent activity. Mutations that can completely disrupt the sPHGDH dimer show different abilities to interrupt the full-length PHGDH tetramer. Among them, E108A and I121A can also disrupt the oligomeric structures of the full-length PHGDH and abolish its enzyme activity. Our study indicates that disrupting the oligomeric structure of PHGDH serves as a novel strategy for PHGDH drug design and the hotspot residues identified can guide the design process.
Collapse
Affiliation(s)
- Hanyu Xu
- BNLMS, Peking-Tsinghua Center for Life Sciences at College of Chemistry and Molecular Engineering, Peking University, Beijing, China
| | - Xiaoyu Qing
- BNLMS, Peking-Tsinghua Center for Life Sciences at College of Chemistry and Molecular Engineering, Peking University, Beijing, China
| | - Qian Wang
- State Key Laboratory of Natural and Biomimetic Drugs, School of Pharmaceutical Sciences, Peking University, Beijing, China
| | - Chunmei Li
- Center for Quantitative Biology, Academy for Advanced Interdisciplinary Studies, Peking University, Beijing, China
| | - Luhua Lai
- BNLMS, Peking-Tsinghua Center for Life Sciences at College of Chemistry and Molecular Engineering, Peking University, Beijing, China; Center for Quantitative Biology, Academy for Advanced Interdisciplinary Studies, Peking University, Beijing, China.
| |
Collapse
|
23
|
Preto AJ, Matos-Filipe P, de Almeida JG, Mourão J, Moreira IS. Predicting Hot Spots Using a Deep Neural Network Approach. Methods Mol Biol 2021; 2190:267-288. [PMID: 32804371 DOI: 10.1007/978-1-0716-0826-5_13] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Targeting protein-protein interactions is a challenge and crucial task of the drug discovery process. A good starting point for rational drug design is the identification of hot spots (HS) at protein-protein interfaces, typically conserved residues that contribute most significantly to the binding. In this chapter, we depict point-by-point an in-house pipeline used for HS prediction using only sequence-based features from the well-known SpotOn dataset of soluble proteins (Moreira et al., Sci Rep 7:8007, 2017), through the implementation of a deep neural network. The presented pipeline is divided into three steps: (1) feature extraction, (2) deep learning classification, and (3) model evaluation. We present all the available resources, including code snippets, the main dataset, and the free and open-source modules/packages necessary for full replication of the protocol. The users should be able to develop an HS prediction model with accuracy, precision, recall, and AUROC of 0.96, 0.93, 0.91, and 0.86, respectively.
Collapse
Affiliation(s)
- António J Preto
- Center for Innovative Biomedicine and Biotechnology, University of Coimbra, Coimbra, Portugal
- Center for Neuroscience and Cell Biology, University of Coimbra, Coimbra, Portugal
- Institute for Interdisciplinary Research, University of Coimbra, Coimbra, Portugal
| | - Pedro Matos-Filipe
- Center for Innovative Biomedicine and Biotechnology, University of Coimbra, Coimbra, Portugal
- Center for Neuroscience and Cell Biology, University of Coimbra, Coimbra, Portugal
| | - José G de Almeida
- Center for Innovative Biomedicine and Biotechnology, University of Coimbra, Coimbra, Portugal
- Center for Neuroscience and Cell Biology, University of Coimbra, Coimbra, Portugal
| | - Joana Mourão
- Center for Innovative Biomedicine and Biotechnology, University of Coimbra, Coimbra, Portugal
- Center for Neuroscience and Cell Biology, University of Coimbra, Coimbra, Portugal
- Institute for Interdisciplinary Research, University of Coimbra, Coimbra, Portugal
| | - Irina S Moreira
- Center for Innovative Biomedicine and Biotechnology, University of Coimbra, Coimbra, Portugal.
- Center for Neuroscience and Cell Biology, University of Coimbra, Coimbra, Portugal.
- University of Coimbra, Department of Life Sciences, University of Coimbra, Coimbra, Portugal.
| |
Collapse
|
24
|
Heat Loss Coefficient Estimation Applied to Existing Buildings through Machine Learning Models. APPLIED SCIENCES-BASEL 2020. [DOI: 10.3390/app10248968] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/16/2022]
Abstract
The Heat Loss Coefficient (HLC) characterizes the envelope efficiency of a building under in-use conditions, and it represents one of the main causes of the performance gap between the building design and its real operation. Accurate estimations of the HLC contribute to optimizing the energy consumption of a building. In this context, the application of black-box models in building energy analysis has been consolidated in recent years. The aim of this paper is to estimate the HLC of an existing building through the prediction of building thermal demands using a methodology based on Machine Learning (ML) models. Specifically, three different ML methods are applied to a public library in the northwest of Spain and compared; eXtreme Gradient Boosting (XGBoost), Support Vector Regression (SVR) and Multi-Layer Perceptron (MLP) neural network. Furthermore, the accuracy of the results is measured, on the one hand, using both CV(RMSE) and Normalized Mean Biased Error (NMBE), as advised by AHSRAE, for thermal demand predictions and, on the other, an absolute error for HLC estimations. The main novelty of this paper lies in the estimation of the HLC of a building considering thermal demand predictions reducing the requirement for monitoring. The results show that the most accurate model is capable of estimating the HLC of the building with an absolute error between 4 and 6%.
Collapse
|
25
|
Agbaria AH, Beck G, Lapidot I, Rich DH, Kapelushnik J, Mordechai S, Salman A, Huleihel M. Diagnosis of inaccessible infections using infrared microscopy of white blood cells and machine learning algorithms. Analyst 2020; 145:6955-6967. [PMID: 32852502 DOI: 10.1039/d0an00752h] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Abstract
Physicians diagnose subjectively the etiology of inaccessible infections where sampling is not feasible (such as, pneumonia, sinusitis, cholecystitis, peritonitis), as bacterial or viral. The diagnosis is based on their experience with some medical markers like blood counts and medical symptoms since it is harder to obtain swabs and reliable laboratory results for most cases. In this study, infrared spectroscopy with machine learning algorithms was used for the rapid and objective diagnosis of the etiology of inaccessible infections and enables an assessment of the error for the subjective diagnosis of the etiology of these infections by physicians. Our approach allows for diagnoses of the etiology of both accessible and inaccessible infections as based on an analysis of the innate immune system response through infrared spectroscopy measurements of white blood cell (WBC) samples. In the present study, we examined 343 individuals involving 113 controls, 89 inaccessible bacterial infections, 54 accessible bacterial infections, 60 inaccessible viral infections, and 27 accessible viral infections. Using our approach, the results show that it is possible to differentiate between controls and infections (combined bacterial and viral) with 95% accuracy, and enabling the diagnosis of the etiology of accessible infections as bacterial or viral with >94% sensitivity and > 90% specificity within one hour after the collection of the blood sample with error rate <6%. Based on our approach, the error rate of the physicians' subjective diagnosis of the etiology of inaccessible infections was found to be >23%.
Collapse
Affiliation(s)
- Adam H Agbaria
- Department of Physics, Ben-Gurion University, Beer-Sheva 84105, Israel
| | | | | | | | | | | | | | | |
Collapse
|
26
|
Guo Z, Wang P, Liu Z, Zhao Y. Discrimination of Thermophilic Proteins and Non-thermophilic Proteins Using Feature Dimension Reduction. Front Bioeng Biotechnol 2020; 8:584807. [PMID: 33195148 PMCID: PMC7642589 DOI: 10.3389/fbioe.2020.584807] [Citation(s) in RCA: 31] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2020] [Accepted: 09/11/2020] [Indexed: 01/19/2023] Open
Abstract
Thermophilicity is a very important property of proteins, as it sometimes determines denaturation and cell death. Thus, methods for predicting thermophilic proteins and non-thermophilic proteins are of interest and can contribute to the design and engineering of proteins. In this article, we describe the use of feature dimension reduction technology and LIBSVM to identify thermophilic proteins. The highest accuracy obtained by cross-validation was 96.02% with 119 parameters. When using only 16 features, we obtained an accuracy of 93.33%. We discuss the importance of the different characteristics in identification and report a comparison of the performance of support vector machine to that of other methods.
Collapse
Affiliation(s)
- Zifan Guo
- School of Aeronautics and Astronautic, Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| | - Pingping Wang
- School of Life Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Zhendong Liu
- School of Computer Science and Technology, Shandong Jianzhu University, Jinan, China
| | - Yuming Zhao
- Information and Computer Engineering College, Northeast Forestry University, Harbin, China
| |
Collapse
|
27
|
Wu R, Prabhu R, Ozkan A, Sitharam M. Rapid prediction of crucial hotspot interactions for icosahedral viral capsid self-assembly by energy landscape atlasing validated by mutagenesis. PLoS Comput Biol 2020; 16:e1008357. [PMID: 33079933 PMCID: PMC7598928 DOI: 10.1371/journal.pcbi.1008357] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2020] [Revised: 10/30/2020] [Accepted: 09/22/2020] [Indexed: 02/07/2023] Open
Abstract
Icosahedral viruses are under a micrometer in diameter, their infectious genome encapsulated by a shell assembled by a multiscale process, starting from an integer multiple of 60 viral capsid or coat protein (VP) monomers. We predict and validate inter-atomic hotspot interactions between VP monomers that are important for the assembly of 3 types of icosahedral viral capsids: Adeno Associated Virus serotype 2 (AAV2) and Minute Virus of Mice (MVM), both T = 1 single stranded DNA viruses, and Bromo Mosaic Virus (BMV), a T = 3 single stranded RNA virus. Experimental validation is by in-vitro, site-directed mutagenesis data found in literature. We combine ab-initio predictions at two scales: at the interface-scale, we predict the importance (cruciality) of an interaction for successful subassembly across each interface between symmetry-related VP monomers; and at the capsid-scale, we predict the cruciality of an interface for successful capsid assembly. At the interface-scale, we measure cruciality by changes in the capsid free-energy landscape partition function when an interaction is removed. The partition function computation uses atlases of interface subassembly landscapes, rapidly generated by a novel geometric method and curated opensource software EASAL (efficient atlasing and search of assembly landscapes). At the capsid-scale, cruciality of an interface for successful assembly of the capsid is based on combinatorial entropy. Our study goes all the way from resource-light, multiscale computational predictions of crucial hotspot inter-atomic interactions to validation using data on site-directed mutagenesis' effect on capsid assembly. By reliably and rapidly narrowing down target interactions, (no more than 1.5 hours per interface on a laptop with Intel Core i5-2500K @ 3.2 Ghz CPU and 8GB of RAM) our predictions can inform and reduce time-consuming in-vitro and in-vivo experiments, or more computationally intensive in-silico analyses.
Collapse
Affiliation(s)
- Ruijin Wu
- Department of Computer and Information Science and Engineering, University of Florida, Gainesville, Florida, United States of America
| | - Rahul Prabhu
- Department of Computer and Information Science and Engineering, University of Florida, Gainesville, Florida, United States of America
| | - Aysegul Ozkan
- Department of Computer and Information Science and Engineering, University of Florida, Gainesville, Florida, United States of America
| | - Meera Sitharam
- Department of Computer and Information Science and Engineering, University of Florida, Gainesville, Florida, United States of America
| |
Collapse
|
28
|
Karle AC. Applying MAPPs Assays to Assess Drug Immunogenicity. Front Immunol 2020; 11:698. [PMID: 32373128 PMCID: PMC7186346 DOI: 10.3389/fimmu.2020.00698] [Citation(s) in RCA: 26] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2020] [Accepted: 03/27/2020] [Indexed: 01/08/2023] Open
Abstract
Immunogenicity against biotherapeutic proteins (BPs) and the potential outcome for the patient are difficult to predict. In vitro assays that can help to assess the immunogenic potential of BPs are not yet used routinely during drug development. MAPPs (MHC-associated peptide proteomics) is one of the assays best characterized regarding its value for immunogenicity potential assessment. This review is focusing on recent studies that have employed human HLA class II-MAPPs assays to rank biotherapeutic candidates, investigate clinical immunogenicity, and understand mechanistic root causes of immunogenicity. Advantages and challenges of the technology are discussed as well as the different areas of application.
Collapse
Affiliation(s)
- Anette C Karle
- Novartis Institute for Biomedical Research, Novartis Pharma AG, Basel, Switzerland
| |
Collapse
|
29
|
Salman A, Lapidot I, Shufan E, Agbaria AH, Porat Katz BS, Mordechai S. Potential of infrared microscopy to differentiate between dementia with Lewy bodies and Alzheimer's diseases using peripheral blood samples and machine learning algorithms. JOURNAL OF BIOMEDICAL OPTICS 2020; 25:1-15. [PMID: 32329265 PMCID: PMC7177186 DOI: 10.1117/1.jbo.25.4.046501] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 01/29/2020] [Accepted: 04/09/2020] [Indexed: 06/11/2023]
Abstract
SIGNIFICANCE Accurate and objective identification of Alzheimer's disease (AD) and dementia with Lewy bodies (DLB) is of major clinical importance due to the current lack of low-cost and noninvasive diagnostic tools to differentiate between the two. Developing an approach for such identification can have a great impact in the field of dementia diseases as it would offer physicians a routine objective test to support their diagnoses. The problem is especially acute because these two dementias have some common symptoms and characteristics, which can lead to misdiagnosis of DLB as AD and vice versa, mainly at their early stages. AIM The aim is to evaluate the potential of mid-infrared (IR) spectroscopy in tandem with machine learning algorithms as a sensitive method to detect minor changes in the biochemical structures that accompany the development of AD and DLB based on a simple peripheral blood test, thus improving the diagnostic accuracy of differentiation between DLB and AD. APPROACH IR microspectroscopy was used to examine white blood cells and plasma isolated from 56 individuals: 26 controls, 20 AD patients, and 10 DLB patients. The measured spectra were analyzed via machine learning. RESULTS Our encouraging results show that it is possible to differentiate between dementia (AD and DLB) and controls with an ∼86 % success rate and between DLB and AD patients with a success rate of better than 93%. CONCLUSIONS The success of this method makes it possible to suggest a new, simple, and powerful tool for the mental health professional, with the potential to improve the reliability and objectivity of diagnoses of both AD and DLB.
Collapse
Affiliation(s)
- Ahmad Salman
- Shamoon College of Engineering, Department of Physics, Beer-Sheva, Israel
| | - Itshak Lapidot
- Afeka Tel-Aviv Academic College of Engineering, Afeka Center for Language Processing, Department of Electrical and Electronics Engineering, Tel-Aviv, Israel
| | - Elad Shufan
- Shamoon College of Engineering, Department of Physics, Beer-Sheva, Israel
| | - Adam H. Agbaria
- Ben-Gurion University of the Negev, Department of Physics, Faculty of Natural Sciences, Beer-Sheva, Israel
| | - Bat-Sheva Porat Katz
- The Hebrew University of Jerusalem, School of Nutritional Sciences, The Robert H. Smith Faculty of Agriculture, Food, and Environment, Rehovot, Israel
- Kaplan Medical Center, Rehovot, Israel
| | - Shaul Mordechai
- Ben-Gurion University of the Negev, Department of Physics, Faculty of Natural Sciences, Beer-Sheva, Israel
| |
Collapse
|
30
|
Su M, Lyles JT, Petit III RA, Peterson J, Hargita M, Tang H, Solis-Lemus C, Quave CL, Read TD. Genomic analysis of variability in Delta-toxin levels between Staphylococcus aureus strains. PeerJ 2020; 8:e8717. [PMID: 32231873 PMCID: PMC7100594 DOI: 10.7717/peerj.8717] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2019] [Accepted: 02/10/2020] [Indexed: 01/10/2023] Open
Abstract
BACKGROUND The delta-toxin (δ-toxin) of Staphylococcus aureus is the only hemolysin shown to cause mast cell degranulation and is linked to atopic dermatitis, a chronic inflammatory skin disease. We sought to characterize variation in δ-toxin production across S. aureus strains and identify genetic loci potentially associated with differences between strains. METHODS A set of 124 S. aureus strains was genome-sequenced and δ-toxin levels in stationary phase supernatants determined by high performance liquid chromatography (HPLC). SNPs and kmers were associated with differences in toxin production using four genome-wide association study (GWAS) methods. Transposon mutations in candidate genes were tested for their δ-toxin levels. We constructed XGBoost models to predict toxin production based on genetic loci discovered to be potentially associated with the phenotype. RESULTS The S. aureus strain set encompassed 40 sequence types (STs) in 23 clonal complexes (CCs). δ-toxin production ranged from barely detectable levels to >90,000 units, with a median of >8,000 units. CC30 had significantly lower levels of toxin production than average while CC45 and CC121 were higher. MSSA (methicillin sensitive) strains had higher δ-toxin production than MRSA (methicillin resistant) strains. Through multiple GWAS approaches, 45 genes were found to be potentially associated with toxicity. Machine learning models using loci discovered through GWAS as features were able to predict δ-toxin production (as a high/low binary phenotype) with a precision of .875 and specificity of .990 but recall of .333. We discovered that mutants in the carA gene, encoding the small chain of carbamoyl phosphate synthase, completely abolished toxin production and toxicity in Caenorhabditis elegans. CONCLUSIONS The amount of stationary phase production of the toxin is a strain-specific phenotype likely affected by a complex interaction of number of genes with different levels of effect. We discovered new candidate genes that potentially play a role in modulating production. We report for the first time that the product of the carA gene is necessary for δ-toxin production in USA300. This work lays a foundation for future work on understanding toxin regulation in S. aureus and prediction of phenotypes from genomic sequences.
Collapse
Affiliation(s)
- Michelle Su
- Division of Infectious Diseases, Department of Medicine, School of Medicine, Emory University, Atlanta, GA, United States of America
| | - James T. Lyles
- Center for the Study of Human Health, College of Arts and Sciences, Emory University, Atlanta, GA, United States of America
| | - Robert A. Petit III
- Division of Infectious Diseases, Department of Medicine, School of Medicine, Emory University, Atlanta, GA, United States of America
| | - Jessica Peterson
- Division of Infectious Diseases, Department of Medicine, School of Medicine, Emory University, Atlanta, GA, United States of America
| | - Michelle Hargita
- Division of Infectious Diseases, Department of Medicine, School of Medicine, Emory University, Atlanta, GA, United States of America
| | - Huaqiao Tang
- Center for the Study of Human Health, College of Arts and Sciences, Emory University, Atlanta, GA, United States of America
| | - Claudia Solis-Lemus
- Department of Human Genetics, School of Medicine, Emory University, Atlanta, GA, United States of America
| | - Cassandra L. Quave
- Center for the Study of Human Health, College of Arts and Sciences, Emory University, Atlanta, GA, United States of America
- Department of Dermatology, School of Medicine, Emory University, Atlanta, GA, United States of America
| | - Timothy D. Read
- Division of Infectious Diseases, Department of Medicine, School of Medicine, Emory University, Atlanta, GA, United States of America
- Department of Human Genetics, School of Medicine, Emory University, Atlanta, GA, United States of America
| |
Collapse
|
31
|
Zeng F, Fang G, Yao L. A Deep Neural Network for Identifying DNA N4-Methylcytosine Sites. Front Genet 2020; 11:209. [PMID: 32211035 PMCID: PMC7067889 DOI: 10.3389/fgene.2020.00209] [Citation(s) in RCA: 16] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2019] [Accepted: 02/21/2020] [Indexed: 11/25/2022] Open
Abstract
Motivation: N4-methylcytosine (4mC) plays an important role in host defense and transcriptional regulation. Accurate identification of 4mc sites provides a more comprehensive understanding of its biological effects. At present, the traditional machine learning algorithms are used in the research on 4mC sites prediction, but the complexity of the algorithms is relatively high, which is not suitable for the processing of large data sets, and the accuracy of prediction needs to be improved. Therefore, it is necessary to develop a new and effective method to accurately identify 4mC sites. Results: In this work, we found a large number of 4mC sites and non 4mC sites of Caenorhabditis elegans (C. elegans) from the latest MethSMRT website, which greatly expanded the dataset of C. elegans, and developed a hybrid deep neural network framework named 4mcDeep-CBI, aiming to identify 4mC sites. In order to obtain the high latitude information of the feature, we input the preliminary extracted features into the Convolutional Neural Network (CNN) and Bidirectional Long Short Term Memory network (BLSTM) to generate advanced features. Taking the advanced features as algorithm input, we have proposed an integrated algorithm to improve feature representation. Experimental results on large new dataset show that the proposed predictor is able to achieve generally better performance in identifying 4mC sites as compared to the state-of-art predictor. Notably, this is the first study of identifying 4mC sites using deep neural network. Moreover, our model runs much faster than the state-of-art predictor.
Collapse
Affiliation(s)
- Feng Zeng
- School of Computer Science and Engineering, Central South University, Changsha, China
| | - Guanyun Fang
- School of Computer Science and Engineering, Central South University, Changsha, China
| | - Lan Yao
- College of Mathematics and Econometrics, Hunan University, Changsha, China
| |
Collapse
|
32
|
Wang HT, Xiao FH, Li GH, Kong QP. Identification of DNA N 6-methyladenine sites by integration of sequence features. Epigenetics Chromatin 2020; 13:8. [PMID: 32093759 PMCID: PMC7038560 DOI: 10.1186/s13072-020-00330-2] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2019] [Accepted: 02/03/2020] [Indexed: 02/21/2023] Open
Abstract
Background An increasing number of nucleic acid modifications have been profiled with the development of sequencing technologies. DNA N6-methyladenine (6mA), which is a prevalent epigenetic modification, plays important roles in a series of biological processes. So far, identification of DNA 6mA relies primarily on time-consuming and expensive experimental approaches. However, in silico methods can be implemented to conduct preliminary screening to save experimental resources and time, especially given the rapid accumulation of sequencing data. Results In this study, we constructed a 6mA predictor, p6mA, from a series of sequence-based features, including physicochemical properties, position-specific triple-nucleotide propensity (PSTNP), and electron–ion interaction pseudopotential (EIIP). We performed maximum relevance maximum distance (MRMD) analysis to select key features and used the Extreme Gradient Boosting (XGBoost) algorithm to build our predictor. Results demonstrated that p6mA outperformed other existing predictors using different datasets. Conclusions p6mA can predict the methylation status of DNA adenines, using only sequence files. It may be used as a tool to help the study of 6mA distribution pattern. Users can download it from https://github.com/Konglab404/p6mA.
Collapse
Affiliation(s)
- Hao-Tian Wang
- State Key Laboratory of Genetic Resources and Evolution/Key Laboratory of Healthy Aging Research of Yunnan Province, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming, 650223, China.,Center for Excellence in Animal Evolution and Genetics, Chinese Academy of Sciences, Kunming, 650223, China.,Kunming Key Laboratory of Healthy Aging Study, Kunming, 650223, China.,Kunming College of Life Science, University of Chinese Academy of Sciences, Beijing, 100049, China
| | - Fu-Hui Xiao
- State Key Laboratory of Genetic Resources and Evolution/Key Laboratory of Healthy Aging Research of Yunnan Province, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming, 650223, China.,Center for Excellence in Animal Evolution and Genetics, Chinese Academy of Sciences, Kunming, 650223, China.,Kunming Key Laboratory of Healthy Aging Study, Kunming, 650223, China
| | - Gong-Hua Li
- State Key Laboratory of Genetic Resources and Evolution/Key Laboratory of Healthy Aging Research of Yunnan Province, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming, 650223, China.,Center for Excellence in Animal Evolution and Genetics, Chinese Academy of Sciences, Kunming, 650223, China.,Kunming Key Laboratory of Healthy Aging Study, Kunming, 650223, China
| | - Qing-Peng Kong
- State Key Laboratory of Genetic Resources and Evolution/Key Laboratory of Healthy Aging Research of Yunnan Province, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming, 650223, China. .,Center for Excellence in Animal Evolution and Genetics, Chinese Academy of Sciences, Kunming, 650223, China. .,Kunming Key Laboratory of Healthy Aging Study, Kunming, 650223, China. .,KIZ/CUHK Joint Laboratory of Bioresources and Molecular Research in Common Diseases, Kunming, 650223, China.
| |
Collapse
|
33
|
Agbaria AH, Rosen GB, Lapidot I, Rich DH, Mordechai S, Kapelushnik J, Huleihel M, Salman A. Rapid diagnosis of infection etiology in febrile pediatric oncology patients using infrared spectroscopy of leukocytes. JOURNAL OF BIOPHOTONICS 2020; 13:e201900215. [PMID: 31566906 DOI: 10.1002/jbio.201900215] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/31/2019] [Revised: 08/27/2019] [Accepted: 09/15/2019] [Indexed: 06/10/2023]
Abstract
Rapid diagnosis of the etiology of infection is highly important for an effective treatment of the infected patients. Bacterial and viral infections are serious diseases that can cause death in many cases. The human immune system deals with many viral and bacterial infections that cause no symptoms and pass quietly without treatment. However, oncology patients undergoing chemotherapy have a very weak immune system caused by leukopenia, and even minor pathogen infection threatens their lives. For this reason, physicians tend to prescribe immediately several types of antibiotics for febrile pediatric oncology patients (FPOPs). Uncontrolled use of antibiotics is one of the major contributors to the development of resistant bacteria. Therefore, for oncology patients, a rapid and objective diagnosis of the etiology of the infection is extremely critical. Current identification methods are time-consuming (>24 h). In this study, the potential of midinfrared spectroscopy in tandem with machine learning algorithms is evaluated for rapid and objective diagnosis of the etiology of infections in FPOPs using simple peripheral blood samples. Our results show that infrared spectroscopy enables the diagnosis of the etiology of infection as bacterial or viral within 70 minutes after the collection of the blood sample with 93% sensitivity and 88% specificity.
Collapse
Affiliation(s)
- Adam H Agbaria
- Department of Physics, Ben-Gurion University, Beer-Sheva, Israel
| | - Guy Beck Rosen
- Department of Hematology, Soroka University Medical Center, Beer-Sheva, Israel
| | - Itshak Lapidot
- Department of Electrical and Electronics Engineering, ACLP-Afeka Center for Language Processing, Afeka Tel-Aviv Academic College of Engineering, Tel-Aviv, Israel
| | - Daniel H Rich
- Department of Physics, Ben-Gurion University, Beer-Sheva, Israel
| | - Shaul Mordechai
- Department of Physics, Ben-Gurion University, Beer-Sheva, Israel
| | - Joseph Kapelushnik
- Department of Hematology, Soroka University Medical Center, Beer-Sheva, Israel
| | - Mahmoud Huleihel
- Department of Microbiology, Immunology and Genetics, Faculty of Health Sciences, Ben-Gurion University of the Negev, Beer-Sheva, Israel
| | - Ahmad Salman
- Department of Physics, SCE-Sami Shamoon College of Engineering, Beer-Sheva, Israel
| |
Collapse
|
34
|
Huang Q, Zhang J, Wei L, Guo F, Zou Q. 6mA-RicePred: A Method for Identifying DNA N 6-Methyladenine Sites in the Rice Genome Based on Feature Fusion. FRONTIERS IN PLANT SCIENCE 2020; 11:4. [PMID: 32076430 PMCID: PMC7006724 DOI: 10.3389/fpls.2020.00004] [Citation(s) in RCA: 27] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/04/2019] [Accepted: 01/06/2020] [Indexed: 06/01/2023]
Abstract
MOTIVATION The biological function of N 6-methyladenine DNA (6mA) in plants is largely unknown. Rice is one of the most important crops worldwide and is a model species for molecular and genetic studies. There are few methods for 6mA site recognition in the rice genome, and an effective computational method is needed. RESULTS In this paper, we propose a new computational method called 6mA-Pred to identify 6mA sites in the rice genome. 6mA-Pred employs a feature fusion method to combine advantageous features from other methods and thus obtain a new feature to identify 6mA sites. This method achieved an accuracy of 87.27% in the identification of 6mA sites with 10-fold cross-validation and achieved an accuracy of 85.6% in independent test sets.
Collapse
Affiliation(s)
- Qianfei Huang
- College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Jun Zhang
- Rehabilitation Department, Heilongjiang Province Land Reclamation Headquarters General Hospital, Harbin, China
| | - Leyi Wei
- College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Fei Guo
- College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|
35
|
PreDBA: A heterogeneous ensemble approach for predicting protein-DNA binding affinity. Sci Rep 2020; 10:1278. [PMID: 31992738 PMCID: PMC6987227 DOI: 10.1038/s41598-020-57778-1] [Citation(s) in RCA: 20] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2019] [Accepted: 01/06/2020] [Indexed: 11/17/2022] Open
Abstract
The interaction between protein and DNA plays an essential function in various critical natural processes, like DNA replication, transcription, splicing, and repair. Studying the binding affinity of proteins to DNA helps to understand the recognition mechanism of protein-DNA complexes. Since there are still many limitations on the protein-DNA binding affinity data measured by experiments, accurate and reliable calculation methods are necessarily required. So we put forward a computational approach in this paper, called PreDBA, that can forecast protein-DNA binding affinity effectively by using heterogeneous ensemble models. One hundred protein-DNA complexes are manually collected from the related literature as a data set for protein-DNA binding affinity. Then, 52 sequence and structural features are obtained. Based on this, the correlation between these 52 characteristics and protein-DNA binding affinity is calculated. Furthermore, we found that the protein-DNA binding affinity is affected by the DNA molecule structure of the compound. We classify all protein-DNA compounds into five classifications based on the DNA structure related to the proteins that make up the protein-DNA complexes. In each group, a stacked heterogeneous ensemble model is constructed based on the obtained features. In the end, based on the binding affinity data set, we used the leave-one-out cross-validation to evaluate the proposed method comprehensively. In the five categories, the Pearson correlation coefficient values of our recommended method range from 0.735 to 0.926. We have demonstrated the advantages of the proposed method compared to other machine learning methods and currently existing protein-DNA binding affinity prediction approach.
Collapse
|
36
|
Basith S, Manavalan B, Hwan Shin T, Lee G. Machine intelligence in peptide therapeutics: A next‐generation tool for rapid disease screening. Med Res Rev 2020; 40:1276-1314. [DOI: 10.1002/med.21658] [Citation(s) in RCA: 139] [Impact Index Per Article: 34.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2019] [Revised: 11/26/2019] [Accepted: 12/16/2019] [Indexed: 12/12/2022]
Affiliation(s)
- Shaherin Basith
- Department of PhysiologyAjou University School of MedicineSuwon Republic of Korea
| | | | - Tae Hwan Shin
- Department of PhysiologyAjou University School of MedicineSuwon Republic of Korea
| | - Gwang Lee
- Department of PhysiologyAjou University School of MedicineSuwon Republic of Korea
| |
Collapse
|
37
|
Barreto CAV, Baptista SJ, Preto AJ, Matos-Filipe P, Mourão J, Melo R, Moreira I. Prediction and targeting of GPCR oligomer interfaces. PROGRESS IN MOLECULAR BIOLOGY AND TRANSLATIONAL SCIENCE 2020; 169:105-149. [PMID: 31952684 DOI: 10.1016/bs.pmbts.2019.11.007] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/18/2022]
Abstract
GPCR oligomerization has emerged as a hot topic in the GPCR field in the last years. Receptors that are part of these oligomers can influence each other's function, although it is not yet entirely understood how these interactions work. The existence of such a highly complex network of interactions between GPCRs generates the possibility of alternative targets for new therapeutic approaches. However, challenges still exist in the characterization of these complexes, especially at the interface level. Different experimental approaches, such as FRET or BRET, are usually combined to study GPCR oligomer interactions. Computational methods have been applied as a useful tool for retrieving information from GPCR sequences and the few X-ray-resolved oligomeric structures that are accessible, as well as for predicting new and trustworthy GPCR oligomeric interfaces. Machine-learning (ML) approaches have recently helped with some hindrances of other methods. By joining and evaluating multiple structure-, sequence- and co-evolution-based features on the same algorithm, it is possible to dilute the issues of particular structures and residues that arise from the experimental methodology into all-encompassing algorithms capable of accurately predict GPCR-GPCR interfaces. All these methods used as a single or a combined approach provide useful information about GPCR oligomerization and its role in GPCR function and dynamics. Altogether, we present experimental, computational and machine-learning methods used to study oligomers interfaces, as well as strategies that have been used to target these dynamic complexes.
Collapse
Affiliation(s)
- Carlos A V Barreto
- Center for Neuroscience and Cell Biology, University of Coimbra, Coimbra, Portugal
| | - Salete J Baptista
- Center for Neuroscience and Cell Biology, University of Coimbra, Coimbra, Portugal; Centro de Ciências e Tecnologias Nucleares, Instituto Superior Técnico, Universidade de Lisboa, CTN, LRS, Portugal
| | - António José Preto
- Center for Neuroscience and Cell Biology, University of Coimbra, Coimbra, Portugal
| | - Pedro Matos-Filipe
- Center for Neuroscience and Cell Biology, University of Coimbra, Coimbra, Portugal
| | - Joana Mourão
- Center for Neuroscience and Cell Biology, University of Coimbra, Coimbra, Portugal; Institute for Interdisciplinary Research, University of Coimbra, Coimbra, Portugal
| | - Rita Melo
- Center for Neuroscience and Cell Biology, University of Coimbra, Coimbra, Portugal; Centro de Ciências e Tecnologias Nucleares, Instituto Superior Técnico, Universidade de Lisboa, CTN, LRS, Portugal
| | - Irina Moreira
- Center for Neuroscience and Cell Biology, University of Coimbra, Coimbra, Portugal; Science and Technology Faculty, University of Coimbra, Coimbra, Portugal.
| |
Collapse
|
38
|
Deng L, Zhong G, Liu C, Luo J, Liu H. MADOKA: an ultra-fast approach for large-scale protein structure similarity searching. BMC Bioinformatics 2019; 20:662. [PMID: 31870277 PMCID: PMC6929402 DOI: 10.1186/s12859-019-3235-1] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2019] [Accepted: 11/14/2019] [Indexed: 01/22/2023] Open
Abstract
Background Protein comparative analysis and similarity searches play essential roles in structural bioinformatics. A couple of algorithms for protein structure alignments have been developed in recent years. However, facing the rapid growth of protein structure data, improving overall comparison performance and running efficiency with massive sequences is still challenging. Results Here, we propose MADOKA, an ultra-fast approach for massive structural neighbor searching using a novel two-phase algorithm. Initially, we apply a fast alignment between pairwise structures. Then, we employ a score to select pairs with more similarity to carry out a more accurate fragment-based residue-level alignment. MADOKA performs about 6–100 times faster than existing methods, including TM-align and SAL, in massive alignments. Moreover, the quality of structural alignment of MADOKA is better than the existing algorithms in terms of TM-score and number of aligned residues. We also develop a web server to search structural neighbors in PDB database (About 360,000 protein chains in total), as well as additional features such as 3D structure alignment visualization. The MADOKA web server is freely available at: http://madoka.denglab.org/ Conclusions MADOKA is an efficient approach to search for protein structure similarity. In addition, we provide a parallel implementation of MADOKA which exploits massive power of multi-core CPUs.
Collapse
Affiliation(s)
- Lei Deng
- School of Computer Science and Engineering, Central South University, Changsha, 410075, China
| | - Guolun Zhong
- School of Computer Science and Engineering, Central South University, Changsha, 410075, China
| | - Chenzhe Liu
- School of Computer Science and Engineering, Central South University, Changsha, 410075, China
| | - Judong Luo
- Department of Radiation Oncology, the Affiliated Changzhou No.2 People's Hospital of Nanjing Medical University, Changzhou, China.
| | - Hui Liu
- Lab of Information Management, Changzhou University, Changzhou, 213164, China.
| |
Collapse
|
39
|
Zhao Z, Xu Y, Zhao Y. SXGBsite: Prediction of Protein-Ligand Binding Sites Using Sequence Information and Extreme Gradient Boosting. Genes (Basel) 2019; 10:E965. [PMID: 31771119 PMCID: PMC6947422 DOI: 10.3390/genes10120965] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2019] [Revised: 10/19/2019] [Accepted: 11/19/2019] [Indexed: 12/13/2022] Open
Abstract
The prediction of protein-ligand binding sites is important in drug discovery and drug design. Protein-ligand binding site prediction computational methods are inexpensive and fast compared with experimental methods. This paper proposes a new computational method, SXGBsite, which includes the synthetic minority over-sampling technique (SMOTE) and the Extreme Gradient Boosting (XGBoost). SXGBsite uses the position-specific scoring matrix discrete cosine transform (PSSM-DCT) and predicted solvent accessibility (PSA) to extract features containing sequence information. A new balanced dataset was generated by SMOTE to improve classifier performance, and a prediction model was constructed using XGBoost. The parallel computing and regularization techniques enabled high-quality and fast predictions and mitigated overfitting caused by SMOTE. An evaluation using 12 different types of ligand binding site independent test sets showed that SXGBsite performs similarly to the existing methods on eight of the independent test sets with a faster computation time. SXGBsite may be applied as a complement to biological experiments.
Collapse
Affiliation(s)
| | - Yonghong Xu
- School of Electrical Engineering, Yanshan University, Qinhuangdao 066004, China
| | | |
Collapse
|
40
|
Schaduangrat N, Nantasenamat C, Prachayasittikul V, Shoombuatong W. Meta-iAVP: A Sequence-Based Meta-Predictor for Improving the Prediction of Antiviral Peptides Using Effective Feature Representation. Int J Mol Sci 2019; 20:ijms20225743. [PMID: 31731751 PMCID: PMC6888698 DOI: 10.3390/ijms20225743] [Citation(s) in RCA: 77] [Impact Index Per Article: 15.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2019] [Revised: 11/07/2019] [Accepted: 11/13/2019] [Indexed: 12/31/2022] Open
Abstract
In spite of the large-scale production and widespread distribution of vaccines and antiviral drugs, viruses remain a prominent human disease. Recently, the discovery of antiviral peptides (AVPs) has become an influential antiviral agent due to their extraordinary advantages. With the avalanche of newly-found peptide sequences in the post-genomic era, there is a great demand to develop a sequence-based predictor for timely identifying AVPs as this information is very useful for both basic research and drug development. In this study, we propose a novel sequence-based meta-predictor with an effective feature representation, called Meta-iAVP, for the accurate prediction of AVPs from given peptide sequences. Herein, the effective feature representation was extracted from a set of prediction scores derived from various machine learning algorithms and types of features. To the best of our knowledge, the model proposed herein represents the first meta-based approach for the prediction of AVPs. An overall accuracy and Matthews correlation coefficient of 95.20% and 0.90, respectively, was achieved from the independent test set on an objective benchmark dataset. Comparative analysis suggested that Meta-iAVP was superior to that of existing methods and therefore represents a useful tool for AVP prediction. Finally, in an effort to facilitate high-throughput prediction of AVPs, the model was deployed as the Meta-iAVP web server and is made freely available online at http://codes.bio/meta-iavp/ where users can submit query peptide sequences for determining the likelihood of whether or not these peptides are AVPs.
Collapse
Affiliation(s)
- Nalini Schaduangrat
- Center of Data Mining and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok 10700, Thailand; (N.S.); (C.N.)
| | - Chanin Nantasenamat
- Center of Data Mining and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok 10700, Thailand; (N.S.); (C.N.)
| | - Virapong Prachayasittikul
- Department of Clinical Microbiology and Applied Technology, Faculty of Medical Technology, Mahidol University, Bangkok 10700, Thailand;
| | - Watshara Shoombuatong
- Center of Data Mining and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok 10700, Thailand; (N.S.); (C.N.)
- Correspondence: ; Tel.: +66-2441-4371 (ext. 2715)
| |
Collapse
|
41
|
Pérez-Rodríguez M, Dirchwolf PM, Silva TV, Villafañe RN, Neto JAG, Pellerano RG, Ferreira EC. Brown rice authenticity evaluation by spark discharge-laser-induced breakdown spectroscopy. Food Chem 2019; 297:124960. [PMID: 31253301 DOI: 10.1016/j.foodchem.2019.124960] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2019] [Revised: 05/17/2019] [Accepted: 06/07/2019] [Indexed: 01/15/2023]
Abstract
Rice is the most consumed food worldwide, therefore its designation of origin (PDO) is very useful. Laser-induced breakdown spectroscopy (LIBS) is an interesting analytical technique for PDO certification, since it provides fast multielemental analysis requiring minimal sample treatment. In this work LIBS spectral data from rice analysis were evaluated for PDO certification of Argentine brown rice. Samples from two PDOs were analyzed by LIBS coupled to spark discharge. The selection of spectral data was accomplished by extreme gradient boosting (XGBoost), an algorithm currently used in machine learning, but rarely applied in chemical issues. Emission lines of C, Ca, Fe, Mg and Na were selected, and the best performance of classification were obtained using k-nearest neighbor (k-NN) algorithm. The developed method provided 84% of accuracy, 100% of sensitivity and 78% of specificity in classification of test samples. Furthermore, it is simple, clean and can be easily applied for rice certification.
Collapse
Affiliation(s)
- Michael Pérez-Rodríguez
- Institute of Basic and Applied Chemistry of the Northeast of Argentina (IQUIBA-NEA), National Scientific and Technical Research Council (CONICET), Faculty of Exact and Natural Science and Surveying, National University of the Northeast - UNNE, Av. Libertad 5470, 3400 Corrientes, Argentina.
| | - Pamela Maia Dirchwolf
- Faculty of Agricultural Sciences, UNNE, Sgto. Cabral 2131, 3400 Corrientes, Argentina
| | - Tiago Varão Silva
- São Paulo State University - UNESP, Chemistry Institute of Araraquara, R. Prof. Francisco Degni 55, 14800-900 Araraquara, SP, Brazil
| | - Roxana Noelia Villafañe
- Institute of Basic and Applied Chemistry of the Northeast of Argentina (IQUIBA-NEA), National Scientific and Technical Research Council (CONICET), Faculty of Exact and Natural Science and Surveying, National University of the Northeast - UNNE, Av. Libertad 5470, 3400 Corrientes, Argentina
| | - José Anchieta Gomes Neto
- São Paulo State University - UNESP, Chemistry Institute of Araraquara, R. Prof. Francisco Degni 55, 14800-900 Araraquara, SP, Brazil
| | - Roberto Gerardo Pellerano
- Institute of Basic and Applied Chemistry of the Northeast of Argentina (IQUIBA-NEA), National Scientific and Technical Research Council (CONICET), Faculty of Exact and Natural Science and Surveying, National University of the Northeast - UNNE, Av. Libertad 5470, 3400 Corrientes, Argentina
| | - Edilene Cristina Ferreira
- São Paulo State University - UNESP, Chemistry Institute of Araraquara, R. Prof. Francisco Degni 55, 14800-900 Araraquara, SP, Brazil
| |
Collapse
|
42
|
Lv Z, Jin S, Ding H, Zou Q. A Random Forest Sub-Golgi Protein Classifier Optimized via Dipeptide and Amino Acid Composition Features. Front Bioeng Biotechnol 2019; 7:215. [PMID: 31552241 PMCID: PMC6737778 DOI: 10.3389/fbioe.2019.00215] [Citation(s) in RCA: 80] [Impact Index Per Article: 16.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2019] [Accepted: 08/22/2019] [Indexed: 02/01/2023] Open
Abstract
To gain insight into the malfunction of the Golgi apparatus and its relationship to various genetic and neurodegenerative diseases, the identification of sub-Golgi proteins, both cis-Golgi and trans-Golgi proteins, is of great significance. In this study, a state-of-art random forests sub-Golgi protein classifier, rfGPT, was developed. The rfGPT used 2-gap dipeptide and split amino acid composition for the feature vectors and was combined with the synthetic minority over-sampling technique (SMOTE) and an analysis of variance (ANOVA) feature selection method. The rfGPT was trained on a sub-Golgi protein sequence data set (137 sequences), with sequence identity less than 25%. For the optimal rfGPT classifier with 93 features, the accuracy (ACC) was 90.5%; the Matthews correlation coefficient (MCC) was 0.811; the sensitivity (Sn) was 92.6%; and the specificity (Sp) was 88.4%. The independent testing scores for the rfGPT were ACC = 90.6%; MCC = 0.696; Sn = 96.1%; and Sp = 69.2%. Although the independent testing accuracy was 4.4% lower than that for the best reported sub-Golgi classifier trained on a data set with 40% sequence identity (304 sequences), the rfGPT is currently the top sub-Golgi protein predictor utilizing feature vectors without any position-specific scoring matrix and its derivative features. Therefore, the rfGPT is a more practical tool, because no sequence alignment is required with tens of millions of protein sequences. To date, the rfGPT is the Golgi classifier with the best independent testing scores, optimized by training on smaller benchmark data sets. Feature importance analysis proves that the non-polar and aliphatic residues composition, the (aromatic residues) + (non-polar, aliphatic residues) dipeptide and aromatic residues composition between NH2-termial and COOH-terminal of protein sequences are the three top biological features for distinguishing the sub-Golgi proteins.
Collapse
Affiliation(s)
- Zhibin Lv
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| | - Shunshan Jin
- Department of Neurology, Heilongjiang Province Land Reclamation Headquarters General Hospital, Harbin, China
| | - Hui Ding
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China.,Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|
43
|
Deng L, Yang W, Liu H. PredPRBA: Prediction of Protein-RNA Binding Affinity Using Gradient Boosted Regression Trees. Front Genet 2019; 10:637. [PMID: 31428122 PMCID: PMC6688581 DOI: 10.3389/fgene.2019.00637] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2019] [Accepted: 06/18/2019] [Indexed: 01/24/2023] Open
Abstract
Protein-RNA interactions play essential roles in many biological aspects. Quantifying the binding affinity of protein-RNA complexes is helpful to the understanding of protein-RNA recognition mechanisms and identification of strong binding partners. Due to experimentally measured protein-RNA binding affinity data available is still limited to date, there is a pressing demand for accurate and reliable computational approaches. In this paper, we propose a computational approach, PredPRBA, which can effectively predict protein-RNA binding affinity using gradient boosted regression trees. We build a dataset of protein-RNA binding affinity that includes 103 protein-RNA complex structures manually collected from related literature. Then, we generate 37 kinds of sequence and structural features and explore the relationship between the features and protein-RNA binding affinity. We find that the binding affinity mainly depends on the structure of RNA molecules. According to the type of RNA associated with proteins composed of the protein-RNA complex, we split the 103 protein-RNA complexes into six categories. For each category, we build a gradient boosted regression tree (GBRT) model based on the generated features. We perform a comprehensive evaluation for the proposed method on the binding affinity dataset using leave-one-out cross-validation. We show that PredPRBA achieves correlations ranging from 0.723 to 0.897 among six categories, which is significantly better than other typical regression methods and the pioneer protein-RNA binding affinity predictor SPOT-Seq-RNA. In addition, a user-friendly web server has been developed to predict the binding affinity of protein-RNA complexes. The PredPRBA webserver is freely available at http://PredPRBA.denglab.org/.
Collapse
Affiliation(s)
- Lei Deng
- School of Computer Science and Engineering, Central South University, Changsha, China.,School of Software, Xinjiang University, Urumqi, China
| | - Wenyi Yang
- School of Computer Science and Engineering, Central South University, Changsha, China
| | - Hui Liu
- Lab of Information Management, Changzhou University, Changzhou, China
| |
Collapse
|
44
|
Nembrini S. On what to permute in test-based approaches for variable importance measures in Random Forests. Bioinformatics 2019; 35:2701-2705. [PMID: 30561510 DOI: 10.1093/bioinformatics/bty1025] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2018] [Revised: 12/09/2018] [Accepted: 12/12/2018] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION In bioinformatics applications, it is currently customary to permute the outcome variable in order to produce inference on covariates to test novel methods or statistics whose distributions are poorly known. The seminal publication of Altmann et al. in Bioinformatics uses the same permutation scheme to obtain P-values that can be treated as corrected measure of feature importance to rectify the bias of the Gini variable importance in Random Forests. Since then, such method has been used in applied work to also draw statistical conclusions on variable importance measures from resulting P-values. RESULTS In this paper, we show that permuting the outcome may produce unexpected results, including P-values with undesirable properties and illustrate how more refined permutation schemes can be appropriate to obtain desirable results, including high power in discovering relevant variables. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Stefano Nembrini
- Department of Pathology, Immunology and Laboratory Medicine, College of Medicine, Emerging Pathogens Institute, University of Florida, Gainesville, FL, USA
| |
Collapse
|
45
|
Wekesa JS, Luan Y, Chen M, Meng J. A Hybrid Prediction Method for Plant lncRNA-Protein Interaction. Cells 2019; 8:E521. [PMID: 31151273 PMCID: PMC6627874 DOI: 10.3390/cells8060521] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2019] [Revised: 05/22/2019] [Accepted: 05/29/2019] [Indexed: 01/23/2023] Open
Abstract
Long non-protein-coding RNAs (lncRNAs) identification and analysis are pervasive in transcriptome studies due to their roles in biological processes. In particular, lncRNA-protein interaction has plausible relevance to gene expression regulation and in cellular processes such as pathogen resistance in plants. While lncRNA-protein interaction has been studied in animals, there has yet to be extensive research in plants. In this paper, we propose a novel plant lncRNA-protein interaction prediction method, namely PLRPIM, which combines deep learning and shallow machine learning methods. The selection of an optimal feature subset and subsequent efficient compression are significant challenges for deep learning models. The proposed method adopts k-mer and extracts high-level abstraction sequence-based features using stacked sparse autoencoder. Based on the extracted features, the fusion of random forest (RF) and light gradient boosting machine (LGBM) is used to build the prediction model. The performances are evaluated on Arabidopsis thaliana and Zea mays datasets. Results from experiments demonstrate PLRPIM's superiority compared with other prediction tools on the two datasets. Based on 5-fold cross-validation, we obtain 89.98% and 93.44% accuracy, 0.954 and 0.982 AUC for Arabidopsis thaliana and Zea mays, respectively. PLRPIM predicts potential lncRNA-protein interaction pairs effectively, which can facilitate lncRNA related research including function prediction.
Collapse
Affiliation(s)
- Jael Sanyanda Wekesa
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116023, Liaoning, China.
- Department of Information Technology, Jomo Kenyatta University of Agriculture and Technology, Nairobi 62000-00200, Kenya.
| | - Yushi Luan
- School of Bioengineering, Dalian University of Technology, Dalian 116023, Liaoning, China.
| | - Ming Chen
- College of Life Sciences, Zhejiang University, Hangzhou 310058, Zhejiang, China.
| | - Jun Meng
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116023, Liaoning, China.
| |
Collapse
|
46
|
Zhang X, Gan Y, Zou G, Guan J, Zhou S. Genome-wide analysis of epigenetic dynamics across human developmental stages and tissues. BMC Genomics 2019; 20:221. [PMID: 30967107 PMCID: PMC6457072 DOI: 10.1186/s12864-019-5472-0] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022] Open
Abstract
BACKGROUND Epigenome is highly dynamic during the early stages of embryonic development. Epigenetic modifications provide the necessary regulation for lineage specification and enable the maintenance of cellular identity. Given the rapid accumulation of genome-wide epigenomic modification maps across cellular differentiation process, there is an urgent need to characterize epigenetic dynamics and reveal their impacts on differential gene regulation. METHODS We proposed DiffEM, a computational method for differential analysis of epigenetic modifications and identified highly dynamic modification sites along cellular differentiation process. We applied this approach to investigating 6 epigenetic marks of 20 kinds of human early developmental stages and tissues, including hESCs, 4 hESC-derived lineages and 15 human primary tissues. RESULTS We identified highly dynamic modification sites where different cell types exhibit distinctive modification patterns, and found that these highly dynamic sites enriched in the genes related to cellular development and differentiation. Further, to evaluate the effectiveness of our method, we correlated the dynamics scores of epigenetic modifications with the variance of gene expression, and compared the results of our method with those of the existing algorithms. The comparison results demonstrate the power of our method in evaluating the epigenetic dynamics and identifying highly dynamic regions along cell differentiation process.
Collapse
Affiliation(s)
- Xia Zhang
- School of Computer Engineering and Science, Shanghai University, Shanghai, China
| | - Yanglan Gan
- School of Computer Science and Technology, Donghua University, Shanghai, China.
| | - Guobing Zou
- School of Computer Engineering and Science, Shanghai University, Shanghai, China
| | - Jihong Guan
- Department of Computer Science and Technology,Tongji University, Shanghai, China
| | - Shuigeng Zhou
- Shanghai Key Lab of Intelligent Information Processing, and School of Computer Science, Fudan University, Shanghai, China
| |
Collapse
|
47
|
Deng L, Sui Y, Zhang J. XGBPRH: Prediction of Binding Hot Spots at Protein⁻RNA Interfaces Utilizing Extreme Gradient Boosting. Genes (Basel) 2019; 10:genes10030242. [PMID: 30901953 PMCID: PMC6471955 DOI: 10.3390/genes10030242] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2019] [Revised: 03/14/2019] [Accepted: 03/15/2019] [Indexed: 01/24/2023] Open
Abstract
Hot spot residues at protein⁻RNA complexes are vitally important for investigating the underlying molecular recognition mechanism. Accurately identifying protein⁻RNA binding hot spots is critical for drug designing and protein engineering. Although some progress has been made by utilizing various available features and a series of machine learning approaches, these methods are still in the infant stage. In this paper, we present a new computational method named XGBPRH, which is based on an eXtreme Gradient Boosting (XGBoost) algorithm and can effectively predict hot spot residues in protein⁻RNA interfaces utilizing an optimal set of properties. Firstly, we download 47 protein⁻RNA complexes and calculate a total of 156 sequence, structure, exposure, and network features. Next, we adopt a two-step feature selection algorithm to extract a combination of 6 optimal features from the combination of these 156 features. Compared with the state-of-the-art approaches, XGBPRH achieves better performances with an area under the ROC curve (AUC) score of 0.817 and an F1-score of 0.802 on the independent test set. Meanwhile, we also apply XGBPRH to two case studies. The results demonstrate that the method can effectively identify novel energy hotspots.
Collapse
Affiliation(s)
- Lei Deng
- School of Computer Science and Engineering, Central South University, Changsha 410075, China.
| | - Yuanchao Sui
- School of Computer Science and Engineering, Central South University, Changsha 410075, China.
| | - Jingpu Zhang
- School of Computer and Data Science, Henan University of Urban Construction, Pingdingshan 467000, China.
| |
Collapse
|
48
|
Xu L, Liang G, Liao C, Chen GD, Chang CC. k-Skip-n-Gram-RF: A Random Forest Based Method for Alzheimer's Disease Protein Identification. Front Genet 2019; 10:33. [PMID: 30809242 PMCID: PMC6379451 DOI: 10.3389/fgene.2019.00033] [Citation(s) in RCA: 53] [Impact Index Per Article: 10.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2018] [Accepted: 01/17/2019] [Indexed: 11/18/2022] Open
Abstract
In this paper, a computational method based on machine learning technique for identifying Alzheimer's disease genes is proposed. Compared with most existing machine learning based methods, existing methods predict Alzheimer's disease genes by using structural magnetic resonance imaging (MRI) technique. Most methods have attained acceptable results, but the cost is expensive and time consuming. Thus, we proposed a computational method for identifying Alzheimer disease genes by use of the sequence information of proteins, and classify the feature vectors by random forest. In the proposed method, the gene protein information is extracted by adaptive k-skip-n-gram features. The proposed method can attain the accuracy to 85.5% on the selected UniProt dataset, which has been demonstrated by the experimental results.
Collapse
Affiliation(s)
- Lei Xu
- School of Electronic and Communication Engineering, Shenzhen Polytechnic, Shenzhen, China
| | - Guangmin Liang
- School of Electronic and Communication Engineering, Shenzhen Polytechnic, Shenzhen, China
| | - Changrui Liao
- Key Laboratory of Optoelectronic Devices and Systems of Ministry of Education and Guangdong Province, College of Optoelectronic Engineering, Shenzhen University, Shenzhen, China
| | - Gin-Den Chen
- Department of Obstetrics and Gynecology, Chung Shan Medical University Hospital, Taichung, Taiwan
| | - Chi-Chang Chang
- School of Medical Informatics, Chung Shan Medical University, Taichung, Taiwan
- IT Office, Chung Shan Medical University Hospital, Taichung, Taiwan
| |
Collapse
|
49
|
Qu K, Wei L, Yu J, Wang C. Identifying Plant Pentatricopeptide Repeat Coding Gene/Protein Using Mixed Feature Extraction Methods. FRONTIERS IN PLANT SCIENCE 2019; 9:1961. [PMID: 30687359 PMCID: PMC6335366 DOI: 10.3389/fpls.2018.01961] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/20/2018] [Accepted: 12/17/2018] [Indexed: 05/04/2023]
Abstract
Motivation: Pentatricopeptide repeat (PPR) is a triangular pentapeptide repeat domain that plays a vital role in plant growth. In this study, we seek to identify PPR coding genes and proteins using a mixture of feature extraction methods. We use four single feature extraction methods focusing on the sequence, physical, and chemical properties as well as the amino acid composition, and mix the features. The Max-Relevant-Max-Distance (MRMD) technique is applied to reduce the feature dimension. Classification uses the random forest, J48, and naïve Bayes with 10-fold cross-validation. Results: Combining two of the feature extraction methods with the random forest classifier produces the highest area under the curve of 0.9848. Using MRMD to reduce the dimension improves this metric for J48 and naïve Bayes, but has little effect on the random forest results. Availability and Implementation: The webserver is available at: http://server.malab.cn/MixedPPR/index.jsp.
Collapse
Affiliation(s)
- Kaiyang Qu
- College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Leyi Wei
- College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Jiantao Yu
- College of Information Engineering, North-West A&F University, Yangling, China
| | - Chunyu Wang
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO, United States
| |
Collapse
|
50
|
Integrating Multiple Interaction Networks for Gene Function Inference. Molecules 2018; 24:molecules24010030. [PMID: 30577643 PMCID: PMC6337127 DOI: 10.3390/molecules24010030] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2018] [Revised: 12/19/2018] [Accepted: 12/20/2018] [Indexed: 01/17/2023] Open
Abstract
In the past few decades, the number and variety of genomic and proteomic data available have increased dramatically. Molecular or functional interaction networks are usually constructed according to high-throughput data and the topological structure of these interaction networks provide a wealth of information for inferring the function of genes or proteins. It is a widely used way to mine functional information of genes or proteins by analyzing the association networks. However, it remains still an urgent but unresolved challenge how to combine multiple heterogeneous networks to achieve more accurate predictions. In this paper, we present a method named ReprsentConcat to improve function inference by integrating multiple interaction networks. The low-dimensional representation of each node in each network is extracted, then these representations from multiple networks are concatenated and fed to gcForest, which augment feature vectors by cascading and automatically determines the number of cascade levels. We experimentally compare ReprsentConcat with a state-of-the-art method, showing that it achieves competitive results on the datasets of yeast and human. Moreover, it is robust to the hyperparameters including the number of dimensions.
Collapse
|