1
|
Roddy JW, Rich DH, Wheeler TJ. nail: software for high-speed, high-sensitivity protein sequence annotation. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.01.27.577580. [PMID: 38352323 PMCID: PMC10862755 DOI: 10.1101/2024.01.27.577580] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Indexed: 02/22/2024]
Abstract
" Fast is fine, but accuracy is final. " -- Wyatt Earp. Background The extreme diversity of newly sequenced organisms and considerable scale of modern sequence databases lead to a tension between competing needs for sensitivity and speed in sequence annotation, with multiple tools displacing the venerable BLAST software suite on one axis or another. Alignment based on profile hidden Markov models (pHMMs) has demonstrated state of art sensitivity, while recent algorithmic advances have resulted in hyper-fast annotation tools with sensitivity close to that of BLAST. Results Here, we introduce a new tool that bridges the gap between advances in these two directions, reaching speeds comparable to fast annotation methods such as MMseqs2 while retaining most of the sensitivity offered by pHMMs. The tool, called nail, implements a heuristic approximation of the pHMM Forward/Backward (FB) algorithm by identifying a sparse subset of the cells in the FB dynamic programming matrix that contains most of the probability mass. The method produces an accurate approximation of pHMM scores and E-values with high speed and small memory requirements. On a protein benchmark, nail recovers the majority of recall difference between MMseqs2 and HMMER, with run time ~26x faster than HMMER3 (only ~2.4x slower than MMseqs2's sensitive variant). nail is released under the open BSD-3-clause license and is available for download at https://github.com/TravisWheelerLab/nail.
Collapse
Affiliation(s)
- Jack W Roddy
- R. Ken Coit College of Pharmacy, University of Arizona, Tucson, Arizona, USA
| | - David H Rich
- Department of Computer Science, University of Montana, Missoula, Montana, USA
| | - Travis J Wheeler
- R. Ken Coit College of Pharmacy, University of Arizona, Tucson, Arizona, USA
- Department of Computer Science, University of Montana, Missoula, Montana, USA
| |
Collapse
|
2
|
Jia K, Kilinc M, Jernigan RL. New alignment method for remote protein sequences by the direct use of pairwise sequence correlations and substitutions. FRONTIERS IN BIOINFORMATICS 2023; 3:1227193. [PMID: 37900964 PMCID: PMC10602800 DOI: 10.3389/fbinf.2023.1227193] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2023] [Accepted: 08/14/2023] [Indexed: 10/31/2023] Open
Abstract
Understanding protein sequences and how they relate to the functions of proteins is extremely important. One of the most basic operations in bioinformatics is sequence alignment and usually the first things learned from these are which positions are the most conserved and often these are critical parts of the structure, such as enzyme active site residues. In addition, the contact pairs in a protein usually correspond closely to the correlations between residue positions in the multiple sequence alignment, and these usually change in a systematic and coordinated way, if one position changes then the other member of the pair also changes to compensate. In the present work, these correlated pairs are taken as anchor points for a new type of sequence alignment. The main advantage of the method here is its combining the remote homolog detection from our method PROST with pairwise sequence substitutions in the rigorous method from Kleinjung et al. We show a few examples of some resulting sequence alignments, and how they can lead to improvements in alignments for function, even for a disordered protein.
Collapse
Affiliation(s)
- Kejue Jia
- Roy J. Carver Department of Biochemistry, Biophysics, and Molecular Biology, Iowa State University, Ames, IA, United States
| | - Mesih Kilinc
- Roy J. Carver Department of Biochemistry, Biophysics, and Molecular Biology, Iowa State University, Ames, IA, United States
- Bioinformatics and Computational Biology Program, Iowa State University, Ames, IA, United States
| | - Robert L. Jernigan
- Roy J. Carver Department of Biochemistry, Biophysics, and Molecular Biology, Iowa State University, Ames, IA, United States
- Bioinformatics and Computational Biology Program, Iowa State University, Ames, IA, United States
| |
Collapse
|
3
|
Hardies SC, Cho BC, Jang GI, Wang Z, Hwang CY. Identification of Structural and Morphogenesis Genes of Sulfitobacter Phage ΦGT1 and Placement within the Evolutionary History of the Podoviruses. Viruses 2023; 15:1475. [PMID: 37515163 PMCID: PMC10386132 DOI: 10.3390/v15071475] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2023] [Revised: 06/23/2023] [Accepted: 06/28/2023] [Indexed: 07/30/2023] Open
Abstract
ΦGT1 is a lytic podovirus of an alphaproteobacterial Sulfitobacter species, with few closely matching sequences among characterized phages, thus defying a useful description by simple sequence clustering methods. The history of the ΦGT1 core structure module was reconstructed using timetrees, including numerous related prospective prophages, to flesh out the evolutionary lineages spanning from the origin of the ejectosomal podovirus >3.2 Gya to the present genes of ΦGT1 and its closest relatives. A peculiarity of the ΦGT1 structural proteome is that it contains two paralogous tubular tail A (tubeA) proteins. The origin of the dual tubeA arrangement was traced to a recombination between two more ancient podoviral lineages occurring ~0.7 Gya in the alphaproteobacterial order Rhizobiales. Descendants of the ancestral dual A recombinant were tracked forward forming both temperate and lytic phage clusters and exhibiting both vertical transmission with patchy persistence and horizontal transfer with respect to host taxonomy. The two ancestral lineages were traced backward, making junctions with a major metagenomic podoviral family, the LUZ24-like gammaproteobacterial phages, and Myxococcal phage Mx8, and finally joining near the origin of podoviruses with P22. With these most conservative among phage genes, deviations from uncomplicated vertical and nonrecombinant descent are numerous but countable. The use of timetrees allowed conceptualization of the phage's evolution in the context of a sequence of ancestors spanning the time of life on Earth.
Collapse
Affiliation(s)
- Stephen C Hardies
- Department of Biochemistry and Structural Biology, UT Health, San Antonio, TX 78229, USA
| | - Byung Cheol Cho
- Microbial Oceanography Laboratory, School of Earth and Environmental Sciences and Research Institute of Oceanography, Seoul National University, Seoul 08826, Republic of Korea
- Saemangeum Environmental Research Center, Kunsan National University, Gunsan 54150, Republic of Korea
| | - Gwang Il Jang
- Aquatic Disease Control Division, National Fishery Products Quality Management Service, Busan 46083, Republic of Korea
| | - Zhiqing Wang
- National Cryo-EM Facility, Cancer Research Technology Program, Frederick National Laboratory for Cancer Research, Leidos Biomedical Research Inc., Frederick, MD 21702, USA
| | - Chung Yeon Hwang
- Microbial Oceanography Laboratory, School of Earth and Environmental Sciences and Research Institute of Oceanography, Seoul National University, Seoul 08826, Republic of Korea
| |
Collapse
|
4
|
Kumar L, Brenner N, Sledzieski S, Olaosebikan M, Roger LM, Lynn-Goin M, Klein-Seetharaman R, Berger B, Putnam H, Yang J, Lewinski NA, Singh R, Daniels NM, Cowen L, Klein-Seetharaman J. Transfer of knowledge from model organisms to evolutionarily distant non-model organisms: The coral Pocillopora damicornis membrane signaling receptome. PLoS One 2023; 18:e0270965. [PMID: 36735673 PMCID: PMC9897584 DOI: 10.1371/journal.pone.0270965] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2021] [Accepted: 06/21/2022] [Indexed: 02/04/2023] Open
Abstract
With the ease of gene sequencing and the technology available to study and manipulate non-model organisms, the extension of the methodological toolbox required to translate our understanding of model organisms to non-model organisms has become an urgent problem. For example, mining of large coral and their symbiont sequence data is a challenge, but also provides an opportunity for understanding functionality and evolution of these and other non-model organisms. Much more information than for any other eukaryotic species is available for humans, especially related to signal transduction and diseases. However, the coral cnidarian host and human have diverged over 700 million years ago and homologies between proteins in the two species are therefore often in the gray zone, or at least often undetectable with traditional BLAST searches. We introduce a two-stage approach to identifying putative coral homologues of human proteins. First, through remote homology detection using Hidden Markov Models, we identify candidate human homologues in the cnidarian genome. However, for many proteins, the human genome alone contains multiple family members with similar or even more divergence in sequence. In the second stage, therefore, we filter the remote homology results based on the functional and structural plausibility of each coral candidate, shortlisting the coral proteins likely to have conserved some of the functions of the human proteins. We demonstrate our approach with a pipeline for mapping membrane receptors in humans to membrane receptors in corals, with specific focus on the stony coral, P. damicornis. More than 1000 human membrane receptors mapped to 335 coral receptors, including 151 G protein coupled receptors (GPCRs). To validate specific sub-families, we chose opsin proteins, representative GPCRs that confer light sensitivity, and Toll-like receptors, representative non-GPCRs, which function in the immune response, and their ability to communicate with microorganisms. Through detailed structure-function analysis of their ligand-binding pockets and downstream signaling cascades, we selected those candidate remote homologues likely to carry out related functions in the corals. This pipeline may prove generally useful for other non-model organisms, such as to support the growing field of synthetic biology.
Collapse
Affiliation(s)
- Lokender Kumar
- Department of Chemistry, Colorado School of Mines, Golden, CO, United States of America
| | - Nathanael Brenner
- Department of Chemistry, Colorado School of Mines, Golden, CO, United States of America
| | - Samuel Sledzieski
- MIT Computer Science & Artificial Intelligence Lab, Massachusetts Institute of Technology, Cambridge, MA, United States of America
| | - Monsurat Olaosebikan
- Department of Computer Science, Tufts University, Medford, MA, United States of America
| | - Liza M. Roger
- Department of Chemical and Life Science Engineering, Virginia Commonwealth University, Richmond, VA, United States of America
| | - Matthew Lynn-Goin
- Department of Chemistry, Colorado School of Mines, Golden, CO, United States of America
| | | | - Bonnie Berger
- MIT Computer Science & Artificial Intelligence Lab, Massachusetts Institute of Technology, Cambridge, MA, United States of America
| | - Hollie Putnam
- Department of Biological Sciences, University of Rhode Island, South Kingstown, RI, United States of America
| | - Jinkyu Yang
- Department of Department of Aeronautics & Astronautics, University of Washington, Seattle, WA, United States of America
| | - Nastassja A. Lewinski
- Department of Chemical and Life Science Engineering, Virginia Commonwealth University, Richmond, VA, United States of America
| | - Rohit Singh
- MIT Computer Science & Artificial Intelligence Lab, Massachusetts Institute of Technology, Cambridge, MA, United States of America
| | - Noah M. Daniels
- Department of Computer Science and Statistics, University of Rhode Island, South Kingstown, RI, United States of America
| | - Lenore Cowen
- Department of Computer Science, Tufts University, Medford, MA, United States of America
| | - Judith Klein-Seetharaman
- Department of Chemistry, Colorado School of Mines, Golden, CO, United States of America
- * E-mail:
| |
Collapse
|
5
|
Tomal JH, Welch WJ, Zamar RH. Robust ranking by ensembling of diverse models and assessment metrics. J STAT COMPUT SIM 2022. [DOI: 10.1080/00949655.2022.2093873] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/17/2022]
Affiliation(s)
- Jabed H. Tomal
- Department of Mathematics and Statistics, Thompson Rivers University, Kamloops, British Columbia, Canada
| | - William J. Welch
- Department of Statistics, The University of British Columbia, Vancouver, British Columbia, Canada
| | - Ruben H. Zamar
- Department of Statistics, The University of British Columbia, Vancouver, British Columbia, Canada
| |
Collapse
|
6
|
Zhao X, Yang J, Li X, Li G, Sun Z, Chen Y, Chen Y, Xia M, Li Y, Yao L, Hou H. Identification and expression analysis of GARP superfamily genes in response to nitrogen and phosphorus stress in Spirodela polyrhiza. BMC PLANT BIOLOGY 2022; 22:308. [PMID: 35751022 PMCID: PMC9233324 DOI: 10.1186/s12870-022-03696-5] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/22/2022] [Accepted: 06/13/2022] [Indexed: 06/12/2023]
Abstract
BACKGROUND GARP transcription factors perform critical roles in plant development and response to environmental stimulus, especially in the phosphorus (P) and nitrogen (N) sensing and uptake. Spirodela polyrhiza (giant duckweed) is widely used for phytoremediation and biomass production due to its rapid growth and efficient N and P removal capacities. However, there has not yet been a comprehensive analysis of the GRAP gene family in S. polyrhiza. RESULTS We conducted a comprehensive study of GRAP superfamily genes in S. polyrhiza. First, we investigated 35 SpGARP genes which have been classified into three groups based on their gene structures, conserved motifs, and phylogenetic relationship. Then, we identified the duplication events, performed the synteny analysis, and calculated the Ka/Ks ratio in these SpGARP genes. The regulatory and co-expression networks of SpGARPs were further constructed using cis-acting element analysis and weighted correlation network analysis (WGCNA). Finally, the expression pattern of SpGARP genes were analyzed using RNA-seq data and qRT-PCR, and several NIGT1 transcription factors were found to be involved in both N and P starvation responses. CONCLUSIONS The study provides insight into the evolution and function of GARP superfamily in S. polyrhiza, and lays the foundation for the further functional verification of SpGARP genes.
Collapse
Affiliation(s)
- Xuyao Zhao
- The State Key Laboratory of Freshwater Ecology and Biotechnology, The Key Laboratory of Aquatic Biodiversity and Conservation of Chinese Academy of Sciences, Institute of Hydrobiology, Chinese Academy of Sciences, Wuhan, 430072, China
| | - Jingjing Yang
- The State Key Laboratory of Freshwater Ecology and Biotechnology, The Key Laboratory of Aquatic Biodiversity and Conservation of Chinese Academy of Sciences, Institute of Hydrobiology, Chinese Academy of Sciences, Wuhan, 430072, China
| | - Xiaozhe Li
- The State Key Laboratory of Freshwater Ecology and Biotechnology, The Key Laboratory of Aquatic Biodiversity and Conservation of Chinese Academy of Sciences, Institute of Hydrobiology, Chinese Academy of Sciences, Wuhan, 430072, China
| | - Gaojie Li
- The State Key Laboratory of Freshwater Ecology and Biotechnology, The Key Laboratory of Aquatic Biodiversity and Conservation of Chinese Academy of Sciences, Institute of Hydrobiology, Chinese Academy of Sciences, Wuhan, 430072, China
| | - Zuoliang Sun
- The State Key Laboratory of Freshwater Ecology and Biotechnology, The Key Laboratory of Aquatic Biodiversity and Conservation of Chinese Academy of Sciences, Institute of Hydrobiology, Chinese Academy of Sciences, Wuhan, 430072, China
- University of Chinese Academy of Sciences, Beijing, 100049, China
| | - Yan Chen
- The State Key Laboratory of Freshwater Ecology and Biotechnology, The Key Laboratory of Aquatic Biodiversity and Conservation of Chinese Academy of Sciences, Institute of Hydrobiology, Chinese Academy of Sciences, Wuhan, 430072, China
- University of Chinese Academy of Sciences, Beijing, 100049, China
| | - Yimeng Chen
- The State Key Laboratory of Freshwater Ecology and Biotechnology, The Key Laboratory of Aquatic Biodiversity and Conservation of Chinese Academy of Sciences, Institute of Hydrobiology, Chinese Academy of Sciences, Wuhan, 430072, China
- University of Chinese Academy of Sciences, Beijing, 100049, China
| | - Manli Xia
- The State Key Laboratory of Freshwater Ecology and Biotechnology, The Key Laboratory of Aquatic Biodiversity and Conservation of Chinese Academy of Sciences, Institute of Hydrobiology, Chinese Academy of Sciences, Wuhan, 430072, China
- University of Chinese Academy of Sciences, Beijing, 100049, China
| | - Yixian Li
- The State Key Laboratory of Freshwater Ecology and Biotechnology, The Key Laboratory of Aquatic Biodiversity and Conservation of Chinese Academy of Sciences, Institute of Hydrobiology, Chinese Academy of Sciences, Wuhan, 430072, China
- University of Chinese Academy of Sciences, Beijing, 100049, China
| | - Lunguang Yao
- Henan Key Laboratory of Ecological Security for Water Source Region of Mid-Line of South-to-North Diversion Project of Henan Province, Collaborative Innovation Center of Water Security for Water Source Region of Mid-Line of South-to-North Diversion Project of Henan Province, Nanyang Normal University, Nanyang, 473061, China
| | - Hongwei Hou
- The State Key Laboratory of Freshwater Ecology and Biotechnology, The Key Laboratory of Aquatic Biodiversity and Conservation of Chinese Academy of Sciences, Institute of Hydrobiology, Chinese Academy of Sciences, Wuhan, 430072, China.
| |
Collapse
|
7
|
Pang Y, Liu B. SelfAT-Fold: Protein Fold Recognition Based on Residue-Based and Motif-Based Self-Attention Networks. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:1861-1869. [PMID: 33090951 DOI: 10.1109/tcbb.2020.3031888] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
The protein fold recognition is a fundamental and crucial step of tertiary structure determination. In this regard, several computational predictors have been proposed. Recently, the predictive performance has been obviously improved by the fold-specific features generated by deep learning techniques. However, these methods failed to measure the global associations among residues or motifs along the protein sequences. Furthermore, these deep learning techniques are often treated as black boxes without interpretability. Inspired by the similarities between protein sequences and natural language sentences, we applied the self-attention mechanism derived from natural language processing (NLP) field to protein fold recognition. The motif-based self-attention network (MSAN) and the residue-based self-attention network (RSAN) were constructed based on a training set to capture the global associations among the structure motifs and residues along the protein sequences, respectively. The fold-specific attention features trained and generated from the training set were then combined with Support Vector Machines (SVMs) to predict the samples in the widely used LE benchmark dataset, which is fully independent from the training set. Experimental results showed that the proposed two SelfAT-Fold predictors outperformed 34 existing state-of-the-art computational predictors. The two SelfAT-Fold predictors were further tested on an independent dataset SCOP_TEST, and they can achieve stable performance. Furthermore, the fold-specific attention features can be used to analyse the characteristics of protein folds. The trained models and data of SelfAT-Fold can be downloaded from http://bliulab.net/selfAT_fold/.
Collapse
|
8
|
Yadav NS, Kumar P, Singh I. Structural and functional analysis of protein. Bioinformatics 2022. [DOI: 10.1016/b978-0-323-89775-4.00026-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022] Open
|
9
|
Liu Y, Han K, Zhu YH, Zhang Y, Shen LC, Song J, Yu DJ. Improving protein fold recognition using triplet network and ensemble deep learning. Brief Bioinform 2021; 22:bbab248. [PMID: 34226918 PMCID: PMC8768454 DOI: 10.1093/bib/bbab248] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2021] [Revised: 06/04/2021] [Indexed: 12/24/2022] Open
Abstract
Protein fold recognition is a critical step toward protein structure and function prediction, aiming at providing the most likely fold type of the query protein. In recent years, the development of deep learning (DL) technique has led to massive advances in this important field, and accordingly, the sensitivity of protein fold recognition has been dramatically improved. Most DL-based methods take an intermediate bottleneck layer as the feature representation of proteins with new fold types. However, this strategy is indirect, inefficient and conditional on the hypothesis that the bottleneck layer's representation is assumed as a good representation of proteins with new fold types. To address the above problem, in this work, we develop a new computational framework by combining triplet network and ensemble DL. We first train a DL-based model, termed FoldNet, which employs triplet loss to train the deep convolutional network. FoldNet directly optimizes the protein fold embedding itself, making the proteins with the same fold types be closer to each other than those with different fold types in the new protein embedding space. Subsequently, using the trained FoldNet, we implement a new residue-residue contact-assisted predictor, termed FoldTR, which improves protein fold recognition. Furthermore, we propose a new ensemble DL method, termed FSD_XGBoost, which combines protein fold embedding with the other two discriminative fold-specific features extracted by two DL-based methods SSAfold and DeepFR. The Top 1 sensitivity of FSD_XGBoost increases to 74.8% at the fold level, which is ~9% higher than that of the state-of-the-art method. Together, the results suggest that fold-specific features extracted by different DL methods complement with each other, and their combination can further improve fold recognition at the fold level. The implemented web server of FoldTR and benchmark datasets are publicly available at http://csbio.njust.edu.cn/bioinf/foldtr/.
Collapse
Affiliation(s)
| | | | | | | | | | - Jiangning Song
- Corresponding authors: Dong-Jun Yu, School of Computer Science and Engineering, Nanjing University of Science and Technology, 200 Xiaolingwei, Nanjing 210094, China. E-mail: ; Jiangning Song, Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Victoria 3800, Australia. E-mail:
| | - Dong-Jun Yu
- Corresponding authors: Dong-Jun Yu, School of Computer Science and Engineering, Nanjing University of Science and Technology, 200 Xiaolingwei, Nanjing 210094, China. E-mail: ; Jiangning Song, Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Victoria 3800, Australia. E-mail:
| |
Collapse
|
10
|
Structural Studies of the Phage G Tail Demonstrate an Atypical Tail Contraction. Viruses 2021; 13:v13102094. [PMID: 34696524 PMCID: PMC8570332 DOI: 10.3390/v13102094] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2021] [Revised: 10/07/2021] [Accepted: 10/14/2021] [Indexed: 01/28/2023] Open
Abstract
Phage G is recognized as having a remarkably large genome and capsid size among isolated, propagated phages. Negative stain electron microscopy of the host–phage G interaction reveals tail sheaths that are contracted towards the distal tip and decoupled from the head–neck region. This is different from the typical myophage tail contraction, where the sheath contracts upward, while being linked to the head–neck region. Our cryo-EM structures of the non-contracted and contracted tail sheath show that: (1) The protein fold of the sheath protein is very similar to its counterpart in smaller, contractile phages such as T4 and phi812; (2) Phage G’s sheath structure in the non-contracted and contracted states are similar to phage T4’s sheath structure. Similarity to other myophages is confirmed by a comparison-based study of the tail sheath’s helical symmetry, the sheath protein’s evolutionary timetree, and the organization of genes involved in tail morphogenesis. Atypical phase G tail contraction could be due to a missing anchor point at the upper end of the tail sheath that allows the decoupling of the sheath from the head–neck region. Explaining the atypical tail contraction requires further investigation of the phage G sheath anchor points.
Collapse
|
11
|
Dong AY, Wang Z, Huang JJ, Song BA, Hao GF. Bioinformatic tools support decision-making in plant disease management. TRENDS IN PLANT SCIENCE 2021; 26:953-967. [PMID: 34039514 DOI: 10.1016/j.tplants.2021.05.001] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/14/2020] [Revised: 02/10/2021] [Accepted: 05/01/2021] [Indexed: 06/12/2023]
Abstract
Food loss due to pathogens is a major concern in agriculture, requiring the need for advanced disease detection and prevention measures to minimize pathogen damage to plants. Novel bioinformatic tools have opened doors for the low-cost rapid identification of pathogens and prevention of disease. The number of these tools is growing fast and a comprehensive and comparative summary of these resources is currently lacking. Here, we review all current bioinformatic tools used to identify the mechanisms of pathogen pathogenicity, plant resistance protein identification, and the detection and treatment of plant disease. We compare functionality, data volume, data sources, performance, and applicability of all tools to provide a comprehensive toolbox for researchers in plant disease management.
Collapse
Affiliation(s)
- An-Yu Dong
- State Key Laboratory Breeding Base of Green Pesticide and Agricultural Bioengineering, Key Laboratory of Green Pesticide and Agricultural Bioengineering, Ministry of Education, Center for Research and Development of Fine Chemicals, Guizhou University, Guiyang 550025, P. R. China
| | - Zheng Wang
- State Key Laboratory Breeding Base of Green Pesticide and Agricultural Bioengineering, Key Laboratory of Green Pesticide and Agricultural Bioengineering, Ministry of Education, Center for Research and Development of Fine Chemicals, Guizhou University, Guiyang 550025, P. R. China
| | - Jun-Jie Huang
- State Key Laboratory Breeding Base of Green Pesticide and Agricultural Bioengineering, Key Laboratory of Green Pesticide and Agricultural Bioengineering, Ministry of Education, Center for Research and Development of Fine Chemicals, Guizhou University, Guiyang 550025, P. R. China
| | - Bao-An Song
- State Key Laboratory Breeding Base of Green Pesticide and Agricultural Bioengineering, Key Laboratory of Green Pesticide and Agricultural Bioengineering, Ministry of Education, Center for Research and Development of Fine Chemicals, Guizhou University, Guiyang 550025, P. R. China
| | - Ge-Fei Hao
- State Key Laboratory Breeding Base of Green Pesticide and Agricultural Bioengineering, Key Laboratory of Green Pesticide and Agricultural Bioengineering, Ministry of Education, Center for Research and Development of Fine Chemicals, Guizhou University, Guiyang 550025, P. R. China.
| |
Collapse
|
12
|
Schäffer AA, McVeigh R, Robbertse B, Schoch CL, Johnston A, Underwood BA, Karsch-Mizrachi I, Nawrocki EP. Ribovore: ribosomal RNA sequence analysis for GenBank submissions and database curation. BMC Bioinformatics 2021; 22:400. [PMID: 34384346 PMCID: PMC8359073 DOI: 10.1186/s12859-021-04316-z] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2021] [Accepted: 08/03/2021] [Indexed: 02/01/2023] Open
Abstract
Background The DNA sequences encoding ribosomal RNA genes (rRNAs) are commonly used as markers to identify species, including in metagenomics samples that may combine many organismal communities. The 16S small subunit ribosomal RNA (SSU rRNA) gene is typically used to identify bacterial and archaeal species. The nuclear 18S SSU rRNA gene, and 28S large subunit (LSU) rRNA gene have been used as DNA barcodes and for phylogenetic studies in different eukaryote taxonomic groups. Because of their popularity, the National Center for Biotechnology Information (NCBI) receives a disproportionate number of rRNA sequence submissions and BLAST queries. These sequences vary in quality, length, origin (nuclear, mitochondria, plastid), and organism source and can represent any region of the ribosomal cistron. Results To improve the timely verification of quality, origin and loci boundaries, we developed Ribovore, a software package for sequence analysis of rRNA sequences. The ribotyper and ribosensor programs are used to validate incoming sequences of bacterial and archaeal SSU rRNA. The ribodbmaker program is used to create high-quality datasets of rRNAs from different taxonomic groups. Key algorithmic steps include comparing candidate sequences against rRNA sequence profile hidden Markov models (HMMs) and covariance models of rRNA sequence and secondary-structure conservation, as well as other tests. Nine freely available blastn rRNA databases created and maintained with Ribovore are used for checking incoming GenBank submissions and used by the blastn browser interface at NCBI. Since 2018, Ribovore has been used to analyze more than 50 million prokaryotic SSU rRNA sequences submitted to GenBank, and to select at least 10,435 fungal rRNA RefSeq records from type material of 8350 taxa. Conclusion Ribovore combines single-sequence and profile-based methods to improve GenBank processing and analysis of rRNA sequences. It is a standalone, portable, and extensible software package for the alignment, classification and validation of rRNA sequences. Researchers planning on submitting SSU rRNA sequences to GenBank are encouraged to download and use Ribovore to analyze their sequences prior to submission to determine which sequences are likely to be automatically accepted into GenBank.
Collapse
Affiliation(s)
- Alejandro A Schäffer
- Cancer Data Science Laboratory, National Cancer Institute, National Institutes of Health, Bethesda, MD, 20892, USA.,National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
| | - Richard McVeigh
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
| | - Barbara Robbertse
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
| | - Conrad L Schoch
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
| | - Anjanette Johnston
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
| | - Beverly A Underwood
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
| | - Ilene Karsch-Mizrachi
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
| | - Eric P Nawrocki
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA.
| |
Collapse
|
13
|
Shao J, Yan K, Liu B. FoldRec-C2C: protein fold recognition by combining cluster-to-cluster model and protein similarity network. Brief Bioinform 2021; 22:5873289. [PMID: 32685972 PMCID: PMC7454262 DOI: 10.1093/bib/bbaa144] [Citation(s) in RCA: 44] [Impact Index Per Article: 14.7] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2020] [Revised: 05/26/2020] [Accepted: 06/11/2020] [Indexed: 12/27/2022] Open
Abstract
As a key for studying the protein structures, protein fold recognition is playing an important role in predicting the protein structures associated with COVID-19 and other important structures. However, the existing computational predictors only focus on the protein pairwise similarity or the similarity between two groups of proteins from 2-folds. However, the homology relationship among proteins is in a hierarchical structure. The global protein similarity network will contribute to the performance improvement. In this study, we proposed a predictor called FoldRec-C2C to globally incorporate the interactions among proteins into the prediction. For the FoldRec-C2C predictor, protein fold recognition problem is treated as an information retrieval task in nature language processing. The initial ranking results were generated by a surprised ranking algorithm Learning to Rank, and then three re-ranking algorithms were performed on the ranking lists to adjust the results globally based on the protein similarity network, including seq-to-seq model, seq-to-cluster model and cluster-to-cluster model (C2C). When tested on a widely used and rigorous benchmark dataset LINDAHL dataset, FoldRec-C2C outperforms other 34 state-of-the-art methods in this field. The source code and data of FoldRec-C2C can be downloaded from http://bliulab.net/FoldRec-C2C/download.
Collapse
Affiliation(s)
- Jiangyi Shao
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China
| | - Ke Yan
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong, China
| | - Bin Liu
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China
| |
Collapse
|
14
|
Zohra Smaili F, Tian S, Roy A, Alazmi M, Arold ST, Mukherjee S, Scott Hefty P, Chen W, Gao X. QAUST: Protein Function Prediction Using Structure Similarity, Protein Interaction, and Functional Motifs. GENOMICS PROTEOMICS & BIOINFORMATICS 2021; 19:998-1011. [PMID: 33631427 PMCID: PMC9403031 DOI: 10.1016/j.gpb.2021.02.001] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/11/2018] [Revised: 04/03/2019] [Accepted: 05/17/2019] [Indexed: 11/25/2022]
Abstract
The number of available protein sequences in public databases is increasing exponentially. However, a significant percentage of these sequences lack functional annotation, which is essential for the understanding of how biological systems operate. Here, we propose a novel method, Quantitative Annotation of Unknown STructure (QAUST), to infer protein functions, specifically Gene Ontology (GO) terms and Enzyme Commission (EC) numbers. QAUST uses three sources of information: structure information encoded by global and local structure similarity search, biological network information inferred by protein–protein interaction data, and sequence information extracted from functionally discriminative sequence motifs. These three pieces of information are combined by consensus averaging to make the final prediction. Our approach has been tested on 500 protein targets from the Critical Assessment of Functional Annotation (CAFA) benchmark set. The results show that our method provides accurate functional annotation and outperforms other prediction methods based on sequence similarity search or threading. We further demonstrate that a previously unknown function of human tripartite motif-containing 22 (TRIM22) protein predicted by QAUST can be experimentally validated.
Collapse
Affiliation(s)
- Fatima Zohra Smaili
- Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955, Saudi Arabia
| | - Shuye Tian
- Department of Biology, Southern University of Science and Technology of China (SUSTC), Shenzhen 518055, China
| | - Ambrish Roy
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA
| | - Meshari Alazmi
- Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955, Saudi Arabia; College of Computer Science and Engineering, University of Hail, Hail 55476, Saudi Arabia
| | - Stefan T Arold
- Biological and Environmental Sciences and Engineering (BESE) Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955, Saudi Arabia
| | - Srayanta Mukherjee
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA
| | - P Scott Hefty
- Department of Molecular Bioscience, University of Kansas, Lawrence, KS 66047, USA
| | - Wei Chen
- Department of Biology, Southern University of Science and Technology of China (SUSTC), Shenzhen 518055, China.
| | - Xin Gao
- Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955, Saudi Arabia.
| |
Collapse
|
15
|
Shao J, Liu B. ProtFold-DFG: protein fold recognition by combining Directed Fusion Graph and PageRank algorithm. Brief Bioinform 2020; 22:5901980. [PMID: 32892224 DOI: 10.1093/bib/bbaa192] [Citation(s) in RCA: 28] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2020] [Revised: 07/16/2020] [Accepted: 07/28/2020] [Indexed: 12/27/2022] Open
Abstract
As one of the most important tasks in protein structure prediction, protein fold recognition has attracted more and more attention. In this regard, some computational predictors have been proposed with the development of machine learning and artificial intelligence techniques. However, these existing computational methods are still suffering from some disadvantages. In this regard, we propose a new network-based predictor called ProtFold-DFG for protein fold recognition. We propose the Directed Fusion Graph (DFG) to fuse the ranking lists generated by different methods, which employs the transitive closure to incorporate more relationships among proteins and uses the KL divergence to calculate the relationship between two proteins so as to improve its generalization ability. Finally, the PageRank algorithm is performed on the DFG to accurately recognize the protein folds by considering the global interactions among proteins in the DFG. Tested on a widely used and rigorous benchmark data set, LINDAHL dataset, experimental results show that the ProtFold-DFG outperforms the other 35 competing methods, indicating that ProtFold-DFG will be a useful method for protein fold recognition. The source code and data of ProtFold-DFG can be downloaded from http://bliulab.net/ProtFold-DFG/download.
Collapse
Affiliation(s)
- Jiangyi Shao
- School of Computer Science and Technology, Beijing Institute of Technology, China
| | - Bin Liu
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China
| |
Collapse
|
16
|
Yang C, Binder FC, Gu M, Elliott TJ. Measures of distinguishability between stochastic processes. Phys Rev E 2020; 101:062137. [PMID: 32688504 DOI: 10.1103/physreve.101.062137] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2019] [Accepted: 05/04/2020] [Indexed: 11/07/2022]
Abstract
Quantifying how distinguishable two stochastic processes are is at the heart of many fields, such as machine learning and quantitative finance. While several measures have been proposed for this task, none have universal applicability and ease of use. In this article, we suggest a set of requirements for a well-behaved measure of process distinguishability. Moreover, we propose a family of measures, called divergence rates, that satisfy all of these requirements. Focusing on a particular member of this family-the coemission divergence rate-we show that it can be computed efficiently, behaves qualitatively similar to other commonly used measures in their regimes of applicability, and remains well behaved in scenarios where other measures break down.
Collapse
Affiliation(s)
- Chengran Yang
- School of Physical and Mathematical Sciences, Nanyang Technological University, Singapore 637371.,Complexity Institute, Nanyang Technological University, Singapore 637335
| | - Felix C Binder
- Institute for Quantum Optics and Quantum Information (IQOQI) Vienna, Austrian Academy of Sciences, Boltzmanngasse 3, 1090 Vienna, Austria
| | - Mile Gu
- School of Physical and Mathematical Sciences, Nanyang Technological University, Singapore 637371.,Complexity Institute, Nanyang Technological University, Singapore 637335.,Centre for Quantum Technologies, National University of Singapore, 3 Science Drive 2, Singapore 117543
| | - Thomas J Elliott
- School of Physical and Mathematical Sciences, Nanyang Technological University, Singapore 637371.,Complexity Institute, Nanyang Technological University, Singapore 637335
| |
Collapse
|
17
|
Pandurangan AP, Stahlhacke J, Oates ME, Smithers B, Gough J. The SUPERFAMILY 2.0 database: a significant proteome update and a new webserver. Nucleic Acids Res 2020; 47:D490-D494. [PMID: 30445555 PMCID: PMC6324026 DOI: 10.1093/nar/gky1130] [Citation(s) in RCA: 98] [Impact Index Per Article: 24.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2018] [Accepted: 10/25/2018] [Indexed: 01/09/2023] Open
Abstract
Here, we present a major update to the SUPERFAMILY database and the webserver. We describe the addition of new SUPERFAMILY 2.0 profile HMM library containing a total of 27 623 HMMs. The database now includes Superfamily domain annotations for millions of protein sequences taken from the Universal Protein Recourse Knowledgebase (UniProtKB) and the National Center for Biotechnology Information (NCBI). This addition constitutes about 51 and 45 million distinct protein sequences obtained from UniProtKB and NCBI respectively. Currently, the database contains annotations for 63 244 and 102 151 complete genomes taken from UniProtKB and NCBI respectively. The current sequence collection and genome update is the biggest so far in the history of SUPERFAMILY updates. In order to the deal with the massive wealth of information, here we introduce a new SUPERFAMILY 2.0 webserver (http://supfam.org). Currently, the webserver mainly focuses on the search, retrieval and display of Superfamily annotation for the entire sequence and genome collection in the database.
Collapse
Affiliation(s)
| | | | - Matt E Oates
- Computer Science, University of Bristol, Bristol BS8 1UB, UK
| | - Ben Smithers
- Computer Science, University of Bristol, Bristol BS8 1UB, UK
| | - Julian Gough
- MRC Laboratory of Molecular Biology, Hills Road, Cambridge CB2 2QH, UK
| |
Collapse
|
18
|
Santana Silva RJ, Micheli F. RRGPredictor, a set-theory-based tool for predicting pathogen-associated molecular pattern receptors (PRRs) and resistance (R) proteins from plants. Genomics 2020; 112:2666-2676. [DOI: 10.1016/j.ygeno.2020.03.001] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2019] [Revised: 02/11/2020] [Accepted: 03/01/2020] [Indexed: 12/22/2022]
|
19
|
Waman VP, Blundell TL, Buchan DWA, Gough J, Jones D, Kelley L, Murzin A, Pandurangan AP, Sillitoe I, Sternberg M, Torres P, Orengo C. The Genome3D Consortium for Structural Annotations of Selected Model Organisms. Methods Mol Biol 2020; 2165:27-67. [PMID: 32621218 DOI: 10.1007/978-1-0716-0708-4_3] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Genome3D consortium is a collaborative project involving protein structure prediction and annotation resources developed by six world-leading structural bioinformatics groups, based in the United Kingdom (namely Blundell, Murzin, Gough, Sternberg, Orengo, and Jones). The main objective of Genome3D serves as a common portal to provide both predicted models and annotations of proteins in model organisms, using several resources developed by these labs such as CATH-Gene3D, DOMSERF, pDomTHREADER, PHYRE, SUPERFAMILY, FUGUE/TOCATTA, and VIVACE. These resources primarily use SCOP- and/or CATH-based protein domain assignments. Another objective of Genome3D is to compare structural classifications of protein domains in CATH and SCOP databases and to provide a consensus mapping of CATH and SCOP protein superfamilies. CATH/SCOP mapping analyses led to the identification of total of 1429 consensus superfamilies.Currently, Genome3D provides structural annotations for ten model organisms, including Homo sapiens, Arabidopsis thaliana, Mus musculus, Escherichia coli, Saccharomyces cerevisiae, Caenorhabditis elegans, Drosophila melanogaster, Plasmodium falciparum, Staphylococcus aureus, and Schizosaccharomyces pombe. Thus, Genome3D serves as a common gateway to each structure prediction/annotation resource and allows users to perform comparative assessment of the predictions. It, thus, assists researchers to broaden their perspective on structure/function predictions of their query protein of interest in selected model organisms.
Collapse
Affiliation(s)
- Vaishali P Waman
- Institute of Structural and Molecular Biology, University College London, London, UK
| | - Tom L Blundell
- Department of Biochemistry, University of Cambridge, Cambridge, UK
| | - Daniel W A Buchan
- Department of Computer Science, University College London, London, UK
| | - Julian Gough
- MRC Laboratory of Molecular Biology, Cambridge, UK
| | - David Jones
- Department of Computer Science, University College London, London, UK
| | - Lawrence Kelley
- Centre for Integrative Systems Biology and Bioinformatics, Department of Life Sciences, Imperial College London, London, UK
| | | | | | - Ian Sillitoe
- Institute of Structural and Molecular Biology, University College London, London, UK
| | - Michael Sternberg
- Centre for Integrative Systems Biology and Bioinformatics, Department of Life Sciences, Imperial College London, London, UK
| | - Pedro Torres
- Department of Biochemistry, University of Cambridge, Cambridge, UK
| | - Christine Orengo
- Institute of Structural and Molecular Biology, University College London, London, UK.
| |
Collapse
|
20
|
Liu B, Zhu Y, Yan K. Fold-LTR-TCP: protein fold recognition based on triadic closure principle. Brief Bioinform 2019; 21:2185-2193. [DOI: 10.1093/bib/bbz139] [Citation(s) in RCA: 50] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2019] [Revised: 10/01/2019] [Accepted: 10/09/2019] [Indexed: 11/13/2022] Open
Abstract
Abstract
As an important task in protein structure and function studies, protein fold recognition has attracted more and more attention. The existing computational predictors in this field treat this task as a multi-classification problem, ignoring the relationship among proteins in the dataset. However, previous studies showed that their relationship is critical for protein homology analysis. In this study, the protein fold recognition is treated as an information retrieval task. The Learning to Rank model (LTR) was employed to retrieve the query protein against the template proteins to find the template proteins in the same fold with the query protein in a supervised manner. The triadic closure principle (TCP) was performed on the ranking list generated by the LTR to improve its accuracy by considering the relationship among the query protein and the template proteins in the ranking list. Finally, a predictor called Fold-LTR-TCP was proposed. The rigorous test on the LE benchmark dataset showed that the Fold-LTR-TCP predictor achieved an accuracy of 73.2%, outperforming all the other competing methods.
Collapse
Affiliation(s)
- Bin Liu
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China
- Advanced Research Institute of Multidisciplinary Science, Beijing Institute of Technology, Beijing, China
| | - Yulin Zhu
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong 518055, China
| | - Ke Yan
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong 518055, China
| |
Collapse
|
21
|
Li CC, Liu B. MotifCNN-fold: protein fold recognition based on fold-specific features extracted by motif-based convolutional neural networks. Brief Bioinform 2019; 21:2133-2141. [PMID: 31774907 DOI: 10.1093/bib/bbz133] [Citation(s) in RCA: 51] [Impact Index Per Article: 10.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2019] [Revised: 09/16/2019] [Accepted: 09/17/2019] [Indexed: 12/31/2022] Open
Abstract
Protein fold recognition is one of the most critical tasks to explore the structures and functions of the proteins based on their primary sequence information. The existing protein fold recognition approaches rely on features reflecting the characteristics of protein folds. However, the feature extraction methods are still the bottleneck of the performance improvement of these methods. In this paper, we proposed two new feature extraction methods called MotifCNN and MotifDCNN to extract more discriminative fold-specific features based on structural motif kernels to construct the motif-based convolutional neural networks (CNNs). The pairwise sequence similarity scores calculated based on fold-specific features are then fed into support vector machines to construct the predictor for fold recognition, and a predictor called MotifCNN-fold has been proposed. Experimental results on the benchmark dataset showed that MotifCNN-fold obviously outperformed all the other competing methods. In particular, the fold-specific features extracted by MotifCNN and MotifDCNN are more discriminative than the fold-specific features extracted by other deep learning techniques, indicating that incorporating the structural motifs into the CNN is able to capture the characteristics of protein folds.
Collapse
Affiliation(s)
- Chen-Chen Li
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China.,School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong 518055, China
| | - Bin Liu
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China.,Advanced Research Institute of Multidisciplinary Science, Beijing Institute of Technology, Beijing, China
| |
Collapse
|
22
|
Long S, Tian P. Protein secondary structure prediction with context convolutional neural network. RSC Adv 2019; 9:38391-38396. [PMID: 35540205 PMCID: PMC9075825 DOI: 10.1039/c9ra05218f] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2019] [Accepted: 11/18/2019] [Indexed: 11/21/2022] Open
Abstract
Protein secondary structure (SS) prediction is important for studying protein structure and function. Both traditional machine learning methods and deep learning neural networks have been utilized and great progress has been achieved in approaching the theoretical limit. Convolutional and recurrent neural networks are two major types of deep learning architectures with comparable prediction accuracy but different training procedures to achieve optimal performance. We are interested in seeking a novel architectural style with competitive performance and in understanding the performance of different architectures with similar training procedures. We constructed a context convolutional neural network (Contextnet) and compared its performance with popular models (e.g. convolutional neural network, recurrent neural network, conditional neural fields…) under similar training procedures on a Jpred dataset. The Contextnet was proven to be highly competitive. Additionally, we retrained the network with the Cullpdb dataset and compared with Jpred, ReportX, Spider3 server and MUFold-SS method, the Contextnet was found to be more Q3 accurate on a CASP13 dataset. Training procedures were found to have significant impact on the accuracy of the Contextnet. Protein secondary structure prediction using context convolutional neural network.![]()
Collapse
Affiliation(s)
| | - Pu Tian
- School of Life Science, School of Artificial Intelligence, Jilin University 2699 Qian-jin Street Changchun China 130012
| |
Collapse
|
23
|
Lemke T, Berg A, Jain A, Peter C. EncoderMap(II): Visualizing Important Molecular Motions with Improved Generation of Protein Conformations. J Chem Inf Model 2019; 59:4550-4560. [PMID: 31647645 DOI: 10.1021/acs.jcim.9b00675] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022]
Abstract
Dimensionality reduction can be used to project high-dimensional molecular data into a simplified, low-dimensional map. One feature of our recently introduced dimensionality reduction technique EncoderMap, which relies on the combination of an autoencoder with multidimensional scaling, is its ability to do the reverse. It is able to generate conformations for any selected points in the low-dimensional map. This transfers the simplified, low-dimensional map back into the high-dimensional conformational space. Although the output is again high-dimensional, certain aspects of the simplification are preserved. The generated conformations only mirror the most dominant conformational differences that determine the positions of conformational states in the low-dimensional map. This allows depicting such differences and-in consequence-visualizing molecular motions and gives a unique perspective on high-dimensional conformational data. In our previous work, protein conformations described in backbone dihedral angle space were used as the input for EncoderMap, and conformations were also generated in this space. For large proteins, however, the generation of conformations is inaccurate with this approach due to the local character of backbone dihedral angles. Here, we present an improved variant of EncoderMap which is able to generate large protein conformations that are accurate in short-range and long-range orders. This is achieved by differentiable reconstruction of Cartesian coordinates from the generated dihedrals, which allows adding a contribution to the cost function that monitors the accuracy of all pairwise distances between the Cα-atoms of the generated conformations. The improved capabilities to generate conformations of large, even multidomain, proteins are demonstrated for two examples: diubiquitin and a part of the Ssa1 Hsp70 yeast chaperone. We show that the improved variant of EncoderMap can nicely visualize motions of protein domains relative to each other but is also able to highlight important conformational changes within the individual domains.
Collapse
Affiliation(s)
- Tobias Lemke
- Theoretical Chemistry , University of Konstanz , 78547 Konstanz , Baden-Württemberg , Germany
| | - Andrej Berg
- Theoretical Chemistry , University of Konstanz , 78547 Konstanz , Baden-Württemberg , Germany
| | - Alok Jain
- Theoretical Chemistry , University of Konstanz , 78547 Konstanz , Baden-Württemberg , Germany.,Department of Biotechnology , National Institute of Pharmaceutical Education and Research Ahmedabad , Gandhinagar , Gujarat 382355 , India
| | - Christine Peter
- Theoretical Chemistry , University of Konstanz , 78547 Konstanz , Baden-Württemberg , Germany
| |
Collapse
|
24
|
Liu B, Li CC, Yan K. DeepSVM-fold: protein fold recognition by combining support vector machines and pairwise sequence similarity scores generated by deep learning networks. Brief Bioinform 2019; 21:1733-1741. [DOI: 10.1093/bib/bbz098] [Citation(s) in RCA: 106] [Impact Index Per Article: 21.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2019] [Revised: 06/27/2019] [Accepted: 07/06/2019] [Indexed: 12/30/2022] Open
Abstract
Abstract
Protein fold recognition is critical for studying the structures and functions of proteins. The existing protein fold recognition approaches failed to efficiently calculate the pairwise sequence similarity scores of the proteins in the same fold sharing low sequence similarities. Furthermore, the existing feature vectorization strategies are not able to measure the global relationships among proteins from different protein folds. In this article, we proposed a new computational predictor called DeepSVM-fold for protein fold recognition by introducing a new feature vector based on the pairwise sequence similarity scores calculated from the fold-specific features extracted by deep learning networks. The feature vectors are then fed into a support vector machine to construct the predictor. Experimental results on the benchmark dataset (LE) show that DeepSVM-fold obviously outperforms all the other competing methods.
Collapse
Affiliation(s)
- Bin Liu
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China
- Advanced Research Institute of Multidisciplinary Science, Beijing Institute of Technology, Beijing, China
| | - Chen-Chen Li
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong 518055, China
| | - Ke Yan
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong 518055, China
| |
Collapse
|
25
|
Zheng W, Zhang C, Bell EW, Zhang Y. I-TASSER gateway: A protein structure and function prediction server powered by XSEDE. FUTURE GENERATIONS COMPUTER SYSTEMS : FGCS 2019; 99:73-85. [PMID: 31427836 PMCID: PMC6699767 DOI: 10.1016/j.future.2019.04.011] [Citation(s) in RCA: 64] [Impact Index Per Article: 12.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/12/2023]
Abstract
There is an increasing gap between the number of known protein sequences and the number of proteins with experimentally characterized structure and function. To alleviate this issue, we have developed the I-TASSER gateway, an online server for automated and reliable protein structure and function prediction. For a given sequence, I-TASSER starts with template recognition from a known structure library, followed by full-length atomic model construction by iterative assembly simulations of the continuous structural fragments excised from the template alignments. Functional insights are then derived from comparative matching of the predicted model with a library of proteins with known function. The I-TASSER pipeline has been recently integrated with the XSEDE Gateway system to accommodate pressing demand from the user community and increasing computing costs. This report summarizes the configuration of the I-TASSER Gateway with the XSEDE-Comet supercomputer cluster, together with an overview of the I-TASSER method and milestones of its development.
Collapse
|
26
|
Zeng C, Zou L. An account of in silico identification tools of secreted effector proteins in bacteria and future challenges. Brief Bioinform 2019; 20:110-129. [PMID: 28981574 DOI: 10.1093/bib/bbx078] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2017] [Indexed: 01/08/2023] Open
Abstract
Bacterial pathogens secrete numerous effector proteins via six secretion systems, type I to type VI secretion systems, to adapt to new environments or to promote virulence by bacterium-host interactions. Many computational approaches have been used in the identification of effector proteins before the subsequent experimental verification because they tolerate laborious biological procedures and are genome scale, automated and highly efficient. Prevalent examples include machine learning methods and statistical techniques. In this article, we summarize the computational progress toward predicting secreted effector proteins in bacteria, with an opening of an introduction of features that are used to discriminate effectors from non-effectors. The mechanism, contribution and deficiency of previous developed detection tools are presented, which are further benchmarked based on a curated testing data set. According to the results of benchmarking, potential improvements of the prediction performance are discussed, which include (1) more informative features for discriminating the effectors from non-effectors; (2) the construction of comprehensive training data set of the machine learning algorithms; (3) the advancement of reliable prediction methods and (4) a better interpretation of the mechanisms behind the molecular processes. The future of in silico identification of bacterial secreted effectors includes both opportunities and challenges.
Collapse
Affiliation(s)
- Cong Zeng
- Bioinformatics Center, Third Military Medical University (TMMU), China
| | | |
Collapse
|
27
|
Khalid RR, Maryam A, Fadouloglou VE, Siddiqi AR, Zhang Y. Cryo-EM density map fitting driven in-silico structure of human soluble guanylate cyclase (hsGC) reveals functional aspects of inter-domain cross talk upon NO binding. J Mol Graph Model 2019; 90:109-119. [PMID: 31055154 PMCID: PMC7956049 DOI: 10.1016/j.jmgm.2019.04.009] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2019] [Revised: 04/05/2019] [Accepted: 04/17/2019] [Indexed: 01/19/2023]
Abstract
The human soluble Guanylate Cyclase (hsGC) is a heterodimeric heme-containing enzyme which regulates many important physiological processes. In eukaryotes, hsGC is the only known receptor for nitric oxide (NO) signaling. Improper NO signaling results in various disease conditions such as neurodegeneration, hypertension, stroke and erectile dysfunction. To understand the mechanisms of these diseases, structure determination of the hsGC dimer complex is crucial. However, so far all the attempts for the experimental structure determination of the protein were unsuccessful. The current study explores the possibility to model the quaternary structure of hsGC using a hybrid approach that combines state-of-the-art protein structure prediction tools with cryo-EM experimental data. The resultant 3D model shows close consistency with structural and functional insights extracted from biochemistry experiment data. Overall, the atomic-level complex structure determination of hsGC helps to unveil the inter-domain communication upon NO binding, which should be of important usefulness for elucidating the biological function of this important enzyme and for developing new treatments against the hsGC associated human diseases.
Collapse
Affiliation(s)
- Rana Rehan Khalid
- Department of Biosciences, COMSATS University, Islamabad, 45550, Pakistan; Department of Biostatistics and Medical Informatics, Acibadem Universitesi, Istanbul, 34752, Turkey; Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, 48109-2218, USA.
| | - Arooma Maryam
- Department of Biosciences, COMSATS University, Islamabad, 45550, Pakistan; Department of Pharmaceutical Chemistry, Biruni Universitesi, Istanbul, 34010, Turkey.
| | - Vasiliki E Fadouloglou
- Department of Molecular Biology and Genetics, Democritus University of Thrace, University Campus, Alexandroupolis, 68100, Greece.
| | - Abdul Rauf Siddiqi
- Department of Biosciences, COMSATS University, Islamabad, 45550, Pakistan.
| | - Yang Zhang
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, 48109-2218, USA.
| |
Collapse
|
28
|
Bordenave CD, Granados Mendoza C, Jiménez Bremont JF, Gárriz A, Rodríguez AA. Defining novel plant polyamine oxidase subfamilies through molecular modeling and sequence analysis. BMC Evol Biol 2019; 19:28. [PMID: 30665356 PMCID: PMC6341606 DOI: 10.1186/s12862-019-1361-z] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2018] [Accepted: 01/14/2019] [Indexed: 01/19/2023] Open
Abstract
BACKGROUND The polyamine oxidases (PAOs) catabolize the oxidative deamination of the polyamines (PAs) spermine (Spm) and spermidine (Spd). Most of the phylogenetic studies performed to analyze the plant PAO family took into account only a limited number and/or taxonomic representation of plant PAOs sequences. RESULTS Here, we constructed a plant PAO protein sequence database and identified four subfamilies. Subfamily PAO back conversion 1 (PAObc1) was present on every lineage included in these analyses, suggesting that BC-type PAOs might play an important role in plants, despite its precise function is unknown. Subfamily PAObc2 was exclusively present in vascular plants, suggesting that t-Spm oxidase activity might play an important role in the development of the vascular system. The only terminal catabolism (TC) PAO subfamily (subfamily PAOtc) was lost in Superasterids but it was present in all other land plants. This indicated that the TC-type reactions are fundamental for land plants and that their function could being taken over by other enzymes in Superasterids. Subfamily PAObc3 was the result of a gene duplication event preceding Angiosperm diversification, followed by a gene extinction in Monocots. Differential conserved protein motifs were found for each subfamily of plant PAOs. The automatic assignment using these motifs was found to be comparable to the assignment by rough clustering performed on this work. CONCLUSIONS The results presented in this work revealed that plant PAO family is bigger than previously conceived. Also, they delineate important background information for future specific structure-function and evolutionary investigations and lay a foundation for the deeper characterization of each plant PAO subfamily.
Collapse
Affiliation(s)
- Cesar Daniel Bordenave
- Laboratorio de Fisiología de Estrés Abiótico en Plantas, Unidad de Biotecnología, INTECH - CONICET - UNSAM, Intendente Marino KM 8.2 - B7130IWA Chascomús, Buenos Aires, Argentina
| | - Carolina Granados Mendoza
- Departamento de Botánica, Instituto de Biología, Universidad Nacional Autónoma de México, Apartado Postal 70-367, Coyoacán, 04510, México City, Mexico
| | - Juan Francisco Jiménez Bremont
- División de Biología Molecular, Instituto Potosino de Investigación Científica y Tecnológica (IPICYT), San Luis Potosí, Mexico
| | - Andrés Gárriz
- Laboratorio de Fisiología de Estrés Abiótico en Plantas, Unidad de Biotecnología, INTECH - CONICET - UNSAM, Intendente Marino KM 8.2 - B7130IWA Chascomús, Buenos Aires, Argentina
| | - Andrés Alberto Rodríguez
- Laboratorio de Fisiología de Estrés Abiótico en Plantas, Unidad de Biotecnología, INTECH - CONICET - UNSAM, Intendente Marino KM 8.2 - B7130IWA Chascomús, Buenos Aires, Argentina.
| |
Collapse
|
29
|
Liu B, Chen J, Guo M, Wang X. Protein Remote Homology Detection and Fold Recognition Based on Sequence-Order Frequency Matrix. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2019; 16:292-300. [PMID: 29990004 DOI: 10.1109/tcbb.2017.2765331] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Protein remote homology detection and fold recognition are two critical tasks for the studies of protein structures and functions. Currently, the profile-based methods achieve the state-of-the-art performance in these fields. However, the widely used sequence profiles, like position-specific frequency matrix (PSFM) and position-specific scoring matrix (PSSM), ignore the sequence-order effects along protein sequence. In this study, we have proposed a novel profile, called sequence-order frequency matrix (SOFM), to extract the sequence-order information of neighboring residues from multiple sequence alignment (MSA). Combined with two profile feature extraction approaches, top-n-grams and the Smith-Waterman algorithm, the SOFMs are applied to protein remote homology detection and fold recognition, and two predictors called SOFM-Top and SOFM-SW are proposed. Experimental results show that SOFM contains more information content than other profiles, and these two predictors outperform other state-of-the-art methods. It is anticipated that SOFM will become a very useful profile in the studies of protein structures and functions.
Collapse
|
30
|
Ansell BRE, Pope BJ, Georgeson P, Emery-Corbin SJ, Jex AR. Annotation of the Giardia proteome through structure-based homology and machine learning. Gigascience 2019; 8:5232230. [PMID: 30520990 PMCID: PMC6312909 DOI: 10.1093/gigascience/giy150] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2018] [Accepted: 11/21/2018] [Indexed: 11/14/2022] Open
Abstract
Background Large-scale computational prediction of protein structures represents a cost-effective alternative to empirical structure determination with particular promise for non-model organisms and neglected pathogens. Conventional sequence-based tools are insufficient to annotate the genomes of such divergent biological systems. Conversely, protein structure tolerates substantial variation in primary amino acid sequence and is thus a robust indicator of biochemical function. Structural proteomics is poised to become a standard part of pathogen genomics research; however, informatic methods are now required to assign confidence in large volumes of predicted structures. Aims Our aim was to predict the proteome of a neglected human pathogen, Giardia duodenalis, and stratify predicted structures into high- and lower-confidence categories using a variety of metrics in isolation and combination. Methods We used the I-TASSER suite to predict structural models for ∼5,000 proteins encoded in G. duodenalis and identify their closest empirically-determined structural homologues in the Protein Data Bank. Models were assigned to high- or lower-confidence categories depending on the presence of matching protein family (Pfam) domains in query and reference peptides. Metrics output from the suite and derived metrics were assessed for their ability to predict the high-confidence category individually, and in combination through development of a random forest classifier. Results We identified 1,095 high-confidence models including 212 hypothetical proteins. Amino acid identity between query and reference peptides was the greatest individual predictor of high-confidence status; however, the random forest classifier outperformed any metric in isolation (area under the receiver operating characteristic curve = 0.976) and identified a subset of 305 high-confidence-like models, corresponding to false-positive predictions. High-confidence models exhibited greater transcriptional abundance, and the classifier generalized across species, indicating the broad utility of this approach for automatically stratifying predicted structures. Additional structure-based clustering was used to cross-check confidence predictions in an expanded family of Nek kinases. Several high-confidence-like proteins yielded substantial new insight into mechanisms of redox balance in G. duodenalis-a system central to the efficacy of limited anti-giardial drugs. Conclusion Structural proteomics combined with machine learning can aid genome annotation for genetically divergent organisms, including human pathogens, and stratify predicted structures to promote efficient allocation of limited resources for experimental investigation.
Collapse
Affiliation(s)
- Brendan R E Ansell
- Population Health and Immunity Division, Walter & Eliza Hall Institute of Medical Research, 1G Royal Pde, Parkville, VIC 3052, Australia
| | - Bernard J Pope
- Melbourne Bioinformatics, 187 Grattan St, University of Melbourne, VIC 3010, Australia.,Centre for Cancer Research, Victorian Comprehensive Cancer Centre, 305 Grattan St, Melbourne, VIC 3000, Australia.,Department of Clinical Pathology, University of Melbourne, 305 Grattan St, Melbourne, VIC 3000, Australia.,Department of Medicine, Central Clinical School, Monash University, 99 Commercial Rd, Melbourne, VIC 3004, Australia
| | - Peter Georgeson
- Melbourne Bioinformatics, 187 Grattan St, University of Melbourne, VIC 3010, Australia.,Centre for Cancer Research, Victorian Comprehensive Cancer Centre, 305 Grattan St, Melbourne, VIC 3000, Australia.,Department of Clinical Pathology, University of Melbourne, 305 Grattan St, Melbourne, VIC 3000, Australia
| | - Samantha J Emery-Corbin
- Population Health and Immunity Division, Walter & Eliza Hall Institute of Medical Research, 1G Royal Pde, Parkville, VIC 3052, Australia
| | - Aaron R Jex
- Population Health and Immunity Division, Walter & Eliza Hall Institute of Medical Research, 1G Royal Pde, Parkville, VIC 3052, Australia.,Faculty of Veterinary and Agricultural Sciences, Cnr Park Drive & Flemington Rd, University of Melbourne, VIC 3010, Australia
| |
Collapse
|
31
|
Flot M, Mishra A, Kuchi AS, Hoque MT. StackSSSPred: A Stacking-Based Prediction of Supersecondary Structure from Sequence. Methods Mol Biol 2019; 1958:101-122. [PMID: 30945215 DOI: 10.1007/978-1-4939-9161-7_5] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Supersecondary structure (SSS) refers to specific geometric arrangements of several secondary structure (SS) elements that are connected by loops. The SSS can provide useful information about the spatial structure and function of a protein. As such, the SSS is a bridge between the secondary structure and tertiary structure. In this chapter, we propose a stacking-based machine learning method for the prediction of two types of SSSs, namely, β-hairpins and β-α-β, from the protein sequence based on comprehensive feature encoding. To encode protein residues, we utilize key features such as solvent accessibility, conservation profile, half surface exposure, torsion angle fluctuation, disorder probabilities, and more. The usefulness of the proposed approach is assessed using a widely used threefold cross-validation technique. The obtained empirical result shows that the proposed approach is useful and prediction can be improved further.
Collapse
Affiliation(s)
- Michael Flot
- Department of Computer Science, University of New Orleans, New Orleans, LA, USA
| | - Avdesh Mishra
- Department of Computer Science, University of New Orleans, New Orleans, LA, USA
| | - Aditi Sharma Kuchi
- Department of Computer Science, University of New Orleans, New Orleans, LA, USA
| | - Md Tamjidul Hoque
- Department of Computer Science, University of New Orleans, New Orleans, LA, USA.
| |
Collapse
|
32
|
Sun S, Wu Q, Peng Z, Yang J. Enhanced prediction of RNA solvent accessibility with long short-term memory neural networks and improved sequence profiles. Bioinformatics 2018; 35:1686-1691. [DOI: 10.1093/bioinformatics/bty876] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2017] [Revised: 09/11/2018] [Accepted: 10/13/2018] [Indexed: 11/14/2022] Open
Affiliation(s)
- Saisai Sun
- School of Mathematical Sciences, Nankai University, Tianjin, China
| | - Qi Wu
- School of Mathematical Sciences, Nankai University, Tianjin, China
| | - Zhenling Peng
- Center for Applied Mathematics, Tianjin University, Tianjin, China
| | - Jianyi Yang
- School of Mathematical Sciences, Nankai University, Tianjin, China
| |
Collapse
|
33
|
Peng H, Zheng Y, Blumenstein M, Tao D, Li J. CRISPR/Cas9 cleavage efficiency regression through boosting algorithms and Markov sequence profiling. Bioinformatics 2018; 34:3069-3077. [PMID: 29672669 DOI: 10.1093/bioinformatics/bty298] [Citation(s) in RCA: 27] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2017] [Accepted: 04/12/2018] [Indexed: 12/26/2022] Open
Abstract
Motivation CRISPR/Cas9 system is a widely used genome editing tool. A prediction problem of great interests for this system is: how to select optimal single-guide RNAs (sgRNAs), such that its cleavage efficiency is high meanwhile the off-target effect is low. Results This work proposed a two-step averaging method (TSAM) for the regression of cleavage efficiencies of a set of sgRNAs by averaging the predicted efficiency scores of a boosting algorithm and those by a support vector machine (SVM). We also proposed to use profiled Markov properties as novel features to capture the global characteristics of sgRNAs. These new features are combined with the outstanding features ranked by the boosting algorithm for the training of the SVM regressor. TSAM improved the mean Spearman correlation coefficiencies comparing with the state-of-the-art performance on benchmark datasets containing thousands of human, mouse and zebrafish sgRNAs. Our method can be also converted to make binary distinctions between efficient and inefficient sgRNAs with superior performance to the existing methods. The analysis reveals that highly efficient sgRNAs have lower melting temperature at the middle of the spacer, cut at 5'-end closer parts of the genome and contain more 'A' but less 'G' comparing with inefficient ones. Comprehensive further analysis also demonstrates that our tool can predict an sgRNA's cutting efficiency with consistently good performance no matter it is expressed from an U6 promoter in cells or from a T7 promoter in vitro. Availability and implementation Online tool is available at http://www.aai-bioinfo.com/CRISPR/. Python and Matlab source codes are freely available at https://github.com/penn-hui/TSAM. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Hui Peng
- Faculty of Engineering and Information Technology, Advanced Analytics Institute, University of Technology Sydney, Broadway, NSW, Australia
| | - Yi Zheng
- Faculty of Engineering and Information Technology, Advanced Analytics Institute, University of Technology Sydney, Broadway, NSW, Australia
| | - Michael Blumenstein
- Faculty of Engineering and Information Technology, Advanced Analytics Institute, University of Technology Sydney, Broadway, NSW, Australia
| | - Dacheng Tao
- Faculty of Engineering and Information Technologies, School of Information Technologies, University of Sydney, Darlington, NSW, Australia
| | - Jinyan Li
- Faculty of Engineering and Information Technology, Advanced Analytics Institute, University of Technology Sydney, Broadway, NSW, Australia
| |
Collapse
|
34
|
Thomas JA, Orwenyo J, Wang LX, Black LW. The Odd "RB" Phage-Identification of Arabinosylation as a New Epigenetic Modification of DNA in T4-Like Phage RB69. Viruses 2018; 10:v10060313. [PMID: 29890699 PMCID: PMC6024577 DOI: 10.3390/v10060313] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2018] [Revised: 06/04/2018] [Accepted: 06/06/2018] [Indexed: 11/17/2022] Open
Abstract
In bacteriophages related to T4, hydroxymethylcytosine (hmC) is incorporated into the genomic DNA during DNA replication and is then further modified to glucosyl-hmC by phage-encoded glucosyltransferases. Previous studies have shown that RB69 shares a core set of genes with T4 and relatives. However, unlike the other “RB” phages, RB69 is unable to recombine its DNA with T4 or with the other “RB” isolates. In addition, despite having homologs to the T4 enzymes used to synthesize hmC, RB69 has no identified homolog to known glucosyltransferase genes. In this study we sought to understand the basis for RB69’s behavior using high-pH anion exchange chromatography (HPAEC) and mass spectrometry. Our analyses identified a novel phage epigenetic DNA sugar modification in RB69 DNA, which we have designated arabinosyl-hmC (ara-hmC). We sought a putative glucosyltranserase responsible for this novel modification and determined that RB69 also has a novel transferase gene, ORF003c, that is likely responsible for the arabinosyl-specific modification. We propose that ara-hmC was responsible for RB69 being unable to participate in genetic exchange with other hmC-containing T-even phages, and for its described incipient speciation. The RB69 ara-hmC also likely protects its DNA from some anti-phage type-IV restriction endonucleases. Several T4-related phages, such as E. coli phage JS09 and Shigella phage Shf125875 have homologs to RB69 ORF003c, suggesting the ara-hmC modification may be relatively common in T4-related phages, highlighting the importance of further work to understand the role of this modification and the biochemical pathway responsible for its production.
Collapse
Affiliation(s)
- Julie A Thomas
- Department of Biochemistry and Molecular Biology, University of Maryland School of Medicine, 108 N. Greene St., Baltimore, MD 21201, USA.
- Gosnell School of Life Sciences, Rochester Institute of Technology, 85 Lomb Memorial Drive, Rochester, NY 14623, USA.
| | - Jared Orwenyo
- Institute of Human Virology, University of Maryland School of Medicine, 725 West Lombard Street, Baltimore, MD 21201, USA.
- Department of Chemistry and Biochemistry, University of Maryland, 8051 Regents Drive, College Park, MD 20742, USA.
| | - Lai-Xi Wang
- Institute of Human Virology, University of Maryland School of Medicine, 725 West Lombard Street, Baltimore, MD 21201, USA.
- Department of Chemistry and Biochemistry, University of Maryland, 8051 Regents Drive, College Park, MD 20742, USA.
| | - Lindsay W Black
- Department of Biochemistry and Molecular Biology, University of Maryland School of Medicine, 108 N. Greene St., Baltimore, MD 21201, USA.
| |
Collapse
|
35
|
Hafsa NE, Berjanskii MV, Arndt D, Wishart DS. Rapid and reliable protein structure determination via chemical shift threading. JOURNAL OF BIOMOLECULAR NMR 2018; 70:33-51. [PMID: 29196969 DOI: 10.1007/s10858-017-0154-1] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/15/2017] [Accepted: 11/14/2017] [Indexed: 06/07/2023]
Abstract
Protein structure determination using nuclear magnetic resonance (NMR) spectroscopy can be both time-consuming and labor intensive. Here we demonstrate how chemical shift threading can permit rapid, robust, and accurate protein structure determination using only chemical shift data. Threading is a relatively old bioinformatics technique that uses a combination of sequence information and predicted (or experimentally acquired) low-resolution structural data to generate high-resolution 3D protein structures. The key motivations behind using NMR chemical shifts for protein threading lie in the fact that they are easy to measure, they are available prior to 3D structure determination, and they contain vital structural information. The method we have developed uses not only sequence and chemical shift similarity but also chemical shift-derived secondary structure, shift-derived super-secondary structure, and shift-derived accessible surface area to generate a high quality protein structure regardless of the sequence similarity (or lack thereof) to a known structure already in the PDB. The method (called E-Thrifty) was found to be very fast (often < 10 min/structure) and to significantly outperform other shift-based or threading-based structure determination methods (in terms of top template model accuracy)-with an average TM-score performance of 0.68 (vs. 0.50-0.62 for other methods). Coupled with recent developments in chemical shift refinement, these results suggest that protein structure determination, using only NMR chemical shifts, is becoming increasingly practical and reliable. E-Thrifty is available as a web server at http://ethrifty.ca .
Collapse
Affiliation(s)
- Noor E Hafsa
- Department of Computing Science, University of Alberta, Edmonton, AB, T6G 2E8, Canada
| | - Mark V Berjanskii
- Department of Biological Sciences, University of Alberta, Edmonton, AB, T6G 2E9, Canada
| | - David Arndt
- Department of Biological Sciences, University of Alberta, Edmonton, AB, T6G 2E9, Canada
| | - David S Wishart
- Department of Computing Science, University of Alberta, Edmonton, AB, T6G 2E8, Canada.
- Department of Biological Sciences, University of Alberta, Edmonton, AB, T6G 2E9, Canada.
| |
Collapse
|
36
|
Wang Z, Hardies SC, Fokine A, Klose T, Jiang W, Cho BC, Rossmann MG. Structure of the Marine Siphovirus TW1: Evolution of Capsid-Stabilizing Proteins and Tail Spikes. Structure 2017; 26:238-248.e3. [PMID: 29290487 DOI: 10.1016/j.str.2017.12.001] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2017] [Revised: 10/16/2017] [Accepted: 12/01/2017] [Indexed: 01/08/2023]
Abstract
Marine bacteriophage TW1 belongs to the Siphoviridae family and infects Pseudoalteromonas phenolica. Mass spectrometry analysis has identified 16 different proteins in the TW1 virion. Functions of most of these proteins have been predicted by bioinformatic methods. A 3.6 Å resolution cryoelectron microscopy map of the icosahedrally averaged TW1 head showed the atomic structures of the major capsid protein, gp57∗, and the capsid-stabilizing protein, gp56. The gp57∗ structure is similar to that of the phage HK97 capsid protein. The gp56 protein has two domains, each having folds similar to that of the N-terminal part of phage λ gpD, indicating a common ancestry. The first gp56 domain clamps adjacent capsomers together, whereas the second domain is required for trimerization. A 6-fold-averaged reconstruction of the distal part of the tail showed that TW1 has six tail spikes, which are unusual for siphophages but are similar to the podophages P22 and Sf6, suggesting a common evolutionary origin of these spikes.
Collapse
Affiliation(s)
- Zhiqing Wang
- Department of Biological Sciences, Purdue University, West Lafayette, IN 47907, USA
| | - Stephen C Hardies
- Department of Biochemistry, The University of Texas Health Science Center, San Antonio, TX 78229, USA
| | - Andrei Fokine
- Department of Biological Sciences, Purdue University, West Lafayette, IN 47907, USA
| | - Thomas Klose
- Department of Biological Sciences, Purdue University, West Lafayette, IN 47907, USA
| | - Wen Jiang
- Department of Biological Sciences, Purdue University, West Lafayette, IN 47907, USA
| | - Byung Cheol Cho
- School of Earth and Environmental Sciences and Research Institute of Oceanography, Seoul National University, Seoul 151-742, Korea
| | - Michael G Rossmann
- Department of Biological Sciences, Purdue University, West Lafayette, IN 47907, USA.
| |
Collapse
|
37
|
Ali B, Desmond MI, Mallory SA, Benítez AD, Buckley LJ, Weintraub ST, Osier MV, Black LW, Thomas JA. To Be or Not To Be T4: Evidence of a Complex Evolutionary Pathway of Head Structure and Assembly in Giant Salmonella Virus SPN3US. Front Microbiol 2017; 8:2251. [PMID: 29187846 PMCID: PMC5694885 DOI: 10.3389/fmicb.2017.02251] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2017] [Accepted: 10/31/2017] [Indexed: 11/18/2022] Open
Abstract
Giant Salmonella phage SPN3US has a 240-kb dsDNA genome and a large complex virion composed of many proteins for which the functions of most are undefined. We recently determined that SPN3US shares a core set of genes with related giant phages and sequenced and characterized 18 amber mutants to facilitate its use as a genetic model system. Notably, SPN3US and related giant phages contain a bolus of ejection proteins within their heads, including a multi-subunit virion RNA polymerase (vRNAP), that enter the host cell with the DNA during infection. In this study, we characterized the SPN3US virion using mass spectrometry to gain insight into its head composition and the features that its head shares with those of related giant phages and with T4 phage. SPN3US has only homologs to the T4 proteins critical for prohead shell formation, the portal and major capsid proteins, as well as to the major enzymes essential for head maturation, the prohead protease and large terminase subunit. Eight of ~50 SPN3US head proteins were found to undergo proteolytic processing at a cleavage motif by the prohead protease gp245. Gp245 undergoes auto-cleavage of its C-terminus, suggesting this is a conserved activation and/or maturation feature of related phage proteases. Analyses of essential head gene mutants showed that the five subunits of the vRNAP must be assembled for any subunit to be incorporated into the prohead, although the assembled vRNAP must then undergo subsequent major conformational rearrangements in the DNA packed capsid to allow ejection through the ~30 Å diameter tail tube for transcription from the injected DNA. In addition, ejection protein candidate gp243 was found to play a critical role in head assembly. Our analyses of the vRNAP and gp243 mutants highlighted an unexpected dichotomy in giant phage head maturation: while all analyzed giant phages have a homologous protease that processes major capsid and portal proteins, processing of ejection proteins is not always a stable/defining feature. Our identification in SPN3US, and related phages, of a diverged paralog to the prohead protease further hints toward a complicated evolutionary pathway for giant phage head structure and assembly.
Collapse
Affiliation(s)
- Bazla Ali
- Thomas H. Gosnell School of Life Sciences, Rochester Institute of Technology, Rochester, NY, United States
| | - Maxim I Desmond
- Thomas H. Gosnell School of Life Sciences, Rochester Institute of Technology, Rochester, NY, United States
| | - Sara A Mallory
- Thomas H. Gosnell School of Life Sciences, Rochester Institute of Technology, Rochester, NY, United States
| | - Andrea D Benítez
- Thomas H. Gosnell School of Life Sciences, Rochester Institute of Technology, Rochester, NY, United States
| | - Larry J Buckley
- Thomas H. Gosnell School of Life Sciences, Rochester Institute of Technology, Rochester, NY, United States
| | - Susan T Weintraub
- Biochemistry, University of Texas Health Science Center at San Antonio, San Antonio, TX, United States
| | - Michael V Osier
- Thomas H. Gosnell School of Life Sciences, Rochester Institute of Technology, Rochester, NY, United States
| | - Lindsay W Black
- University of Maryland School of Medicine, Baltimore, MD, United States
| | - Julie A Thomas
- Thomas H. Gosnell School of Life Sciences, Rochester Institute of Technology, Rochester, NY, United States
| |
Collapse
|
38
|
Seppälä S, Wilken SE, Knop D, Solomon KV, O’Malley MA. The importance of sourcing enzymes from non-conventional fungi for metabolic engineering and biomass breakdown. Metab Eng 2017; 44:45-59. [DOI: 10.1016/j.ymben.2017.09.008] [Citation(s) in RCA: 25] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2017] [Revised: 09/16/2017] [Accepted: 09/16/2017] [Indexed: 10/18/2022]
|
39
|
Lovato P, Cristani M, Bicego M. Soft Ngram Representation and Modeling for Protein Remote Homology Detection. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2017; 14:1482-1488. [PMID: 27483459 DOI: 10.1109/tcbb.2016.2595575] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
Remote homology detection represents a central problem in bioinformatics, where the challenge is to detect functionally related proteins when their sequence similarity is low. Recent solutions employ representations derived from the sequence profile, obtained by replacing each amino acid of the sequence by the corresponding most probable amino acid in the profile. However, the information contained in the profile could be exploited more deeply, provided that there is a representation able to capture and properly model such crucial evolutionary information. In this paper, we propose a novel profile-based representation for sequences, called soft Ngram. This representation, which extends the traditional Ngram scheme (obtained by grouping N consecutive amino acids), permits considering all of the evolutionary information in the profile: this is achieved by extracting Ngrams from the whole profile, equipping them with a weight directly computed from the corresponding evolutionary frequencies. We illustrate two different approaches to model the proposed representation and to derive a feature vector, which can be effectively used for classification using a support vector machine (SVM). A thorough evaluation on three benchmarks demonstrates that the new approach outperforms other Ngram-based methods, and shows very promising results also in comparison with a broader spectrum of techniques.
Collapse
|
40
|
MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat Biotechnol 2017; 35:1026-1028. [PMID: 29035372 DOI: 10.1038/nbt.3988] [Citation(s) in RCA: 1486] [Impact Index Per Article: 212.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
|
41
|
Li S, Chen J, Liu B. Protein remote homology detection based on bidirectional long short-term memory. BMC Bioinformatics 2017; 18:443. [PMID: 29017445 PMCID: PMC5634958 DOI: 10.1186/s12859-017-1842-2] [Citation(s) in RCA: 41] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2017] [Accepted: 09/21/2017] [Indexed: 01/05/2023] Open
Abstract
BACKGROUND Protein remote homology detection plays a vital role in studies of protein structures and functions. Almost all of the traditional machine leaning methods require fixed length features to represent the protein sequences. However, it is never an easy task to extract the discriminative features with limited knowledge of proteins. On the other hand, deep learning technique has demonstrated its advantage in automatically learning representations. It is worthwhile to explore the applications of deep learning techniques to the protein remote homology detection. RESULTS In this study, we employ the Bidirectional Long Short-Term Memory (BLSTM) to learn effective features from pseudo proteins, also propose a predictor called ProDec-BLSTM: it includes input layer, bidirectional LSTM, time distributed dense layer and output layer. This neural network can automatically extract the discriminative features by using bidirectional LSTM and the time distributed dense layer. CONCLUSION Experimental results on a widely-used benchmark dataset show that ProDec-BLSTM outperforms other related methods in terms of both the mean ROC and mean ROC50 scores. This promising result shows that ProDec-BLSTM is a useful tool for protein remote homology detection. Furthermore, the hidden patterns learnt by ProDec-BLSTM can be interpreted and visualized, and therefore, additional useful information can be obtained.
Collapse
Affiliation(s)
- Shumin Li
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, HIT Campus Shenzhen University Town, Xili, Shenzhen, 518055, China
| | - Junjie Chen
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, HIT Campus Shenzhen University Town, Xili, Shenzhen, 518055, China
| | - Bin Liu
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, HIT Campus Shenzhen University Town, Xili, Shenzhen, 518055, China.
| |
Collapse
|
42
|
Abstract
With the widespread usage of smart phones, more and more mobile apps are developed every day, playing an increasingly important role in changing our lifestyles and business models. In this trend, it becomes a hot research topic for developing effective mobile app recommender systems in both industry and academia. Compared with existing studies about mobile app recommendations, our research aims to improve the recommendation effectiveness based on analyzing a psychological trait of human beings, exploratory behavior, which refers to a type of variety-seeking behavior in unfamiliar domains. To this end, we propose a novel probabilistic model named Goal-oriented Exploratory Model (GEM), integrating exploratory behavior identification with personalized item recommendation. An algorithm combining collapsed Gibbs sampling and Expectation Maximization is developed for model learning and inference. Through extensive experiments conducted on a real dataset, the proposed model demonstrates superior recommendation performances and good interpretability compared with state-of-art recommendation methods. Moreover, empirical analyses on exploratory behavior find that individuals with a strong exploratory tendency exhibit behavioral patterns of variety seeking, risk taking, and higher involvement. Besides, mobile apps that are less popular or in the long tail possess greater potential of arousing exploratory behavior in individuals.
Collapse
|
43
|
Genomewide Mutational Diversity in Escherichia coli Population Evolving in Prolonged Stationary Phase. mSphere 2017; 2:mSphere00059-17. [PMID: 28567442 PMCID: PMC5444009 DOI: 10.1128/msphere.00059-17] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2017] [Accepted: 05/05/2017] [Indexed: 11/20/2022] Open
Abstract
Prolonged stationary phase in bacteria, contrary to its name, is highly dynamic, with extreme nutrient limitation as a predominant stress. Stationary-phase cultures adapt by rapidly selecting a mutation(s) that confers a growth advantage in stationary phase (GASP). The phenotypic diversity of starving E. coli populations has been studied in detail; however, only a few mutations that accumulate in prolonged stationary phase have been described. This study documented the spectrum of mutations appearing in Escherichia coli during 28 days of prolonged starvation. The genetic diversity of the population increases over time in stationary phase to an extent that cannot be explained by random, neutral drift. This suggests that prolonged stationary phase offers a great model system to study adaptive evolution by natural selection. Prolonged stationary phase is an approximation of natural environments presenting a range of stresses. Survival in prolonged stationary phase requires alternative metabolic pathways for survival. This study describes the repertoire of mutations accumulating in starving Escherichia coli populations in lysogeny broth. A wide range of mutations accumulates over the course of 1 month in stationary phase. Single nucleotide polymorphisms (SNPs) constitute 64% of all mutations. A majority of these mutations are nonsynonymous and are located at conserved loci. There is an increase in genetic diversity in the evolving populations over time. Computer simulations of evolution in stationary phase suggest that the maximum frequency of mutations observed in our experimental populations cannot be explained by neutral drift. Moreover, there is frequent genetic parallelism across populations, suggesting that these mutations are under positive selection. Finally, functional analysis of mutations suggests that regulatory mutations are frequent targets of selection. IMPORTANCE Prolonged stationary phase in bacteria, contrary to its name, is highly dynamic, with extreme nutrient limitation as a predominant stress. Stationary-phase cultures adapt by rapidly selecting a mutation(s) that confers a growth advantage in stationary phase (GASP). The phenotypic diversity of starving E. coli populations has been studied in detail; however, only a few mutations that accumulate in prolonged stationary phase have been described. This study documented the spectrum of mutations appearing in Escherichia coli during 28 days of prolonged starvation. The genetic diversity of the population increases over time in stationary phase to an extent that cannot be explained by random, neutral drift. This suggests that prolonged stationary phase offers a great model system to study adaptive evolution by natural selection.
Collapse
|
44
|
Rossi MF, Mello B, Schrago CG. Performance of Hidden Markov Models in Recovering the Standard Classification of Glycoside Hydrolases. Evol Bioinform Online 2017; 13:1176934317703401. [PMID: 28469382 PMCID: PMC5404901 DOI: 10.1177/1176934317703401] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2016] [Accepted: 03/09/2017] [Indexed: 12/02/2022] Open
Abstract
Glycoside hydrolases (GHs) are carbohydrate-active enzymes that assist the hydrolysis of glycoside bonds of complex sugars into carbohydrates. The current standard GH family classification is available in the CAZy database, which is based on the similarities of amino acid sequences and curated semi-automatically. However, with the exponential increase in data availability from genome sequences, automated classification methods are required for the fast annotation of coding sequences. Currently, the dbCAN database offers automatic annotations of signature domains from CAZy-defined classifications using a statistical approach, the hidden Markov models (HMMs). However, dbCAN does not contain the entire set of CAZy GH families. Moreover, no evaluation has been conducted so far of the viability of using HMM profiles as a means of automatically assigning GH amino acid sequences to the standard CAZy GH family classification itself. In this work, we performed a meta-analysis in which amino acid sequences from CAZy-defined GH families were used to build HMM family-specific profiles. We then queried a set with ~300 000 GH sequences against our database of HMM profiles estimated from CAZy families. We conducted the same evaluation against the available dbCAN HMM profiles. Our analyses recovered 65% of matches with the standard CAZy classification, whereas dbCAN HMMs resulted in 61% of matches. We also provided an analysis of the types of errors commonly found when HMMs are used to recover CAZy-based classifications. Although the performance of HMM was good, further developments are necessary for a fully automated classification of GH, allowing the standardization of GH classification among protein databases.
Collapse
Affiliation(s)
- Mariana Fonseca Rossi
- Department of Genetics, Federal University of Rio de Janeiro, Rio de Janeiro, Brazil
| | - Beatriz Mello
- Department of Genetics, Federal University of Rio de Janeiro, Rio de Janeiro, Brazil
| | - Carlos G Schrago
- Department of Genetics, Federal University of Rio de Janeiro, Rio de Janeiro, Brazil
| |
Collapse
|
45
|
Computational Prediction of the Heterodimeric and Higher-Order Structure of gpE1/gpE2 Envelope Glycoproteins Encoded by Hepatitis C Virus. J Virol 2017; 91:JVI.02309-16. [PMID: 28148799 DOI: 10.1128/jvi.02309-16] [Citation(s) in RCA: 25] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2016] [Accepted: 01/25/2017] [Indexed: 12/24/2022] Open
Abstract
Despite the recent success of newly developed direct-acting antivirals against hepatitis C, the disease continues to be a global health threat due to the lack of diagnosis of most carriers and the high cost of treatment. The heterodimer formed by glycoproteins E1 and E2 within the hepatitis C virus (HCV) lipid envelope is a potential vaccine candidate and antiviral target. While the structure of E1/E2 has not yet been resolved, partial crystal structures of the E1 and E2 ectodomains have been determined. The unresolved parts of the structure are within the realm of what can be modeled with current computational modeling tools. Furthermore, a variety of additional experimental data is available to support computational predictions of E1/E2 structure, such as data from antibody binding studies, cryo-electron microscopy (cryo-EM), mutational analyses, peptide binding analysis, linker-scanning mutagenesis, and nuclear magnetic resonance (NMR) studies. In accordance with these rich experimental data, we have built an in silico model of the full-length E1/E2 heterodimer. Our model supports that E1/E2 assembles into a trimer, which was previously suggested from a study by Falson and coworkers (P. Falson, B. Bartosch, K. Alsaleh, B. A. Tews, A. Loquet, Y. Ciczora, L. Riva, C. Montigny, C. Montpellier, G. Duverlie, E. I. Pecheur, M. le Maire, F. L. Cosset, J. Dubuisson, and F. Penin, J. Virol. 89:10333-10346, 2015, https://doi.org/10.1128/JVI.00991-15). Size exclusion chromatography and Western blotting data obtained by using purified recombinant E1/E2 support our hypothesis. Our model suggests that during virus assembly, the trimer of E1/E2 may be further assembled into a pentamer, with 12 pentamers comprising a single HCV virion. We anticipate that this new model will provide a useful framework for HCV envelope structure and the development of antiviral strategies.IMPORTANCE One hundred fifty million people have been estimated to be infected with hepatitis C virus, and many more are at risk for infection. A better understanding of the structure of the HCV envelope, which is responsible for attachment and fusion, could aid in the development of a vaccine and/or new treatments for this disease. We draw upon computational techniques to predict a full-length model of the E1/E2 heterodimer based on the partial crystal structures of the envelope glycoproteins E1 and E2. E1/E2 has been widely studied experimentally, and this provides valuable data, which has assisted us in our modeling. Our proposed structure is used to suggest the organization of the HCV envelope. We also present new experimental data from size exclusion chromatography that support our computational prediction of a trimeric oligomeric state of E1/E2.
Collapse
|
46
|
Schmitt I, Lumbsch HT, Søchting U. Phylogeny of the lichen genusPlacopsisand its allies based on Bayesian analyses of nuclear and mitochondrial sequences. Mycologia 2017. [DOI: 10.1080/15572536.2004.11833042] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
Affiliation(s)
| | | | - Ulrik Søchting
- Department of Mycology, Botanical Institute, University of Copenhagen, Ø. Farimagsgade 2D, DK-1353 Copenhagen K, Denmark
| |
Collapse
|
47
|
Schmitt I, Mueller G, Lumbsch HT. Ascoma morphology is homoplaseous and phylogenetically misleading in some pyrenocarpous lichens. Mycologia 2017. [DOI: 10.1080/15572536.2006.11832813] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
Affiliation(s)
| | | | - H. Thorsten Lumbsch
- Department of Botany, The Field Museum, 1400 S. Lake Shore Drive, Chicago, Illinois 60605
| |
Collapse
|
48
|
CATH-Gene3D: Generation of the Resource and Its Use in Obtaining Structural and Functional Annotations for Protein Sequences. Methods Mol Biol 2017; 1558:79-110. [PMID: 28150234 DOI: 10.1007/978-1-4939-6783-4_4] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/05/2022]
Abstract
This chapter describes the generation of the data in the CATH-Gene3D online resource and how it can be used to study protein domains and their evolutionary relationships. Methods will be presented for: comparing protein structures, recognizing homologs, predicting domain structures within protein sequences, and subclassifying superfamilies into functionally pure families, together with a guide on using the webpages.
Collapse
|
49
|
The genome of the Gulf pipefish enables understanding of evolutionary innovations. Genome Biol 2016; 17:258. [PMID: 27993155 PMCID: PMC5168715 DOI: 10.1186/s13059-016-1126-6] [Citation(s) in RCA: 56] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2016] [Accepted: 12/05/2016] [Indexed: 11/10/2022] Open
Abstract
Background Evolutionary origins of derived morphologies ultimately stem from changes in protein structure, gene regulation, and gene content. A well-assembled, annotated reference genome is a central resource for pursuing these molecular phenomena underlying phenotypic evolution. We explored the genome of the Gulf pipefish (Syngnathus scovelli), which belongs to family Syngnathidae (pipefishes, seahorses, and seadragons). These fishes have dramatically derived bodies and a remarkable novelty among vertebrates, the male brood pouch. Results We produce a reference genome, condensed into chromosomes, for the Gulf pipefish. Gene losses and other changes have occurred in pipefish hox and dlx clusters and in the tbx and pitx gene families, candidate mechanisms for the evolution of syngnathid traits, including an elongated axis and the loss of ribs, pelvic fins, and teeth. We measure gene expression changes in pregnant versus non-pregnant brood pouch tissue and characterize the genomic organization of duplicated metalloprotease genes (patristacins) recruited into the function of this novel structure. Phylogenetic inference using ultraconserved sequences provides an alternative hypothesis for the relationship between orders Syngnathiformes and Scombriformes. Comparisons of chromosome structure among percomorphs show that chromosome number in a pipefish ancestor became reduced via chromosomal fusions. Conclusions The collected findings from this first syngnathid reference genome open a window into the genomic underpinnings of highly derived morphologies, demonstrating that de novo production of high quality and useful reference genomes is within reach of even small research groups. Electronic supplementary material The online version of this article (doi:10.1186/s13059-016-1126-6) contains supplementary material, which is available to authorized users.
Collapse
|
50
|
Abstract
Comparative protein structure modeling predicts the three-dimensional structure of a given protein sequence (target) based primarily on its alignment to one or more proteins of known structure (templates). The prediction process consists of fold assignment, target-template alignment, model building, and model evaluation. This unit describes how to calculate comparative models using the program MODELLER and how to use the ModBase database of such models, and discusses all four steps of comparative modeling, frequently observed errors, and some applications. Modeling lactate dehydrogenase from Trichomonas vaginalis (TvLDH) is described as an example. The download and installation of the MODELLER software is also described. © 2016 by John Wiley & Sons, Inc.
Collapse
Affiliation(s)
- Benjamin Webb
- University of California at San Francisco, San Francisco, California
| | - Andrej Sali
- University of California at San Francisco, San Francisco, California
| |
Collapse
|