1
|
Ang'ang'o LM, Herren JK, Tastan Bishop Ö. Bioinformatics analysis of the Microsporidia sp. MB genome: a malaria transmission-blocking symbiont of the Anopheles arabiensis mosquito. BMC Genomics 2024; 25:1132. [PMID: 39578727 PMCID: PMC11585130 DOI: 10.1186/s12864-024-11046-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2024] [Accepted: 11/13/2024] [Indexed: 11/24/2024] Open
Abstract
BACKGROUND The use of microsporidia as a disease-transmission-blocking tool has garnered significant attention. Microsporidia sp. MB, known for its ability to block malaria development in mosquitoes, is an optimal candidate for supplementing malaria vector control methods. This symbiont, found in Anopheles mosquitoes, can be transmitted both vertically and horizontally with minimal effects on its mosquito host. Its genome, recently sequenced from An. arabiensis, comprises a compact 5.9 Mbp. RESULTS Here, we analyze the Microsporidia sp. MB genome, highlighting its major genomic features, gene content, and protein function. The genome contains 2247 genes, predominantly encoding enzymes. Unlike other members of the Enterocytozoonida group, Microsporidia sp. MB has retained most of the genes in the glycolytic pathway. Genes involved in RNA interference (RNAi) were also identified, suggesting a mechanism for host immune suppression. Importantly, meiosis-related genes (MRG) were detected, indicating potential for sexual reproduction in this organism. Comparative analyses revealed similarities with its closest relative, Vittaforma corneae, despite key differences in host interactions. CONCLUSION This study provides an in-depth analysis of the newly sequenced Microsporidia sp. MB genome, uncovering its unique adaptations for intracellular parasitism, including retention of essential metabolic pathways and RNAi machinery. The identification of MRGs suggests the possibility of sexual reproduction, offering insights into the symbiont's evolutionary strategies. Establishing a reference genome for Microsporidia sp. MB sets the foundation for future studies on its role in malaria transmission dynamics and host-parasite interactions.
Collapse
Affiliation(s)
- Lilian Mbaisi Ang'ang'o
- Department of Biochemistry, Microbiology, and Bioinformatics, Research Unit in Bioinformatics (RUBi), Rhodes University, Makhanda, 6140, South Africa
- International Centre of Insect Physiology and Ecology (icipe), P.O. Box 30772-00100, Nairobi, Kenya
| | - Jeremy Keith Herren
- International Centre of Insect Physiology and Ecology (icipe), P.O. Box 30772-00100, Nairobi, Kenya.
| | - Özlem Tastan Bishop
- Department of Biochemistry, Microbiology, and Bioinformatics, Research Unit in Bioinformatics (RUBi), Rhodes University, Makhanda, 6140, South Africa.
| |
Collapse
|
2
|
Bordin N, Scholes H, Rauer C, Roca-Martínez J, Sillitoe I, Orengo C. Clustering protein functional families at large scale with hierarchical approaches. Protein Sci 2024; 33:e5140. [PMID: 39145441 PMCID: PMC11325189 DOI: 10.1002/pro.5140] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2024] [Revised: 07/22/2024] [Accepted: 07/24/2024] [Indexed: 08/16/2024]
Abstract
Proteins, fundamental to cellular activities, reveal their function and evolution through their structure and sequence. CATH functional families (FunFams) are coherent clusters of protein domain sequences in which the function is conserved across their members. The increasing volume and complexity of protein data enabled by large-scale repositories like MGnify or AlphaFold Database requires more powerful approaches that can scale to the size of these new resources. In this work, we introduce MARC and FRAN, two algorithms developed to build upon and address limitations of GeMMA/FunFHMMER, our original methods developed to classify proteins with related functions using a hierarchical approach. We also present CATH-eMMA, which uses embeddings or Foldseek distances to form relationship trees from distance matrices, reducing computational demands and handling various data types effectively. CATH-eMMA offers a highly robust and much faster tool for clustering protein functions on a large scale, providing a new tool for future studies in protein function and evolution.
Collapse
Affiliation(s)
- Nicola Bordin
- Institute of Structural and Molecular Biology, University College London, London, UK
| | - Harry Scholes
- Institute of Structural and Molecular Biology, University College London, London, UK
| | - Clemens Rauer
- Institute of Structural and Molecular Biology, University College London, London, UK
- Universidad Autonoma de Madrid, Ciudad Universitaria de Cantoblanco, Madrid, Spain
| | - Joel Roca-Martínez
- Institute of Structural and Molecular Biology, University College London, London, UK
| | - Ian Sillitoe
- Institute of Structural and Molecular Biology, University College London, London, UK
| | - Christine Orengo
- Institute of Structural and Molecular Biology, University College London, London, UK
| |
Collapse
|
3
|
Nallathambi P, Umamaheswari C, Reddy B, Aarthy B, Javed M, Ravikumar P, Watpade S, Kashyap PL, Boopalakrishnan G, Kumar S, Sharma A, Kumar A. Deciphering the Genomic Landscape and Virulence Mechanisms of the Wheat Powdery Mildew Pathogen Blumeria graminis f. sp. tritici Wtn1: Insights from Integrated Genome Assembly and Conidial Transcriptomics. J Fungi (Basel) 2024; 10:267. [PMID: 38667938 PMCID: PMC11051031 DOI: 10.3390/jof10040267] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2023] [Revised: 03/16/2024] [Accepted: 03/19/2024] [Indexed: 04/28/2024] Open
Abstract
A high-quality genome sequence from an Indian isolate of Blumeria graminis f. sp. tritici Wtn1, a persistent threat in wheat farming, was obtained using a hybrid method. The assembly of over 9.24 million DNA-sequence reads resulted in 93 contigs, totaling a 140.61 Mb genome size, potentially encoding 8480 genes. Notably, more than 73.80% of the genome, spanning approximately 102.14 Mb, comprises retro-elements, LTR elements, and P elements, influencing evolution and adaptation significantly. The phylogenomic analysis placed B. graminis f. sp. tritici Wtn1 in a distinct monocot-infecting clade. A total of 583 tRNA anticodon sequences were identified from the whole genome of the native virulent strain B. graminis f. sp. tritici, which comprises distinct genome features with high counts of tRNA anticodons for leucine (70), cysteine (61), alanine (58), and arginine (45), with only two stop codons (Opal and Ochre) present and the absence of the Amber stop codon. Comparative InterProScan analysis unveiled "shared and unique" proteins in B. graminis f. sp. tritici Wtn1. Identified were 7707 protein-encoding genes, annotated to different categories such as 805 effectors, 156 CAZymes, 6102 orthologous proteins, and 3180 distinct protein families (PFAMs). Among the effectors, genes like Avra10, Avrk1, Bcg-7, BEC1005, CSEP0105, CSEP0162, BEC1016, BEC1040, and HopI1 closely linked to pathogenesis and virulence were recognized. Transcriptome analysis highlighted abundant proteins associated with RNA processing and modification, post-translational modification, protein turnover, chaperones, and signal transduction. Examining the Environmental Information Processing Pathways in B. graminis f. sp. tritici Wtn1 revealed 393 genes across 33 signal transduction pathways. The key pathways included yeast MAPK signaling (53 genes), mTOR signaling (38 genes), PI3K-Akt signaling (23 genes), and AMPK signaling (21 genes). Additionally, pathways like FoxO, Phosphatidylinositol, the two-component system, and Ras signaling showed significant gene representation, each with 15-16 genes, key SNPs, and Indels in specific chromosomes highlighting their relevance to environmental responses and pathotype evolution. The SNP and InDel analysis resulted in about 3.56 million variants, including 3.45 million SNPs, 5050 insertions, and 5651 deletions within the whole genome of B. graminis f. sp. tritici Wtn1. These comprehensive genome and transcriptome datasets serve as crucial resources for understanding the pathogenicity, virulence effectors, retro-elements, and evolutionary origins of B. graminis f. sp. tritici Wtn1, aiding in developing robust strategies for the effective management of wheat powdery mildew.
Collapse
Affiliation(s)
- Perumal Nallathambi
- ICAR-Indian Agricultural Research Institute, Regional Station, Wellington 643231, Tamil Nadu, India; (P.N.); (C.U.); (B.A.); (P.R.)
| | - Chandrasekaran Umamaheswari
- ICAR-Indian Agricultural Research Institute, Regional Station, Wellington 643231, Tamil Nadu, India; (P.N.); (C.U.); (B.A.); (P.R.)
| | - Bhaskar Reddy
- ICAR-Indian Agricultural Research Institute, Pusa Campus, New Delhi 110012, Delhi, India; (M.J.); (G.B.)
| | - Balakrishnan Aarthy
- ICAR-Indian Agricultural Research Institute, Regional Station, Wellington 643231, Tamil Nadu, India; (P.N.); (C.U.); (B.A.); (P.R.)
| | - Mohammed Javed
- ICAR-Indian Agricultural Research Institute, Pusa Campus, New Delhi 110012, Delhi, India; (M.J.); (G.B.)
| | - Priya Ravikumar
- ICAR-Indian Agricultural Research Institute, Regional Station, Wellington 643231, Tamil Nadu, India; (P.N.); (C.U.); (B.A.); (P.R.)
| | - Santosh Watpade
- ICAR-Indian Agricultural Research Institute, Regional Station, Shimla 171004, Himachal Pradesh, India;
| | - Prem Lal Kashyap
- ICAR-Indian Institute of Wheat and Barley Research, Karnal 132001, Haryana, India; (P.L.K.); (S.K.); (A.S.)
| | | | - Sudheer Kumar
- ICAR-Indian Institute of Wheat and Barley Research, Karnal 132001, Haryana, India; (P.L.K.); (S.K.); (A.S.)
| | - Anju Sharma
- ICAR-Indian Institute of Wheat and Barley Research, Karnal 132001, Haryana, India; (P.L.K.); (S.K.); (A.S.)
| | - Aundy Kumar
- ICAR-Indian Agricultural Research Institute, Pusa Campus, New Delhi 110012, Delhi, India; (M.J.); (G.B.)
| |
Collapse
|
4
|
Kilinc M, Jia K, Jernigan RL. Improved global protein homolog detection with major gains in function identification. Proc Natl Acad Sci U S A 2023; 120:e2211823120. [PMID: 36827259 PMCID: PMC9992864 DOI: 10.1073/pnas.2211823120] [Citation(s) in RCA: 24] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2022] [Accepted: 01/20/2023] [Indexed: 02/25/2023] Open
Abstract
There are several hundred million protein sequences, but the relationships among them are not fully available from existing homolog detection methods. There is an essential need for an improved method to push homolog detection to lower levels of sequence identity. The method used here relies on a language model to represent proteins numerically in a matrix (an embedding) and uses discrete cosine transforms to compress the data to extract the most essential part, significantly reducing the data size. This PRotein Ortholog Search Tool (PROST) is significantly faster with linear runtimes, and most importantly, computes the distances between pairs of protein sequences to yield homologs at significantly lower levels of sequence identity than previously. The extent of allosteric effects in proteins points out the importance of global aspects of structure and sequence. PROST excels at global homology detection but not at detecting local homologs. Results are validated by strong similarities between the corresponding pairs of structures. The number of remote homologs detected increased significantly and pushes the effective sequence matches more deeply into the twilight zone. Human protein sequences presently having no assigned function now find significant numbers of putative homologs for 93% of cases and structurally verified assigned functions for 76.4% of these cases. The data compression enables massive searches for homologs with short search times while yielding significant gains in the numbers of remote homologs detected. The method is sufficiently efficient to permit whole-genome/proteome comparisons. The PROST web server is accessible at https://mesihk.github.io/prost.
Collapse
Affiliation(s)
- Mesih Kilinc
- Bioinformatics and Computational Biology Program, Iowa State University, Ames, IA50011
| | - Kejue Jia
- Roy J. Carver Department of Biochemistry, Biophysics and Molecular Biology, Iowa State University, Ames, IA50011
| | - Robert L. Jernigan
- Bioinformatics and Computational Biology Program, Iowa State University, Ames, IA50011
- Roy J. Carver Department of Biochemistry, Biophysics and Molecular Biology, Iowa State University, Ames, IA50011
| |
Collapse
|
5
|
Ang’ang’o LM, Herren JK, Tastan Bishop Ö. Structural and Functional Annotation of Hypothetical Proteins from the Microsporidia Species Vittaforma corneae ATCC 50505 Using in silico Approaches. Int J Mol Sci 2023; 24:3507. [PMID: 36834914 PMCID: PMC9960886 DOI: 10.3390/ijms24043507] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2022] [Revised: 01/25/2023] [Accepted: 02/06/2023] [Indexed: 02/12/2023] Open
Abstract
Microsporidia are spore-forming eukaryotes that are related to fungi but have unique traits that set them apart. They have compact genomes as a result of evolutionary gene loss associated with their complete dependency on hosts for survival. Despite having a relatively small number of genes, a disproportionately high percentage of the genes in microsporidia genomes code for proteins whose functions remain unknown (hypothetical proteins-HPs). Computational annotation of HPs has become a more efficient and cost-effective alternative to experimental investigation. This research developed a robust bioinformatics annotation pipeline of HPs from Vittaforma corneae, a clinically important microsporidian that causes ocular infections in immunocompromised individuals. Here, we describe various steps to retrieve sequences and homologs and to carry out physicochemical characterization, protein family classification, identification of motifs and domains, protein-protein interaction network analysis, and homology modelling using a variety of online resources. Classification of protein families produced consistent findings across platforms, demonstrating the accuracy of annotation utilizing in silico methods. A total of 162 out of 2034 HPs were fully annotated, with the bulk of them categorized as binding proteins, enzymes, or regulatory proteins. The protein functions of several HPs from Vittaforma corneae were accurately inferred. This improved our understanding of microsporidian HPs despite challenges related to the obligate nature of microsporidia, the absence of fully characterized genes, and the lack of homologous genes in other systems.
Collapse
Affiliation(s)
- Lilian Mbaisi Ang’ang’o
- Research Unit in Bioinformatics (RUBi), Department of Biochemistry and Microbiology, Rhodes University, Makhanda 6140, South Africa
| | - Jeremy Keith Herren
- International Centre of Insect Physiology and Ecology (icipe), Nairobi P.O. Box 30772-00100, Kenya
| | - Özlem Tastan Bishop
- Research Unit in Bioinformatics (RUBi), Department of Biochemistry and Microbiology, Rhodes University, Makhanda 6140, South Africa
| |
Collapse
|
6
|
AlphaFold2 reveals commonalities and novelties in protein structure space for 21 model organisms. Commun Biol 2023; 6:160. [PMID: 36755055 PMCID: PMC9908985 DOI: 10.1038/s42003-023-04488-9] [Citation(s) in RCA: 32] [Impact Index Per Article: 16.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2022] [Accepted: 01/16/2023] [Indexed: 02/10/2023] Open
Abstract
Deep-learning (DL) methods like DeepMind's AlphaFold2 (AF2) have led to substantial improvements in protein structure prediction. We analyse confident AF2 models from 21 model organisms using a new classification protocol (CATH-Assign) which exploits novel DL methods for structural comparison and classification. Of ~370,000 confident models, 92% can be assigned to 3253 superfamilies in our CATH domain superfamily classification. The remaining cluster into 2367 putative novel superfamilies. Detailed manual analysis on 618 of these, having at least one human relative, reveal extremely remote homologies and further unusual features. Only 25 novel superfamilies could be confirmed. Although most models map to existing superfamilies, AF2 domains expand CATH by 67% and increases the number of unique 'global' folds by 36% and will provide valuable insights on structure function relationships. CATH-Assign will harness the huge expansion in structural data provided by DeepMind to rationalise evolutionary changes driving functional divergence.
Collapse
|
7
|
Children's Neurological Status Epilepticus and Poor Prognostic Factors through Electroencephalogram Image under Composite Domain Analysis Algorithm. JOURNAL OF HEALTHCARE ENGINEERING 2021; 2021:8201363. [PMID: 34868532 PMCID: PMC8639250 DOI: 10.1155/2021/8201363] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/25/2021] [Revised: 09/29/2021] [Accepted: 09/30/2021] [Indexed: 11/17/2022]
Abstract
This study aimed to analyze the application of composite domain analysis algorithm for electroencephalogram (EEG) images of children with epilepsy and to investigate the risk factors related to poor prognosis. 70 children with neurological epilepsy admitted to the hospital were selected as the research objects. Besides, the EEG of the children during the intermittent and seizure phases of epilepsy were collected, so as to establish a composite domain analysis algorithm model. Then, the model was applied in EEG analysis. The clinical disease type and prognosis of children were statistically analyzed, and the risk factors that affected the prognosis of children were investigated. The results showed that the EEG signal values of the detail coefficients (d51 and d52) and the approximate coefficient (c5) during the epileptic seizure period were higher markedly than the signal values of the epileptic intermittent period; the EEG signal of the epileptic intermittent period was a transient waveform, which appeared as sharp waves or spikes. The EEG signal of epileptic seizures was continuous, with a composite waveform of sharp waves and spikes, and the change amplitude of the wavelet envelope spectrum during epileptic seizures was also higher hugely than that of intermittent epilepsy. The accurate identification rate, specificity, and sensitivity of EEG analysis with the composite domain algorithm were higher than those without the algorithm. Among the five types of epileptic seizures in children, the proportion of systemic tonic-clonic status was the largest, and the proportion of myoclonic status was equal to that of complex partial epileptic status, both of which were relatively small. The proportion of children with a better prognosis was 75.71% (53/70), which was higher than those with a poor prognosis 24.29% (17/70). Abnormal imaging examination (odds ratio (OR) = 3.823 and 95% confidence interval (CI) = 1.643–8.897); seizure duration greater than 1 hour (OR = 1.855 and 95% CI = 1.076–3.199); C-reactive protein (CRP) (OR = 5.089 and 95% CI = 1.507–17.187); and abnormal blood glucose (OR = 3.077, 95%CI = 1.640–5.773) were all independent risk factors for poor prognosis (all P < 0.05). The composite domain analysis algorithm was helpful for clinicians to find the difference in the EEG signals between the epileptic seizure period and the epileptic intermittent period in a short time, thereby improving the doctor's analysis of the results, which could reflect its marked superiority. In addition, abnormal imaging examinations, convulsion duration greater than 1 hour, CRP, and abnormal blood glucose were independent risk factors for poor prognosis in children. Therefore, the invasion of related risk factors could be reduced clinically by prognostic review with medical advice, attention to food safety and hygiene, and improvement of children's immunity.
Collapse
|
8
|
Music of metagenomics-a review of its applications, analysis pipeline, and associated tools. Funct Integr Genomics 2021; 22:3-26. [PMID: 34657989 DOI: 10.1007/s10142-021-00810-y] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2021] [Revised: 09/25/2021] [Accepted: 10/03/2021] [Indexed: 10/20/2022]
Abstract
This humble effort highlights the intricate details of metagenomics in a simple, poetic, and rhythmic way. The paper enforces the significance of the research area, provides details about major analytical methods, examines the taxonomy and assembly of genomes, emphasizes some tools, and concludes by celebrating the richness of the ecosystem populated by the "metagenome."
Collapse
|
9
|
Vázquez-Campos X, Kinsela AS, Bligh MW, Payne TE, Wilkins MR, Waite TD. Genomic Insights Into the Archaea Inhabiting an Australian Radioactive Legacy Site. Front Microbiol 2021; 12:732575. [PMID: 34737728 PMCID: PMC8561730 DOI: 10.3389/fmicb.2021.732575] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2021] [Accepted: 09/21/2021] [Indexed: 11/29/2022] Open
Abstract
During the 1960s, small quantities of radioactive materials were co-disposed with chemical waste at the Little Forest Legacy Site (LFLS, Sydney, Australia). The microbial function and population dynamics in a waste trench during a rainfall event have been previously investigated revealing a broad abundance of candidate and potentially undescribed taxa in this iron-rich, radionuclide-contaminated environment. Applying genome-based metagenomic methods, we recovered 37 refined archaeal MAGs, mainly from undescribed DPANN Archaea lineages without standing in nomenclature and 'Candidatus Methanoperedenaceae' (ANME-2D). Within the undescribed DPANN, the newly proposed orders 'Ca. Gugararchaeales', 'Ca. Burarchaeales' and 'Ca. Anstonellales', constitute distinct lineages with a more comprehensive central metabolism and anabolic capabilities within the 'Ca. Micrarchaeota' phylum compared to most other DPANN. The analysis of new and extant 'Ca. Methanoperedens spp.' MAGs suggests metal ions as the ancestral electron acceptors during the anaerobic oxidation of methane while the respiration of nitrate/nitrite via molybdopterin oxidoreductases would have been a secondary acquisition. The presence of genes for the biosynthesis of polyhydroxyalkanoates in most 'Ca. Methanoperedens' also appears to be a widespread characteristic of the genus for carbon accumulation. This work expands our knowledge about the roles of the Archaea at the LFLS, especially, DPANN Archaea and 'Ca. Methanoperedens', while exploring their diversity, uniqueness, potential role in elemental cycling, and evolutionary history.
Collapse
Affiliation(s)
- Xabier Vázquez-Campos
- NSW Systems Biology Initiative, School of Biotechnology and Biomolecular Sciences, The University of New South Wales, Sydney, NSW, Australia
| | - Andrew S. Kinsela
- UNSW Water Research Centre, School of Civil and Environmental Engineering, The University of New South Wales, Sydney, NSW, Australia
| | - Mark W. Bligh
- UNSW Water Research Centre, School of Civil and Environmental Engineering, The University of New South Wales, Sydney, NSW, Australia
| | - Timothy E. Payne
- Environmental Research Theme, Australian Nuclear Science and Technology Organisation, Kirrawee DC, NSW, Australia
| | - Marc R. Wilkins
- NSW Systems Biology Initiative, School of Biotechnology and Biomolecular Sciences, The University of New South Wales, Sydney, NSW, Australia
| | - T. David Waite
- UNSW Water Research Centre, School of Civil and Environmental Engineering, The University of New South Wales, Sydney, NSW, Australia
| |
Collapse
|
10
|
Tassia MG, David KT, Townsend JP, Halanych KM. TIAMMAt: Leveraging biodiversity to revise protein domain models, evidence from innate immunity. Mol Biol Evol 2021; 38:5806-5818. [PMID: 34459919 PMCID: PMC8662601 DOI: 10.1093/molbev/msab258] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023] Open
Abstract
Sequence annotation is fundamental for studying the evolution of protein families, particularly when working with nonmodel species. Given the rapid, ever-increasing number of species receiving high-quality genome sequencing, accurate domain modeling that is representative of species diversity is crucial for understanding protein family sequence evolution and their inferred function(s). Here, we describe a bioinformatic tool called Taxon-Informed Adjustment of Markov Model Attributes (TIAMMAt) which revises domain profile hidden Markov models (HMMs) by incorporating homologous domain sequences from underrepresented and nonmodel species. Using innate immunity pathways as a case study, we show that revising profile HMM parameters to directly account for variation in homologs among underrepresented species provides valuable insight into the evolution of protein families. Following adjustment by TIAMMAt, domain profile HMMs exhibit changes in their per-site amino acid state emission probabilities and insertion/deletion probabilities while maintaining the overall structure of the consensus sequence. Our results show that domain revision can heavily impact evolutionary interpretations for some families (i.e., NLR’s NACHT domain), whereas impact on other domains (e.g., rel homology domain and interferon regulatory factor domains) is minimal due to high levels of sequence conservation across the sampled phylogenetic depth (i.e., Metazoa). Importantly, TIAMMAt revises target domain models to reflect homologous sequence variation using the taxonomic distribution under consideration by the user. TIAMMAt’s flexibility to revise any subset of the Pfam database using a user-defined taxonomic pool will make it a valuable tool for future protein evolution studies, particularly when incorporating (or focusing) on nonmodel species.
Collapse
Affiliation(s)
- Michael G Tassia
- Department of Biological Sciences, Auburn University, Auburn, Alabama
| | - Kyle T David
- Department of Biological Sciences, Auburn University, Auburn, Alabama
| | - James P Townsend
- Whitman Center, Marine Biological Laboratory, Woods Hole, Massachusetts.,Department of Biology, Providence College, Providence, Rhode Island
| | | |
Collapse
|
11
|
Queirós P, Delogu F, Hickl O, May P, Wilmes P. Mantis: flexible and consensus-driven genome annotation. Gigascience 2021; 10:giab042. [PMID: 34076241 PMCID: PMC8170692 DOI: 10.1093/gigascience/giab042] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2020] [Revised: 03/22/2021] [Accepted: 05/14/2021] [Indexed: 12/22/2022] Open
Abstract
BACKGROUND The rapid development of the (meta-)omics fields has produced an unprecedented amount of high-resolution and high-fidelity data. Through the use of these datasets we can infer the role of previously functionally unannotated proteins from single organisms and consortia. In this context, protein function annotation can be described as the identification of regions of interest (i.e., domains) in protein sequences and the assignment of biological functions. Despite the existence of numerous tools, challenges remain in terms of speed, flexibility, and reproducibility. In the big data era, it is also increasingly important to cease limiting our findings to a single reference, coalescing knowledge from different data sources, and thus overcoming some limitations in overly relying on computationally generated data from single sources. RESULTS We implemented a protein annotation tool, Mantis, which uses database identifiers intersection and text mining to integrate knowledge from multiple reference data sources into a single consensus-driven output. Mantis is flexible, allowing for the customization of reference data and execution parameters, and is reproducible across different research goals and user environments. We implemented a depth-first search algorithm for domain-specific annotation, which significantly improved annotation performance compared to sequence-wide annotation. The parallelized implementation of Mantis results in short runtimes while also outputting high coverage and high-quality protein function annotations. CONCLUSIONS Mantis is a protein function annotation tool that produces high-quality consensus-driven protein annotations. It is easy to set up, customize, and use, scaling from single genomes to large metagenomes. Mantis is available under the MIT license at https://github.com/PedroMTQ/mantis.
Collapse
Affiliation(s)
- Pedro Queirós
- Systems Ecology, Luxembourg Centre for Systems Biomedicine, University of Luxembourg, 6 Avenue du Swing, 4367 Esch-sur-Alzette, Luxembourg
| | - Francesco Delogu
- Systems Ecology, Luxembourg Centre for Systems Biomedicine, University of Luxembourg, 6 Avenue du Swing, 4367 Esch-sur-Alzette, Luxembourg
| | - Oskar Hickl
- Bioinformatics Core, Luxembourg Centre for Systems Biomedicine, University of Luxembourg, 6 Avenue du Swing, 4367 Esch-sur-Alzette, Luxembourg
| | - Patrick May
- Bioinformatics Core, Luxembourg Centre for Systems Biomedicine, University of Luxembourg, 6 Avenue du Swing, 4367 Esch-sur-Alzette, Luxembourg
| | - Paul Wilmes
- Systems Ecology, Luxembourg Centre for Systems Biomedicine, University of Luxembourg, 6 Avenue du Swing, 4367 Esch-sur-Alzette, Luxembourg
| |
Collapse
|
12
|
An improved high-quality genome assembly and annotation of Tibetan hulless barley. Sci Data 2020; 7:139. [PMID: 32385314 PMCID: PMC7210891 DOI: 10.1038/s41597-020-0480-0] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2019] [Accepted: 04/03/2020] [Indexed: 12/28/2022] Open
Abstract
Hulless barley (Hordeum vulgare L. var. nudum) is a barley variety that has loose husk cover of the caryopses. Because of the ease in processing and edibility, hulless barley has been locally cultivated and used as human food. For example, in Tibetan Plateau, hulless barley is the staple food for human and essential livestock feed. Although the draft genome of hulless barley has been sequenced, the assembly remains fragmented. Here, we reported an improved high-quality assembly and annotation of the Tibetan hulless barley genome using more than 67X PacBio long-reads. The N50 contig length of the new assembly is at least more than 19 times larger than other available barley assemblies. The new genome assembly also showed high gene completeness and high collinearity of genome synteny with the previously reported barley genome. The new genome assembly and annotation will not only remove major hurdles in genetic analysis and breeding of hulless barley, but will also serve as a key resource for studying barley genomics and genetics.
Collapse
|
13
|
Ugarte A, Vicedomini R, Bernardes J, Carbone A. A multi-source domain annotation pipeline for quantitative metagenomic and metatranscriptomic functional profiling. MICROBIOME 2018; 6:149. [PMID: 30153857 PMCID: PMC6114274 DOI: 10.1186/s40168-018-0532-2] [Citation(s) in RCA: 25] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/27/2018] [Accepted: 08/13/2018] [Indexed: 05/23/2023]
Abstract
BACKGROUND Biochemical and regulatory pathways have until recently been thought and modelled within one cell type, one organism and one species. This vision is being dramatically changed by the advent of whole microbiome sequencing studies, revealing the role of symbiotic microbial populations in fundamental biochemical functions. The new landscape we face requires the reconstruction of biochemical and regulatory pathways at the community level in a given environment. In order to understand how environmental factors affect the genetic material and the dynamics of the expression from one environment to another, we want to evaluate the quantity of gene protein sequences or transcripts associated to a given pathway by precisely estimating the abundance of protein domains, their weak presence or absence in environmental samples. RESULTS MetaCLADE is a novel profile-based domain annotation pipeline based on a multi-source domain annotation strategy. It applies directly to reads and improves identification of the catalog of functions in microbiomes. MetaCLADE is applied to simulated data and to more than ten metagenomic and metatranscriptomic datasets from different environments where it outperforms InterProScan in the number of annotated domains. It is compared to the state-of-the-art non-profile-based and profile-based methods, UProC and HMM-GRASPx, showing complementary predictions to UProC. A combination of MetaCLADE and UProC improves even further the functional annotation of environmental samples. CONCLUSIONS Learning about the functional activity of environmental microbial communities is a crucial step to understand microbial interactions and large-scale environmental impact. MetaCLADE has been explicitly designed for metagenomic and metatranscriptomic data and allows for the discovery of patterns in divergent sequences, thanks to its multi-source strategy. MetaCLADE highly improves current domain annotation methods and reaches a fine degree of accuracy in annotation of very different environments such as soil and marine ecosystems, ancient metagenomes and human tissues.
Collapse
Affiliation(s)
- Ari Ugarte
- Sorbonne Université, UPMC-Univ P6, CNRS, IBPS, Laboratoire de Biologie Computationnelle et Quantitative - UMR 7238, 4 Place Jussieu, Paris, 75005 France
| | - Riccardo Vicedomini
- Sorbonne Université, UPMC-Univ P6, CNRS, IBPS, Laboratoire de Biologie Computationnelle et Quantitative - UMR 7238, 4 Place Jussieu, Paris, 75005 France
- Sorbonne Université, UPMC-Univ P6, CNRS, Institut des Sciences du Calcul et des Donnees, 4 Place Jussieu, Paris, 75005 France
| | - Juliana Bernardes
- Sorbonne Université, UPMC-Univ P6, CNRS, IBPS, Laboratoire de Biologie Computationnelle et Quantitative - UMR 7238, 4 Place Jussieu, Paris, 75005 France
| | - Alessandra Carbone
- Sorbonne Université, UPMC-Univ P6, CNRS, IBPS, Laboratoire de Biologie Computationnelle et Quantitative - UMR 7238, 4 Place Jussieu, Paris, 75005 France
- Institut Universitaire de France, Paris, 75005 France
| |
Collapse
|
14
|
Raimondi D, Orlando G, Moreau Y, Vranken WF. Ultra-fast global homology detection with Discrete Cosine Transform and Dynamic Time Warping. Bioinformatics 2018; 34:3118-3125. [DOI: 10.1093/bioinformatics/bty309] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2017] [Accepted: 04/18/2018] [Indexed: 11/14/2022] Open
Affiliation(s)
- Daniele Raimondi
- Interuniversity Institute of Bioinformatics in Brussels, ULB-VUB, Brussels, Belgium
- Structural Biology Brussels, Vrije Universiteit Brussel, Brussels, Belgium
- ESAT-STADIUS, KU Leuven, Leuven, Belgium
- Machine Learning Group, Université Libre De Bruxelles, Brussels, Belgium
| | - Gabriele Orlando
- Interuniversity Institute of Bioinformatics in Brussels, ULB-VUB, Brussels, Belgium
- Structural Biology Brussels, Vrije Universiteit Brussel, Brussels, Belgium
- Machine Learning Group, Université Libre De Bruxelles, Brussels, Belgium
| | - Yves Moreau
- ESAT-STADIUS, KU Leuven, Leuven, Belgium
- Imec, Leuven, Belgium
| | - Wim F Vranken
- Interuniversity Institute of Bioinformatics in Brussels, ULB-VUB, Brussels, Belgium
- Structural Biology Brussels, Vrije Universiteit Brussel, Brussels, Belgium
| |
Collapse
|
15
|
Holliday GL, Brown SD, Akiva E, Mischel D, Hicks MA, Morris JH, Huang CC, Meng EC, Pegg SCH, Ferrin TE, Babbitt PC. Biocuration in the structure-function linkage database: the anatomy of a superfamily. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2017; 2017:3074783. [PMID: 28365730 PMCID: PMC5467563 DOI: 10.1093/database/bax006] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/28/2016] [Accepted: 01/23/2017] [Indexed: 12/11/2022]
Abstract
With ever-increasing amounts of sequence data available in both the primary literature and sequence repositories, there is a bottleneck in annotating molecular function to a sequence. This article describes the biocuration process and methods used in the structure-function linkage database (SFLD) to help address some of the challenges. We discuss how the hierarchy within the SFLD allows us to infer detailed functional properties for functionally diverse enzyme superfamilies in which all members are homologous, conserve an aspect of their chemical function and have associated conserved structural features that enable the chemistry. Also presented is the Enzyme Structure-Function Ontology (ESFO), which has been designed to capture the relationships between enzyme sequence, structure and function that underlie the SFLD and is used to guide the biocuration processes within the SFLD. Database URL:http://sfld.rbvi.ucsf.edu/
Collapse
Affiliation(s)
- Gemma L Holliday
- Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, CA 94143, USA
| | - Shoshana D Brown
- Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, CA 94143, USA
| | - Eyal Akiva
- Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, CA 94143, USA
| | - David Mischel
- Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, CA 94143, USA
| | - Michael A Hicks
- Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, CA 94143, USA.,Human Longevity, Inc, San Diego, CA 92121, USA
| | - John H Morris
- Department of Pharmaceutical Chemistry, School of Pharmacy, University of California, San Francisco, CA 94143, USA
| | - Conrad C Huang
- Department of Pharmaceutical Chemistry, School of Pharmacy, University of California, San Francisco, CA 94143, USA
| | - Elaine C Meng
- Department of Pharmaceutical Chemistry, School of Pharmacy, University of California, San Francisco, CA 94143, USA
| | | | - Thomas E Ferrin
- Department of Pharmaceutical Chemistry, School of Pharmacy, University of California, San Francisco, CA 94143, USA.,California Institute for Quantitative Biosciences, University of California, San Francisco, CA 94158, USA
| | - Patricia C Babbitt
- Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, CA 94143, USA.,Department of Pharmaceutical Chemistry, School of Pharmacy, University of California, San Francisco, CA 94143, USA.,California Institute for Quantitative Biosciences, University of California, San Francisco, CA 94158, USA
| |
Collapse
|
16
|
Ziehm M, Kaur S, Ivanov DK, Ballester PJ, Marcus D, Partridge L, Thornton JM. Drug repurposing for aging research using model organisms. Aging Cell 2017. [PMID: 28620943 PMCID: PMC5595691 DOI: 10.1111/acel.12626] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/03/2023] Open
Abstract
Many increasingly prevalent diseases share a common risk factor: age. However, little is known about pharmaceutical interventions against aging, despite many genes and pathways shown to be important in the aging process and numerous studies demonstrating that genetic interventions can lead to a healthier aging phenotype. An important challenge is to assess the potential to repurpose existing drugs for initial testing on model organisms, where such experiments are possible. To this end, we present a new approach to rank drug-like compounds with known mammalian targets according to their likelihood to modulate aging in the invertebrates Caenorhabditis elegans and Drosophila. Our approach combines information on genetic effects on aging, orthology relationships and sequence conservation, 3D protein structures, drug binding and bioavailability. Overall, we rank 743 different drug-like compounds for their likelihood to modulate aging. We provide various lines of evidence for the successful enrichment of our ranking for compounds modulating aging, despite sparse public data suitable for validation. The top ranked compounds are thus prime candidates for in vivo testing of their effects on lifespan in C. elegans or Drosophila. As such, these compounds are promising as research tools and ultimately a step towards identifying drugs for a healthier human aging.
Collapse
Affiliation(s)
- Matthias Ziehm
- European Molecular Biology Laboratory; European Bioinformatics Institute (EMBL-EBI); The Genome Campus Hinxton, Cambridge CB10 1SD UK
- Department of Genetics, Evolution and Environment; Institute of Healthy Ageing; University College London; Gower Street London WC1E 6BT UK
| | - Satwant Kaur
- European Molecular Biology Laboratory; European Bioinformatics Institute (EMBL-EBI); The Genome Campus Hinxton, Cambridge CB10 1SD UK
| | - Dobril K. Ivanov
- European Molecular Biology Laboratory; European Bioinformatics Institute (EMBL-EBI); The Genome Campus Hinxton, Cambridge CB10 1SD UK
| | - Pedro J. Ballester
- European Molecular Biology Laboratory; European Bioinformatics Institute (EMBL-EBI); The Genome Campus Hinxton, Cambridge CB10 1SD UK
| | - David Marcus
- European Molecular Biology Laboratory; European Bioinformatics Institute (EMBL-EBI); The Genome Campus Hinxton, Cambridge CB10 1SD UK
| | - Linda Partridge
- Department of Genetics, Evolution and Environment; Institute of Healthy Ageing; University College London; Gower Street London WC1E 6BT UK
- Max Planck Institute for Biology of Ageing; Joseph-Stelzmann-Str. 9b 50931 Cologne Germany
| | - Janet M. Thornton
- European Molecular Biology Laboratory; European Bioinformatics Institute (EMBL-EBI); The Genome Campus Hinxton, Cambridge CB10 1SD UK
| |
Collapse
|
17
|
CATH-Gene3D: Generation of the Resource and Its Use in Obtaining Structural and Functional Annotations for Protein Sequences. Methods Mol Biol 2017; 1558:79-110. [PMID: 28150234 DOI: 10.1007/978-1-4939-6783-4_4] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/05/2022]
Abstract
This chapter describes the generation of the data in the CATH-Gene3D online resource and how it can be used to study protein domains and their evolutionary relationships. Methods will be presented for: comparing protein structures, recognizing homologs, predicting domain structures within protein sequences, and subclassifying superfamilies into functionally pure families, together with a guide on using the webpages.
Collapse
|
18
|
Bernardes J, Zaverucha G, Vaquero C, Carbone A. Improvement in Protein Domain Identification Is Reached by Breaking Consensus, with the Agreement of Many Profiles and Domain Co-occurrence. PLoS Comput Biol 2016; 12:e1005038. [PMID: 27472895 PMCID: PMC4966962 DOI: 10.1371/journal.pcbi.1005038] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2015] [Accepted: 06/28/2016] [Indexed: 11/30/2022] Open
Abstract
Traditional protein annotation methods describe known domains with probabilistic models representing consensus among homologous domain sequences. However, when relevant signals become too weak to be identified by a global consensus, attempts for annotation fail. Here we address the fundamental question of domain identification for highly divergent proteins. By using high performance computing, we demonstrate that the limits of state-of-the-art annotation methods can be bypassed. We design a new strategy based on the observation that many structural and functional protein constraints are not globally conserved through all species but might be locally conserved in separate clades. We propose a novel exploitation of the large amount of data available: 1. for each known protein domain, several probabilistic clade-centered models are constructed from a large and differentiated panel of homologous sequences, 2. a decision-making protocol combines outcomes obtained from multiple models, 3. a multi-criteria optimization algorithm finds the most likely protein architecture. The method is evaluated for domain and architecture prediction over several datasets and statistical testing hypotheses. Its performance is compared against HMMScan and HHblits, two widely used search methods based on sequence-profile and profile-profile comparison. Due to their closeness to actual protein sequences, clade-centered models are shown to be more specific and functionally predictive than the broadly used consensus models. Based on them, we improved annotation of Plasmodium falciparum protein sequences on a scale not previously possible. We successfully predict at least one domain for 72% of P. falciparum proteins against 63% achieved previously, corresponding to 30% of improvement over the total number of Pfam domain predictions on the whole genome. The method is applicable to any genome and opens new avenues to tackle evolutionary questions such as the reconstruction of ancient domain duplications, the reconstruction of the history of protein architectures, and the estimation of protein domain age. Website and software: http://www.lcqb.upmc.fr/CLADE. Current sequence databases contain hundreds of billions of nucleotides coding for genes and a classification of these sequences is a primary problem in genomics. A reasonable way to organize these sequences is through their predicted domains, but the identification of domains in very divergent sequences, spanning the entire phylogenetic tree of species, is a difficult problem. By generating multiple probabilistic models for a domain, describing the spread of evolutionary patterns in different phylogenetic clades, we can effectively explore domains that are likely to be coded in gene sequences. Through a machine learning approach and optimization techniques, coding for expected evolutionary constraints, we filter the many possibilities of domain identification found for a gene and propose the most likely domain architecture associated to it. The application of this novel approach to the full genome of Plasmodium falciparum, to a dataset of sequences from three SCOP datasets highlights the interest of exploring multiple pathways of domain evolution in the aim of extracting biological information from genomic sequences. Our new computational approach was developed with the hope of providing a novel tier of accurate and precise tools that complement existing tools such as HMMer, HHblits and PSI-BLAST, by exploring in a novel way the large amount of sequence data available. The existence of powerful databases for sequences, domains and architectures help make this hope a reality.
Collapse
Affiliation(s)
- Juliana Bernardes
- Sorbonne Universités, UPMC Univ-Paris 6, CNRS, UMR 7238, Laboratoire de Biologie Computationnelle et Quantitative, Paris, France
- * E-mail: (JB); (AC)
| | - Gerson Zaverucha
- COPPE, Programa de Engenharia de Sistemas e Computação, Universidade Federal do Rio de Janeiro, Rio de Janeiro, Brazil
| | - Catherine Vaquero
- Sorbonne Universités, UPMC Univ-Paris 6, INSERM U1135, CNRS ERL 8255, Centre d’Immunologie et des Maladies Infectieuses (CIMI-Paris), Paris, France
| | - Alessandra Carbone
- Sorbonne Universités, UPMC Univ-Paris 6, CNRS, UMR 7238, Laboratoire de Biologie Computationnelle et Quantitative, Paris, France
- Institut Universitaire de France, Paris, France
- * E-mail: (JB); (AC)
| |
Collapse
|
19
|
Mao Y, Yang X, Liu Y, Yan Y, Du Z, Han Y, Song Y, Zhou L, Cui Y, Yang R. Reannotation of Yersinia pestis Strain 91001 Based on Omics Data. Am J Trop Med Hyg 2016; 95:562-70. [PMID: 27382076 DOI: 10.4269/ajtmh.16-0215] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2016] [Accepted: 05/17/2016] [Indexed: 12/16/2022] Open
Abstract
Yersinia pestis is among the most dangerous human pathogens, and systematic research of this pathogen is important in bacterial pathogenomics research. To fully interpret the biological functions, physiological characteristics, and pathogenesis of Y. pestis, a comprehensive annotation of its entire genome is necessary. The emergence of omics-based research has brought new opportunities to better annotate the genome of this pathogen. Here, the complete genome of Y. pestis strain 91001 was reannotated using genomics and proteogenomics data. One hundred and thirty-seven unreliable coding sequences were removed, and 41 homologous genes were relocated with their translational initiation sites, while the functions of seven pseudogenes and 392 hypothetical genes were revised. Moreover, annotations of noncoding RNAs, repeat sequences, and transposable elements have also been incorporated. The reannotated results are freely available at http://tody.bmi.ac.cn.
Collapse
Affiliation(s)
- Yiqing Mao
- State Key Laboratory of Pathogen and Biosecurity, Beijing Institute of Microbiology and Epidemiology, Beijing, People's Republic of China. Center of Information Technology, Beijing Institute of Health and Medical Information, Beijing, People's Republic of China
| | - Xianwei Yang
- State Key Laboratory of Pathogen and Biosecurity, Beijing Institute of Microbiology and Epidemiology, Beijing, People's Republic of China
| | - Yang Liu
- Department of Biotechnology, Beijing Institute of Radiation Medicine, Beijing, People's Republic of China
| | - Yanfeng Yan
- State Key Laboratory of Pathogen and Biosecurity, Beijing Institute of Microbiology and Epidemiology, Beijing, People's Republic of China
| | - Zongmin Du
- State Key Laboratory of Pathogen and Biosecurity, Beijing Institute of Microbiology and Epidemiology, Beijing, People's Republic of China
| | - Yanping Han
- State Key Laboratory of Pathogen and Biosecurity, Beijing Institute of Microbiology and Epidemiology, Beijing, People's Republic of China
| | - Yajun Song
- State Key Laboratory of Pathogen and Biosecurity, Beijing Institute of Microbiology and Epidemiology, Beijing, People's Republic of China
| | - Lei Zhou
- State Key Laboratory of Pathogen and Biosecurity, Beijing Institute of Microbiology and Epidemiology, Beijing, People's Republic of China
| | - Yujun Cui
- State Key Laboratory of Pathogen and Biosecurity, Beijing Institute of Microbiology and Epidemiology, Beijing, People's Republic of China.
| | - Ruifu Yang
- State Key Laboratory of Pathogen and Biosecurity, Beijing Institute of Microbiology and Epidemiology, Beijing, People's Republic of China.
| |
Collapse
|
20
|
Thakur S, Guttman DS. A De-Novo Genome Analysis Pipeline (DeNoGAP) for large-scale comparative prokaryotic genomics studies. BMC Bioinformatics 2016; 17:260. [PMID: 27363390 PMCID: PMC4929753 DOI: 10.1186/s12859-016-1142-2] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2015] [Accepted: 06/22/2016] [Indexed: 11/10/2022] Open
Abstract
Background Comparative analysis of whole genome sequence data from closely related prokaryotic species or strains is becoming an increasingly important and accessible approach for addressing both fundamental and applied biological questions. While there are number of excellent tools developed for performing this task, most scale poorly when faced with hundreds of genome sequences, and many require extensive manual curation. Results We have developed a de-novo genome analysis pipeline (DeNoGAP) for the automated, iterative and high-throughput analysis of data from comparative genomics projects involving hundreds of whole genome sequences. The pipeline is designed to perform reference-assisted and de novo gene prediction, homolog protein family assignment, ortholog prediction, functional annotation, and pan-genome analysis using a range of proven tools and databases. While most existing methods scale quadratically with the number of genomes since they rely on pairwise comparisons among predicted protein sequences, DeNoGAP scales linearly since the homology assignment is based on iteratively refined hidden Markov models. This iterative clustering strategy enables DeNoGAP to handle a very large number of genomes using minimal computational resources. Moreover, the modular structure of the pipeline permits easy updates as new analysis programs become available. Conclusion DeNoGAP integrates bioinformatics tools and databases for comparative analysis of a large number of genomes. The pipeline offers tools and algorithms for annotation and analysis of completed and draft genome sequences. The pipeline is developed using Perl, BioPerl and SQLite on Ubuntu Linux version 12.04 LTS. Currently, the software package accompanies script for automated installation of necessary external programs on Ubuntu Linux; however, the pipeline should be also compatible with other Linux and Unix systems after necessary external programs are installed. DeNoGAP is freely available at https://sourceforge.net/projects/denogap/. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-1142-2) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Shalabh Thakur
- Department of Cell & Systems Biology, University of Toronto, Toronto, ON, Canada
| | - David S Guttman
- Department of Cell & Systems Biology, University of Toronto, Toronto, ON, Canada. .,Centre for the Analysis of Genome Evolution and Function, University of Toronto, Toronto, ON, Canada.
| |
Collapse
|
21
|
Saripella GV, Sonnhammer ELL, Forslund K. Benchmarking the next generation of homology inference tools. Bioinformatics 2016; 32:2636-41. [PMID: 27256311 PMCID: PMC5013910 DOI: 10.1093/bioinformatics/btw305] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2015] [Accepted: 05/05/2016] [Indexed: 12/21/2022] Open
Abstract
Motivation: Over the last decades, vast numbers of sequences were deposited in public databases. Bioinformatics tools allow homology and consequently functional inference for these sequences. New profile-based homology search tools have been introduced, allowing reliable detection of remote homologs, but have not been systematically benchmarked. To provide such a comparison, which can guide bioinformatics workflows, we extend and apply our previously developed benchmark approach to evaluate the ‘next generation’ of profile-based approaches, including CS-BLAST, HHSEARCH and PHMMER, in comparison with the non-profile based search tools NCBI-BLAST, USEARCH, UBLAST and FASTA. Method: We generated challenging benchmark datasets based on protein domain architectures within either the PFAM + Clan, SCOP/Superfamily or CATH/Gene3D domain definition schemes. From each dataset, homologous and non-homologous protein pairs were aligned using each tool, and standard performance metrics calculated. We further measured congruence of domain architecture assignments in the three domain databases. Results: CSBLAST and PHMMER had overall highest accuracy. FASTA, UBLAST and USEARCH showed large trade-offs of accuracy for speed optimization. Conclusion: Profile methods are superior at inferring remote homologs but the difference in accuracy between methods is relatively small. PHMMER and CSBLAST stand out with the highest accuracy, yet still at a reasonable computational cost. Additionally, we show that less than 0.1% of Swiss-Prot protein pairs considered homologous by one database are considered non-homologous by another, implying that these classifications represent equivalent underlying biological phenomena, differing mostly in coverage and granularity. Availability and Implementation: Benchmark datasets and all scripts are placed at (http://sonnhammer.org/download/Homology_benchmark). Contact:forslund@embl.de Supplementary information: Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Ganapathi Varma Saripella
- Science for Life Laboratory, Stockholm Bioinformatics Center, Department of Biochemistry and Biophysics, Stockholm University, Stockholm SE-10691, Sweden
| | - Erik L L Sonnhammer
- Science for Life Laboratory, Stockholm Bioinformatics Center, Department of Biochemistry and Biophysics, Stockholm University, Stockholm SE-10691, Sweden
| | - Kristoffer Forslund
- European Molecular Biology Laboratory, Structural and Computational Biology Unit, Heidelberg 69117, Germany
| |
Collapse
|
22
|
Lee D, Das S, Dawson NL, Dobrijevic D, Ward J, Orengo C. Novel Computational Protocols for Functionally Classifying and Characterising Serine Beta-Lactamases. PLoS Comput Biol 2016; 12:e1004926. [PMID: 27332861 PMCID: PMC4917113 DOI: 10.1371/journal.pcbi.1004926] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2015] [Accepted: 04/19/2016] [Indexed: 11/23/2022] Open
Abstract
Beta-lactamases represent the main bacterial mechanism of resistance to beta-lactam antibiotics and are a significant challenge to modern medicine. We have developed an automated classification and analysis protocol that exploits structure- and sequence-based approaches and which allows us to propose a grouping of serine beta-lactamases that more consistently captures and rationalizes the existing three classification schemes: Classes, (A, C and D, which vary in their implementation of the mechanism of action); Types (that largely reflect evolutionary distance measured by sequence similarity); and Variant groups (which largely correspond with the Bush-Jacoby clinical groups). Our analysis platform exploits a suite of in-house and public tools to identify Functional Determinants (FDs), i.e. residue sites, responsible for conferring different phenotypes between different classes, different types and different variants. We focused on Class A beta-lactamases, the most highly populated and clinically relevant class, to identify FDs implicated in the distinct phenotypes associated with different Class A Types and Variants. We show that our FunFHMMer method can separate the known beta-lactamase classes and identify those positions likely to be responsible for the different implementations of the mechanism of action in these enzymes. Two novel algorithms, ASSP and SSPA, allow detection of FD sites likely to contribute to the broadening of the substrate profiles. Using our approaches, we recognise 151 Class A types in UniProt. Finally, we used our beta-lactamase FunFams and ASSP profiles to detect 4 novel Class A types in microbiome samples. Our platforms have been validated by literature studies, in silico analysis and some targeted experimental verification. Although developed for the serine beta-lactamases they could be used to classify and analyse any diverse protein superfamily where sub-families have diverged over both long and short evolutionary timescales.
Collapse
Affiliation(s)
- David Lee
- Institute of Structural and Molecular Biology, University College London, London, United Kingdom
| | - Sayoni Das
- Institute of Structural and Molecular Biology, University College London, London, United Kingdom
| | - Natalie L. Dawson
- Institute of Structural and Molecular Biology, University College London, London, United Kingdom
| | - Dragana Dobrijevic
- Department of Biochemical Engineering, University College London, London, United Kingdom
| | - John Ward
- Department of Biochemical Engineering, University College London, London, United Kingdom
| | - Christine Orengo
- Institute of Structural and Molecular Biology, University College London, London, United Kingdom
| |
Collapse
|
23
|
Syme RA, Tan KC, Hane JK, Dodhia K, Stoll T, Hastie M, Furuki E, Ellwood SR, Williams AH, Tan YF, Testa AC, Gorman JJ, Oliver RP. Comprehensive Annotation of the Parastagonospora nodorum Reference Genome Using Next-Generation Genomics, Transcriptomics and Proteogenomics. PLoS One 2016; 11:e0147221. [PMID: 26840125 PMCID: PMC4739733 DOI: 10.1371/journal.pone.0147221] [Citation(s) in RCA: 28] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2015] [Accepted: 12/30/2015] [Indexed: 11/29/2022] Open
Abstract
Parastagonospora nodorum, the causal agent of Septoria nodorum blotch (SNB), is an economically important pathogen of wheat (Triticum spp.), and a model for the study of necrotrophic pathology and genome evolution. The reference P. nodorum strain SN15 was the first Dothideomycete with a published genome sequence, and has been used as the basis for comparison within and between species. Here we present an updated reference genome assembly with corrections of SNP and indel errors in the underlying genome assembly from deep resequencing data as well as extensive manual annotation of gene models using transcriptomic and proteomic sources of evidence (https://github.com/robsyme/Parastagonospora_nodorum_SN15). The updated assembly and annotation includes 8,366 genes with modified protein sequence and 866 new genes. This study shows the benefits of using a wide variety of experimental methods allied to expert curation to generate a reliable set of gene models.
Collapse
Affiliation(s)
- Robert A. Syme
- Centre for Crop & Disease Management, Department of Environment and Agriculture, Curtin University, Bentley, WA, Australia
| | - Kar-Chun Tan
- Centre for Crop & Disease Management, Department of Environment and Agriculture, Curtin University, Bentley, WA, Australia
| | - James K. Hane
- Centre for Crop & Disease Management, Department of Environment and Agriculture, Curtin University, Bentley, WA, Australia
- Curtin Institute for Computation, Curtin University, Bentley, WA, Australia
| | - Kejal Dodhia
- Centre for Crop & Disease Management, Department of Environment and Agriculture, Curtin University, Bentley, WA, Australia
| | - Thomas Stoll
- Protein Discovery Centre, QIMR Berghofer Medical Research Institute, Herston, Qld, Australia
| | - Marcus Hastie
- Protein Discovery Centre, QIMR Berghofer Medical Research Institute, Herston, Qld, Australia
| | - Eiko Furuki
- Centre for Crop & Disease Management, Department of Environment and Agriculture, Curtin University, Bentley, WA, Australia
| | - Simon R. Ellwood
- Centre for Crop & Disease Management, Department of Environment and Agriculture, Curtin University, Bentley, WA, Australia
| | - Angela H. Williams
- Centre for Crop & Disease Management, Department of Environment and Agriculture, Curtin University, Bentley, WA, Australia
| | | | - Alison C. Testa
- Centre for Crop & Disease Management, Department of Environment and Agriculture, Curtin University, Bentley, WA, Australia
| | - Jeffrey J. Gorman
- Protein Discovery Centre, QIMR Berghofer Medical Research Institute, Herston, Qld, Australia
| | - Richard P. Oliver
- Centre for Crop & Disease Management, Department of Environment and Agriculture, Curtin University, Bentley, WA, Australia
- * E-mail:
| |
Collapse
|
24
|
Lam SD, Dawson NL, Das S, Sillitoe I, Ashford P, Lee D, Lehtinen S, Orengo CA, Lees JG. Gene3D: expanding the utility of domain assignments. Nucleic Acids Res 2016; 44:D404-9. [PMID: 26578585 PMCID: PMC4702871 DOI: 10.1093/nar/gkv1231] [Citation(s) in RCA: 48] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2015] [Revised: 10/29/2015] [Accepted: 10/30/2015] [Indexed: 12/21/2022] Open
Abstract
Gene3D http://gene3d.biochem.ucl.ac.uk is a database of domain annotations of Ensembl and UniProtKB protein sequences. Domains are predicted using a library of profile HMMs representing 2737 CATH superfamilies. Gene3D has previously featured in the Database issue of NAR and here we report updates to the website and database. The current Gene3D (v14) release has expanded its domain assignments to ∼ 20,000 cellular genomes and over 43 million unique protein sequences, more than doubling the number of protein sequences since our last publication. Amongst other updates, we have improved our Functional Family annotation method. We have also improved the quality and coverage of our 3D homology modelling pipeline of predicted CATH domains. Additionally, the structural models have been expanded to include an extra model organism (Drosophila melanogaster). We also document a number of additional visualization tools in the Gene3D website.
Collapse
Affiliation(s)
- Su Datt Lam
- Institute of Structural and Molecular Biology, Division of Biosciences, University College London, Gower Street, London, WC1E 6BT, UK
| | - Natalie L Dawson
- Institute of Structural and Molecular Biology, Division of Biosciences, University College London, Gower Street, London, WC1E 6BT, UK
| | - Sayoni Das
- Institute of Structural and Molecular Biology, Division of Biosciences, University College London, Gower Street, London, WC1E 6BT, UK
| | - Ian Sillitoe
- Institute of Structural and Molecular Biology, Division of Biosciences, University College London, Gower Street, London, WC1E 6BT, UK
| | - Paul Ashford
- Institute of Structural and Molecular Biology, Division of Biosciences, University College London, Gower Street, London, WC1E 6BT, UK
| | - David Lee
- Institute of Structural and Molecular Biology, Division of Biosciences, University College London, Gower Street, London, WC1E 6BT, UK
| | - Sonja Lehtinen
- Institute of Structural and Molecular Biology, Division of Biosciences, University College London, Gower Street, London, WC1E 6BT, UK Department of Infectious Disease Epidemiology, Imperial College, St Mary's Campus, Norfolk Place, London W2 1PG, UK
| | - Christine A Orengo
- Institute of Structural and Molecular Biology, Division of Biosciences, University College London, Gower Street, London, WC1E 6BT, UK
| | - Jonathan G Lees
- Institute of Structural and Molecular Biology, Division of Biosciences, University College London, Gower Street, London, WC1E 6BT, UK
| |
Collapse
|
25
|
Das S, Orengo CA. Protein function annotation using protein domain family resources. Methods 2016; 93:24-34. [DOI: 10.1016/j.ymeth.2015.09.029] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2015] [Revised: 09/28/2015] [Accepted: 09/29/2015] [Indexed: 01/25/2023] Open
|
26
|
Beattie KE, De Ferrari L, Mitchell JBO. Why do Sequence Signatures Predict Enzyme Mechanism? Homology versus Chemistry. Evol Bioinform Online 2015; 11:267-74. [PMID: 26740739 PMCID: PMC4696837 DOI: 10.4137/ebo.s31482] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2015] [Revised: 11/04/2015] [Accepted: 11/08/2015] [Indexed: 01/25/2023] Open
Abstract
First, we identify InterPro sequence signatures representing evolutionary relatedness and, second, signatures identifying specific chemical machinery. Thus, we predict the chemical mechanisms of enzyme-catalyzed reactions from catalytic and non-catalytic subsets of InterPro signatures. We first scanned our 249 sequences using InterProScan and then used the MACiE database to identify those amino acid residues that are important for catalysis. The sequences were mutated in silico to replace these catalytic residues with glycine and then again scanned using InterProScan. Those signature matches from the original scan that disappeared on mutation were called catalytic. Mechanism was predicted using all signatures, only the 78 “catalytic” signatures, or only the 519 “non-catalytic” signatures. The non-catalytic signatures gave indistinguishable results from those for the whole feature set, with precision of 0.991 and sensitivity of 0.970. The catalytic signatures alone gave less impressive predictivity, with precision and sensitivity of 0.791 and 0.735, respectively. These results show that our successful prediction of enzyme mechanism is mostly by homology rather than by identifying catalytic machinery.
Collapse
Affiliation(s)
- Kirsten E Beattie
- Biomedical Sciences Research Complex and EaStCHEM School of Chemistry, Purdie Building, University of St Andrews, North Haugh, St Andrews, Scotland, UK
| | - Luna De Ferrari
- Biomedical Sciences Research Complex and EaStCHEM School of Chemistry, Purdie Building, University of St Andrews, North Haugh, St Andrews, Scotland, UK
| | - John B O Mitchell
- Biomedical Sciences Research Complex and EaStCHEM School of Chemistry, Purdie Building, University of St Andrews, North Haugh, St Andrews, Scotland, UK
| |
Collapse
|
27
|
Yue J, Liu J, Ban R, Tang W, Deng L, Fei Z, Liu Y. Kiwifruit Information Resource (KIR): a comparative platform for kiwifruit genomics. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2015; 2015:bav113. [PMID: 26656885 PMCID: PMC4674624 DOI: 10.1093/database/bav113] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/03/2015] [Accepted: 11/05/2015] [Indexed: 12/22/2022]
Abstract
The Kiwifruit Information Resource (KIR) is dedicated to maintain and integrate comprehensive datasets on genomics, functional genomics and transcriptomics of kiwifruit (Actinidiaceae). KIR serves as a central access point for existing/new genomic and genetic data. KIR also provides researchers with a variety of visualization and analysis tools. Current developments include the updated genome structure of Actinidia chinensis cv. Hongyang and its newest genome annotation, putative transcripts, gene expression, physical markers of genetic traits as well as relevant publications based on the latest genome assembly. Nine thousand five hundred and forty-seven new transcripts are detected and 21 132 old transcripts are changed. At the present release, the next-generation transcriptome sequencing data has been incorporated into gene models and splice variants. Protein–protein interactions are also identified based on experimentally determined orthologous interactions. Furthermore, the experimental results reported in peer-reviewed literature are manually extracted and integrated within a well-developed query page. In total, 122 identifications are currently associated, including commonly used gene names and symbols. All KIR datasets are helpful to facilitate a broad range of kiwifruit research topics and freely available to the research community. Database URL: http://bdg.hfut.edu.cn/kir/index.html.
Collapse
Affiliation(s)
- Junyang Yue
- School of Biotechnology and Food Engineering, Hefei University of Technology, Hefei 230009, China
| | - Jian Liu
- School of Biotechnology and Food Engineering, Hefei University of Technology, Hefei 230009, China
| | - Rongjun Ban
- School of Information Science and Technology, University of Science and Technology of China, Hefei 230009, China
| | - Wei Tang
- School of Biotechnology and Food Engineering, Hefei University of Technology, Hefei 230009, China
| | - Lin Deng
- Information and Network Center, Hefei University of Technology, Hefei 230009, China
| | - Zhangjun Fei
- Boyce Thompson Institute for Plant Research and USDA-ARS Robert W. Holley Center, Tower Road, Cornell University Campus, Ithaca, NY 14853, USA and
| | - Yongsheng Liu
- School of Biotechnology and Food Engineering, Hefei University of Technology, Hefei 230009, China, Ministry of Education Key Laboratory for Bio-Resource and Eco-Environment, College of Life Science and State Key Laboratory of Hydraulics and Mountain River Engineering, Sichuan University, Chengdu 610064, China
| |
Collapse
|
28
|
Das S, Dawson NL, Orengo CA. Diversity in protein domain superfamilies. Curr Opin Genet Dev 2015; 35:40-9. [PMID: 26451979 PMCID: PMC4686048 DOI: 10.1016/j.gde.2015.09.005] [Citation(s) in RCA: 32] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2015] [Revised: 09/07/2015] [Accepted: 09/08/2015] [Indexed: 01/25/2023]
Abstract
Whilst ∼93% of domain superfamilies appear to be relatively structurally and functionally conserved based on the available data from the CATH-Gene3D domain classification resource, the remainder are much more diverse. In this review, we consider how domains in some of the most ubiquitous and promiscuous superfamilies have evolved, in particular the plasticity in their functional sites and surfaces which expands the repertoire of molecules they interact with and actions performed on them. To what extent can we identify a core function for these superfamilies which would allow us to develop a ‘domain grammar of function’ whereby a protein's biological role can be proposed from its constituent domains? Clearly the first step is to understand the extent to which these components vary and how changes in their molecular make-up modifies function.
Collapse
Affiliation(s)
- Sayoni Das
- Institute of Structural and Molecular Biology, UCL, 627 Darwin Building, Gower Street, WC1E 6BT, UK
| | - Natalie L Dawson
- Institute of Structural and Molecular Biology, UCL, 627 Darwin Building, Gower Street, WC1E 6BT, UK
| | - Christine A Orengo
- Institute of Structural and Molecular Biology, UCL, 627 Darwin Building, Gower Street, WC1E 6BT, UK.
| |
Collapse
|
29
|
Scaiewicz A, Levitt M. The language of the protein universe. Curr Opin Genet Dev 2015; 35:50-6. [PMID: 26451980 DOI: 10.1016/j.gde.2015.08.010] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2015] [Revised: 08/20/2015] [Accepted: 08/25/2015] [Indexed: 11/17/2022]
Abstract
Proteins, the main cell machinery which play a major role in nearly every cellular process, have always been a central focus in biology. We live in the post-genomic era, and inferring information from massive data sets is a steadily growing universal challenge. The increasing availability of fully sequenced genomes can be regarded as the 'Rosetta Stone' of the protein universe, allowing the understanding of genomes and their evolution, just as the original Rosetta Stone allowed Champollion to decipher the ancient Egyptian hieroglyphics. In this review, we consider aspects of the protein domain architectures repertoire that are closely related to those of human languages and aim to provide some insights about the language of proteins.
Collapse
Affiliation(s)
- Andrea Scaiewicz
- Department of Structural Biology, Stanford University, Stanford, CA 94305-5126, United States
| | - Michael Levitt
- Department of Structural Biology, Stanford University, Stanford, CA 94305-5126, United States.
| |
Collapse
|
30
|
Lees JG, Ranea JA, Orengo CA. Identifying and characterising key alternative splicing events in Drosophila development. BMC Genomics 2015; 16:608. [PMID: 26275604 PMCID: PMC4537583 DOI: 10.1186/s12864-015-1674-2] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2015] [Accepted: 05/29/2015] [Indexed: 12/26/2022] Open
Abstract
BACKGROUND In complex Metazoans a given gene frequently codes for multiple protein isoforms, through processes such as alternative splicing. Large scale functional annotation of these isoforms is a key challenge for functional genomics. This annotation gap is increasing with the large numbers of multi transcript genes being identified by technologies such as RNASeq. Furthermore attempts to characterise the functions of splicing in an organism are complicated by the difficulty in distinguishing functional isoforms from those produced by splicing errors or transcription noise. Tools to help prioritise candidate isoforms for testing are largely absent. RESULTS In this study we implement a Time-course Switch (TS) score for ranking isoforms by their likelihood of producing additional functions based on their developmental expression profiles, as reported by modENCODE. The TS score allows us to better investigate functional roles of different isoforms expressed in multi transcript genes. From this analysis, we find that isoforms with high TS scores have sequence feature changes consistent with more deterministic splicing and functional changes and tend to gain domains or whole exons which could carry additional functions. Furthermore these functions appear to be particularly important for essential regulatory roles, establishing functional isoform switching as key for regulatory processes. Based on the TS score we develop a Transcript Annotations Pipeline for Alternative Splicing (TAPAS) that identifies functional neighbourhoods of potentially interesting isoforms. CONCLUSIONS We have identified a subset of protein isoforms which appear to have high functional significance, particularly in regulation. This has been made possible through the development of novel methods that make use of transcript expression profiles. The methods and analyses we present here represent important first steps in the development of tools to address the near complete lack of isoform specific function annotation. In turn the tools allow us to better characterise the regulatory functions of alternative splicing in more detail.
Collapse
Affiliation(s)
- Jonathan G Lees
- Institute of Structural and Molecular Biology, Division of Biosciences, University College London, Gower Street, London, WC1E 6BT, UK.
| | - Juan A Ranea
- Department of Molecular Biology and Biochemistry-CIBER de Enfermedades Raras, University of Malaga, Malaga, 29071, Spain.
| | - Christine A Orengo
- Institute of Structural and Molecular Biology, Division of Biosciences, University College London, Gower Street, London, WC1E 6BT, UK.
| |
Collapse
|
31
|
The history of the CATH structural classification of protein domains. Biochimie 2015; 119:209-17. [PMID: 26253692 PMCID: PMC4678953 DOI: 10.1016/j.biochi.2015.08.004] [Citation(s) in RCA: 31] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2015] [Accepted: 08/01/2015] [Indexed: 11/21/2022]
Abstract
This article presents a historical review of the protein structure classification database CATH. Together with the SCOP database, CATH remains comprehensive and reasonably up-to-date with the now more than 100,000 protein structures in the PDB. We review the expansion of the CATH and SCOP resources to capture predicted domain structures in the genome sequence data and to provide information on the likely functions of proteins mediated by their constituent domains. The establishment of comprehensive function annotation resources has also meant that domain families can be functionally annotated allowing insights into functional divergence and evolution within protein families. We present a historical review of the protein structure database CATH. We review the expansion of the CATH and SCOP resources with sequence data and functional annotations. How functional annotation resources allow insights into functional divergence and evolution within protein families.
Collapse
|
32
|
Das S, Lee D, Sillitoe I, Dawson NL, Lees JG, Orengo CA. Functional classification of CATH superfamilies: a domain-based approach for protein function annotation. Bioinformatics 2015; 31:3460-7. [PMID: 26139634 PMCID: PMC4612221 DOI: 10.1093/bioinformatics/btv398] [Citation(s) in RCA: 62] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2015] [Accepted: 06/24/2015] [Indexed: 11/18/2022] Open
Abstract
Motivation: Computational approaches that can predict protein functions are essential to bridge the widening function annotation gap especially since <1.0% of all proteins in UniProtKB have been experimentally characterized. We present a domain-based method for protein function classification and prediction of functional sites that exploits functional sub-classification of CATH superfamilies. The superfamilies are sub-classified into functional families (FunFams) using a hierarchical clustering algorithm supervised by a new classification method, FunFHMMer. Results: FunFHMMer generates more functionally coherent groupings of protein sequences than other domain-based protein classifications. This has been validated using known functional information. The conserved positions predicted by the FunFams are also found to be enriched in known functional residues. Moreover, the functional annotations provided by the FunFams are found to be more precise than other domain-based resources. FunFHMMer currently identifies 110 439 FunFams in 2735 superfamilies which can be used to functionally annotate > 16 million domain sequences. Availability and implementation: All FunFam annotation data are made available through the CATH webpages (http://www.cathdb.info). The FunFHMMer webserver (http://www.cathdb.info/search/by_funfhmmer) allows users to submit query sequences for assignment to a CATH FunFam. Contact:sayoni.das.12@ucl.ac.uk Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Sayoni Das
- Institute of Structural and Molecular Biology, UCL, Darwin Building, Gower Street, WC1E 6BT, UK
| | - David Lee
- Institute of Structural and Molecular Biology, UCL, Darwin Building, Gower Street, WC1E 6BT, UK
| | - Ian Sillitoe
- Institute of Structural and Molecular Biology, UCL, Darwin Building, Gower Street, WC1E 6BT, UK
| | - Natalie L Dawson
- Institute of Structural and Molecular Biology, UCL, Darwin Building, Gower Street, WC1E 6BT, UK
| | - Jonathan G Lees
- Institute of Structural and Molecular Biology, UCL, Darwin Building, Gower Street, WC1E 6BT, UK
| | - Christine A Orengo
- Institute of Structural and Molecular Biology, UCL, Darwin Building, Gower Street, WC1E 6BT, UK
| |
Collapse
|
33
|
Das S, Sillitoe I, Lee D, Lees JG, Dawson NL, Ward J, Orengo CA. CATH FunFHMMer web server: protein functional annotations using functional family assignments. Nucleic Acids Res 2015; 43:W148-53. [PMID: 25964299 PMCID: PMC4489299 DOI: 10.1093/nar/gkv488] [Citation(s) in RCA: 48] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2015] [Accepted: 05/02/2015] [Indexed: 12/20/2022] Open
Abstract
The widening function annotation gap in protein databases and the increasing number and diversity of the proteins being sequenced presents new challenges to protein function prediction methods. Multidomain proteins complicate the protein sequence–structure–function relationship further as new combinations of domains can expand the functional repertoire, creating new proteins and functions. Here, we present the FunFHMMer web server, which provides Gene Ontology (GO) annotations for query protein sequences based on the functional classification of the domain-based CATH-Gene3D resource. Our server also provides valuable information for the prediction of functional sites. The predictive power of FunFHMMer has been validated on a set of 95 proteins where FunFHMMer performs better than BLAST, Pfam and CDD. Recent validation by an independent international competition ranks FunFHMMer as one of the top function prediction methods in predicting GO annotations for both the Biological Process and Molecular Function Ontology. The FunFHMMer web server is available at http://www.cathdb.info/search/by_funfhmmer.
Collapse
Affiliation(s)
- Sayoni Das
- Institute of Structural and Molecular Biology, UCL, Darwin Building, Gower Street, WC1E 6BT, UK
| | - Ian Sillitoe
- Institute of Structural and Molecular Biology, UCL, Darwin Building, Gower Street, WC1E 6BT, UK
| | - David Lee
- Institute of Structural and Molecular Biology, UCL, Darwin Building, Gower Street, WC1E 6BT, UK
| | - Jonathan G Lees
- Institute of Structural and Molecular Biology, UCL, Darwin Building, Gower Street, WC1E 6BT, UK
| | - Natalie L Dawson
- Institute of Structural and Molecular Biology, UCL, Darwin Building, Gower Street, WC1E 6BT, UK
| | - John Ward
- Department of Biochemical Engineering, UCL, Gower Street, WC1E 6BT, UK
| | - Christine A Orengo
- Institute of Structural and Molecular Biology, UCL, Darwin Building, Gower Street, WC1E 6BT, UK
| |
Collapse
|
34
|
Haçarız O, Akgün M, Kavak P, Yüksel B, Sağıroğlu MŞ. Comparative transcriptome profiling approach to glean virulence and immunomodulation-related genes of Fasciola hepatica. BMC Genomics 2015; 16:366. [PMID: 25956885 PMCID: PMC4429430 DOI: 10.1186/s12864-015-1539-8] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2015] [Accepted: 04/15/2015] [Indexed: 12/18/2022] Open
Abstract
BACKGROUND Fasciola hepatica causes chronic liver disease, fasciolosis, leading to significant losses in the livestock economy and concerns for human health in many countries. The identification of F. hepatica genes involved in the parasite's virulence through modulation of host immune system is utmost important to comprehend evasion mechanisms of the parasite and develop more effective strategies against fasciolosis. In this study, to identify the parasite's putative virulence genes which are associated with host immunomodulation, we explored whole transcriptome of an adult F. hepatica using current transcriptome profiling approaches integrated with detailed in silico analyses. In brief, the comparison of the parasite transcripts with the specialised public databases containing sequence data of non-parasitic organisms (Dugesiidae species and Caenorhabditis elegans) or of numerous pathogens and investigation of the sequences in terms of nucleotide evolution (directional selection) and cytokine signaling relation were conducted. RESULTS NGS of the whole transcriptome resulted in 19,534,766 sequence reads, yielding a total of 40,260 transcripts (N₅₀ = 522 bp). A number of the parasite transcripts (n = 1,671) were predicted to be virulence-related on the basis of the exclusive homology with the pathogen-associated data, positive selection or relationship with cytokine signaling. Of these, a group of the virulence-related genes (n = 62), not previously described, were found likely to be associated with immunomodulation based on in silico functional categorisation, showing significant sequence similarities with various immune receptors (i.e. MHC I class, TGF-β receptor, toll/interleukin-1 receptor, T-cell receptor, TNF receptor, and IL-18 receptor accessory protein), cytokines (i.e. TGF-β, interleukin-4/interleukin-13 and TNF-α), cluster of differentiations (e.g. CD48 and CD147) or molecules associated with other immunomodulatory mechanisms (such as regulation of macrophage activation). Some of the genes (n = 5) appeared to be under positive selection (Ka/Ks > 1), imitating proteins associated with cytokine signaling (through sequence homologies with thrombospondin type 1, toll/interleukin-1 receptor, TGF-β receptor and CD147). CONCLUSIONS With a comparative transcriptome profiling approach, we have identified a number of potential immunomodulator genes of F. hepatica (n = 62), which are firstly described here, could be employed for the development of better strategies (including RNAi) in the battle against both zoonotically and economically important disease, fasciolosis.
Collapse
Affiliation(s)
- Orçun Haçarız
- TÜBİTAK Marmara Research Center, Genetic Engineering and Biotechnology Institute, P.O. Box 21, 41470, Gebze, Kocaeli, Turkey.
| | - Mete Akgün
- TÜBİTAK Marmara Research Center, Information Technologies Institute, Gebze, Kocaeli, Turkey.
| | - Pınar Kavak
- TÜBİTAK Marmara Research Center, Information Technologies Institute, Gebze, Kocaeli, Turkey.
| | - Bayram Yüksel
- TÜBİTAK Marmara Research Center, Genetic Engineering and Biotechnology Institute, P.O. Box 21, 41470, Gebze, Kocaeli, Turkey.
| | - Mahmut Şamil Sağıroğlu
- TÜBİTAK Marmara Research Center, Information Technologies Institute, Gebze, Kocaeli, Turkey.
| |
Collapse
|
35
|
Cheng H, Liao Y, Schaeffer RD, Grishin NV. Manual classification strategies in the ECOD database. Proteins 2015; 83:1238-51. [PMID: 25917548 DOI: 10.1002/prot.24818] [Citation(s) in RCA: 56] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2015] [Revised: 03/30/2015] [Accepted: 04/19/2015] [Indexed: 12/28/2022]
Abstract
ECOD (Evolutionary Classification Of protein Domains) is a comprehensive and up-to-date protein structure classification database. The majority of new structures released from the PDB (Protein Data Bank) each week already have close homologs in the ECOD hierarchy and thus can be reliably partitioned into domains and classified by software without manual intervention. However, those proteins that lack confidently detectable homologs require careful analysis by experts. Although many bioinformatics resources rely on expert curation to some degree, specific examples of how this curation occurs and in what cases it is necessary are not always described. Here, we illustrate the manual classification strategy in ECOD by example, focusing on two major issues in protein classification: domain partitioning and the relationship between homology and similarity scores. Most examples show recently released and manually classified PDB structures. We discuss multi-domain proteins, discordance between sequence and structural similarities, difficulties with assessing homology with scores, and integral membrane proteins homologous to soluble proteins. By timely assimilation of newly available structures into its hierarchy, ECOD strives to provide a most accurate and updated view of the protein structure world as a result of combined computational and expert-driven analysis.
Collapse
Affiliation(s)
- Hua Cheng
- Howard Hughes Medical Institute, University of Texas Southwestern Medical Center, Dallas, Texas, 75390
| | - Yuxing Liao
- Department of Biophysics and Biochemistry, University of Texas Southwestern Medical Center, Dallas, Texas, 75390
| | - R Dustin Schaeffer
- Howard Hughes Medical Institute, University of Texas Southwestern Medical Center, Dallas, Texas, 75390
| | - Nick V Grishin
- Howard Hughes Medical Institute, University of Texas Southwestern Medical Center, Dallas, Texas, 75390.,Department of Biophysics and Biochemistry, University of Texas Southwestern Medical Center, Dallas, Texas, 75390
| |
Collapse
|
36
|
Meng X, Wang C, Rahman SU, Wang Y, Wang A, Tao S. Genome-wide identification and evolution of HECT genes in soybean. Int J Mol Sci 2015; 16:8517-35. [PMID: 25894222 PMCID: PMC4425094 DOI: 10.3390/ijms16048517] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2015] [Revised: 04/13/2015] [Accepted: 04/13/2015] [Indexed: 01/10/2023] Open
Abstract
Proteins containing domains homologous to the E6-associated protein (E6-AP) carboxyl terminus (HECT) are an important class of E3 ubiquitin ligases involved in the ubiquitin proteasome pathway. HECT-type E3s play crucial roles in plant growth and development. However, current understanding of plant HECT genes and their evolution is very limited. In this study, we performed a genome-wide analysis of the HECT domain-containing genes in soybean. Using high-quality genome sequences, we identified 19 soybean HECT genes. The predicted HECT genes were distributed unevenly across 15 of 20 chromosomes. Nineteen of these genes were inferred to be segmentally duplicated gene pairs, suggesting that in soybean, segmental duplications have made a significant contribution to the expansion of the HECT gene family. Phylogenetic analysis showed that these HECT genes can be divided into seven groups, among which gene structure and domain architecture was relatively well-conserved. The Ka/Ks ratios show that after the duplication events, duplicated HECT genes underwent purifying selection. Moreover, expression analysis reveals that 15 of the HECT genes in soybean are differentially expressed in 14 tissues, and are often highly expressed in the flowers and roots. In summary, this work provides useful information on which further functional studies of soybean HECT genes can be based.
Collapse
Affiliation(s)
- Xianwen Meng
- College of Life Sciences and State Key Laboratory of Crop Stress Biology in Arid Areas, Northwest A&F University, Yangling 712100, China.
- Bioinformatics Center, Northwest A&F University, Yangling 712100, China.
| | - Chen Wang
- College of Life Sciences and State Key Laboratory of Crop Stress Biology in Arid Areas, Northwest A&F University, Yangling 712100, China.
- Bioinformatics Center, Northwest A&F University, Yangling 712100, China.
| | - Siddiq Ur Rahman
- College of Life Sciences and State Key Laboratory of Crop Stress Biology in Arid Areas, Northwest A&F University, Yangling 712100, China.
- Bioinformatics Center, Northwest A&F University, Yangling 712100, China.
| | - Yaxu Wang
- College of Life Sciences and State Key Laboratory of Crop Stress Biology in Arid Areas, Northwest A&F University, Yangling 712100, China.
- Bioinformatics Center, Northwest A&F University, Yangling 712100, China.
| | - Ailan Wang
- College of Life Sciences and State Key Laboratory of Crop Stress Biology in Arid Areas, Northwest A&F University, Yangling 712100, China.
- Bioinformatics Center, Northwest A&F University, Yangling 712100, China.
| | - Shiheng Tao
- College of Life Sciences and State Key Laboratory of Crop Stress Biology in Arid Areas, Northwest A&F University, Yangling 712100, China.
- Bioinformatics Center, Northwest A&F University, Yangling 712100, China.
| |
Collapse
|
37
|
Hutchins JRA. What's that gene (or protein)? Online resources for exploring functions of genes, transcripts, and proteins. Mol Biol Cell 2015; 25:1187-201. [PMID: 24723265 PMCID: PMC3982986 DOI: 10.1091/mbc.e13-10-0602] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022] Open
Abstract
The genomic era has enabled research projects that use approaches including genome-scale screens, microarray analysis, next-generation sequencing, and mass spectrometry-based proteomics to discover genes and proteins involved in biological processes. Such methods generate data sets of gene, transcript, or protein hits that researchers wish to explore to understand their properties and functions and thus their possible roles in biological systems of interest. Recent years have seen a profusion of Internet-based resources to aid this process. This review takes the viewpoint of the curious biologist wishing to explore the properties of protein-coding genes and their products, identified using genome-based technologies. Ten key questions are asked about each hit, addressing functions, phenotypes, expression, evolutionary conservation, disease association, protein structure, interactors, posttranslational modifications, and inhibitors. Answers are provided by presenting the latest publicly available resources, together with methods for hit-specific and data set-wide information retrieval, suited to any genome-based analytical technique and experimental species. The utility of these resources is demonstrated for 20 factors regulating cell proliferation. Results obtained using some of these are discussed in more depth using the p53 tumor suppressor as an example. This flexible and universally applicable approach for characterizing experimental hits helps researchers to maximize the potential of their projects for biological discovery.
Collapse
Affiliation(s)
- James R A Hutchins
- Institute of Human Genetics, Centre National de la Recherche Scientifique (CNRS), 34396 Montpellier, France
| |
Collapse
|
38
|
Abstract
A key reason three-dimensional (3-D) protein structures are annotated with supporting or derived information is to understand the molecular basis of protein function. To this end, protein structure annotation databases curate key facts and observations, based on community-accepted standards, about the ~100,000 3-D experimental protein structures to date. This review will introduce the primary structure repositories, databases, and value-added structural annotation databases, as well as the range of information they provide. The different levels of annotation data (primary vs. derived vs. inferred) and how they should all be considered accordingly will also be described.
Collapse
Affiliation(s)
- Margaret J. Gabanyi
- Center for Integrative Proteomics Research, Rutgers, The State University of New Jersey, Piscataway, NJ 08854 USA
| | - Helen M. Berman
- Center for Integrative Proteomics Research, Rutgers, The State University of New Jersey, Piscataway, NJ 08854 USA
| |
Collapse
|
39
|
Murcha MW, Narsai R, Devenish J, Kubiszewski-Jakubiak S, Whelan J. MPIC: a mitochondrial protein import components database for plant and non-plant species. PLANT & CELL PHYSIOLOGY 2015; 56:e10. [PMID: 25435547 DOI: 10.1093/pcp/pcu186] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/04/2023]
Abstract
In the 2 billion years since the endosymbiotic event that gave rise to mitochondria, variations in mitochondrial protein import have evolved across different species. With the genomes of an increasing number of plant species sequenced, it is possible to gain novel insights into mitochondrial protein import pathways. We have generated the Mitochondrial Protein Import Components (MPIC) Database (DB; http://www.plantenergy.uwa.edu.au/applications/mpic) providing searchable information on the protein import apparatus of plant and non-plant mitochondria. An in silico analysis was carried out, comparing the mitochondrial protein import apparatus from 24 species representing various lineages from Saccharomyces cerevisiae (yeast) and algae to Homo sapiens (human) and higher plants, including Arabidopsis thaliana (Arabidopsis), Oryza sativa (rice) and other more recently sequenced plant species. Each of these species was extensively searched and manually assembled for analysis in the MPIC DB. The database presents an interactive diagram in a user-friendly manner, allowing users to select their import component of interest. The MPIC DB presents an extensive resource facilitating detailed investigation of the mitochondrial protein import machinery and allowing patterns of conservation and divergence to be recognized that would otherwise have been missed. To demonstrate the usefulness of the MPIC DB, we present a comparative analysis of the mitochondrial protein import machinery in plants and non-plant species, revealing plant-specific features that have evolved.
Collapse
Affiliation(s)
- Monika W Murcha
- Australian Research Council Centre of Excellence in Plant Energy Biology, Bayliss Building M316, University of Western Australia, 35 Stirling Highway, Crawley 6009, Western Australia, Australia
| | - Reena Narsai
- Department of Botany, Australian Research Council Centre of Excellence in Plant Energy Biology, School of Life Science, La Trobe University, Bundoora 3083, Victoria, Australia
| | - James Devenish
- Australian Research Council Centre of Excellence in Plant Energy Biology, Bayliss Building M316, University of Western Australia, 35 Stirling Highway, Crawley 6009, Western Australia, Australia
| | - Szymon Kubiszewski-Jakubiak
- Australian Research Council Centre of Excellence in Plant Energy Biology, Bayliss Building M316, University of Western Australia, 35 Stirling Highway, Crawley 6009, Western Australia, Australia
| | - James Whelan
- Department of Botany, Australian Research Council Centre of Excellence in Plant Energy Biology, School of Life Science, La Trobe University, Bundoora 3083, Victoria, Australia
| |
Collapse
|
40
|
Mitchell A, Chang HY, Daugherty L, Fraser M, Hunter S, Lopez R, McAnulla C, McMenamin C, Nuka G, Pesseat S, Sangrador-Vegas A, Scheremetjew M, Rato C, Yong SY, Bateman A, Punta M, Attwood TK, Sigrist CJA, Redaschi N, Rivoire C, Xenarios I, Kahn D, Guyot D, Bork P, Letunic I, Gough J, Oates M, Haft D, Huang H, Natale DA, Wu CH, Orengo C, Sillitoe I, Mi H, Thomas PD, Finn RD. The InterPro protein families database: the classification resource after 15 years. Nucleic Acids Res 2014; 43:D213-21. [PMID: 25428371 PMCID: PMC4383996 DOI: 10.1093/nar/gku1243] [Citation(s) in RCA: 961] [Impact Index Per Article: 87.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022] Open
Abstract
The InterPro database (http://www.ebi.ac.uk/interpro/) is a freely available resource that can be used to classify sequences into protein families and to predict the presence of important domains and sites. Central to the InterPro database are predictive models, known as signatures, from a range of different protein family databases that have different biological focuses and use different methodological approaches to classify protein families and domains. InterPro integrates these signatures, capitalizing on the respective strengths of the individual databases, to produce a powerful protein classification resource. Here, we report on the status of InterPro as it enters its 15th year of operation, and give an overview of new developments with the database and its associated Web interfaces and software. In particular, the new domain architecture search tool is described and the process of mapping of Gene Ontology terms to InterPro is outlined. We also discuss the challenges faced by the resource given the explosive growth in sequence data in recent years. InterPro (version 48.0) contains 36 766 member database signatures integrated into 26 238 InterPro entries, an increase of over 3993 entries (5081 signatures), since 2012.
Collapse
Affiliation(s)
- Alex Mitchell
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - Hsin-Yu Chang
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - Louise Daugherty
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - Matthew Fraser
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - Sarah Hunter
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - Rodrigo Lopez
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - Craig McAnulla
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - Conor McMenamin
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - Gift Nuka
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - Sebastien Pesseat
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - Amaia Sangrador-Vegas
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - Maxim Scheremetjew
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - Claudia Rato
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - Siew-Yit Yong
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - Alex Bateman
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - Marco Punta
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - Teresa K Attwood
- Faculty of Life Science and School of Computer Science, The University of Manchester, Manchester, M13 9PL, UK
| | - Christian J A Sigrist
- Swiss Institute of Bioinformatics (SIB), CMU - Rue Michel-Servet, 1211 Geneva 4, Switzerland
| | - Nicole Redaschi
- Swiss Institute of Bioinformatics (SIB), CMU - Rue Michel-Servet, 1211 Geneva 4, Switzerland
| | - Catherine Rivoire
- Swiss Institute of Bioinformatics (SIB), CMU - Rue Michel-Servet, 1211 Geneva 4, Switzerland
| | - Ioannis Xenarios
- Swiss Institute of Bioinformatics (SIB), CMU - Rue Michel-Servet, 1211 Geneva 4, Switzerland Center for Integrative Genomics, University of Lausanne, 1015 Lausanne, Switzerland Department of Biochemistry, University of Geneva, 1211 Geneva, Switzerland
| | - Daniel Kahn
- Pôle Rhône-Alpin de Bio-Informatique (PRABI), Batiment G. Mendel, Universite Claude Bernard, 43 bd du 11 novembre 1918, 69622 Villeurbanne Cedex, France
| | - Dominique Guyot
- Pôle Rhône-Alpin de Bio-Informatique (PRABI), Batiment G. Mendel, Universite Claude Bernard, 43 bd du 11 novembre 1918, 69622 Villeurbanne Cedex, France
| | - Peer Bork
- European Molecular Laboratory (EMBL), Meyerhofstasse 1, 69117 Heidelberg, Germany
| | - Ivica Letunic
- European Molecular Laboratory (EMBL), Meyerhofstasse 1, 69117 Heidelberg, Germany
| | - Julian Gough
- Department of Computer Science, University of Bristol, Woodland Road, Bristol, BS8 1UB, UK
| | - Matt Oates
- Department of Computer Science, University of Bristol, Woodland Road, Bristol, BS8 1UB, UK
| | - Daniel Haft
- J. Craig Venter Institute (JCVI), 9704 Medical Center Drive, Rockville, MD 20850, USA
| | - Hongzhan Huang
- Protein Information Resource (PIR), Georgetown University Medical Center, Washington, DC 20007, USA
| | - Darren A Natale
- Protein Information Resource (PIR), Georgetown University Medical Center, Washington, DC 20007, USA
| | - Cathy H Wu
- Protein Information Resource (PIR), Georgetown University Medical Center, Washington, DC 20007, USA Center for Bioinformatics and Computational Biology, University of Delaware, Newark, DE 19711, USA
| | - Christine Orengo
- Structural and Molecular Biology Department, University College London, University of London, London, WC1E 6BT, UK
| | - Ian Sillitoe
- Structural and Molecular Biology Department, University College London, University of London, London, WC1E 6BT, UK
| | - Huaiyu Mi
- Division of Bioinformatics, Department of Preventive Medicine, University of Southern California, Los Angeles, CA 90089, USA
| | - Paul D Thomas
- Division of Bioinformatics, Department of Preventive Medicine, University of Southern California, Los Angeles, CA 90089, USA
| | - Robert D Finn
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| |
Collapse
|
41
|
Lewis TE, Sillitoe I, Andreeva A, Blundell TL, Buchan DWA, Chothia C, Cozzetto D, Dana JM, Filippis I, Gough J, Jones DT, Kelley LA, Kleywegt GJ, Minneci F, Mistry J, Murzin AG, Ochoa-Montaño B, Oates ME, Punta M, Rackham OJL, Stahlhacke J, Sternberg MJE, Velankar S, Orengo C. Genome3D: exploiting structure to help users understand their sequences. Nucleic Acids Res 2014; 43:D382-6. [PMID: 25348407 PMCID: PMC4384030 DOI: 10.1093/nar/gku973] [Citation(s) in RCA: 36] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/03/2022] Open
Abstract
Genome3D (http://www.genome3d.eu) is a collaborative resource that provides predicted domain annotations and structural models for key sequences. Since introducing Genome3D in a previous NAR paper, we have substantially extended and improved the resource. We have annotated representatives from Pfam families to improve coverage of diverse sequences and added a fast sequence search to the website to allow users to find Genome3D-annotated sequences similar to their own. We have improved and extended the Genome3D data, enlarging the source data set from three model organisms to 10, and adding VIVACE, a resource new to Genome3D. We have analysed and updated Genome3D's SCOP/CATH mapping. Finally, we have improved the superposition tools, which now give users a more powerful interface for investigating similarities and differences between structural models.
Collapse
Affiliation(s)
- Tony E Lewis
- Institute of Structural and Molecular Biology, UCL, 636 Darwin Building, Gower Street, London, WC1E 6BT, UK
| | - Ian Sillitoe
- Institute of Structural and Molecular Biology, UCL, 636 Darwin Building, Gower Street, London, WC1E 6BT, UK
| | - Antonina Andreeva
- MRC Laboratory of Molecular Biology, Hills Road, Cambridge, CB2 0QH, UK
| | - Tom L Blundell
- Department of Biochemistry, University of Cambridge, Old Addenbrooke's Site, 80 Tennis Court Road, Cambridge, CB2 1GA, UK
| | - Daniel W A Buchan
- Department of Computer Science, UCL, Gower Street, London, WC1E 6BT, UK
| | - Cyrus Chothia
- MRC Laboratory of Molecular Biology, Hills Road, Cambridge, CB2 0QH, UK
| | - Domenico Cozzetto
- Department of Computer Science, UCL, Gower Street, London, WC1E 6BT, UK
| | - José M Dana
- European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire, CB10 1SD, UK
| | - Ioannis Filippis
- Centre for Bioinformatics, Department of Life Sciences, Imperial College London, London, SW7 2AZ, UK
| | - Julian Gough
- Department of Computer Science, University of Bristol, Merchant Venturers Building, Woodland Road, Bristol, BS8 1UB, UK
| | - David T Jones
- Institute of Structural and Molecular Biology, UCL, 636 Darwin Building, Gower Street, London, WC1E 6BT, UK Department of Computer Science, UCL, Gower Street, London, WC1E 6BT, UK
| | - Lawrence A Kelley
- Centre for Bioinformatics, Department of Life Sciences, Imperial College London, London, SW7 2AZ, UK
| | - Gerard J Kleywegt
- European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire, CB10 1SD, UK
| | - Federico Minneci
- Department of Computer Science, UCL, Gower Street, London, WC1E 6BT, UK
| | - Jaina Mistry
- European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire, CB10 1SD, UK
| | - Alexey G Murzin
- MRC Laboratory of Molecular Biology, Hills Road, Cambridge, CB2 0QH, UK
| | - Bernardo Ochoa-Montaño
- Department of Biochemistry, University of Cambridge, Old Addenbrooke's Site, 80 Tennis Court Road, Cambridge, CB2 1GA, UK
| | - Matt E Oates
- Department of Computer Science, University of Bristol, Merchant Venturers Building, Woodland Road, Bristol, BS8 1UB, UK
| | - Marco Punta
- European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire, CB10 1SD, UK
| | - Owen J L Rackham
- MRC Clinical Sciences Centre, Hammersmith Hospital Campus, Du Cane Road, London, W12 0NN, UK
| | - Jonathan Stahlhacke
- Department of Computer Science, University of Bristol, Merchant Venturers Building, Woodland Road, Bristol, BS8 1UB, UK
| | - Michael J E Sternberg
- Centre for Bioinformatics, Department of Life Sciences, Imperial College London, London, SW7 2AZ, UK
| | - Sameer Velankar
- European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire, CB10 1SD, UK
| | - Christine Orengo
- Institute of Structural and Molecular Biology, UCL, 636 Darwin Building, Gower Street, London, WC1E 6BT, UK
| |
Collapse
|
42
|
De Ferrari L, Mitchell JBO. From sequence to enzyme mechanism using multi-label machine learning. BMC Bioinformatics 2014; 15:150. [PMID: 24885296 PMCID: PMC4229970 DOI: 10.1186/1471-2105-15-150] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2014] [Accepted: 05/07/2014] [Indexed: 12/30/2022] Open
Abstract
Background In this work we predict enzyme function at the level of chemical mechanism, providing a finer granularity of annotation than traditional Enzyme Commission (EC) classes. Hence we can predict not only whether a putative enzyme in a newly sequenced organism has the potential to perform a certain reaction, but how the reaction is performed, using which cofactors and with susceptibility to which drugs or inhibitors, details with important consequences for drug and enzyme design. Work that predicts enzyme catalytic activity based on 3D protein structure features limits the prediction of mechanism to proteins already having either a solved structure or a close relative suitable for homology modelling. Results In this study, we evaluate whether sequence identity, InterPro or Catalytic Site Atlas sequence signatures provide enough information for bulk prediction of enzyme mechanism. By splitting MACiE (Mechanism, Annotation and Classification in Enzymes database) mechanism labels to a finer granularity, which includes the role of the protein chain in the overall enzyme complex, the method can predict at 96% accuracy (and 96% micro-averaged precision, 99.9% macro-averaged recall) the MACiE mechanism definitions of 248 proteins available in the MACiE, EzCatDb (Database of Enzyme Catalytic Mechanisms) and SFLD (Structure Function Linkage Database) databases using an off-the-shelf K-Nearest Neighbours multi-label algorithm. Conclusion We find that InterPro signatures are critical for accurate prediction of enzyme mechanism. We also find that incorporating Catalytic Site Atlas attributes does not seem to provide additional accuracy. The software code (ml2db), data and results are available online at
http://sourceforge.net/projects/ml2db/ and as supplementary files.
Collapse
Affiliation(s)
- Luna De Ferrari
- Biomedical Sciences Research Complex and EaStCHEM School of Chemistry, Purdie Building, University of St Andrews, North Haugh, St Andrews, Scotland KY16 9ST, UK.
| | | |
Collapse
|