1
|
Schütze K, Heinzinger M, Steinegger M, Rost B. Nearest neighbor search on embeddings rapidly identifies distant protein relations. FRONTIERS IN BIOINFORMATICS 2022; 2:1033775. [PMID: 36466147 PMCID: PMC9714024 DOI: 10.3389/fbinf.2022.1033775] [Citation(s) in RCA: 11] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2022] [Accepted: 10/31/2022] [Indexed: 11/29/2023] Open
Abstract
Since 1992, all state-of-the-art methods for fast and sensitive identification of evolutionary, structural, and functional relations between proteins (also referred to as "homology detection") use sequences and sequence-profiles (PSSMs). Protein Language Models (pLMs) generalize sequences, possibly capturing the same constraints as PSSMs, e.g., through embeddings. Here, we explored how to use such embeddings for nearest neighbor searches to identify relations between protein pairs with diverged sequences (remote homology detection for levels of <20% pairwise sequence identity, PIDE). While this approach excelled for proteins with single domains, we demonstrated the current challenges applying this to multi-domain proteins and presented some ideas how to overcome existing limitations, in principle. We observed that sufficiently challenging data set separations were crucial to provide deeply relevant insights into the behavior of nearest neighbor search when applied to the protein embedding space, and made all our methods readily available for others.
Collapse
Affiliation(s)
- Konstantin Schütze
- TUM (Technical University of Munich) Department of Informatics, Bioinformatics & Computational Biology—i12, Munich, Germany
| | - Michael Heinzinger
- TUM (Technical University of Munich) Department of Informatics, Bioinformatics & Computational Biology—i12, Munich, Germany
- TUM Graduate School, Center of Doctoral Studies in Informatics and its Applications (CeDoSIA), Garching, Germany
| | - Martin Steinegger
- School of Biological Sciences, Seoul National University, Seoul, South Korea
- Artificial Intelligence Institute, Seoul National University, Seoul, South Korea
| | - Burkhard Rost
- TUM (Technical University of Munich) Department of Informatics, Bioinformatics & Computational Biology—i12, Munich, Germany
- Institute for Advanced Study (TUM-IAS), Germany & TUM School of Life Sciences Weihenstephan (WZW), Freising, Germany
| |
Collapse
|
2
|
Marquet C, Heinzinger M, Olenyi T, Dallago C, Erckert K, Bernhofer M, Nechaev D, Rost B. Embeddings from protein language models predict conservation and variant effects. Hum Genet 2022; 141:1629-1647. [PMID: 34967936 PMCID: PMC8716573 DOI: 10.1007/s00439-021-02411-y] [Citation(s) in RCA: 42] [Impact Index Per Article: 21.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2021] [Accepted: 12/06/2021] [Indexed: 12/13/2022]
Abstract
The emergence of SARS-CoV-2 variants stressed the demand for tools allowing to interpret the effect of single amino acid variants (SAVs) on protein function. While Deep Mutational Scanning (DMS) sets continue to expand our understanding of the mutational landscape of single proteins, the results continue to challenge analyses. Protein Language Models (pLMs) use the latest deep learning (DL) algorithms to leverage growing databases of protein sequences. These methods learn to predict missing or masked amino acids from the context of entire sequence regions. Here, we used pLM representations (embeddings) to predict sequence conservation and SAV effects without multiple sequence alignments (MSAs). Embeddings alone predicted residue conservation almost as accurately from single sequences as ConSeq using MSAs (two-state Matthews Correlation Coefficient-MCC-for ProtT5 embeddings of 0.596 ± 0.006 vs. 0.608 ± 0.006 for ConSeq). Inputting the conservation prediction along with BLOSUM62 substitution scores and pLM mask reconstruction probabilities into a simplistic logistic regression (LR) ensemble for Variant Effect Score Prediction without Alignments (VESPA) predicted SAV effect magnitude without any optimization on DMS data. Comparing predictions for a standard set of 39 DMS experiments to other methods (incl. ESM-1v, DeepSequence, and GEMME) revealed our approach as competitive with the state-of-the-art (SOTA) methods using MSA input. No method outperformed all others, neither consistently nor statistically significantly, independently of the performance measure applied (Spearman and Pearson correlation). Finally, we investigated binary effect predictions on DMS experiments for four human proteins. Overall, embedding-based methods have become competitive with methods relying on MSAs for SAV effect prediction at a fraction of the costs in computing/energy. Our method predicted SAV effects for the entire human proteome (~ 20 k proteins) within 40 min on one Nvidia Quadro RTX 8000. All methods and data sets are freely available for local and online execution through bioembeddings.com, https://github.com/Rostlab/VESPA , and PredictProtein.
Collapse
Affiliation(s)
- Céline Marquet
- Department of Informatics, Bioinformatics and Computational Biology - i12, TUM-Technical University of Munich, Boltzmannstr. 3, Garching, 85748, Munich, Germany.
- TUM Graduate School, Center of Doctoral Studies in Informatics and its Applications (CeDoSIA), Boltzmannstr. 11, 85748, Garching, Germany.
| | - Michael Heinzinger
- Department of Informatics, Bioinformatics and Computational Biology - i12, TUM-Technical University of Munich, Boltzmannstr. 3, Garching, 85748, Munich, Germany
- TUM Graduate School, Center of Doctoral Studies in Informatics and its Applications (CeDoSIA), Boltzmannstr. 11, 85748, Garching, Germany
| | - Tobias Olenyi
- Department of Informatics, Bioinformatics and Computational Biology - i12, TUM-Technical University of Munich, Boltzmannstr. 3, Garching, 85748, Munich, Germany
- TUM Graduate School, Center of Doctoral Studies in Informatics and its Applications (CeDoSIA), Boltzmannstr. 11, 85748, Garching, Germany
| | - Christian Dallago
- Department of Informatics, Bioinformatics and Computational Biology - i12, TUM-Technical University of Munich, Boltzmannstr. 3, Garching, 85748, Munich, Germany
- TUM Graduate School, Center of Doctoral Studies in Informatics and its Applications (CeDoSIA), Boltzmannstr. 11, 85748, Garching, Germany
| | - Kyra Erckert
- Department of Informatics, Bioinformatics and Computational Biology - i12, TUM-Technical University of Munich, Boltzmannstr. 3, Garching, 85748, Munich, Germany
- TUM Graduate School, Center of Doctoral Studies in Informatics and its Applications (CeDoSIA), Boltzmannstr. 11, 85748, Garching, Germany
| | - Michael Bernhofer
- Department of Informatics, Bioinformatics and Computational Biology - i12, TUM-Technical University of Munich, Boltzmannstr. 3, Garching, 85748, Munich, Germany
- TUM Graduate School, Center of Doctoral Studies in Informatics and its Applications (CeDoSIA), Boltzmannstr. 11, 85748, Garching, Germany
| | - Dmitrii Nechaev
- Department of Informatics, Bioinformatics and Computational Biology - i12, TUM-Technical University of Munich, Boltzmannstr. 3, Garching, 85748, Munich, Germany
- TUM Graduate School, Center of Doctoral Studies in Informatics and its Applications (CeDoSIA), Boltzmannstr. 11, 85748, Garching, Germany
| | - Burkhard Rost
- Department of Informatics, Bioinformatics and Computational Biology - i12, TUM-Technical University of Munich, Boltzmannstr. 3, Garching, 85748, Munich, Germany
- Institute for Advanced Study (TUM-IAS), Lichtenbergstr. 2a, Garching, 85748, Munich, Germany
- TUM School of Life Sciences Weihenstephan (TUM-WZW), Alte Akademie 8, Freising, Germany
| |
Collapse
|
3
|
Littmann M, Heinzinger M, Dallago C, Weissenow K, Rost B. Protein embeddings and deep learning predict binding residues for various ligand classes. Sci Rep 2021; 11:23916. [PMID: 34903827 PMCID: PMC8668950 DOI: 10.1038/s41598-021-03431-4] [Citation(s) in RCA: 38] [Impact Index Per Article: 12.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2021] [Accepted: 12/02/2021] [Indexed: 01/27/2023] Open
Abstract
One important aspect of protein function is the binding of proteins to ligands, including small molecules, metal ions, and macromolecules such as DNA or RNA. Despite decades of experimental progress many binding sites remain obscure. Here, we proposed bindEmbed21, a method predicting whether a protein residue binds to metal ions, nucleic acids, or small molecules. The Artificial Intelligence (AI)-based method exclusively uses embeddings from the Transformer-based protein Language Model (pLM) ProtT5 as input. Using only single sequences without creating multiple sequence alignments (MSAs), bindEmbed21DL outperformed MSA-based predictions. Combination with homology-based inference increased performance to F1 = 48 ± 3% (95% CI) and MCC = 0.46 ± 0.04 when merging all three ligand classes into one. All results were confirmed by three independent data sets. Focusing on very reliably predicted residues could complement experimental evidence: For the 25% most strongly predicted binding residues, at least 73% were correctly predicted even when ignoring the problem of missing experimental annotations. The new method bindEmbed21 is fast, simple, and broadly applicable-neither using structure nor MSAs. Thereby, it found binding residues in over 42% of all human proteins not otherwise implied in binding and predicted about 6% of all residues as binding to metal ions, nucleic acids, or small molecules.
Collapse
Affiliation(s)
- Maria Littmann
- Department of Informatics, Bioinformatics and Computational Biology, I12, TUM (Technical University of Munich), Boltzmannstr. 3, 85748, Garching/Munich, Germany.
| | - Michael Heinzinger
- Department of Informatics, Bioinformatics and Computational Biology, I12, TUM (Technical University of Munich), Boltzmannstr. 3, 85748, Garching/Munich, Germany
- TUM Graduate School, Center of Doctoral Studies in Informatics and Its Applications (CeDoSIA), Boltzmannstr. 11, 85748, Garching, Germany
| | - Christian Dallago
- Department of Informatics, Bioinformatics and Computational Biology, I12, TUM (Technical University of Munich), Boltzmannstr. 3, 85748, Garching/Munich, Germany
- TUM Graduate School, Center of Doctoral Studies in Informatics and Its Applications (CeDoSIA), Boltzmannstr. 11, 85748, Garching, Germany
| | - Konstantin Weissenow
- Department of Informatics, Bioinformatics and Computational Biology, I12, TUM (Technical University of Munich), Boltzmannstr. 3, 85748, Garching/Munich, Germany
- TUM Graduate School, Center of Doctoral Studies in Informatics and Its Applications (CeDoSIA), Boltzmannstr. 11, 85748, Garching, Germany
| | - Burkhard Rost
- Department of Informatics, Bioinformatics and Computational Biology, I12, TUM (Technical University of Munich), Boltzmannstr. 3, 85748, Garching/Munich, Germany
- Institute for Advanced Study (TUM-IAS), Lichtenbergstr. 2a, Garching, 85748, Munich, Germany
- TUM School of Life Sciences Weihenstephan (TUM-WZW), Alte Akademie 8, Freising, Germany
- Department of Biochemistry and Molecular Biophysics, Columbia University, 701 West, 168th Street, New York, NY, 10032, USA
| |
Collapse
|
4
|
Computational and experimental elucidation of Plasmodium falciparum phosphoethanolamine methyltransferase inhibitors: Pivotal drug target. PLoS One 2019; 14:e0221032. [PMID: 31437171 PMCID: PMC6705855 DOI: 10.1371/journal.pone.0221032] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2019] [Accepted: 07/29/2019] [Indexed: 11/19/2022] Open
Abstract
INTRODUCTION Plasmodium falciparum synthesizes phosphatidylcholine for the membrane development through serine decarboxylase-phosphoethanolamine methyltransferase pathway for growth in human host. Phosphoethanolamine-methyltransferase (PfPMT) is a crucial enzyme for the synthesis of phosphocholine which is a precursor for phosphatidylcholine synthesis and is considered as a pivotal drug target as it is absent in the host. The inhibition of PfPMT may kill malaria parasite and hence is being considered as potential target for rational antimalarial drug designing. METHODS In this study, we have used computer aided drug designing (CADD) approaches to establish potential PfPMT inhibitors from Asinex compound library virtually screened for ADMET and the docking affinity. The selected compounds were tested for in-vitro schizonticidal, gametocidal and cytotoxicity activity. Nontoxic compounds were further studied for PfPMT enzyme specificity and antimalarial efficacy for P. berghei in albino mice model. RESULTS Our results have identified two nontoxic PfPMT competitive inhibitors ASN.1 and ASN.3 with better schizonticidal and gametocidal activity which were found to inhibit PfPMT at IC50 1.49μM and 2.31μM respectively. The promising reduction in parasitaemia was found both in orally (50 & 10 mg/kg) and intravenous (IV) (5& 1 mg/kg) however, the better growth inhibition was found in intravenous groups. CONCLUSION We report that the compounds containing Pyridinyl-Pyrimidine and Phenyl-Furan scaffolds as the potential inhibitors of PfPMT and thus may act as promising antimalarial inhibitor candidates which can be further optimized and used as leads for template based antimalarial drug development.
Collapse
|
5
|
Reeb J, Hecht M, Mahlich Y, Bromberg Y, Rost B. Predicted Molecular Effects of Sequence Variants Link to System Level of Disease. PLoS Comput Biol 2016; 12:e1005047. [PMID: 27536940 PMCID: PMC4990455 DOI: 10.1371/journal.pcbi.1005047] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2016] [Accepted: 07/04/2016] [Indexed: 11/19/2022] Open
Abstract
Developments in experimental and computational biology are advancing our understanding of how protein sequence variation impacts molecular protein function. However, the leap from the micro level of molecular function to the macro level of the whole organism, e.g. disease, remains barred. Here, we present new results emphasizing earlier work that suggested some links from molecular function to disease. We focused on non-synonymous single nucleotide variants, also referred to as single amino acid variants (SAVs). Building upon OMIA (Online Mendelian Inheritance in Animals), we introduced a curated set of 117 disease-causing SAVs in animals. Methods optimized to capture effects upon molecular function often correctly predict human (OMIM) and animal (OMIA) Mendelian disease-causing variants. We also predicted effects of human disease-causing variants in the mouse model, i.e. we put OMIM SAVs into mouse orthologs. Overall, fewer variants were predicted with effect in the model organism than in the original organism. Our results, along with other recent studies, demonstrate that predictions of molecular effects capture some important aspects of disease. Thus, in silico methods focusing on the micro level of molecular function can help to understand the macro system level of disease. The variations in the genetic sequence between individuals affect the gene-product, i.e. the protein differently. Some variants have no measurable effect (are neutral), while others affect protein function. Some of those effects are so severe they cause so called monogenic Mendelian diseases, i.e. diseases triggered by a single letter change. Some in silico methods predict the molecular impact of sequence variation. However, both experimental and computational analyses struggle to generalize from the effect upon molecular protein function to the effect upon the organism such as a disease. Here, we confirmed that methods predicting molecular effects correctly capture the type of effects causing Mendelian diseases in human and introduced a data set for animal diseases that was also captured by predictions methods. Predicted effects were less when in silico testing human variants in an animal model (here mouse). This is important to know because “mouse models” are common to study human diseases. Overall, we provided some evidence for a link between the molecular level and some type of disease.
Collapse
Affiliation(s)
- Jonas Reeb
- Department of Informatics, Bioinformatics & Computational Biology—i12, Technische Universität München, Garching/Munich, Germany
- TUM Graduate School, Center of Doctoral Studies in Informatics and its Applications (CeDoSIA), Technische Universität München, Garching, Germany
- * E-mail:
| | - Maximilian Hecht
- Department of Informatics, Bioinformatics & Computational Biology—i12, Technische Universität München, Garching/Munich, Germany
| | - Yannick Mahlich
- Department of Informatics, Bioinformatics & Computational Biology—i12, Technische Universität München, Garching/Munich, Germany
- Department of Biochemistry and Microbiology, Rutgers University, New Brunswick, New Jersey, United States of America
- Institute for Advanced Study (TUM-IAS), Garching/Munich, Germany
| | - Yana Bromberg
- Department of Biochemistry and Microbiology, Rutgers University, New Brunswick, New Jersey, United States of America
- Institute for Advanced Study (TUM-IAS), Garching/Munich, Germany
| | - Burkhard Rost
- Department of Informatics, Bioinformatics & Computational Biology—i12, Technische Universität München, Garching/Munich, Germany
- Institute for Advanced Study (TUM-IAS), Garching/Munich, Germany
- Institute for Food and Plant Sciences WZW, Technische Universität München, Weihenstephan, Freising, Germany
| |
Collapse
|
6
|
Ahmad M, Jung LT, Bhuiyan MAA. On fuzzy semantic similarity measure for DNA coding. Comput Biol Med 2015; 69:144-51. [PMID: 26773936 DOI: 10.1016/j.compbiomed.2015.12.017] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2015] [Revised: 12/22/2015] [Accepted: 12/23/2015] [Indexed: 11/28/2022]
Abstract
A coding measure scheme numerically translates the DNA sequence to a time domain signal for protein coding regions identification. A number of coding measure schemes based on numerology, geometry, fixed mapping, statistical characteristics and chemical attributes of nucleotides have been proposed in recent decades. Such coding measure schemes lack the biologically meaningful aspects of nucleotide data and hence do not significantly discriminate coding regions from non-coding regions. This paper presents a novel fuzzy semantic similarity measure (FSSM) coding scheme centering on FSSM codons׳ clustering and genetic code context of nucleotides. Certain natural characteristics of nucleotides i.e. appearance as a unique combination of triplets, preserving special structure and occurrence, and ability to own and share density distributions in codons have been exploited in FSSM. The nucleotides׳ fuzzy behaviors, semantic similarities and defuzzification based on the center of gravity of nucleotides revealed a strong correlation between nucleotides in codons. The proposed FSSM coding scheme attains a significant enhancement in coding regions identification i.e. 36-133% as compared to other existing coding measure schemes tested over more than 250 benchmarked and randomly taken DNA datasets of different organisms.
Collapse
Affiliation(s)
- Muneer Ahmad
- College of Computer Sciences, King Faisal University, Saudi Arabia.
| | - Low Tang Jung
- Department of Computer Sciences, University Technology PETRONAS, Malaysia.
| | | |
Collapse
|
7
|
Rappoport N, Stern A, Linial N, Linial M. Entropy-driven partitioning of the hierarchical protein space. Bioinformatics 2015; 30:i624-30. [PMID: 25161256 PMCID: PMC4147929 DOI: 10.1093/bioinformatics/btu478] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/03/2023] Open
Abstract
Motivation: Modern protein sequencing techniques have led to the determination of >50 million protein sequences. ProtoNet is a clustering system that provides a continuous hierarchical agglomerative clustering tree for all proteins. While ProtoNet performs unsupervised classification of all included proteins, finding an optimal level of granularity for the purpose of focusing on protein functional groups remain elusive. Here, we ask whether knowledge-based annotations on protein families can support the automatic unsupervised methods for identifying high-quality protein families. We present a method that yields within the ProtoNet hierarchy an optimal partition of clusters, relative to manual annotation schemes. The method’s principle is to minimize the entropy-derived distance between annotation-based partitions and all available hierarchical partitions. We describe the best front (BF) partition of 2 478 328 proteins from UniRef50. Of 4 929 553 ProtoNet tree clusters, BF based on Pfam annotations contain 26 891 clusters. The high quality of the partition is validated by the close correspondence with the set of clusters that best describe thousands of keywords of Pfam. The BF is shown to be superior to naïve cut in the ProtoNet tree that yields a similar number of clusters. Finally, we used parameters intrinsic to the clustering process to enrich a priori the BF’s clusters. We present the entropy-based method’s benefit in overcoming the unavoidable limitations of nested clusters in ProtoNet. We suggest that this automatic information-based cluster selection can be useful for other large-scale annotation schemes, as well as for systematically testing and comparing putative families derived from alternative clustering methods. Availability and implementation: A catalog of BF clusters for thousands of Pfam keywords is provided at http://protonet.cs.huji.ac.il/bestFront/ Contact: michall@cc.huji.ac.il
Collapse
Affiliation(s)
- Nadav Rappoport
- School of Computer Science and Engineering and Department of Biological Chemistry, Institute of Life Sciences, The Hebrew University of Jerusalem, 91904, Israel
| | - Amos Stern
- School of Computer Science and Engineering and Department of Biological Chemistry, Institute of Life Sciences, The Hebrew University of Jerusalem, 91904, Israel
| | - Nathan Linial
- School of Computer Science and Engineering and Department of Biological Chemistry, Institute of Life Sciences, The Hebrew University of Jerusalem, 91904, Israel
| | - Michal Linial
- School of Computer Science and Engineering and Department of Biological Chemistry, Institute of Life Sciences, The Hebrew University of Jerusalem, 91904, Israel
| |
Collapse
|
8
|
Doerr D, Stoye J, Böcker S, Jahn K. Identifying gene clusters by discovering common intervals in indeterminate strings. BMC Genomics 2015; 15 Suppl 6:S2. [PMID: 25571793 PMCID: PMC4274641 DOI: 10.1186/1471-2164-15-s6-s2] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
Abstract
Background Comparative analyses of chromosomal gene orders are successfully used to predict
gene clusters in bacterial and fungal genomes. Present models for detecting sets
of co-localized genes in chromosomal sequences require prior knowledge of gene
family assignments of genes in the dataset of interest. These families are often
computationally predicted on the basis of sequence similarity or higher order
features of gene products. Errors introduced in this process amplify in subsequent
gene order analyses and thus may deteriorate gene cluster prediction. Results In this work, we present a new dynamic model and efficient computational
approaches for gene cluster prediction suitable in scenarios ranging from
traditional gene family-based gene cluster prediction, via multiple conflicting
gene family annotations, to gene family-free analysis, in which gene clusters are
predicted solely on the basis of a pairwise similarity measure of the genes of
different genomes. We evaluate our gene family-free model against a gene
family-based model on a dataset of 93 bacterial genomes. Conclusions Our model is able to detect gene clusters that would be also detected with
well-established gene family-based approaches. Moreover, we show that it is able
to detect conserved regions which are missed by gene family-based methods due to
wrong or deficient gene family assignments.
Collapse
|
9
|
Ben-Tal N, Kolodny R. Representation of the Protein Universe using Classifications, Maps, and Networks. Isr J Chem 2014. [DOI: 10.1002/ijch.201400001] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
|
10
|
PPM-Dom: A novel method for domain position prediction. Comput Biol Chem 2013; 47:8-15. [DOI: 10.1016/j.compbiolchem.2013.06.002] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2013] [Revised: 06/05/2013] [Accepted: 06/05/2013] [Indexed: 02/05/2023]
|
11
|
Ezkurdia I, Tress ML. Protein structural domains: definition and prediction. CURRENT PROTOCOLS IN PROTEIN SCIENCE 2011; Chapter 2:2.14.1-2.14.16. [PMID: 22045561 DOI: 10.1002/0471140864.ps0214s66] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/06/2023]
Abstract
Recognition and prediction of structural domains in proteins is an important part of structure and function prediction. This unit lists the range of tools available for domain prediction, and describes sequence and structural analysis tools that complement domain prediction methods. Also detailed are the basic domain prediction steps, along with suggested strategies for different protein sequences and potential pitfalls in domain boundary prediction. The difficult problem of domain orientation prediction is also discussed. All the resources necessary for domain boundary prediction are accessible via publicly available Web servers and databases and do not require computational expertise.
Collapse
Affiliation(s)
- Iakes Ezkurdia
- Spanish National Cancer Research Centre (CNIO)-Structural Biology and Biocomputing Programme, Madrid, Spain
| | - Michael L Tress
- Spanish National Cancer Research Centre (CNIO)-Structural Biology and Biocomputing Programme, Madrid, Spain
| |
Collapse
|
12
|
Angadi UB, Venkatesulu M. Structural SCOP superfamily level classification using unsupervised machine learning. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2011; 9:601-608. [PMID: 21844638 DOI: 10.1109/tcbb.2011.114] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/31/2023]
Abstract
One of the major research directions in bioinformatics is that of assigning superfamily classification to a given set of proteins. The classification reflects the structural, evolutionary, and functional relatedness. These relationships are embodied in a hierarchical classification, such as the Structural Classification of Protein (SCOP), which is mostly manually curated. Such a classification is essential for the structural and functional analyses of proteins. Yet a large number of proteins remain unclassified. In this study, we have proposed an unsupervised machine learning approach to classify and assign a given set of proteins to SCOP superfamilies. In the method, we have constructed a database and similarity matrix using P-values obtained from an all-against-all BLAST run and trained the network with the ART2 unsupervised learning algorithm using the rows of the similarity matrix as input vectors, enabling the trained network to classify the proteins from 0.82 to 0.97 f-measure accuracy. The performance of ART2 has been compared with that of spectral clustering, Random forest, SVM, and HHpred. ART2 performs better than the others except HHpred. HHpred performs better than ART2 and the sum of errors is smaller than that of the other methods evaluated.
Collapse
|
13
|
Protein disorder--a breakthrough invention of evolution? Curr Opin Struct Biol 2011; 21:412-8. [PMID: 21514145 DOI: 10.1016/j.sbi.2011.03.014] [Citation(s) in RCA: 112] [Impact Index Per Article: 8.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2011] [Revised: 03/29/2011] [Accepted: 03/29/2011] [Indexed: 11/21/2022]
Abstract
As an operational definition, we refer to regions in proteins that do not adopt regular three-dimensional structures in isolation, as disordered regions. An antipode to disorder would be 'well-structured' rather than 'ordered'. Here, we argue for the following three hypotheses. Firstly, it is more useful to picture disorder as a distinct phenomenon in structural biology than as an extreme example of protein flexibility. Secondly, there are many very different flavors of protein disorder, nevertheless, it seems advantageous to portray the universe of all possible proteins in terms of two main types: well-structured, disordered. There might be a third type 'other' but we have so far no positive evidence for this. Thirdly, nature uses protein disorder as a tool to adapt to different environments. Protein disorder is evolutionarily conserved and this maintenance of disorder is highly nontrivial. Increasingly integrating protein disorder into the toolbox of a living cell was a crucial step in the evolution from simple bacteria to complex eukaryotes. We need new advanced computational methods to study this new milestone in the advance of protein biology.
Collapse
|
14
|
Bannert C, Welfle A, aus dem Spring C, Schomburg D. BrEPS: a flexible and automatic protocol to compute enzyme-specific sequence profiles for functional annotation. BMC Bioinformatics 2010; 11:589. [PMID: 21122127 PMCID: PMC3009691 DOI: 10.1186/1471-2105-11-589] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2010] [Accepted: 12/01/2010] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Models for the simulation of metabolic networks require the accurate prediction of enzyme function. Based on a genomic sequence, enzymatic functions of gene products are today mainly predicted by sequence database searching and operon analysis. Other methods can support these techniques: We have developed an automatic method "BrEPS" that creates highly specific sequence patterns for the functional annotation of enzymes. RESULTS The enzymes in the UniprotKB are identified and their sequences compared against each other with BLAST. The enzymes are then clustered into a number of trees, where each tree node is associated with a set of EC-numbers. The enzyme sequences in the tree nodes are aligned with ClustalW. The conserved columns of the resulting multiple alignments are used to construct sequence patterns. In the last step, we verify the quality of the patterns by computing their specificity. Patterns with low specificity are omitted and recomputed further down in the tree. The final high-quality patterns can be used for functional annotation. We ran our protocol on a recent Swiss-Prot release and show statistics, as well as a comparison to PRIAM, a probabilistic method that is also specialized on the functional annotation of enzymes. We determine the amount of true positive annotations for five common microorganisms with data from BRENDA and AMENDA serving as standard of truth. BrEPS is almost on par with PRIAM, a fact which we discuss in the context of five manually investigated cases. CONCLUSIONS Our protocol computes highly specific sequence patterns that can be used to support the functional annotation of enzymes. The main advantages of our method are that it is automatic and unsupervised, and quite fast once the patterns are evaluated. The results show that BrEPS can be a valuable addition to the reconstruction of metabolic networks.
Collapse
Affiliation(s)
- C Bannert
- Dept. of Bioinformatics and Biochemistry, Technische Universität Braunschweig, Langer Kamp 19b, 38106 Braunschweig, Germany
| | - A Welfle
- Dept. of Bioinformatics and Biochemistry, Technische Universität Braunschweig, Langer Kamp 19b, 38106 Braunschweig, Germany
| | - C aus dem Spring
- Dept. of Bioinformatics and Biochemistry, Technische Universität Braunschweig, Langer Kamp 19b, 38106 Braunschweig, Germany
| | - D Schomburg
- Dept. of Bioinformatics and Biochemistry, Technische Universität Braunschweig, Langer Kamp 19b, 38106 Braunschweig, Germany
| |
Collapse
|
15
|
Naamati G, Fromer M, Linial M. Expansion of tandem repeats in sea anemone Nematostella vectensis proteome: A source for gene novelty? BMC Genomics 2009; 10:593. [PMID: 20003297 PMCID: PMC2805694 DOI: 10.1186/1471-2164-10-593] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2009] [Accepted: 12/10/2009] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The complete proteome of the starlet sea anemone, Nematostella vectensis, provides insights into gene invention dating back to the Cnidarian-Bilaterian ancestor. With the addition of the complete proteomes of Hydra magnipapillata and Monosiga brevicollis, the investigation of proteins having unique features in early metazoan life has become practical. We focused on the properties and the evolutionary trends of tandem repeat (TR) sequences in Cnidaria proteomes. RESULTS We found that 11-16% of N. vectensis proteins contain tandem repeats. Most TRs cover 150 amino acid segments that are comprised of basic units of 5-20 amino acids. In total, the N. Vectensis proteome has about 3300 unique TR-units, but only a small fraction of them are shared with H. magnipapillata, M. brevicollis, or mammalian proteomes. The overall abundance of these TRs stands out relative to that of 14 proteomes representing the diversity among eukaryotes and within the metazoan world. TR-units are characterized by a unique composition of amino acids, with cysteine and histidine being over-represented. Structurally, most TR-segments are associated with coiled and disordered regions. Interestingly, 80% of the TR-segments can be read in more than one open reading frame. For over 100 of them, translation of the alternative frames would result in long proteins. Most domain families that are characterized as repeats in eukaryotes are found in the TR-proteomes from Nematostella and Hydra. CONCLUSIONS While most TR-proteins have originated from prediction tools and are still awaiting experimental validations, supportive evidence exists for hundreds of TR-units in Nematostella. The existence of TR-proteins in early metazoan life may have served as a robust mode for novel genes with previously overlooked structural and functional characteristics.
Collapse
|
16
|
Punta M, Love J, Handelman S, Hunt JF, Shapiro L, Hendrickson WA, Rost B. Structural genomics target selection for the New York consortium on membrane protein structure. ACTA ACUST UNITED AC 2009; 10:255-68. [PMID: 19859826 PMCID: PMC2780672 DOI: 10.1007/s10969-009-9071-1] [Citation(s) in RCA: 40] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/22/2009] [Accepted: 09/30/2009] [Indexed: 01/02/2023]
Abstract
The New York Consortium on Membrane Protein Structure (NYCOMPS), a part of the Protein Structure Initiative (PSI) in the USA, has as its mission to establish a high-throughput pipeline for determination of novel integral membrane protein structures. Here we describe our current target selection protocol, which applies structural genomics approaches informed by the collective experience of our team of investigators. We first extract all annotated proteins from our reagent genomes, i.e. the 96 fully sequenced prokaryotic genomes from which we clone DNA. We filter this initial pool of sequences and obtain a list of valid targets. NYCOMPS defines valid targets as those that, among other features, have at least two predicted transmembrane helices, no predicted long disordered regions and, except for community nominated targets, no significant sequence similarity in the predicted transmembrane region to any known protein structure. Proteins that feed our experimental pipeline are selected by defining a protein seed and searching the set of all valid targets for proteins that are likely to have a transmembrane region structurally similar to that of the seed. We require sequence similarity aligning at least half of the predicted transmembrane region of seed and target. Seeds are selected according to their feasibility and/or biological interest, and they include both centrally selected targets and community nominated targets. As of December 2008, over 6,000 targets have been selected and are currently being processed by the experimental pipeline. We discuss how our target list may impact structural coverage of the membrane protein space.
Collapse
Affiliation(s)
- Marco Punta
- Department of Biochemistry and Molecular Biophysics, Columbia University, 630 West 168th Street, New York, NY, 10032, USA.
| | | | | | | | | | | | | |
Collapse
|
17
|
Walsh I, Martin AJM, Mooney C, Rubagotti E, Vullo A, Pollastri G. Ab initio and homology based prediction of protein domains by recursive neural networks. BMC Bioinformatics 2009; 10:195. [PMID: 19558651 PMCID: PMC2711945 DOI: 10.1186/1471-2105-10-195] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2008] [Accepted: 06/26/2009] [Indexed: 11/10/2022] Open
Abstract
Background Proteins, especially larger ones, are often composed of individual evolutionary units, domains, which have their own function and structural fold. Predicting domains is an important intermediate step in protein analyses, including the prediction of protein structures. Results We describe novel systems for the prediction of protein domain boundaries powered by Recursive Neural Networks. The systems rely on a combination of primary sequence and evolutionary information, predictions of structural features such as secondary structure, solvent accessibility and residue contact maps, and structural templates, both annotated for domains (from the SCOP dataset) and unannotated (from the PDB). We gauge the contribution of contact maps, and PDB and SCOP templates independently and for different ranges of template quality. We find that accurately predicted contact maps are informative for the prediction of domain boundaries, while the same is not true for contact maps predicted ab initio. We also find that gap information from PDB templates is informative, but, not surprisingly, less than SCOP annotations. We test both systems trained on templates of all qualities, and systems trained only on templates of marginal similarity to the query (less than 25% sequence identity). While the first batch of systems produces near perfect predictions in the presence of fair to good templates, the second batch outperforms or match ab initio predictors down to essentially any level of template quality. We test all systems in 5-fold cross-validation on a large non-redundant set of multi-domain and single domain proteins. The final predictors are state-of-the-art, with a template-less prediction boundary recall of 50.8% (precision 38.7%) within ± 20 residues and a single domain recall of 80.3% (precision 78.1%). The SCOP-based predictors achieve a boundary recall of 74% (precision 77.1%) again within ± 20 residues, and classify single domain proteins as such in over 85% of cases, when we allow a mix of bad and good quality templates. If we only allow marginal templates (max 25% sequence identity to the query) the scores remain high, with boundary recall and precision of 59% and 66.3%, and 80% of all single domain proteins predicted correctly. Conclusion The systems presented here may prove useful in large-scale annotation of protein domains in proteins of unknown structure. The methods are available as public web servers at the address: and we plan on running them on a multi-genomic scale and make the results public in the near future.
Collapse
Affiliation(s)
- Ian Walsh
- School of Computer Science and Informatics, University College Dublin, Belfield, Dublin 4, Ireland.
| | | | | | | | | | | |
Collapse
|
18
|
Kuzniar A, Lin K, He Y, Nijveen H, Pongor S, Leunissen JAM. ProGMap: an integrated annotation resource for protein orthology. Nucleic Acids Res 2009; 37:W428-34. [PMID: 19494185 PMCID: PMC2703891 DOI: 10.1093/nar/gkp462] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
Current protein sequence databases employ different classification schemes that often provide conflicting annotations, especially for poorly characterized proteins. ProGMap (Protein Group Mappings, http://www.bioinformatics.nl/progmap) is a web-tool designed to help researchers and database annotators to assess the coherence of protein groups defined in various databases and thereby facilitate the annotation of newly sequenced proteins. ProGMap is based on a non-redundant dataset of over 6.6 million protein sequences which is mapped to 240 000 protein group descriptions collected from UniProt, RefSeq, Ensembl, COG, KOG, OrthoMCL-DB, HomoloGene, TRIBES and PIRSF. ProGMap combines the underlying classification schemes via a network of links constructed by a fast and fully automated mapping approach originally developed for document classification. The web interface enables queries to be made using sequence identifiers, gene symbols, protein functions or amino acid and nucleotide sequences. For the latter query type BLAST similarity search and QuickMatch identity search services have been incorporated, for finding sequences similar (or identical) to a query sequence. ProGMap is meant to help users of high throughput methodologies who deal with partially annotated genomic data.
Collapse
Affiliation(s)
- Arnold Kuzniar
- Laboratory of Bioinformatics, Wageningen University and Research Centre (WUR), Dreijenlaan 3, 6703 HA Wageningen, The Netherlands
| | | | | | | | | | | |
Collapse
|
19
|
Fulton DL, Sundararajan S, Badis G, Hughes TR, Wasserman WW, Roach JC, Sladek R. TFCat: the curated catalog of mouse and human transcription factors. Genome Biol 2009; 10:R29. [PMID: 19284633 PMCID: PMC2691000 DOI: 10.1186/gb-2009-10-3-r29] [Citation(s) in RCA: 153] [Impact Index Per Article: 10.2] [Reference Citation Analysis] [Abstract] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2008] [Revised: 02/26/2009] [Accepted: 03/12/2009] [Indexed: 11/20/2022] Open
Abstract
TFCat is a catalog of mouse and human transcription factors based on a reliable core collection of annotations obtained by expert review of the scientific literature Unravelling regulatory programs governed by transcription factors (TFs) is fundamental to understanding biological systems. TFCat is a catalog of mouse and human TFs based on a reliable core collection of annotations obtained by expert review of the scientific literature. The collection, including proven and homology-based candidate TFs, is annotated within a function-based taxonomy and DNA-binding proteins are organized within a classification system. All data and user-feedback mechanisms are available at the TFCat portal .
Collapse
Affiliation(s)
- Debra L Fulton
- Department of Medical Genetics, Centre for Molecular Medicine and Therapeutics, Child and Family Research Institute, University of British Columbia, Vancouver, Canada.
| | | | | | | | | | | | | |
Collapse
|
20
|
Wang M, Caetano-Anollés G. The evolutionary mechanics of domain organization in proteomes and the rise of modularity in the protein world. Structure 2009; 17:66-78. [PMID: 19141283 DOI: 10.1016/j.str.2008.11.008] [Citation(s) in RCA: 101] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2008] [Revised: 10/27/2008] [Accepted: 11/13/2008] [Indexed: 10/21/2022]
Abstract
Protein domains are compact evolutionary units of structure and function that usually combine in proteins to produce complex domain arrangements. In order to study their evolution, we reconstructed genome-based phylogenetic trees of architectures from a census of domain structure and organization conducted at protein fold and fold-superfamily levels in hundreds of fully sequenced genomes. These trees defined timelines of architectural discovery and revealed remarkable evolutionary patterns, including the explosive appearance of domain combinations during the rise of organismal lineages, the dominance of domain fusion processes throughout evolution, and the late appearance of a new class of multifunctional modules in Eukarya by fission of domain combinations. Our study provides a detailed account of the history and diversification of a molecular interactome and shows how the interplay of domain fusions and fissions defines an evolutionary mechanics of domain organization that is fundamentally responsible for the complexity of the protein world.
Collapse
Affiliation(s)
- Minglei Wang
- Department of Crop Sciences, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| | | |
Collapse
|
21
|
Wrzeszczynski KO, Rost B. Cell cycle kinases predicted from conserved biophysical properties. Proteins 2009; 74:655-68. [PMID: 18704950 DOI: 10.1002/prot.22181] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
Machine-learning techniques can classify functionally related proteins where homology-transfer as well as sequence and structure motifs fail. Here, we present a method that aimed at complementing homology-transfer in the identification of cell cycle control kinases from sequence alone. First, we identified functionally significant residues in cell cycle proteins through their high sequence conservation and biophysical properties. We then incorporated these residues and their features into support vector machines (SVM) to identify new kinases and more specifically to differentiate cell cycle kinases from other kinases and other proteins. As expected, the most informative residues tend to be highly conserved and tend to localize in the ATP binding regions of the kinases. Another observation confirmed that ATP binding regions are typically not found on the surface but in partially buried sites, and that this fact is correctly captured by accessibility predictions. Using these highly conserved, semi-buried residues and their biophysical properties, we could distinguish cell cycle S/T kinases from other kinase families at levels around 70-80% accuracy and 62-81% coverage. An application to the entire human proteome predicted at least 97 human proteins with limited previous annotations to be candidates for cell cycle kinases.
Collapse
Affiliation(s)
- Kazimierz O Wrzeszczynski
- Department of Biochemistry and Molecular Biophysics, Columbia University, New York, New York 10032, USA
| | | |
Collapse
|
22
|
Nair R, Liu J, Soong TT, Acton TB, Everett JK, Kouranov A, Fiser A, Godzik A, Jaroszewski L, Orengo C, Montelione GT, Rost B. Structural genomics is the largest contributor of novel structural leverage. ACTA ACUST UNITED AC 2009; 10:181-91. [PMID: 19194785 PMCID: PMC2705706 DOI: 10.1007/s10969-008-9055-6] [Citation(s) in RCA: 54] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2008] [Accepted: 12/08/2008] [Indexed: 11/28/2022]
Abstract
The Protein Structural Initiative (PSI) at the US National Institutes of Health (NIH) is funding four large-scale centers for structural genomics (SG). These centers systematically target many large families without structural coverage, as well as very large families with inadequate structural coverage. Here, we report a few simple metrics that demonstrate how successfully these efforts optimize structural coverage: while the PSI-2 (2005-now) contributed more than 8% of all structures deposited into the PDB, it contributed over 20% of all novel structures (i.e. structures for protein sequences with no structural representative in the PDB on the date of deposition). The structural coverage of the protein universe represented by today’s UniProt (v12.8) has increased linearly from 1992 to 2008; structural genomics has contributed significantly to the maintenance of this growth rate. Success in increasing novel leverage (defined in Liu et al. in Nat Biotechnol 25:849–851, 2007) has resulted from systematic targeting of large families. PSI’s per structure contribution to novel leverage was over 4-fold higher than that for non-PSI structural biology efforts during the past 8 years. If the success of the PSI continues, it may just take another ~15 years to cover most sequences in the current UniProt database.
Collapse
Affiliation(s)
- Rajesh Nair
- Department of Biochemistry and Molecular Biophysics, Columbia University, New York, NY 10032, USA
| | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
23
|
Loewenstein Y, Portugaly E, Fromer M, Linial M. Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space. Bioinformatics 2008; 24:i41-9. [PMID: 18586742 PMCID: PMC2718652 DOI: 10.1093/bioinformatics/btn174] [Citation(s) in RCA: 65] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION UPGMA (average linking) is probably the most popular algorithm for hierarchical data clustering, especially in computational biology. However, UPGMA requires the entire dissimilarity matrix in memory. Due to this prohibitive requirement, UPGMA is not scalable to very large datasets. APPLICATION We present a novel class of memory-constrained UPGMA (MC-UPGMA) algorithms. Given any practical memory size constraint, this framework guarantees the correct clustering solution without explicitly requiring all dissimilarities in memory. The algorithms are general and are applicable to any dataset. We present a data-dependent characterization of hardness and clustering efficiency. The presented concepts are applicable to any agglomerative clustering formulation. RESULTS We apply our algorithm to the entire collection of protein sequences, to automatically build a comprehensive evolutionary-driven hierarchy of proteins from sequence alone. The newly created tree captures protein families better than state-of-the-art large-scale methods such as CluSTr, ProtoNet4 or single-linkage clustering. We demonstrate that leveraging the entire mass embodied in all sequence similarities allows to significantly improve on current protein family clusterings which are unable to directly tackle the sheer mass of this data. Furthermore, we argue that non-metric constraints are an inherent complexity of the sequence space and should not be overlooked. The robustness of UPGMA allows significant improvement, especially for multidomain proteins, and for large or divergent families. AVAILABILITY A comprehensive tree built from all UniProt sequence similarities, together with navigation and classification tools will be made available as part of the ProtoNet service. A C++ implementation of the algorithm is available on request.
Collapse
Affiliation(s)
- Yaniv Loewenstein
- School of Computer Science and Engineering, Institute of Life Sciences, The Hebrew University of Jerusalem, Israel
| | | | | | | |
Collapse
|
24
|
Sequence similarity network reveals common ancestry of multidomain proteins. PLoS Comput Biol 2008; 4:e1000063. [PMID: 18475320 PMCID: PMC2377100 DOI: 10.1371/journal.pcbi.1000063] [Citation(s) in RCA: 55] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2007] [Accepted: 03/18/2008] [Indexed: 11/25/2022] Open
Abstract
We address the problem of homology identification in complex multidomain families with varied domain architectures. The challenge is to distinguish sequence pairs that share common ancestry from pairs that share an inserted domain but are otherwise unrelated. This distinction is essential for accuracy in gene annotation, function prediction, and comparative genomics. There are two major obstacles to multidomain homology identification: lack of a formal definition and lack of curated benchmarks for evaluating the performance of new methods. We offer preliminary solutions to both problems: 1) an extension of the traditional model of homology to include domain insertions; and 2) a manually curated benchmark of well-studied families in mouse and human. We further present Neighborhood Correlation, a novel method that exploits the local structure of the sequence similarity network to identify homologs with great accuracy based on the observation that gene duplication and domain shuffling leave distinct patterns in the sequence similarity network. In a rigorous, empirical comparison using our curated data, Neighborhood Correlation outperforms sequence similarity, alignment length, and domain architecture comparison. Neighborhood Correlation is well suited for automated, genome-scale analyses. It is easy to compute, does not require explicit knowledge of domain architecture, and classifies both single and multidomain homologs with high accuracy. Homolog predictions obtained with our method, as well as our manually curated benchmark and a web-based visualization tool for exploratory analysis of the network neighborhood structure, are available at http://www.neighborhoodcorrelation.org. Our work represents a departure from the prevailing view that the concept of homology cannot be applied to genes that have undergone domain shuffling. In contrast to current approaches that either focus on the homology of individual domains or consider only families with identical domain architectures, we show that homology can be rationally defined for multidomain families with diverse architectures by considering the genomic context of the genes that encode them. Our study demonstrates the utility of mining network structure for evolutionary information, suggesting this is a fertile approach for investigating evolutionary processes in the post-genomic era. New genes evolve through the duplication and modification of existing genes. As a result, genes that share common ancestry tend to have similar structure and function. Computational methods that use common ancestry have been extraordinarily successful in inferring function. The practice of discerning evolutionary relationships is stymied, however, by modular sequences made up of two or more domains. When two genes share some domains but not others, it is difficult to distinguish a case of common ancestry from insertion of the same domain into both genes. We present a formal framework to define how multidomain genes are related, and propose a novel method for rapid, robust characterization of evolutionary relationships. In an empirical comparison with the current state of the art, we demonstrate superior performance of our method using a large hand-curated set of sequences known to share common ancestry. The success of our method derives from its unique ability to infer evolutionary history from local topology in the sequence similarity network. This represents a departure from the view that protein family classification must be restricted to families with conserved architecture. By exploiting the structure of the sequence similarity network, our approach surmounts this limitation and opens the door to studies of the role of modularity in protein evolution.
Collapse
|
25
|
Schenk G, Margraf T, Torda AE. Protein sequence and structure alignments within one framework. Algorithms Mol Biol 2008; 3:4. [PMID: 18380904 PMCID: PMC2390564 DOI: 10.1186/1748-7188-3-4] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2008] [Accepted: 04/01/2008] [Indexed: 11/19/2022] Open
Abstract
Background Protein structure alignments are usually based on very different techniques to sequence alignments. We propose a method which treats sequence, structure and even combined sequence + structure in a single framework. Using a probabilistic approach, we calculate a similarity measure which can be applied to fragments containing only protein sequence, structure or both simultaneously. Results Proof-of-concept results are given for the different problems. For sequence alignments, the methodology is no better than conventional methods. For structure alignments, the techniques are very fast, reliable and tolerant of a range of alignment parameters. Combined sequence and structure alignments may provide a more reliable alignment for pairs of proteins where pure structural alignments can be misled by repetitive elements or apparent symmetries. Conclusion The probabilistic framework has an elegance in principle, merging sequence and structure descriptors into a single framework. It has a practical use in fast structural alignments and a potential use in finding those examples where sequence and structural similarities apparently disagree.
Collapse
|
26
|
Carrière C, Mornon JP, Venien-Bryan C, Boisset N, Callebaut I. Calcineurin B-like domains in the large regulatory α/β subunits of phosphorylase kinase. Proteins 2008; 71:1597-606. [DOI: 10.1002/prot.22006] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
|
27
|
Tress M, Cheng J, Baldi P, Joo K, Lee J, Seo JH, Lee J, Baker D, Chivian D, Kim D, Ezkurdia I. Assessment of predictions submitted for the CASP7 domain prediction category. Proteins 2008; 69 Suppl 8:137-51. [PMID: 17680686 DOI: 10.1002/prot.21675] [Citation(s) in RCA: 33] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
This paper details the assessment process and evaluation results for the Critical Assessment of Protein Structure Prediction (CASP7) domain prediction category. Domain predictions were assessed using the Normalized Domain Overlap score introduced in CASP6 and the accuracy of prediction of domain break points. The results of the analysis clearly demonstrate that the best methods are able to make consistently reliable predictions when the target has a structural template, although they are less good when the domain break occurs in a region not covered by a template. The conditions of the experiment meant that it was impossible to draw any conclusions about domain prediction for free modeling targets and it was also difficult to draw many distinctions between the best groups. Two thirds of the targets submitted were single domains and hence regarded as easy to predict. Even those targets defined as having multiple domains always had at least one domain with a similar template structure.
Collapse
Affiliation(s)
- Michael Tress
- Structural and Biological Computation Programme, Spanish National Cancer Research Centre, Madrid, Spain.
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
28
|
Fretwell JF, K. Ismail SM, Cummings JM, Selby TL. Characterization of a randomized FRET library for protease specificity determination. MOLECULAR BIOSYSTEMS 2008; 4:862-70. [DOI: 10.1039/b709290c] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
|
29
|
Heger A, Korpelainen E, Hupponen T, Mattila K, Ollikainen V, Holm L. PairsDB atlas of protein sequence space. Nucleic Acids Res 2007; 36:D276-80. [PMID: 17986464 PMCID: PMC2238971 DOI: 10.1093/nar/gkm879] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Sequence similarity/database searching is a cornerstone of molecular biology. PairsDB is a database intended to make exploring protein sequences and their similarity relationships quick and easy. Behind PairsDB is a comprehensive collection of protein sequences and BLAST and PSI-BLAST alignments between them. Instead of running BLAST or PSI-BLAST individually on each request, results are retrieved instantaneously from a database of pre-computed alignments. Filtering options allow you to find a set of sequences satisfying a set of criteria—for example, all human proteins with solved structure and without transmembrane segments. PairsDB is continually updated and covers all sequences in Uniprot. The data is stored in a MySQL relational database. Data files will be made available for download at ftp://nic.funet.fi/pub/sci/molbio. PairsDB can also be accessed interactively at http://pairsdb.csc.fi. PairsDB data is a valuable platform to build various downstream automated analysis pipelines. For example, the graph of all-against-all similarity relationships is the starting point for clustering protein families, delineating domains, improving alignment accuracy by consistency measures, and defining orthologous genes. Moreover, query-anchored stacked sequence alignments, profiles and consensus sequences are useful in studies of sequence conservation patterns for clues about possible functional sites.
Collapse
Affiliation(s)
- Andreas Heger
- MRC Functional Genetics Unit, University of Oxford, UK
| | | | | | | | | | | |
Collapse
|
30
|
Wang M, Yafremava LS, Caetano-Anollés D, Mittenthal JE, Caetano-Anollés G. Reductive evolution of architectural repertoires in proteomes and the birth of the tripartite world. Genes Dev 2007; 17:1572-85. [PMID: 17908824 PMCID: PMC2045140 DOI: 10.1101/gr.6454307] [Citation(s) in RCA: 94] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2007] [Accepted: 08/23/2007] [Indexed: 11/25/2022]
Abstract
The repertoire of protein architectures in proteomes is evolutionarily conserved and capable of preserving an accurate record of genomic history. Here we use a census of protein architecture in 185 genomes that have been fully sequenced to generate genome-based phylogenies that describe the evolution of the protein world at fold (F) and fold superfamily (FSF) levels. The patterns of representation of F and FSF architectures over evolutionary history suggest three epochs in the evolution of the protein world: (1) architectural diversification, where members of an architecturally rich ancestral community diversified their protein repertoire; (2) superkingdom specification, where superkingdoms Archaea, Bacteria, and Eukarya were specified; and (3) organismal diversification, where F and FSF specific to relatively small sets of organisms appeared as the result of diversification of organismal lineages. Functional annotation of FSF along these architectural chronologies revealed patterns of discovery of biological function. Most importantly, the analysis identified an early and extensive differential loss of architectures occurring primarily in Archaea that segregates the archaeal lineage from the ancient community of organisms and establishes the first organismal divide. Reconstruction of phylogenomic trees of proteomes reflects the timeline of architectural diversification in the emerging lineages. Thus, Archaea undertook a minimalist strategy using only a small subset of the full architectural repertoire and then crystallized into a diversified superkingdom late in evolution. Our analysis also suggests a communal ancestor to all life that was molecularly complex and adopted genomic strategies currently present in Eukarya.
Collapse
Affiliation(s)
- Minglei Wang
- Department of Crop Sciences, University of Illinois at Urbana–Champaign, Urbana, Illinois 61801, USA
| | - Liudmila S. Yafremava
- Department of Crop Sciences, University of Illinois at Urbana–Champaign, Urbana, Illinois 61801, USA
| | - Derek Caetano-Anollés
- Department of Crop Sciences, University of Illinois at Urbana–Champaign, Urbana, Illinois 61801, USA
| | - Jay E. Mittenthal
- Department of Cell and Developmental Biology, University of Illinois at Urbana–Champaign, Urbana, Illinois 61801, USA
| | - Gustavo Caetano-Anollés
- Department of Crop Sciences, University of Illinois at Urbana–Champaign, Urbana, Illinois 61801, USA
| |
Collapse
|
31
|
Mezei M, Zhou MM. Pspace: a program that assesses protein space. SOURCE CODE FOR BIOLOGY AND MEDICINE 2007; 2:6. [PMID: 17956630 PMCID: PMC2231351 DOI: 10.1186/1751-0473-2-6] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/06/2007] [Accepted: 10/23/2007] [Indexed: 11/10/2022]
Abstract
Background We describe a computer program named Pspace designed to a) obtain a reliable basis for the description of three-dimensional structures of a given protein family using homology modeling through selection of an optimal subset of the protein family whose structure would be determined experimentally; and b) aid in the search of orthologs by matching two sets of sequences in three different ways. Methods The prioritization is established dynamically as new sequences and new structures are becoming available through ranking proteins by their value in providing structural information about the rest of the family set. The matching can give a list of potential orthologs or it can deduce an overall optimal matching of two sets of sequences. Results The various covering strategies and ortholog searches are tested on the bromodomain family. Conclusion The possibility of extending this approach to the space of all proteins is discussed.
Collapse
Affiliation(s)
- Mihaly Mezei
- Department of Structural and Chemical Biology, Mount Sinai School of Medicine, New York University, One Gustave L, Levy Place, New York, New York 10029, USA.
| | | |
Collapse
|
32
|
Aragues R, Sali A, Bonet J, Marti-Renom MA, Oliva B. Characterization of protein hubs by inferring interacting motifs from protein interactions. PLoS Comput Biol 2007; 3:1761-71. [PMID: 17941705 PMCID: PMC1976338 DOI: 10.1371/journal.pcbi.0030178] [Citation(s) in RCA: 43] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2007] [Accepted: 07/27/2007] [Indexed: 12/19/2022] Open
Abstract
The characterization of protein interactions is essential for understanding biological systems. While genome-scale methods are available for identifying interacting proteins, they do not pinpoint the interacting motifs (e.g., a domain, sequence segments, a binding site, or a set of residues). Here, we develop and apply a method for delineating the interacting motifs of hub proteins (i.e., highly connected proteins). The method relies on the observation that proteins with common interaction partners tend to interact with these partners through a common interacting motif. The sole input for the method are binary protein interactions; neither sequence nor structure information is needed. The approach is evaluated by comparing the inferred interacting motifs with domain families defined for 368 proteins in the Structural Classification of Proteins (SCOP). The positive predictive value of the method for detecting proteins with common SCOP families is 75% at sensitivity of 10%. Most of the inferred interacting motifs were significantly associated with sequence patterns, which could be responsible for the common interactions. We find that yeast hubs with multiple interacting motifs are more likely to be essential than hubs with one or two interacting motifs, thus rationalizing the previously observed correlation between essentiality and the number of interacting partners of a protein. We also find that yeast hubs with multiple interacting motifs evolve slower than the average protein, contrary to the hubs with one or two interacting motifs. The proposed method will help us discover unknown interacting motifs and provide biological insights about protein hubs and their roles in interaction networks. Recent advances in experimental methods have produced a deluge of protein–protein interactions data. However, these methods do not supply information on which specific protein regions are physically in contact during the interactions. Identifying these regions (interfaces) is fundamental for scientific disciplines that require detailed characterizations of protein interactions. In this work, we present a computational method that identifies groups of proteins with similar interfaces. This is achieved by relying on the observation that proteins with common interaction partners tend to interact through similar interfaces. The proposed method retrieves protein interactions from public data repositories and groups proteins that share a sensible number of interacting partners. Proteins within the same group are then labeled with the same “interacting motif” identifier (iMotif). The evaluation performed using known protein domains and structural binding sites suggests that the method is better suited for proteins with multiple interacting partners (hubs). Using yeast data, we show that the cellular essentiality of a gene better correlates with the number of interacting motifs than with the absolute number of interactions.
Collapse
Affiliation(s)
- Ramon Aragues
- Structural Bioinformatics Lab (GRIB), Universitat Pompeu Fabra-IMIM, Barcelona Research Park of Biomedicine (PRBB), Barcelona, Catalonia, Spain
| | - Andrej Sali
- Department of Biopharmaceutical Sciences, University of California San Francisco, San Francisco, California, United States of America
- Department of Pharmaceutical Chemistry, University of California San Francisco, San Francisco, California, United States of America
- California Institute for Quantitative Biomedical Research, University of California San Francisco, San Francisco, California, United States of America
| | - Jaume Bonet
- Structural Bioinformatics Lab (GRIB), Universitat Pompeu Fabra-IMIM, Barcelona Research Park of Biomedicine (PRBB), Barcelona, Catalonia, Spain
| | - Marc A Marti-Renom
- Structural Genomics Unit, Bioinformatics Department, Centro de Investigación Príncipe Felipe, Valencia, Spain
- * To whom correspondence should be addressed. E-mail: (MAMR); (BO)
| | - Baldo Oliva
- Structural Bioinformatics Lab (GRIB), Universitat Pompeu Fabra-IMIM, Barcelona Research Park of Biomedicine (PRBB), Barcelona, Catalonia, Spain
- * To whom correspondence should be addressed. E-mail: (MAMR); (BO)
| |
Collapse
|
33
|
Religa TL, Johnson CM, Vu DM, Brewer SH, Dyer RB, Fersht AR. The helix-turn-helix motif as an ultrafast independently folding domain: the pathway of folding of Engrailed homeodomain. Proc Natl Acad Sci U S A 2007; 104:9272-7. [PMID: 17517666 PMCID: PMC1890484 DOI: 10.1073/pnas.0703434104] [Citation(s) in RCA: 63] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Helices 2 and 3 of Engrailed homeodomain (EnHD) form a helix-turn-helix (HTH) motif. This common motif is believed not to fold independently, which is the characteristic feature of a motif rather than a domain. But we found that the EnHD HTH motif is monomeric and folded in solution, having essentially the same structure as in full-length protein. It had a sigmoidal thermal denaturation transition. Both native backbone and local tertiary interactions were formed concurrently at 4 x 10(5) s(-1) at 25 degrees C, monitored by IR and fluorescence T-jump kinetics, respectively, the same rate constant as for the fast phase in the folding of EnHD. The HTH motif, thus, is an ultrafast-folding, natural protein domain. Its independent stability and appropriate folding kinetics account for the stepwise folding of EnHD, satisfy fully the criteria for an on-pathway intermediate, and explain the changes in mechanism of folding across the homeodomain family. Experiments on mutated and engineered fragments of the parent protein with different probes allowed the assignment of the observed kinetic phases to specific events to show that EnHD is not an example of one-state downhill folding.
Collapse
Affiliation(s)
- Tomasz L. Religa
- *Medical Research Council Centre for Protein Engineering, Hills Road, Cambridge CB2 2QH, United Kingdom; and
| | - Christopher M. Johnson
- *Medical Research Council Centre for Protein Engineering, Hills Road, Cambridge CB2 2QH, United Kingdom; and
| | - Dung M. Vu
- Chemistry Division, Los Alamos National Laboratory, Mail Stop J567, Los Alamos, NM 87545
| | - Scott H. Brewer
- Chemistry Division, Los Alamos National Laboratory, Mail Stop J567, Los Alamos, NM 87545
| | - R. Brian Dyer
- Chemistry Division, Los Alamos National Laboratory, Mail Stop J567, Los Alamos, NM 87545
| | - Alan R. Fersht
- *Medical Research Council Centre for Protein Engineering, Hills Road, Cambridge CB2 2QH, United Kingdom; and
- To whom correspondence should be addressed. E-mail:
| |
Collapse
|
34
|
Integration and mining of malaria molecular, functional and pharmacological data: how far are we from a chemogenomic knowledge space? Malar J 2006; 5:110. [PMID: 17112376 PMCID: PMC1665468 DOI: 10.1186/1475-2875-5-110] [Citation(s) in RCA: 15] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2006] [Accepted: 11/17/2006] [Indexed: 11/21/2022] Open
Abstract
The organization and mining of malaria genomic and post-genomic data is important to significantly increase the knowledge of the biology of its causative agents, and is motivated, on a longer term, by the necessity to predict and characterize new biological targets and new drugs. Biological targets are sought in a biological space designed from the genomic data from Plasmodium falciparum, but using also the millions of genomic data from other species. Drug candidates are sought in a chemical space containing the millions of small molecules stored in public and private chemolibraries. Data management should, therefore, be as reliable and versatile as possible. In this context, five aspects of the organization and mining of malaria genomic and post-genomic data were examined: 1) the comparison of protein sequences including compositionally atypical malaria sequences, 2) the high throughput reconstruction of molecular phylogenies, 3) the representation of biological processes, particularly metabolic pathways, 4) the versatile methods to integrate genomic data, biological representations and functional profiling obtained from X-omic experiments after drug treatments and 5) the determination and prediction of protein structures and their molecular docking with drug candidate structures. Recent progress towards a grid-enabled chemogenomic knowledge space is discussed.
Collapse
|
35
|
Portugaly E, Linial N, Linial M. EVEREST: a collection of evolutionary conserved protein domains. Nucleic Acids Res 2006; 35:D241-6. [PMID: 17099230 PMCID: PMC1669739 DOI: 10.1093/nar/gkl850] [Citation(s) in RCA: 17] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Protein domains are subunits of proteins that recur throughout the protein world. There are many definitions attempting to capture the essence of a protein domain, and several systems that identify protein domains and classify them into families. EVEREST, recently described in Portugaly et al. (2006) BMC Bioinformatics, 7, 277, is one such system that performs the task automatically, using protein sequence alone. Herein we describe EVEREST release 2.0, consisting of 20 029 families, each defined by one or more HMMs. The current EVEREST database was constructed by scanning UniProt 8.1 and all PDB sequences (total over 3 000 000 sequences) with each of the EVEREST families. EVEREST annotates 64% of all sequences, and covers 59% of all residues. EVEREST is available at . The website provides annotations given by SCOP, CATH, Pfam A and EVEREST. It allows for browsing through the families of each of those sources, graphically visualizing the domain organization of the proteins in the family. The website also provides access to analyzes of relationships between domain families, within and across domain definition systems. Users can upload sequences for analysis by the set of EVEREST families. Finally an advanced search form allows querying for families matching criteria regarding novelty, phylogenetic composition and more.
Collapse
Affiliation(s)
- Elon Portugaly
- School of Computer Science & Engineering, Institute of Life Sciences, The Hebrew University of Jerusalem.
| | | | | |
Collapse
|
36
|
Wang M, Caetano-Anollés G. Global phylogeny determined by the combination of protein domains in proteomes. Mol Biol Evol 2006; 23:2444-54. [PMID: 16971695 DOI: 10.1093/molbev/msl117] [Citation(s) in RCA: 72] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
The majority of proteins consist of multiple domains that are either repeated or combined in defined order. In this study, we survey the combination of protein domains defined at fold and fold superfamily levels in 185 genomes belonging to organisms that have been fully sequenced and introduce a method that reconstructs rooted phylogenomic trees from the content and arrangement of domains in proteins at a genomic level. We find that the majority of domain combinations were unique to Archaea, Bacteria, or Eukarya, suggesting most combinations originated after life had diversified. Domain repeat and domain repeat within multidomain proteins increased notably in eukaryotes, mainly at the expense of single-domain and domain-pair proteins. This increase was mostly confined to Metazoa. We also find an unbalanced sharing of domain combinations which suggests that Eukarya is more closely related to Bacteria than to Archaea, an observation that challenges the widely assumed eukaryote-archaebacterial sisterhood relationship. The occurrence and abundance of the molecular repertoire (interactome) of domain combinations was used to generate phylogenomic trees. These global interactome-based phylogenies described organismal histories satisfactorily, revealing the tripartite nature of life, and supporting controversial evolutionary patterns, such as the Coelomata hypothesis, the grouping of plants and animals, and the Gram-positive origin of bacteria. Results suggest strongly that the process of domain combination is not random but curved by evolution, rejecting the null hypothesis of domain modules combining in the absence of natural selection or an optimality criterion.
Collapse
Affiliation(s)
- Minglei Wang
- Department of Crop Sciences, University of Illinois at Urbana-Champaign, Urbana, IL, USA
| | | |
Collapse
|
37
|
Lin K, Zhu L, Zhang DY. An initial strategy for comparing proteins at the domain architecture level. Bioinformatics 2006; 22:2081-6. [PMID: 16837531 DOI: 10.1093/bioinformatics/btl366] [Citation(s) in RCA: 37] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Ideally, only proteins that exhibit highly similar domain architectures should be compared with one another as homologues or be classified into a single family. By combining three different indices, the Jaccard index, the Goodman-Kruskal gamma function and the domain duplicate index, into a single similarity measure, we propose a method for comparing proteins based on their domain architectures. RESULTS Evaluation of the method using the eukaryotic orthologous groups of proteins (KOGs) database indicated that it allows the automatic and efficient comparison of multiple-domain proteins, which are usually refractory to classic approaches based on sequence similarity measures. As a case study, the PDZ and LRR_1 domains are used to demonstrate how proteins containing promiscuous domains can be clearly compared using our method. For the convenience of users, a web server was set up where three different query interfaces were implemented to compare different domain architectures or proteins with domain(s), and to identify the relationships among domain architectures within a given KOG from the Clusters of Orthologous Groups of Proteins database. CONCLUSION The approach we propose is suitable for estimating the similarity of domain architectures of proteins, especially those of multidomain proteins. AVAILABILITY http://cmb.bnu.edu.cn/pdart/.
Collapse
Affiliation(s)
- Kui Lin
- MOE Key Laboratory for Biodiversity Science and Ecological Engineering and College of Life Sciences, Beijing Normal University, Beijing 100875, China.
| | | | | |
Collapse
|
38
|
Portugaly E, Harel A, Linial N, Linial M. EVEREST: automatic identification and classification of protein domains in all protein sequences. BMC Bioinformatics 2006; 7:277. [PMID: 16749920 PMCID: PMC1533870 DOI: 10.1186/1471-2105-7-277] [Citation(s) in RCA: 28] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2006] [Accepted: 06/02/2006] [Indexed: 11/16/2022] Open
Abstract
Background Proteins are comprised of one or several building blocks, known as domains. Such domains can be classified into families according to their evolutionary origin. Whereas sequencing technologies have advanced immensely in recent years, there are no matching computational methodologies for large-scale determination of protein domains and their boundaries. We provide and rigorously evaluate a novel set of domain families that is automatically generated from sequence data. Our domain family identification process, called EVEREST (EVolutionary Ensembles of REcurrent SegmenTs), begins by constructing a library of protein segments that emerge in an all vs. all pairwise sequence comparison. It then proceeds to cluster these segments into putative domain families. The selection of the best putative families is done using machine learning techniques. A statistical model is then created for each of the chosen families. This procedure is then iterated: the aforementioned statistical models are used to scan all protein sequences, to recreate a library of segments and to cluster them again. Results Processing the Swiss-Prot section of the UniProt Knoledgebase, release 7.2, EVEREST defines 20,230 domains, covering 85% of the amino acids of the Swiss-Prot database. EVEREST annotates 11,852 proteins (6% of the database) that are not annotated by Pfam A. In addition, in 43,086 proteins (20% of the database), EVEREST annotates a part of the protein that is not annotated by Pfam A. Performance tests show that EVEREST recovers 56% of Pfam A families and 63% of SCOP families with high accuracy, and suggests previously unknown domain families with at least 51% fidelity. EVEREST domains are often a combination of domains as defined by Pfam or SCOP and are frequently sub-domains of such domains. Conclusion The EVEREST process and its output domain families provide an exhaustive and validated view of the protein domain world that is automatically generated from sequence data. The EVEREST library of domain families, accessible for browsing and download at [1], provides a complementary view to that provided by other existing libraries. Furthermore, since it is automatic, the EVEREST process is scalable and we will run it in the future on larger databases as well. The EVEREST source files are available for download from the EVEREST web site.
Collapse
Affiliation(s)
- Elon Portugaly
- School of Computer Science & Engineering, The Hebrew University of Jerusalem, Jerusalem, Israel
| | - Amir Harel
- School of Computer Science & Engineering, The Hebrew University of Jerusalem, Jerusalem, Israel
| | - Nathan Linial
- School of Computer Science & Engineering, The Hebrew University of Jerusalem, Jerusalem, Israel
| | - Michal Linial
- Department of Biological Chemistry, Institute of Life Sciences, The Hebrew University of Jerusalem, Jerusalem, Israel
| |
Collapse
|
39
|
Dehal PS, Boore JL. A phylogenomic gene cluster resource: the Phylogenetically Inferred Groups (PhIGs) database. BMC Bioinformatics 2006; 7:201. [PMID: 16608522 PMCID: PMC1523372 DOI: 10.1186/1471-2105-7-201] [Citation(s) in RCA: 44] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2005] [Accepted: 04/11/2006] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND We present here the PhIGs database, a phylogenomic resource for sequenced genomes. Although many methods exist for clustering gene families, very few attempt to create truly orthologous clusters sharing descent from a single ancestral gene across a range of evolutionary depths. Although these non-phylogenetic gene family clusters have been used broadly for gene annotation, errors are known to be introduced by the artifactual association of slowly evolving paralogs and lack of annotation for those more rapidly evolving. A full phylogenetic framework is necessary for accurate inference of function and for many studies that address pattern and mechanism of the evolution of the genome. The automated generation of evolutionary gene clusters, creation of gene trees, determination of orthology and paralogy relationships, and the correlation of this information with gene annotations, expression information, and genomic context is an important resource to the scientific community. DISCUSSION The PhIGs database currently contains 23 completely sequenced genomes of fungi and metazoans, containing 409,653 genes that have been grouped into 42,645 gene clusters. Each gene cluster is built such that the gene sequence distances are consistent with the known organismal relationships and in so doing, maximizing the likelihood for the clusters to represent truly orthologous genes. The PhIGs website contains tools that allow the study of genes within their phylogenetic framework through keyword searches on annotations, such as GO and InterPro assignments, and sequence similarity searches by BLAST and HMM. In addition to displaying the evolutionary relationships of the genes in each cluster, the website also allows users to view the relative physical positions of homologous genes in specified sets of genomes. SUMMARY Accurate analyses of genes and genomes can only be done within their full phylogenetic context. The PhIGs database and corresponding website http://phigs.org address this problem for the scientific community. Our goal is to expand the content as more genomes are sequenced and use this framework to incorporate more analyses.
Collapse
Affiliation(s)
- Paramvir S Dehal
- Evolutionary Genomics Department, DOE Joint Genome Institute and Lawrence, Berkeley National Laboratory, 2800 Mitchell Drive, Walnut Creek, CA 94598, USA
| | - Jeffrey L Boore
- Evolutionary Genomics Department, DOE Joint Genome Institute and Lawrence, Berkeley National Laboratory, 2800 Mitchell Drive, Walnut Creek, CA 94598, USA
- Department of Integrative Biology, 3060 Valley Life Sciences Building, University of California, Berkeley, CA 94720, USA
| |
Collapse
|
40
|
Camoglu O, Can T, Singh AK. Integrating multi-attribute similarity networks for robust representation of the protein space. Bioinformatics 2006; 22:1585-92. [PMID: 16595556 DOI: 10.1093/bioinformatics/btl130] [Citation(s) in RCA: 15] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION A global view of the protein space is essential for functional and evolutionary analysis of proteins. In order to achieve this, a similarity network can be built using pairwise relationships among proteins. However, existing similarity networks employ a single similarity measure and therefore their utility depends highly on the quality of the selected measure. A more robust representation of the protein space can be realized if multiple sources of information are used. RESULTS We propose a novel approach for analyzing multi-attribute similarity networks by combining random walks on graphs with Bayesian theory. A multi-attribute network is created by combining sequence and structure based similarity measures. For each attribute of the similarity network, one can compute a measure of affinity from a given protein to every other protein in the network using random walks. This process makes use of the implicit clustering information of the similarity network, and we show that it is superior to naive, local ranking methods. We then combine the computed affinities using a Bayesian framework. In particular, when we train a Bayesian model for automated classification of a novel protein, we achieve high classification accuracy and outperform single attribute networks. In addition, we demonstrate the effectiveness of our technique by comparison with a competing kernel-based information integration approach.
Collapse
Affiliation(s)
- Orhan Camoglu
- Department of Computer Science, University of California Santa Barbara, 93106, USA.
| | | | | |
Collapse
|
41
|
Uchiyama I. Hierarchical clustering algorithm for comprehensive orthologous-domain classification in multiple genomes. Nucleic Acids Res 2006; 34:647-58. [PMID: 16436801 PMCID: PMC1351371 DOI: 10.1093/nar/gkj448] [Citation(s) in RCA: 61] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Ortholog identification is a crucial first step in comparative genomics. Here, we present a rapid method of ortholog grouping which is effective enough to allow the comparison of many genomes simultaneously. The method takes as input all-against-all similarity data and classifies genes based on the traditional hierarchical clustering algorithm UPGMA. In the course of clustering, the method detects domain fusion or fission events, and splits clusters into domains if required. The subsequent procedure splits the resulting trees such that intra-species paralogous genes are divided into different groups so as to create plausible orthologous groups. As a result, the procedure can split genes into the domains minimally required for ortholog grouping. The procedure, named DomClust, was tested using the COG database as a reference. When comparing several clustering algorithms combined with the conventional bidirectional best-hit (BBH) criterion, we found that our method generally showed better agreement with the COG classification. By comparing the clustering results generated from datasets of different releases, we also found that our method showed relatively good stability in comparison to the BBH-based methods.
Collapse
Affiliation(s)
- Ikuo Uchiyama
- National Institute for Basic Biology, National Institutes of Natural Sciences, Nishigonaka 38, Myodaiji, Okazaki, Aichi 444-8585 Japan.
| |
Collapse
|
42
|
Gewehr JE, Zimmer R. SSEP-Domain: protein domain prediction by alignment of secondary structure elements and profiles. Bioinformatics 2005; 22:181-7. [PMID: 16267083 DOI: 10.1093/bioinformatics/bti751] [Citation(s) in RCA: 44] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION The prediction of protein domains is a crucial task for functional classification, homology-based structure prediction and structural genomics. In this paper, we present the SSEP-Domain protein domain prediction approach, which is based on the application of secondary structure element alignment (SSEA) and profile-profile alignment (PPA) in combination with InterPro pattern searches. SSEA allows rapid screening for potential domain regions while PPA provides us with the necessary specificity for selecting significant hits. The combination with InterPro patterns allows finding domain regions without solved structural templates if sequence family definitions exist. RESULTS A preliminary version of SSEP-Domain was ranked among the top-performing domain prediction servers in the CASP 6 and CAFASP 4 experiments. Evaluation of the final version shows further improvement over these results together with a significant speed-up. AVAILABILITY The server is available at http://www.bio.ifi.lmu.de/SSEP/
Collapse
Affiliation(s)
- Jan E Gewehr
- Practical Informatics and Bioinformatics Group, Department of Informatics, Ludwig-Maximilians-University Amalienstrasse 17, D-80333 Munich, Germany.
| | | |
Collapse
|
43
|
Dekker FJ, Koch MA, Waldmann H. Protein structure similarity clustering (PSSC) and natural product structure as inspiration sources for drug development and chemical genomics. Curr Opin Chem Biol 2005; 9:232-9. [PMID: 15939324 DOI: 10.1016/j.cbpa.2005.03.003] [Citation(s) in RCA: 56] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2005] [Accepted: 03/22/2005] [Indexed: 02/04/2023]
Abstract
Finding small molecules that modulate protein function is of primary importance in drug development and in the emerging field of chemical genomics. To facilitate the identification of such molecules, we developed a novel strategy making use of structural conservatism found in protein domain architecture and natural product inspired compound library design. Domains and proteins identified as being structurally similar in their ligand-sensing cores are grouped in a protein structure similarity cluster (PSSC). Natural products can be considered as evolutionary pre-validated ligands for multiple proteins and therefore natural products that are known to interact with one of the PSSC member proteins are selected as guiding structures for compound library synthesis. Application of this novel strategy for compound library design provided enhanced hit rates in small compound libraries for structurally similar proteins.
Collapse
Affiliation(s)
- Frank J Dekker
- Department of Chemical Biology, Max-Planck Institute of Molecular Physiology, Otto-Hahn Str. 11, D-44227 Dortmund, Germany
| | | | | |
Collapse
|
44
|
Abstract
MOTIVATION Given a large family of homologous protein sequences, many methods can divide the family into smaller groups that correspond to the different functions carried out by proteins within the family. One important problem, however, has been the absence of a general method for selecting an appropriate level of granularity, or size of the groups. RESULTS We propose a consistent way of choosing the granularity that is independent of the sequence similarity and sequence clustering method used. We study three large, well-investigated protein families: basic leucine zippers, nuclear receptors and proteins with three consecutive C2H2 zinc fingers. Our method is tested against known functional information, the experimentally determined binding specificities, using a simple scoring method. The significance of the groups is also measured by randomizing the data. Finally, we compare our algorithm against a popular method of grouping proteins, the TRIBE-MCL method. In the end, we determine that dividing the families at the proposed level of granularity creates very significant and useful groups of proteins that correspond to the different DNA-binding motifs. We expect that such groupings will be useful in studying not only DNA binding but also other protein interactions.
Collapse
Affiliation(s)
- Jason E Donald
- Department of Chemistry and Chemical Biology, Harvard University, 12 Oxford Street, Cambridge, MA 02138, USA
| | | |
Collapse
|
45
|
Bae K, Mallick BK, Elsik CG. Prediction of protein interdomain linker regions by a hidden Markov model. Bioinformatics 2005; 21:2264-70. [PMID: 15746283 DOI: 10.1093/bioinformatics/bti363] [Citation(s) in RCA: 15] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Our aim was to predict protein interdomain linker regions using sequence alone, without requiring known homology. Identifying linker regions will delineate domain boundaries, and can be used to computationally dissect proteins into domains prior to clustering them into families. We developed a hidden Markov model of linker/non-linker sequence regions using a linker index derived from amino acid propensity. We employed an efficient Bayesian estimation of the model using Markov Chain Monte Carlo, Gibbs sampling in particular, to simulate parameters from the posteriors. Our model recognizes sequence data to be continuous rather than categorical, and generates a probabilistic output. RESULTS We applied our method to a dataset of protein sequences in which domains and interdomain linkers had been delineated using the Pfam-A database. The prediction results are superior to a simpler method that also uses linker index.
Collapse
Affiliation(s)
- Kyounghwa Bae
- Department of Statistics, Texas A&M University College Station, TX 77843-3143, USA
| | | | | |
Collapse
|
46
|
Krause A, Stoye J, Vingron M. Large scale hierarchical clustering of protein sequences. BMC Bioinformatics 2005; 6:15. [PMID: 15663796 PMCID: PMC547898 DOI: 10.1186/1471-2105-6-15] [Citation(s) in RCA: 47] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2004] [Accepted: 01/22/2005] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Searching a biological sequence database with a query sequence looking for homologues has become a routine operation in computational biology. In spite of the high degree of sophistication of currently available search routines it is still virtually impossible to identify quickly and clearly a group of sequences that a given query sequence belongs to. RESULTS We report on our developments in grouping all known protein sequences hierarchically into superfamily and family clusters. Our graph-based algorithms take into account the topology of the sequence space induced by the data itself to construct a biologically meaningful partitioning. We have applied our clustering procedures to a non-redundant set of about 1,000,000 sequences resulting in a hierarchical clustering which is being made available for querying and browsing at http://systers.molgen.mpg.de/. CONCLUSIONS Comparisons with other widely used clustering methods on various data sets show the abilities and strengths of our clustering methods in producing a biologically meaningful grouping of protein sequences.
Collapse
Affiliation(s)
- Antje Krause
- Max Planck Institute for Molecular Genetics, Computational Molecular Biology, Ihnestrasse 73, 14195 Berlin, Germany
- TFH Wildau, Bahnhofstrasse 1, 15745 Wildau, Germany
| | - Jens Stoye
- Universität Bielefeld, Technische Fakultät, AG Genominformatik, Postfach 100131, 33501 Bielefeld, Germany
| | - Martin Vingron
- Max Planck Institute for Molecular Genetics, Computational Molecular Biology, Ihnestrasse 73, 14195 Berlin, Germany
| |
Collapse
|
47
|
Kaplan N, Friedlich M, Fromer M, Linial M. A functional hierarchical organization of the protein sequence space. BMC Bioinformatics 2004; 5:196. [PMID: 15596019 PMCID: PMC544566 DOI: 10.1186/1471-2105-5-196] [Citation(s) in RCA: 29] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2004] [Accepted: 12/14/2004] [Indexed: 11/10/2022] Open
Abstract
Background It is a major challenge of computational biology to provide a comprehensive functional classification of all known proteins. Most existing methods seek recurrent patterns in known proteins based on manually-validated alignments of known protein families. Such methods can achieve high sensitivity, but are limited by the necessary manual labor. This makes our current view of the protein world incomplete and biased. This paper concerns ProtoNet, a automatic unsupervised global clustering system that generates a hierarchical tree of over 1,000,000 proteins, based solely on sequence similarity. Results In this paper we show that ProtoNet correctly captures functional and structural aspects of the protein world. Furthermore, a novel feature is an automatic procedure that reduces the tree to 12% its original size. This procedure utilizes only parameters intrinsic to the clustering process. Despite the substantial reduction in size, the system's predictive power concerning biological functions is hardly affected. We then carry out an automatic comparison with existing functional protein annotations. Consequently, 78% of the clusters in the compressed tree (5,300 clusters) get assigned a biological function with a high confidence. The clustering and compression processes are unsupervised, and robust. Conclusions We present an automatically generated unbiased method that provides a hierarchical classification of all currently known proteins.
Collapse
Affiliation(s)
- Noam Kaplan
- Department of Biological Chemistry, Institute of Life Sciences, The Hebrew University of Jerusalem, Israel
| | - Moriah Friedlich
- School of Computer Science and Engineering, The Hebrew University of Jerusalem, Israel
| | - Menachem Fromer
- School of Computer Science and Engineering, The Hebrew University of Jerusalem, Israel
| | - Michal Linial
- Department of Biological Chemistry, Institute of Life Sciences, The Hebrew University of Jerusalem, Israel
| |
Collapse
|
48
|
Papandreou N, Berezovsky IN, Lopes A, Eliopoulos E, Chomilier J. Universal positions in globular proteins. From observation to simulation. ACTA ACUST UNITED AC 2004; 271:4762-8. [PMID: 15606763 DOI: 10.1111/j.1432-1033.2004.04440.x] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
The description of globular protein structures as an ensemble of contiguous 'closed loops' or 'tightened end fragments' reveals fold elements crucial for the formation of stable structures and for navigating the very process of protein folding. These are the ends of the loops, which are spatially close to each other but are situated apart in the polypeptide chain by 25-30 residues. They also correlate with the locations of highly conserved hydrophobic residues (referred to as topohydrophobic), in a structural alignment of the members of a protein family. This study analysed these positions in 111 representatives of different protein folds, and then carried out dynamic Monte Carlo simulations of the first steps of the folding process, aimed at predicting the origins of the assembling folds. The simulations demonstrated that there is an obvious trend for certain sets of residues, named 'mostly interacting residues', to be buried at the early stages of the folding process. Location of these residues at the loop ends and correlation with topohydrophobic positions are demonstrated, thereby giving a route to simulations of the protein folding process.
Collapse
|
49
|
Kifer I, Sasson O, Linial M. Predicting fold novelty based on ProtoNet hierarchical classification. Bioinformatics 2004; 21:1020-7. [PMID: 15539447 DOI: 10.1093/bioinformatics/bti135] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Structural genomics projects aim to solve a large number of protein structures with the ultimate objective of representing the entire protein space. The computational challenge is to identify and prioritize a small set of proteins with new, currently unknown, superfamilies or folds. RESULTS We develop a method that assigns each protein a likelihood of it belonging to a new, yet undetermined, structural superfamily. The method relies on a variant of ProtoNet, an automatic hierarchical classification scheme of all protein sequences from SwissProt. Our results show that proteins that are remote from solved structures in the ProtoNet hierarchy are more likely to belong to new superfamilies. The results are validated against SCOP releases from recent years that account for about half of the solved structures known to date. We show that our new method and the representation of ProtoNet are superior in detecting new targets, compared to our previous method using ProtoMap classification. Furthermore, our method outperforms PSI-BLAST search in detecting potential new superfamilies.
Collapse
Affiliation(s)
- Ilona Kifer
- Department of Biological Chemistry, Institute of Life Sciences Jerusalem 91904, Israel
| | | | | |
Collapse
|
50
|
Liu J, Hegyi H, Acton TB, Montelione GT, Rost B. Automatic target selection for structural genomics on eukaryotes. Proteins 2004; 56:188-200. [PMID: 15211504 DOI: 10.1002/prot.20012] [Citation(s) in RCA: 60] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
A central goal of structural genomics is to experimentally determine representative structures for all protein families. At least 14 structural genomics pilot projects are currently investigating the feasibility of high-throughput structure determination; the National Institutes of Health funded nine of these in the United States. Initiatives differ in the particular subset of "all families" on which they focus. At the NorthEast Structural Genomics consortium (NESG), we target eukaryotic protein domain families. The automatic target selection procedure has three aims: 1) identify all protein domain families from currently five entirely sequenced eukaryotic target organisms based on their sequence homology, 2) discard those families that can be modeled on the basis of structural information already present in the PDB, and 3) target representatives of the remaining families for structure determination. To guarantee that all members of one family share a common foldlike region, we had to begin by dissecting proteins into structural domain-like regions before clustering. Our hierarchical approach, CHOP, utilizing homology to PrISM, Pfam-A, and SWISS-PROT chopped the 103,796 eukaryotic proteins/ORFs into 247,222 fragments. Of these fragments, 122,999 appeared suitable targets that were grouped into >27,000 singletons and >18,000 multifragment clusters. Thus, our results suggested that it might be necessary to determine >40,000 structures to minimally cover the subset of five eukaryotic proteomes.
Collapse
Affiliation(s)
- Jinfeng Liu
- CUBIC, Department of Biochemistry and Molecular Biophysics, Columbia University, New York, New York 10032, USA
| | | | | | | | | |
Collapse
|