1
|
Simpkin AJ, Thomas JMH, Keegan RM, Rigden DJ. MrParse: finding homologues in the PDB and the EBI AlphaFold database for molecular replacement and more. ACTA CRYSTALLOGRAPHICA SECTION D STRUCTURAL BIOLOGY 2022; 78:553-559. [PMID: 35503204 PMCID: PMC9063843 DOI: 10.1107/s2059798322003576] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/31/2021] [Accepted: 03/29/2022] [Indexed: 11/10/2022]
Abstract
Crystallographers have an array of search-model options for structure solution by molecular replacement (MR). The well established options of homologous experimental structures and regular secondary-structure elements or motifs are increasingly supplemented by computational modelling. Such modelling may be carried out locally or may use pre-calculated predictions retrieved from databases such as the EBI AlphaFold database. MrParse is a new pipeline to help to streamline the decision process in MR by consolidating bioinformatic predictions in one place. When reflection data are provided, MrParse can rank any experimental homologues found using eLLG, which indicates the likelihood that a given search model will work in MR. Inbuilt displays of predicted secondary structure, coiled-coil and transmembrane regions further inform the choice of MR protocol. MrParse can also identify and rank homologues in the EBI AlphaFold database, a function that will also interest other structural biologists and bioinformaticians.
Collapse
|
2
|
Ultrafast end-to-end protein structure prediction enables high-throughput exploration of uncharacterized proteins. Proc Natl Acad Sci U S A 2022; 119:2113348119. [PMID: 35074909 PMCID: PMC8795500 DOI: 10.1073/pnas.2113348119] [Citation(s) in RCA: 19] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 12/07/2021] [Indexed: 12/12/2022] Open
Abstract
Deep learning-based prediction of protein structure usually begins by constructing a multiple sequence alignment (MSA) containing homologs of the target protein. The most successful approaches combine large feature sets derived from MSAs, and considerable computational effort is spent deriving these input features. We present a method that greatly reduces the amount of preprocessing required for a target MSA, while producing main chain coordinates as a direct output of a deep neural network. The network makes use of just three recurrent networks and a stack of residual convolutional layers, making the predictor very fast to run, and easy to install and use. Our approach constructs a directly learned representation of the sequences in an MSA, starting from a one-hot encoding of the sequences. When supplemented with an approximate precision matrix, the learned representation can be used to produce structural models of comparable or greater accuracy as compared to our original DMPfold method, while requiring less than a second to produce a typical model. This level of accuracy and speed allows very large-scale three-dimensional modeling of proteins on minimal hardware, and we demonstrate this by producing models for over 1.3 million uncharacterized regions of proteins extracted from the BFD sequence clusters. After constructing an initial set of approximate models, we select a confident subset of over 30,000 models for further refinement and analysis, revealing putative novel protein folds. We also provide updated models for over 5,000 Pfam families studied in the original DMPfold paper.
Collapse
|
3
|
Laine E, Eismann S, Elofsson A, Grudinin S. Protein sequence-to-structure learning: Is this the end(-to-end revolution)? Proteins 2021; 89:1770-1786. [PMID: 34519095 DOI: 10.1002/prot.26235] [Citation(s) in RCA: 20] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2021] [Revised: 08/16/2021] [Accepted: 09/03/2021] [Indexed: 01/08/2023]
Abstract
The potential of deep learning has been recognized in the protein structure prediction community for some time, and became indisputable after CASP13. In CASP14, deep learning has boosted the field to unanticipated levels reaching near-experimental accuracy. This success comes from advances transferred from other machine learning areas, as well as methods specifically designed to deal with protein sequences and structures, and their abstractions. Novel emerging approaches include (i) geometric learning, that is, learning on representations such as graphs, three-dimensional (3D) Voronoi tessellations, and point clouds; (ii) pretrained protein language models leveraging attention; (iii) equivariant architectures preserving the symmetry of 3D space; (iv) use of large meta-genome databases; (v) combinations of protein representations; and (vi) finally truly end-to-end architectures, that is, differentiable models starting from a sequence and returning a 3D structure. Here, we provide an overview and our opinion of the novel deep learning approaches developed in the last 2 years and widely used in CASP14.
Collapse
Affiliation(s)
- Elodie Laine
- Sorbonne Université, CNRS, IBPS, Laboratoire de Biologie Computationnelle et Quantitative (LCQB), Paris, France
| | - Stephan Eismann
- Department of Computer Science and Applied Physics, Stanford University, Stanford, California, USA
| | - Arne Elofsson
- Department of Biochemistry and Biophysics and Science for Life Laboratory, Stockholm University, Solna, Sweden
| | - Sergei Grudinin
- Univ. Grenoble Alpes, CNRS, Grenoble INP, LJK, Grenoble, France
| |
Collapse
|
4
|
Zheng W, Zhang C, Li Y, Pearce R, Bell EW, Zhang Y. Folding non-homologous proteins by coupling deep-learning contact maps with I-TASSER assembly simulations. CELL REPORTS METHODS 2021; 1:100014. [PMID: 34355210 PMCID: PMC8336924 DOI: 10.1016/j.crmeth.2021.100014] [Citation(s) in RCA: 227] [Impact Index Per Article: 75.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/25/2021] [Revised: 04/22/2021] [Accepted: 05/03/2021] [Indexed: 12/23/2022]
Abstract
Structure prediction for proteins lacking homologous templates in the Protein Data Bank (PDB) remains a significant unsolved problem. We developed a protocol, C-I-TASSER, to integrate interresidue contact maps from deep neural-network learning with the cutting-edge I-TASSER fragment assembly simulations. Large-scale benchmark tests showed that C-I-TASSER can fold more than twice the number of non-homologous proteins than the I-TASSER, which does not use contacts. When applied to a folding experiment on 8,266 unsolved Pfam families, C-I-TASSER successfully folded 4,162 domain families, including 504 folds that are not found in the PDB. Furthermore, it created correct folds for 85% of proteins in the SARS-CoV-2 genome, despite the quick mutation rate of the virus and sparse sequence profiles. The results demonstrated the critical importance of coupling whole-genome and metagenome-based evolutionary information with optimal structure assembly simulations for solving the problem of non-homologous protein structure prediction.
Collapse
Affiliation(s)
- Wei Zheng
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA
| | - Chengxin Zhang
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA
| | - Yang Li
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA
| | - Robin Pearce
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA
| | - Eric W. Bell
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA
| | - Yang Zhang
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA
- Department of Biological Chemistry, University of Michigan, Ann Arbor, MI 48109, USA
| |
Collapse
|
5
|
Elofsson A. Toward Characterising the Cellular 3D-Proteome. FRONTIERS IN BIOINFORMATICS 2021; 1:598878. [PMID: 36353353 PMCID: PMC9638702 DOI: 10.3389/fbinf.2021.598878] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2020] [Accepted: 01/27/2021] [Indexed: 11/30/2022] Open
|
6
|
Yin X, Yi K, Zhao Y, Hu Y, Li X, He T, Liu J, Cui G. Revealing the full-length transcriptome of caucasian clover rhizome development. BMC PLANT BIOLOGY 2020; 20:429. [PMID: 32938399 PMCID: PMC7493993 DOI: 10.1186/s12870-020-02637-4] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/11/2020] [Accepted: 09/03/2020] [Indexed: 06/02/2023]
Abstract
BACKGROUND Caucasian clover (Trifolium ambiguum M. Bieb.) is a strongly rhizomatous, low-crowned perennial leguminous and ground-covering grass. The species may be used as an ornamental plant and is resistant to cold, arid temperatures and grazing due to a well-developed underground rhizome system and a strong clonal reproduction capacity. However, the posttranscriptional mechanism of the development of the rhizome system in caucasian clover has not been comprehensively studied. Additionally, a reference genome for this species has not yet been published, which limits further exploration of many important biological processes in this plant. RESULT We adopted PacBio sequencing and Illumina sequencing to identify differentially expressed genes (DEGs) in five tissues, including taproot (T1), horizontal rhizome (T2), swelling of taproot (T3), rhizome bud (T4) and rhizome bud tip (T5) tissues, in the caucasian clover rhizome. In total, we obtained 19.82 GB clean data and 80,654 nonredundant transcripts were analysed. Additionally, we identified 78,209 open reading frames (ORFs), 65,227 coding sequences (CDSs), 58,276 simple sequence repeats (SSRs), 6821 alternative splicing (AS) events, 2429 long noncoding RNAs (lncRNAs) and 4501 putative transcription factors (TFs) from 64 different families. Compared with other tissues, T5 exhibited more DEGs, and co-upregulated genes in T5 are mainly annotated as involved in phenylpropanoid biosynthesis. We also identified betaine aldehyde dehydrogenase (BADH) as a highly expressed gene-specific to T5. A weighted gene co-expression network analysis (WGCNA) of transcription factors and physiological indicators were combined to reveal 11 hub genes (MEgreen-GA3), three of which belong to the HB-KNOX family, that are up-regulated in T3. We analysed 276 DEGs involved in hormone signalling and transduction, and the largest number of genes are associated with the auxin (IAA) signalling pathway, with significant up-regulation in T2 and T5. CONCLUSIONS This study contributes to our understanding of gene expression across five different tissues and provides preliminary insight into rhizome growth and development in caucasian clover.
Collapse
Affiliation(s)
- Xiujie Yin
- College of Animal Science and Technology, Northeast Agricultural University, No.600 Changjiang Street, Xiangfang District, Harbin, 150030, Heilongjiang, China
| | - Kun Yi
- College of Animal Science and Technology, Northeast Agricultural University, No.600 Changjiang Street, Xiangfang District, Harbin, 150030, Heilongjiang, China
| | - Yihang Zhao
- College of Animal Science and Technology, Northeast Agricultural University, No.600 Changjiang Street, Xiangfang District, Harbin, 150030, Heilongjiang, China
| | - Yao Hu
- College of Animal Science and Technology, Northeast Agricultural University, No.600 Changjiang Street, Xiangfang District, Harbin, 150030, Heilongjiang, China
| | - Xu Li
- College of Animal Science and Technology, Northeast Agricultural University, No.600 Changjiang Street, Xiangfang District, Harbin, 150030, Heilongjiang, China
| | - Taotao He
- College of Animal Science and Technology, Northeast Agricultural University, No.600 Changjiang Street, Xiangfang District, Harbin, 150030, Heilongjiang, China
| | - Jiaxue Liu
- College of Animal Science and Technology, Northeast Agricultural University, No.600 Changjiang Street, Xiangfang District, Harbin, 150030, Heilongjiang, China
| | - Guowen Cui
- College of Animal Science and Technology, Northeast Agricultural University, No.600 Changjiang Street, Xiangfang District, Harbin, 150030, Heilongjiang, China.
| |
Collapse
|
7
|
Kumar G, Srinivasan N, Sandhya S. Artificial protein sequences enable recognition of vicinal and distant protein functional relationships. Proteins 2020; 88:1688-1700. [PMID: 32725917 DOI: 10.1002/prot.25986] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2020] [Revised: 06/29/2020] [Accepted: 07/26/2020] [Indexed: 11/07/2022]
Abstract
High divergence in protein sequences makes the detection of distant protein relationships through homology-based approaches challenging. Grouping protein sequences into families, through similarities in either sequence or 3-D structure, facilitates in the improved recognition of protein relationships. In addition, strategically designed protein-like sequences have been shown to bridge distant structural domain families by serving as artificial linkers. In this study, we have augmented a search database of known protein domain families with such designed sequences, with the intention of providing functional clues to domain families of unknown structure. When assessed using representative query sequences from each family, we obtain a success rate of 94% in protein domain families of known structure. Further, we demonstrate that the augmented search space enabled fold recognition for 582 families with no structural information available a priori. Additionally, we were able to provide reliable functional relationships for 610 orphan families. We discuss the application of our method in predicting functional roles through select examples for DUF4922, DUF5131, and DUF5085. Our approach also detects new associations between families that were previously not known to be related, as demonstrated through new sub-groups of the RNA polymerase domain among three distinct RNA viruses. Taken together, designed sequences-augmented search databases direct the detection of meaningful relationships between distant protein families. In turn, they enable fold recognition and offer reliable pointers to potential functional sites that may be probed further through direct mutagenesis studies.
Collapse
Affiliation(s)
- Gayatri Kumar
- Molecular Biophysics Unit, Indian Institute of Science, Bangalore, India
| | | | - Sankaran Sandhya
- Molecular Biophysics Unit, Indian Institute of Science, Bangalore, India
| |
Collapse
|
8
|
Sulkowska JI. On folding of entangled proteins: knots, lassos, links and θ-curves. Curr Opin Struct Biol 2020; 60:131-141. [PMID: 32062143 DOI: 10.1016/j.sbi.2020.01.007] [Citation(s) in RCA: 36] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2019] [Revised: 01/02/2020] [Accepted: 01/12/2020] [Indexed: 12/15/2022]
Abstract
Around 6% of protein structures deposited in the PDB are entangled, forming knots, slipknots, lassos, links, and θ-curves. In each of these cases, the protein backbone weaves through itself in a complex way, and at some point passes through a closed loop, formed by other regions of the protein structure. Such a passing can be interpreted as crossing a topological barrier. How proteins overcome such barriers, and therefore different degrees of frustration, challenged scientists and has shed new light on the field of protein folding. In this review, we summarize the current knowledge about the free energy landscape of proteins with non-trivial topology. We describe identified mechanisms which lead proteins to self-tying. We discuss the influence of excluded volume, such as crowding and chaperones, on tying, based on available data. We briefly discuss the diversity of topological complexity of proteins and their evolution. We also list available tools to investigate non-trivial topology. Finally, we formulate intriguing and challenging questions at the boundary of biophysics, bioinformatics, biology, and mathematics, which arise from the discovery of entangled proteins.
Collapse
Affiliation(s)
- Joanna Ida Sulkowska
- Centre of New Technologies, University of Warsaw, Warsaw, Poland; Faculty of Chemistry, University of Warsaw, Warsaw, Poland.
| |
Collapse
|
9
|
Simpkin AJ, Thomas JMH, Simkovic F, Keegan RM, Rigden DJ. Molecular replacement using structure predictions from databases. Acta Crystallogr D Struct Biol 2019; 75:1051-1062. [PMID: 31793899 PMCID: PMC6889911 DOI: 10.1107/s2059798319013962] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2019] [Accepted: 10/12/2019] [Indexed: 01/19/2023] Open
Abstract
Molecular replacement (MR) is the predominant route to solution of the phase problem in macromolecular crystallography. Where the lack of a suitable homologue precludes conventional MR, one option is to predict the target structure using bioinformatics. Such modelling, in the absence of homologous templates, is called ab initio or de novo modelling. Recently, the accuracy of such models has improved significantly as a result of the availability, in many cases, of residue-contact predictions derived from evolutionary covariance analysis. Covariance-assisted ab initio models representing structurally uncharacterized Pfam families are now available on a large scale in databases, potentially representing a valuable and easily accessible supplement to the PDB as a source of search models. Here, the unconventional MR pipeline AMPLE is employed to explore the value of structure predictions in the GREMLIN and PconsFam databases. It was tested whether these deposited predictions, processed in various ways, could solve the structures of PDB entries that were subsequently deposited. The results were encouraging: nine of 27 GREMLIN cases were solved, covering target lengths of 109-355 residues and a resolution range of 1.4-2.9 Å, and with target-model shared sequence identity as low as 20%. The cluster-and-truncate approach in AMPLE proved to be essential for most successes. For the overall lower quality structure predictions in the PconsFam database, remodelling with Rosetta within the AMPLE pipeline proved to be the best approach, generating ensemble search models from single-structure deposits. Finally, it is shown that the AMPLE-obtained search models deriving from GREMLIN deposits are of sufficiently high quality to be selected by the sequence-independent MR pipeline SIMBAD. Overall, the results help to point the way towards the optimal use of the expanding databases of ab initio structure predictions.
Collapse
Affiliation(s)
- Adam J. Simpkin
- Institute of Integrative Biology, University of Liverpool, Liverpool L69 7ZB, England
| | - Jens M. H. Thomas
- Institute of Integrative Biology, University of Liverpool, Liverpool L69 7ZB, England
| | - Felix Simkovic
- Institute of Integrative Biology, University of Liverpool, Liverpool L69 7ZB, England
| | - Ronan M. Keegan
- STFC, Rutherford Appleton Laboratory, Research Complex at Harwell, Didcot OX11 0FA, England
| | - Daniel J. Rigden
- Institute of Integrative Biology, University of Liverpool, Liverpool L69 7ZB, England
| |
Collapse
|
10
|
Greener JG, Kandathil SM, Jones DT. Deep learning extends de novo protein modelling coverage of genomes using iteratively predicted structural constraints. Nat Commun 2019; 10:3977. [PMID: 31484923 PMCID: PMC6726615 DOI: 10.1038/s41467-019-11994-0] [Citation(s) in RCA: 108] [Impact Index Per Article: 21.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2019] [Accepted: 08/14/2019] [Indexed: 01/30/2023] Open
Abstract
The inapplicability of amino acid covariation methods to small protein families has limited their use for structural annotation of whole genomes. Recently, deep learning has shown promise in allowing accurate residue-residue contact prediction even for shallow sequence alignments. Here we introduce DMPfold, which uses deep learning to predict inter-atomic distance bounds, the main chain hydrogen bond network, and torsion angles, which it uses to build models in an iterative fashion. DMPfold produces more accurate models than two popular methods for a test set of CASP12 domains, and works just as well for transmembrane proteins. Applied to all Pfam domains without known structures, confident models for 25% of these so-called dark families were produced in under a week on a small 200 core cluster. DMPfold provides models for 16% of human proteome UniProt entries without structures, generates accurate models with fewer than 100 sequences in some cases, and is freely available.
Collapse
Affiliation(s)
- Joe G Greener
- Department of Computer Science, University College London, Gower Street, London, WC1E 6BT, UK
- The Francis Crick Institute, 1 Midland Road, London, NW1 1AT, UK
| | - Shaun M Kandathil
- Department of Computer Science, University College London, Gower Street, London, WC1E 6BT, UK
- The Francis Crick Institute, 1 Midland Road, London, NW1 1AT, UK
| | - David T Jones
- Department of Computer Science, University College London, Gower Street, London, WC1E 6BT, UK.
- The Francis Crick Institute, 1 Midland Road, London, NW1 1AT, UK.
| |
Collapse
|
11
|
Lenhard B, Sternberg MJE. Computation Resources for Molecular Biology: Special Issue 2019. J Mol Biol 2019; 431:2395-2397. [PMID: 31152744 DOI: 10.1016/j.jmb.2019.05.034] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Affiliation(s)
- Boris Lenhard
- Institute of Clinical Sciences, Faculty of Medicine, Imperial College London, London SW7 2AZ, UK; Computational Regulatory Genomics, MRC London Institute of Medical Sciences, London, W12 0NN, UK.
| | - Michael J E Sternberg
- Structural Bioinformatics Group, Centre for Integrative systems Biology and Bioinformatics, Department of Life Sciences, Imperial College London, London SW7 2AZ, UK.
| |
Collapse
|