Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Elnaggar A, Heinzinger M, Dallago C, Rehawi G, Wang Y, Jones L, Gibbs T, Feher T, Angerer C, Steinegger M, Bhowmik D, Rost B. ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning. IEEE Trans Pattern Anal Mach Intell 2022. [PMID: 34232869 DOI: 10.1101/2020.07.12.199554] [Citation(s) in RCA: 71] [Impact Index Per Article: 35.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/15/2023]

For:	Elnaggar A, Heinzinger M, Dallago C, Rehawi G, Wang Y, Jones L, Gibbs T, Feher T, Angerer C, Steinegger M, Bhowmik D, Rost B. ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning. IEEE Trans Pattern Anal Mach Intell 2022. [PMID: 34232869 DOI: 10.1101/2020.07.12.199554] [Citation(s) in RCA: 71] [Impact Index Per Article: 35.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/15/2023]

Number

Cited by Other Article(s)

Kalakoti Y, Yadav S, Sundar D. TransDTI: Transformer-Based Language Models for Estimating DTIs and Building a Drug Recommendation Workflow. ACS OMEGA 2022;7:2706-2717. [PMID: 35097268 PMCID: PMC8792915 DOI: 10.1021/acsomega.1c05203] [Citation(s) in RCA: 19] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/19/2021] [Accepted: 12/28/2021] [Indexed: 06/09/2023]

Kandathil SM, Greener JG, Lau AM, Jones DT. Ultrafast end-to-end protein structure prediction enables high-throughput exploration of uncharacterized proteins. Proc Natl Acad Sci U S A 2022;119:e2113348119. [PMID: 35074909 PMCID: PMC8795500 DOI: 10.1073/pnas.2113348119] [Citation(s) in RCA: 19] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2021] [Accepted: 12/07/2021] [Indexed: 12/12/2022] Open

Hsu C, Nisonoff H, Fannjiang C, Listgarten J. Learning protein fitness models from evolutionary and assay-labeled data. Nat Biotechnol 2022;40:1114-1122. [PMID: 35039677 DOI: 10.1038/s41587-021-01146-5] [Citation(s) in RCA: 65] [Impact Index Per Article: 32.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2021] [Accepted: 11/02/2021] [Indexed: 01/27/2023]

Greener JG, Kandathil SM, Moffat L, Jones DT. A guide to machine learning for biologists. Nat Rev Mol Cell Biol 2022;23:40-55. [PMID: 34518686 DOI: 10.1038/s41580-021-00407-0] [Citation(s) in RCA: 564] [Impact Index Per Article: 282.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 07/23/2021] [Indexed: 02/08/2023]

Li Y, Zhang C, Zheng W, Zhou X, Bell EW, Yu DJ, Zhang Y. Protein inter-residue contact and distance prediction by coupling complementary coevolution features with deep residual networks in CASP14. Proteins 2021;89:1911-1921. [PMID: 34382712 PMCID: PMC8616805 DOI: 10.1002/prot.26211] [Citation(s) in RCA: 15] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2021] [Revised: 07/24/2021] [Accepted: 08/05/2021] [Indexed: 01/12/2023]

McGee F, Hauri S, Novinger Q, Vucetic S, Levy RM, Carnevale V, Haldane A. The generative capacity of probabilistic protein sequence models. Nat Commun 2021;12:6302. [PMID: 34728624 PMCID: PMC8563988 DOI: 10.1038/s41467-021-26529-9] [Citation(s) in RCA: 17] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2021] [Accepted: 09/23/2021] [Indexed: 01/10/2023] Open

Jiang Y, Wang D, Wang W, Xu D. Computational methods for protein localization prediction. Comput Struct Biotechnol J 2021;19:5834-5844. [PMID: 34765098 PMCID: PMC8564054 DOI: 10.1016/j.csbj.2021.10.023] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2021] [Revised: 10/12/2021] [Accepted: 10/13/2021] [Indexed: 12/16/2022] Open

Zheng J, Xiao X, Qiu WR. iCDI-W2vCom: Identifying the Ion Channel-Drug Interaction in Cellular Networking Based on word2vec and node2vec. Front Genet 2021;12:738274. [PMID: 34567088 PMCID: PMC8458815 DOI: 10.3389/fgene.2021.738274] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2021] [Accepted: 08/02/2021] [Indexed: 12/04/2022] Open

Bernhofer M, Dallago C, Karl T, Satagopam V, Heinzinger M, Littmann M, Olenyi T, Qiu J, Schütze K, Yachdav G, Ashkenazy H, Ben-Tal N, Bromberg Y, Goldberg T, Kajan L, O’Donoghue S, Sander C, Schafferhans A, Schlessinger A, Vriend G, Mirdita M, Gawron P, Gu W, Jarosz Y, Trefois C, Steinegger M, Schneider R, Rost B. PredictProtein - Predicting Protein Structure and Function for 29 Years. Nucleic Acids Res 2021;49:W535-W540. [PMID: 33999203 PMCID: PMC8265159 DOI: 10.1093/nar/gkab354] [Citation(s) in RCA: 129] [Impact Index Per Article: 43.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2021] [Revised: 04/06/2021] [Accepted: 05/10/2021] [Indexed: 12/12/2022] Open

Affiliation(s)

Michael Bernhofer TUM (Technical University of Munich) Department of Informatics, Bioinformatics & Computational Biology - i12, Boltzmannstr 3, 85748 Garching/Munich, Germany TUM Graduate School CeDoSIA, Boltzmannstr 11, 85748 Garching, Germany
Christian Dallago TUM (Technical University of Munich) Department of Informatics, Bioinformatics & Computational Biology - i12, Boltzmannstr 3, 85748 Garching/Munich, Germany TUM Graduate School CeDoSIA, Boltzmannstr 11, 85748 Garching, Germany
Tim Karl TUM (Technical University of Munich) Department of Informatics, Bioinformatics & Computational Biology - i12, Boltzmannstr 3, 85748 Garching/Munich, Germany
Venkata Satagopam Luxembourg Centre For Systems Biomedicine (LCSB), University of Luxembourg, Campus Belval, House of Biomedicine II, 6 avenue du Swing, L-4367 Belvaux, Luxembourg ELIXIR Luxembourg (ELIXIR-LU) Node, University of Luxembourg, Campus Belval, House of Biomedicine II, 6 avenue du Swing, L-4367 Belvaux, Luxembourg
Michael Heinzinger TUM (Technical University of Munich) Department of Informatics, Bioinformatics & Computational Biology - i12, Boltzmannstr 3, 85748 Garching/Munich, Germany TUM Graduate School CeDoSIA, Boltzmannstr 11, 85748 Garching, Germany
Maria Littmann TUM (Technical University of Munich) Department of Informatics, Bioinformatics & Computational Biology - i12, Boltzmannstr 3, 85748 Garching/Munich, Germany TUM Graduate School CeDoSIA, Boltzmannstr 11, 85748 Garching, Germany
Tobias Olenyi TUM (Technical University of Munich) Department of Informatics, Bioinformatics & Computational Biology - i12, Boltzmannstr 3, 85748 Garching/Munich, Germany
Jiajun Qiu TUM (Technical University of Munich) Department of Informatics, Bioinformatics & Computational Biology - i12, Boltzmannstr 3, 85748 Garching/Munich, Germany Department of Otolaryngology Head & Neck Surgery, The Ninth People's Hospital & Ear Institute, School of Medicine & Shanghai Key Laboratory of Translational Medicine on Ear and Nose Diseases, Shanghai Jiao Tong University, Shanghai, China
Konstantin Schütze TUM (Technical University of Munich) Department of Informatics, Bioinformatics & Computational Biology - i12, Boltzmannstr 3, 85748 Garching/Munich, Germany
Guy Yachdav TUM (Technical University of Munich) Department of Informatics, Bioinformatics & Computational Biology - i12, Boltzmannstr 3, 85748 Garching/Munich, Germany
Haim Ashkenazy Department of Molecular Biology, Max Planck Institute for Developmental Biology, Tübingen, Germany The Shmunis School of Biomedicine and Cancer Research, George S. Wise Faculty of Life Sciences, Tel Aviv University, 69978 Tel Aviv, Israel
Nir Ben-Tal Department of Biochemistry & Molecular Biology, George S. Wise Faculty of Life Sciences, Tel Aviv University, 69978 Tel Aviv, Israel
Yana Bromberg Department of Biochemistry and Microbiology, Rutgers University, New Brunswick, NJ 08901, USA
Tatyana Goldberg TUM (Technical University of Munich) Department of Informatics, Bioinformatics & Computational Biology - i12, Boltzmannstr 3, 85748 Garching/Munich, Germany
Laszlo Kajan Roche Polska Sp. z o.o., Domaniewska 39B, 02–672 Warsaw, Poland
Sean O’Donoghue Garvan Institute of Medical Research, Sydney, Australia
Chris Sander Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA 02215, USA Department of Cell Biology, Harvard Medical School, Boston, MA 02215, USA Broad Institute of MIT and Harvard, Boston, MA 02142, USA
Andrea Schafferhans TUM (Technical University of Munich) Department of Informatics, Bioinformatics & Computational Biology - i12, Boltzmannstr 3, 85748 Garching/Munich, Germany HSWT (Hochschule Weihenstephan Triesdorf \| University of Applied Sciences), Department of Bioengineering Sciences, Am Hofgarten 10, 85354 Freising, Germany
Avner Schlessinger Department of Pharmacological Sciences, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
Gerrit Vriend BIPS, Poblacion Baco, Mindoro, Philippines
Milot Mirdita Quantitative and Computational Biology, Max Planck Institute for Biophysical Chemistry, Göttingen, Germany
Piotr Gawron Luxembourg Centre For Systems Biomedicine (LCSB), University of Luxembourg, Campus Belval, House of Biomedicine II, 6 avenue du Swing, L-4367 Belvaux, Luxembourg
Wei Gu Luxembourg Centre For Systems Biomedicine (LCSB), University of Luxembourg, Campus Belval, House of Biomedicine II, 6 avenue du Swing, L-4367 Belvaux, Luxembourg ELIXIR Luxembourg (ELIXIR-LU) Node, University of Luxembourg, Campus Belval, House of Biomedicine II, 6 avenue du Swing, L-4367 Belvaux, Luxembourg
Yohan Jarosz Luxembourg Centre For Systems Biomedicine (LCSB), University of Luxembourg, Campus Belval, House of Biomedicine II, 6 avenue du Swing, L-4367 Belvaux, Luxembourg ELIXIR Luxembourg (ELIXIR-LU) Node, University of Luxembourg, Campus Belval, House of Biomedicine II, 6 avenue du Swing, L-4367 Belvaux, Luxembourg
Christophe Trefois Luxembourg Centre For Systems Biomedicine (LCSB), University of Luxembourg, Campus Belval, House of Biomedicine II, 6 avenue du Swing, L-4367 Belvaux, Luxembourg ELIXIR Luxembourg (ELIXIR-LU) Node, University of Luxembourg, Campus Belval, House of Biomedicine II, 6 avenue du Swing, L-4367 Belvaux, Luxembourg
Martin Steinegger School of Biological Sciences, Seoul National University, Seoul, South Korea Artificial Intelligence Institute, Seoul National University, Seoul, South Korea
Reinhard Schneider Luxembourg Centre For Systems Biomedicine (LCSB), University of Luxembourg, Campus Belval, House of Biomedicine II, 6 avenue du Swing, L-4367 Belvaux, Luxembourg ELIXIR Luxembourg (ELIXIR-LU) Node, University of Luxembourg, Campus Belval, House of Biomedicine II, 6 avenue du Swing, L-4367 Belvaux, Luxembourg
Burkhard Rost TUM (Technical University of Munich) Department of Informatics, Bioinformatics & Computational Biology - i12, Boltzmannstr 3, 85748 Garching/Munich, Germany Institute for Advanced Study (TUM-IAS), Lichtenbergstr. 2a, 85748 Garching/Munich, Germany TUM School of Life Sciences Weihenstephan (WZW), Alte Akademie 8, Freising, Germany

Collapse

Yamaguchi H, Saito Y. Evotuning protocols for Transformer-based variant effect prediction on multi-domain proteins. Brief Bioinform 2021;22:6309928. [PMID: 34180966 DOI: 10.1093/bib/bbab234] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2021] [Revised: 05/28/2021] [Accepted: 05/30/2021] [Indexed: 12/14/2022] Open

Dallago C, Schütze K, Heinzinger M, Olenyi T, Littmann M, Lu AX, Yang KK, Min S, Yoon S, Morton JT, Rost B. Learned Embeddings from Deep Learning to Visualize and Predict Protein Sets. Curr Protoc 2021;1:e113. [PMID: 33961736 DOI: 10.1002/cpz1.113] [Citation(s) in RCA: 32] [Impact Index Per Article: 10.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/03/2023]

Abstract

Models from machine learning (ML) or artificial intelligence (AI) increasingly assist in guiding experimental design and decision making in molecular biology and medicine. Recently, Language Models (LMs) have been adapted from Natural Language Processing (NLP) to encode the implicit language written in protein sequences. Protein LMs show enormous potential in generating descriptive representations (embeddings) for proteins from just their sequences, in a fraction of the time with respect to previous approaches, yet with comparable or improved predictive ability. Researchers have trained a variety of protein LMs that are likely to illuminate different angles of the protein language. By leveraging the bio_embeddings pipeline and modules, simple and reproducible workflows can be laid out to generate protein embeddings and rich visualizations. Embeddings can then be leveraged as input features through machine learning libraries to develop methods predicting particular aspects of protein function and structure. Beyond the workflows included here, embeddings have been leveraged as proxies to traditional homology-based inference and even to align similar protein sequences. A wealth of possibilities remain for researchers to harness through the tools provided in the following protocols. © 2021 The Authors. Current Protocols published by Wiley Periodicals LLC. The following protocols are included in this manuscript: Basic Protocol 1: Generic use of the bio_embeddings pipeline to plot protein sequences and annotations Basic Protocol 2: Generate embeddings from protein sequences using the bio_embeddings pipeline Basic Protocol 3: Overlay sequence annotations onto a protein space visualization Basic Protocol 4: Train a machine learning classifier on protein embeddings Alternate Protocol 1: Generate 3D instead of 2D visualizations Alternate Protocol 2: Visualize protein solubility instead of protein subcellular localization Support Protocol: Join embedding generation and sequence space visualization in a pipeline.

Collapse

Affiliation(s)

Christian Dallago TUM (Technical University of Munich) Department of Informatics, Bioinformatics & Computational Biology, Garching/Munich, Germany.,TUM Graduate School, Center of Doctoral Studies in Informatics and its Applications (CeDoSIA), Garching/Munich, Germany
Konstantin Schütze TUM (Technical University of Munich) Department of Informatics, Bioinformatics & Computational Biology, Garching/Munich, Germany
Michael Heinzinger TUM (Technical University of Munich) Department of Informatics, Bioinformatics & Computational Biology, Garching/Munich, Germany.,TUM Graduate School, Center of Doctoral Studies in Informatics and its Applications (CeDoSIA), Garching/Munich, Germany
Tobias Olenyi TUM (Technical University of Munich) Department of Informatics, Bioinformatics & Computational Biology, Garching/Munich, Germany
Maria Littmann TUM (Technical University of Munich) Department of Informatics, Bioinformatics & Computational Biology, Garching/Munich, Germany.,TUM Graduate School, Center of Doctoral Studies in Informatics and its Applications (CeDoSIA), Garching/Munich, Germany
Amy X Lu Department of Computer Science, University of Toronto, Toronto, Canada & Vector Institute
Kevin K Yang Microsoft Research New England, Cambridge, Massachusetts
Seonwoo Min Department of Electrical and Computer Engineering, Seoul National University, Seoul, South Korea
Sungroh Yoon Department of Electrical and Computer Engineering, Seoul National University, Seoul, South Korea.,Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, South Korea
James T Morton Center for Computational Biology, Flatiron Institute, New York, New York
Burkhard Rost TUM (Technical University of Munich) Department of Informatics, Bioinformatics & Computational Biology, Garching/Munich, Germany.,Institute for Advanced Study (TUM-IAS), Garching/Munich, Germany.,TUM School of Life Sciences Weihenstephan (WZW), Freising, Germany.,Columbia University, Department of Biochemistry and Molecular Biophysics, New York, New York.,New York Consortium on Membrane Protein Structure (NYCOMPS), New York, New York

Collapse

Song B, Li Z, Lin X, Wang J, Wang T, Fu X. Pretraining model for biological sequence data. Brief Funct Genomics 2021;20:181-195. [PMID: 34050350 PMCID: PMC8194843 DOI: 10.1093/bfgp/elab025] [Citation(s) in RCA: 14] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2021] [Revised: 04/13/2021] [Accepted: 04/21/2021] [Indexed: 12/26/2022] Open

Murvai N, Kalmar L, Szabo B, Schad E, Micsonai A, Kardos J, Buday L, Han KH, Tompa P, Tantos A. Cellular Chaperone Function of Intrinsically Disordered Dehydrin ERD14. Int J Mol Sci 2021;22:6190. [PMID: 34201246 PMCID: PMC8230022 DOI: 10.3390/ijms22126190] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2021] [Revised: 06/03/2021] [Accepted: 06/04/2021] [Indexed: 12/04/2022] Open

Affiliation(s)

Nikoletta Murvai Research Centre for Natural Sciences, Institute of Enzymology, 1117 Budapest, Hungary; (N.M.); (L.K.); (B.S.); (E.S.); (L.B.); (P.T.) Department of Biochemistry, Institute of Biology, ELTE Eötvös Loránd University, 1117 Budapest, Hungary
Lajos Kalmar Research Centre for Natural Sciences, Institute of Enzymology, 1117 Budapest, Hungary; (N.M.); (L.K.); (B.S.); (E.S.); (L.B.); (P.T.) Department of Veterinary Medicine, University of Cambridge, Cambridge CB3 0ES, UK
Beata Szabo Research Centre for Natural Sciences, Institute of Enzymology, 1117 Budapest, Hungary; (N.M.); (L.K.); (B.S.); (E.S.); (L.B.); (P.T.)
Eva Schad Research Centre for Natural Sciences, Institute of Enzymology, 1117 Budapest, Hungary; (N.M.); (L.K.); (B.S.); (E.S.); (L.B.); (P.T.)
András Micsonai ELTE NAP Neuroimmunology Research Group, Department of Biochemistry, Institute of Biology, Eötvös Loránd University, 1117 Budapest, Hungary; (A.M.); (J.K.)
József Kardos ELTE NAP Neuroimmunology Research Group, Department of Biochemistry, Institute of Biology, Eötvös Loránd University, 1117 Budapest, Hungary; (A.M.); (J.K.)
László Buday Research Centre for Natural Sciences, Institute of Enzymology, 1117 Budapest, Hungary; (N.M.); (L.K.); (B.S.); (E.S.); (L.B.); (P.T.)
Kyou-Hoon Han Biomedical Translational Research Center, Division of Convergent Biomedical Research, Korea Research Institute of Bioscience and Biotechnology, Daejeon 34141, Korea; Gene Editing Research Center, Division of Convergent Biomedical Research, Korea Research Institute of Bioscience and Biotechnology, Daejeon 34141, Korea
Peter Tompa Research Centre for Natural Sciences, Institute of Enzymology, 1117 Budapest, Hungary; (N.M.); (L.K.); (B.S.); (E.S.); (L.B.); (P.T.) VIB-VUB Center for Structural Biology (CSB), Vlaams Instituut voor Biotechnologie (VIB), 1050 Brussels, Belgium Structural Biology Brussels (SBB), Vrije Universiteit Brussel (VUB), 1050 Brussels, Belgium
Agnes Tantos Research Centre for Natural Sciences, Institute of Enzymology, 1117 Budapest, Hungary; (N.M.); (L.K.); (B.S.); (E.S.); (L.B.); (P.T.)

Collapse

Iuchi H, Matsutani T, Yamada K, Iwano N, Sumi S, Hosoda S, Zhao S, Fukunaga T, Hamada M. Representation learning applications in biological sequence analysis. Comput Struct Biotechnol J 2021;19:3198-3208. [PMID: 34141139 PMCID: PMC8190442 DOI: 10.1016/j.csbj.2021.05.039] [Citation(s) in RCA: 27] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2021] [Revised: 05/10/2021] [Accepted: 05/20/2021] [Indexed: 12/16/2022] Open

Affiliation(s)

Hitoshi Iuchi Waseda Research Institute for Science and Engineering, Waseda University, Tokyo 169-8555, Japan Computational Bio Big-Data Open Innovation Laboratory (CBBD-OIL), National Institute of Advanced Industrial Science and Technology (AIST), Tokyo 169-8555, Japan
Taro Matsutani Computational Bio Big-Data Open Innovation Laboratory (CBBD-OIL), National Institute of Advanced Industrial Science and Technology (AIST), Tokyo 169-8555, Japan Graduate School of Advanced Science and Engineering, Waseda University, Tokyo 169-8555, Japan
Keisuke Yamada School of Advanced Science and Engineering, Waseda University, Tokyo 169-8555, Japan
Natsuki Iwano Graduate School of Advanced Science and Engineering, Waseda University, Tokyo 169-8555, Japan
Shunsuke Sumi Graduate School of Advanced Science and Engineering, Waseda University, Tokyo 169-8555, Japan Department of Life Science Frontiers, Center for iPS Cell Research and Application, Kyoto University, Kyoto 606-8507, Japan
Shion Hosoda Computational Bio Big-Data Open Innovation Laboratory (CBBD-OIL), National Institute of Advanced Industrial Science and Technology (AIST), Tokyo 169-8555, Japan Graduate School of Advanced Science and Engineering, Waseda University, Tokyo 169-8555, Japan
Shitao Zhao Waseda Research Institute for Science and Engineering, Waseda University, Tokyo 169-8555, Japan
Tsukasa Fukunaga Waseda Institute for Advanced Study, Waseda University, Tokyo 169-0051, Japan Department of Computer Science, Graduate School of Information Science and Technology, The University of Tokyo, Tokyo 113-0032, Japan
Michiaki Hamada Computational Bio Big-Data Open Innovation Laboratory (CBBD-OIL), National Institute of Advanced Industrial Science and Technology (AIST), Tokyo 169-8555, Japan Graduate School of Advanced Science and Engineering, Waseda University, Tokyo 169-8555, Japan School of Advanced Science and Engineering, Waseda University, Tokyo 169-8555, Japan Graduate School of Medicine, Nippon Medical School, Tokyo 113-8602, Japan

Collapse

Min S, Kim H, Lee B, Yoon S. Protein transfer learning improves identification of heat shock protein families. PLoS One 2021;16:e0251865. [PMID: 34003870 PMCID: PMC8130922 DOI: 10.1371/journal.pone.0251865] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2021] [Accepted: 05/04/2021] [Indexed: 12/16/2022] Open

Cai T, Lim H, Abbu KA, Qiu Y, Nussinov R, Xie L. MSA-Regularized Protein Sequence Transformer toward Predicting Genome-Wide Chemical-Protein Interactions: Application to GPCRome Deorphanization. J Chem Inf Model 2021;61:1570-1582. [PMID: 33757283 PMCID: PMC8154251 DOI: 10.1021/acs.jcim.0c01285] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2020] [Indexed: 01/14/2023]

Abstract

Small molecules play a critical role in modulating biological systems. Knowledge of chemical-protein interactions helps address fundamental and practical questions in biology and medicine. However, with the rapid emergence of newly sequenced genes, the endogenous or surrogate ligands of a vast number of proteins remain unknown. Homology modeling and machine learning are two major methods for assigning new ligands to a protein but mostly fail when sequence homology between an unannotated protein and those with known functions or structures is low. In this study, we develop a new deep learning framework to predict chemical binding to evolutionary divergent unannotated proteins, whose ligand cannot be reliably predicted by existing methods. By incorporating evolutionary information into self-supervised learning of unlabeled protein sequences, we develop a novel method, distilled sequence alignment embedding (DISAE), for the protein sequence representation. DISAE can utilize all protein sequences and their multiple sequence alignment (MSA) to capture functional relationships between proteins without the knowledge of their structure and function. Followed by the DISAE pretraining, we devise a module-based fine-tuning strategy for the supervised learning of chemical-protein interactions. In the benchmark studies, DISAE significantly improves the generalizability of machine learning models and outperforms the state-of-the-art methods by a large margin. Comprehensive ablation studies suggest that the use of MSA, sequence distillation, and triplet pretraining critically contributes to the success of DISAE. The interpretability analysis of DISAE suggests that it learns biologically meaningful information. We further use DISAE to assign ligands to human orphan G-protein coupled receptors (GPCRs) and to cluster the human GPCRome by integrating their phylogenetic and ligand relationships. The promising results of DISAE open an avenue for exploring the chemical landscape of entire sequenced genomes.

Collapse

Ofer D, Brandes N, Linial M. The language of proteins: NLP, machine learning & protein sequences. Comput Struct Biotechnol J 2021;19:1750-1758. [PMID: 33897979 PMCID: PMC8050421 DOI: 10.1016/j.csbj.2021.03.022] [Citation(s) in RCA: 111] [Impact Index Per Article: 37.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2021] [Revised: 03/19/2021] [Accepted: 03/19/2021] [Indexed: 12/12/2022] Open

Hiranuma N, Park H, Baek M, Anishchenko I, Dauparas J, Baker D. Improved protein structure refinement guided by deep learning based accuracy estimation. Nat Commun 2021;12:1340. [PMID: 33637700 PMCID: PMC7910447 DOI: 10.1038/s41467-021-21511-x] [Citation(s) in RCA: 117] [Impact Index Per Article: 39.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2020] [Accepted: 01/18/2021] [Indexed: 11/22/2022] Open

Hawkins-Hooker A, Depardieu F, Baur S, Couairon G, Chen A, Bikard D. Generating functional protein variants with variational autoencoders. PLoS Comput Biol 2021;17:e1008736. [PMID: 33635868 PMCID: PMC7946179 DOI: 10.1371/journal.pcbi.1008736] [Citation(s) in RCA: 73] [Impact Index Per Article: 24.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2020] [Revised: 03/10/2021] [Accepted: 01/25/2021] [Indexed: 11/20/2022] Open

Abstract

The vast expansion of protein sequence databases provides an opportunity for new protein design approaches which seek to learn the sequence-function relationship directly from natural sequence variation. Deep generative models trained on protein sequence data have been shown to learn biologically meaningful representations helpful for a variety of downstream tasks, but their potential for direct use in the design of novel proteins remains largely unexplored. Here we show that variational autoencoders trained on a dataset of almost 70000 luciferase-like oxidoreductases can be used to generate novel, functional variants of the luxA bacterial luciferase. We propose separate VAE models to work with aligned sequence input (MSA VAE) and raw sequence input (AR-VAE), and offer evidence that while both are able to reproduce patterns of amino acid usage characteristic of the family, the MSA VAE is better able to capture long-distance dependencies reflecting the influence of 3D structure. To confirm the practical utility of the models, we used them to generate variants of luxA whose luminescence activity was validated experimentally. We further showed that conditional variants of both models could be used to increase the solubility of luxA without disrupting function. Altogether 6/12 of the variants generated using the unconditional AR-VAE and 9/11 generated using the unconditional MSA VAE retained measurable luminescence, together with all 23 of the less distant variants generated by conditional versions of the models; the most distant functional variant contained 35 differences relative to the nearest training set sequence. These results demonstrate the feasibility of using deep generative models to explore the space of possible protein sequences and generate useful variants, providing a method complementary to rational design and directed evolution approaches.

Collapse

Littmann M, Heinzinger M, Dallago C, Olenyi T, Rost B. Embeddings from deep learning transfer GO annotations beyond homology. Sci Rep 2021;11:1160. [PMID: 33441905 PMCID: PMC7806674 DOI: 10.1038/s41598-020-80786-0] [Citation(s) in RCA: 58] [Impact Index Per Article: 19.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2020] [Accepted: 12/24/2020] [Indexed: 11/09/2022] Open

Susanty M, Rajab TE, Hertadi R. A Review of Protein Structure Prediction using Deep Learning. BIO WEB OF CONFERENCES 2021. [DOI: 10.1051/bioconf/20214104003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open