Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Loewenthal G, Rapoport D, Avram O, Moshe A, Wygoda E, Itzkovitch A, Israeli O, Azouri D, Cartwright RA, Mayrose I, Pupko T. A probabilistic model for indel evolution: differentiating insertions from deletions. Mol Biol Evol 2021;38:5769-5781. [PMID: 34469521 PMCID: PMC8662616 DOI: 10.1093/molbev/msab266] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/27/2022] Open

For:	Loewenthal G, Rapoport D, Avram O, Moshe A, Wygoda E, Itzkovitch A, Israeli O, Azouri D, Cartwright RA, Mayrose I, Pupko T. A probabilistic model for indel evolution: differentiating insertions from deletions. Mol Biol Evol 2021;38:5769-5781. [PMID: 34469521 PMCID: PMC8662616 DOI: 10.1093/molbev/msab266] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/27/2022] Open

Number

Cited by Other Article(s)

Redelings BD, Holmes I, Lunter G, Pupko T, Anisimova M. Insertions and Deletions: Computational Methods, Evolutionary Dynamics, and Biological Applications. Mol Biol Evol 2024;41:msae177. [PMID: 39172750 PMCID: PMC11385596 DOI: 10.1093/molbev/msae177] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2024] [Revised: 07/02/2024] [Accepted: 07/09/2024] [Indexed: 08/24/2024] Open

Suvorov A, Schrider DR. Reliable estimation of tree branch lengths using deep neural networks. PLoS Comput Biol 2024;20:e1012337. [PMID: 39102450 PMCID: PMC11326709 DOI: 10.1371/journal.pcbi.1012337] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2023] [Revised: 08/15/2024] [Accepted: 07/18/2024] [Indexed: 08/07/2024] Open

Bou Dagher L, Madern D, Malbos P, Brochier-Armanet C. Persistent homology reveals strong phylogenetic signal in 3D protein structures. PNAS NEXUS 2024;3:pgae158. [PMID: 38689707 PMCID: PMC11058471 DOI: 10.1093/pnasnexus/pgae158] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/02/2023] [Accepted: 04/01/2024] [Indexed: 05/02/2024]

Dotan E, Jaschek G, Pupko T, Belinkov Y. Effect of tokenization on transformers for biological sequences. Bioinformatics 2024;40:btae196. [PMID: 38608190 PMCID: PMC11055402 DOI: 10.1093/bioinformatics/btae196] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2023] [Revised: 02/20/2024] [Accepted: 04/11/2024] [Indexed: 04/14/2024] Open

Abstract

MOTIVATION

Deep-learning models are transforming biological research, including many bioinformatics and comparative genomics algorithms, such as sequence alignments, phylogenetic tree inference, and automatic classification of protein functions. Among these deep-learning algorithms, models for processing natural languages, developed in the natural language processing (NLP) community, were recently applied to biological sequences. However, biological sequences are different from natural languages, such as English, and French, in which segmentation of the text to separate words is relatively straightforward. Moreover, biological sequences are characterized by extremely long sentences, which hamper their processing by current machine-learning models, notably the transformer architecture. In NLP, one of the first processing steps is to transform the raw text to a list of tokens. Deep-learning applications to biological sequence data mostly segment proteins and DNA to single characters. In this work, we study the effect of alternative tokenization algorithms on eight different tasks in biology, from predicting the function of proteins and their stability, through nucleotide sequence alignment, to classifying proteins to specific families.

RESULTS

We demonstrate that applying alternative tokenization algorithms can increase accuracy and at the same time, substantially reduce the input length compared to the trivial tokenizer in which each character is a token. Furthermore, applying these tokenization algorithms allows interpreting trained models, taking into account dependencies among positions. Finally, we trained these tokenizers on a large dataset of protein sequences containing more than 400 billion amino acids, which resulted in over a 3-fold decrease in the number of tokens. We then tested these tokenizers trained on large-scale data on the above specific tasks and showed that for some tasks it is highly beneficial to train database-specific tokenizers. Our study suggests that tokenizers are likely to be a critical component in future deep-network analysis of biological sequence data.

AVAILABILITY AND IMPLEMENTATION

Code, data, and trained tokenizers are available on https://github.com/technion-cs-nlp/BiologicalTokenizers.

Collapse

Wygoda E, Loewenthal G, Moshe A, Alburquerque M, Mayrose I, Pupko T. Statistical framework to determine indel-length distribution. Bioinformatics 2024;40:btae043. [PMID: 38269647 PMCID: PMC10868340 DOI: 10.1093/bioinformatics/btae043] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2023] [Revised: 01/10/2024] [Accepted: 01/22/2024] [Indexed: 01/26/2024] Open

Trost J, Haag J, Höhler D, Jacob L, Stamatakis A, Boussau B. Simulations of Sequence Evolution: How (Un)realistic They Are and Why. Mol Biol Evol 2024;41:msad277. [PMID: 38124381 PMCID: PMC10768886 DOI: 10.1093/molbev/msad277] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2023] [Revised: 11/17/2023] [Accepted: 12/08/2023] [Indexed: 12/23/2023] Open

Liu Z, Zhao Y, Zhang Y, Xu L, Zhou L, Yang W, Zhao H, Zhao J, Wang F. Development of Omni InDel and supporting database for maize. FRONTIERS IN PLANT SCIENCE 2023;14:1216505. [PMID: 37457340 PMCID: PMC10344896 DOI: 10.3389/fpls.2023.1216505] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 05/04/2023] [Accepted: 06/12/2023] [Indexed: 07/18/2023]

Affiliation(s)

Zhihao Liu Key Laboratory of Crop DNA Fingerprinting Innovation and Utilization (Co-construction by Ministry and Province), Ministry of Agriculture and Rural Affairs, Beijing Academy of Agricultural and Forest Sciences (BAAFS), Beijing, China College of Agriculture, Jilin Agricultural University, Changchun, China
Yikun Zhao Key Laboratory of Crop DNA Fingerprinting Innovation and Utilization (Co-construction by Ministry and Province), Ministry of Agriculture and Rural Affairs, Beijing Academy of Agricultural and Forest Sciences (BAAFS), Beijing, China
Yunlong Zhang Key Laboratory of Crop DNA Fingerprinting Innovation and Utilization (Co-construction by Ministry and Province), Ministry of Agriculture and Rural Affairs, Beijing Academy of Agricultural and Forest Sciences (BAAFS), Beijing, China
Liwen Xu Key Laboratory of Crop DNA Fingerprinting Innovation and Utilization (Co-construction by Ministry and Province), Ministry of Agriculture and Rural Affairs, Beijing Academy of Agricultural and Forest Sciences (BAAFS), Beijing, China
Ling Zhou Provincial Key Laboratory of Agrobiology, Institute of Crop Germplasm and Biotechnology, Jiangsu Academy of Agricultural Sciences, Nanjing, Jiangsu, China
Weiguang Yang College of Agriculture, Jilin Agricultural University, Changchun, China
Han Zhao Provincial Key Laboratory of Agrobiology, Institute of Crop Germplasm and Biotechnology, Jiangsu Academy of Agricultural Sciences, Nanjing, Jiangsu, China
Jiuran Zhao Key Laboratory of Crop DNA Fingerprinting Innovation and Utilization (Co-construction by Ministry and Province), Ministry of Agriculture and Rural Affairs, Beijing Academy of Agricultural and Forest Sciences (BAAFS), Beijing, China
Fengge Wang Key Laboratory of Crop DNA Fingerprinting Innovation and Utilization (Co-construction by Ministry and Province), Ministry of Agriculture and Rural Affairs, Beijing Academy of Agricultural and Forest Sciences (BAAFS), Beijing, China

Collapse

Kusakin AV, Goleva OV, Danilov LG, Krylov AV, Tsay VV, Kalinin RS, Tian NS, Eismont YA, Mukomolova AL, Chukhlovin AB, Komissarov AS, Glotov OS. The Telomeric Repeats of HHV-6A Do Not Determine the Chromosome into Which the Virus Is Integrated. Genes (Basel) 2023;14:521. [PMID: 36833448 PMCID: PMC9957103 DOI: 10.3390/genes14020521] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2022] [Revised: 02/10/2023] [Accepted: 02/13/2023] [Indexed: 02/22/2023] Open

Analysis of Copy Number Variation in the Whole Genome of Normal-Haired and Long-Haired Tianzhu White Yaks. Genes (Basel) 2022;13:genes13122405. [PMID: 36553672 PMCID: PMC9777850 DOI: 10.3390/genes13122405] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2022] [Revised: 12/06/2022] [Accepted: 12/16/2022] [Indexed: 12/23/2022] Open

Loewenthal G, Wygoda E, Nagar N, Glick L, Mayrose I, Pupko T. The evolutionary dynamics that retain long neutral genomic sequences in face of indel deletion bias: a model and its application to human introns. Open Biol 2022;12:220223. [PMID: 36514983 PMCID: PMC9748784 DOI: 10.1098/rsob.220223] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022] Open

Wittmund M, Cadet F, Davari MD. Learning Epistasis and Residue Coevolution Patterns: Current Trends and Future Perspectives for Advancing Enzyme Engineering. ACS Catal 2022. [DOI: 10.1021/acscatal.2c01426] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]

Moshe A, Wygoda E, Ecker N, Loewenthal G, Avram O, Israeli O, Hazkani-Covo E, Pe’er I, Pupko T. An Approximate Bayesian Computation Approach for Modeling Genome Rearrangements. Mol Biol Evol 2022;39:msac231. [PMID: 36282896 PMCID: PMC9692237 DOI: 10.1093/molbev/msac231] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/30/2024] Open

Lichman BR. Ancestral Sequence Reconstruction for Exploring Alkaloid Evolution. Methods Mol Biol 2022;2505:165-179. [PMID: 35732944 DOI: 10.1007/978-1-0716-2349-7_12] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]