1
|
Li W, Almirantis Y, Provata A. Range-limited Heaps' law for functional DNA words in the human genome. J Theor Biol 2024; 592:111878. [PMID: 38901778 DOI: 10.1016/j.jtbi.2024.111878] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2023] [Revised: 05/31/2024] [Accepted: 06/10/2024] [Indexed: 06/22/2024]
Abstract
Heaps' or Herdan-Heaps' law is a linguistic law describing the relationship between the vocabulary/dictionary size (type) and word counts (token) to be a power-law function. Its existence in genomes with certain definition of DNA words is unclear partly because the dictionary size in genome could be much smaller than that in a human language. We define a DNA word as a coding region in a genome that codes for a protein domain. Using human chromosomes and chromosome arms as individual samples, we establish the existence of Heaps' law in the human genome within limited range. Our definition of words in a genomic or proteomic context is different from other definitions such as over-represented k-mers which are much shorter in length. Although an approximate power-law distribution of protein domain sizes due to gene duplication and the related Zipf's law is well known, their translation to the Heaps' law in DNA words is not automatic. Several other animal genomes are shown herein also to exhibit range-limited Heaps' law with our definition of DNA words, though with various exponents. When tokens were randomly sampled and sample sizes reach to the maximum level, a deviation from the Heaps' law was observed, but a quadratic regression in log-log type-token plot fits the data perfectly. Investigation of type-token plot and its regression coefficients could provide an alternative narrative of reusage and redundancy of protein domains as well as creation of new protein domains from a linguistic perspective.
Collapse
Affiliation(s)
- Wentian Li
- Department of Applied Mathematics and Statistics, Stony Brook University, Stony Brook, NY, USA(1); The Robert S. Boas Center for Genomics and Human Genetics, The Feinstein Institutes for Medical Research, Northwell Health, Manhasset, NY, USA.
| | - Yannis Almirantis
- Theoretical Biology and Computational Genomics Laboratory, Institute of Bioscience and Applications, National Center for Scientific Research "Demokritos", 15341 Athens, Greece
| | - Astero Provata
- Statistical Mechanics and Dynamical Systems Laboratory, Institute of Nanoscience and Nanotechnology, National Center for Scientific Research "Demokritos", 15341 Athens, Greece
| |
Collapse
|
2
|
Tanoz I, Timsit Y. Protein Fold Usages in Ribosomes: Another Glance to the Past. Int J Mol Sci 2024; 25:8806. [PMID: 39201491 PMCID: PMC11354259 DOI: 10.3390/ijms25168806] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2024] [Revised: 08/07/2024] [Accepted: 08/08/2024] [Indexed: 09/02/2024] Open
Abstract
The analysis of protein fold usage, similar to codon usage, offers profound insights into the evolution of biological systems and the origins of modern proteomes. While previous studies have examined fold distribution in modern genomes, our study focuses on the comparative distribution and usage of protein folds in ribosomes across bacteria, archaea, and eukaryotes. We identify the prevalence of certain 'super-ribosome folds,' such as the OB fold in bacteria and the SH3 domain in archaea and eukaryotes. The observed protein fold distribution in the ribosomes announces the future power-law distribution where only a few folds are highly prevalent, and most are rare. Additionally, we highlight the presence of three copies of proto-Rossmann folds in ribosomes across all kingdoms, showing its ancient and fundamental role in ribosomal structure and function. Our study also explores early mechanisms of molecular convergence, where different protein folds bind equivalent ribosomal RNA structures in ribosomes across different kingdoms. This comparative analysis enhances our understanding of ribosomal evolution, particularly the distinct evolutionary paths of the large and small subunits, and underscores the complex interplay between RNA and protein components in the transition from the RNA world to modern cellular life. Transcending the concept of folds also makes it possible to group a large number of ribosomal proteins into five categories of urfolds or metafolds, which could attest to their ancestral character and common origins. This work also demonstrates that the gradual acquisition of extensions by simple but ordered folds constitutes an inexorable evolutionary mechanism. This observation supports the idea that simple but structured ribosomal proteins preceded the development of their disordered extensions.
Collapse
Affiliation(s)
- Inzhu Tanoz
- Aix-Marseille Université, Université de Toulon, IRD, CNRS, Mediterranean Institute of Oceanography (MIO), UM 110, 13288 Marseille, France;
| | - Youri Timsit
- Aix-Marseille Université, Université de Toulon, IRD, CNRS, Mediterranean Institute of Oceanography (MIO), UM 110, 13288 Marseille, France;
- Research Federation for the Study of Global Ocean Systems Ecology and Evolution, FR2022/Tara GOSEE, 3 Rue Michel-Ange, 75016 Paris, France
| |
Collapse
|
3
|
Gómez-Márquez C, Morales JA, Romero-Gutiérrez T, Paredes O, Borrayo E. Decoding semiotic minimal genome: a non-genocentric approach. Front Microbiol 2024; 15:1356050. [PMID: 38476952 PMCID: PMC10929006 DOI: 10.3389/fmicb.2024.1356050] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2023] [Accepted: 02/02/2024] [Indexed: 03/14/2024] Open
Abstract
The search for the minimum information required for an organism to sustain a cellular system network has rendered both the identification of a fixed number of known genes and those genes whose function remains to be identified. The approaches used in such search generally focus their analysis on coding genomic regions, based on the genome to proteic-product perspective. Such approaches leave other fundamental processes aside, mainly those that include higher-level information management. To cope with this limitation, a non-genocentric approach based on genomic sequence analysis using language processing tools and gene ontology may prove an effective strategy for the identification of those fundamental genomic elements for life autonomy. Additionally, this approach will provide us with an integrative analysis of the information value present in all genomic elements, regardless of their coding status.
Collapse
Affiliation(s)
- Carolina Gómez-Márquez
- Biodigital Innovation Lab, Translational Bioengineering Department, Exact Sciences and Engineering University Center, Universidad de Guadalajara, Guadalajara, Mexico
| | - J. Alejandro Morales
- Biodigital Innovation Lab, Translational Bioengineering Department, Exact Sciences and Engineering University Center, Universidad de Guadalajara, Guadalajara, Mexico
| | - Teresa Romero-Gutiérrez
- Biodigital Innovation Lab, Translational Bioengineering Department, Exact Sciences and Engineering University Center, Universidad de Guadalajara, Guadalajara, Mexico
- Technological Innovation Department, Tlajomulco University Center, Universidad de Guadalajara, Guadalajara, Mexico
| | - Omar Paredes
- Biodigital Innovation Lab, Translational Bioengineering Department, Exact Sciences and Engineering University Center, Universidad de Guadalajara, Guadalajara, Mexico
| | - Ernesto Borrayo
- Biodigital Innovation Lab, Translational Bioengineering Department, Exact Sciences and Engineering University Center, Universidad de Guadalajara, Guadalajara, Mexico
| |
Collapse
|
4
|
Caetano-Anollés G, Claverie JM, Nasir A. A critical analysis of the current state of virus taxonomy. Front Microbiol 2023; 14:1240993. [PMID: 37601376 PMCID: PMC10435761 DOI: 10.3389/fmicb.2023.1240993] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2023] [Accepted: 07/20/2023] [Indexed: 08/22/2023] Open
Abstract
Taxonomical classification has preceded evolutionary understanding. For that reason, taxonomy has become a battleground fueled by knowledge gaps, technical limitations, and a priorism. Here we assess the current state of the challenging field, focusing on fallacies that are common in viral classification. We emphasize that viruses are crucial contributors to the genomic and functional makeup of holobionts, organismal communities that behave as units of biological organization. Consequently, viruses cannot be considered taxonomic units because they challenge crucial concepts of organismality and individuality. Instead, they should be considered processes that integrate virions and their hosts into life cycles. Viruses harbor phylogenetic signatures of genetic transfer that compromise monophyly and the validity of deep taxonomic ranks. A focus on building phylogenetic networks using alignment-free methodologies and molecular structure can help mitigate the impasse, at least in part. Finally, structural phylogenomic analysis challenges the polyphyletic scenario of multiple viral origins adopted by virus taxonomy, defeating a polyphyletic origin and supporting instead an ancient cellular origin of viruses. We therefore, prompt abandoning deep ranks and urgently reevaluating the validity of taxonomic units and principles of virus classification.
Collapse
Affiliation(s)
- Gustavo Caetano-Anollés
- Evolutionary Bioinformatics Laboratory, Department of Crop Sciences and C.R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, IL, United States
| | - Jean-Michel Claverie
- Structural and Genomic Information Laboratory (UMR7256), Mediterranean Institute of Microbiology (FR3479), IM2B, IOM, Aix Marseille University, CNRS, Marseille, France
| | | |
Collapse
|
5
|
Caetano-Anollés G. Agency in evolution of biomolecular communication. Ann N Y Acad Sci 2023; 1525:88-103. [PMID: 37219369 DOI: 10.1111/nyas.15005] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/24/2023]
Abstract
Biomolecular communication demands that interactions between parts of a molecular system act as scaffolds for message transmission. It also requires an organized system of signs-a communicative agency-for creating and transmitting meaning. The emergence of agency, the capacity to act in a given context and generate end-directed behaviors, has baffled evolutionary biologists for centuries. Here, I explore its emergence with knowledge grounded in over two decades of evolutionary genomic and bioinformatic exploration. Biphasic processes of growth and diversification exist that generate hierarchy and modularity in biological systems at widely ranging time scales. Similarly, a biphasic process exists in communication that constructs a message before it can be transmitted for interpretation. Transmission dissipates matter-energy and information and involves computation. Agency emerges when molecular machinery generates hierarchical layers of vocabularies in an entangled communication network clustered around the universal Turing machine of the ribosome. Computations canalize biological systems to perform biological functions in a dissipative quest to structure long-lived occurrents. This occurs within the confines of a "triangle of persistence" that maximizes invariance with trade-offs between economy, flexibility, and robustness. Thus, learning from previous historical and circumstantial experiences unifies modules in a hierarchy that expands the agency of systems.
Collapse
Affiliation(s)
- Gustavo Caetano-Anollés
- Evolutionary Bioinformatics Laboratory, Department of Crop Sciences and C. R. Woese Institute for Genomic Biology, University of Illinois, Urbana, Illinois, USA
| |
Collapse
|
6
|
Vincent D, Bui A, Ezernieks V, Shahinfar S, Luke T, Ram D, Rigas N, Panozzo J, Rochfort S, Daetwyler H, Hayden M. A community resource to mass explore the wheat grain proteome and its application to the late-maturity alpha-amylase (LMA) problem. Gigascience 2022; 12:giad084. [PMID: 37919977 PMCID: PMC10627334 DOI: 10.1093/gigascience/giad084] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2023] [Revised: 08/02/2023] [Accepted: 09/19/2023] [Indexed: 11/04/2023] Open
Abstract
BACKGROUND Late-maturity alpha-amylase (LMA) is a wheat genetic defect causing the synthesis of high isoelectric point alpha-amylase following a temperature shock during mid-grain development or prolonged cold throughout grain development, both leading to starch degradation. While the physiology is well understood, the biochemical mechanisms involved in grain LMA response remain unclear. We have applied high-throughput proteomics to 4,061 wheat flours displaying a range of LMA activities. Using an array of statistical analyses to select LMA-responsive biomarkers, we have mined them using a suite of tools applicable to wheat proteins. RESULTS We observed that LMA-affected grains activated their primary metabolisms such as glycolysis and gluconeogenesis; TCA cycle, along with DNA- and RNA- binding mechanisms; and protein translation. This logically transitioned to protein folding activities driven by chaperones and protein disulfide isomerase, as well as protein assembly via dimerisation and complexing. The secondary metabolism was also mobilized with the upregulation of phytohormones and chemical and defence responses. LMA further invoked cellular structures, including ribosomes, microtubules, and chromatin. Finally, and unsurprisingly, LMA expression greatly impacted grain storage proteins, as well as starch and other carbohydrates, with the upregulation of alpha-gliadins and starch metabolism, whereas LMW glutenin, stachyose, sucrose, UDP-galactose, and UDP-glucose were downregulated. CONCLUSIONS To our knowledge, this is not only the first proteomics study tackling the wheat LMA issue but also the largest plant-based proteomics study published to date. Logistics, technicalities, requirements, and bottlenecks of such an ambitious large-scale high-throughput proteomics experiment along with the challenges associated with big data analyses are discussed.
Collapse
Affiliation(s)
- Delphine Vincent
- Agriculture Victoria Research, AgriBio, Center Centre for AgriBioscience, Bundoora, VIC 3083, Australia
| | - AnhDuyen Bui
- Agriculture Victoria Research, AgriBio, Center Centre for AgriBioscience, Bundoora, VIC 3083, Australia
| | - Vilnis Ezernieks
- Agriculture Victoria Research, AgriBio, Center Centre for AgriBioscience, Bundoora, VIC 3083, Australia
| | - Saleh Shahinfar
- Agriculture Victoria Research, AgriBio, Center Centre for AgriBioscience, Bundoora, VIC 3083, Australia
| | - Timothy Luke
- Agriculture Victoria Research, AgriBio, Center Centre for AgriBioscience, Bundoora, VIC 3083, Australia
| | - Doris Ram
- Agriculture Victoria Research, AgriBio, Center Centre for AgriBioscience, Bundoora, VIC 3083, Australia
| | - Nicholas Rigas
- Agriculture Victoria Research, Grains Innovation Park, Horsham, VIC 3400, Australia
| | - Joe Panozzo
- Agriculture Victoria Research, Grains Innovation Park, Horsham, VIC 3400, Australia
- Centre for Agricultural Innovation, University of Melbourne, Parkville, VIC 3010, Australia
| | - Simone Rochfort
- Agriculture Victoria Research, AgriBio, Center Centre for AgriBioscience, Bundoora, VIC 3083, Australia
- School of Applied Systems Biology, La Trobe University, Bundoora, VIC 3083, Australia
| | - Hans Daetwyler
- Agriculture Victoria Research, AgriBio, Center Centre for AgriBioscience, Bundoora, VIC 3083, Australia
- School of Applied Systems Biology, La Trobe University, Bundoora, VIC 3083, Australia
| | - Matthew Hayden
- Agriculture Victoria Research, AgriBio, Center Centre for AgriBioscience, Bundoora, VIC 3083, Australia
- School of Applied Systems Biology, La Trobe University, Bundoora, VIC 3083, Australia
| |
Collapse
|
7
|
Semple S, Ferrer-I-Cancho R, Gustison ML. Linguistic laws in biology. Trends Ecol Evol 2022; 37:53-66. [PMID: 34598817 PMCID: PMC8678306 DOI: 10.1016/j.tree.2021.08.012] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2021] [Revised: 08/24/2021] [Accepted: 08/25/2021] [Indexed: 01/03/2023]
Abstract
Linguistic laws, the common statistical patterns of human language, have been investigated by quantitative linguists for nearly a century. Recently, biologists from a range of disciplines have started to explore the prevalence of these laws beyond language, finding patterns consistent with linguistic laws across multiple levels of biological organisation, from molecular (genomes, genes, and proteins) to organismal (animal behaviour) to ecological (populations and ecosystems). We propose a new conceptual framework for the study of linguistic laws in biology, comprising and integrating distinct levels of analysis, from description to prediction to theory building. Adopting this framework will provide critical new insights into the fundamental rules of organisation underpinning natural systems, unifying linguistic laws and core theory in biology.
Collapse
Affiliation(s)
- Stuart Semple
- School of Life and Health Sciences, University of Roehampton, London, UK.
| | - Ramon Ferrer-I-Cancho
- Complexity and Quantitative Linguistics Laboratory, Laboratory for Relational Algorithmics, Complexity, and Learning Research Group, Departament de Ciències de la Computació, Universitat Politècnica de Catalunya, 08034 Barcelona, Catalonia, Spain
| | - Morgan L Gustison
- Department of Integrative Biology, University of Texas at Austin, Austin, TX, USA
| |
Collapse
|
8
|
Caetano-Anollés G, Aziz MF, Mughal F, Caetano-Anollés D. Tracing protein and proteome history with chronologies and networks: folding recapitulates evolution. Expert Rev Proteomics 2021; 18:863-880. [PMID: 34628994 DOI: 10.1080/14789450.2021.1992277] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/24/2023]
Abstract
INTRODUCTION While the origin and evolution of proteins remain mysterious, advances in evolutionary genomics and systems biology are facilitating the historical exploration of the structure, function and organization of proteins and proteomes. Molecular chronologies are series of time events describing the history of biological systems and subsystems and the rise of biological innovations. Together with time-varying networks, these chronologies provide a window into the past. AREAS COVERED Here, we review molecular chronologies and networks built with modern methods of phylogeny reconstruction. We discuss how chronologies of structural domain families uncover the explosive emergence of metabolism, the late rise of translation, the co-evolution of ribosomal proteins and rRNA, and the late development of the ribosomal exit tunnel; events that coincided with a tendency to shorten folding time. Evolving networks described the early emergence of domains and a late 'big bang' of domain combinations. EXPERT OPINION Two processes, folding and recruitment appear central to the evolutionary progression. The former increases protein persistence. The later fosters diversity. Chronologically, protein evolution mirrors folding by combining supersecondary structures into domains, developing translation machinery to facilitate folding speed and stability, and enhancing structural complexity by establishing long-distance interactions in novel structural and architectural designs.
Collapse
Affiliation(s)
- Gustavo Caetano-Anollés
- Evolutionary Bioinformatics Laboratory, Department of Crop Sciences, University of Illinois, Urbana, Illinois, USA.,C. R. Woese Institute for Genomic Biology, University of Illinois, Urbana, Illinois, USA
| | - M Fayez Aziz
- Evolutionary Bioinformatics Laboratory, Department of Crop Sciences, University of Illinois, Urbana, Illinois, USA
| | - Fizza Mughal
- Evolutionary Bioinformatics Laboratory, Department of Crop Sciences, University of Illinois, Urbana, Illinois, USA
| | - Derek Caetano-Anollés
- Data Science Platform, Broad Institute of MIT and Harvard, Cambridge, Massachusetts, USA
| |
Collapse
|