1
|
Li W, Almirantis Y, Provata A. Range-limited Heaps' law for functional DNA words in the human genome. J Theor Biol 2024; 592:111878. [PMID: 38901778 DOI: 10.1016/j.jtbi.2024.111878] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2023] [Revised: 05/31/2024] [Accepted: 06/10/2024] [Indexed: 06/22/2024]
Abstract
Heaps' or Herdan-Heaps' law is a linguistic law describing the relationship between the vocabulary/dictionary size (type) and word counts (token) to be a power-law function. Its existence in genomes with certain definition of DNA words is unclear partly because the dictionary size in genome could be much smaller than that in a human language. We define a DNA word as a coding region in a genome that codes for a protein domain. Using human chromosomes and chromosome arms as individual samples, we establish the existence of Heaps' law in the human genome within limited range. Our definition of words in a genomic or proteomic context is different from other definitions such as over-represented k-mers which are much shorter in length. Although an approximate power-law distribution of protein domain sizes due to gene duplication and the related Zipf's law is well known, their translation to the Heaps' law in DNA words is not automatic. Several other animal genomes are shown herein also to exhibit range-limited Heaps' law with our definition of DNA words, though with various exponents. When tokens were randomly sampled and sample sizes reach to the maximum level, a deviation from the Heaps' law was observed, but a quadratic regression in log-log type-token plot fits the data perfectly. Investigation of type-token plot and its regression coefficients could provide an alternative narrative of reusage and redundancy of protein domains as well as creation of new protein domains from a linguistic perspective.
Collapse
Affiliation(s)
- Wentian Li
- Department of Applied Mathematics and Statistics, Stony Brook University, Stony Brook, NY, USA(1); The Robert S. Boas Center for Genomics and Human Genetics, The Feinstein Institutes for Medical Research, Northwell Health, Manhasset, NY, USA.
| | - Yannis Almirantis
- Theoretical Biology and Computational Genomics Laboratory, Institute of Bioscience and Applications, National Center for Scientific Research "Demokritos", 15341 Athens, Greece
| | - Astero Provata
- Statistical Mechanics and Dynamical Systems Laboratory, Institute of Nanoscience and Nanotechnology, National Center for Scientific Research "Demokritos", 15341 Athens, Greece
| |
Collapse
|
2
|
Tanoz I, Timsit Y. Protein Fold Usages in Ribosomes: Another Glance to the Past. Int J Mol Sci 2024; 25:8806. [PMID: 39201491 PMCID: PMC11354259 DOI: 10.3390/ijms25168806] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2024] [Revised: 08/07/2024] [Accepted: 08/08/2024] [Indexed: 09/02/2024] Open
Abstract
The analysis of protein fold usage, similar to codon usage, offers profound insights into the evolution of biological systems and the origins of modern proteomes. While previous studies have examined fold distribution in modern genomes, our study focuses on the comparative distribution and usage of protein folds in ribosomes across bacteria, archaea, and eukaryotes. We identify the prevalence of certain 'super-ribosome folds,' such as the OB fold in bacteria and the SH3 domain in archaea and eukaryotes. The observed protein fold distribution in the ribosomes announces the future power-law distribution where only a few folds are highly prevalent, and most are rare. Additionally, we highlight the presence of three copies of proto-Rossmann folds in ribosomes across all kingdoms, showing its ancient and fundamental role in ribosomal structure and function. Our study also explores early mechanisms of molecular convergence, where different protein folds bind equivalent ribosomal RNA structures in ribosomes across different kingdoms. This comparative analysis enhances our understanding of ribosomal evolution, particularly the distinct evolutionary paths of the large and small subunits, and underscores the complex interplay between RNA and protein components in the transition from the RNA world to modern cellular life. Transcending the concept of folds also makes it possible to group a large number of ribosomal proteins into five categories of urfolds or metafolds, which could attest to their ancestral character and common origins. This work also demonstrates that the gradual acquisition of extensions by simple but ordered folds constitutes an inexorable evolutionary mechanism. This observation supports the idea that simple but structured ribosomal proteins preceded the development of their disordered extensions.
Collapse
Affiliation(s)
- Inzhu Tanoz
- Aix-Marseille Université, Université de Toulon, IRD, CNRS, Mediterranean Institute of Oceanography (MIO), UM 110, 13288 Marseille, France;
| | - Youri Timsit
- Aix-Marseille Université, Université de Toulon, IRD, CNRS, Mediterranean Institute of Oceanography (MIO), UM 110, 13288 Marseille, France;
- Research Federation for the Study of Global Ocean Systems Ecology and Evolution, FR2022/Tara GOSEE, 3 Rue Michel-Ange, 75016 Paris, France
| |
Collapse
|
3
|
Barone F, Russo ET, Villegas Garcia EN, Punta M, Cozzini S, Ansuini A, Cazzaniga A. Protein family annotation for the Unified Human Gastrointestinal Proteome by DPCfam clustering. Sci Data 2024; 11:568. [PMID: 38824125 PMCID: PMC11144186 DOI: 10.1038/s41597-024-03131-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2023] [Accepted: 03/08/2024] [Indexed: 06/03/2024] Open
Abstract
Technological advances in massively parallel sequencing have led to an exponential growth in the number of known protein sequences. Much of this growth originates from metagenomic projects producing new sequences from environmental and clinical samples. The Unified Human Gastrointestinal Proteome (UHGP) catalogue is one of the most relevant metagenomic datasets with applications ranging from medicine to biology. However, the low levels of sequence annotation may impair its usability. This work aims to produce a family classification of UHGP sequences to facilitate downstream structural and functional annotation. This is achieved through the release of the DPCfam-UHGP50 dataset containing 10,778 putative protein families generated using DPCfam clustering, an unsupervised pipeline grouping sequences into single or multi-domain architectures. DPCfam-UHGP50 considerably improves family coverage at protein and residue levels compared to the manually curated repository Pfam. In the hope that DPCfam-UHGP50 will foster future discoveries in the field of metagenomics of the human gut, we release a FAIR-compliant database of our results that is easily accessible via a searchable web server and Zenodo repository.
Collapse
Affiliation(s)
- Federico Barone
- Area Science Park, Padriciano, 99, 34149, Trieste, Italy
- University of Trieste, Trieste, 34127, Italy
| | | | | | - Marco Punta
- IRCCS San Raffaele Institute, Center for Omics Sciences, Milan, 20132, Italy
- IRCCS San Raffaele Institute, Unit of Immunogenetics, Leukemia Genomics and Immunobiology, Division of Immunology, Transplantation and Infectious Disease, Milan, 20132, Italy
| | | | | | | |
Collapse
|
4
|
Gollapalli P, Rudrappa S, Kumar V, Santosh Kumar HS. Domain Architecture Based Methods for Comparative Functional Genomics Toward Therapeutic Drug Target Discovery. J Mol Evol 2023; 91:598-615. [PMID: 37626222 DOI: 10.1007/s00239-023-10129-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2022] [Accepted: 08/06/2023] [Indexed: 08/27/2023]
Abstract
Genes duplicate, mutate, recombine, fuse or fission to produce new genes, or when genes are formed from de novo, novel functions arise during evolution. Researchers have tried to quantify the causes of these molecular diversification processes to know how these genes increase molecular complexity over a period of time, for instance protein domain organization. In contrast to global sequence similarity, protein domain architectures can capture key structural and functional characteristics, making them better proxies for describing functional equivalence. In Prokaryotes and eukaryotes it has proven that, domain designs are retained over significant evolutionary distances. Protein domain architectures are now being utilized to categorize and distinguish evolutionarily related proteins and find homologs among species that are evolutionarily distant from one another. Additionally, structural information stored in domain structures has accelerated homology identification and sequence search methods. Tools for functional protein annotation have been developed to discover, protein domain content, domain order, domain recurrence, and domain position as all these contribute to the prediction of protein functional accuracy. In this review, an attempt is made to summarise facts and speculations regarding the use of protein domain architecture and modularity to identify possible therapeutic targets among cellular activities based on the understanding their linked biological processes.
Collapse
Affiliation(s)
- Pavan Gollapalli
- Center for Bioinformatics and Biostatistics, Nitte (Deemed to be University), Mangalore, Karnataka, 575018, India
| | - Sushmitha Rudrappa
- Department of Biotechnology and Bioinformatics, Jnana Sahyadri Campus, Kuvempu University, Shankaraghatta, Shivamogga, Karnataka, 577451, India
| | - Vadlapudi Kumar
- Department of Biochemistry, Davangere University, Shivagangothri, Davangere, Karnataka, 577007, India
| | - Hulikal Shivashankara Santosh Kumar
- Department of Biotechnology and Bioinformatics, Jnana Sahyadri Campus, Kuvempu University, Shankaraghatta, Shivamogga, Karnataka, 577451, India.
| |
Collapse
|
5
|
Russo ET, Barone F, Bateman A, Cozzini S, Punta M, Laio A. DPCfam: Unsupervised protein family classification by Density Peak Clustering of large sequence datasets. PLoS Comput Biol 2022; 18:e1010610. [PMID: 36260616 PMCID: PMC9621593 DOI: 10.1371/journal.pcbi.1010610] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2022] [Revised: 10/31/2022] [Accepted: 09/26/2022] [Indexed: 11/07/2022] Open
Abstract
Proteins that are known only at a sequence level outnumber those with an experimental characterization by orders of magnitude. Classifying protein regions (domains) into homologous families can generate testable functional hypotheses for yet unannotated sequences. Existing domain family resources typically use at least some degree of manual curation: they grow slowly over time and leave a large fraction of the protein sequence space unclassified. We here describe automatic clustering by Density Peak Clustering of UniRef50 v. 2017_07, a protein sequence database including approximately 23M sequences. We performed a radical re-implementation of a pipeline we previously developed in order to allow handling millions of sequences and data volumes of the order of 3 TeraBytes. The modified pipeline, which we call DPCfam, finds ∼ 45,000 protein clusters in UniRef50. Our automatic classification is in close correspondence to the ones of the Pfam and ECOD resources: in particular, about 81% of medium-large Pfam families and 72% of ECOD families can be mapped to clusters generated by DPCfam. In addition, our protocol finds more than 14,000 clusters constituted of protein regions with no Pfam annotation, which are therefore candidates for representing novel protein families. These results are made available to the scientific community through a dedicated repository.
Collapse
Affiliation(s)
| | - Federico Barone
- SISSA, Trieste, Italy
- AREA SCIENCE PARK, Trieste, Italy
- Department of Mathematics and Geosciences, University of Trieste, Trieste, Italy
| | - Alex Bateman
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, United Kingdom
| | | | - Marco Punta
- Center for Omics Sciences, IRCCS San Raffaele Institute, Milan, Italy
- Unit of Immunogenetics, Leukemia Genomics and Immunobiology, Division of Immunology, Transplantation and Infectious Disease, IRCCS San Raffaele Scientific Institute, Milan, Italy
| | | |
Collapse
|
6
|
Semple S, Ferrer-I-Cancho R, Gustison ML. Linguistic laws in biology. Trends Ecol Evol 2022; 37:53-66. [PMID: 34598817 PMCID: PMC8678306 DOI: 10.1016/j.tree.2021.08.012] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2021] [Revised: 08/24/2021] [Accepted: 08/25/2021] [Indexed: 01/03/2023]
Abstract
Linguistic laws, the common statistical patterns of human language, have been investigated by quantitative linguists for nearly a century. Recently, biologists from a range of disciplines have started to explore the prevalence of these laws beyond language, finding patterns consistent with linguistic laws across multiple levels of biological organisation, from molecular (genomes, genes, and proteins) to organismal (animal behaviour) to ecological (populations and ecosystems). We propose a new conceptual framework for the study of linguistic laws in biology, comprising and integrating distinct levels of analysis, from description to prediction to theory building. Adopting this framework will provide critical new insights into the fundamental rules of organisation underpinning natural systems, unifying linguistic laws and core theory in biology.
Collapse
Affiliation(s)
- Stuart Semple
- School of Life and Health Sciences, University of Roehampton, London, UK.
| | - Ramon Ferrer-I-Cancho
- Complexity and Quantitative Linguistics Laboratory, Laboratory for Relational Algorithmics, Complexity, and Learning Research Group, Departament de Ciències de la Computació, Universitat Politècnica de Catalunya, 08034 Barcelona, Catalonia, Spain
| | - Morgan L Gustison
- Department of Integrative Biology, University of Texas at Austin, Austin, TX, USA
| |
Collapse
|
7
|
Caetano-Anollés G. The Compressed Vocabulary of Microbial Life. Front Microbiol 2021; 12:655990. [PMID: 34305827 PMCID: PMC8292947 DOI: 10.3389/fmicb.2021.655990] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2021] [Accepted: 04/27/2021] [Indexed: 12/22/2022] Open
Abstract
Communication is an undisputed central activity of life that requires an evolving molecular language. It conveys meaning through messages and vocabularies. Here, I explore the existence of a growing vocabulary in the molecules and molecular functions of the microbial world. There are clear correspondences between the lexicon, syntax, semantics, and pragmatics of language organization and the module, structure, function, and fitness paradigms of molecular biology. These correspondences are constrained by universal laws and engineering principles. Macromolecular structure, for example, follows quantitative linguistic patterns arising from statistical laws that are likely universal, including the Zipf's law, a special case of the scale-free distribution, the Heaps' law describing sublinear growth typical of economies of scales, and the Menzerath-Altmann's law, which imposes size-dependent patterns of decreasing returns. Trade-off solutions between principles of economy, flexibility, and robustness define a "triangle of persistence" describing the impact of the environment on a biological system. The pragmatic landscape of the triangle interfaces with the syntax and semantics of molecular languages, which together with comparative and evolutionary genomic data can explain global patterns of diversification of cellular life. The vocabularies of proteins (proteomes) and functions (functionomes) revealed a significant universal lexical core supporting a universal common ancestor, an ancestral evolutionary link between Bacteria and Eukarya, and distinct reductive evolutionary strategies of language compression in Archaea and Bacteria. A "causal" word cloud strategy inspired by the dependency grammar paradigm used in catenae unfolded the evolution of lexical units associated with Gene Ontology terms at different levels of ontological abstraction. While Archaea holds the smallest, oldest, and most homogeneous vocabulary of all superkingdoms, Bacteria heterogeneously apportions a more complex vocabulary, and Eukarya pushes functional innovation through mechanisms of flexibility and robustness.
Collapse
Affiliation(s)
- Gustavo Caetano-Anollés
- Evolutionary Bioinformatics Laboratory, Department of Crop Sciences, and C. R. Woese Institute for Genomic Biology, University of Illinois, Urbana, IL, United States
| |
Collapse
|
8
|
Dermauw W, Van Leeuwen T, Feyereisen R. Diversity and evolution of the P450 family in arthropods. INSECT BIOCHEMISTRY AND MOLECULAR BIOLOGY 2020; 127:103490. [PMID: 33169702 DOI: 10.1016/j.ibmb.2020.103490] [Citation(s) in RCA: 134] [Impact Index Per Article: 26.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/15/2020] [Revised: 10/09/2020] [Accepted: 10/09/2020] [Indexed: 05/13/2023]
Abstract
The P450 family (CYP genes) of arthropods encodes diverse enzymes involved in the metabolism of foreign compounds and in essential endocrine or ecophysiological functions. The P450 sequences (CYPome) from 40 arthropod species were manually curated, including 31 complete CYPomes, and a maximum likelihood phylogeny of nearly 3000 sequences is presented. Arthropod CYPomes are assembled from members of six CYP clans of variable size, the CYP2, CYP3, CYP4 and mitochondrial clans, as well as the CYP20 and CYP16 clans that are not found in Neoptera. CYPome sizes vary from two dozen genes in some parasitic species to over 200 in species as diverse as collembolans or ticks. CYPomes are comprised of few CYP families with many genes and many CYP families with few genes, and this distribution is the result of dynamic birth and death processes. Lineage-specific expansions or blooms are found throughout the phylogeny and often result in genomic clusters that appear to form a reservoir of catalytic diversity maintained as heritable units. Among the many P450s with physiological functions, six CYP families are involved in ecdysteroid metabolism. However, five so-called Halloween genes are not universally represented and do not constitute the unique pathway of ecdysteroid biosynthesis. The diversity of arthropod CYPomes has only partially been uncovered to date and many P450s with physiological functions regulating the synthesis and degradation of endogenous signal molecules (including ecdysteroids) and semiochemicals (including pheromones and defense chemicals) remain to be discovered. Sequence diversity of arthropod P450s is extreme, and P450 sequences lacking the universally conserved Cys ligand to the heme have evolved several times. A better understanding of P450 evolution is needed to discern the relative contributions of stochastic processes and adaptive processes in shaping the size and diversity of CYPomes.
Collapse
Affiliation(s)
- Wannes Dermauw
- Laboratory of Agrozoology, Department of Plants and Crops, Faculty of Bioscience Engineering, Ghent University, Coupure Links 653, 9000, Ghent, Belgium
| | - Thomas Van Leeuwen
- Laboratory of Agrozoology, Department of Plants and Crops, Faculty of Bioscience Engineering, Ghent University, Coupure Links 653, 9000, Ghent, Belgium
| | - René Feyereisen
- Laboratory of Agrozoology, Department of Plants and Crops, Faculty of Bioscience Engineering, Ghent University, Coupure Links 653, 9000, Ghent, Belgium; Department of Plant and Environmental Sciences, University of Copenhagen, 40 Thorvaldsensvej, DK-1871, Frederiksberg C, Copenhagen, Denmark.
| |
Collapse
|
9
|
Xiao X, Xue GF, Stamatovic B, Qiu WR. Using Cellular Automata to Simulate Domain Evolution in Proteins. Front Genet 2020; 11:515. [PMID: 32582278 PMCID: PMC7296063 DOI: 10.3389/fgene.2020.00515] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2019] [Accepted: 04/28/2020] [Indexed: 11/26/2022] Open
Abstract
Proteins play primary roles in important biological processes such as catalysis, physiological functions, and immune system functions. Thus, the research on how proteins evolved has been a nuclear question in the field of evolutionary biology. General models of protein evolution help to determine the baseline expectations for evolution of sequences, and these models have been extensively useful in sequence analysis as well as for the computer simulation of artificial sequence data sets. We have developed a new method of simulating multi-domain protein evolution, including fusions of domains, insertion, and deletion. It has been observed via the simulation test that the success rates achieved by the proposed predictor are remarkably high. For the convenience of the most experimental scientists, a user-friendly web server has been established at http://jci-bioinfo.cn/domainevo, by which users can easily get their desired results without having to go through the detailed mathematics. Through the simulation results of this website, users can predict the evolution trend of the protein domain architecture.
Collapse
Affiliation(s)
- Xuan Xiao
- Computer Department, Jing-De-Zhen Ceramic Institute, Jingdezhen, China
| | - Guang-Fu Xue
- Computer Department, Jing-De-Zhen Ceramic Institute, Jingdezhen, China
| | - Biljana Stamatovic
- Faculty of Information Systems and Technologies, University of Donja Gorica, Podgorica, Montenegro
| | - Wang-Ren Qiu
- Computer Department, Jing-De-Zhen Ceramic Institute, Jingdezhen, China
| |
Collapse
|
10
|
Abstract
Darwin's theory of evolution emphasized that positive selection of functional proficiency provides the fitness that ultimately determines the structure of life, a view that has dominated biochemical thinking of enzymes as perfectly optimized for their specific functions. The 20th-century modern synthesis, structural biology, and the central dogma explained the machinery of evolution, and nearly neutral theory explained how selection competes with random fixation dynamics that produce molecular clocks essential e.g. for dating evolutionary histories. However, quantitative proteomics revealed that selection pressures not relating to optimal function play much larger roles than previously thought, acting perhaps most importantly via protein expression levels. This paper first summarizes recent progress in the 21st century toward recovering this universal selection pressure. Then, the paper argues that proteome cost minimization is the dominant, underlying 'non-function' selection pressure controlling most of the evolution of already functionally adapted living systems. A theory of proteome cost minimization is described and argued to have consequences for understanding evolutionary trade-offs, aging, cancer, and neurodegenerative protein-misfolding diseases.
Collapse
|
11
|
Zalguizuri A, Caetano-Anollés G, Lepek VC. Phylogenetic profiling, an untapped resource for the prediction of secreted proteins and its complementation with sequence-based classifiers in bacterial type III, IV and VI secretion systems. Brief Bioinform 2020; 20:1395-1402. [PMID: 29394318 DOI: 10.1093/bib/bby009] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2017] [Revised: 01/15/2018] [Indexed: 12/29/2022] Open
Abstract
In the establishment and maintenance of the interaction between pathogenic or symbiotic bacteria with a eukaryotic organism, protein substrates of specialized bacterial secretion systems called effectors play a critical role once translocated into the host cell. Proteins are also secreted to the extracellular medium by free-living bacteria or directly injected into other competing organisms to hinder or kill. In this work, we explore an approach based on the evolutionary dependence that most of the effectors maintain with their specific secretion system that analyzes the co-occurrence of any orthologous protein group and their corresponding secretion system across multiple genomes. We compared and complemented our methodology with sequence-based machine learning prediction tools for the type III, IV and VI secretion systems. Finally, we provide the predictive results for the three secretion systems in 1606 complete genomes at http://www.iib.unsam.edu.ar/orgsissec/.
Collapse
|
12
|
Gu Y, Zu J, Li Y. A novel evolutionary model for constructing gene coexpression networks with comprehensive features. BMC Bioinformatics 2019; 20:460. [PMID: 31492104 PMCID: PMC6731579 DOI: 10.1186/s12859-019-3035-7] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2019] [Accepted: 08/19/2019] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Uncovering the evolutionary principles of gene coexpression network is important for our understanding of the network topological property of new genes. However, most existing evolutionary models only considered the evolution of duplication genes and only based on the degree of genes, ignoring the other key topological properties. The evolutionary mechanism by which how are new genes integrated into the ancestral networks are not yet to be comprehensively characterized. Herein, based on the human ribonucleic acid-sequencing (RNA-seq) data, we develop a new evolutionary model of gene coexpression network which considers the evolutionary process of both duplication genes and de novo genes. RESULTS Based on the human RNA-seq data, we construct a gene coexpression network consisting of 8061 genes and 638624 links. We find that there are 1394 duplication genes and 126 de novo genes in the network. Then based on human gene age data, we reproduce the evolutionary process of this gene coexpression network and develop a new evolutionary model. We find that the generation rates of duplication genes and de novo genes are approximately 3.58/Myr (Myr=Million year) and 0.31/Myr, respectively. Based on the average degree and coreness of parent genes, we find that the gene duplication is a random process. Eventually duplication genes only inherit 12.89% connections from their parent genes and the retained connections have a smaller edge betweenness. Moreover, we find that both duplication genes and de novo genes prefer to develop new interactions with genes which have a large degree and a large coreness. Our proposed model can generate an evolutionary network when the number of newly added genes or the length of evolutionary time is known. CONCLUSIONS Gene duplication and de novo genes are two dominant evolutionary forces in shaping the coexpression network. Both duplication genes and de novo genes develop new interactions through a "rich-gets-richer" mechanism in terms of degree and coreness. This mechanism leads to the scale-free property and hierarchical architecture of biomolecular network. The proposed model is able to construct a gene coexpression network with comprehensive biological characteristics.
Collapse
Affiliation(s)
- Yuexi Gu
- School of Mathematics and Statistics, Xi'an Jiaotong University, Xi'an, 710049, People's Republic of China
| | - Jian Zu
- School of Mathematics and Statistics, Xi'an Jiaotong University, Xi'an, 710049, People's Republic of China.
| | - Yu Li
- School of Mathematics and Statistics, Xi'an Jiaotong University, Xi'an, 710049, People's Republic of China
| |
Collapse
|
13
|
A global map of the protein shape universe. PLoS Comput Biol 2019; 15:e1006969. [PMID: 30978181 PMCID: PMC6481876 DOI: 10.1371/journal.pcbi.1006969] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2018] [Revised: 04/24/2019] [Accepted: 03/20/2019] [Indexed: 11/19/2022] Open
Abstract
Proteins are involved in almost all functions in a living cell, and functions of proteins are realized by their tertiary structures. Obtaining a global perspective of the variety and distribution of protein structures lays a foundation for our understanding of the building principle of protein structures. In light of the rapid accumulation of low-resolution structure data from electron tomography and cryo-electron microscopy, here we map and classify three-dimensional (3D) surface shapes of proteins into a similarity space. Surface shapes of proteins were represented with 3D Zernike descriptors, mathematical moment-based invariants, which have previously been demonstrated effective for biomolecular structure similarity search. In addition to single chains of proteins, we have also analyzed the shape space occupied by protein complexes. From the mapping, we have obtained various new insights into the relationship between shapes, main-chain folds, and complex formation. The unique view obtained from shape mapping opens up new ways to understand design principles, functions, and evolution of proteins. Proteins are the major molecules involved in almost all cellular processes. In this work, we present a novel mapping of protein shapes that represents the variety and the similarities of 3D shapes of proteins and their assemblies. This mapping provides various novel insights into protein shapes including determinant factors of protein 3D shapes, which enhance our understanding of the design principles of protein shapes. The mapping will also be a valuable resource for artificial protein design as well as references for classifying medium- to low-resolution protein structure images of determined by cryo-electron microscopy and tomography.
Collapse
|
14
|
Abstract
This chapter reviews current research on how protein domain architectures evolve. We begin by summarizing work on the phylogenetic distribution of proteins, as this will directly impact which domain architectures can be formed in different species. Studies relating domain family size to occurrence have shown that they generally follow power law distributions, both within genomes and larger evolutionary groups. These findings were subsequently extended to multi-domain architectures. Genome evolution models that have been suggested to explain the shape of these distributions are reviewed, as well as evidence for selective pressure to expand certain domain families more than others. Each domain has an intrinsic combinatorial propensity, and the effects of this have been studied using measures of domain versatility or promiscuity. Next, we study the principles of protein domain architecture evolution and how these have been inferred from distributions of extant domain arrangements. Following this, we review inferences of ancestral domain architecture and the conclusions concerning domain architecture evolution mechanisms that can be drawn from these. Finally, we examine whether all known cases of a given domain architecture can be assumed to have a single common origin (monophyly) or have evolved convergently (polyphyly). We end by a discussion of some available tools for computational analysis or exploitation of protein domain architectures and their evolution.
Collapse
|
15
|
Razban RM, Gilson AI, Durfee N, Strobelt H, Dinkla K, Choi JM, Pfister H, Shakhnovich EI. ProteomeVis: a web app for exploration of protein properties from structure to sequence evolution across organisms' proteomes. Bioinformatics 2018; 34:3557-3565. [PMID: 29741573 PMCID: PMC6184454 DOI: 10.1093/bioinformatics/bty370] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2017] [Revised: 03/27/2018] [Accepted: 05/03/2018] [Indexed: 01/27/2023] Open
Abstract
Motivation Protein evolution spans time scales and its effects span the length of an organism. A web app named ProteomeVis is developed to provide a comprehensive view of protein evolution in the Saccharomyces cerevisiae and Escherichia coli proteomes. ProteomeVis interactively creates protein chain graphs, where edges between nodes represent structure and sequence similarities within user-defined ranges, to study the long time scale effects of protein structure evolution. The short time scale effects of protein sequence evolution are studied by sequence evolutionary rate (ER) correlation analyses with protein properties that span from the molecular to the organismal level. Results We demonstrate the utility and versatility of ProteomeVis by investigating the distribution of edges per node in organismal protein chain universe graphs (oPCUGs) and putative ER determinants. S.cerevisiae and E.coli oPCUGs are scale-free with scaling constants of 1.79 and 1.56, respectively. Both scaling constants can be explained by a previously reported theoretical model describing protein structure evolution. Protein abundance most strongly correlates with ER among properties in ProteomeVis, with Spearman correlations of -0.49 (P-value < 10-10) and -0.46 (P-value < 10-10) for S.cerevisiae and E.coli, respectively. This result is consistent with previous reports that found protein expression to be the most important ER determinant. Availability and implementation ProteomeVis is freely accessible at http://proteomevis.chem.harvard.edu. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Rostam M Razban
- Department of Chemistry & Chemical Biology, Harvard University, Cambridge, MA, USA
| | - Amy I Gilson
- Department of Chemistry & Chemical Biology, Harvard University, Cambridge, MA, USA
| | - Niamh Durfee
- Department of Chemistry & Chemical Biology, Harvard University, Cambridge, MA, USA
| | - Hendrik Strobelt
- School of Engineering & Applied Sciences, Harvard University, Cambridge, MA, USA
| | - Kasper Dinkla
- School of Engineering & Applied Sciences, Harvard University, Cambridge, MA, USA
| | - Jeong-Mo Choi
- Department of Chemistry & Chemical Biology, Harvard University, Cambridge, MA, USA
| | - Hanspeter Pfister
- School of Engineering & Applied Sciences, Harvard University, Cambridge, MA, USA
| | - Eugene I Shakhnovich
- Department of Chemistry & Chemical Biology, Harvard University, Cambridge, MA, USA
| |
Collapse
|
16
|
Power-law relationship in the long-tailed sections of proton dose distributions. Sci Rep 2018; 8:10413. [PMID: 29991734 PMCID: PMC6039508 DOI: 10.1038/s41598-018-28683-5] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2018] [Accepted: 06/13/2018] [Indexed: 11/11/2022] Open
Abstract
The halo portion of a proton therapy dose creates a long tail in proton dose distributions, but so far study of this phenomenon has been limited. We used statistical methods and mathematical models to confirm that the long-tailed portion of proton dose distributions exhibits a power-law relationship. By analyzing 299 measured dose profiles, we found that all proton lateral dose distributions had a significant power-law scaling correlation with a high correlation coefficient in the tail. We set up a dual-mechanism model, containing both direct and indirect impact mechanisms. In the direct impact mechanism, the proximal dose deposition is mainly due to the direct impact of a proton on a particle. In the indirect mechanism, the impact of a proton on a given particle is considered in terms of the proton’s impact on a neighboring particle that then impacts the given particle. We found that the indirect impact mechanism led to a tail in the distribution exhibiting a power-law relationship because the probability of the indirect impacts was proportional to the distance; i.e., the longer the distance, the larger the indirect impact probability. Upon analyzing the experimental data, we observed that the power-law exponent increased with proton energy.
Collapse
|
17
|
Dori N, Behar H, Brot H, Louzoun Y. Family-size variability grows with collapse rate in a birth-death-catastrophe model. Phys Rev E 2018; 98:012416. [PMID: 30110815 DOI: 10.1103/physreve.98.012416] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2017] [Indexed: 06/08/2023]
Abstract
Forest-fire and avalanche models support the notion that frequent catastrophes prevent the growth of very large populations and as such, prevent rare large-scale catastrophes. We show that this notion is not universal. A new model class leads to a paradigm shift in the influence of catastrophes on the family-size distribution of subpopulations. We study a simple population dynamics model where individuals, as well as a whole family, may die with a constant probability, accompanied by a logistic population growth model. We compute the characteristics of the family-size distribution in steady state and the phase diagram of the steady-state distribution and show that the family and catastrophe size variances increase with the catastrophe frequency, which is the opposite of common intuition. Frequent catastrophes are balanced by a larger net-growth rate in surviving families, leading to the exponential growth of these families. When the catastrophe rate is further increased, a second phase transition to extinction occurs when the rate of new family creations is lower than their destruction rate by catastrophes.
Collapse
Affiliation(s)
- N Dori
- Gonda Brain Research Center, Bar-Ilan University, Ramat Gan, Israel
| | - H Behar
- Department of Biology, Stanford University, Stanford, California 94305-5020, USA
| | - H Brot
- Boston Children's Hospital, Harvard Medical School, 3 Blackfan Circle, Boston, Massachusetts 02115, USA and Center for Polymer Studies and Department of Physics, Boston University, Boston, Massachusetts 02215, USA
| | - Y Louzoun
- Gonda Brain Research Center, Bar-Ilan University, Ramat Gan, Israel
- Department of Mathematics, Bar-Ilan University, Ramat Gan, Israel
| |
Collapse
|
18
|
Škrlj B, Kunej T, Konc J. Insights from Ion Binding Site Network Analysis into Evolution and Functions of Proteins. Mol Inform 2018; 37:e1700144. [PMID: 29418080 DOI: 10.1002/minf.201700144] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2017] [Accepted: 02/01/2018] [Indexed: 01/05/2023]
Abstract
Many biological phenomena can be represented as complex networks. Using a protein binding site comparison approach, we generated a network of ion binding sites on the scale of all known protein structures from the Protein Data Bank. We found that this ion binding site similarity network is scale-free, indicating a network in which a few ion binding site scaffolds are the network hubs, and these are connected to hundreds of nodes, whereas the vast majority of nodes have only a few neighbors. Enrichment and statistical analysis of the network components and communities yielded insights into underlying processes from the functional and the structural perspective. Largest components and communities were observed to be closely related to basic metabolic processes and some of the most common structural folds, which, from the evolutionary point of view, indicates that they may be the oldest ones. Further, we derived the first comprehensive map of ion interchangeability, based on binding site similarity. Several highly interchangeable protein-ion binding site pairs emerged (e.g., Ca2+ and Mg2+ ), as well as structurally distinct ones. The constructed network of ion binding site similarities will aid in understanding the general principles of protein-ion binding sites structure, function and evolution. We demonstrate potential uses of the network on proteins involved in cancer development and immune response, where individual ions play prominent roles in disease development.
Collapse
Affiliation(s)
- Blaž Škrlj
- Department of molecular modeling, National Institute of Chemistry, Hajdrihova 19, Ljubljana, Slovenia.,Jožef Stefan International Postgraduate School, Jamova cesta 39, 1000, Ljubljana, Slovenia
| | - Tanja Kunej
- Department of Animal Science, Biotechnical Faculty, University of Ljubljana, Slovenia
| | - Janez Konc
- Department of molecular modeling, National Institute of Chemistry, Hajdrihova 19, Ljubljana, Slovenia
| |
Collapse
|
19
|
Raanan H, Pike DH, Moore EK, Falkowski PG, Nanda V. Modular origins of biological electron transfer chains. Proc Natl Acad Sci U S A 2018; 115:1280-1285. [PMID: 29358375 PMCID: PMC5819401 DOI: 10.1073/pnas.1714225115] [Citation(s) in RCA: 23] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/07/2023] Open
Abstract
Oxidoreductases catalyze electron transfer reactions that ultimately provide the energy for life. A limited set of ancestral protein-metal modules are presumably the building blocks that evolved into this diverse protein family. However, the identity of these modules and their path to modern oxidoreductases is unknown. Using a comparative structural analysis approach, we identify a set of fundamental electron transfer modules that have evolved to form the extant oxidoreductases. Using transition metal-containing cofactors as fiducial markers, it is possible to cluster cofactor microenvironments into as few as four major modules: bacterial ferredoxin, cytochrome c, symerythrin, and plastocyanin-type folds. From structural alignments, it is challenging to ascertain whether modules evolved from a single common ancestor (homology) or arose by independent convergence on a limited set of structural forms (analogy). Additional insight into common origins is contained in the spatial adjacency network (SPAN), which is based on proximity of modules in oxidoreductases containing multiple cofactor electron transfer chains. Electron transfer chains within complex modern oxidoreductases likely evolved through repeated duplication and diversification of ancient modular units that arose in the Archean eon.
Collapse
Affiliation(s)
- Hagai Raanan
- Environmental Biophysics and Molecular Ecology Program, Department of Marine and Coastal Sciences, Rutgers University, New Brunswick, NJ 08901
- Center for Advanced Biotechnology and Medicine, Rutgers University, Piscataway, NJ 08854
| | - Douglas H Pike
- Center for Advanced Biotechnology and Medicine, Rutgers University, Piscataway, NJ 08854
| | - Eli K Moore
- Environmental Biophysics and Molecular Ecology Program, Department of Marine and Coastal Sciences, Rutgers University, New Brunswick, NJ 08901
| | - Paul G Falkowski
- Environmental Biophysics and Molecular Ecology Program, Department of Marine and Coastal Sciences, Rutgers University, New Brunswick, NJ 08901;
- Department of Earth and Planetary Sciences, Rutgers University, New Brunswick, NJ 08901
| | - Vikas Nanda
- Center for Advanced Biotechnology and Medicine, Rutgers University, Piscataway, NJ 08854;
- Department of Biochemistry and Molecular Biology, Robert Wood Johnson Medical School, Rutgers University, Piscataway, NJ 08854
| |
Collapse
|
20
|
De Lazzari E, Grilli J, Maslov S, Cosentino Lagomarsino M. Family-specific scaling laws in bacterial genomes. Nucleic Acids Res 2017; 45:7615-7622. [PMID: 28605556 PMCID: PMC5737699 DOI: 10.1093/nar/gkx510] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2017] [Accepted: 05/30/2017] [Indexed: 01/21/2023] Open
Abstract
Among several quantitative invariants found in evolutionary genomics, one of the most striking is the scaling of the overall abundance of proteins, or protein domains, sharing a specific functional annotation across genomes of given size. The size of these functional categories change, on average, as power-laws in the total number of protein-coding genes. Here, we show that such regularities are not restricted to the overall behavior of high-level functional categories, but also exist systematically at the level of single evolutionary families of protein domains. Specifically, the number of proteins within each family follows family-specific scaling laws with genome size. Functionally similar sets of families tend to follow similar scaling laws, but this is not always the case. To understand this systematically, we provide a comprehensive classification of families based on their scaling properties. Additionally, we develop a quantitative score for the heterogeneity of the scaling of families belonging to a given category or predefined group. Under the common reasonable assumption that selection is driven solely or mainly by biological function, these findings point to fine-tuned and interdependent functional roles of specific protein domains, beyond our current functional annotations. This analysis provides a deeper view on the links between evolutionary expansion of protein families and the functional constraints shaping the gene repertoire of bacterial genomes.
Collapse
Affiliation(s)
- Eleonora De Lazzari
- Sorbonne Universités, UPMC Université Paris 06, UMR 7238 Computational and Quantitative Biology, Genomic Physics Group, 4 Place Jussieu, Paris 75005, France
| | - Jacopo Grilli
- Department of Ecology and Evolution, University of Chicago, 1101 E 57th st 60637 Chicago, IL, USA
| | - Sergei Maslov
- Department of Bioengineering, Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
- To whom correspondence should be addressed. Tel: +33 144277341; . Correspondence may also be addressed to Sergei Maslov. Tel: +1 217 265 5705;
| | - Marco Cosentino Lagomarsino
- Sorbonne Universités, UPMC Université Paris 06, UMR 7238 Computational and Quantitative Biology, Genomic Physics Group, 4 Place Jussieu, Paris 75005, France
- CNRS, UMR 7238, Paris, France
- FIRC Institute of Molecular Oncology (IFOM), 20139 Milan, Italy
- To whom correspondence should be addressed. Tel: +33 144277341; . Correspondence may also be addressed to Sergei Maslov. Tel: +1 217 265 5705;
| |
Collapse
|
21
|
Nasir A, Kim KM, Caetano-Anollés G. Phylogenetic Tracings of Proteome Size Support the Gradual Accretion of Protein Structural Domains and the Early Origin of Viruses from Primordial Cells. Front Microbiol 2017; 8:1178. [PMID: 28690608 PMCID: PMC5481351 DOI: 10.3389/fmicb.2017.01178] [Citation(s) in RCA: 30] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2017] [Accepted: 06/09/2017] [Indexed: 01/05/2023] Open
Abstract
Untangling the origin and evolution of viruses remains a challenging proposition. We recently studied the global distribution of protein domain structures in thousands of completely sequenced viral and cellular proteomes with comparative genomics, phylogenomics, and multidimensional scaling methods. A tree of life describing the evolution of proteomes revealed viruses emerging from the base of the tree as a fourth supergroup of life. A tree of domains indicated an early origin of modern viral lineages from ancient cells that co-existed with the cellular ancestors. However, it was recently argued that the rooting of our trees and the basal placement of viruses was artifactually induced by small genome (proteome) size. Here we show that these claims arise from misunderstanding and misinterpretations of cladistic methodology. Trees are reconstructed unrooted, and thus, their topologies cannot be distorted a posteriori by the rooting methodology. Tracing proteome size in trees and multidimensional views of evolutionary relationships as well as tests of leaf stability and exclusion/inclusion of taxa demonstrated that the smallest proteomes were neither attracted toward the root nor caused any topological distortions of the trees. Simulations confirmed that taxa clustering patterns were independent of proteome size and were determined by the presence of known evolutionary relatives in data matrices, highlighting the need for broader taxon sampling in phylogeny reconstruction. Instead, phylogenetic tracings of proteome size revealed a slowdown in innovation of the structural domain vocabulary and four regimes of allometric scaling that reflected a Heaps law. These regimes explained increasing economies of scale in the evolutionary growth and accretion of kernel proteome repertoires of viruses and cellular organisms that resemble growth of human languages with limited vocabulary sizes. Results reconcile dynamic and static views of domain frequency distributions that are consistent with the axiom of spatiotemporal continuity that is tenet of evolutionary thinking.
Collapse
Affiliation(s)
- Arshan Nasir
- Department of Biosciences, COMSATS Institute of Information TechnologyIslamabad, Pakistan
- Evolutionary Bioinformatics Laboratory, Department of Crop Sciences, University of Illinois at Urbana-ChampaignUrbana, IL, United States
| | - Kyung Mo Kim
- Division of Polar Life Sciences, Korea Polar Research InstituteIncheon, South Korea
| | - Gustavo Caetano-Anollés
- Evolutionary Bioinformatics Laboratory, Department of Crop Sciences, University of Illinois at Urbana-ChampaignUrbana, IL, United States
| |
Collapse
|
22
|
Ye M, Zhang X, Racz GC, Jiang Q, Moret BME. NEMo: An Evolutionary Model With Modularity for PPI Networks. IEEE Trans Nanobioscience 2017; 16:131-139. [PMID: 28113347 DOI: 10.1109/tnb.2017.2656058] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2025]
Abstract
Modeling the evolution of biological networks is a major challenge. Biological networks are usually represented as graphs; evolutionary events not only include addition and removal of vertices and edges but also duplication of vertices and their associated edges. Since duplication is viewed as a primary driver of genomic evolution, recent work has focused on duplication-based models. Missing from these models is any embodiment of modularity, a widely accepted attribute of biological networks. Some models spontaneously generate modular structures, but none is known to maintain and evolve them. We describe network evolution with modularity (NEMo), a new model that embodies modularity. NEMo allows modules to appear and disappear and to fission and to merge, all driven by the underlying edge-level events using a duplication-based process. We also introduce measures to compare biological networks in terms of their modular structure; we present comparisons between NEMo and existing duplication-based models and run our measuring tools on both generated and published networks.
Collapse
|
23
|
Carey M, Wu S, Gan G, Wu H. Correlation-based iterative clustering methods for time course data: The identification of temporal gene response modules for influenza infection in humans. Infect Dis Model 2016; 1:28-39. [PMID: 29928719 PMCID: PMC5963321 DOI: 10.1016/j.idm.2016.07.001] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2016] [Accepted: 07/08/2016] [Indexed: 12/25/2022] Open
Abstract
Many pragmatic clustering methods have been developed to group data vectors or objects into clusters so that the objects in one cluster are very similar and objects in different clusters are distinct based on some similarity measure. The availability of time course data has motivated researchers to develop methods, such as mixture and mixed-effects modelling approaches, that incorporate the temporal information contained in the shape of the trajectory of the data. However, there is still a need for the development of time-course clustering methods that can adequately deal with inhomogeneous clusters (some clusters are quite large and others are quite small). Here we propose two such methods, hierarchical clustering (IHC) and iterative pairwise-correlation clustering (IPC). We evaluate and compare the proposed methods to the Markov Cluster Algorithm (MCL) and the generalised mixed-effects model (GMM) using simulation studies and an application to a time course gene expression data set from a study containing human subjects who were challenged by a live influenza virus. We identify four types of temporal gene response modules to influenza infection in humans, i.e., single-gene modules (SGM), small-size modules (SSM), medium-size modules (MSM) and large-size modules (LSM). The LSM contain genes that perform various fundamental biological functions that are consistent across subjects. The SSM and SGM contain genes that perform either different or similar biological functions that have complex temporal responses to the virus and are unique to each subject. We show that the temporal response of the genes in the LSM have either simple patterns with a single peak or trough a consequence of the transient stimuli sustained or state-transitioning patterns pertaining to developmental cues and that these modules can differentiate the severity of disease outcomes. Additionally, the size of gene response modules follows a power-law distribution with a consistent exponent across all subjects, which reveals the presence of universality in the underlying biological principles that generated these modules.
Collapse
Affiliation(s)
- Michelle Carey
- Department of Biostatistics and Computational Biology, Crittenden Blvd, Rochester, NY 14642, USA
- Department of Mathematics and Statistics, McGill University, 805 Sherbrooke Street West, Montreal, Canada
| | - Shuang Wu
- Department of Biostatistics and Computational Biology, Crittenden Blvd, Rochester, NY 14642, USA
- Biogen, 250 Binney Street, Cambridge, MA, USA
| | - Guojun Gan
- Department of Mathematics, University of Connecticut, 196 Auditorium Road U-3009, Storrs, USA
| | - Hulin Wu
- Department of Biostatistics and Computational Biology, Crittenden Blvd, Rochester, NY 14642, USA
- Department of Biostatistics, University of Texas Health Science Center School of Public Health at Houston, 1200 Pressler Street, Houston, USA
- Corresponding author. Department of Biostatistics, University of Texas Health Science Center School of Public Health at Houston, 1200 Pressler Street, Houston, USA.
| |
Collapse
|
24
|
Li W, Fontanelli O, Miramontes P. Size distribution of function-based human gene sets and the split-merge model. ROYAL SOCIETY OPEN SCIENCE 2016; 3:160275. [PMID: 27853602 PMCID: PMC5108952 DOI: 10.1098/rsos.160275] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 04/22/2016] [Accepted: 07/01/2016] [Indexed: 06/06/2023]
Abstract
The sizes of paralogues-gene families produced by ancestral duplication-are known to follow a power-law distribution. We examine the size distribution of gene sets or gene families where genes are grouped by a similar function or share a common property. The size distribution of Human Gene Nomenclature Committee (HGNC) gene sets deviate from the power-law, and can be fitted much better by a beta rank function. We propose a simple mechanism to break a power-law size distribution by a combination of splitting and merging operations. The largest gene sets are split into two to account for the subfunctional categories, and a small proportion of other gene sets are merged into larger sets as new common themes might be realized. These operations are not uncommon for a curator of gene sets. A simulation shows that iteration of these operations changes the size distribution of Ensembl paralogues and could lead to a distribution fitted by a rank beta function. We further illustrate application of beta rank function by the example of distribution of transcription factors and drug target genes among HGNC gene families.
Collapse
Affiliation(s)
- Wentian Li
- The Robert S. Boas Center for Genomics and Human Genetics, The Feinstein Institute for Medical Research, Northwell Health, Manhasset, NY, USA
| | - Oscar Fontanelli
- Departamento de Matemáticas, Facultad de Ciencias, Universidad Nacional Autónoma de México, Circuito Exterior, Ciudad Universitaria, México 04510 DF, México
| | - Pedro Miramontes
- Departamento de Matemáticas, Facultad de Ciencias, Universidad Nacional Autónoma de México, Circuito Exterior, Ciudad Universitaria, México 04510 DF, México
- Bioinformatics Group and Interdisciplinary Center for Bioinformatics, University of Leipzig, Haertelstrasse 16–18, 04107 Leipzig, Germany
| |
Collapse
|
25
|
A Dynamic Model for the Evolution of Protein Structure. J Mol Evol 2016; 82:230-43. [PMID: 27146880 DOI: 10.1007/s00239-016-9740-1] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/31/2015] [Accepted: 04/12/2016] [Indexed: 10/21/2022]
Abstract
Domains are folded structures and evolutionary building blocks of protein molecules. Their three-dimensional atomic conformations, which define biological functions, can be coarse-grained into levels of a hierarchy. Here we build global dynamical models for the evolution of domains at fold and fold superfamily (FSF) levels. We fit the models with data from phylogenomic trees of domain structures and evaluate the distributions of the resulting parameters and their implications. The trees were inferred from a census of domain structures in hundreds of genomes from all three superkingdoms of life. The models used birth-death differential equations with the global abundances of structures as state variables, with one set of equations for folds and another for FSFs. Only the transitions present in the tree are assumed possible. Each fold or FSF diversifies in variants, eventually producing a new fold or FSF. The parameters specify rates of generation of variants and of new folds or FSFs. The equations were solved for the parameters by simplifying the trees to a comb-like topology, treating branches as emerging directly from a trunk. We found that the rate constants for folds and FSFs evolved similarly. These parameters showed a sharp transient change at about 1.5 Gyrs ago. This time coincides with a period in which domains massively combined in proteins and their arrangements distributed in novel lineages during the rise of organismal diversification. Our simulations suggest that exploration of protein structure space occurs through coarse-grained discoveries that undergo fine-grained elaboration.
Collapse
|
26
|
Ye M, Racz GC, Jiang Q, Zhang X, Moret BME. NEMo: An Evolutionary Model with Modularity for PPI Networks. LECTURE NOTES IN COMPUTER SCIENCE 2016:224-236. [DOI: 10.1007/978-3-319-38782-6_19] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/03/2025]
|
27
|
Magner A, Szpankowski W, Kihara D. On the origin of protein superfamilies and superfolds. Sci Rep 2015; 5:8166. [PMID: 25703447 PMCID: PMC4336935 DOI: 10.1038/srep08166] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2014] [Accepted: 01/08/2015] [Indexed: 11/08/2022] Open
Abstract
Distributions of protein families and folds in genomes are highly skewed, having a small number of prevalent superfamiles/superfolds and a large number of families/folds of a small size. Why are the distributions of protein families and folds skewed? Why are there only a limited number of protein families? Here, we employ an information theoretic approach to investigate the protein sequence-structure relationship that leads to the skewed distributions. We consider that protein sequences and folds constitute an information theoretic channel and computed the most efficient distribution of sequences that code all protein folds. The identified distributions of sequences and folds are found to follow a power law, consistent with those observed for proteins in nature. Importantly, the skewed distributions of sequences and folds are suggested to have different origins: the skewed distribution of sequences is due to evolutionary pressure to achieve efficient coding of necessary folds, whereas that of folds is based on the thermodynamic stability of folds. The current study provides a new information theoretic framework for proteins that could be widely applied for understanding protein sequences, structures, functions, and interactions.
Collapse
Affiliation(s)
- Abram Magner
- Department of Computer Science, Purdue University, West Lafayette, IN, 47907, USA
| | - Wojciech Szpankowski
- Department of Computer Science, Purdue University, West Lafayette, IN, 47907, USA
| | - Daisuke Kihara
- Department of Computer Science, Purdue University, West Lafayette, IN, 47907, USA
- Department of Biological Sciences, Purdue University, West Lafayette, IN, 47907, USA
| |
Collapse
|
28
|
Structure based annotation of Helicobacter pylori strain 26695 proteome. PLoS One 2014; 9:e115020. [PMID: 25549250 PMCID: PMC4280198 DOI: 10.1371/journal.pone.0115020] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2014] [Accepted: 11/17/2014] [Indexed: 11/23/2022] Open
Abstract
The availability of complete genome sequences of H. pylori 26695 has provided a wealth of information enabling us to carry out in silico studies to identify new molecular targets for pharmaceutical treatment. In order to construe the structural and functional information of complete proteome, use of computational methods are more relevant since these methods are reliable and provide a solution to the time consuming and expensive experimental methods. Out of 1590 predicted protein coding genes in H. pylori, experimentally determined structures are available for only 145 proteins in the PDB. In the absence of experimental structures, computational studies on the three dimensional (3D) structural organization would help in deciphering the protein fold, structure and active site. Functional annotation of each protein was carried out based on structural fold and binding site based ligand association. Most of these proteins are uncharacterized in this proteome and through our annotation pipeline we were able to annotate most of them. We could assign structural folds to 464 uncharacterized proteins from an initial list of 557 sequences. Of the 1195 known structural folds present in the SCOP database, 411 (34% of all known folds) are observed in the whole H. pylori 26695 proteome, with greater inclination for domains belonging to α/β class (36.63%). Top folds include P-loop containing nucleoside triphosphate hydrolases (22.6%), TIM barrel (16.7%), transmembrane helix hairpin (16.05%), alpha-alpha superhelix (11.1%) and S-adenosyl-L-methionine-dependent methyltransferases (10.7%).
Collapse
|
29
|
Improved contact predictions using the recognition of protein like contact patterns. PLoS Comput Biol 2014; 10:e1003889. [PMID: 25375897 PMCID: PMC4222596 DOI: 10.1371/journal.pcbi.1003889] [Citation(s) in RCA: 132] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2014] [Accepted: 09/03/2014] [Indexed: 11/23/2022] Open
Abstract
Given sufficient large protein families, and using a global statistical inference approach, it is possible to obtain sufficient accuracy in protein residue contact predictions to predict the structure of many proteins. However, these approaches do not consider the fact that the contacts in a protein are neither randomly, nor independently distributed, but actually follow precise rules governed by the structure of the protein and thus are interdependent. Here, we present PconsC2, a novel method that uses a deep learning approach to identify protein-like contact patterns to improve contact predictions. A substantial enhancement can be seen for all contacts independently on the number of aligned sequences, residue separation or secondary structure type, but is largest for β-sheet containing proteins. In addition to being superior to earlier methods based on statistical inferences, in comparison to state of the art methods using machine learning, PconsC2 is superior for families with more than 100 effective sequence homologs. The improved contact prediction enables improved structure prediction. Here, we introduce a novel protein contact prediction method PconsC2 that, to the best of our knowledge, outperforms earlier methods. PconsC2 is based on our earlier method, PconsC, as it utilizes the same set of contact predictions from plmDCA and PSICOV. However, in contrast to PconsC, where each residue pair is analysed independently, the initial predictions are analysed in context of neighbouring residue pairs using a deep learning approach, inspired by earlier work. We find that for each layer the deep learning procedure improves the predictions. At the end, after five layers of deep learning and inclusion of a few extra features provides the best performance. An improvement can be seen for all types of proteins, independent on length, number of homologous sequences and structural class. However, the improvement is largest for β-sheet containing proteins. Most importantly the improvement brings for the first time sufficiently accurate predictions to some protein families with less than 1000 homologous sequences. PconsC2 outperforms as well state of the art machine learning based predictors for protein families larger than 100 effective sequences. PconsC2 is licensed under the GNU General Public License v3 and freely available from http://c2.pcons.net/.
Collapse
|
30
|
Abstract
The widespread exchange of genes between bacteria must have consequences on the global architecture of their genomes, which are being found in the abundant genomic data available today. Most of the expansion of bacterial protein families can be attributed to transfer events, which are positively biased for smaller evolutionary distances between genomes, and more frequent for classes that are larger, when summed over all known bacteria. Moreover, “innovation” events where horizontal transfers carry exogenous evolutionary families appear to be less frequent for larger genomes. This dynamic expansion of evolutionary families is interconnected with the acquisition of new biological functions and thus with the size and distribution of the genes’ functional categories found on a genome. This commentary presents our recent contributions to this line of work and possible future directions.
Collapse
Affiliation(s)
- Luigi Grassi
- Dipartimento di Fisica, Sapienza Università di Roma; Rome, Italy
| | | | | |
Collapse
|
31
|
Guo Z, Jiang W, Lages N, Borcherds W, Wang D. Relationship between gene duplicability and diversifiability in the topology of biochemical networks. BMC Genomics 2014; 15:577. [PMID: 25005725 PMCID: PMC4129122 DOI: 10.1186/1471-2164-15-577] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2014] [Accepted: 06/26/2014] [Indexed: 01/21/2023] Open
Abstract
Background Selective gene duplicability, the extensive expansion of a small number of gene families, is universal. Quantitatively, the number of genes (P(K)) with K duplicates in a genome decreases precipitously as K increases, and often follows a power law (P(k)∝k-α). Functional diversification, either neo- or sub-functionalization, is a major evolution route for duplicate genes. Results Using three lines of genomic datasets, we studied the relationship between gene duplicability and diversifiability in the topology of biochemical networks. First, we explored scenario where two pathways in the biochemical networks antagonize each other. Synthetic knockout of respective genes for the two pathways rescues the phenotypic defects of each individual knockout. We identified duplicate gene pairs with sufficient divergences that represent this antagonism relationship in the yeast S. cerevisiae. Such pairs overwhelmingly belong to large gene families, thus tend to have high duplicability. Second, we used distances between proteins of duplicate genes in the protein interaction network as a metric of their diversification. The higher a gene’s duplicate count, the further the proteins of this gene and its duplicates drift away from one another in the networks, which is especially true for genetically antagonizing duplicate genes. Third, we computed a sequence-homology-based clustering coefficient to quantify sequence diversifiability among duplicate genes – the lower the coefficient, the more the sequences have diverged. Duplicate count (K) of a gene is negatively correlated to the clustering coefficient of its duplicates, suggesting that gene duplicability is related to the extent of sequence divergence within the duplicate gene family. Conclusion Thus, a positive correlation exists between gene diversifiability and duplicability in the context of biochemical networks – an improvement of our understanding of gene duplicability.
Collapse
Affiliation(s)
| | | | | | | | - Degeng Wang
- Greehey Children's Cancer Research Institute, University of Texas Health Science Center at San Antonio, 8403 Floyd Curl Drive, San Antonio, TX 78229-3900, USA.
| |
Collapse
|
32
|
Merging molecular mechanism and evolution: theory and computation at the interface of biophysics and evolutionary population genetics. Curr Opin Struct Biol 2014; 26:84-91. [PMID: 24952216 DOI: 10.1016/j.sbi.2014.05.005] [Citation(s) in RCA: 60] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2014] [Revised: 04/19/2014] [Accepted: 05/16/2014] [Indexed: 11/24/2022]
Abstract
The variation among sequences and structures in nature is both determined by physical laws and by evolutionary history. However, these two factors are traditionally investigated by disciplines with different emphasis and philosophy-molecular biophysics on one hand and evolutionary population genetics in another. Here, we review recent theoretical and computational approaches that address the crucial need to integrate these two disciplines. We first articulate the elements of these approaches. Then, we survey their contribution to our mechanistic understanding of molecular evolution, the polymorphisms in coding region, the distribution of fitness effects (DFE) of mutations, the observed folding stability of proteins in nature, and the distribution of protein folds in genomes.
Collapse
|
33
|
Light S, Basile W, Elofsson A. Orphans and new gene origination, a structural and evolutionary perspective. Curr Opin Struct Biol 2014; 26:73-83. [DOI: 10.1016/j.sbi.2014.05.006] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2014] [Revised: 05/07/2014] [Accepted: 05/16/2014] [Indexed: 12/28/2022]
|
34
|
Grilli J, Romano M, Bassetti F, Cosentino Lagomarsino M. Cross-species gene-family fluctuations reveal the dynamics of horizontal transfers. Nucleic Acids Res 2014; 42:6850-60. [PMID: 24829449 PMCID: PMC4066789 DOI: 10.1093/nar/gku378] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/03/2022] Open
Abstract
Prokaryotes vary their protein repertoire mainly through horizontal transfer and gene loss. To elucidate the links between these processes and the cross-species gene-family statistics, we perform a large-scale data analysis of the cross-species variability of gene-family abundance (the number of members of the family found on a given genome). We find that abundance fluctuations are related to the rate of horizontal transfers. This is rationalized by a minimal theoretical model, which predicts this link. The families that are not captured by the model show abundance profiles that are markedly peaked around a mean value, possibly because of specific abundance selection. Based on these results, we define an abundance variability index that captures a family's evolutionary behavior (and thus some of its relevant functional properties) purely based on its cross-species abundance fluctuations. Analysis and model, combined, show a quantitative link between cross-species family abundance statistics and horizontal transfer dynamics, which can be used to analyze genome ‘flux’. Groups of families with different values of the abundance variability index correspond to genome sub-parts having different plasticity in terms of the level of horizontal exchange allowed by natural selection.
Collapse
Affiliation(s)
- Jacopo Grilli
- Dipartimento di Fisica e Astronomia "G. Galilei", Università di Padova, Via Marzolo 8, I-35131 Padova, Italy
| | - Mariacristina Romano
- Dipartimento di Fisica, Università degli Studi di Milano, via Celoria, 16, 20133 Milano, Italy
| | - Federico Bassetti
- Università di Pavia, Dipartimento di Matematica, via Ferrata 1, 27100 Pavia, Italy
| | - Marco Cosentino Lagomarsino
- CNRS, UMR 7238, Paris, France Sorbonne Universités, UPMC Université Paris 06, UMR 7238 Computational and Quantitative Biology, Genomic Physics Group, 15 rue de l'École de Médecine, Paris, France
| |
Collapse
|
35
|
Abstract
Efforts from the TB Structural Genomics Consortium together with those of tuberculosis structural biologists worldwide have led to the determination of about 350 structures, making up nearly a tenth of the pathogen's proteome. Given that knowledge of protein structures is essential to obtaining a high-resolution understanding of the underlying biology, it is desirable to have a structural view of the entire proteome. Indeed, structure prediction methods have advanced sufficiently to allow structural models of many more proteins to be built based on homology modeling and fold recognition strategies. By means of these approaches, structural models for about 2,877 proteins, making up nearly 70% of the Mycobacterium tuberculosis proteome, are available. Knowledge from bioinformatics has made significant inroads into an improved annotation of the M. tuberculosis genome and in the prediction of key protein players that interact in vital pathways, some of which are unique to the organism. Functional inferences have been made for a large number of proteins based on fold-function associations. More importantly, ligand-binding pockets of the proteins are identified and scanned against a large database, leading to binding site-based ligand associations and hence structure-based function annotation. Near proteome-wide structural models provide a global perspective of the fold distribution in the genome. New insights about the folds that predominate in the genome, as well as the fold combinations that make up multidomain proteins, are also obtained. This chapter describes the structural proteome, functional inferences drawn from it, and its applications in drug discovery.
Collapse
|
36
|
Kim KM, Nasir A, Caetano-Anollés G. The importance of using realistic evolutionary models for retrodicting proteomes. Biochimie 2014; 99:129-37. [DOI: 10.1016/j.biochi.2013.11.019] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2013] [Accepted: 11/22/2013] [Indexed: 01/16/2023]
|
37
|
Li W, Freudenberg J, Miramontes P. Diminishing return for increased Mappability with longer sequencing reads: implications of the k-mer distributions in the human genome. BMC Bioinformatics 2014; 15:2. [PMID: 24386976 PMCID: PMC3927684 DOI: 10.1186/1471-2105-15-2] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2013] [Accepted: 12/17/2013] [Indexed: 11/10/2022] Open
Abstract
Background The amount of non-unique sequence (non-singletons) in a genome directly affects the difficulty of read alignment to a reference assembly for high throughput-sequencing data. Although a longer read is more likely to be uniquely mapped to the reference genome, a quantitative analysis of the influence of read lengths on mappability has been lacking. To address this question, we evaluate the k-mer distribution of the human reference genome. The k-mer frequency is determined for k ranging from 20 bp to 1000 bp. Results We observe that the proportion of non-singletons k-mers decreases slowly with increasing k, and can be fitted by piecewise power-law functions with different exponents at different ranges of k. A slower decay at greater values for k indicates more limited gains in mappability for read lengths between 200 bp and 1000 bp. The frequency distributions of k-mers exhibit long tails with a power-law-like trend, and rank frequency plots exhibit a concave Zipf’s curve. The most frequent 1000-mers comprise 172 regions, which include four large stretches on chromosomes 1 and X, containing genes of biomedical relevance. Comparison with other databases indicates that the 172 regions can be broadly classified into two types: those containing LINE transposable elements and those containing segmental duplications. Conclusion Read mappability as measured by the proportion of singletons increases steadily up to the length scale around 200 bp. When read length increases above 200 bp, smaller gains in mappability are expected. Moreover, the proportion of non-singletons decreases with read lengths much slower than linear. Even a read length of 1000 bp would not allow the unique alignment of reads for many coding regions of human genes. A mix of techniques will be needed for efficiently producing high-quality data that cover the complete human genome.
Collapse
Affiliation(s)
- Wentian Li
- The Robert S, Boas Center for Genomics and Human Genetic, The Feinstein Institute for Medical Research, North Shore LIJ Health System, 350 Community Drive, Manhasset, USA.
| | | | | |
Collapse
|
38
|
Sequence and structure space model of protein divergence driven by point mutations. J Theor Biol 2013; 330:1-8. [DOI: 10.1016/j.jtbi.2013.03.015] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2012] [Revised: 03/07/2013] [Accepted: 03/18/2013] [Indexed: 12/11/2022]
|
39
|
Affiliation(s)
- Rachel Kolodny
- Department of Computer Science, University of Haifa, Haifa 31905, Israel;
| | - Leonid Pereyaslavets
- Department of Structural Biology, Stanford University, Stanford, California 94305; ,
| | | | - Michael Levitt
- Department of Structural Biology, Stanford University, Stanford, California 94305; ,
| |
Collapse
|
40
|
Franzosa EA, Garamszegi S, Xia Y. Toward a three-dimensional view of protein networks between species. Front Microbiol 2012; 3:428. [PMID: 23267356 PMCID: PMC3528071 DOI: 10.3389/fmicb.2012.00428] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2012] [Accepted: 12/06/2012] [Indexed: 01/27/2023] Open
Abstract
General principles governing biomolecular interactions between species are expected to differ significantly from known principles governing the interactions within species, yet these principles remain poorly understood at the systems level. A key reason for this knowledge gap is the lack of a detailed three-dimensional (3D), atomistic view of biomolecular interaction networks between species. Recent progress in structural biology, systems biology, and computational biology has enabled accurate and large-scale construction of 3D structural models of nodes and edges for protein–protein interaction networks within and between species. The resulting within- and between-species structural interaction networks have provided new biophysical, functional, and evolutionary insights into species interactions and infectious disease. Here, we review the nascent field of between-species structural systems biology, focusing on interactions between host and pathogens such as viruses.
Collapse
|
41
|
Kruger FA, Rostom R, Overington JP. Mapping small molecule binding data to structural domains. BMC Bioinformatics 2012; 13 Suppl 17:S11. [PMID: 23282026 PMCID: PMC3521243 DOI: 10.1186/1471-2105-13-s17-s11] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/14/2023] Open
Abstract
BACKGROUND Large-scale bioactivity/SAR Open Data has recently become available, and this has allowed new analyses and approaches to be developed to help address the productivity and translational gaps of current drug discovery. One of the current limitations of these data is the relative sparsity of reported interactions per protein target, and complexities in establishing clear relationships between bioactivity and targets using bioinformatics tools. We detail in this paper the indexing of targets by the structural domains that bind (or are likely to bind) the ligand within a full-length protein. Specifically, we present a simple heuristic to map small molecule binding to Pfam domains. This profiling can be applied to all proteins within a genome to give some indications of the potential pharmacological modulation and regulation of all proteins. RESULTS In this implementation of our heuristic, ligand binding to protein targets from the ChEMBL database was mapped to structural domains as defined by profiles contained within the Pfam-A database. Our mapping suggests that the majority of assay targets within the current version of the ChEMBL database bind ligands through a small number of highly prevalent domains, and conversely the majority of Pfam domains sampled by our data play no currently established role in ligand binding. Validation studies, carried out firstly against Uniprot entries with expert binding-site annotation and secondly against entries in the wwPDB repository of crystallographic protein structures, demonstrate that our simple heuristic maps ligand binding to the correct domain in about 90 percent of all assessed cases. Using the mappings obtained with our heuristic, we have assembled ligand sets associated with each Pfam domain. CONCLUSIONS Small molecule binding has been mapped to Pfam-A domains of protein targets in the ChEMBL bioactivity database. The result of this mapping is an enriched annotation of small molecule bioactivity data and a grouping of activity classes following the Pfam-A specifications of protein domains. This is valuable for data-focused approaches in drug discovery, for example when extrapolating potential targets of a small molecule with known activity against one or few targets, or in the assessment of a potential target for drug discovery or screening studies.
Collapse
Affiliation(s)
- Felix A Kruger
- European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, UK
| | | | | |
Collapse
|
42
|
Bottinelli A, Bassetti B, Lagomarsino MC, Gherardi M. Influence of homology and node age on the growth of protein-protein interaction networks. PHYSICAL REVIEW. E, STATISTICAL, NONLINEAR, AND SOFT MATTER PHYSICS 2012; 86:041919. [PMID: 23214627 DOI: 10.1103/physreve.86.041919] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/15/2012] [Indexed: 06/01/2023]
Abstract
Proteins participating in a protein-protein interaction network can be grouped into homology classes following their common ancestry. Proteins added to the network correspond to genes added to the classes, so the dynamics of the two objects are intrinsically linked. Here we first introduce a statistical model describing the joint growth of the network and the partitioning of nodes into classes, which is studied through a combined mean-field and simulation approach. We then employ this unified framework to address the specific issue of the age dependence of protein interactions through the definition of three different node wiring or divergence schemes. A comparison with empirical data indicates that an age-dependent divergence move is necessary in order to reproduce the basic topological observables together with the age correlation between interacting nodes visible in empirical data. We also discuss the possibility of nontrivial joint partition and topology observables.
Collapse
|
43
|
A network synthesis model for generating protein interaction network families. PLoS One 2012; 7:e41474. [PMID: 22912671 PMCID: PMC3418285 DOI: 10.1371/journal.pone.0041474] [Citation(s) in RCA: 34] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2011] [Accepted: 06/27/2012] [Indexed: 11/19/2022] Open
Abstract
In this work, we introduce a novel network synthesis model that can generate families of evolutionarily related synthetic protein-protein interaction (PPI) networks. Given an ancestral network, the proposed model generates the network family according to a hypothetical phylogenetic tree, where the descendant networks are obtained through duplication and divergence of their ancestors, followed by network growth using network evolution models. We demonstrate that this network synthesis model can effectively create synthetic networks whose internal and cross-network properties closely resemble those of real PPI networks. The proposed model can serve as an effective framework for generating comprehensive benchmark datasets that can be used for reliable performance assessment of comparative network analysis algorithms. Using this model, we constructed a large-scale network alignment benchmark, called NAPAbench, and evaluated the performance of several representative network alignment algorithms. Our analysis clearly shows the relative performance of the leading network algorithms, with their respective advantages and disadvantages. The algorithm and source code of the network synthesis model and the network alignment benchmark NAPAbench are publicly available at http://www.ece.tamu.edu/bjyoon/NAPAbench/.
Collapse
|
44
|
The ecology of bacterial genes and the survival of the new. INTERNATIONAL JOURNAL OF EVOLUTIONARY BIOLOGY 2012; 2012:394026. [PMID: 22900231 PMCID: PMC3415099 DOI: 10.1155/2012/394026] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Received: 04/21/2012] [Accepted: 06/26/2012] [Indexed: 11/18/2022]
Abstract
Much of the observed variation among closely related bacterial genomes is attributable to gains and losses of genes that are acquired horizontally as well as to gene duplications and larger amplifications. The genomic flexibility that results from these mechanisms certainly contributes to the ability of bacteria to survive and adapt in varying environmental challenges. However, the duplicability and transferability of individual genes imply that natural selection should operate, not only at the organismal level, but also at the level of the gene. Genes can be considered semiautonomous entities that possess specific functional niches and evolutionary dynamics. The evolution of bacterial genes should respond both to selective pressures that favor competition, mostly among orthologs or paralogs that may occupy the same functional niches, and cooperation, with the majority of other genes coexisting in a given genome. The relative importance of either type of selection is likely to vary among different types of genes, based on the functional niches they cover and on the tightness of their association with specific organismal lineages. The frequent availability of new functional niches caused by environmental changes and biotic evolution should enable the constant diversification of gene families and the survival of new lineages of genes.
Collapse
|
45
|
Petersen SB, Neves-Petersen MT, Henriksen SB, Mortensen RJ, Geertz-Hansen HM. Scale-free behaviour of amino acid pair interactions in folded proteins. PLoS One 2012; 7:e41322. [PMID: 22848462 PMCID: PMC3406053 DOI: 10.1371/journal.pone.0041322] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2012] [Accepted: 06/20/2012] [Indexed: 11/19/2022] Open
Abstract
The protein structure is a cumulative result of interactions between amino acid residues interacting with each other through space and/or chemical bonds. Despite the large number of high resolution protein structures, the "protein structure code" has not been fully identified. Our manuscript presents a novel approach to protein structure analysis in order to identify rules for spatial packing of amino acid pairs in proteins. We have investigated 8706 high resolution non-redundant protein chains and quantified amino acid pair interactions in terms of solvent accessibility, spatial and sequence distance, secondary structure, and sequence length. The number of pairs found in a particular environment is stored in a cell in an 8 dimensional data tensor. When plotting the cell population against the number of cells that have the same population size, a scale free organization is found. When analyzing which amino acid paired residues contributed to the cells with a population above 50, pairs of Ala, Ile, Leu and Val dominate the results. This result is statistically highly significant. We postulate that such pairs form "structural stability points" in the protein structure. Our data shows that they are in buried α-helices or β-strands, in a spatial distance of 3.8-4.3Å and in a sequence distance >4 residues. We speculate that the scale free organization of the amino acid pair interactions in the 8D protein structure combined with the clear dominance of pairs of Ala, Ile, Leu and Val is important for understanding the very nature of the protein structure formation. Our observations suggest that protein structures should be considered as having a higher dimensional organization.
Collapse
Affiliation(s)
- Steffen B. Petersen
- Int Iberian Nanotechnol Laboratory INL, Braga, Portugal
- Department of Health Science and Technology, Aalborg University, Aalborg, Denmark
- The Institute for Lasers, University at Buffalo, The State University of New York at Buffalo, Photonics and Biophotonics, Buffalo, New York, United States of America
| | - Maria Teresa Neves-Petersen
- Int Iberian Nanotechnol Laboratory INL, Braga, Portugal
- Department of Biotechnology, Chemistry and Environmental Engineering, Aalborg University, Aalborg, Denmark
| | - Svend B. Henriksen
- Department of Physics and Nanotechnology, Aalborg University, Aalborg, Denmark
| | - Rasmus J. Mortensen
- Department of Physics and Nanotechnology, Aalborg University, Aalborg, Denmark
- Department of Infectious Disease Immunology, Statens Serum Institute, Copenhagen, Denmark
| | - Henrik M. Geertz-Hansen
- Technical University of Denmark, Novo Nordisk Foundation Center for Biosustainability Fremtidsvej, Hørsholm, Copenhagen, Denmark
| |
Collapse
|
46
|
Current understanding of the formation and adaptation of metabolic systems based on network theory. Metabolites 2012; 2:429-57. [PMID: 24957641 PMCID: PMC3901219 DOI: 10.3390/metabo2030429] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2012] [Revised: 06/26/2012] [Accepted: 07/09/2012] [Indexed: 11/17/2022] Open
Abstract
Formation and adaptation of metabolic networks has been a long-standing question in biology. With recent developments in biotechnology and bioinformatics, the understanding of metabolism is progressively becoming clearer from a network perspective. This review introduces the comprehensive metabolic world that has been revealed by a wide range of data analyses and theoretical studies; in particular, it illustrates the role of evolutionary events, such as gene duplication and horizontal gene transfer, and environmental factors, such as nutrient availability and growth conditions, in evolution of the metabolic network. Furthermore, the mathematical models for the formation and adaptation of metabolic networks have also been described, according to the current understanding from a perspective of metabolic networks. These recent findings are helpful in not only understanding the formation of metabolic networks and their adaptation, but also metabolic engineering.
Collapse
|
47
|
Garma L, Mukherjee S, Mitra P, Zhang Y. How many protein-protein interactions types exist in nature? PLoS One 2012; 7:e38913. [PMID: 22719985 PMCID: PMC3374795 DOI: 10.1371/journal.pone.0038913] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2012] [Accepted: 05/14/2012] [Indexed: 11/18/2022] Open
Abstract
“Protein quaternary structure universe” refers to the ensemble of all protein-protein complexes across all organisms in nature. The number of quaternary folds thus corresponds to the number of ways proteins physically interact with other proteins. This study focuses on answering two basic questions: Whether the number of protein-protein interactions is limited and, if yes, how many different quaternary folds exist in nature. By all-to-all sequence and structure comparisons, we grouped the protein complexes in the protein data bank (PDB) into 3,629 families and 1,761 folds. A statistical model was introduced to obtain the quantitative relation between the numbers of quaternary families and quaternary folds in nature. The total number of possible protein-protein interactions was estimated around 4,000, which indicates that the current protein repository contains only 42% of quaternary folds in nature and a full coverage needs approximately a quarter century of experimental effort. The results have important implications to the protein complex structural modeling and the structure genomics of protein-protein interactions.
Collapse
Affiliation(s)
- Leonardo Garma
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan, United States of America
- Biocenter Oulu and Department of Biochemistry, University of Oulu, Oulu, Finland
| | - Srayanta Mukherjee
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan, United States of America
| | - Pralay Mitra
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan, United States of America
| | - Yang Zhang
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan, United States of America
- * E-mail:
| |
Collapse
|
48
|
Haegeman B, Weitz JS. A neutral theory of genome evolution and the frequency distribution of genes. BMC Genomics 2012; 13:196. [PMID: 22613814 PMCID: PMC3386021 DOI: 10.1186/1471-2164-13-196] [Citation(s) in RCA: 46] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2012] [Accepted: 05/21/2012] [Indexed: 12/31/2022] Open
Abstract
Background The gene composition of bacteria of the same species can differ significantly between isolates. Variability in gene composition can be summarized in terms of gene frequency distributions, in which individual genes are ranked according to the frequency of genomes in which they appear. Empirical gene frequency distributions possess a U-shape, such that there are many rare genes, some genes of intermediate occurrence, and many common genes. It would seem that U-shaped gene frequency distributions can be used to infer the essentiality and/or importance of a gene to a species. Here, we ask: can U-shaped gene frequency distributions, instead, arise generically via neutral processes of genome evolution? Results We introduce a neutral model of genome evolution which combines birth-death processes at the organismal level with gene uptake and loss at the genomic level. This model predicts that gene frequency distributions possess a characteristic U-shape even in the absence of selective forces driving genome and population structure. We compare the model predictions to empirical gene frequency distributions from 6 multiply sequenced species of bacterial pathogens. We fit the model with constant population size to data, matching U-shape distributions albeit without matching all quantitative features of the distribution. We find stronger model fits in the case where we consider exponentially growing populations. We also show that two alternative models which contain a "rigid" and "flexible" core component of genomes provide strong fits to gene frequency distributions. Conclusions The analysis of neutral models of genome evolution suggests that U-shaped gene frequency distributions provide less information than previously suggested regarding gene essentiality. We discuss the need for additional theory and genomic level information to disentangle the roles of evolutionary mechanisms operating within and amongst individuals in driving the dynamics of gene distributions.
Collapse
Affiliation(s)
- Bart Haegeman
- INRIA Research Team MODEMIC, UMR MISTEA, 34060 Montpellier, France.
| | | |
Collapse
|
49
|
Burke S, Elber R. Super folds, networks, and barriers. Proteins 2012; 80:463-70. [PMID: 22095563 PMCID: PMC3290721 DOI: 10.1002/prot.23212] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2011] [Revised: 08/31/2011] [Accepted: 09/22/2011] [Indexed: 11/06/2022]
Abstract
Exhaustive enumeration of sequences and folds is conducted for a simple lattice model of conformations, sequences, and energies. Examination of all foldable sequences and their nearest connected neighbors (sequences that differ by no more than a point mutation) illustrates the following: (i) There exist unusually large number of sequences that fold into a few structures (super-folds). The same observation was made experimentally and computationally using stochastic sampling and exhaustive enumeration of related models. (ii) There exist only a few large networks of connected sequences that are not restricted to one fold. These networks cover a significant fraction of fold spaces (super-networks). (iii) There exist barriers in sequence space that prevent foldable sequences of the same structure to "connect" through a series of single point mutations (super-barrier), even in the presence of the sequence connection between folds. While there is ample experimental evidence for the existence of super-folds, evidence for a super-network is just starting to emerge. The prediction of a sequence barrier is an intriguing characteristic of sequence space, suggesting that the overall sequence space may be disconnected. The implications and limitations of these observations for evolution of protein structures are discussed.
Collapse
Affiliation(s)
- Sean Burke
- Institute for Computational Engineering and Sciences, University of Texas at Austin, Austin TX 78712
| | - Ron Elber
- Institute for Computational Engineering and Sciences, University of Texas at Austin, Austin TX 78712
- Department of Chemistry and Biochemistry, University of Texas at Austin, Austin TX 78712
| |
Collapse
|
50
|
Abstract
Large-scale databases are available that contain homologous gene families constructed from hundreds of complete genome sequences from across the three domains of life. Here, we discuss the approaches of increasing complexity aimed at extracting information on the pattern and process of gene family evolution from such datasets. In particular, we consider the models that invoke processes of gene birth (duplication and transfer) and death (loss) to explain the evolution of gene families. First, we review birth-and-death models of family size evolution and their implications in light of the universal features of family size distribution observed across different species and the three domains of life. Subsequently, we proceed to recent developments on models capable of more completely considering information in the sequences of homologous gene families through the probabilistic reconciliation of the phylogenetic histories of individual genes with the phylogenetic history of the genomes in which they have resided. To illustrate the methods and results presented, we use data from the HOGENOM database, demonstrating that the distribution of homologous gene family sizes in the genomes of the eukaryota, archaea, and bacteria exhibits remarkably similar shapes. We show that these distributions are best described by models of gene family size evolution, where for individual genes the death (loss) rate is larger than the birth (duplication and transfer) rate but new families are continually supplied to the genome by a process of origination. Finally, we use probabilistic reconciliation methods to take into consideration additional information from gene phylogenies, and find that, for prokaryotes, the majority of birth events are the result of transfer.
Collapse
|