Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Qian J, Luscombe NM, Gerstein M. Protein family and fold occurrence in genomes: power-law behaviour and evolutionary model. J Mol Biol 2001;313:673-81. [PMID: 11697896 DOI: 10.1006/jmbi.2001.5079] [Citation(s) in RCA: 206] [Impact Index Per Article: 8.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]

For:	Qian J, Luscombe NM, Gerstein M. Protein family and fold occurrence in genomes: power-law behaviour and evolutionary model. J Mol Biol 2001;313:673-81. [PMID: 11697896 DOI: 10.1006/jmbi.2001.5079] [Citation(s) in RCA: 206] [Impact Index Per Article: 8.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]

Number

Cited by Other Article(s)

Li W, Almirantis Y, Provata A. Range-limited Heaps' law for functional DNA words in the human genome. J Theor Biol 2024;592:111878. [PMID: 38901778 DOI: 10.1016/j.jtbi.2024.111878] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2023] [Revised: 05/31/2024] [Accepted: 06/10/2024] [Indexed: 06/22/2024]

Tanoz I, Timsit Y. Protein Fold Usages in Ribosomes: Another Glance to the Past. Int J Mol Sci 2024;25:8806. [PMID: 39201491 PMCID: PMC11354259 DOI: 10.3390/ijms25168806] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2024] [Revised: 08/07/2024] [Accepted: 08/08/2024] [Indexed: 09/02/2024] Open

Abstract

The analysis of protein fold usage, similar to codon usage, offers profound insights into the evolution of biological systems and the origins of modern proteomes. While previous studies have examined fold distribution in modern genomes, our study focuses on the comparative distribution and usage of protein folds in ribosomes across bacteria, archaea, and eukaryotes. We identify the prevalence of certain 'super-ribosome folds,' such as the OB fold in bacteria and the SH3 domain in archaea and eukaryotes. The observed protein fold distribution in the ribosomes announces the future power-law distribution where only a few folds are highly prevalent, and most are rare. Additionally, we highlight the presence of three copies of proto-Rossmann folds in ribosomes across all kingdoms, showing its ancient and fundamental role in ribosomal structure and function. Our study also explores early mechanisms of molecular convergence, where different protein folds bind equivalent ribosomal RNA structures in ribosomes across different kingdoms. This comparative analysis enhances our understanding of ribosomal evolution, particularly the distinct evolutionary paths of the large and small subunits, and underscores the complex interplay between RNA and protein components in the transition from the RNA world to modern cellular life. Transcending the concept of folds also makes it possible to group a large number of ribosomal proteins into five categories of urfolds or metafolds, which could attest to their ancestral character and common origins. This work also demonstrates that the gradual acquisition of extensions by simple but ordered folds constitutes an inexorable evolutionary mechanism. This observation supports the idea that simple but structured ribosomal proteins preceded the development of their disordered extensions.

Collapse

Barone F, Russo ET, Villegas Garcia EN, Punta M, Cozzini S, Ansuini A, Cazzaniga A. Protein family annotation for the Unified Human Gastrointestinal Proteome by DPCfam clustering. Sci Data 2024;11:568. [PMID: 38824125 PMCID: PMC11144186 DOI: 10.1038/s41597-024-03131-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2023] [Accepted: 03/08/2024] [Indexed: 06/03/2024] Open

Gollapalli P, Rudrappa S, Kumar V, Santosh Kumar HS. Domain Architecture Based Methods for Comparative Functional Genomics Toward Therapeutic Drug Target Discovery. J Mol Evol 2023;91:598-615. [PMID: 37626222 DOI: 10.1007/s00239-023-10129-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2022] [Accepted: 08/06/2023] [Indexed: 08/27/2023]

Russo ET, Barone F, Bateman A, Cozzini S, Punta M, Laio A. DPCfam: Unsupervised protein family classification by Density Peak Clustering of large sequence datasets. PLoS Comput Biol 2022;18:e1010610. [PMID: 36260616 PMCID: PMC9621593 DOI: 10.1371/journal.pcbi.1010610] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2022] [Revised: 10/31/2022] [Accepted: 09/26/2022] [Indexed: 11/07/2022] Open

Semple S, Ferrer-I-Cancho R, Gustison ML. Linguistic laws in biology. Trends Ecol Evol 2022;37:53-66. [PMID: 34598817 PMCID: PMC8678306 DOI: 10.1016/j.tree.2021.08.012] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2021] [Revised: 08/24/2021] [Accepted: 08/25/2021] [Indexed: 01/03/2023]

Caetano-Anollés G. The Compressed Vocabulary of Microbial Life. Front Microbiol 2021;12:655990. [PMID: 34305827 PMCID: PMC8292947 DOI: 10.3389/fmicb.2021.655990] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2021] [Accepted: 04/27/2021] [Indexed: 12/22/2022] Open

Abstract

Communication is an undisputed central activity of life that requires an evolving molecular language. It conveys meaning through messages and vocabularies. Here, I explore the existence of a growing vocabulary in the molecules and molecular functions of the microbial world. There are clear correspondences between the lexicon, syntax, semantics, and pragmatics of language organization and the module, structure, function, and fitness paradigms of molecular biology. These correspondences are constrained by universal laws and engineering principles. Macromolecular structure, for example, follows quantitative linguistic patterns arising from statistical laws that are likely universal, including the Zipf's law, a special case of the scale-free distribution, the Heaps' law describing sublinear growth typical of economies of scales, and the Menzerath-Altmann's law, which imposes size-dependent patterns of decreasing returns. Trade-off solutions between principles of economy, flexibility, and robustness define a "triangle of persistence" describing the impact of the environment on a biological system. The pragmatic landscape of the triangle interfaces with the syntax and semantics of molecular languages, which together with comparative and evolutionary genomic data can explain global patterns of diversification of cellular life. The vocabularies of proteins (proteomes) and functions (functionomes) revealed a significant universal lexical core supporting a universal common ancestor, an ancestral evolutionary link between Bacteria and Eukarya, and distinct reductive evolutionary strategies of language compression in Archaea and Bacteria. A "causal" word cloud strategy inspired by the dependency grammar paradigm used in catenae unfolded the evolution of lexical units associated with Gene Ontology terms at different levels of ontological abstraction. While Archaea holds the smallest, oldest, and most homogeneous vocabulary of all superkingdoms, Bacteria heterogeneously apportions a more complex vocabulary, and Eukarya pushes functional innovation through mechanisms of flexibility and robustness.

Collapse

Dermauw W, Van Leeuwen T, Feyereisen R. Diversity and evolution of the P450 family in arthropods. INSECT BIOCHEMISTRY AND MOLECULAR BIOLOGY 2020;127:103490. [PMID: 33169702 DOI: 10.1016/j.ibmb.2020.103490] [Citation(s) in RCA: 134] [Impact Index Per Article: 26.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/15/2020] [Revised: 10/09/2020] [Accepted: 10/09/2020] [Indexed: 05/13/2023]

Abstract

The P450 family (CYP genes) of arthropods encodes diverse enzymes involved in the metabolism of foreign compounds and in essential endocrine or ecophysiological functions. The P450 sequences (CYPome) from 40 arthropod species were manually curated, including 31 complete CYPomes, and a maximum likelihood phylogeny of nearly 3000 sequences is presented. Arthropod CYPomes are assembled from members of six CYP clans of variable size, the CYP2, CYP3, CYP4 and mitochondrial clans, as well as the CYP20 and CYP16 clans that are not found in Neoptera. CYPome sizes vary from two dozen genes in some parasitic species to over 200 in species as diverse as collembolans or ticks. CYPomes are comprised of few CYP families with many genes and many CYP families with few genes, and this distribution is the result of dynamic birth and death processes. Lineage-specific expansions or blooms are found throughout the phylogeny and often result in genomic clusters that appear to form a reservoir of catalytic diversity maintained as heritable units. Among the many P450s with physiological functions, six CYP families are involved in ecdysteroid metabolism. However, five so-called Halloween genes are not universally represented and do not constitute the unique pathway of ecdysteroid biosynthesis. The diversity of arthropod CYPomes has only partially been uncovered to date and many P450s with physiological functions regulating the synthesis and degradation of endogenous signal molecules (including ecdysteroids) and semiochemicals (including pheromones and defense chemicals) remain to be discovered. Sequence diversity of arthropod P450s is extreme, and P450 sequences lacking the universally conserved Cys ligand to the heme have evolved several times. A better understanding of P450 evolution is needed to discern the relative contributions of stochastic processes and adaptive processes in shaping the size and diversity of CYPomes.

Collapse

Xiao X, Xue GF, Stamatovic B, Qiu WR. Using Cellular Automata to Simulate Domain Evolution in Proteins. Front Genet 2020;11:515. [PMID: 32582278 PMCID: PMC7296063 DOI: 10.3389/fgene.2020.00515] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2019] [Accepted: 04/28/2020] [Indexed: 11/26/2022] Open

Survival of the cheapest: how proteome cost minimization drives evolution. Q Rev Biophys 2020;53:e7. [PMID: 32624048 DOI: 10.1017/s0033583520000037] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]

Zalguizuri A, Caetano-Anollés G, Lepek VC. Phylogenetic profiling, an untapped resource for the prediction of secreted proteins and its complementation with sequence-based classifiers in bacterial type III, IV and VI secretion systems. Brief Bioinform 2020;20:1395-1402. [PMID: 29394318 DOI: 10.1093/bib/bby009] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2017] [Revised: 01/15/2018] [Indexed: 12/29/2022] Open

Gu Y, Zu J, Li Y. A novel evolutionary model for constructing gene coexpression networks with comprehensive features. BMC Bioinformatics 2019;20:460. [PMID: 31492104 PMCID: PMC6731579 DOI: 10.1186/s12859-019-3035-7] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2019] [Accepted: 08/19/2019] [Indexed: 11/10/2022] Open

Abstract

BACKGROUND

Uncovering the evolutionary principles of gene coexpression network is important for our understanding of the network topological property of new genes. However, most existing evolutionary models only considered the evolution of duplication genes and only based on the degree of genes, ignoring the other key topological properties. The evolutionary mechanism by which how are new genes integrated into the ancestral networks are not yet to be comprehensively characterized. Herein, based on the human ribonucleic acid-sequencing (RNA-seq) data, we develop a new evolutionary model of gene coexpression network which considers the evolutionary process of both duplication genes and de novo genes.

RESULTS

Based on the human RNA-seq data, we construct a gene coexpression network consisting of 8061 genes and 638624 links. We find that there are 1394 duplication genes and 126 de novo genes in the network. Then based on human gene age data, we reproduce the evolutionary process of this gene coexpression network and develop a new evolutionary model. We find that the generation rates of duplication genes and de novo genes are approximately 3.58/Myr (Myr=Million year) and 0.31/Myr, respectively. Based on the average degree and coreness of parent genes, we find that the gene duplication is a random process. Eventually duplication genes only inherit 12.89% connections from their parent genes and the retained connections have a smaller edge betweenness. Moreover, we find that both duplication genes and de novo genes prefer to develop new interactions with genes which have a large degree and a large coreness. Our proposed model can generate an evolutionary network when the number of newly added genes or the length of evolutionary time is known.

CONCLUSIONS

Gene duplication and de novo genes are two dominant evolutionary forces in shaping the coexpression network. Both duplication genes and de novo genes develop new interactions through a "rich-gets-richer" mechanism in terms of degree and coreness. This mechanism leads to the scale-free property and hierarchical architecture of biomolecular network. The proposed model is able to construct a gene coexpression network with comprehensive biological characteristics.

Collapse

A global map of the protein shape universe. PLoS Comput Biol 2019;15:e1006969. [PMID: 30978181 PMCID: PMC6481876 DOI: 10.1371/journal.pcbi.1006969] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2018] [Revised: 04/24/2019] [Accepted: 03/20/2019] [Indexed: 11/19/2022] Open

Evolution of Protein Domain Architectures. Methods Mol Biol 2019;1910:469-504. [PMID: 31278674 DOI: 10.1007/978-1-4939-9074-0_15] [Citation(s) in RCA: 25] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/04/2022]

Razban RM, Gilson AI, Durfee N, Strobelt H, Dinkla K, Choi JM, Pfister H, Shakhnovich EI. ProteomeVis: a web app for exploration of protein properties from structure to sequence evolution across organisms' proteomes. Bioinformatics 2018;34:3557-3565. [PMID: 29741573 PMCID: PMC6184454 DOI: 10.1093/bioinformatics/bty370] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2017] [Revised: 03/27/2018] [Accepted: 05/03/2018] [Indexed: 01/27/2023] Open

Power-law relationship in the long-tailed sections of proton dose distributions. Sci Rep 2018;8:10413. [PMID: 29991734 PMCID: PMC6039508 DOI: 10.1038/s41598-018-28683-5] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2018] [Accepted: 06/13/2018] [Indexed: 11/11/2022] Open

Dori N, Behar H, Brot H, Louzoun Y. Family-size variability grows with collapse rate in a birth-death-catastrophe model. Phys Rev E 2018;98:012416. [PMID: 30110815 DOI: 10.1103/physreve.98.012416] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2017] [Indexed: 06/08/2023]

Škrlj B, Kunej T, Konc J. Insights from Ion Binding Site Network Analysis into Evolution and Functions of Proteins. Mol Inform 2018;37:e1700144. [PMID: 29418080 DOI: 10.1002/minf.201700144] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2017] [Accepted: 02/01/2018] [Indexed: 01/05/2023]

Raanan H, Pike DH, Moore EK, Falkowski PG, Nanda V. Modular origins of biological electron transfer chains. Proc Natl Acad Sci U S A 2018;115:1280-1285. [PMID: 29358375 PMCID: PMC5819401 DOI: 10.1073/pnas.1714225115] [Citation(s) in RCA: 23] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/07/2023] Open

De Lazzari E, Grilli J, Maslov S, Cosentino Lagomarsino M. Family-specific scaling laws in bacterial genomes. Nucleic Acids Res 2017;45:7615-7622. [PMID: 28605556 PMCID: PMC5737699 DOI: 10.1093/nar/gkx510] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2017] [Accepted: 05/30/2017] [Indexed: 01/21/2023] Open

Nasir A, Kim KM, Caetano-Anollés G. Phylogenetic Tracings of Proteome Size Support the Gradual Accretion of Protein Structural Domains and the Early Origin of Viruses from Primordial Cells. Front Microbiol 2017;8:1178. [PMID: 28690608 PMCID: PMC5481351 DOI: 10.3389/fmicb.2017.01178] [Citation(s) in RCA: 30] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2017] [Accepted: 06/09/2017] [Indexed: 01/05/2023] Open

Abstract

Untangling the origin and evolution of viruses remains a challenging proposition. We recently studied the global distribution of protein domain structures in thousands of completely sequenced viral and cellular proteomes with comparative genomics, phylogenomics, and multidimensional scaling methods. A tree of life describing the evolution of proteomes revealed viruses emerging from the base of the tree as a fourth supergroup of life. A tree of domains indicated an early origin of modern viral lineages from ancient cells that co-existed with the cellular ancestors. However, it was recently argued that the rooting of our trees and the basal placement of viruses was artifactually induced by small genome (proteome) size. Here we show that these claims arise from misunderstanding and misinterpretations of cladistic methodology. Trees are reconstructed unrooted, and thus, their topologies cannot be distorted a posteriori by the rooting methodology. Tracing proteome size in trees and multidimensional views of evolutionary relationships as well as tests of leaf stability and exclusion/inclusion of taxa demonstrated that the smallest proteomes were neither attracted toward the root nor caused any topological distortions of the trees. Simulations confirmed that taxa clustering patterns were independent of proteome size and were determined by the presence of known evolutionary relatives in data matrices, highlighting the need for broader taxon sampling in phylogeny reconstruction. Instead, phylogenetic tracings of proteome size revealed a slowdown in innovation of the structural domain vocabulary and four regimes of allometric scaling that reflected a Heaps law. These regimes explained increasing economies of scale in the evolutionary growth and accretion of kernel proteome repertoires of viruses and cellular organisms that resemble growth of human languages with limited vocabulary sizes. Results reconcile dynamic and static views of domain frequency distributions that are consistent with the axiom of spatiotemporal continuity that is tenet of evolutionary thinking.

Collapse

Ye M, Zhang X, Racz GC, Jiang Q, Moret BME. NEMo: An Evolutionary Model With Modularity for PPI Networks. IEEE Trans Nanobioscience 2017;16:131-139. [PMID: 28113347 DOI: 10.1109/tnb.2017.2656058] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2025]

Carey M, Wu S, Gan G, Wu H. Correlation-based iterative clustering methods for time course data: The identification of temporal gene response modules for influenza infection in humans. Infect Dis Model 2016;1:28-39. [PMID: 29928719 PMCID: PMC5963321 DOI: 10.1016/j.idm.2016.07.001] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2016] [Accepted: 07/08/2016] [Indexed: 12/25/2022] Open

Abstract

Many pragmatic clustering methods have been developed to group data vectors or objects into clusters so that the objects in one cluster are very similar and objects in different clusters are distinct based on some similarity measure. The availability of time course data has motivated researchers to develop methods, such as mixture and mixed-effects modelling approaches, that incorporate the temporal information contained in the shape of the trajectory of the data. However, there is still a need for the development of time-course clustering methods that can adequately deal with inhomogeneous clusters (some clusters are quite large and others are quite small). Here we propose two such methods, hierarchical clustering (IHC) and iterative pairwise-correlation clustering (IPC). We evaluate and compare the proposed methods to the Markov Cluster Algorithm (MCL) and the generalised mixed-effects model (GMM) using simulation studies and an application to a time course gene expression data set from a study containing human subjects who were challenged by a live influenza virus. We identify four types of temporal gene response modules to influenza infection in humans, i.e., single-gene modules (SGM), small-size modules (SSM), medium-size modules (MSM) and large-size modules (LSM). The LSM contain genes that perform various fundamental biological functions that are consistent across subjects. The SSM and SGM contain genes that perform either different or similar biological functions that have complex temporal responses to the virus and are unique to each subject. We show that the temporal response of the genes in the LSM have either simple patterns with a single peak or trough a consequence of the transient stimuli sustained or state-transitioning patterns pertaining to developmental cues and that these modules can differentiate the severity of disease outcomes. Additionally, the size of gene response modules follows a power-law distribution with a consistent exponent across all subjects, which reveals the presence of universality in the underlying biological principles that generated these modules.

Collapse

Li W, Fontanelli O, Miramontes P. Size distribution of function-based human gene sets and the split-merge model. ROYAL SOCIETY OPEN SCIENCE 2016;3:160275. [PMID: 27853602 PMCID: PMC5108952 DOI: 10.1098/rsos.160275] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 04/22/2016] [Accepted: 07/01/2016] [Indexed: 06/06/2023]

A Dynamic Model for the Evolution of Protein Structure. J Mol Evol 2016;82:230-43. [PMID: 27146880 DOI: 10.1007/s00239-016-9740-1] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/31/2015] [Accepted: 04/12/2016] [Indexed: 10/21/2022]

Ye M, Racz GC, Jiang Q, Zhang X, Moret BME. NEMo: An Evolutionary Model with Modularity for PPI Networks. LECTURE NOTES IN COMPUTER SCIENCE 2016:224-236. [DOI: 10.1007/978-3-319-38782-6_19] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/03/2025]

Magner A, Szpankowski W, Kihara D. On the origin of protein superfamilies and superfolds. Sci Rep 2015;5:8166. [PMID: 25703447 PMCID: PMC4336935 DOI: 10.1038/srep08166] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2014] [Accepted: 01/08/2015] [Indexed: 11/08/2022] Open

Structure based annotation of Helicobacter pylori strain 26695 proteome. PLoS One 2014;9:e115020. [PMID: 25549250 PMCID: PMC4280198 DOI: 10.1371/journal.pone.0115020] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2014] [Accepted: 11/17/2014] [Indexed: 11/23/2022] Open

Improved contact predictions using the recognition of protein like contact patterns. PLoS Comput Biol 2014;10:e1003889. [PMID: 25375897 PMCID: PMC4222596 DOI: 10.1371/journal.pcbi.1003889] [Citation(s) in RCA: 132] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2014] [Accepted: 09/03/2014] [Indexed: 11/23/2022] Open

Abstract

Given sufficient large protein families, and using a global statistical inference approach, it is possible to obtain sufficient accuracy in protein residue contact predictions to predict the structure of many proteins. However, these approaches do not consider the fact that the contacts in a protein are neither randomly, nor independently distributed, but actually follow precise rules governed by the structure of the protein and thus are interdependent. Here, we present PconsC2, a novel method that uses a deep learning approach to identify protein-like contact patterns to improve contact predictions. A substantial enhancement can be seen for all contacts independently on the number of aligned sequences, residue separation or secondary structure type, but is largest for β-sheet containing proteins. In addition to being superior to earlier methods based on statistical inferences, in comparison to state of the art methods using machine learning, PconsC2 is superior for families with more than 100 effective sequence homologs. The improved contact prediction enables improved structure prediction.

Here, we introduce a novel protein contact prediction method PconsC2 that, to the best of our knowledge, outperforms earlier methods. PconsC2 is based on our earlier method, PconsC, as it utilizes the same set of contact predictions from plmDCA and PSICOV. However, in contrast to PconsC, where each residue pair is analysed independently, the initial predictions are analysed in context of neighbouring residue pairs using a deep learning approach, inspired by earlier work. We find that for each layer the deep learning procedure improves the predictions. At the end, after five layers of deep learning and inclusion of a few extra features provides the best performance. An improvement can be seen for all types of proteins, independent on length, number of homologous sequences and structural class. However, the improvement is largest for β-sheet containing proteins. Most importantly the improvement brings for the first time sufficiently accurate predictions to some protein families with less than 1000 homologous sequences. PconsC2 outperforms as well state of the art machine learning based predictors for protein families larger than 100 effective sequences. PconsC2 is licensed under the GNU General Public License v3 and freely available from http://c2.pcons.net/.

Collapse

Grassi L, Grilli J, Lagomarsino MC. Large-scale dynamics of horizontal transfers. Mob Genet Elements 2014;2:163-167. [PMID: 23061026 PMCID: PMC3463476 DOI: 10.4161/mge.21112] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/20/2023] Open

Guo Z, Jiang W, Lages N, Borcherds W, Wang D. Relationship between gene duplicability and diversifiability in the topology of biochemical networks. BMC Genomics 2014;15:577. [PMID: 25005725 PMCID: PMC4129122 DOI: 10.1186/1471-2164-15-577] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2014] [Accepted: 06/26/2014] [Indexed: 01/21/2023] Open

Abstract

Background

Selective gene duplicability, the extensive expansion of a small number of gene families, is universal. Quantitatively, the number of genes (P_(K)) with K duplicates in a genome decreases precipitously as K increases, and often follows a power law (P_(k)∝k^-α). Functional diversification, either neo- or sub-functionalization, is a major evolution route for duplicate genes.

Results

Using three lines of genomic datasets, we studied the relationship between gene duplicability and diversifiability in the topology of biochemical networks. First, we explored scenario where two pathways in the biochemical networks antagonize each other. Synthetic knockout of respective genes for the two pathways rescues the phenotypic defects of each individual knockout. We identified duplicate gene pairs with sufficient divergences that represent this antagonism relationship in the yeast S. cerevisiae. Such pairs overwhelmingly belong to large gene families, thus tend to have high duplicability. Second, we used distances between proteins of duplicate genes in the protein interaction network as a metric of their diversification. The higher a gene’s duplicate count, the further the proteins of this gene and its duplicates drift away from one another in the networks, which is especially true for genetically antagonizing duplicate genes. Third, we computed a sequence-homology-based clustering coefficient to quantify sequence diversifiability among duplicate genes – the lower the coefficient, the more the sequences have diverged. Duplicate count (K) of a gene is negatively correlated to the clustering coefficient of its duplicates, suggesting that gene duplicability is related to the extent of sequence divergence within the duplicate gene family.

Conclusion

Thus, a positive correlation exists between gene diversifiability and duplicability in the context of biochemical networks – an improvement of our understanding of gene duplicability.

Collapse

Merging molecular mechanism and evolution: theory and computation at the interface of biophysics and evolutionary population genetics. Curr Opin Struct Biol 2014;26:84-91. [PMID: 24952216 DOI: 10.1016/j.sbi.2014.05.005] [Citation(s) in RCA: 60] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2014] [Revised: 04/19/2014] [Accepted: 05/16/2014] [Indexed: 11/24/2022]

Light S, Basile W, Elofsson A. Orphans and new gene origination, a structural and evolutionary perspective. Curr Opin Struct Biol 2014;26:73-83. [DOI: 10.1016/j.sbi.2014.05.006] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2014] [Revised: 05/07/2014] [Accepted: 05/16/2014] [Indexed: 12/28/2022]

Grilli J, Romano M, Bassetti F, Cosentino Lagomarsino M. Cross-species gene-family fluctuations reveal the dynamics of horizontal transfers. Nucleic Acids Res 2014;42:6850-60. [PMID: 24829449 PMCID: PMC4066789 DOI: 10.1093/nar/gku378] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/03/2022] Open

Structural Annotation of the Mycobacterium tuberculosis Proteome. Microbiol Spectr 2014;2. [PMID: 26105824 DOI: 10.1128/microbiolspec.mgm2-0027-2013] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/28/2023] Open

Kim KM, Nasir A, Caetano-Anollés G. The importance of using realistic evolutionary models for retrodicting proteomes. Biochimie 2014;99:129-37. [DOI: 10.1016/j.biochi.2013.11.019] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2013] [Accepted: 11/22/2013] [Indexed: 01/16/2023]

Li W, Freudenberg J, Miramontes P. Diminishing return for increased Mappability with longer sequencing reads: implications of the k-mer distributions in the human genome. BMC Bioinformatics 2014;15:2. [PMID: 24386976 PMCID: PMC3927684 DOI: 10.1186/1471-2105-15-2] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2013] [Accepted: 12/17/2013] [Indexed: 11/10/2022] Open

Abstract

Background

The amount of non-unique sequence (non-singletons) in a genome directly affects the difficulty of read alignment to a reference assembly for high throughput-sequencing data. Although a longer read is more likely to be uniquely mapped to the reference genome, a quantitative analysis of the influence of read lengths on mappability has been lacking. To address this question, we evaluate the k-mer distribution of the human reference genome. The k-mer frequency is determined for k ranging from 20 bp to 1000 bp.

Results

We observe that the proportion of non-singletons k-mers decreases slowly with increasing k, and can be fitted by piecewise power-law functions with different exponents at different ranges of k. A slower decay at greater values for k indicates more limited gains in mappability for read lengths between 200 bp and 1000 bp. The frequency distributions of k-mers exhibit long tails with a power-law-like trend, and rank frequency plots exhibit a concave Zipf’s curve. The most frequent 1000-mers comprise 172 regions, which include four large stretches on chromosomes 1 and X, containing genes of biomedical relevance. Comparison with other databases indicates that the 172 regions can be broadly classified into two types: those containing LINE transposable elements and those containing segmental duplications.

Conclusion

Read mappability as measured by the proportion of singletons increases steadily up to the length scale around 200 bp. When read length increases above 200 bp, smaller gains in mappability are expected. Moreover, the proportion of non-singletons decreases with read lengths much slower than linear. Even a read length of 1000 bp would not allow the unique alignment of reads for many coding regions of human genes. A mix of techniques will be needed for efficiently producing high-quality data that cover the complete human genome.

Collapse

Sequence and structure space model of protein divergence driven by point mutations. J Theor Biol 2013;330:1-8. [DOI: 10.1016/j.jtbi.2013.03.015] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2012] [Revised: 03/07/2013] [Accepted: 03/18/2013] [Indexed: 12/11/2022]

Kolodny R, Pereyaslavets L, Samson AO, Levitt M. On the Universe of Protein Folds. Annu Rev Biophys 2013;42:559-82. [DOI: 10.1146/annurev-biophys-083012-130432] [Citation(s) in RCA: 59] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]

Franzosa EA, Garamszegi S, Xia Y. Toward a three-dimensional view of protein networks between species. Front Microbiol 2012;3:428. [PMID: 23267356 PMCID: PMC3528071 DOI: 10.3389/fmicb.2012.00428] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2012] [Accepted: 12/06/2012] [Indexed: 01/27/2023] Open

Kruger FA, Rostom R, Overington JP. Mapping small molecule binding data to structural domains. BMC Bioinformatics 2012;13 Suppl 17:S11. [PMID: 23282026 PMCID: PMC3521243 DOI: 10.1186/1471-2105-13-s17-s11] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/14/2023] Open

Abstract

BACKGROUND

Large-scale bioactivity/SAR Open Data has recently become available, and this has allowed new analyses and approaches to be developed to help address the productivity and translational gaps of current drug discovery. One of the current limitations of these data is the relative sparsity of reported interactions per protein target, and complexities in establishing clear relationships between bioactivity and targets using bioinformatics tools. We detail in this paper the indexing of targets by the structural domains that bind (or are likely to bind) the ligand within a full-length protein. Specifically, we present a simple heuristic to map small molecule binding to Pfam domains. This profiling can be applied to all proteins within a genome to give some indications of the potential pharmacological modulation and regulation of all proteins.

RESULTS

In this implementation of our heuristic, ligand binding to protein targets from the ChEMBL database was mapped to structural domains as defined by profiles contained within the Pfam-A database. Our mapping suggests that the majority of assay targets within the current version of the ChEMBL database bind ligands through a small number of highly prevalent domains, and conversely the majority of Pfam domains sampled by our data play no currently established role in ligand binding. Validation studies, carried out firstly against Uniprot entries with expert binding-site annotation and secondly against entries in the wwPDB repository of crystallographic protein structures, demonstrate that our simple heuristic maps ligand binding to the correct domain in about 90 percent of all assessed cases. Using the mappings obtained with our heuristic, we have assembled ligand sets associated with each Pfam domain.

CONCLUSIONS

Small molecule binding has been mapped to Pfam-A domains of protein targets in the ChEMBL bioactivity database. The result of this mapping is an enriched annotation of small molecule bioactivity data and a grouping of activity classes following the Pfam-A specifications of protein domains. This is valuable for data-focused approaches in drug discovery, for example when extrapolating potential targets of a small molecule with known activity against one or few targets, or in the assessment of a potential target for drug discovery or screening studies.

Collapse

Bottinelli A, Bassetti B, Lagomarsino MC, Gherardi M. Influence of homology and node age on the growth of protein-protein interaction networks. PHYSICAL REVIEW. E, STATISTICAL, NONLINEAR, AND SOFT MATTER PHYSICS 2012;86:041919. [PMID: 23214627 DOI: 10.1103/physreve.86.041919] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/15/2012] [Indexed: 06/01/2023]

A network synthesis model for generating protein interaction network families. PLoS One 2012;7:e41474. [PMID: 22912671 PMCID: PMC3418285 DOI: 10.1371/journal.pone.0041474] [Citation(s) in RCA: 34] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2011] [Accepted: 06/27/2012] [Indexed: 11/19/2022] Open

The ecology of bacterial genes and the survival of the new. INTERNATIONAL JOURNAL OF EVOLUTIONARY BIOLOGY 2012;2012:394026. [PMID: 22900231 PMCID: PMC3415099 DOI: 10.1155/2012/394026] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Received: 04/21/2012] [Accepted: 06/26/2012] [Indexed: 11/18/2022]

Petersen SB, Neves-Petersen MT, Henriksen SB, Mortensen RJ, Geertz-Hansen HM. Scale-free behaviour of amino acid pair interactions in folded proteins. PLoS One 2012;7:e41322. [PMID: 22848462 PMCID: PMC3406053 DOI: 10.1371/journal.pone.0041322] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2012] [Accepted: 06/20/2012] [Indexed: 11/19/2022] Open

Current understanding of the formation and adaptation of metabolic systems based on network theory. Metabolites 2012;2:429-57. [PMID: 24957641 PMCID: PMC3901219 DOI: 10.3390/metabo2030429] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2012] [Revised: 06/26/2012] [Accepted: 07/09/2012] [Indexed: 11/17/2022] Open

Garma L, Mukherjee S, Mitra P, Zhang Y. How many protein-protein interactions types exist in nature? PLoS One 2012;7:e38913. [PMID: 22719985 PMCID: PMC3374795 DOI: 10.1371/journal.pone.0038913] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2012] [Accepted: 05/14/2012] [Indexed: 11/18/2022] Open

Haegeman B, Weitz JS. A neutral theory of genome evolution and the frequency distribution of genes. BMC Genomics 2012;13:196. [PMID: 22613814 PMCID: PMC3386021 DOI: 10.1186/1471-2164-13-196] [Citation(s) in RCA: 46] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2012] [Accepted: 05/21/2012] [Indexed: 12/31/2022] Open

Abstract

Background

The gene composition of bacteria of the same species can differ significantly between isolates. Variability in gene composition can be summarized in terms of gene frequency distributions, in which individual genes are ranked according to the frequency of genomes in which they appear. Empirical gene frequency distributions possess a U-shape, such that there are many rare genes, some genes of intermediate occurrence, and many common genes. It would seem that U-shaped gene frequency distributions can be used to infer the essentiality and/or importance of a gene to a species. Here, we ask: can U-shaped gene frequency distributions, instead, arise generically via neutral processes of genome evolution?

Results

We introduce a neutral model of genome evolution which combines birth-death processes at the organismal level with gene uptake and loss at the genomic level. This model predicts that gene frequency distributions possess a characteristic U-shape even in the absence of selective forces driving genome and population structure. We compare the model predictions to empirical gene frequency distributions from 6 multiply sequenced species of bacterial pathogens. We fit the model with constant population size to data, matching U-shape distributions albeit without matching all quantitative features of the distribution. We find stronger model fits in the case where we consider exponentially growing populations. We also show that two alternative models which contain a "rigid" and "flexible" core component of genomes provide strong fits to gene frequency distributions.

Conclusions

The analysis of neutral models of genome evolution suggests that U-shaped gene frequency distributions provide less information than previously suggested regarding gene essentiality. We discuss the need for additional theory and genomic level information to disentangle the roles of evolutionary mechanisms operating within and amongst individuals in driving the dynamics of gene distributions.

Collapse

Burke S, Elber R. Super folds, networks, and barriers. Proteins 2012;80:463-70. [PMID: 22095563 PMCID: PMC3290721 DOI: 10.1002/prot.23212] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2011] [Revised: 08/31/2011] [Accepted: 09/22/2011] [Indexed: 11/06/2022]

Modeling gene family evolution and reconciling phylogenetic discord. Methods Mol Biol 2012;856:29-51. [PMID: 22399454 DOI: 10.1007/978-1-61779-585-5_2] [Citation(s) in RCA: 28] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]