1
|
Carninci P, Kasukawa T, Katayama S, Gough J, Frith MC, Maeda N, Oyama R, Ravasi T, Lenhard B, Wells C, Kodzius R, Shimokawa K, Bajic VB, Brenner SE, Batalov S, Forrest ARR, Zavolan M, Davis MJ, Wilming LG, Aidinis V, Allen JE, Ambesi-Impiombato A, Apweiler R, Aturaliya RN, Bailey TL, Bansal M, Baxter L, Beisel KW, Bersano T, Bono H, Chalk AM, Chiu KP, Choudhary V, Christoffels A, Clutterbuck DR, Crowe ML, Dalla E, Dalrymple BP, de Bono B, Della Gatta G, di Bernardo D, Down T, Engstrom P, Fagiolini M, Faulkner G, Fletcher CF, Fukushima T, Furuno M, Futaki S, Gariboldi M, Georgii-Hemming P, Gingeras TR, Gojobori T, Green RE, Gustincich S, Harbers M, Hayashi Y, Hensch TK, Hirokawa N, Hill D, Huminiecki L, Iacono M, Ikeo K, Iwama A, Ishikawa T, Jakt M, Kanapin A, Katoh M, Kawasawa Y, Kelso J, Kitamura H, Kitano H, Kollias G, Krishnan SPT, Kruger A, Kummerfeld SK, Kurochkin IV, Lareau LF, Lazarevic D, Lipovich L, Liu J, Liuni S, McWilliam S, Madan Babu M, Madera M, Marchionni L, Matsuda H, Matsuzawa S, Miki H, Mignone F, Miyake S, Morris K, Mottagui-Tabar S, Mulder N, Nakano N, Nakauchi H, Ng P, Nilsson R, Nishiguchi S, Nishikawa S, et alCarninci P, Kasukawa T, Katayama S, Gough J, Frith MC, Maeda N, Oyama R, Ravasi T, Lenhard B, Wells C, Kodzius R, Shimokawa K, Bajic VB, Brenner SE, Batalov S, Forrest ARR, Zavolan M, Davis MJ, Wilming LG, Aidinis V, Allen JE, Ambesi-Impiombato A, Apweiler R, Aturaliya RN, Bailey TL, Bansal M, Baxter L, Beisel KW, Bersano T, Bono H, Chalk AM, Chiu KP, Choudhary V, Christoffels A, Clutterbuck DR, Crowe ML, Dalla E, Dalrymple BP, de Bono B, Della Gatta G, di Bernardo D, Down T, Engstrom P, Fagiolini M, Faulkner G, Fletcher CF, Fukushima T, Furuno M, Futaki S, Gariboldi M, Georgii-Hemming P, Gingeras TR, Gojobori T, Green RE, Gustincich S, Harbers M, Hayashi Y, Hensch TK, Hirokawa N, Hill D, Huminiecki L, Iacono M, Ikeo K, Iwama A, Ishikawa T, Jakt M, Kanapin A, Katoh M, Kawasawa Y, Kelso J, Kitamura H, Kitano H, Kollias G, Krishnan SPT, Kruger A, Kummerfeld SK, Kurochkin IV, Lareau LF, Lazarevic D, Lipovich L, Liu J, Liuni S, McWilliam S, Madan Babu M, Madera M, Marchionni L, Matsuda H, Matsuzawa S, Miki H, Mignone F, Miyake S, Morris K, Mottagui-Tabar S, Mulder N, Nakano N, Nakauchi H, Ng P, Nilsson R, Nishiguchi S, Nishikawa S, Nori F, Ohara O, Okazaki Y, Orlando V, Pang KC, Pavan WJ, Pavesi G, Pesole G, Petrovsky N, Piazza S, Reed J, Reid JF, Ring BZ, Ringwald M, Rost B, Ruan Y, Salzberg SL, Sandelin A, Schneider C, Schönbach C, Sekiguchi K, Semple CAM, Seno S, Sessa L, Sheng Y, Shibata Y, Shimada H, Shimada K, Silva D, Sinclair B, Sperling S, Stupka E, Sugiura K, Sultana R, Takenaka Y, Taki K, Tammoja K, Tan SL, Tang S, Taylor MS, Tegner J, Teichmann SA, Ueda HR, van Nimwegen E, Verardo R, Wei CL, Yagi K, Yamanishi H, Zabarovsky E, Zhu S, Zimmer A, Hide W, Bult C, Grimmond SM, Teasdale RD, Liu ET, Brusic V, Quackenbush J, Wahlestedt C, Mattick JS, Hume DA, Kai C, Sasaki D, Tomaru Y, Fukuda S, Kanamori-Katayama M, Suzuki M, Aoki J, Arakawa T, Iida J, Imamura K, Itoh M, Kato T, Kawaji H, Kawagashira N, Kawashima T, Kojima M, Kondo S, Konno H, Nakano K, Ninomiya N, Nishio T, Okada M, Plessy C, Shibata K, Shiraki T, Suzuki S, Tagami M, Waki K, Watahiki A, Okamura-Oho Y, Suzuki H, Kawai J, Hayashizaki Y, FANTOM Consortium, RIKEN Genome Exploration Research Group and Genome Science Group (Genome Network Project Core Group). The transcriptional landscape of the mammalian genome. Science 2005; 309:1559-63. [PMID: 16141072 DOI: 10.1126/science.1112014] [Show More Authors] [Citation(s) in RCA: 2671] [Impact Index Per Article: 133.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022]
Abstract
This study describes comprehensive polling of transcription start and termination sites and analysis of previously unidentified full-length complementary DNAs derived from the mouse genome. We identify the 5' and 3' boundaries of 181,047 transcripts with extensive variation in transcripts arising from alternative promoter usage, splicing, and polyadenylation. There are 16,247 new mouse protein-coding transcripts, including 5154 encoding previously unidentified proteins. Genomic mapping of the transcriptome reveals transcriptional forests, with overlapping transcription on both strands, separated by deserts in which few transcripts are observed. The data provide a comprehensive platform for the comparative analysis of mammalian transcriptional regulation in differentiation and development.
Collapse
|
|
20 |
2671 |
2
|
Ramachandran P, Dobie R, Wilson-Kanamori JR, Dora EF, Henderson BEP, Luu NT, Portman JR, Matchett KP, Brice M, Marwick JA, Taylor RS, Efremova M, Vento-Tormo R, Carragher NO, Kendall TJ, Fallowfield JA, Harrison EM, Mole DJ, Wigmore SJ, Newsome PN, Weston CJ, Iredale JP, Tacke F, Pollard JW, Ponting CP, Marioni JC, Teichmann SA, Henderson NC. Resolving the fibrotic niche of human liver cirrhosis at single-cell level. Nature 2019; 575:512-518. [PMID: 31597160 PMCID: PMC6876711 DOI: 10.1038/s41586-019-1631-3] [Citation(s) in RCA: 1072] [Impact Index Per Article: 178.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2018] [Accepted: 09/04/2019] [Indexed: 12/13/2022]
Abstract
Liver cirrhosis is a major cause of death worldwide and is characterized by extensive fibrosis. There are currently no effective antifibrotic therapies available. To obtain a better understanding of the cellular and molecular mechanisms involved in disease pathogenesis and enable the discovery of therapeutic targets, here we profile the transcriptomes of more than 100,000 single human cells, yielding molecular definitions for non-parenchymal cell types that are found in healthy and cirrhotic human liver. We identify a scar-associated TREM2+CD9+ subpopulation of macrophages, which expands in liver fibrosis, differentiates from circulating monocytes and is pro-fibrogenic. We also define ACKR1+ and PLVAP+ endothelial cells that expand in cirrhosis, are topographically restricted to the fibrotic niche and enhance the transmigration of leucocytes. Multi-lineage modelling of ligand and receptor interactions between the scar-associated macrophages, endothelial cells and PDGFRα+ collagen-producing mesenchymal cells reveals intra-scar activity of several pro-fibrogenic pathways including TNFRSF12A, PDGFR and NOTCH signalling. Our work dissects unanticipated aspects of the cellular and molecular basis of human organ fibrosis at a single-cell level, and provides a conceptual framework for the discovery of rational therapeutic targets in liver cirrhosis.
Collapse
|
research-article |
6 |
1072 |
3
|
Domínguez Conde C, Xu C, Jarvis LB, Rainbow DB, Wells SB, Gomes T, Howlett SK, Suchanek O, Polanski K, King HW, Mamanova L, Huang N, Szabo PA, Richardson L, Bolt L, Fasouli ES, Mahbubani KT, Prete M, Tuck L, Richoz N, Tuong ZK, Campos L, Mousa HS, Needham EJ, Pritchard S, Li T, Elmentaite R, Park J, Rahmani E, Chen D, Menon DK, Bayraktar OA, James LK, Meyer KB, Yosef N, Clatworthy MR, Sims PA, Farber DL, Saeb-Parsy K, Jones JL, Teichmann SA. Cross-tissue immune cell analysis reveals tissue-specific features in humans. Science 2022; 376:eabl5197. [PMID: 35549406 PMCID: PMC7612735 DOI: 10.1126/science.abl5197] [Citation(s) in RCA: 452] [Impact Index Per Article: 150.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/02/2023]
Abstract
Despite their crucial role in health and disease, our knowledge of immune cells within human tissues remains limited. We surveyed the immune compartment of 16 tissues from 12 adult donors by single-cell RNA sequencing and VDJ sequencing generating a dataset of ~360,000 cells. To systematically resolve immune cell heterogeneity across tissues, we developed CellTypist, a machine learning tool for rapid and precise cell type annotation. Using this approach, combined with detailed curation, we determined the tissue distribution of finely phenotyped immune cell types, revealing hitherto unappreciated tissue-specific features and clonal architecture of T and B cells. Our multitissue approach lays the foundation for identifying highly resolved immune cell types by leveraging a common reference dataset, tissue-integrated expression analysis, and antigen receptor sequencing.
Collapse
|
research-article |
3 |
452 |
4
|
Apic G, Gough J, Teichmann SA. Domain combinations in archaeal, eubacterial and eukaryotic proteomes. J Mol Biol 2001; 310:311-25. [PMID: 11428892 DOI: 10.1006/jmbi.2001.4776] [Citation(s) in RCA: 353] [Impact Index Per Article: 14.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
There is a limited repertoire of domain families that are duplicated and combined in different ways to form the set of proteins in a genome. Proteins are gene products, and at the level of genes, duplication, recombination, fusion and fission are the processes that produce new genes. We attempt to gain an overview of these processes by studying the evolutionary units in proteins, domains, in the protein sequences of 40 genomes. The domain and superfamily definitions in the Structural Classification of Proteins Database are used, so that we can view all pairs of adjacent domains in genome sequences in terms of their superfamily combinations. We find 783 out of the 859 superfamilies in SCOP in these genomes, and the 783 families occur in 1307 pairwise combinations. Most families are observed in combination with one or two other families, while a few families are very versatile in their combinatorial behaviour; 209 families do not make combinations with other families. This type of pattern can be described as a scale-free network. We also study the N to C-terminal orientation of domain pairs and domain repeats. The phylogenetic distribution of domain combinations is surveyed, to establish the extent of common and kingdom-specific combinations. Of the kingdom-specific combinations, significantly more combinations consist of families present in all three kingdoms than of families present in one or two kingdoms. Hence, we are led to conclude that recombination between common families, as compared to the invention of new families and recombination among these, has also been a major contribution to the evolution of kingdom-specific and species-specific functions in organisms in all three kingdoms. Finally, we compare the set of the domain combinations in the genomes to those in the RCSB Protein Data Bank, and discuss the implications for structural genomics.
Collapse
|
|
24 |
353 |
5
|
Madissoon E, Wilbrey-Clark A, Miragaia RJ, Saeb-Parsy K, Mahbubani KT, Georgakopoulos N, Harding P, Polanski K, Huang N, Nowicki-Osuch K, Fitzgerald RC, Loudon KW, Ferdinand JR, Clatworthy MR, Tsingene A, van Dongen S, Dabrowska M, Patel M, Stubbington MJT, Teichmann SA, Stegle O, Meyer KB. scRNA-seq assessment of the human lung, spleen, and esophagus tissue stability after cold preservation. Genome Biol 2019; 21:1. [PMID: 31892341 PMCID: PMC6937944 DOI: 10.1186/s13059-019-1906-x] [Citation(s) in RCA: 296] [Impact Index Per Article: 49.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2019] [Accepted: 11/28/2019] [Indexed: 02/02/2023] Open
Abstract
BACKGROUND The Human Cell Atlas is a large international collaborative effort to map all cell types of the human body. Single-cell RNA sequencing can generate high-quality data for the delivery of such an atlas. However, delays between fresh sample collection and processing may lead to poor data and difficulties in experimental design. RESULTS This study assesses the effect of cold storage on fresh healthy spleen, esophagus, and lung from ≥ 5 donors over 72 h. We collect 240,000 high-quality single-cell transcriptomes with detailed cell type annotations and whole genome sequences of donors, enabling future eQTL studies. Our data provide a valuable resource for the study of these 3 organs and will allow cross-organ comparison of cell types. We see little effect of cold ischemic time on cell yield, total number of reads per cell, and other quality control metrics in any of the tissues within the first 24 h. However, we observe a decrease in the proportions of lung T cells at 72 h, higher percentage of mitochondrial reads, and increased contamination by background ambient RNA reads in the 72-h samples in the spleen, which is cell type specific. CONCLUSIONS In conclusion, we present robust protocols for tissue preservation for up to 24 h prior to scRNA-seq analysis. This greatly facilitates the logistics of sample collection for Human Cell Atlas or clinical studies since it increases the time frames for sample processing.
Collapse
|
Evaluation Study |
6 |
296 |
6
|
Park J, Teichmann SA, Hubbard T, Chothia C. Intermediate sequences increase the detection of homology between sequences. J Mol Biol 1997; 273:349-54. [PMID: 9367767 DOI: 10.1006/jmbi.1997.1288] [Citation(s) in RCA: 146] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
Abstract
Two homologous sequences, which have diverged beyond the point where their homology can be recognised by a simple direct comparison, can be related through a third sequence that is suitably intermediate between the two. High scores, for a sequence match between the first and third sequences and between the second and the third sequences, imply that the first and second sequences are related even though their own match score is low. We have tested the usefulness of this idea using a database that contains the sequences of 971 protein domains whose structures are known and whose residue identities with each other are some 40% or less (PDB40D). On the basis of sequence and structural information, 2143 pairs of these sequences are known to have an evolutionary relationship. FASTA, in an all-against-all comparison of the sequences in the database, detected 320 (15%) of these relationships as well as three false positive (i.e. 1% error rate). Using intermediate sequences found by FASTA matches of PDB40D sequences to those in the large non-redundant OWL database we could detect 550 evolutionary relationships with an error rate of 1%. This means the intermediate sequence procedure increases the ability to recognise the evolutionary relationships amongst the PDB40D sequences by 70%.
Collapse
|
|
28 |
146 |
7
|
Teichmann SA, Park J, Chothia C. Structural assignments to the Mycoplasma genitalium proteins show extensive gene duplications and domain rearrangements. Proc Natl Acad Sci U S A 1998; 95:14658-63. [PMID: 9843945 PMCID: PMC24505 DOI: 10.1073/pnas.95.25.14658] [Citation(s) in RCA: 112] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
The parasitic bacterium Mycoplasma genitalium has a small, reduced genome with close to a basic set of genes. As a first step toward determining the families of protein domains that form the products of these genes, we have used the multiple sequence programs PSI-BLAST and GEANFAMMER to match the sequences of the 467 gene products of M. genitalium to the sequences of the domains that form proteins of known structure [Protein Data Bank (PDB) sequences]. PDB sequences (274) match all of 106 M. genitalium sequences and some parts of another 85; thus, 41% of its total sequences are matched in all or part. The evolutionary relationships of the PDB domains that match M. genitalium are described in the structural classification of proteins (SCOP) database. Using this information, we show that the domains in the matched M. genitalium sequences come from 114 superfamilies and that 58% of them have arisen by gene duplication. This level of duplication is more than twice that found by using pairwise sequence comparisons. The PDB domain matches also describe the domain structure of the matched sequences: just over a quarter contain one domain and the rest have combinations of two or more domains.
Collapse
|
research-article |
27 |
112 |
8
|
Abstract
New computational techniques have allowed protein folds to be assigned to all or parts of between a quarter (Caenorhabditis elegans) and a half (Mycoplasma genitalium) of the individual protein sequences in different genomes. These assignments give a new perspective on domain structures, gene duplications, protein families and protein folds in genome sequences.
Collapse
|
Review |
26 |
110 |
9
|
Park J, Lappe M, Teichmann SA. Mapping protein family interactions: intramolecular and intermolecular protein family interaction repertoires in the PDB and yeast. J Mol Biol 2001; 307:929-38. [PMID: 11273711 DOI: 10.1006/jmbi.2001.4526] [Citation(s) in RCA: 108] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
In the postgenomic era, one of the most interesting and important challenges is to understand protein interactions on a large scale. The physical interactions between protein domains are fundamental to the workings of a cell: in multi-domain polypeptide chains, in multi-subunit proteins and in transient complexes between proteins that also exist independently. To study the large-scale patterns and evolution of interactions between protein domains, we view interactions between protein domains in terms of the interactions between structural families of evolutionarily related domains. This allows us to classify 8151 interactions between individual domains in the Protein Data Bank and the yeast Saccharomyces cerevisiae in terms of 664 types of interactions, between protein families. At least 51 interactions do not occur in the Protein Data Bank and can only be derived from the yeast data. The map of interactions between protein families has the form of a scale-free network, meaning that most protein families only interact with one or two other families, while a few families are extremely versatile in their interactions and are connected to many families. We observe that almost half of all known families engage in interactions with domains from their own family. We also see that the repertoires of interactions of domains within and between polypeptide chains overlap mostly for two specific types of protein families: enzymes and same-family interactions. This suggests that different types of protein interaction repertoires exist for structural, functional and regulatory reasons.
Collapse
|
|
24 |
108 |
10
|
Fuchs M, Hafer A, Münch C, Kannenberg F, Teichmann S, Scheibner J, Stange EF, Seedorf U. Disruption of the sterol carrier protein 2 gene in mice impairs biliary lipid and hepatic cholesterol metabolism. J Biol Chem 2001; 276:48058-65. [PMID: 11673458 DOI: 10.1074/jbc.m106732200] [Citation(s) in RCA: 85] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022] Open
Abstract
Hepatic up-regulation of sterol carrier protein 2 (Scp2) in mice promotes hypersecretion of cholesterol into bile and gallstone formation in response to a lithogenic diet. We hypothesized that Scp2 deficiency may alter biliary lipid secretion and hepatic cholesterol metabolism. Male gallstone-susceptible C57BL/6 and C57BL/6(Scp2(-/-)) knockout mice were fed a standard chow or lithogenic diet. Hepatic biles were collected to determine biliary lipid secretion rates, bile flow, and bile salt pool size. Plasma lipoprotein distribution was investigated, and gene expression of cytosolic lipid-binding proteins, lipoprotein receptors, hepatic regulatory enzymes, and intestinal cholesterol absorption was measured. Compared with chow-fed wild-type animals, C57BL/6(Scp2(-/-)) mice had higher bile flow and lower bile salt secretion rates, decreased hepatic apolipoprotein expression, increased hepatic cholesterol synthesis, and up-regulation of liver fatty acid-binding protein. In addition, the bile salt pool size was reduced and intestinal cholesterol absorption was unaltered in C57BL/6(Scp2(-/-)) mice. When C57BL/6(Scp2(-/-)) mice were challenged with a lithogenic diet, a smaller increase of hepatic free cholesterol failed to suppress cholesterol synthesis and biliary cholesterol secretion increased to a much smaller extent than phospholipid and bile salt secretion. Scp2 deficiency did not prevent gallstone formation and may be compensated in part by hepatic up-regulation of liver fatty acid-binding protein. These results support a role of Scp2 in hepatic cholesterol metabolism, biliary lipid secretion, and intracellular cholesterol distribution.
Collapse
|
|
24 |
85 |
11
|
Teichmann SA, Murzin AG, Chothia C. Determination of protein function, evolution and interactions by structural genomics. Curr Opin Struct Biol 2001; 11:354-63. [PMID: 11406387 DOI: 10.1016/s0959-440x(00)00215-3] [Citation(s) in RCA: 85] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
Abstract
The genome sequencing projects and knowledge of the entire protein repertoires of many organisms have prompted new procedures and techniques for the large-scale determination of protein structure, function and interactions. Recently, new work has been carried out on the determination of the function and evolutionary relationships of proteins by experimental structural genomics, and the discovery of protein-protein interactions by computational structural genomics.
Collapse
|
Review |
24 |
85 |
12
|
Abstract
The predicted proteins of the genome of Caenorhabditis elegans were analysed by various sequence comparison methods to identify the repertoire of proteins that are members of the immunoglobulin superfamily (IgSF). The IgSF is one of the largest families of protein domain in this genome and likely to be one of the major families in other multicellular eukaryotes too. This is because members of the superfamily are involved in a variety of functions including cell-cell recognition, cell-surface receptors, muscle structure and, in higher organisms, the immune system. Sixty-four proteins with 488 I set IgSF domains were identified largely by using Hidden Markov models. The domain architectures of the protein products of these 64 genes are described. Twenty-one of these had been characterised previously. We show that another 25 are related to proteins of known function. The C. elegans IgSF proteins can be classified into five broad categories: muscle proteins, protein kinases and phosphatases, three categories of proteins involved in the development of the nervous system, leucine-rich repeat containing proteins and proteins without homologues of known function, of which there are 18. The 19 proteins involved in nervous system development that are not kinases or phosphatases are homologues of neuroglian, axonin, NCAM, wrapper, klingon, ICCR and nephrin or belong to the recently identified zig gene family. Out of the set of 64 genes, 22 are on the X chromosome. This study should be seen as an initial description of the IgSF repertoire in C. elegans, because the current gene definitions may contain a number of errors, especially in the case of long sequences, and there may be IgSF genes that have not yet been detected. However, the proteins described here do provide an overview of the bulk of the repertoire of immunoglobulin superfamily members in C. elegans, a framework for refinement and extension of the repertoire as gene and protein definitions improve, and the basis for investigations of their function and for comparisons with the repertoires of other organisms.
Collapse
|
|
25 |
82 |
13
|
Abstract
Domains are the building blocks of all globular proteins, and are units of compact three-dimensional structure as well as evolutionary units. There is a limited repertoire of domain families, so that these domain families are duplicated and combined in different ways to form the set of proteins in a genome. Proteins are gene products. The processes that produce new genes are duplication and recombination as well as gene fusion and fission. We attempt to gain an overview of these processes by studying the structural domains in the proteins of seven genomes from the three kingdoms of life: Eubacteria, Archaea and Eukaryota. We use here the domain and superfamily definitions in Structural Classification of Proteins Database (SCOP) in order to map pairs of adjacent domains in genome sequences in terms of their superfamily combinations. We find 624 out of the 764 superfamilies in SCOP in these genomes, and the 624 families occur in 585 pairwise combinations. Most families are observed in combination with one or two other families, while a few families are very versatile in their combinatorial behaviour. This type of pattern can be described by a scale-free network. Finally, we study domain repeats and we compare the set of the domain combinations in the genomes to those in PDB, and discuss the implications for structural genomics.
Collapse
|
|
23 |
79 |
14
|
Bornberg-Bauer E, Beaussart F, Kummerfeld SK, Teichmann SA, Weiner J. The evolution of domain arrangements in proteins and interaction networks. Cell Mol Life Sci 2005; 62:435-45. [PMID: 15719170 PMCID: PMC11924419 DOI: 10.1007/s00018-004-4416-1] [Citation(s) in RCA: 78] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
Abstract
Proteins are composed of domains, which are conserved evolutionary units that often also correspond to functional units and can frequently be detected with reasonable reliability using computational methods. Most proteins consist of two or more domains, giving rise to a variety of combinations of domains. Another level of complexity arises because proteins themselves can form complexes with small molecules, nucleic acids and other proteins. The networks of both domain combinations and protein interactions can be conceptualised as graphs, and these graphs can be analysed conveniently by computational methods. In this review we summarise facts and hypotheses about the evolution of domains in multi-domain proteins and protein complexes, and the tools and data resources available to study them.
Collapse
|
Review |
20 |
78 |
15
|
Abstract
We report the discovery of a novel family of proteins, each member contains tandem pentapeptide (five residue) repeats, described by the motif A(D/N)LXX. Members of this family are both membrane bound and cytoplasmic. The function of these repeats is uncertain, but they may have a targeting or structural function rather than enzymatic activity. This family is most common in cyanobacteria, suggesting a function related to cyanobacterial-specific metabolism. Although no experimental information is available for the structure of this family, it is predicted that the tandem pentapeptide repeats will form a right-handed beta-helical structure. A structural model of the pentapeptide repeats is presented.
Collapse
|
research-article |
27 |
74 |
16
|
Teichmann SA, Rison SC, Thornton JM, Riley M, Gough J, Chothia C. The evolution and structural anatomy of the small molecule metabolic pathways in Escherichia coli. J Mol Biol 2001; 311:693-708. [PMID: 11518524 DOI: 10.1006/jmbi.2001.4912] [Citation(s) in RCA: 70] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
The 106 small molecule metabolic (SMM) pathways in Escherichia coli are formed by the protein products of 581 genes. We can define 722 domains, nearly all of which are homologous to proteins of known structure, that form all or part of 510 of these proteins. This information allows us to answer general questions on the structural anatomy of the SMM pathway proteins and to trace family relationships and recruitment events within and across pathways. Half the gene products contain a single domain and half are formed by combinations of between two and six domains. The 722 domains belong to one of 213 families that have between one and 51 members. Family members usually conserve their catalytic or cofactor binding properties; substrate recognition is rarely conserved. Of the 213 families, members of only a quarter occur in isolation, i.e. they form single-domain proteins. Most members of the other families combine with domains from just one or two other families and a few more versatile families can combine with several different partners. Excluding isoenzymes, more than twice as many homologues are distributed across pathways as within pathways. However, serial recruitment, with two consecutive enzymes both being recruited to another pathway, is rare and recruitment of three consecutive enzymes is not observed. Only eight of the 106 pathways have a high number of homologues. Homology between consecutive pairs of enzymes with conservation of the main substrate-binding site but change in catalytic mechanism (which would support a simple model of retrograde pathway evolution) occurs only six times in the whole set of enzymes. Most of the domains that form SMM pathways have homologues in non-SMM pathways. Taken together, these results imply a pervasive "mosaic" model for the formation of protein repertoires and pathways.
Collapse
|
|
24 |
70 |
17
|
Abstract
Using the sequence information from nine completely sequenced bacterial genomes, we extract 32 protein families that are thought to contain orthologous proteins from each genome. The alignments of these 32 families are used to construct a phylogeny with the neighbor-joining algorithm. This tree has several topological features that are different from the conventional phylogeny, yet it is highly reliable according to its bootstrap values. Upon closer study of the individual families used, it is clear that the strong phylogenetic signal comes from three families, at least two of which are good candidates for horizontal transfer. The tree from the remaining 29 families consists almost entirely of noise at the level of bacterial phylum divisions, indicating that, even with large amounts of data, it may not be possible to reconstruct the prokaryote phylogeny using standard sequence-based methods.
Collapse
|
|
26 |
63 |
18
|
Abstract
Telomerases are RNA-dependent polymerases that catalyse the synthesis of the telomeric DNA at the tips of eukaryotic chromosomes. The recent identification of the catalytic subunit of telomerases from several different species suggests that the core of the telomerase is conserved. The proposed sequence and structural homology between the telomerase catalytic subunit and reverse transcriptases, together with a wealth of genetic and biochemical information, has led to significant advances in our understanding of the mechanism by which telomerases synthesise telomeric DNA.
Collapse
|
Review |
26 |
48 |
19
|
Park J, Teichmann SA. DIVCLUS: an automatic method in the GEANFAMMER package that finds homologous domains in single- and multi-domain proteins. Bioinformatics 1998; 14:144-50. [PMID: 9545446 DOI: 10.1093/bioinformatics/14.2.144] [Citation(s) in RCA: 42] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023] Open
Abstract
MOTIVATION Large-scale determination of relationships between the proteins produced by genome sequences is now common. All protein sequences are matched and those that have high match scores are clustered into families. In cases where the proteins are built of several domains or duplication modules, this can lead to misleading results. Consider the very simple example of three proteins: 1, formed by duplication modules A and B; 2, formed by duplication modules B' and C; and 3, formed by duplication modules C' and D. Duplication modules B and B' are homologous, as are C and C'. Matching the sequences of 1, 2 and 3 followed by simple single-linkage clustering would put all three in the same family, even though proteins 1 and 3 are not related. This is because the different parts of 2 match 1 and 3. This paper describes a procedure, DIVCLUS, that divides such complex clusters of partially related sequences into simple clusters that contain only related duplication modules. In the example just given, it would produce two groups of sequences: the first with domains B of sequence 1 and B of sequence 2, and the second with domain C of sequence 2 and C of sequence 3. DIVCLUS is part of a package called GEANFAMMER, for GEnome ANalysis and protein FAMily MakER. The package automates the detection of families of duplication modules from a protein sequence database. RESULTS DIVCLUS has been applied to the division of single-linkage clusters generated from the protein sequences of six completely sequenced bacterial genomes. Out of 12 013 genes in these six genomes, 4563 single- and multi-domain sequences formed 1071 complex clusters. Application of the DIVCLUS program resolved these clusters into 2113 clusters corresponding to single duplication modules. AVAILABILITY The perl5 program and its documentation are available at the following address: http://www.mrc-lmb.cam.ac.uk/genomes/ and by anonymous ftp at ftp.mrc-lmb.cam.ac.uk in the directory /pub/genomes/Software/. CONTACT sat@mrc-lmb.cam.ac.uk; jong@mrc-lmb. cam.ac.uk
Collapse
|
|
27 |
42 |
20
|
Teichmann SA, Rison SC, Thornton JM, Riley M, Gough J, Chothia C. Small-molecule metabolism: an enzyme mosaic. Trends Biotechnol 2001; 19:482-6. [PMID: 11711174 DOI: 10.1016/s0167-7799(01)01813-3] [Citation(s) in RCA: 40] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
Escherichia coli has been a popular organism for studying metabolic pathways. In an attempt to find out more about how these pathways are constructed, the enzymes were analysed by defining their protein domains. Structural assignments and sequence comparisons were used to show that 213 domain families constitute approximately 90% of the enzymes in the small-molecule metabolic pathways. Catalytic or cofactor-binding properties between family members are often conserved, while recognition of the main substrate with change in catalytic mechanism is only observed in a few cases of consecutive enzymes in a pathway. Recruitment of domains across pathways is very common, but there is little regularity in the pattern of domains in metabolic pathways. This is analogous to a mosaic in which a stone of a certain colour is selected to fill a position in the picture.
Collapse
|
|
24 |
40 |
21
|
Qian J, Stenger B, Wilson CA, Lin J, Jansen R, Teichmann SA, Park J, Krebs WG, Yu H, Alexandrov V, Echols N, Gerstein M. PartsList: a web-based system for dynamically ranking protein folds based on disparate attributes, including whole-genome expression and interaction information. Nucleic Acids Res 2001; 29:1750-64. [PMID: 11292848 PMCID: PMC31319 DOI: 10.1093/nar/29.8.1750] [Citation(s) in RCA: 38] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2000] [Revised: 02/27/2001] [Accepted: 02/27/2001] [Indexed: 11/14/2022] Open
Abstract
As the number of protein folds is quite limited, a mode of analysis that will be increasingly common in the future, especially with the advent of structural genomics, is to survey and re-survey the finite parts list of folds from an expanding number of perspectives. We have developed a new resource, called PartsList, that lets one dynamically perform these comparative fold surveys. It is available on the web at http://bioinfo.mbb.yale.edu/partslist and http://www.partslist.org. The system is based on the existing fold classifications and functions as a form of companion annotation for them, providing 'global views' of many already completed fold surveys. The central idea in the system is that of comparison through ranking; PartsList will rank the approximately 420 folds based on more than 180 attributes. These include: (i) occurrence in a number of completely sequenced genomes (e.g. it will show the most common folds in the worm versus yeast); (ii) occurrence in the structure databank (e.g. most common folds in the PDB); (iii) both absolute and relative gene expression information (e.g. most changing folds in expression over the cell cycle); (iv) protein-protein interactions, based on experimental data in yeast and comprehensive PDB surveys (e.g. most interacting fold); (v) sensitivity to inserted transposons; (vi) the number of functions associated with the fold (e.g. most multi-functional folds); (vii) amino acid composition (e.g. most Cys-rich folds); (viii) protein motions (e.g. most mobile folds); and (ix) the level of similarity based on a comprehensive set of structural alignments (e.g. most structurally variable folds). The integration of whole-genome expression and protein-protein interaction data with structural information is a particularly novel feature of our system. We provide three ways of visualizing the rankings: a profiler emphasizing the progression of high and low ranks across many pre-selected attributes, a dynamic comparer for custom comparisons and a numerical rankings correlator. These allow one to directly compare very different attributes of a fold (e.g. expression level, genome occurrence and maximum motion) in the uniform numerical format of ranks. This uniform framework, in turn, highlights the way that the frequency of many of the attributes falls off with approximate power-law behavior (i.e. according to V(-b), for attribute value V and constant exponent b), with a few folds having large values and most having small values.
Collapse
|
research-article |
24 |
38 |
22
|
Smith BO, Ito Y, Raine A, Teichmann S, Ben-Tovim L, Nietlispach D, Broadhurst RW, Terada T, Kelly M, Oschkinat H, Shibata T, Yokoyama S, Laue ED. An approach to global fold determination using limited NMR data from larger proteins selectively protonated at specific residue types. JOURNAL OF BIOMOLECULAR NMR 1996; 8:360-368. [PMID: 20686886 DOI: 10.1007/bf00410335] [Citation(s) in RCA: 35] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/17/1996] [Accepted: 10/02/1996] [Indexed: 05/29/2023]
Abstract
A combination of calculation and experiment is used to demonstrate that the global fold of larger proteins can be rapidly determined using limited NMR data. The approach involves a combination of heteronuclear triple resonance NMR experiments with protonation of selected residue types in an otherwise completely deuterated protein. This method of labelling produces proteins with alpha-specific deuteration in the protonated residues, and the results suggest that this will improve the sensitivity of experiments involving correlation of side-chain ((1)H and (13)C) and backbone ((1)H and (15)N) amide resonances. It will allow the rapid assignment of backbone resonances with high sensitivity and the determination of a reasonable structural model of a protein based on limited NOE restraints, an application that is of increasing importance as data from the large number of genome sequencing projects accumulates. The method that we propose should also be of utility in extending the use of NMR spectroscopy to determine the structures of larger proteins.
Collapse
|
|
29 |
35 |
23
|
Teichmann SA, Chothia C, Church GM, Park J. Fast assignment of protein structures to sequences using the intermediate sequence library PDB-ISL. Bioinformatics 2000; 16:117-24. [PMID: 10842732 DOI: 10.1093/bioinformatics/16.2.117] [Citation(s) in RCA: 34] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION For large-scale structural assignment to sequences, as in computational structural genomics, a fast yet sensitive sequence search procedure is essential. A new approach using intermediate sequences was tested as a shortcut to iterative multiple sequence search methods such as PSI-BLAST. RESULTS A library containing potential intermediate sequences for proteins of known structure (PDB-ISL) was constructed. The sequences in the library were collected from a large sequence database using the sequences of the domains of proteins of known structure as the query sequences and the program PSI-BLAST. Sequences of proteins of unknown structure can be matched to distantly related proteins of known structure by using pairwise sequence comparison methods to find homologues in PDB-ISL. Searches of PDB-ISL were calibrated, and the number of correct matches found at a given error rate was the same as that found by PSI-BLAST. The advantage of this library is that it uses pairwise sequence comparison methods, such as FASTA or BLAST2, and can, therefore, be searched easily and, in many cases, much more quickly than an iterative multiple sequence comparison method. The procedure is roughly 20 times faster than PSI-BLAST for small genomes and several hundred times for large genomes. AVAILABILITY Sequences can be submitted to the PDB-ISL servers at http://stash.mrc-lmb.cam.ac.uk/PDB_ISL/ or http://cyrah.ebi.ac.uk:1111/Serv/PDB_ISL/ and can be downloaded from ftp://ftp.ebi.ac.uk/pub/contrib/jong/PDB_+ ++ISL/ CONTACT: sat@mrc-lmb.cam.ac.uk and jong@ebi.ac.uk
Collapse
|
|
25 |
34 |
24
|
Cousin SL, Silva F, Teichmann S, Hemmer M, Buades B, Biegert J. High-flux table-top soft x-ray source driven by sub-2-cycle, CEP stable, 1.85-μm 1-kHz pulses for carbon K-edge spectroscopy. OPTICS LETTERS 2014; 39:5383-6. [PMID: 26466278 DOI: 10.1364/ol.39.005383] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/23/2023]
Abstract
We report on the first table-top high-flux source of coherent soft x-ray radiation up to 400 eV, operating at 1 kHz. This source covers the carbon K-edge with a beam brilliance of (4.3±1.2)×10(15) photons/s/mm(2)/strad/10% bandwidth and a photon flux of (1.85±0.12)×10(7) photons/s/1% bandwidth. We use this source to demonstrate table-top x-ray near-edge fine-structure spectroscopy at the carbon K-edge of a polyimide foil and retrieve the specific absorption features corresponding to the binding orbitals of the carbon atoms in the foil.
Collapse
|
|
11 |
25 |
25
|
|
News |
25 |
8 |