1
|
Galperin MY, Kristensen DM, Makarova KS, Wolf YI, Koonin EV. Microbial genome analysis: the COG approach. Brief Bioinform 2020; 20:1063-1070. [PMID: 28968633 DOI: 10.1093/bib/bbx117] [Citation(s) in RCA: 144] [Impact Index Per Article: 36.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2017] [Revised: 08/01/2017] [Indexed: 11/15/2022] Open
Abstract
For the past 20 years, the Clusters of Orthologous Genes (COG) database had been a popular tool for microbial genome annotation and comparative genomics. Initially created for the purpose of evolutionary classification of protein families, the COG have been used, apart from straightforward functional annotation of sequenced genomes, for such tasks as (i) unification of genome annotation in groups of related organisms; (ii) identification of missing and/or undetected genes in complete microbial genomes; (iii) analysis of genomic neighborhoods, in many cases allowing prediction of novel functional systems; (iv) analysis of metabolic pathways and prediction of alternative forms of enzymes; (v) comparison of organisms by COG functional categories; and (vi) prioritization of targets for structural and functional characterization. Here we review the principles of the COG approach and discuss its key advantages and drawbacks in microbial genome analysis.
Collapse
|
2
|
Patnaik BB, Chung JM, Hwang HJ, Sang MK, Park JE, Min HR, Cho HC, Dewangan N, Baliarsingh S, Kang SW, Park SY, Jo YH, Park HS, Kim WJ, Han YS, Lee JS, Lee YS. Transcriptome analysis of air-breathing land slug, Incilaria fruhstorferi reveals functional insights into growth, immunity, and reproduction. BMC Genomics 2019; 20:154. [PMID: 30808280 PMCID: PMC6390351 DOI: 10.1186/s12864-019-5526-3] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2018] [Accepted: 02/11/2019] [Indexed: 01/27/2023] Open
Abstract
Background Incilaria (= Meghimatium) fruhstorferi is an air-breathing land slug found in restricted habitats of Japan, Taiwan and selected provinces of South Korea (Jeju, Chuncheon, Busan, and Deokjeokdo). The species is on a decline due to depletion of forest cover, predation by natural enemies, and collection. To facilitate the conservation of the species, it is important to decide on a number of traits related to growth, immunity and reproduction addressing fitness advantage of the species. Results The visceral mass transcriptome of I. fruhstorferi was enabled using the Illumina HiSeq 4000 sequencing platform. According to BUSCO (Benchmarking Universal Single-Copy Orthologs) method, the transcriptome was considered complete with 91.8% of ortholog genes present (Single: 70.7%; Duplicated: 21.1%). A total of 96.79% of the raw read sequences were processed as clean reads. TransDecoder identified 197,271 contigs that contained candidate-coding regions. Of a total of 50,230 unigenes, 34,470 (68.62% of the total unigenes) annotated to homologous proteins in the Protostome database (PANM-DB). The GO term and KEGG pathway analysis indicated genes involved in metabolism, phosphatidylinositol signalling system, aminobenzoate degradation, and T-cell receptor signalling pathway. Many genes associated with molluscan innate immunity were categorized under pathogen recognition receptor, TLR signalling pathway, MyD88 dependent pathway, endogenous ligands, immune effectors, antimicrobial peptides, apoptosis, and adaptation-related. The reproduction-associated unigenes showed homology to protein fem-1, spermatogenesis-associated protein, sperm associated antigen, and testis expressed sequences, among others. In addition, we identified key growth-related genes categorized under somatotrophic axis, muscle growth, chitinases and collagens. A total of 4822 Simple Sequence Repeats (SSRs) were also identified from the unigene sequences of I. fruhstorferi. Conclusions This is the first available genomic information for non-model land slug, I. fruhstorferi focusing on genes related to growth, immunity, and reproduction, with additional focus on microsatellites and repeating elements. The transcriptome provides access to greater number of traits of unknown relevance in the species that could be exploited for in-depth analyses of evolutionary plasticity and making informed choices during conservation planning. This would be appropriate for understanding the dynamics of the species on a priority basis considering the ecological, health, and social benefits. Electronic supplementary material The online version of this article (10.1186/s12864-019-5526-3) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Bharat Bhusan Patnaik
- School of Biotech Sciences, Trident Academy of Creative Technology (TACT), F2-B, Chandaka Industrial Estate, Chandrasekharpur, Bhubaneswar, Odisha, 751024, India
| | - Jong Min Chung
- Department of Life Science and Biotechnology, College of Natural Sciences, Soonchunhyang University, 22 Soonchunhyangro, Shinchang-myeon, Asan, Chungchungnam-do, 31538, South Korea
| | - Hee Ju Hwang
- Department of Life Science and Biotechnology, College of Natural Sciences, Soonchunhyang University, 22 Soonchunhyangro, Shinchang-myeon, Asan, Chungchungnam-do, 31538, South Korea
| | - Min Kyu Sang
- Department of Life Science and Biotechnology, College of Natural Sciences, Soonchunhyang University, 22 Soonchunhyangro, Shinchang-myeon, Asan, Chungchungnam-do, 31538, South Korea
| | - Jie Eun Park
- Department of Life Science and Biotechnology, College of Natural Sciences, Soonchunhyang University, 22 Soonchunhyangro, Shinchang-myeon, Asan, Chungchungnam-do, 31538, South Korea
| | - Hye Rin Min
- Department of Life Science and Biotechnology, College of Natural Sciences, Soonchunhyang University, 22 Soonchunhyangro, Shinchang-myeon, Asan, Chungchungnam-do, 31538, South Korea
| | - Hang Chul Cho
- Department of Life Science and Biotechnology, College of Natural Sciences, Soonchunhyang University, 22 Soonchunhyangro, Shinchang-myeon, Asan, Chungchungnam-do, 31538, South Korea
| | - Neha Dewangan
- School of Biotech Sciences, Trident Academy of Creative Technology (TACT), F2-B, Chandaka Industrial Estate, Chandrasekharpur, Bhubaneswar, Odisha, 751024, India
| | - Snigdha Baliarsingh
- School of Biotech Sciences, Trident Academy of Creative Technology (TACT), F2-B, Chandaka Industrial Estate, Chandrasekharpur, Bhubaneswar, Odisha, 751024, India
| | - Se Won Kang
- Biological Resource Center, Korea Research Institute of Bioscience and Biotechnology (KRIBB), 181, Ipsin-gil, Jungeup-si, Jeollabuk-do, 56212, South Korea
| | - So Young Park
- Nakdonggang National Institute of Biological Resources, Biodiversity Conservation and Change Research Division, 137, Donam-2-gil, Sangju-si, Gyeongsangbuk-do, 37242, South Korea
| | - Yong Hun Jo
- College of Agriculture and Life Science, Chonnam National University, 77 Yongbong-ro, Buk-gu, Gwangju, 61186, South Korea
| | - Hong Seog Park
- Research Institute, GnC BIO Co., LTD, 621-6 Banseok-dong, Yuseong-gu, Daejeon, 34069, Republic of Korea
| | - Wan Jong Kim
- Department of Life Science and Biotechnology, College of Natural Sciences, Soonchunhyang University, 22 Soonchunhyangro, Shinchang-myeon, Asan, Chungchungnam-do, 31538, South Korea
| | - Yeon Soo Han
- College of Agriculture and Life Science, Chonnam National University, 77 Yongbong-ro, Buk-gu, Gwangju, 61186, South Korea
| | - Jun Sang Lee
- Institute of Basic Science, Soonchunhyang University, 22 Soonchunhyangro, Shinchang-myeon, Asan, Chungchungnam-do, 31538, South Korea
| | - Yong Seok Lee
- Department of Life Science and Biotechnology, College of Natural Sciences, Soonchunhyang University, 22 Soonchunhyangro, Shinchang-myeon, Asan, Chungchungnam-do, 31538, South Korea.
| |
Collapse
|
3
|
Tatarinova TV, Lysnyansky I, Nikolsky YV, Bolshoy A. The mysterious orphans of Mycoplasmataceae. Biol Direct 2016; 11:2. [PMID: 26747447 PMCID: PMC4706650 DOI: 10.1186/s13062-015-0104-3] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2015] [Accepted: 12/30/2015] [Indexed: 01/08/2023] Open
Abstract
Background The length of a protein sequence is largely determined by its function. In certain species, it may be also affected by additional factors, such as growth temperature or acidity. In 2002, it was shown that in the bacterium Escherichia coli and in the archaeon Archaeoglobus fulgidus, protein sequences with no homologs were, on average, shorter than those with homologs (BMC Evol Biol 2:20, 2002). It is now generally accepted that in bacterial and archaeal genomes the distributions of protein length are different between sequences with and without homologs. In this study, we examine this postulate by conducting a comprehensive analysis of all annotated prokaryotic genomes and by focusing on certain exceptions. Results We compared the distribution of lengths of “having homologs proteins” (HHPs) and “non-having homologs proteins” (orphans or ORFans) in all currently completely sequenced and COG-annotated prokaryotic genomes. As expected, the HHPs and ORFans have strikingly different length distributions in almost all genomes. As previously established, the HHPs, indeed are, on average, longer than the ORFans, and the length distributions for the ORFans have a relatively narrow peak, in contrast to the HHPs, whose lengths spread over a wider range of values. However, about thirty genomes do not obey these rules. Practically all genomes of Mycoplasma and Ureaplasma have atypical ORFans distributions, with the mean lengths of ORFan larger than the mean lengths of HHPs. These genera constitute over 80 % of atypical genomes. Conclusions We confirmed on a ubiquitous set of genomes that the previous observation of HHPs and ORFans have different gene length distributions. We also showed that Mycoplasmataceae genomes have very distinctive distributions of ORFans lengths. We offer several possible biological explanations of this phenomenon, such as an adaptation to Mycoplasmataceae’s ecological niche, specifically its “quiet” co-existence with host organisms, resulting in long ABC transporters. Electronic supplementary material The online version of this article (doi:10.1186/s13062-015-0104-3) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Tatiana V Tatarinova
- Children's Hospital Los Angeles, Keck School of Medicine, University of Southern California, Los Angeles, 90027, CA, USA. .,Spatial Sciences Institute, University of Southern California, Los Angeles, 90089, CA, USA.
| | - Inna Lysnyansky
- Mycoplasma Unit, Division of Avian and Aquatic Diseases, Kimron Veterinary Institute, POB 12, Beit Dagan, 50250, Israel.
| | - Yuri V Nikolsky
- School of Systems Biology, George Mason University, 10900 University Blvd, MSN 5B3, Manassas, VA, 20110, USA. .,Prosapia Genetics, LLC, 534 San Andres Dr., Solana Beach, CA, 92075, USA. .,Vavilov Institute of General Genetics, Moscow, Russian Federation.
| | - Alexander Bolshoy
- Department of Evolutionary and Environmental Biology and Institute of Evolution, University of Haifa, Haifa, Israel.
| |
Collapse
|
4
|
Higdon R, Earl RK, Stanberry L, Hudac CM, Montague E, Stewart E, Janko I, Choiniere J, Broomall W, Kolker N, Bernier RA, Kolker E. The promise of multi-omics and clinical data integration to identify and target personalized healthcare approaches in autism spectrum disorders. OMICS-A JOURNAL OF INTEGRATIVE BIOLOGY 2016; 19:197-208. [PMID: 25831060 DOI: 10.1089/omi.2015.0020] [Citation(s) in RCA: 67] [Impact Index Per Article: 8.4] [Reference Citation Analysis] [Abstract] [Subscribe] [Scholar Register] [Indexed: 12/19/2022]
Abstract
Complex diseases are caused by a combination of genetic and environmental factors, creating a difficult challenge for diagnosis and defining subtypes. This review article describes how distinct disease subtypes can be identified through integration and analysis of clinical and multi-omics data. A broad shift toward molecular subtyping of disease using genetic and omics data has yielded successful results in cancer and other complex diseases. To determine molecular subtypes, patients are first classified by applying clustering methods to different types of omics data, then these results are integrated with clinical data to characterize distinct disease subtypes. An example of this molecular-data-first approach is in research on Autism Spectrum Disorder (ASD), a spectrum of social communication disorders marked by tremendous etiological and phenotypic heterogeneity. In the case of ASD, omics data such as exome sequences and gene and protein expression data are combined with clinical data such as psychometric testing and imaging to enable subtype identification. Novel ASD subtypes have been proposed, such as CHD8, using this molecular subtyping approach. Broader use of molecular subtyping in complex disease research is impeded by data heterogeneity, diversity of standards, and ineffective analysis tools. The future of molecular subtyping for ASD and other complex diseases calls for an integrated resource to identify disease mechanisms, classify new patients, and inform effective treatment options. This in turn will empower and accelerate precision medicine and personalized healthcare.
Collapse
Affiliation(s)
- Roger Higdon
- 1 Bioinformatics and High-Throughput Analysis Laboratory, Seattle Children's Research Institute , Seattle, Washington
| | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
5
|
Haft DH. Using comparative genomics to drive new discoveries in microbiology. Curr Opin Microbiol 2015; 23:189-96. [PMID: 25617609 DOI: 10.1016/j.mib.2014.11.017] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2014] [Revised: 11/19/2014] [Accepted: 11/20/2014] [Indexed: 01/17/2023]
Abstract
Bioinformatics looks to many microbiologists like a service industry. In this view, annotation starts with what is known from experiments in the lab, makes reasonable inferences of which genes match other genes in function, builds databases to make all that we know accessible, but creates nothing truly new. Experiments lead, then biocuration and computational biology follow. But the astounding success of genome sequencing is changing the annotation paradigm. Every genome sequenced is an intercepted coded message from the microbial world, and as all cryptographers know, it is easier to decode a thousand messages than a single message. Some biology is best discovered not by phenomenology, but by decoding genome content, forming hypotheses, and doing the first few rounds of validation computationally. Through such reasoning, a role and function may be assigned to a protein with no sequence similarity to any protein yet studied. Experimentation can follow after the discovery to cement and to extend the findings. Unfortunately, this approach remains so unfamiliar to most bench scientists that lab work and comparative genomics typically segregate to different teams working on unconnected projects. This review will discuss several themes in comparative genomics as a discovery method, including highly derived data, use of patterns of design to reason by analogy, and in silico testing of computationally generated hypotheses.
Collapse
|
6
|
Galperin MY, Makarova KS, Wolf YI, Koonin EV. Expanded microbial genome coverage and improved protein family annotation in the COG database. Nucleic Acids Res 2014; 43:D261-9. [PMID: 25428365 DOI: 10.1093/nar/gku1223] [Citation(s) in RCA: 987] [Impact Index Per Article: 98.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/03/2023] Open
Abstract
Microbial genome sequencing projects produce numerous sequences of deduced proteins, only a small fraction of which have been or will ever be studied experimentally. This leaves sequence analysis as the only feasible way to annotate these proteins and assign to them tentative functions. The Clusters of Orthologous Groups of proteins (COGs) database (http://www.ncbi.nlm.nih.gov/COG/), first created in 1997, has been a popular tool for functional annotation. Its success was largely based on (i) its reliance on complete microbial genomes, which allowed reliable assignment of orthologs and paralogs for most genes; (ii) orthology-based approach, which used the function(s) of the characterized member(s) of the protein family (COG) to assign function(s) to the entire set of carefully identified orthologs and describe the range of potential functions when there were more than one; and (iii) careful manual curation of the annotation of the COGs, aimed at detailed prediction of the biological function(s) for each COG while avoiding annotation errors and overprediction. Here we present an update of the COGs, the first since 2003, and a comprehensive revision of the COG annotations and expansion of the genome coverage to include representative complete genomes from all bacterial and archaeal lineages down to the genus level. This re-analysis of the COGs shows that the original COG assignments had an error rate below 0.5% and allows an assessment of the progress in functional genomics in the past 12 years. During this time, functions of many previously uncharacterized COGs have been elucidated and tentative functional assignments of many COGs have been validated, either by targeted experiments or through the use of high-throughput methods. A particularly important development is the assignment of functions to several widespread, conserved proteins many of which turned out to participate in translation, in particular rRNA maturation and tRNA modification. The new version of the COGs is expected to become an important tool for microbial genomics.
Collapse
Affiliation(s)
- Michael Y Galperin
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 2094, USA
| | - Kira S Makarova
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 2094, USA
| | - Yuri I Wolf
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 2094, USA
| | - Eugene V Koonin
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 2094, USA
| |
Collapse
|
7
|
Kucharova V, Wiker HG. Proteogenomics in microbiology: taking the right turn at the junction of genomics and proteomics. Proteomics 2014; 14:2360-675. [PMID: 25263021 DOI: 10.1002/pmic.201400168] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2014] [Revised: 08/18/2014] [Accepted: 09/23/2014] [Indexed: 12/14/2022]
Abstract
High-accuracy and high-throughput proteomic methods have completely changed the way we can identify and characterize proteins. MS-based proteomics can now provide a unique supplement to genomic data and add a new level of information to the interpretation of genomic sequences. Proteomics-driven genome annotation has become especially relevant in microbiology where genomes are sequenced on a daily basis and limitations of an in silico driven annotation process are well recognized. In this review paper, we outline different strategies on how one can design a proteogenomic experiment, for example on genome-sequenced (synonymous proteogenomics) versus unsequenced organisms (ortho-proteogenomics) or with the aid of other "omic" data such as RNA-seq. We touch upon many challenges that are encountered during a typical proteogenomic study, mostly concerning bioinformatics methods and downstream data analysis, but also related to creation and use of sequence databases. A large list of proteogenomic case studies of different microorganisms is provided to illustrate the mapping of MS/MS-derived peptide spectra to genomic DNA sequences. These investigations have led to accurate determination of translational initiation sites, pointed out eventual read-throughs or programmed frameshifts, detected signal peptide processing or other protein maturation events, removed questionable annotation assignments, and provided evidence for predicted hypothetical proteins.
Collapse
Affiliation(s)
- Veronika Kucharova
- Department of Clinical Science, The Gade Research Group for Infection and Immunity, University of Bergen, Norway
| | | |
Collapse
|
8
|
Stanberry L, Rekepalli B, Liu Y, Giblock P, Higdon R, Montague E, Broomall W, Kolker N, Kolker E. Optimizing high performance computing workflow for protein functional annotation. CONCURRENCY AND COMPUTATION : PRACTICE & EXPERIENCE 2014; 26:2112-2121. [PMID: 25313296 PMCID: PMC4194055 DOI: 10.1002/cpe.3264] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/04/2023]
Abstract
Functional annotation of newly sequenced genomes is one of the major challenges in modern biology. With modern sequencing technologies, the protein sequence universe is rapidly expanding. Newly sequenced bacterial genomes alone contain over 7.5 million proteins. The rate of data generation has far surpassed that of protein annotation. The volume of protein data makes manual curation infeasible, whereas a high compute cost limits the utility of existing automated approaches. In this work, we present an improved and optmized automated workflow to enable large-scale protein annotation. The workflow uses high performance computing architectures and a low complexity classification algorithm to assign proteins into existing clusters of orthologous groups of proteins. On the basis of the Position-Specific Iterative Basic Local Alignment Search Tool the algorithm ensures at least 80% specificity and sensitivity of the resulting classifications. The workflow utilizes highly scalable parallel applications for classification and sequence alignment. Using Extreme Science and Engineering Discovery Environment supercomputers, the workflow processed 1,200,000 newly sequenced bacterial proteins. With the rapid expansion of the protein sequence universe, the proposed workflow will enable scientists to annotate big genome data.
Collapse
Affiliation(s)
- Larissa Stanberry
- Bioinformatics & High-Throughput Analysis Laboratory and High-Throughput Analysis Core, Seattle Children's Research Institute (SCRI), DELSA Global, Seattle, WA 98101, USA
| | - Bhanu Rekepalli
- Joint Institute for Computational Sciences, University of Tennessee - Oak Ridge National Laboratory (JICS UT - ORNL), DELSA Global, Oak Ridge, TN, USA
| | - Yuan Liu
- Joint Institute for Computational Sciences, University of Tennessee - Oak Ridge National Laboratory (JICS UT - ORNL), DELSA Global, Oak Ridge, TN, USA
| | | | - Roger Higdon
- Bioinformatics & High-Throughput Analysis Laboratory and High-Throughput Analysis Core, Seattle Children's Research Institute (SCRI), DELSA Global, Seattle, WA 98101, USA
| | - Elizabeth Montague
- Bioinformatics & High-Throughput Analysis Laboratory and High-Throughput Analysis Core, Seattle Children's Research Institute (SCRI), DELSA Global, Seattle, WA 98101, USA
| | - William Broomall
- Bioinformatics & High-Throughput Analysis Laboratory and High-Throughput Analysis Core, Seattle Children's Research Institute (SCRI), DELSA Global, Seattle, WA 98101, USA
| | - Natali Kolker
- Bioinformatics & High-Throughput Analysis Laboratory and High-Throughput Analysis Core, Seattle Children's Research Institute (SCRI), DELSA Global, Seattle, WA 98101, USA
| | - Eugene Kolker
- Bioinformatics & High-throughput Analysis Laboratory, SCRI, High-throughput Analysis Core, SCRI, Predicitive Analytics, Seattle Children's Hospital, Departments of Pediatrics and Biomedical Informatics & Medical Education, University of Washington, DELSA Global
| |
Collapse
|
9
|
Römling U, Kjelleberg S, Normark S, Nyman L, Uhlin BE, Åkerlund B. Microbial biofilm formation: a need to act. J Intern Med 2014; 276:98-110. [PMID: 24796496 DOI: 10.1111/joim.12242] [Citation(s) in RCA: 102] [Impact Index Per Article: 10.2] [Reference Citation Analysis] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
Affiliation(s)
- U Römling
- Department of Microbiology, Tumor and Cell Biology, Karolinska Institutet, Stockholm, Sweden
| | | | | | | | | | | |
Collapse
|
10
|
Higdon R, Haynes W, Stanberry L, Stewart E, Yandl G, Howard C, Broomall W, Kolker N, Kolker E. Unraveling the Complexities of Life Sciences Data. BIG DATA 2013; 1:42-50. [PMID: 27447037 DOI: 10.1089/big.2012.1505] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
The life sciences have entered into the realm of big data and data-enabled science, where data can either empower or overwhelm. These data bring the challenges of the 5 Vs of big data: volume, veracity, velocity, variety, and value. Both independently and through our involvement with DELSA Global (Data-Enabled Life Sciences Alliance, DELSAglobal.org), the Kolker Lab ( kolkerlab.org ) is creating partnerships that identify data challenges and solve community needs. We specialize in solutions to complex biological data challenges, as exemplified by the community resource of MOPED (Model Organism Protein Expression Database, MOPED.proteinspire.org ) and the analysis pipeline of SPIRE (Systematic Protein Investigative Research Environment, PROTEINSPIRE.org ). Our collaborative work extends into the computationally intensive tasks of analysis and visualization of millions of protein sequences through innovative implementations of sequence alignment algorithms and creation of the Protein Sequence Universe tool (PSU). Pushing into the future together with our collaborators, our lab is pursuing integration of multi-omics data and exploration of biological pathways, as well as assigning function to proteins and porting solutions to the cloud. Big data have come to the life sciences; discovering the knowledge in the data will bring breakthroughs and benefits.
Collapse
Affiliation(s)
- Roger Higdon
- 1 Bioinformatics and High-throughput Analysis Laboratory, Seattle Children's Research Institute , Seattle, Washington
- 2 High-throughput Analysis Core, Center for Developmental Therapeutics, Seattle Children's Research Institute , Seattle, Washington
- 3 Predictive Analytics, Seattle Children's , Seattle, Washington
- 4 Data-Enabled Life Sciences Alliance (DELSA Global) , Seattle, Washington
| | - Winston Haynes
- 1 Bioinformatics and High-throughput Analysis Laboratory, Seattle Children's Research Institute , Seattle, Washington
- 2 High-throughput Analysis Core, Center for Developmental Therapeutics, Seattle Children's Research Institute , Seattle, Washington
- 3 Predictive Analytics, Seattle Children's , Seattle, Washington
- 4 Data-Enabled Life Sciences Alliance (DELSA Global) , Seattle, Washington
| | - Larissa Stanberry
- 1 Bioinformatics and High-throughput Analysis Laboratory, Seattle Children's Research Institute , Seattle, Washington
- 2 High-throughput Analysis Core, Center for Developmental Therapeutics, Seattle Children's Research Institute , Seattle, Washington
- 3 Predictive Analytics, Seattle Children's , Seattle, Washington
- 4 Data-Enabled Life Sciences Alliance (DELSA Global) , Seattle, Washington
| | - Elizabeth Stewart
- 1 Bioinformatics and High-throughput Analysis Laboratory, Seattle Children's Research Institute , Seattle, Washington
- 4 Data-Enabled Life Sciences Alliance (DELSA Global) , Seattle, Washington
| | - Gregory Yandl
- 1 Bioinformatics and High-throughput Analysis Laboratory, Seattle Children's Research Institute , Seattle, Washington
- 2 High-throughput Analysis Core, Center for Developmental Therapeutics, Seattle Children's Research Institute , Seattle, Washington
- 4 Data-Enabled Life Sciences Alliance (DELSA Global) , Seattle, Washington
| | - Chris Howard
- 4 Data-Enabled Life Sciences Alliance (DELSA Global) , Seattle, Washington
- 5 Center for Developmental Therapeutics, Seattle Children's Research Institute , Seattle, Washington
| | - William Broomall
- 2 High-throughput Analysis Core, Center for Developmental Therapeutics, Seattle Children's Research Institute , Seattle, Washington
- 3 Predictive Analytics, Seattle Children's , Seattle, Washington
- 4 Data-Enabled Life Sciences Alliance (DELSA Global) , Seattle, Washington
| | - Natali Kolker
- 2 High-throughput Analysis Core, Center for Developmental Therapeutics, Seattle Children's Research Institute , Seattle, Washington
- 3 Predictive Analytics, Seattle Children's , Seattle, Washington
- 4 Data-Enabled Life Sciences Alliance (DELSA Global) , Seattle, Washington
| | - Eugene Kolker
- 1 Bioinformatics and High-throughput Analysis Laboratory, Seattle Children's Research Institute , Seattle, Washington
- 2 High-throughput Analysis Core, Center for Developmental Therapeutics, Seattle Children's Research Institute , Seattle, Washington
- 3 Predictive Analytics, Seattle Children's , Seattle, Washington
- 4 Data-Enabled Life Sciences Alliance (DELSA Global) , Seattle, Washington
- 6 Departments of Biomedical Informatics & Medical Education and Pediatrics, University of Washington , Seattle, Washington
| |
Collapse
|
11
|
Ponomarenko E, Poverennaya E, Pyatnitskiy M, Lisitsa A, Moshkovskii S, Ilgisonis E, Chernobrovkin A, Archakov A. Comparative ranking of human chromosomes based on post-genomic data. OMICS-A JOURNAL OF INTEGRATIVE BIOLOGY 2012; 16:604-11. [PMID: 22966780 DOI: 10.1089/omi.2012.0034] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/25/2022]
Abstract
The goal of the Human Proteome Project (HPP) is to fully characterize the 21,000 human protein-coding genes with respect to the estimated two million proteins they encode. As such, the HPP aims to create a comprehensive, detailed resource to help elucidate protein functions and to advance medical treatment. Similarly to the Human Genome Project (HGP), the HPP chose a chromosome-centric approach, assigning different chromosomes to different countries. Here we introduce a scoring method for chromosome ranking based on several characteristics, including relevance to health problems, existing published knowledge, and current transcriptome and proteome coverage. The score of each chromosome was computed as a weighted combination of indexes reflecting the aforementioned characteristics. The approach is tailored to the chromosome-centric HPP (C-HPP), and is advantageous in that it takes into account currently available information. We ranked the human chromosomes using the proposed score, and observed that Chr Y, Chr 13, and Chr 18 were top-ranked, whereas the scores of Chr 19, Chr 11, and Chr 17 were comparatively low. For Chr 18, selected for the Russian part of C-HPP, about 25% of the encoded genes were associated with diseases, including cancers and neurodegenerative and psychiatric diseases, as well as type 1 diabetes and essential hypertension. This ranking approach could easily be adapted to prioritize research for other sets of genes, such as metabolic pathways and functional categories.
Collapse
Affiliation(s)
- Elena Ponomarenko
- Institute of Biomedical Chemistry of Russian Academy of Medical Sciences, Moscow, Russia
| | | | | | | | | | | | | | | |
Collapse
|
12
|
THUILLARD MARC, MOULTON VINCENT. IDENTIFYING AND RECONSTRUCTING LATERAL TRANSFERS FROM DISTANCE MATRICES BY COMBINING THE MINIMUM CONTRADICTION METHOD AND NEIGHBOR-NET. J Bioinform Comput Biol 2011; 9:453-70. [DOI: 10.1142/s0219720011005409] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2010] [Revised: 02/01/2011] [Accepted: 02/13/2011] [Indexed: 11/18/2022]
Abstract
Identifying lateral gene transfers is an important problem in evolutionary biology. Under a simple model of evolution, the expected values of an evolutionary distance matrix describing a phylogenetic tree fulfill the so-called Kalmanson inequalities. The Minimum Contradiction method for identifying lateral gene transfers exploits the fact that lateral transfers may generate large deviations from the Kalmanson inequalities. Here a new approach is presented to deal with such cases that combines the Neighbor-Net algorithm for computing phylogenetic networks with the Minimum Contradiction method. A subset of taxa, prescribed using Neighbor-Net, is obtained by measuring how closely the Kalmanson inequalities are fulfilled by each taxon. A criterion is then used to identify the taxa, possibly involved in a lateral transfer between nonconsecutive taxa. We illustrate the utility of the new approach by applying it to a distance matrix for Archaea, Bacteria, and Eukaryota.
Collapse
Affiliation(s)
| | - VINCENT MOULTON
- School of Computing Sciences, University of East Anglia, Norwich, NR4 7TJ, UK
| |
Collapse
|
13
|
The genetic organisation of prokaryotic two-component system signalling pathways. BMC Genomics 2010; 11:720. [PMID: 21172000 PMCID: PMC3018481 DOI: 10.1186/1471-2164-11-720] [Citation(s) in RCA: 34] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2010] [Accepted: 12/20/2010] [Indexed: 11/16/2022] Open
Abstract
Background Two-component systems (TCSs) are modular and diverse signalling pathways, involving a stimulus-responsive transfer of phosphoryl groups from transmitter to partner receiver domains. TCS gene and domain organisation are both potentially informative regarding biological function, interaction partnerships and molecular mechanisms. However, there is currently little understanding of the relationships between domain architecture, gene organisation and TCS pathway structure. Results Here we classify the gene and domain organisation of TCS gene loci from 1405 prokaryotic replicons (>40,000 TCS proteins). We find that 200 bp is the most appropriate distance cut-off for defining whether two TCS genes are functionally linked. More than 90% of all TCS gene loci encode just one or two transmitter and/or receiver domains, however numerous other geometries exist, often with large numbers of encoded TCS domains. Such information provides insights into the distribution of TCS domains between genes, and within genes. As expected, the organisation of TCS genes and domains is affected by phylogeny, and plasmid-encoded TCS exhibit differences in organisation from their chromosomally-encoded counterparts. Conclusions We provide here an overview of the genomic and genetic organisation of TCS domains, as a resource for further research. We also propose novel metrics that build upon TCS gene/domain organisation data and allow comparisons between genomic complements of TCSs. In particular, 'percentage orphaned TCS genes' (or 'Dissemination') and 'percentage of complex loci' (or 'Sophistication') appear to be useful discriminators, and to reflect mechanistic aspects of TCS organisation not captured by existing metrics.
Collapse
|
14
|
Comparative genome biology of a serogroup B carriage and disease strain supports a polygenic nature of meningococcal virulence. J Bacteriol 2010; 192:5363-77. [PMID: 20709895 DOI: 10.1128/jb.00883-10] [Citation(s) in RCA: 46] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023] Open
Abstract
Neisseria meningitidis serogroup B strains are responsible for most meningococcal cases in the industrialized countries, and strains belonging to the clonal complex ST-41/44 are among the most prevalent serogroup B strains in carriage and disease. Here, we report the first genome and transcriptome comparison of a serogroup B carriage strain from the clonal complex ST-41/44 to the serogroup B disease strain MC58 from the clonal complex ST-32. Both genomes are highly colinear, with only three major genome rearrangements that are associated with the integration of mobile genetic elements. They further differ in about 10% of their gene content, with the highest variability in gene presence as well as gene sequence found for proteins involved in host cell interactions, including Opc, NadA, TonB-dependent receptors, RTX toxin, and two-partner secretion system proteins. Whereas housekeeping genes coding for metabolic functions were highly conserved, there were considerable differences in their expression pattern upon adhesion to human nasopharyngeal cells between both strains, including differences in energy metabolism and stress response. In line with these genomic and transcriptomic differences, both strains also showed marked differences in their in vitro infectivity and in serum resistance. Taken together, these data support the concept of a polygenic nature of meningococcal virulence comprising differences in the repertoire of adhesins as well as in the regulation of metabolic genes and suggest a prominent role for immune selection and genetic drift in shaping the meningococcal genome.
Collapse
|
15
|
Galperin MY, Koonin EV. From complete genome sequence to 'complete' understanding? Trends Biotechnol 2010; 28:398-406. [PMID: 20647113 PMCID: PMC3065831 DOI: 10.1016/j.tibtech.2010.05.006] [Citation(s) in RCA: 119] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2010] [Revised: 05/18/2010] [Accepted: 05/28/2010] [Indexed: 12/29/2022]
Abstract
The rapidly accumulating genome sequence data allow researchers to address fundamental biological questions that were not even asked just a few years ago. A major problem in genomics is the widening gap between the rapid progress in genome sequencing and the comparatively slow progress in the functional characterization of sequenced genomes. Here we discuss two key questions of genome biology: whether we need more genomes, and how deep is our understanding of biology based on genomic analysis. We argue that overly specific annotations of gene functions are often less useful than the more generic, but also more robust, functional assignments based on protein family classification. We also discuss problems in understanding the functions of the remaining 'conserved hypothetical' genes.
Collapse
Affiliation(s)
- Michael Y Galperin
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA.
| | | |
Collapse
|
16
|
Mahadevan P, Seto D. Rapid pair-wise synteny analysis of large bacterial genomes using web-based GeneOrder4.0. BMC Res Notes 2010; 3:41. [PMID: 20178631 PMCID: PMC2844394 DOI: 10.1186/1756-0500-3-41] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/01/2010] [Accepted: 02/23/2010] [Indexed: 11/30/2022] Open
Abstract
Background The growing whole genome sequence databases necessitate the development of user-friendly software tools to mine these data. Web-based tools are particularly useful to wet-bench biologists as they enable platform-independent analysis of sequence data, without having to perform complex programming tasks and software compiling. Findings GeneOrder4.0 is a web-based "on-the-fly" synteny and gene order analysis tool for comparative bacterial genomics (ca. 8 Mb). It enables the visualization of synteny by plotting protein similarity scores between two genomes and it also provides visual annotation of "hypothetical" proteins from older archived genomes based on more recent annotations. Conclusions The web-based software tool GeneOrder4.0 is a user-friendly application that has been updated to allow the rapid analysis of synteny and gene order in large bacterial genomes. It is developed with the wet-bench researcher in mind.
Collapse
Affiliation(s)
- Padmanabhan Mahadevan
- Department of Bioinformatics and Computational Biology, 10900 University Blvd,, MSN 5B3, George Mason University, Manassas, VA 20110, USA.
| | | |
Collapse
|
17
|
Galperin MY, Higdon R, Kolker E. Interplay of heritage and habitat in the distribution of bacterial signal transduction systems. MOLECULAR BIOSYSTEMS 2010; 6:721-8. [PMID: 20237650 DOI: 10.1039/b908047c] [Citation(s) in RCA: 76] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Abstract
Comparative analysis of the complete genome sequences from a variety of poorly studied organisms aims at predicting ecological and behavioral properties of these organisms and helping in characterizing their habitats. This task requires finding appropriate descriptors that could be correlated with the core traits of each system and would allow meaningful comparisons. Using the relatively simple bacterial models, first attempts have been made to introduce suitable metrics to describe the complexity of organism's signaling machinery, which included introducing the "bacterial IQ" score. Here, we use an updated census of prokaryotic signal transduction systems to improve this parameter and evaluate its consistency within selected bacterial phyla. We also introduce a more elaborate descriptor, a set of profiles of relative abundance of members of each family of signal transduction proteins encoded in each genome. We show that these family profiles are well conserved within each genus and are often consistent within families of bacteria. Thus, they reflect evolutionary relationships between organisms as well as individual adaptations of each organism to its specific ecological niche.
Collapse
Affiliation(s)
- Michael Y Galperin
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, Maryland 20894, USA.
| | | | | |
Collapse
|
18
|
Louie B, Higdon R, Kolker E. A statistical model of protein sequence similarity and function similarity reveals overly-specific function predictions. PLoS One 2009; 4:e7546. [PMID: 19844580 PMCID: PMC2760442 DOI: 10.1371/journal.pone.0007546] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2009] [Accepted: 09/13/2009] [Indexed: 12/02/2022] Open
Abstract
Background Predicting protein function from primary sequence is an important open problem in modern biology. Not only are there many thousands of proteins of unknown function, current approaches for predicting function must be improved upon. One problem in particular is overly-specific function predictions which we address here with a new statistical model of the relationship between protein sequence similarity and protein function similarity. Methodology Our statistical model is based on sets of proteins with experimentally validated functions and numeric measures of function specificity and function similarity derived from the Gene Ontology. The model predicts the similarity of function between two proteins given their amino acid sequence similarity measured by statistics from the BLAST sequence alignment algorithm. A novel aspect of our model is that it predicts the degree of function similarity shared between two proteins over a continuous range of sequence similarity, facilitating prediction of function with an appropriate level of specificity. Significance Our model shows nearly exact function similarity for proteins with high sequence similarity (bit score >244.7, e-value >1e−62, non-redundant NCBI protein database (NRDB)) and only small likelihood of specific function match for proteins with low sequence similarity (bit score <54.6, e-value <1e−05, NRDB). For sequence similarity ranges in between our annotation model shows an increasing relationship between function similarity and sequence similarity, but with considerable variability. We applied the model to a large set of proteins of unknown function, and predicted functions for thousands of these proteins ranging from general to very specific. We also applied the model to a data set of proteins with previously assigned, specific functions that were electronically based. We show that, on average, these prior function predictions are more specific (quite possibly overly-specific) compared to predictions from our model that is based on proteins with experimentally determined function.
Collapse
Affiliation(s)
- Brenton Louie
- Bioinformatics and High-throughput Analysis Laboratory, Seattle Children's Research Institute, Seattle, Washington, United States of America
- Predictive Analytics, Seattle Children's Hospital, University of Washington School of Medicine, Seattle, Washington, United States of America
| | - Roger Higdon
- Bioinformatics and High-throughput Analysis Laboratory, Seattle Children's Research Institute, Seattle, Washington, United States of America
- Predictive Analytics, Seattle Children's Hospital, University of Washington School of Medicine, Seattle, Washington, United States of America
| | - Eugene Kolker
- Bioinformatics and High-throughput Analysis Laboratory, Seattle Children's Research Institute, Seattle, Washington, United States of America
- Predictive Analytics, Seattle Children's Hospital, University of Washington School of Medicine, Seattle, Washington, United States of America
- Biomedical and Health Informatics Division, Department of Medical Education and Biomedical Informatics, University of Washington School of Medicine, Seattle, Washington, United States of America
- * E-mail:
| |
Collapse
|
19
|
Meinicke P. UFO: a web server for ultra-fast functional profiling of whole genome protein sequences. BMC Genomics 2009; 10:409. [PMID: 19725959 PMCID: PMC2744726 DOI: 10.1186/1471-2164-10-409] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2009] [Accepted: 09/02/2009] [Indexed: 11/10/2022] Open
Abstract
Background Functional profiling is a key technique to characterize and compare the functional potential of entire genomes. The estimation of profiles according to an assignment of sequences to functional categories is a computationally expensive task because it requires the comparison of all protein sequences from a genome with a usually large database of annotated sequences or sequence families. Description Based on machine learning techniques for Pfam domain detection, the UFO web server for ultra-fast functional profiling allows researchers to process large protein sequence collections instantaneously. Besides the frequencies of Pfam and GO categories, the user also obtains the sequence specific assignments to Pfam domain families. In addition, a comparison with existing genomes provides dissimilarity scores with respect to 821 reference proteomes. Considering the underlying UFO domain detection, the results on 206 test genomes indicate a high sensitivity of the approach. In comparison with current state-of-the-art HMMs, the runtime measurements show a considerable speed up in the range of four orders of magnitude. For an average size prokaryotic genome, the computation of a functional profile together with its comparison typically requires about 10 seconds of processing time. Conclusion For the first time the UFO web server makes it possible to get a quick overview on the functional inventory of newly sequenced organisms. The genome scale comparison with a large number of precomputed profiles allows a first guess about functionally related organisms. The service is freely available and does not require user registration or specification of a valid email address.
Collapse
Affiliation(s)
- Peter Meinicke
- Department of Bioinformatics, Institute of Microbiology and Genetics, Georg-August-University Göttingen, Germany.
| |
Collapse
|
20
|
Kaddi C, Quo CF, Wang MD. Quantitative metrics for bio-modeling algorithm selection. ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL INTERNATIONAL CONFERENCE 2009; 2008:4613-6. [PMID: 19163744 DOI: 10.1109/iembs.2008.4650241] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
In this paper, we report our efforts in developing guidelines that are capable of helping researchers to select algorithms in systems biology modeling. We propose a set of metrics based on discrete observable units in terms of key bio-modeling considerations. We accomplish this by (i) reviewing classical metric definitions, (ii) implementing widely used modeling algorithms on a specific case study, and (iii) testing metrics that are a hybrid of classical metrics and key bio-modeling considerations. The modeling algorithms implemented are Michaelis-Menten kinetics, generalized mass action, flux balance analysis, and metabolic control analysis. This work extends our previous work in developing qualitative guidelines to select bio-modeling algorithms. Our results impact systems biology modeling specifically by increasing the level of confidence for users to select bio-modeling algorithms by using quantitative metrics appropriately.
Collapse
Affiliation(s)
- Chanchala Kaddi
- Wallace H. Coulter Department of Biomedical Engineering, Georgia Institute of Technology, Atlanta, 30332 USA.
| | | | | |
Collapse
|
21
|
Genomes and knowledge - a questionable relationship? Trends Microbiol 2008; 16:512-9. [PMID: 18819801 DOI: 10.1016/j.tim.2008.08.001] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2008] [Revised: 08/15/2008] [Accepted: 08/21/2008] [Indexed: 11/22/2022]
Abstract
The availability of bacterial genome sequences has ushered in an era of post-genomic research - accelerating and often enabling molecular genetic analyses. For bacteriologists focussing on an individual bacterium, comparing genomes has also led to a greater understanding of their favoured organism through contextualization. But how does the value of such contextualization vary with the number of available genomes? It seems that for most genome metrics, comparison against approximately 100 genomes is sufficient, with comparison against further genomes not considerably affecting the contextual knowledge gained. It appears that quality, rather than quantity, might be the most important factor when comparing genomes.
Collapse
|
22
|
Abstract
Minimum contradiction matrices are a useful complement to distance-based phylogenies. A minimum contradiction matrix represents phylogenetic information under the form of an ordered distance matrix Y(i) (,) (j) (n). A matrix element corresponds to the distance from a reference vertex n to the path (i, j). For an X-tree or a split network, the minimum contradiction matrix is a Robinson matrix. It therefore fulfills all the inequalities defining perfect order: Y(i) (,) (j) (n) >or= Y(i) (,) (k) (n) (,)Y(k j) (n) >or= Y(k) (,) (I) (n), i
Collapse
|
23
|
Jones JT, Moens M, Mota M, Li H, Kikuchi T. Bursaphelenchus xylophilus: opportunities in comparative genomics and molecular host-parasite interactions. MOLECULAR PLANT PATHOLOGY 2008; 9:357-68. [PMID: 18705876 PMCID: PMC6640334 DOI: 10.1111/j.1364-3703.2007.00461.x] [Citation(s) in RCA: 93] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/02/2023]
Abstract
Most Bursaphelenchus species are fungal feeding nematodes that colonize dead or dying trees. However, Bursaphelenchus xylophilus, the pine wood nematode, is also a pathogen of trees and is the causal agent of pine wilt disease. B. xylophilus is native to North America and here it causes little damage to trees. Where it is introduced to new regions it causes huge damage. The most severely affected areas are found in the Far East but more recently B. xylophilus has been introduced into Portugal and the potential for damage here is also high. As incidence and severity of pine wilt disease are linked to temperature we suggest that climate change is likely to exacerbate the problems caused by B. xylophilus and, in addition, will extend (northwards in Europe) the range in which pine wilt disease can occur. Here we review what is currently known about the interactions of B. xylophilus with its hosts, including recent developments in our understanding of the molecular biology of pathogenicity in the nematode. We also examine the potential developments that could be made by more widespread use of genomics tools to understand interactions between B. xylophilus, bacterial pathogens that have been implicated in disease and host trees.
Collapse
Affiliation(s)
- John T Jones
- PPP Programme, SCRI, Invergowrie, Dundee DD2 5DA, UK.
| | | | | | | | | |
Collapse
|
24
|
Wilson GA, Feil EJ, Lilley AK, Field D. Large-scale comparative genomic ranking of taxonomically restricted genes (TRGs) in bacterial and archaeal genomes. PLoS One 2007; 2:e324. [PMID: 17389915 PMCID: PMC1824705 DOI: 10.1371/journal.pone.0000324] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2007] [Accepted: 02/18/2007] [Indexed: 11/19/2022] Open
Abstract
BACKGROUND Lineage-specific, or taxonomically restricted genes (TRGs), especially those that are species and strain-specific, are of special interest because they are expected to play a role in defining exclusive ecological adaptations to particular niches. Despite this, they are relatively poorly studied and little understood, in large part because many are still orphans or only have homologues in very closely related isolates. This lack of homology confounds attempts to establish the likelihood that a hypothetical gene is expressed and, if so, to determine the putative function of the protein. METHODOLOGY/PRINCIPAL FINDINGS We have developed "QIPP" ("Quality Index for Predicted Proteins"), an index that scores the "quality" of a protein based on non-homology-based criteria. QIPP can be used to assign a value between zero and one to any protein based on comparing its features to other proteins in a given genome. We have used QIPP to rank the predicted proteins in the proteomes of Bacteria and Archaea. This ranking reveals that there is a large amount of variation in QIPP scores, and identifies many high-scoring orphans as potentially "authentic" (expressed) orphans. There are significant differences in the distributions of QIPP scores between orphan and non-orphan genes for many genomes and a trend for less well-conserved genes to have lower QIPP scores. CONCLUSIONS The implication of this work is that QIPP scores can be used to further annotate predicted proteins with information that is independent of homology. Such information can be used to prioritize candidates for further analysis. Data generated for this study can be found in the OrphanMine at http://www.genomics.ceh.ac.uk/orphan_mine.
Collapse
Affiliation(s)
- Gareth A Wilson
- Centre for Ecology and Hydrology (CEH) Oxford, Oxford, United Kindgom.
| | | | | | | |
Collapse
|