1
|
Abstract
Within the next decade, the genomes of 1.8 million eukaryotic species will be sequenced. Identifying genes in these sequences is essential to understand the biology of the species. This is challenging due to the transcriptional complexity of eukaryotic genomes, which encode hundreds of thousands of transcripts of multiple types. Among these, a small set of protein-coding mRNAs play a disproportionately large role in defining phenotypes. Due to their sequence conservation, orthology can be established, making it possible to define the universal catalog of eukaryotic protein-coding genes. This catalog should substantially contribute to uncovering the genomic events underlying the emergence of eukaryotic phenotypes. This piece briefly reviews the basics of protein-coding gene prediction, discusses challenges in finalizing annotation of the human genome, and proposes strategies for producing annotations across the eukaryotic Tree of Life. This lays the groundwork for obtaining the catalog of all genes-the Earth's code of life.
Collapse
Affiliation(s)
- Roderic Guigó
- Bioinformatics and Genomics, Center for Genomic Regulation (CRG), The Barcelona Institute for Science and Technology (BIST), Dr. Aiguader 88, 08003 Barcelona, Catalonia
- Universitat Pompeu Fabra (UPF), Barcelona, Catalonia
| |
Collapse
|
2
|
Vogel CM, Potthoff DB, Schäfer M, Barandun N, Vorholt JA. Protective role of the Arabidopsis leaf microbiota against a bacterial pathogen. Nat Microbiol 2021; 6:1537-1548. [PMID: 34819644 PMCID: PMC7612696 DOI: 10.1038/s41564-021-00997-7] [Citation(s) in RCA: 60] [Impact Index Per Article: 20.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2021] [Accepted: 10/15/2021] [Indexed: 11/08/2022]
Abstract
The aerial parts of plants are host to taxonomically structured bacterial communities. Members of the core phyllosphere microbiota can protect Arabidopsis thaliana against foliar pathogens. However, whether plant protection is widespread and to what extent the modes of protection differ among phyllosphere microorganisms are not clear. Here, we present a systematic analysis of plant protection capabilities of the At-LSPHERE, which is a collection of >200 bacterial isolates from A. thaliana, against the bacterial pathogen Pseudomonas syringae pv. tomato DC3000. In total, 224 bacterial leaf isolates were individually assessed for plant protection in a gnotobiotic system. Protection against the pathogen varied, with ~10% of leaf microbiota strains providing full protection, ~10% showing intermediate levels of protection and the remaining ~80% not markedly reducing disease phenotypes upon infection. The most protective strains were distributed across different taxonomic groups. Synthetic community experiments revealed additive effects of strains but also that a single strain can confer full protection in a community context. We also identify different mechanisms that contribute to plant protection. Although pattern-triggered immunity coreceptor signalling is involved in protection by a subset of strains, other strains protected in the absence of functional plant immunity receptors BAK1 and BKK1. Using a comparative genomics approach combined with mutagenesis, we reveal that direct bacteria-pathogen interactions contribute to plant protection by Rhizobium Leaf202. This shows that a computational approach based on the data provided can be used to identify genes of the microbiota that are important for plant protection.
Collapse
Affiliation(s)
| | | | - Martin Schäfer
- Institute of Microbiology, ETH Zurich, Zurich, Switzerland
| | | | | |
Collapse
|
3
|
Lempens P, Meehan CJ, Vandelannoote K, Fissette K, de Rijk P, Van Deun A, Rigouts L, de Jong BC. Isoniazid resistance levels of Mycobacterium tuberculosis can largely be predicted by high-confidence resistance-conferring mutations. Sci Rep 2018; 8:3246. [PMID: 29459669 PMCID: PMC5818527 DOI: 10.1038/s41598-018-21378-x] [Citation(s) in RCA: 80] [Impact Index Per Article: 13.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2017] [Accepted: 02/01/2018] [Indexed: 11/22/2022] Open
Abstract
The majority of Mycobacterium tuberculosis isolates resistant to isoniazid harbour a mutation in katG. Since these mutations cause a wide range of minimum inhibitory concentrations (MICs), largely below the serum level reached with higher dosing (15 mg/L upon 15–20 mg/kg), the drug might still remain partly active in presence of a katG mutation. We therefore investigated which genetic mutations predict the level of phenotypic isoniazid resistance in clinical M. tuberculosis isolates. To this end, the association between known and unknown isoniazid resistance-conferring mutations in whole genome sequences, and the isoniazid MICs of 176 isolates was examined. We found mostly moderate-level resistance characterized by a mode of 6.4 mg/L for the very common katG Ser315Thr mutation, and always very high MICs (≥19.2 mg/L) for the combination of katG Ser315Thr and inhA c-15t. Contrary to common belief, isolates harbouring inhA c-15t alone, partly also showed moderate-level resistance, particularly when combined with inhA Ser94Ala. No overt association between low-confidence or unknown mutations, except in katG, and isoniazid resistance (level) was found. Except for the rare katG deletion, line probe assay is thus not sufficiently accurate to predict the level of isoniazid resistance for a single mutation in katG or inhA.
Collapse
Affiliation(s)
- Pauline Lempens
- Unit of Mycobacteriology, Department of Biomedical Sciences, Institute of Tropical Medicine, Antwerp, Belgium. .,Department of Biomedical Sciences, University of Antwerp, Antwerp, Belgium.
| | - Conor J Meehan
- Unit of Mycobacteriology, Department of Biomedical Sciences, Institute of Tropical Medicine, Antwerp, Belgium
| | - Koen Vandelannoote
- Unit of Mycobacteriology, Department of Biomedical Sciences, Institute of Tropical Medicine, Antwerp, Belgium
| | - Kristina Fissette
- Unit of Mycobacteriology, Department of Biomedical Sciences, Institute of Tropical Medicine, Antwerp, Belgium
| | - Pim de Rijk
- Unit of Mycobacteriology, Department of Biomedical Sciences, Institute of Tropical Medicine, Antwerp, Belgium
| | - Armand Van Deun
- Unit of Mycobacteriology, Department of Biomedical Sciences, Institute of Tropical Medicine, Antwerp, Belgium
| | - Leen Rigouts
- Unit of Mycobacteriology, Department of Biomedical Sciences, Institute of Tropical Medicine, Antwerp, Belgium.,Department of Biomedical Sciences, University of Antwerp, Antwerp, Belgium
| | - Bouke C de Jong
- Unit of Mycobacteriology, Department of Biomedical Sciences, Institute of Tropical Medicine, Antwerp, Belgium
| |
Collapse
|
4
|
From Genomes to Phenotypes: Traitar, the Microbial Trait Analyzer. mSystems 2016; 1:mSystems00101-16. [PMID: 28066816 PMCID: PMC5192078 DOI: 10.1128/msystems.00101-16] [Citation(s) in RCA: 79] [Impact Index Per Article: 9.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2016] [Accepted: 11/12/2016] [Indexed: 01/17/2023] Open
Abstract
Bacteria are ubiquitous in our ecosystem and have a major impact on human health, e.g., by supporting digestion in the human gut. Bacterial communities can also aid in biotechnological processes such as wastewater treatment or decontamination of polluted soils. Diverse bacteria contribute with their unique capabilities to the functioning of such ecosystems, but lab experiments to investigate those capabilities are labor-intensive. Major advances in sequencing techniques open up the opportunity to study bacteria by their genome sequences. For this purpose, we have developed Traitar, software that predicts traits of bacteria on the basis of their genomes. It is applicable to studies with tens or hundreds of bacterial genomes. Traitar may help researchers in microbiology to pinpoint the traits of interest, reducing the amount of wet lab work required. The number of sequenced genomes is growing exponentially, profoundly shifting the bottleneck from data generation to genome interpretation. Traits are often used to characterize and distinguish bacteria and are likely a driving factor in microbial community composition, yet little is known about the traits of most microbes. We describe Traitar, the microbial trait analyzer, which is a fully automated software package for deriving phenotypes from a genome sequence. Traitar provides phenotype classifiers to predict 67 traits related to the use of various substrates as carbon and energy sources, oxygen requirement, morphology, antibiotic susceptibility, proteolysis, and enzymatic activities. Furthermore, it suggests protein families associated with the presence of particular phenotypes. Our method uses L1-regularized L2-loss support vector machines for phenotype assignments based on phyletic patterns of protein families and their evolutionary histories across a diverse set of microbial species. We demonstrate reliable phenotype assignment for Traitar to bacterial genomes from 572 species of eight phyla, also based on incomplete single-cell genomes and simulated draft genomes. We also showcase its application in metagenomics by verifying and complementing a manual metabolic reconstruction of two novel Clostridiales species based on draft genomes recovered from commercial biogas reactors. Traitar is available at https://github.com/hzi-bifo/traitar. IMPORTANCE Bacteria are ubiquitous in our ecosystem and have a major impact on human health, e.g., by supporting digestion in the human gut. Bacterial communities can also aid in biotechnological processes such as wastewater treatment or decontamination of polluted soils. Diverse bacteria contribute with their unique capabilities to the functioning of such ecosystems, but lab experiments to investigate those capabilities are labor-intensive. Major advances in sequencing techniques open up the opportunity to study bacteria by their genome sequences. For this purpose, we have developed Traitar, software that predicts traits of bacteria on the basis of their genomes. It is applicable to studies with tens or hundreds of bacterial genomes. Traitar may help researchers in microbiology to pinpoint the traits of interest, reducing the amount of wet lab work required.
Collapse
|
5
|
Brbić M, Piškorec M, Vidulin V, Kriško A, Šmuc T, Supek F. The landscape of microbial phenotypic traits and associated genes. Nucleic Acids Res 2016; 44:10074-10090. [PMID: 27915291 PMCID: PMC5137458 DOI: 10.1093/nar/gkw964] [Citation(s) in RCA: 39] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2016] [Revised: 09/21/2016] [Accepted: 10/11/2016] [Indexed: 12/31/2022] Open
Abstract
Bacteria and Archaea display a variety of phenotypic traits and can adapt to diverse ecological niches. However, systematic annotation of prokaryotic phenotypes is lacking. We have therefore developed ProTraits, a resource containing ∼545 000 novel phenotype inferences, spanning 424 traits assigned to 3046 bacterial and archaeal species. These annotations were assigned by a computational pipeline that associates microbes with phenotypes by text-mining the scientific literature and the broader World Wide Web, while also being able to define novel concepts from unstructured text. Moreover, the ProTraits pipeline assigns phenotypes by drawing extensively on comparative genomics, capturing patterns in gene repertoires, codon usage biases, proteome composition and co-occurrence in metagenomes. Notably, we find that gene synteny is highly predictive of many phenotypes, and highlight examples of gene neighborhoods associated with spore-forming ability. A global analysis of trait interrelatedness outlined clusters in the microbial phenotype network, suggesting common genetic underpinnings. Our extended set of phenotype annotations allows detection of 57 088 high confidence gene-trait links, which recover many known associations involving sporulation, flagella, catalase activity, aerobicity, photosynthesis and other traits. Over 99% of the commonly occurring gene families are involved in genetic interactions conditional on at least one phenotype, suggesting that epistasis has a major role in shaping microbial gene content.
Collapse
Affiliation(s)
- Maria Brbić
- Division of Electronics, Ruder Boskovic Institute, 10000 Zagreb, Croatia
| | - Matija Piškorec
- Division of Electronics, Ruder Boskovic Institute, 10000 Zagreb, Croatia
| | - Vedrana Vidulin
- Division of Electronics, Ruder Boskovic Institute, 10000 Zagreb, Croatia
| | - Anita Kriško
- Mediterranean Institute of Life Sciences, 21000 Split, Croatia
| | - Tomislav Šmuc
- Division of Electronics, Ruder Boskovic Institute, 10000 Zagreb, Croatia
| | - Fran Supek
- Division of Electronics, Ruder Boskovic Institute, 10000 Zagreb, Croatia .,EMBL/CRG Systems Biology Research Unit, Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, 08003 Barcelona, Spain.,Universitat Pompeu Fabra (UPF), 08002 Barcelona, Spain
| |
Collapse
|
6
|
Turaev D, Rattei T. High definition for systems biology of microbial communities: metagenomics gets genome-centric and strain-resolved. Curr Opin Biotechnol 2016; 39:174-181. [PMID: 27115497 DOI: 10.1016/j.copbio.2016.04.011] [Citation(s) in RCA: 28] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2015] [Revised: 04/08/2016] [Accepted: 04/12/2016] [Indexed: 11/28/2022]
Abstract
The systems biology of microbial communities, organismal communities inhabiting all ecological niches on earth, has in recent years been strongly facilitated by the rapid development of experimental, sequencing and data analysis methods. Novel experimental approaches and binning methods in metagenomics render the semi-automatic reconstructions of near-complete genomes of uncultivable bacteria possible, while advances in high-resolution amplicon analysis allow for efficient and less biased taxonomic community characterization. This will also facilitate predictive modeling approaches, hitherto limited by the low resolution of metagenomic data. In this review, we pinpoint the most promising current developments in metagenomics. They facilitate microbial systems biology towards a systemic understanding of mechanisms in microbial communities with scopes of application in many areas of our daily life.
Collapse
Affiliation(s)
- Dmitrij Turaev
- Department of Microbiology and Ecosystem Science, University of Vienna, 1090 Vienna, Austria
| | - Thomas Rattei
- Department of Microbiology and Ecosystem Science, University of Vienna, 1090 Vienna, Austria.
| |
Collapse
|
7
|
Eichinger V, Nussbaumer T, Platzer A, Jehl MA, Arnold R, Rattei T. EffectiveDB--updates and novel features for a better annotation of bacterial secreted proteins and Type III, IV, VI secretion systems. Nucleic Acids Res 2015; 44:D669-74. [PMID: 26590402 PMCID: PMC4702896 DOI: 10.1093/nar/gkv1269] [Citation(s) in RCA: 115] [Impact Index Per Article: 12.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2015] [Accepted: 11/03/2015] [Indexed: 11/17/2022] Open
Abstract
Protein secretion systems play a key role in the interaction of bacteria and hosts. EffectiveDB (http://effectivedb.org) contains pre-calculated predictions of bacterial secreted proteins and of intact secretion systems. Here we describe a major update of the database, which was previously featured in the NAR Database Issue. EffectiveDB bundles various tools to recognize Type III secretion signals, conserved binding sites of Type III chaperones, Type IV secretion peptides, eukaryotic-like domains and subcellular targeting signals in the host. Beyond the analysis of arbitrary protein sequence collections, the new release of EffectiveDB also provides a ‘genome-mode’, in which protein sequences from nearly complete genomes or metagenomic bins can be screened for the presence of three important secretion systems (Type III, IV, VI). EffectiveDB contains pre-calculated predictions for currently 1677 bacterial genomes from the EggNOG 4.0 database and for additional bacterial genomes from NCBI RefSeq. The new, user-friendly and informative web portal offers a submission tool for running the EffectiveDB prediction tools on user-provided data.
Collapse
Affiliation(s)
- Valerie Eichinger
- Division of Computational System Biology, Department of Microbiology and Ecosystem Science, University of Vienna, 1090 Vienna, Austria
| | - Thomas Nussbaumer
- Division of Computational System Biology, Department of Microbiology and Ecosystem Science, University of Vienna, 1090 Vienna, Austria
| | - Alexander Platzer
- Division of Computational System Biology, Department of Microbiology and Ecosystem Science, University of Vienna, 1090 Vienna, Austria
| | - Marc-André Jehl
- Division of Computational System Biology, Department of Microbiology and Ecosystem Science, University of Vienna, 1090 Vienna, Austria
| | - Roland Arnold
- Program in Genetics and Genome Biology, The Hospital for Sick Children, Toronto, Ontario M5G 1X8, Canada
| | - Thomas Rattei
- Division of Computational System Biology, Department of Microbiology and Ecosystem Science, University of Vienna, 1090 Vienna, Austria
| |
Collapse
|
8
|
Feldbauer R, Schulz F, Horn M, Rattei T. Prediction of microbial phenotypes based on comparative genomics. BMC Bioinformatics 2015; 16 Suppl 14:S1. [PMID: 26451672 PMCID: PMC4603748 DOI: 10.1186/1471-2105-16-s14-s1] [Citation(s) in RCA: 33] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/18/2023] Open
Abstract
The accessibility of almost complete genome sequences of uncultivable microbial species from metagenomes necessitates computational methods predicting microbial phenotypes solely based on genomic data. Here we investigate how comparative genomics can be utilized for the prediction of microbial phenotypes. The PICA framework facilitates application and comparison of different machine learning techniques for phenotypic trait prediction. We have improved and extended PICA's support vector machine plug-in and suggest its applicability to large-scale genome databases and incomplete genome sequences. We have demonstrated the stability of the predictive power for phenotypic traits, not perturbed by the rapid growth of genome databases. A new software tool facilitates the in-depth analysis of phenotype models, which associate expected and unexpected protein functions with particular traits. Most of the traits can be reliably predicted in only 60-70% complete genomes. We have established a new phenotypic model that predicts intracellular microorganisms. Thereby we could demonstrate that also independently evolved phenotypic traits, characterized by genome reduction, can be reliably predicted based on comparative genomics. Our results suggest that the extended PICA framework can be used to automatically annotate phenotypes in near-complete microbial genome sequences, as generated in large numbers in current metagenomics studies.
Collapse
|
9
|
Liu R, France B, George S, Rallo R, Zhang H, Xia T, Nel AE, Bradley K, Cohen Y. Association rule mining of cellular responses induced by metal and metal oxide nanoparticles. Analyst 2014; 139:943-53. [PMID: 24260774 DOI: 10.1039/c3an01409f] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Abstract
Relationships among fourteen different biological responses (including ten signaling pathway activities and four cytotoxicity effects) of murine macrophage (RAW264.7) and bronchial epithelial (BEAS-2B) cells exposed to six metal and metal oxide nanoparticles (NPs) were analyzed using both statistical and data mining approaches. Both the pathway activities and cytotoxicity effects were assessed using high-throughput screening (HTS) over an exposure period of up to 24 h and concentration range of 0.39-200 mg L(-1). HTS data were processed by outlier removal, normalization, and hit-identification (for significantly regulated cellular responses) to arrive at reliable multiparametric bioactivity profiles for the NPs. Association rule mining was then applied to the bioactivity profiles followed by a pruning process to remove redundant rules. The non-redundant association rules indicated that "significant regulation" of one or more cellular responses implies regulation of other (associated) cellular response types. Pairwise correlation analysis (via Pearson's χ(2) test) and self-organizing map clustering of the different cellular response types indicated consistency with the identified non-redundant association rules. Furthermore, in order to explore the potential use of association rules as a tool for data-driven hypothesis generation, specific pathway activity experiments were carried out for ZnO NPs. The experimental results confirmed the association rule identified for the p53 pathway and mitochondrial superoxide levels (via MitoSox reagent) and further revealed that blocking of the transcriptional activity of p53 lowered the MitoSox signal. The present approach of using association rule mining for data-driven hypothesis generation has important implications for streamlining multi-parameter HTS assays, improving the understanding of NP toxicity mechanisms, and selection of endpoints for the development of nanomaterial structure-activity relationships.
Collapse
Affiliation(s)
- Rong Liu
- Institute of the Environment and Sustainability, University of California, Los Angeles, CA 90095, USA.
| | | | | | | | | | | | | | | | | |
Collapse
|
10
|
Franco-Duarte R, Mendes I, Umek L, Drumonde-Neves J, Zupan B, Schuller D. Computational models reveal genotype-phenotype associations in Saccharomyces cerevisiae. Yeast 2014; 31:265-77. [PMID: 24752995 DOI: 10.1002/yea.3016] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2013] [Revised: 04/09/2014] [Accepted: 04/10/2014] [Indexed: 11/11/2022] Open
Abstract
Genome sequencing is essential to understand individual variation and to study the mechanisms that explain relations between genotype and phenotype. The accumulated knowledge from large-scale genome sequencing projects of Saccharomyces cerevisiae isolates is being used to study the mechanisms that explain such relations. Our objective was to undertake genetic characterization of 172 S. cerevisiae strains from different geographical origins and technological groups, using 11 polymorphic microsatellites, and computationally relate these data with the results of 30 phenotypic tests. Genetic characterization revealed 280 alleles, with the microsatellite ScAAT1 contributing most to intrastrain variability, together with alleles 20, 9 and 16 from the microsatellites ScAAT4, ScAAT5 and ScAAT6. These microsatellite allelic profiles are characteristic for both the phenotype and origin of yeast strains. We confirm the strength of these associations by construction and cross-validation of computational models that can predict the technological application and origin of a strain from the microsatellite allelic profile. Associations between microsatellites and specific phenotypes were scored using information gain ratios, and significant findings were confirmed by permutation tests and estimation of false discovery rates. The phenotypes associated with higher number of alleles were the capacity to resist to sulphur dioxide (tested by the capacity to grow in the presence of potassium bisulphite) and the presence of galactosidase activity. Our study demonstrates the utility of computational modelling to estimate a strain technological group and phenotype from microsatellite allelic combinations as tools for preliminary yeast strain selection.
Collapse
Affiliation(s)
- Ricardo Franco-Duarte
- Centre of Molecular and Environmental Biology (CBMA), Department of Biology, University of Minho, Braga, Portugal
| | | | | | | | | | | |
Collapse
|
11
|
Abstract
The constantly increasing volume and complexity of available biological data requires new methods for their management and analysis. An important challenge is the integration of information from different sources in order to discover possible hidden relations between already known data. In this paper we introduce a data mining approach which relates biological ontologies by mining cross and intra-ontology pairwise generalized association rules. Its advantage is sensitivity to rare associations, for these are important for biologists. We propose a new class of interestingness measures designed for hierarchically organized rules. These measures allow one to select the most important rules and to take into account rare cases. They favor rules with an actual interestingness value that exceeds the expected value. The latter is calculated taking into account the parent rule. We demonstrate this approach by applying it to the analysis of data from Gene Ontology and GPCR databases. Our objective is to discover interesting relations between two different ontologies or parts of a single ontology. The association rules that are thus discovered can provide the user with new knowledge about underlying biological processes or help improve annotation consistency. The obtained results show that produced rules represent meaningful and quite reliable associations.
Collapse
|
12
|
Boon E, Meehan CJ, Whidden C, Wong DHJ, Langille MGI, Beiko RG. Interactions in the microbiome: communities of organisms and communities of genes. FEMS Microbiol Rev 2014; 38:90-118. [PMID: 23909933 PMCID: PMC4298764 DOI: 10.1111/1574-6976.12035] [Citation(s) in RCA: 119] [Impact Index Per Article: 11.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/23/2013] [Revised: 07/02/2013] [Accepted: 07/10/2013] [Indexed: 12/17/2022] Open
Abstract
A central challenge in microbial community ecology is the delineation of appropriate units of biodiversity, which can be taxonomic, phylogenetic, or functional in nature. The term 'community' is applied ambiguously; in some cases, the term refers simply to a set of observed entities, while in other cases, it requires that these entities interact with one another. Microorganisms can rapidly gain and lose genes, potentially decoupling community roles from taxonomic and phylogenetic groupings. Trait-based approaches offer a useful alternative, but many traits can be defined based on gene functions, metabolic modules, and genomic properties, and the optimal set of traits to choose is often not obvious. An analysis that considers taxon assignment and traits in concert may be ideal, with the strengths of each approach offsetting the weaknesses of the other. Individual genes also merit consideration as entities in an ecological analysis, with characteristics such as diversity, turnover, and interactions modeled using genes rather than organisms as entities. We identify some promising avenues of research that are likely to yield a deeper understanding of microbial communities that shift from observation-based questions of 'Who is there?' and 'What are they doing?' to the mechanistically driven question of 'How will they respond?'
Collapse
Affiliation(s)
- Eva Boon
- Department of Biology, Dalhousie University, Halifax, NS, Canada
| | | | | | | | | | | |
Collapse
|
13
|
Yu P, Wild DJ. Discovering associations in biomedical datasets by link-based associative classifier (LAC). PLoS One 2012; 7:e51018. [PMID: 23227228 PMCID: PMC3515483 DOI: 10.1371/journal.pone.0051018] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2012] [Accepted: 10/31/2012] [Indexed: 11/21/2022] Open
Abstract
Associative classification mining (ACM) can be used to provide predictive models with high accuracy as well as interpretability. However, traditional ACM ignores the difference of significances among the features used for mining. Although weighted associative classification mining (WACM) addresses this issue by assigning different weights to features, most implementations can only be utilized when pre-assigned weights are available. In this paper, we propose a link-based approach to automatically derive weight information from a dataset using link-based models which treat the dataset as a bipartite model. By combining this link-based feature weighting method with a traditional ACM method–classification based on associations (CBA), a Link-based Associative Classifier (LAC) is developed. We then demonstrate the application of LAC to biomedical datasets for association discovery between chemical compounds and bioactivities or diseases. The results indicate that the novel link-based weighting method is comparable to support vector machine (SVM) and RELIEF method, and is capable of capturing significant features. Additionally, LAC is shown to produce models with high accuracies and discover interesting associations which may otherwise remain unrevealed by traditional ACM.
Collapse
Affiliation(s)
- Pulan Yu
- School of Informatics and Computing, Indiana University, Bloomington, Indiana, United States of America
| | - David J. Wild
- School of Informatics and Computing, Indiana University, Bloomington, Indiana, United States of America
- * E-mail:
| |
Collapse
|
14
|
Irsoy O, Yildiz OT, Alpaydin E. Design and analysis of classifier learning experiments in bioinformatics: survey and case studies. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2012; 9:1663-1675. [PMID: 22908127 DOI: 10.1109/tcbb.2012.117] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/01/2023]
Abstract
In many bioinformatics applications, it is important to assess and compare the performances of algorithms trained from data, to be able to draw conclusions unaffected by chance and are therefore significant. Both the design of such experiments and the analysis of the resulting data using statistical tests should be done carefully for the results to carry significance. In this paper, we first review the performance measures used in classification, the basics of experiment design and statistical tests. We then give the results of our survey over 1,500 papers published in the last two years in three bioinformatics journals (including this one). Although the basics of experiment design are well understood, such as resampling instead of using a single training set and the use of different performance metrics instead of error, only 21 percent of the papers use any statistical test for comparison. In the third part, we analyze four different scenarios which we encounter frequently in the bioinformatics literature, discussing the proper statistical methodology as well as showing an example case study for each. With the supplementary software, we hope that the guidelines we discuss will play an important role in future studies.
Collapse
Affiliation(s)
- Ozan Irsoy
- Department of Computer Engineering, Boğaziçi University, Bebek 34342, Istanbul, Turkey.
| | | | | |
Collapse
|