1
|
Conformal Prediction with Orange. J Stat Softw 2021. [DOI: 10.18637/jss.v098.i07] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/03/2022] Open
|
2
|
Low prevalence of active COVID-19 in Slovenia: a nationwide population study of a probability-based sample. Clin Microbiol Infect 2020; 26:1514-1519. [PMID: 32688068 PMCID: PMC7367804 DOI: 10.1016/j.cmi.2020.07.013] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2020] [Revised: 07/08/2020] [Accepted: 07/10/2020] [Indexed: 12/03/2022]
Abstract
Objectives Accurate population-level assessment of the coronavirus disease 2019 (COVID-19) burden is fundamental for navigating the path forward during the ongoing pandemic, but current knowledge is scant. We conducted the first nationwide population study using a probability-based sample to assess active severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) infection, combined with a longitudinal follow-up of the entire cohort over the next 6 months. Baseline SARS-CoV-2 RNA testing results and the first 3-week follow-up results are presented. Methods A probability-based sample of the Slovenian population comprising data from 2.1 million people was selected from the Central Population Register (n = 3000). SARS-CoV-2 RNA was detected in nasopharyngeal samples using the cobas 6800 SARS-CoV-2 assay. Each participant filled in a detailed baseline questionnaire with basic sociodemographic data and detailed medical history compatible with COVID-19. After 3 weeks, participants were interviewed for the presence of COVID-19–compatible clinical symptoms and signs, including in household members, and offered immediate testing for SARS-CoV-2 RNA if indicated. Results A total of 1368 individuals (46%) consented to participate and completed the questionnaire. Two of 1366 participants tested positive for SARS-CoV-2 RNA (prevalence 0.15%; posterior mean 0.18%, 95% Bayesian confidence interval 0.03–0.47; 95% highest density region (HDR) 0.01–0.41). No newly diagnosed infections occurred in the cohort during the first 3-week follow-up round. Conclusions The low prevalence of active COVID-19 infections found in this study accurately predicted the dynamics of the epidemic in Slovenia over the subsequent month. Properly designed and timely executed studies using probability-based samples combined with routine target-testing figures provide reliable data that can be used to make informed decisions on relaxing or strengthening disease mitigation strategies.
Collapse
|
3
|
Abstract
BACKGROUND Personalized, precision, P4, or stratified medicine is understood as a medical approach in which patients are stratified based on their disease subtype, risk, prognosis, or treatment response using specialized diagnostic tests. The key idea is to base medical decisions on individual patient characteristics, including molecular and behavioral biomarkers, rather than on population averages. Personalized medicine is deeply connected to and dependent on data science, specifically machine learning (often named Artificial Intelligence in the mainstream media). While during recent years there has been a lot of enthusiasm about the potential of 'big data' and machine learning-based solutions, there exist only few examples that impact current clinical practice. The lack of impact on clinical practice can largely be attributed to insufficient performance of predictive models, difficulties to interpret complex model predictions, and lack of validation via prospective clinical trials that demonstrate a clear benefit compared to the standard of care. In this paper, we review the potential of state-of-the-art data science approaches for personalized medicine, discuss open challenges, and highlight directions that may help to overcome them in the future. CONCLUSIONS There is a need for an interdisciplinary effort, including data scientists, physicians, patient advocates, regulatory agencies, and health insurance organizations. Partially unrealistic expectations and concerns about data science-based solutions need to be better managed. In parallel, computational methods must advance more to provide direct benefit to clinical practice.
Collapse
|
4
|
Predicting Patient’s Long-Term Clinical Status after Hip Arthroplasty Using Hierarchical Decision Modelling and Data Mining. Methods Inf Med 2018. [DOI: 10.1055/s-0038-1634460] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/17/2022]
Abstract
AbstractConstruction of a prognostic model is presented for the long-term outcome after femoral neck fracture treatment with implantation of hip endoprosthesis. While the model is induced from the follow-up data, we show that the use of additional expert knowledge is absolutely crucial to obtain good predictive accuracy. A schema is proposed where domain knowledge is encoded as a hierarchical decision model of which only a part is induced from the data while the rest is specified by the expert. Although applied to hip endoprosthesis domain, the proposed schema is general and can be used for the construction of other prognostic models where both follow-up data and human expertise is available.
Collapse
|
5
|
A comprehensive structural, biochemical and biological profiling of the human NUDIX hydrolase family. Nat Commun 2017; 8:1541. [PMID: 29142246 PMCID: PMC5688067 DOI: 10.1038/s41467-017-01642-w] [Citation(s) in RCA: 86] [Impact Index Per Article: 12.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2016] [Accepted: 10/06/2017] [Indexed: 01/04/2023] Open
Abstract
The NUDIX enzymes are involved in cellular metabolism and homeostasis, as well as mRNA processing. Although highly conserved throughout all organisms, their biological roles and biochemical redundancies remain largely unclear. To address this, we globally resolve their individual properties and inter-relationships. We purify 18 of the human NUDIX proteins and screen 52 substrates, providing a substrate redundancy map. Using crystal structures, we generate sequence alignment analyses revealing four major structural classes. To a certain extent, their substrate preference redundancies correlate with structural classes, thus linking structure and activity relationships. To elucidate interdependence among the NUDIX hydrolases, we pairwise deplete them generating an epistatic interaction map, evaluate cell cycle perturbations upon knockdown in normal and cancer cells, and analyse their protein and mRNA expression in normal and cancer tissues. Using a novel FUSION algorithm, we integrate all data creating a comprehensive NUDIX enzyme profile map, which will prove fundamental to understanding their biological functionality.
Collapse
|
6
|
Abstract
Motivation: The rapid growth of diverse biological data allows us to consider interactions between a variety of objects, such as genes, chemicals, molecular signatures, diseases, pathways and environmental exposures. Often, any pair of objects—such as a gene and a disease—can be related in different ways, for example, directly via gene–disease associations or indirectly via functional annotations, chemicals and pathways. Different ways of relating these objects carry different semantic meanings. However, traditional methods disregard these semantics and thus cannot fully exploit their value in data modeling. Results: We present Medusa, an approach to detect size-k modules of objects that, taken together, appear most significant to another set of objects. Medusa operates on large-scale collections of heterogeneous datasets and explicitly distinguishes between diverse data semantics. It advances research along two dimensions: it builds on collective matrix factorization to derive different semantics, and it formulates the growing of the modules as a submodular optimization program. Medusa is flexible in choosing or combining semantic meanings and provides theoretical guarantees about detection quality. In a systematic study on 310 complex diseases, we show the effectiveness of Medusa in associating genes with diseases and detecting disease modules. We demonstrate that in predicting gene–disease associations Medusa compares favorably to methods that ignore diverse semantic meanings. We find that the utility of different semantics depends on disease categories and that, overall, Medusa recovers disease modules more accurately when combining different semantics. Availability and implementation: Source code is at http://github.com/marinkaz/medusa Contact:marinka@cs.stanford.edu, blaz.zupan@fri.uni-lj.si
Collapse
|
7
|
dictyExpress: a web-based platform for sequence data management and analytics in Dictyostelium and beyond. BMC Bioinformatics 2017; 18:291. [PMID: 28578698 PMCID: PMC5457571 DOI: 10.1186/s12859-017-1706-9] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2016] [Accepted: 05/23/2017] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Dictyostelium discoideum, a soil-dwelling social amoeba, is a model for the study of numerous biological processes. Research in the field has benefited mightily from the adoption of next-generation sequencing for genomics and transcriptomics. Dictyostelium biologists now face the widespread challenges of analyzing and exploring high dimensional data sets to generate hypotheses and discovering novel insights. RESULTS We present dictyExpress (2.0), a web application designed for exploratory analysis of gene expression data, as well as data from related experiments such as Chromatin Immunoprecipitation sequencing (ChIP-Seq). The application features visualization modules that include time course expression profiles, clustering, gene ontology enrichment analysis, differential expression analysis and comparison of experiments. All visualizations are interactive and interconnected, such that the selection of genes in one module propagates instantly to visualizations in other modules. dictyExpress currently stores the data from over 800 Dictyostelium experiments and is embedded within a general-purpose software framework for management of next-generation sequencing data. dictyExpress allows users to explore their data in a broader context by reciprocal linking with dictyBase-a repository of Dictyostelium genomic data. In addition, we introduce a companion application called GenBoard, an intuitive graphic user interface for data management and bioinformatics analysis. CONCLUSIONS dictyExpress and GenBoard enable broad adoption of next generation sequencing based inquiries by the Dictyostelium research community. Labs without the means to undertake deep sequencing projects can mine the data available to the public. The entire information flow, from raw sequence data to hypothesis testing, can be accomplished in an efficient workspace. The software framework is generalizable and represents a useful approach for any research community. To encourage more wide usage, the backend is open-source, available for extension and further development by bioinformaticians and data scientists.
Collapse
|
8
|
Programming social behavior by the maternal fragile X protein. GENES, BRAIN, AND BEHAVIOR 2016; 15:578-87. [PMID: 27198123 PMCID: PMC9879598 DOI: 10.1111/gbb.12298] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/30/2015] [Revised: 05/11/2016] [Accepted: 05/12/2016] [Indexed: 01/28/2023]
Abstract
The developing fetus and neonate are highly sensitive to maternal environment. Besides the well-documented effects of maternal stress, nutrition and infections, maternal mutations, by altering the fetal, perinatal and/or early postnatal environment, can impact the behavior of genetically normal offspring. Mutation/premutation in the X-linked FMR1 (encoding the translational regulator FMRP) in females, although primarily responsible for causing fragile X syndrome (FXS) in their children, may also elicit such maternal effects. We showed that a deficit in maternal FMRP in mice results in hyperactivity in the genetically normal offspring. To test if maternal FMRP has a broader intergenerational effect, we measured social behavior, a core dimension of neurodevelopmental disorders, in offspring of FMRP-deficient dams. We found that male offspring of Fmr1(+/-) mothers, independent of their own Fmr1 genotype, exhibit increased approach and reduced avoidance toward conspecific strangers, reminiscent of 'indiscriminate friendliness' or the lack of stranger anxiety, diagnosed in neglected children and in patients with Asperger's and Williams syndrome. Furthermore, social interaction failed to activate mesolimbic/amygdala regions, encoding social aversion, in these mice, providing a neurobiological basis for the behavioral abnormality. This work identifies a novel role for FMRP that extends its function beyond the well-established genetic function into intergenerational non-genetic inheritance/programming of social behavior and the corresponding neuronal circuit. As FXS premutation and some psychiatric conditions that can be associated with reduced FMRP expression are more prevalent in mothers than full FMR1 mutation, our findings potentially broaden the significance of FMRP-dependent programming of social behavior beyond the FXS population.
Collapse
|
9
|
ISDN2014_0324: Non‐genetic transmission of abnormal social phenotype in a mouse model of FXS. Int J Dev Neurosci 2015. [DOI: 10.1016/j.ijdevneu.2015.04.271] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022] Open
|
10
|
Leaps and lulls in the developmental transcriptome of Dictyostelium discoideum. BMC Genomics 2015; 16:294. [PMID: 25887420 PMCID: PMC4403905 DOI: 10.1186/s12864-015-1491-7] [Citation(s) in RCA: 46] [Impact Index Per Article: 5.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2014] [Accepted: 03/26/2015] [Indexed: 12/26/2022] Open
Abstract
BACKGROUND Development of the soil amoeba Dictyostelium discoideum is triggered by starvation. When placed on a solid substrate, the starving solitary amoebae cease growth, communicate via extracellular cAMP, aggregate by tens of thousands and develop into multicellular organisms. Early phases of the developmental program are often studied in cells starved in suspension while cAMP is provided exogenously. Previous studies revealed massive shifts in the transcriptome under both developmental conditions and a close relationship between gene expression and morphogenesis, but were limited by the sampling frequency and the resolution of the methods. RESULTS Here, we combine the superior depth and specificity of RNA-seq-based analysis of mRNA abundance with high frequency sampling during filter development and cAMP pulsing in suspension. We found that the developmental transcriptome exhibits mostly gradual changes interspersed by a few instances of large shifts. For each time point we treated the entire transcriptome as single phenotype, and were able to characterize development as groups of similar time points separated by gaps. The grouped time points represented gradual changes in mRNA abundance, or molecular phenotype, and the gaps represented times during which many genes are differentially expressed rapidly, and thus the phenotype changes dramatically. Comparing developmental experiments revealed that gene expression in filter developed cells lagged behind those treated with exogenous cAMP in suspension. The high sampling frequency revealed many genes whose regulation is reproducibly more complex than indicated by previous studies. Gene Ontology enrichment analysis suggested that the transition to multicellularity coincided with rapid accumulation of transcripts associated with DNA processes and mitosis. Later development included the up-regulation of organic signaling molecules and co-factor biosynthesis. Our analysis also demonstrated a high level of synchrony among the developing structures throughout development. CONCLUSIONS Our data describe D. discoideum development as a series of coordinated cellular and multicellular activities. Coordination occurred within fields of aggregating cells and among multicellular bodies, such as mounds or migratory slugs that experience both cell-cell contact and various soluble signaling regimes. These time courses, sampled at the highest temporal resolution to date in this system, provide a comprehensive resource for studies of developmental gene expression.
Collapse
|
11
|
Genome-Wide Localization Study of Yeast Pex11 Identifies Peroxisome-Mitochondria Interactions through the ERMES Complex. J Mol Biol 2015; 427:2072-87. [PMID: 25769804 PMCID: PMC4429955 DOI: 10.1016/j.jmb.2015.03.004] [Citation(s) in RCA: 104] [Impact Index Per Article: 11.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2014] [Revised: 03/01/2015] [Accepted: 03/04/2015] [Indexed: 11/29/2022]
Abstract
Pex11 is a peroxin that regulates the number of peroxisomes in eukaryotic cells. Recently, it was found that a mutation in one of the three mammalian paralogs, PEX11β, results in a neurological disorder. The molecular function of Pex11, however, is not known. Saccharomyces cerevisiae Pex11 has been shown to recruit to peroxisomes the mitochondrial fission machinery, thus enabling proliferation of peroxisomes. This process is essential for efficient fatty acid β-oxidation. In this study, we used high-content microscopy on a genome-wide scale to determine the subcellular localization pattern of yeast Pex11 in all non-essential gene deletion mutants, as well as in temperature-sensitive essential gene mutants. Pex11 localization and morphology of peroxisomes was profoundly affected by mutations in 104 different genes that were functionally classified. A group of genes encompassing MDM10, MDM12 and MDM34 that encode the mitochondrial and cytosolic components of the ERMES complex was analyzed in greater detail. Deletion of these genes caused a specifically altered Pex11 localization pattern, whereas deletion of MMM1, the gene encoding the fourth, endoplasmic-reticulum-associated component of the complex, did not result in an altered Pex11 localization or peroxisome morphology phenotype. Moreover, we found that Pex11 and Mdm34 physically interact and that Pex11 plays a role in establishing the contact sites between peroxisomes and mitochondria through the ERMES complex. Based on these results, we propose that the mitochondrial/cytosolic components of the ERMES complex establish a direct interaction between mitochondria and peroxisomes through Pex11. Molecular function of Pex11, a protein with roles in metabolism and disease, is unknown. Genome-wide screening determined subcellular localization of Pex11-GFP in yeast. Mutants defective in components of the ERMES complex show altered Pex11 localization. Pex11 physically interacts with the ERMES complex component Mdm34. ERMES complex and Pex11 mediate interaction between mitochondria and peroxisomes.
Collapse
|
12
|
Heterogeneous computing architecture for fast detection of SNP-SNP interactions. BMC Bioinformatics 2014; 15:216. [PMID: 24964802 PMCID: PMC4230497 DOI: 10.1186/1471-2105-15-216] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2013] [Accepted: 06/19/2014] [Indexed: 12/04/2022] Open
Abstract
Background The extent of data in a typical genome-wide association study (GWAS) poses considerable computational challenges to software tools for gene-gene interaction discovery. Exhaustive evaluation of all interactions among hundreds of thousands to millions of single nucleotide polymorphisms (SNPs) may require weeks or even months of computation. Massively parallel hardware within a modern Graphic Processing Unit (GPU) and Many Integrated Core (MIC) coprocessors can shorten the run time considerably. While the utility of GPU-based implementations in bioinformatics has been well studied, MIC architecture has been introduced only recently and may provide a number of comparative advantages that have yet to be explored and tested. Results We have developed a heterogeneous, GPU and Intel MIC-accelerated software module for SNP-SNP interaction discovery to replace the previously single-threaded computational core in the interactive web-based data exploration program SNPsyn. We report on differences between these two modern massively parallel architectures and their software environments. Their utility resulted in an order of magnitude shorter execution times when compared to the single-threaded CPU implementation. GPU implementation on a single Nvidia Tesla K20 runs twice as fast as that for the MIC architecture-based Xeon Phi P5110 coprocessor, but also requires considerably more programming effort. Conclusions General purpose GPUs are a mature platform with large amounts of computing power capable of tackling inherently parallel problems, but can prove demanding for the programmer. On the other hand the new MIC architecture, albeit lacking in performance reduces the programming effort and makes it up with a more general architecture suitable for a wider range of problems.
Collapse
|
13
|
Computational models reveal genotype-phenotype associations in Saccharomyces cerevisiae. Yeast 2014; 31:265-77. [PMID: 24752995 DOI: 10.1002/yea.3016] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2013] [Revised: 04/09/2014] [Accepted: 04/10/2014] [Indexed: 11/11/2022] Open
Abstract
Genome sequencing is essential to understand individual variation and to study the mechanisms that explain relations between genotype and phenotype. The accumulated knowledge from large-scale genome sequencing projects of Saccharomyces cerevisiae isolates is being used to study the mechanisms that explain such relations. Our objective was to undertake genetic characterization of 172 S. cerevisiae strains from different geographical origins and technological groups, using 11 polymorphic microsatellites, and computationally relate these data with the results of 30 phenotypic tests. Genetic characterization revealed 280 alleles, with the microsatellite ScAAT1 contributing most to intrastrain variability, together with alleles 20, 9 and 16 from the microsatellites ScAAT4, ScAAT5 and ScAAT6. These microsatellite allelic profiles are characteristic for both the phenotype and origin of yeast strains. We confirm the strength of these associations by construction and cross-validation of computational models that can predict the technological application and origin of a strain from the microsatellite allelic profile. Associations between microsatellites and specific phenotypes were scored using information gain ratios, and significant findings were confirmed by permutation tests and estimation of false discovery rates. The phenotypes associated with higher number of alleles were the capacity to resist to sulphur dioxide (tested by the capacity to grow in the presence of potassium bisulphite) and the presence of galactosidase activity. Our study demonstrates the utility of computational modelling to estimate a strain technological group and phenotype from microsatellite allelic combinations as tools for preliminary yeast strain selection.
Collapse
|
14
|
Integrative clustering by nonnegative matrix factorization can reveal coherent functional groups from gene profile data. IEEE J Biomed Health Inform 2014; 19:698-708. [PMID: 24733033 DOI: 10.1109/jbhi.2014.2316508] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Recent developments in molecular biology and techniques for genome-wide data acquisition have resulted in abundance of data to profile genes and predict their function. These datasets may come from diverse sources and it is an open question how to commonly address them and fuse them into a joint prediction model. A prevailing technique to identify groups of related genes that exhibit similar profiles is profile-based clustering. Cluster inference may benefit from consensus across different clustering models. In this paper, we propose a technique that develops separate gene clusters from each of available data sources and then fuses them by means of nonnegative matrix factorization. We use gene profile data on the budding yeast S. cerevisiae to demonstrate that this approach can successfully integrate heterogeneous datasets and yield high-quality clusters that could otherwise not be inferred by simply merging the gene profiles prior to clustering.
Collapse
|
15
|
OCT4 and the acquisition of oocyte developmental competence during folliculogenesis. THE INTERNATIONAL JOURNAL OF DEVELOPMENTAL BIOLOGY 2013; 56:853-8. [PMID: 23417407 DOI: 10.1387/ijdb.120174mz] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
Abstract
The role that the transcription factor OCT4 plays during oocyte growth is yet unknown. In this review, we summarise the data on its potential role in the acquisition of oocyte developmental competence in the mouse. These studies describe the presence in MII oocytes and 2-cell embryos of an OCT4 transcriptional network that might be part of the molecular signature of maternal origin on which the inner cell mass and the embryonic stem cell-associated pluripotency is assembled and shaped. The Oct4-gene regulatory network thus provides a connection between eggs, early preimplantation embryos and embryonic stem cells.
Collapse
|
16
|
Abstract
ATP-binding cassette (ABC) transporters can translocate a broad spectrum of molecules across the cell membrane including physiological cargo and toxins. ABC transporters are known for the role they play in resistance towards anticancer agents in chemotherapy of cancer patients. There are 68 ABC transporters annotated in the genome of the social amoeba Dictyostelium discoideum. We have characterized more than half of these ABC transporters through a systematic study of mutations in their genes. We have analyzed morphological and transcriptional phenotypes for these mutants during growth and development and found that most of the mutants exhibited rather subtle phenotypes. A few of the genes may share physiological functions, as reflected in their transcriptional phenotypes. Since most of the abc-transporter mutants showed subtle morphological phenotypes, we utilized these transcriptional phenotypes to identify genes that are important for development by looking for transcripts whose abundance was unperturbed in most of the mutants. We found a set of 668 genes that includes many validated D. discoideum developmental genes. We have also found that abcG6 and abcG18 may have potential roles in intercellular signaling during terminal differentiation of spores and stalks.
Collapse
|
17
|
Computational models for prediction of yeast strain potential for winemaking from phenotypic profiles. PLoS One 2013; 8:e66523. [PMID: 23874393 PMCID: PMC3713011 DOI: 10.1371/journal.pone.0066523] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2013] [Accepted: 05/06/2013] [Indexed: 11/29/2022] Open
Abstract
Saccharomyces cerevisiae strains from diverse natural habitats harbour a vast amount of phenotypic diversity, driven by interactions between yeast and the respective environment. In grape juice fermentations, strains are exposed to a wide array of biotic and abiotic stressors, which may lead to strain selection and generate naturally arising strain diversity. Certain phenotypes are of particular interest for the winemaking industry and could be identified by screening of large number of different strains. The objective of the present work was to use data mining approaches to identify those phenotypic tests that are most useful to predict a strain's potential for winemaking. We have constituted a S. cerevisiae collection comprising 172 strains of worldwide geographical origins or technological applications. Their phenotype was screened by considering 30 physiological traits that are important from an oenological point of view. Growth in the presence of potassium bisulphite, growth at 40°C, and resistance to ethanol were mostly contributing to strain variability, as shown by the principal component analysis. In the hierarchical clustering of phenotypic profiles the strains isolated from the same wines and vineyards were scattered throughout all clusters, whereas commercial winemaking strains tended to co-cluster. Mann-Whitney test revealed significant associations between phenotypic results and strain's technological application or origin. Naïve Bayesian classifier identified 3 of the 30 phenotypic tests of growth in iprodion (0.05 mg/mL), cycloheximide (0.1 µg/mL) and potassium bisulphite (150 mg/mL) that provided most information for the assignment of a strain to the group of commercial strains. The probability of a strain to be assigned to this group was 27% using the entire phenotypic profile and increased to 95%, when only results from the three tests were considered. Results show the usefulness of computational approaches to simplify strain selection procedures.
Collapse
|
18
|
Bacterial discrimination by dictyostelid amoebae reveals the complexity of ancient interspecies interactions. Curr Biol 2013; 23:862-72. [PMID: 23664307 DOI: 10.1016/j.cub.2013.04.034] [Citation(s) in RCA: 39] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2013] [Revised: 03/12/2013] [Accepted: 04/11/2013] [Indexed: 10/26/2022]
Abstract
BACKGROUND Amoebae and bacteria interact within predator-prey and host-pathogen relationships, but the general response of amoeba to bacteria is not well understood. The amoeba Dictyostelium discoideum feeds on, and is colonized by, diverse bacterial species, including Gram-positive [Gram(+)] and Gram-negative [Gram(-)] bacteria, two major groups of bacteria that differ in structure and macromolecular composition. RESULTS Transcriptional profiling of D. discoideum revealed sets of genes whose expression is enriched in amoebae interacting with different species of bacteria, including sets that appear specific to amoebae interacting with Gram(+) or with Gram(-) bacteria. In a genetic screen utilizing the growth of mutant amoebae on a variety of bacteria as a phenotypic readout, we identified amoebal genes that are only required for growth on Gram(+) bacteria, including one that encodes the cell-surface protein gp130, as well as several genes that are only required for growth on Gram(-) bacteria, including one that encodes a putative lysozyme, AlyL. These genes are required for parts of the transcriptional response of wild-type amoebae, and this allowed their classification into potential response pathways. CONCLUSIONS We have defined genes that are critical for amoebal survival during feeding on Gram(+), or Gram(-), bacteria that we propose form part of a regulatory network that allows D. discoideum to elicit specific cellular responses to different species of bacteria in order to optimize survival.
Collapse
|
19
|
Interest propagation for knowledge extraction and representation. Stud Health Technol Inform 2013; 186:182-186. [PMID: 23542994] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/02/2023]
Abstract
Due to the increasing number of available biomedical data repositories, providing a comprehensive and intuitive access to information is still a demanding task for Information Retrieval systems. In this work we present an interactive data exploration system that retrieves relevant information by propagating the user's interest within a network. The developed techniques have been applied to two different retrieval tasks useful for biomedical research: the prioritization of proteins related to a disease of interest and the search of publications in the literature. The method relies on a network of biomedical entities, scoring of entities of interest by the user, and score propagation. The assessment of the relevance of the retrieved information confirmed a high accuracy of the presented algorithms for both the domains considered.
Collapse
|
20
|
Supporting regenerative medicine by integrative dimensionality reduction. Methods Inf Med 2012; 51:341-7. [PMID: 22773076 DOI: 10.3414/me11-02-0045] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2011] [Accepted: 05/04/2012] [Indexed: 01/03/2023]
Abstract
OBJECTIVE The assessment of the developmental potential of stem cells is a crucial step towards their clinical application in regenerative medicine. It has been demonstrated that genome-wide expression profiles can predict the cellular differentiation stage by means of dimensionality reduction methods. Here we show that these techniques can be further strengthened to support decision making with i) a novel strategy for gene selection; ii) methods for combining the evidence from multiple data sets. METHODS We propose to exploit dimensionality reduction methods for the selection of genes specifically activated in different stages of differentiation. To obtain an integrated predictive model, the expression values of the selected genes from multiple data sets are combined. We investigated distinct approaches that either aggregate data sets or use learning ensembles. RESULTS We analyzed the performance of the proposed methods on six publicly available data sets. The selection procedure identified a reduced subset of genes whose expression values gave rise to an accurate stage prediction. The assessment of predictive accuracy demonstrated a high quality of predictions for most of the data integration methods presented. CONCLUSION The experimental results highlighted the main potentials of proposed approaches. These include the ability to predict the true staging by combining multiple training data sets when this could not be inferred from a single data source, and to focus the analysis on a reduced list of genes of similar predictive performance.
Collapse
|
21
|
Knowledge-based bioinformatics for the study of mammalian oocytes. THE INTERNATIONAL JOURNAL OF DEVELOPMENTAL BIOLOGY 2012; 56:859-66. [DOI: 10.1387/ijdb.120138fm] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
|
22
|
BzpF is a CREB-like transcription factor that regulates spore maturation and stability in Dictyostelium. Dev Biol 2011; 358:137-46. [PMID: 21810415 PMCID: PMC3180911 DOI: 10.1016/j.ydbio.2011.07.017] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2011] [Revised: 07/08/2011] [Accepted: 07/13/2011] [Indexed: 12/31/2022]
Abstract
The cAMP response element-binding protein (CREB) is a highly conserved transcription factor that integrates signaling through the cAMP-dependent protein kinase A (PKA) in many eukaryotes. PKA plays a critical role in Dictyostelium development but no CREB homologue has been identified in this system. Here we show that Dictyostelium utilizes a CREB-like protein, BzpF, to integrate PKA signaling during late development. bzpF(-) mutants produce compromised spores, which are extremely unstable and germination defective. Previously, we have found that BzpF binds the canonical CRE motif in vitro. In this paper, we determined the DNA binding specificity of BzpF using protein binding microarray (PBM) and showed that the motif with the highest specificity is a CRE-like sequence. BzpF is necessary to activate the transcription of at least 15 PKA-regulated, late-developmental target genes whose promoters contain BzpF binding motifs. BzpF is sufficient to activate two of these genes. The comparison of RNA sequencing data between wild type and bzpF(-) mutant revealed that the mutant fails to express 205 genes, many of which encode cellulose-binding and sugar-binding proteins. We propose that BzpF is a CREB-like transcription factor that regulates spore maturation and stability in a PKA-related manner.
Collapse
|
23
|
miR669a and miR669q prevent skeletal muscle differentiation in postnatal cardiac progenitors. ACTA ACUST UNITED AC 2011; 193:1197-212. [PMID: 21708977 PMCID: PMC3216340 DOI: 10.1083/jcb.201011099] [Citation(s) in RCA: 62] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Postnatal heart stem and progenitor cells are a potential therapeutic tool for cardiomyopathies, but little is known about the mechanisms that control cardiac differentiation. Recent work has highlighted an important role for microribonucleic acids (miRNAs) as regulators of cardiac and skeletal myogenesis. In this paper, we isolated cardiac progenitors from neonatal β-sarcoglycan (Sgcb)-null mouse hearts affected by dilated cardiomyopathy. Unexpectedly, Sgcb-null cardiac progenitors spontaneously differentiated into skeletal muscle fibers both in vitro and when transplanted into regenerating muscles or infarcted hearts. Differentiation potential correlated with the absence of expression of a novel miRNA, miR669q, and with down-regulation of miR669a. Other miRNAs are known to promote myogenesis, but only miR669a and miR669q act upstream of myogenic regulatory factors to prevent myogenesis by directly targeting the MyoD 3' untranslated region. This finding reveals an added level of complexity in the mechanism of the fate choice of mesoderm progenitors and suggests that using endogenous cardiac stem cells therapeutically will require specially tailored procedures for certain genetic diseases.
Collapse
|
24
|
Stage prediction of embryonic stem cell differentiation from genome-wide expression data. ACTA ACUST UNITED AC 2011; 27:2546-53. [PMID: 21765096 DOI: 10.1093/bioinformatics/btr422] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Abstract
MOTIVATION The developmental stage of a cell can be determined by cellular morphology or various other observable indicators. Such classical markers could be complemented with modern surrogates, like whole-genome transcription profiles, that can encode the state of the entire organism and provide increased quantitative resolution. Recent findings suggest that such profiles provide sufficient information to reliably predict the cell's developmental stage. RESULTS We use whole-genome transcription data and several data projection methods to infer differentiation stage prediction models for embryonic cells. Given a transcription profile of an uncharacterized cell, these models can then predict its developmental stage. In a series of experiments comprising 14 datasets from the Gene Expression Omnibus, we demonstrate that the approach is robust and has excellent prediction ability both within a specific cell line and across different cell lines. AVAILABILITY Model inference and computational evaluation procedures in the form of Python scripts and accompanying datasets are available at http://www.biolab.si/supp/stagerank. CONTACT blaz.zupan@fri.uni-lj.si SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
|
25
|
|
26
|
Abstract
SNPsyn (http://snpsyn.biolab.si) is an interactive software tool for the discovery of synergistic pairs of single nucleotide polymorphisms (SNPs) from large genome-wide case-control association studies (GWAS) data on complex diseases. Synergy among SNPs is estimated using an information-theoretic approach called interaction analysis. SNPsyn is both a stand-alone C++/Flash application and a web server. The computationally intensive part is implemented in C++ and can run in parallel on a dedicated cluster or grid. The graphical user interface is written in Adobe Flash Builder 4 and can run in most web browsers or as a stand-alone application. The SNPsyn web server hosts the Flash application, receives GWAS data submissions, invokes the interaction analysis and serves result files. The user can explore details on identified synergistic pairs of SNPs, perform gene set enrichment analysis and interact with the constructed SNP synergy network.
Collapse
|
27
|
iCLIP--transcriptome-wide mapping of protein-RNA interactions with individual nucleotide resolution. J Vis Exp 2011:2638. [PMID: 21559008 PMCID: PMC3169244 DOI: 10.3791/2638] [Citation(s) in RCA: 136] [Impact Index Per Article: 10.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/16/2023] Open
Abstract
The unique composition and spatial arrangement of RNA-binding proteins (RBPs) on a transcript guide the diverse aspects of post-transcriptional regulation1. Therefore, an essential step towards understanding transcript regulation at the molecular level is to gain positional information on the binding sites of RBPs2. Protein-RNA interactions can be studied using biochemical methods, but these approaches do not address RNA binding in its native cellular context. Initial attempts to study protein-RNA complexes in their cellular environment employed affinity purification or immunoprecipitation combined with differential display or microarray analysis (RIP-CHIP)3-5. These approaches were prone to identifying indirect or non-physiological interactions6. In order to increase the specificity and positional resolution, a strategy referred to as CLIP (UV cross-linking and immunoprecipitation) was introduced7,8. CLIP combines UV cross-linking of proteins and RNA molecules with rigorous purification schemes including denaturing polyacrylamide gel electrophoresis. In combination with high-throughput sequencing technologies, CLIP has proven as a powerful tool to study protein-RNA interactions on a genome-wide scale (referred to as HITS-CLIP or CLIP-seq)9,10. Recently, PAR-CLIP was introduced that uses photoreactive ribonucleoside analogs for cross-linking11,12. Despite the high specificity of the obtained data, CLIP experiments often generate cDNA libraries of limited sequence complexity. This is partly due to the restricted amount of co-purified RNA and the two inefficient RNA ligation reactions required for library preparation. In addition, primer extension assays indicated that many cDNAs truncate prematurely at the crosslinked nucleotide13. Such truncated cDNAs are lost during the standard CLIP library preparation protocol. We recently developed iCLIP (individual-nucleotide resolution CLIP), which captures the truncated cDNAs by replacing one of the inefficient intermolecular RNA ligation steps with a more efficient intramolecular cDNA circularization (Figure 1)14. Importantly, sequencing the truncated cDNAs provides insights into the position of the cross-link site at nucleotide resolution. We successfully applied iCLIP to study hnRNP C particle organization on a genome-wide scale and assess its role in splicing regulation14.
Collapse
|
28
|
|
29
|
Biomechanical and clinical alterations of the hip joint following femoral neck fracture and implantation of bipolar hip endoprosthesis. COLLEGIUM ANTROPOLOGICUM 2010; 34:931-935. [PMID: 20977085] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Subscribe] [Scholar Register] [Indexed: 05/30/2023]
Abstract
The implantation of a bipolar partial hip endoprosthesis is a treatment of choice for displaced medial femoral neck fracture. We present an experimental study which asses and compare biomechanical and clinical status through period before and after hip fracture and implantation of bipolar partial hip endoprosthesis. This study encompassed 75 patients who suffered from an acute medial femoral neck fracture and were treated with the implantation of a bipolar partial hip endoprosthesis. Their biomechanical status (stress distribution on the hip joint weight bearing area) and clinical status (Harris Hip Score) were estimated for the time prior to the injury and assessed at the follow-up examination that was, on average, carried out 40 months after the operation. Despite ageing, the observed Harris Hip Score at the follow-up examination was higher than that estimated prior to the injury (77.9 > 69.6; p = 0.006). Similarly, the hip stress distribution was reduced (2.7 MPa < 2.3 MPa; p = 0.001). While this reduction can be attributed to a loss of weight due to late ageing, the principal improvement came from the operative treatment and corresponding restoration of the biomechanical properties of the hip joint. The implantation of a bipolar partial hip endoprosthesis for patients with displaced medial femoral neck fractures improves the biomechanical and clinical features of the hip, what should have on mind during making decision about treatment.
Collapse
|
30
|
iCLIP reveals the function of hnRNP particles in splicing at individual nucleotide resolution. Nat Struct Mol Biol 2010; 17:909-15. [PMID: 20601959 PMCID: PMC3000544 DOI: 10.1038/nsmb.1838] [Citation(s) in RCA: 836] [Impact Index Per Article: 59.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2009] [Accepted: 04/22/2010] [Indexed: 01/27/2023]
Abstract
In the nucleus of eukaryotic cells, nascent transcripts are associated with heterogeneous nuclear ribonucleoprotein (hnRNP) particles that are nucleated by hnRNP C. Despite their abundance however, it remained unclear whether these particles control pre-mRNA processing. Here, we developed individual-nucleotide resolution UV-cross-linking and immunoprecipitation (iCLIP) to study the role of hnRNP C in splicing regulation. iCLIP data demonstrate that hnRNP C recognizes uridine tracts with a defined long-range spacing consistent with hnRNP particle organization. hnRNP particles assemble on both introns and exons, but remain generally excluded from splice sites. Integration of transcriptome-wide iCLIP data and alternative splicing profiles into an ‘RNA map’ indicates how the positioning of hnRNP particles determines their effect on inclusion of alternative exons. The ability of high-resolution iCLIP data to provide insights into the mechanism of this regulation holds promise for studies of other higher-order ribonucleoprotein complexes.
Collapse
|
31
|
Computational approaches for the genetic and phenotypic characterization of a Saccharomyces cerevisiae wine yeast collection. Yeast 2010; 26:675-92. [PMID: 19894212 DOI: 10.1002/yea.1728] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Within this study, we have used a set of computational techniques to relate the genotypes and phenotypes of natural populations of Saccharomyces cerevisiae, using allelic information from 11 microsatellite loci and results from 24 phenotypic tests. A group of 103 strains was obtained from a larger S. cerevisiae winemaking strain collection by clustering with self-organizing maps. These strains were further characterized regarding their allelic combinations for 11 microsatellites and analysed in phenotypic screens that included taxonomic criteria (carbon and nitrogen assimilation tests, growth at different temperatures) and tests with biotechnological relevance (ethanol resistance, H(2)S or aromatic precursors formation). Phenotypic variability was rather high and each strain showed a unique phenotypic profile. The results, expressed as optical density (A(640)) after 22 h of growth, were in agreement with taxonomic data, although with some exceptions, since few strains were capable of consuming arabinose and ribose to a small extent. Based on microsatellite allelic information, naïve Bayesian classifier correctly assigned (AUC = 0.81, p < 10(-8)) most of the strains to the vineyard from where they were isolated, despite their close location (50-100 km). We also identified subgroups of strains with similar values of a phenotypic feature and microsatellite allelic pattern (AUC > 0.75). Subgroups were found for strains with low ethanol resistance, growth at 30 degrees C and growth in media containing galactose, raffinose or urea. The results demonstrate that computational approaches can be used to establish genotype-phenotype relations and to make predictions about a strain's biotechnological potential.
Collapse
|
32
|
New components of the Dictyostelium PKA pathway revealed by Bayesian analysis of expression data. BMC Bioinformatics 2010; 11:163. [PMID: 20356373 PMCID: PMC2873529 DOI: 10.1186/1471-2105-11-163] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2009] [Accepted: 03/31/2010] [Indexed: 11/30/2022] Open
Abstract
Background Identifying candidate genes in genetic networks is important for understanding regulation and biological function. Large gene expression datasets contain relevant information about genetic networks, but mining the data is not a trivial task. Algorithms that infer Bayesian networks from expression data are powerful tools for learning complex genetic networks, since they can incorporate prior knowledge and uncover higher-order dependencies among genes. However, these algorithms are computationally demanding, so novel techniques that allow targeted exploration for discovering new members of known pathways are essential. Results Here we describe a Bayesian network approach that addresses a specific network within a large dataset to discover new components. Our algorithm draws individual genes from a large gene-expression repository, and ranks them as potential members of a known pathway. We apply this method to discover new components of the cAMP-dependent protein kinase (PKA) pathway, a central regulator of Dictyostelium discoideum development. The PKA network is well studied in D. discoideum but the transcriptional networks that regulate PKA activity and the transcriptional outcomes of PKA function are largely unknown. Most of the genes highly ranked by our method encode either known components of the PKA pathway or are good candidates. We tested 5 uncharacterized highly ranked genes by creating mutant strains and identified a candidate cAMP-response element-binding protein, yet undiscovered in D. discoideum, and a histidine kinase, a candidate upstream regulator of PKA activity. Conclusions The single-gene expansion method is useful in identifying new components of known pathways. The method takes advantage of the Bayesian framework to incorporate prior biological knowledge and discovers higher-order dependencies among genes while greatly reducing the computational resources required to process high-throughput datasets.
Collapse
|
33
|
Conserved developmental transcriptomes in evolutionarily divergent species. Genome Biol 2010; 11:R35. [PMID: 20236529 PMCID: PMC2864575 DOI: 10.1186/gb-2010-11-3-r35] [Citation(s) in RCA: 130] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2009] [Revised: 02/11/2010] [Accepted: 03/17/2010] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Evolutionarily divergent organisms often share developmental anatomies despite vast differences between their genome sequences. The social amoebae Dictyostelium discoideum and Dictyostelium purpureum have similar developmental morphologies although their genomes are as divergent as those of man and jawed fish. RESULTS Here we show that the anatomical similarities are accompanied by extensive transcriptome conservation. Using RNA sequencing we compared the abundance and developmental regulation of all the transcripts in the two species. In both species, most genes are developmentally regulated and the greatest expression changes occur during the transition from unicellularity to multicellularity. The developmental regulation of transcription is highly conserved between orthologs in the two species. In addition to timing of expression, the level of mRNA production is also conserved between orthologs and is consistent with the intuitive notion that transcript abundance correlates with the amount of protein required. Furthermore, the conservation of transcriptomes extends to cell-type specific expression. CONCLUSIONS These findings suggest that developmental programs are remarkably conserved at the transcriptome level, considering the great evolutionary distance between the genomes. Moreover, this transcriptional conservation may be responsible for the similar developmental anatomies of Dictyostelium discoideum and Dictyostelium purpureum.
Collapse
|
34
|
Does replication groups scoring reduce false positive rate in SNP interaction discovery? BMC Genomics 2010; 11:58. [PMID: 20092660 PMCID: PMC2823693 DOI: 10.1186/1471-2164-11-58] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2009] [Accepted: 01/22/2010] [Indexed: 11/20/2022] Open
Abstract
Background Computational methods that infer single nucleotide polymorphism (SNP) interactions from phenotype data may uncover new biological mechanisms in non-Mendelian diseases. However, practical aspects of such analysis face many problems. Present experimental studies typically use SNP arrays with hundreds of thousands of SNPs but record only hundreds of samples. Candidate SNP pairs inferred by interaction analysis may include a high proportion of false positives. Recently, Gayan et al. (2008) proposed to reduce the number of false positives by combining results of interaction analysis performed on subsets of data (replication groups), rather than analyzing the entire data set directly. If performing as hypothesized, replication groups scoring could improve interaction analysis and also any type of feature ranking and selection procedure in systems biology. Because Gayan et al. do not compare their approach to the standard interaction analysis techniques, we here investigate if replication groups indeed reduce the number of reported false positive interactions. Results A set of simulated and false interaction-imputed experimental SNP data sets were used to compare the inference of SNP-SNP interactions by means of replication groups to the standard approach where the entire data set was directly used to score all candidate SNP pairs. In all our experiments, the inference of interactions from the entire data set (e.g. without using the replication groups) reported fewer false positives. Conclusions With respect to the direct scoring approach the utility of replication groups does not reduce false positive rates, and may, depending on the data set, often perform worse.
Collapse
|
35
|
Text Mining approaches for automated literature knowledge extraction and representation. Stud Health Technol Inform 2010; 160:954-958. [PMID: 20841825] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/29/2023]
Abstract
Due to the overwhelming volume of published scientific papers, information tools for automated literature analysis are essential to support current biomedical research. We have developed a knowledge extraction tool to help researcher in discovering useful information which can support their reasoning process. The tool is composed of a search engine based on Text Mining and Natural Language Processing techniques, and an analysis module which process the search results in order to build annotation similarity networks. We tested our approach on the available knowledge about the genetic mechanism of cardiac diseases, where the target is to find both known and possible hypothetical relations between specific candidate genes and the trait of interest. We show that the system i) is able to effectively retrieve medical concepts and genes and ii) plays a relevant role assisting researchers in the formulation and evaluation of novel literature-based hypotheses.
Collapse
|
36
|
dictyExpress: a Dictyostelium discoideum gene expression database with an explorative data analysis web-based interface. BMC Bioinformatics 2009; 10:265. [PMID: 19706156 PMCID: PMC2738683 DOI: 10.1186/1471-2105-10-265] [Citation(s) in RCA: 61] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2009] [Accepted: 08/25/2009] [Indexed: 11/25/2022] Open
Abstract
Background Bioinformatics often leverages on recent advancements in computer science to support biologists in their scientific discovery process. Such efforts include the development of easy-to-use web interfaces to biomedical databases. Recent advancements in interactive web technologies require us to rethink the standard submit-and-wait paradigm, and craft bioinformatics web applications that share analytical and interactive power with their desktop relatives, while retaining simplicity and availability. Results We have developed dictyExpress, a web application that features a graphical, highly interactive explorative interface to our database that consists of more than 1000 Dictyostelium discoideum gene expression experiments. In dictyExpress, the user can select experiments and genes, perform gene clustering, view gene expression profiles across time, view gene co-expression networks, perform analyses of Gene Ontology term enrichment, and simultaneously display expression profiles for a selected gene in various experiments. Most importantly, these tasks are achieved through web applications whose components are seamlessly interlinked and immediately respond to events triggered by the user, thus providing a powerful explorative data analysis environment. Conclusion dictyExpress is a precursor for a new generation of web-based bioinformatics applications with simple but powerful interactive interfaces that resemble that of the modern desktop. While dictyExpress serves mainly the Dictyostelium research community, it is relatively easy to adapt it to other datasets. We propose that the design ideas behind dictyExpress will influence the development of similar applications for other model organisms.
Collapse
|
37
|
Rule-based clustering for gene promoter structure discovery. Methods Inf Med 2009; 48:229-35. [PMID: 19387502 DOI: 10.3414/me9225] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
BACKGROUND The genetic cellular response to internal and external changes is determined by the sequence and structure of gene-regulatory promoter regions. OBJECTIVES Using data on gene-regulatory elements (i.e., either putative or known transcription factor binding sites) and data on gene expression profiles we can discover structural elements in promoter regions and infer the underlying programs of gene regulation. Such hypotheses obtained in silico can greatly assist us in experiment planning. The principal obstacle for such approaches is the combinatorial explosion in different combinations of promoter elements to be examined. METHODS Stemming from several state-of-the-art machine learning approaches we here propose a heuristic, rule-based clustering method that uses gene expression similarity to guide the search for informative structures in promoters, thus exploring only the most promising parts of the vast and expressively rich rule-space. RESULTS We present the utility of the method in the analysis of gene expression data on budding yeast S. cerevisiae where cells were induced to proliferate peroxisomes. CONCLUSIONS We demonstrate that the proposed approach is able to infer informative relations uncovering relatively complex structures in gene promoter regions that regulate gene expression.
Collapse
|
38
|
Polymorphic members of the lag gene family mediate kin discrimination in Dictyostelium. Curr Biol 2009; 19:567-72. [PMID: 19285397 DOI: 10.1016/j.cub.2009.02.037] [Citation(s) in RCA: 109] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2008] [Revised: 01/02/2009] [Accepted: 02/04/2009] [Indexed: 01/01/2023]
Abstract
Self and kin discrimination are observed in most kingdoms of life and are mediated by highly polymorphic plasma membrane proteins. Sequence polymorphism, which is essential for effective recognition, is maintained by balancing selection. Dictyostelium discoideum are social amoebas that propagate as unicellular organisms but aggregate upon starvation and form fruiting bodies with viable spores and dead stalk cells. Aggregative development exposes Dictyostelium to the perils of chimerism, including cheating, which raises questions about how the victims survive in nature and how social cooperation persists. Dictyostelids can minimize the cost of chimerism by preferential cooperation with kin, but the mechanisms of kin discrimination are largely unknown. Dictyostelium lag genes encode transmembrane proteins with multiple immunoglobulin (Ig) repeats that participate in cell adhesion and signaling. Here, we describe their role in kin discrimination. We show that lagB1 and lagC1 are highly polymorphic in natural populations and that their sequence dissimilarity correlates well with wild-strain segregation. Deleting lagB1 and lagC1 results in strain segregation in chimeras with wild-type cells, whereas elimination of the nearly invariant homolog lagD1 has no such consequences. These findings reveal an early evolutionary origin of kin discrimination and provide insight into the mechanism of social recognition and immunity.
Collapse
|
39
|
dictyBase--a Dictyostelium bioinformatics resource update. Nucleic Acids Res 2009; 37:D515-9. [PMID: 18974179 PMCID: PMC2686522 DOI: 10.1093/nar/gkn844] [Citation(s) in RCA: 68] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2008] [Revised: 10/14/2008] [Accepted: 10/15/2008] [Indexed: 12/14/2022] Open
Abstract
dictyBase (http://dictybase.org) is the model organism database for Dictyostelium discoideum. It houses the complete genome sequence, ESTs and the entire body of literature relevant to Dictyostelium. This information is curated to provide accurate gene models and functional annotations, with the goal of fully annotating the genome. This dictyBase update describes the annotations and features implemented since 2006, including improved strain and phenotype representation, integration of predicted transcriptional regulatory elements, protein domain information, biochemical pathways, improved searching and a wiki tool that allows members of the research community to provide annotations.
Collapse
|
40
|
Report on EU-USA workshop: how systems biology can advance cancer research (27 October 2008). Mol Oncol 2008; 3:9-17. [PMID: 19383362 DOI: 10.1016/j.molonc.2008.11.003] [Citation(s) in RCA: 35] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2008] [Accepted: 11/26/2008] [Indexed: 11/29/2022] Open
Abstract
The main conclusion is that systems biology approaches can indeed advance cancer research, having already proved successful in a very wide variety of cancer-related areas, and are likely to prove superior to many current research strategies. Major points include: Systems biology and computational approaches can make important contributions to research and development in key clinical aspects of cancer and of cancer treatment, and should be developed for understanding and application to diagnosis, biomarkers, cancer progression, drug development and treatment strategies. Development of new measurement technologies is central to successful systems approaches, and should be strongly encouraged. The systems view of disease combined with these new technologies and novel computational tools will over the next 5-20 years lead to medicine that is predictive, personalized, preventive and participatory (P4 medicine).Major initiatives are in progress to gather extremely wide ranges of data for both somatic and germ-line genetic variations, as well as gene, transcript, protein and metabolite expression profiles that are cancer-relevant. Electronic databases and repositories play a central role to store and analyze these data. These resources need to be developed and sustained. Understanding cellular pathways is crucial in cancer research, and these pathways need to be considered in the context of the progression of cancer at various stages. At all stages of cancer progression, major areas require modelling via systems and developmental biology methods including immune system reactions, angiogenesis and tumour progression.A number of mathematical models of an analytical or computational nature have been developed that can give detailed insights into the dynamics of cancer-relevant systems. These models should be further integrated across multiple levels of biological organization in conjunction with analysis of laboratory and clinical data.Biomarkers represent major tools in determining the presence of cancer, its progression and the responses to treatments. There is a need for sets of high-quality annotated clinical samples, enabling comparisons across different diseases and the quantitative simulation of major pathways leading to biomarker development and analysis of drug effects.Education is recognized as a key component in the success of any systems biology programme, especially for applications to cancer research. It is recognized that a balance needs to be found between the need to be interdisciplinary and the necessity of having extensive specialist knowledge in particular areas.A proposal from this workshop is to explore one or more types of cancer over the full scale of their progression, for example glioblastoma or colon cancer. Such an exemplar project would require all the experimental and computational tools available for the generation and analysis of quantitative data over the entire hierarchy of biological information. These tools and approaches could be mobilized to understand, detect and treat cancerous processes and establish methods applicable across a wide range of cancers.
Collapse
|
41
|
|
42
|
Towards knowledge-based gene expression data mining. J Biomed Inform 2007; 40:787-802. [PMID: 17683991 DOI: 10.1016/j.jbi.2007.06.005] [Citation(s) in RCA: 33] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2006] [Revised: 04/20/2007] [Accepted: 06/06/2007] [Indexed: 11/24/2022]
Abstract
The field of gene expression data analysis has grown in the past few years from being purely data-centric to integrative, aiming at complementing microarray analysis with data and knowledge from diverse available sources. In this review, we report on the plethora of gene expression data mining techniques and focus on their evolution toward knowledge-based data analysis approaches. In particular, we discuss recent developments in gene expression-based analysis methods used in association and classification studies, phenotyping and reverse engineering of gene networks.
Collapse
|
43
|
Abstract
MOTIVATION The genome of the social amoeba Dictyostelium discoideum contains an unusually large number of polyketide synthase (PKS) genes. An analysis of the genes is a first step towards understanding the biological roles of their products and exploiting novel products. RESULTS A total of 45 Type I iterative PKS genes were found, 5 of which are probably pseudogenes. Catalytic domains that are homologous with known PKS sequences as well as possible novel domains were identified. The genes often occurred in clusters of 2-5 genes, where members of the cluster had very similar sequences. The D.discoideum PKS genes formed a clade distinct from fungal and bacterial genes. All nine genes examined by RT-PCR were expressed, although at different developmental stages. The promoters of PKS genes were much more divergent than the structural genes, although we have identified motifs that are unique to some PKS gene promoters.
Collapse
|
44
|
Abstract
MOTIVATION Methods for analyzing cancer microarray data often face two distinct challenges: the models they infer need to perform well when classifying new tissue samples while at the same time providing an insight into the patterns and gene interactions hidden in the data. State-of-the-art supervised data mining methods often cover well only one of these aspects, motivating the development of methods where predictive models with a solid classification performance would be easily communicated to the domain expert. RESULTS Data visualization may provide for an excellent approach to knowledge discovery and analysis of class-labeled data. We have previously developed an approach called VizRank that can score and rank point-based visualizations according to degree of separation of data instances of different class. We here extend VizRank with techniques to uncover outliers, score features (genes) and perform classification, as well as to demonstrate that the proposed approach is well suited for cancer microarray analysis. Using VizRank and radviz visualization on a set of previously published cancer microarray data sets, we were able to find simple, interpretable data projections that include only a small subset of genes yet do clearly differentiate among different cancer types. We also report that our approach to classification through visualization achieves performance that is comparable to state-of-the-art supervised data mining techniques. AVAILABILITY VizRank and radviz are implemented as part of the Orange data mining suite (http://www.ailab.si/orange). SUPPLEMENTARY INFORMATION Supplementary data are available from http://www.ailab.si/supp/bi-cancer.
Collapse
|
45
|
FreeViz--an intelligent multivariate visualization approach to explorative analysis of biomedical data. J Biomed Inform 2007; 40:661-71. [PMID: 17531544 DOI: 10.1016/j.jbi.2007.03.010] [Citation(s) in RCA: 32] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2006] [Revised: 01/24/2007] [Accepted: 03/28/2007] [Indexed: 11/30/2022]
Abstract
Visualization can largely improve biomedical data analysis. It plays a crucial role in explorative data analysis and may support various data mining tasks. The paper presents FreeViz, an optimization method that finds linear projection and associated scatterplot that best separates instances of different class. In a single graph, the resulting FreeViz visualization can provide a global view of the classification problem being studied, reveal interesting relations between classes and features, uncover feature interactions, and provide information about intra-class similarities. The paper gives mathematical foundations of FreeViz, and presents its utility on various biomedical data sets.
Collapse
|
46
|
Abstract
BACKGROUND There is no standard triage method for earthquake victims with crush injuries because of a scarcity of epidemiologic and quantitative data. We conducted a retrospective cohort study to develop predictive models based on clinical data for crush injury in the Kobe earthquake. METHODS The medical records of 372 patients with crush injuries from the Kobe earthquake were retrospectively analyzed. Twenty-one risk factors were assessed with logistic regression analysis for three outcomes relating to crush syndrome. Two types of predictive triage models--initial evaluation in the field and secondary assessment at the hospital--were developed using logistic regression analysis. Classification accuracy, Brier score and area under the receiver operating characteristic curve (AUC) were used to evaluate the model. RESULTS The initial triage model, which includes pulse rate, delayed rescue, and abnormal urine color, has an AUC of 0.73. The secondary model, which includes WBC, tachycardia, abnormal urine color, and hyperkalemia, shows an AUC of 0.76. CONCLUSIONS These triage models may be especially useful to nondisaster experts for distinguishing earthquake victims at high risk of severe crush syndrome from those at lower risk. Application of the model may allow relief workers to better utilize limited medical and transportation resources in the aftermath of a disaster.
Collapse
|
47
|
Predictive data mining in clinical medicine: current issues and guidelines. Int J Med Inform 2006; 77:81-97. [PMID: 17188928 DOI: 10.1016/j.ijmedinf.2006.11.006] [Citation(s) in RCA: 300] [Impact Index Per Article: 16.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2006] [Accepted: 11/17/2006] [Indexed: 12/18/2022]
Abstract
BACKGROUND The widespread availability of new computational methods and tools for data analysis and predictive modeling requires medical informatics researchers and practitioners to systematically select the most appropriate strategy to cope with clinical prediction problems. In particular, the collection of methods known as 'data mining' offers methodological and technical solutions to deal with the analysis of medical data and construction of prediction models. A large variety of these methods requires general and simple guidelines that may help practitioners in the appropriate selection of data mining tools, construction and validation of predictive models, along with the dissemination of predictive models within clinical environments. PURPOSE The goal of this review is to discuss the extent and role of the research area of predictive data mining and to propose a framework to cope with the problems of constructing, assessing and exploiting data mining models in clinical medicine. METHODS We review the recent relevant work published in the area of predictive data mining in clinical medicine, highlighting critical issues and summarizing the approaches in a set of learned lessons. RESULTS The paper provides a comprehensive review of the state of the art of predictive data mining in clinical medicine and gives guidelines to carry out data mining studies in this field. CONCLUSIONS Predictive data mining is becoming an essential instrument for researchers and clinical practitioners in medicine. Understanding the main issues underlying these methods and the application of agreed and standardized procedures is mandatory for their deployment and the dissemination of results. Thanks to the integration of molecular and clinical data taking place within genomic medicine, the area has recently not only gained a fresh impulse but also a new set of complex problems it needs to address.
Collapse
|
48
|
|
49
|
Abstract
Methylation of cytosine residues in DNA plays a critical role in the silencing of gene expression, organization of chromatin structure, and cellular differentiation of eukaryotes. Previous studies failed to detect 5-methylcytosine in Dictyostelium genomic DNA, but the recent sequencing of the Dictyostelium genome revealed a candidate DNA methyltransferase gene (dnmA). The genome sequence also uncovered an unusual distribution of potential methylation sites, CpG islands, throughout the genome. DnmA belongs to the Dnmt2 subfamily and contains all the catalytic motifs necessary for cytosine methyltransferases. Dnmt2 activity is typically weak in Drosophila melanogaster, mouse, and human cells and the gene function in these systems is unknown. We have investigated the methylation status of Dictyostelium genomic DNA with antibodies raised against 5-methylcytosine and detected low levels of the modified nucleotide. We also found that DNA methylation increased during development. We searched the genome for potential methylation sites and found them in retrotransposable elements and in several other genes. Using Southern blot analysis with methylation-sensitive and -insensitive restriction endonucleases, we found that the DIRS retrotransposon and the guaB gene were indeed methylated. We then mutated the dnmA gene and found that DNA methylation was reduced to about 50% of the wild-type level. The mutant cells exhibited morphological defects in late development, indicating that DNA methylation has a regulatory role in Dictyostelium development. Our findings establish a role for a Dnmt2 methyltransferase in eukaryotic development.
Collapse
|
50
|
Abstract
GenePath is a web-based application for the analysis of mutant-based experiments and synthesis of genetic networks. Here, we introduce GenePath and describe a number of new approaches, including conflict resolution, handling cyclic pathways, confidence level assignment, what-if analysis and new experiment proposal. We illustrate the key concepts using data from a study of adhesion genes in Dictyostelium discoideum and show that GenePath discovered genetic interactions that were ignored in the original publication. GenePath is available at http://www.genepath.org/genepath2.
Collapse
|