1
|
DeePhys: A machine learning-assisted platform for electrophysiological phenotyping of human neuronal networks. Stem Cell Reports 2024; 19:285-298. [PMID: 38278155 PMCID: PMC10874850 DOI: 10.1016/j.stemcr.2023.12.008] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2022] [Revised: 12/18/2023] [Accepted: 12/20/2023] [Indexed: 01/28/2024] Open
Abstract
Reproducible functional assays to study in vitro neuronal networks represent an important cornerstone in the quest to develop physiologically relevant cellular models of human diseases. Here, we introduce DeePhys, a MATLAB-based analysis tool for data-driven functional phenotyping of in vitro neuronal cultures recorded by high-density microelectrode arrays. DeePhys is a modular workflow that offers a range of techniques to extract features from spike-sorted data, allowing for the examination of functional phenotypes both at the individual cell and network levels, as well as across development. In addition, DeePhys incorporates the capability to integrate novel features and to use machine-learning-assisted approaches, which facilitates a comprehensive evaluation of pharmacological interventions. To illustrate its practical application, we apply DeePhys to human induced pluripotent stem cell-derived dopaminergic neurons obtained from both patients and healthy individuals and showcase how DeePhys enables phenotypic screenings.
Collapse
|
2
|
Persistent complement dysregulation with signs of thromboinflammation in active Long Covid. Science 2024; 383:eadg7942. [PMID: 38236961 DOI: 10.1126/science.adg7942] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2023] [Accepted: 11/24/2023] [Indexed: 01/23/2024]
Abstract
Long Covid is a debilitating condition of unknown etiology. We performed multimodal proteomics analyses of blood serum from COVID-19 patients followed up to 12 months after confirmed severe acute respiratory syndrome coronavirus 2 infection. Analysis of >6500 proteins in 268 longitudinal samples revealed dysregulated activation of the complement system, an innate immune protection and homeostasis mechanism, in individuals experiencing Long Covid. Thus, active Long Covid was characterized by terminal complement system dysregulation and ongoing activation of the alternative and classical complement pathways, the latter associated with increased antibody titers against several herpesviruses possibly stimulating this pathway. Moreover, markers of hemolysis, tissue injury, platelet activation, and monocyte-platelet aggregates were increased in Long Covid. Machine learning confirmed complement and thromboinflammatory proteins as top biomarkers, warranting diagnostic and therapeutic interrogation of these systems.
Collapse
|
3
|
Multimodal learning in clinical proteomics: enhancing antimicrobial resistance prediction models with chemical information. Bioinformatics 2023; 39:btad717. [PMID: 38001023 PMCID: PMC10724849 DOI: 10.1093/bioinformatics/btad717] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2023] [Revised: 11/08/2023] [Accepted: 11/23/2023] [Indexed: 11/26/2023] Open
Abstract
MOTIVATION Large-scale clinical proteomics datasets of infectious pathogens, combined with antimicrobial resistance outcomes, have recently opened the door for machine learning models which aim to improve clinical treatment by predicting resistance early. However, existing prediction frameworks typically train a separate model for each antimicrobial and species in order to predict a pathogen's resistance outcome, resulting in missed opportunities for chemical knowledge transfer and generalizability. RESULTS We demonstrate the effectiveness of multimodal learning over proteomic and chemical features by exploring two clinically relevant tasks for our proposed deep learning models: drug recommendation and generalized resistance prediction. By adopting this multi-view representation of the pathogenic samples and leveraging the scale of the available datasets, our models outperformed the previous single-drug and single-species predictive models by statistically significant margins. We extensively validated the multi-drug setting, highlighting the challenges in generalizing beyond the training data distribution, and quantitatively demonstrate how suitable representations of antimicrobial drugs constitute a crucial tool in the development of clinically relevant predictive models. AVAILABILITY AND IMPLEMENTATION The code used to produce the results presented in this article is available at https://github.com/BorgwardtLab/MultimodalAMR.
Collapse
|
4
|
Predicting sepsis using deep learning across international sites: a retrospective development and validation study. EClinicalMedicine 2023; 62:102124. [PMID: 37588623 PMCID: PMC10425671 DOI: 10.1016/j.eclinm.2023.102124] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 02/28/2023] [Revised: 06/29/2023] [Accepted: 07/17/2023] [Indexed: 08/18/2023] Open
Abstract
Background When sepsis is detected, organ damage may have progressed to irreversible stages, leading to poor prognosis. The use of machine learning for predicting sepsis early has shown promise, however international validations are missing. Methods This was a retrospective, observational, multi-centre cohort study. We developed and externally validated a deep learning system for the prediction of sepsis in the intensive care unit (ICU). Our analysis represents the first international, multi-centre in-ICU cohort study for sepsis prediction using deep learning to our knowledge. Our dataset contains 136,478 unique ICU admissions, representing a refined and harmonised subset of four large ICU databases comprising data collected from ICUs in the US, the Netherlands, and Switzerland between 2001 and 2016. Using the international consensus definition Sepsis-3, we derived hourly-resolved sepsis annotations, amounting to 25,694 (18.8%) patient stays with sepsis. We compared our approach to clinical baselines as well as machine learning baselines and performed an extensive internal and external statistical validation within and across databases, reporting area under the receiver-operating-characteristic curve (AUC). Findings Averaged over sites, our model was able to predict sepsis with an AUC of 0.846 (95% confidence interval [CI], 0.841-0.852) on a held-out validation cohort internal to each site, and an AUC of 0.761 (95% CI, 0.746-0.770) when validating externally across sites. Given access to a small fine-tuning set (10% per site), the transfer to target sites was improved to an AUC of 0.807 (95% CI, 0.801-0.813). Our model raised 1.4 false alerts per true alert and detected 80% of the septic patients 3.7 h (95% CI, 3.0-4.3) prior to the onset of sepsis, opening a vital window for intervention. Interpretation By monitoring clinical and laboratory measurements in a retrospective simulation of a real-time prediction scenario, a deep learning system for the detection of sepsis generalised to previously unseen ICU cohorts, internationally. Funding This study was funded by the Personalized Health and Related Technologies (PHRT) strategic focus area of the ETH domain.
Collapse
|
5
|
Higher-order genetic interaction discovery with network-based biological priors. Bioinformatics 2023; 39:i523-i533. [PMID: 37387173 PMCID: PMC10311320 DOI: 10.1093/bioinformatics/btad273] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/01/2023] Open
Abstract
MOTIVATION Complex phenotypes, such as many common diseases and morphological traits, are controlled by multiple genetic factors, namely genetic mutations and genes, and are influenced by environmental conditions. Deciphering the genetics underlying such traits requires a systemic approach, where many different genetic factors and their interactions are considered simultaneously. Many association mapping techniques available nowadays follow this reasoning, but have some severe limitations. In particular, they require binary encodings for the genetic markers, forcing the user to decide beforehand whether to use, e.g. a recessive or a dominant encoding. Moreover, most methods cannot include any biological prior or are limited to testing only lower-order interactions among genes for association with the phenotype, potentially missing a large number of marker combinations. RESULTS We propose HOGImine, a novel algorithm that expands the class of discoverable genetic meta-markers by considering higher-order interactions of genes and by allowing multiple encodings for the genetic variants. Our experimental evaluation shows that the algorithm has a substantially higher statistical power compared to previous methods, allowing it to discover genetic mutations statistically associated with the phenotype at hand that could not be found before. Our method can exploit prior biological knowledge on gene interactions, such as protein-protein interaction networks, genetic pathways, and protein complexes, to restrict its search space. Since computing higher-order gene interactions poses a high computational burden, we also develop a more efficient search strategy and support computation to make our approach applicable in practice, leading to substantial runtime improvements compared to state-of-the-art methods. AVAILABILITY AND IMPLEMENTATION Code and data are available at https://github.com/BorgwardtLab/HOGImine.
Collapse
|
6
|
networkGWAS: A network-based approach to discover genetic associations. Bioinformatics 2023:7191773. [PMID: 37285313 DOI: 10.1093/bioinformatics/btad370] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2022] [Revised: 04/11/2023] [Accepted: 06/06/2023] [Indexed: 06/09/2023] Open
Abstract
MOTIVATION While the search for associations between genetic markers and complex traits has led to the discovery of tens of thousands of trait-related genetic variants, the vast majority of these only explain a small fraction of the observed phenotypic variation. One possible strategy to overcome this while leveraging biological prior is to aggregate the effects of several genetic markers and to test entire genes, pathways or (sub)networks of genes for association to a phenotype. The latter, network-based genome-wide association studies, in particular suffers from a vast search space and an inherent multiple testing problem. As a consequence, current approaches are either based on greedy feature selection, thereby risking that they miss relevant associations, or neglect doing a multiple testing correction, which can lead to an abundance of false positive findings. RESULTS To address the shortcomings of current approaches of network-based genome-wide association studies, we propose networkGWAS, a computationally efficient and statistically sound approach to network-based genome-wide association studies using mixed models and neighborhood aggregation. It allows for population structure correction and for well-calibrated p-values, which are obtained through circular and degree-preserving network permutations. networkGWAS successfully detects known associations on diverse synthetic phenotypes, as well as known and novel genes in phenotypes from S. cerevisiae and H. sapiens. It thereby enables the systematic combination of gene-based genome-wide association studies with biological network information. AVAILABILITY https://github.com/BorgwardtLab/networkGWAS.git. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
|
7
|
Multi-modal deep learning improves grain yield prediction in wheat breeding by fusing genomics and phenomics. Bioinformatics 2023:7176366. [PMID: 37220903 DOI: 10.1093/bioinformatics/btad336] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2022] [Revised: 02/15/2023] [Accepted: 05/22/2023] [Indexed: 05/25/2023] Open
Abstract
MOTIVATION Developing new crop varieties with superior performance is highly important to ensure robust and sustainable global food security. The speed of variety development is limited by long field cycles and advanced generation selections in plant breeding programs. While methods to predict yield from genotype or phenotype data have been proposed, improved performance and integrated models are needed. RESULTS We propose a machine learning model that leverages both genotype and phenotype measurements by fusing genetic variants with multiple data sources collected by unmanned aerial systems. We use a deep multiple instance learning framework with an attention mechanism that sheds light on the importance given to each input during prediction, enhancing interpretability. Our model reaches 0.754 ± 0.024 Pearson correlation coefficient when predicting yield in similar environmental conditions; a 34.8% improvement over the genotype-only linear baseline (0.559 ± 0.050). We further predict yield on new lines in an unseen environment using only genotypes, obtaining a prediction accuracy of 0.386 ± 0.010, a 13.5% improvement over the linear baseline. Our multi-modal deep learning architecture efficiently accounts for plant health and environment, distilling the genetic contribution and providing excellent predictions. Yield prediction algorithms leveraging phenotypic observations during training therefore promise to improve breeding programs, ultimately speeding up delivery of improved varieties. AVAILABILITY AND IMPLEMENTATION Available at https://github.com/BorgwardtLab/PheGeMIL (code) and https://doi.org/doi:10.5061/dryad.kprr4xh5p (data).
Collapse
|
8
|
SIMBSIG: similarity search and clustering for biobank-scale data. Bioinformatics 2023; 39:6969101. [PMID: 36610707 PMCID: PMC9825260 DOI: 10.1093/bioinformatics/btac829] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2022] [Revised: 12/14/2022] [Accepted: 12/22/2022] [Indexed: 12/24/2022] Open
Abstract
SUMMARY In many modern bioinformatics applications, such as statistical genetics, or single-cell analysis, one frequently encounters datasets which are orders of magnitude too large for conventional in-memory analysis. To tackle this challenge, we introduce SIMBSIG (SIMmilarity Batched Search Integrated GPU), a highly scalable Python package which provides a scikit-learn-like interface for out-of-core, GPU-enabled similarity searches, principal component analysis and clustering. Due to the PyTorch backend, it is highly modular and particularly tailored to many data types with a particular focus on biobank data analysis. AVAILABILITY AND IMPLEMENTATION SIMBSIG is freely available from PyPI and its source code and documentation can be found on GitHub (https://github.com/BorgwardtLab/simbsig) under a BSD-3 license.
Collapse
|
9
|
reComBat: batch-effect removal in large-scale multi-source gene-expression data integration. BIOINFORMATICS ADVANCES 2022; 2:vbac071. [PMID: 36699372 PMCID: PMC9710604 DOI: 10.1093/bioadv/vbac071] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 08/11/2022] [Revised: 09/01/2022] [Accepted: 09/26/2022] [Indexed: 01/28/2023]
Abstract
Motivation With the steadily increasing abundance of omics data produced all over the world under vastly different experimental conditions residing in public databases, a crucial step in many data-driven bioinformatics applications is that of data integration. The challenge of batch-effect removal for entire databases lies in the large number of batches and biological variation, which can result in design matrix singularity. This problem can currently not be solved satisfactorily by any common batch-correction algorithm. Results We present reComBat, a regularized version of the empirical Bayes method to overcome this limitation and benchmark it against popular approaches for the harmonization of public gene-expression data (both microarray and bulkRNAsq) of the human opportunistic pathogen Pseudomonas aeruginosa. Batch-effects are successfully mitigated while biologically meaningful gene-expression variation is retained. reComBat fills the gap in batch-correction approaches applicable to large-scale, public omics databases and opens up new avenues for data-driven analysis of complex biological processes beyond the scope of a single study. Availability and implementation The code is available at https://github.com/BorgwardtLab/reComBat, all data and evaluation code can be found at https://github.com/BorgwardtLab/batchCorrectionPublicData. Supplementary information Supplementary data are available at Bioinformatics Advances online.
Collapse
|
10
|
Direct antimicrobial resistance prediction from clinical MALDI-TOF mass spectra using machine learning. Nat Med 2022; 28:164-174. [PMID: 35013613 DOI: 10.1038/s41591-021-01619-9] [Citation(s) in RCA: 42] [Impact Index Per Article: 21.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2020] [Accepted: 11/08/2021] [Indexed: 12/20/2022]
Abstract
Early use of effective antimicrobial treatments is critical for the outcome of infections and the prevention of treatment resistance. Antimicrobial resistance testing enables the selection of optimal antibiotic treatments, but current culture-based techniques can take up to 72 hours to generate results. We have developed a novel machine learning approach to predict antimicrobial resistance directly from matrix-assisted laser desorption/ionization-time of flight (MALDI-TOF) mass spectra profiles of clinical isolates. We trained calibrated classifiers on a newly created publicly available database of mass spectra profiles from the clinically most relevant isolates with linked antimicrobial susceptibility phenotypes. This dataset combines more than 300,000 mass spectra with more than 750,000 antimicrobial resistance phenotypes from four medical institutions. Validation on a panel of clinically important pathogens, including Staphylococcus aureus, Escherichia coli and Klebsiella pneumoniae, resulting in areas under the receiver operating characteristic curve of 0.80, 0.74 and 0.74, respectively, demonstrated the potential of using machine learning to substantially accelerate antimicrobial resistance determination and change of clinical management. Furthermore, a retrospective clinical case study of 63 patients found that implementing this approach would have changed the clinical treatment in nine cases, which would have been beneficial in eight cases (89%). MALDI-TOF mass spectra-based machine learning may thus be an important new tool for treatment optimization and antibiotic stewardship.
Collapse
|
11
|
Early Prediction of Sepsis in the ICU Using Machine Learning: A Systematic Review. Front Med (Lausanne) 2021; 8:607952. [PMID: 34124082 PMCID: PMC8193357 DOI: 10.3389/fmed.2021.607952] [Citation(s) in RCA: 37] [Impact Index Per Article: 12.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2020] [Accepted: 03/04/2021] [Indexed: 12/12/2022] Open
Abstract
Background: Sepsis is among the leading causes of death in intensive care units (ICUs) worldwide and its recognition, particularly in the early stages of the disease, remains a medical challenge. The advent of an affluence of available digital health data has created a setting in which machine learning can be used for digital biomarker discovery, with the ultimate goal to advance the early recognition of sepsis. Objective: To systematically review and evaluate studies employing machine learning for the prediction of sepsis in the ICU. Data Sources: Using Embase, Google Scholar, PubMed/Medline, Scopus, and Web of Science, we systematically searched the existing literature for machine learning-driven sepsis onset prediction for patients in the ICU. Study Eligibility Criteria: All peer-reviewed articles using machine learning for the prediction of sepsis onset in adult ICU patients were included. Studies focusing on patient populations outside the ICU were excluded. Study Appraisal and Synthesis Methods: A systematic review was performed according to the PRISMA guidelines. Moreover, a quality assessment of all eligible studies was performed. Results: Out of 974 identified articles, 22 and 21 met the criteria to be included in the systematic review and quality assessment, respectively. A multitude of machine learning algorithms were applied to refine the early prediction of sepsis. The quality of the studies ranged from "poor" (satisfying ≤ 40% of the quality criteria) to "very good" (satisfying ≥ 90% of the quality criteria). The majority of the studies (n = 19, 86.4%) employed an offline training scenario combined with a horizon evaluation, while two studies implemented an online scenario (n = 2, 9.1%). The massive inter-study heterogeneity in terms of model development, sepsis definition, prediction time windows, and outcomes precluded a meta-analysis. Last, only two studies provided publicly accessible source code and data sources fostering reproducibility. Limitations: Articles were only eligible for inclusion when employing machine learning algorithms for the prediction of sepsis onset in the ICU. This restriction led to the exclusion of studies focusing on the prediction of septic shock, sepsis-related mortality, and patient populations outside the ICU. Conclusions and Key Findings: A growing number of studies employs machine learning to optimize the early prediction of sepsis through digital biomarker discovery. This review, however, highlights several shortcomings of the current approaches, including low comparability and reproducibility. Finally, we gather recommendations how these challenges can be addressed before deploying these models in prospective analyses. Systematic Review Registration Number: CRD42020200133.
Collapse
|
12
|
Network-guided search for genetic heterogeneity between gene pairs. Bioinformatics 2021; 37:57-65. [PMID: 32573681 PMCID: PMC8034561 DOI: 10.1093/bioinformatics/btaa581] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2019] [Revised: 05/19/2020] [Accepted: 06/15/2020] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Correlating genetic loci with a disease phenotype is a common approach to improve our understanding of the genetics underlying complex diseases. Standard analyses mostly ignore two aspects, namely genetic heterogeneity and interactions between loci. Genetic heterogeneity, the phenomenon that genetic variants at different loci lead to the same phenotype, promises to increase statistical power by aggregating low-signal variants. Incorporating interactions between loci results in a computational and statistical bottleneck due to the vast amount of candidate interactions. RESULTS We propose a novel method SiNIMin that addresses these two aspects by finding pairs of interacting genes that are, upon combination, associated with a phenotype of interest under a model of genetic heterogeneity. We guide the interaction search using biological prior knowledge in the form of protein-protein interaction networks. Our method controls type I error and outperforms state-of-the-art methods with respect to statistical power. Additionally, we find novel associations for multiple Arabidopsis thaliana phenotypes, and, with an adapted variant of SiNIMin, for a study of rare variants in migraine patients. AVAILABILITY AND IMPLEMENTATION Code available at https://github.com/BorgwardtLab/SiNIMin. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
|
13
|
Biological network analysis with deep learning. Brief Bioinform 2021; 22:1515-1530. [PMID: 33169146 PMCID: PMC7986589 DOI: 10.1093/bib/bbaa257] [Citation(s) in RCA: 46] [Impact Index Per Article: 15.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2020] [Revised: 08/26/2020] [Accepted: 09/11/2020] [Indexed: 12/17/2022] Open
Abstract
Recent advancements in experimental high-throughput technologies have expanded the availability and quantity of molecular data in biology. Given the importance of interactions in biological processes, such as the interactions between proteins or the bonds within a chemical compound, this data is often represented in the form of a biological network. The rise of this data has created a need for new computational tools to analyze networks. One major trend in the field is to use deep learning for this goal and, more specifically, to use methods that work with networks, the so-called graph neural networks (GNNs). In this article, we describe biological networks and review the principles and underlying algorithms of GNNs. We then discuss domains in bioinformatics in which graph neural networks are frequently being applied at the moment, such as protein function prediction, protein-protein interaction prediction and in silico drug discovery and development. Finally, we highlight application areas such as gene regulatory networks and disease diagnosis where deep learning is emerging as a new tool to answer classic questions like gene interaction prediction and automatic disease prediction from data.
Collapse
|
14
|
Biological network analysis with deep learning. Brief Bioinform 2021; 22:1515-1530. [PMID: 33169146 DOI: 10.1145/3447548.3467442] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2020] [Revised: 08/26/2020] [Accepted: 09/11/2020] [Indexed: 05/28/2023] Open
Abstract
Recent advancements in experimental high-throughput technologies have expanded the availability and quantity of molecular data in biology. Given the importance of interactions in biological processes, such as the interactions between proteins or the bonds within a chemical compound, this data is often represented in the form of a biological network. The rise of this data has created a need for new computational tools to analyze networks. One major trend in the field is to use deep learning for this goal and, more specifically, to use methods that work with networks, the so-called graph neural networks (GNNs). In this article, we describe biological networks and review the principles and underlying algorithms of GNNs. We then discuss domains in bioinformatics in which graph neural networks are frequently being applied at the moment, such as protein function prediction, protein-protein interaction prediction and in silico drug discovery and development. Finally, we highlight application areas such as gene regulatory networks and disease diagnosis where deep learning is emerging as a new tool to answer classic questions like gene interaction prediction and automatic disease prediction from data.
Collapse
|
15
|
Enhancing statistical power in temporal biomarker discovery through representative shapelet mining. Bioinformatics 2021; 36:i840-i848. [PMID: 33381811 PMCID: PMC7773478 DOI: 10.1093/bioinformatics/btaa815] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023] Open
Abstract
Motivation Temporal biomarker discovery in longitudinal data is based on detecting reoccurring trajectories, the so-called shapelets. The search for shapelets requires considering all subsequences in the data. While the accompanying issue of multiple testing has been mitigated in previous work, the redundancy and overlap of the detected shapelets results in an a priori unbounded number of highly similar and structurally meaningless shapelets. As a consequence, current temporal biomarker discovery methods are impractical and underpowered. Results We find that the pre- or post-processing of shapelets does not sufficiently increase the power and practical utility. Consequently, we present a novel method for temporal biomarker discovery: Statistically Significant Submodular Subset Shapelet Mining (S5M) that retrieves short subsequences that are (i) occurring in the data, (ii) are statistically significantly associated with the phenotype and (iii) are of manageable quantity while maximizing structural diversity. Structural diversity is achieved by pruning non-representative shapelets via submodular optimization. This increases the statistical power and utility of S5M compared to state-of-the-art approaches on simulated and real-world datasets. For patients admitted to the intensive care unit (ICU) showing signs of severe organ failure, we find temporal patterns in the sequential organ failure assessment score that are associated with in-ICU mortality. Availability and implementation S5M is an option in the python package of S3M: github.com/BorgwardtLab/S3M.
Collapse
|
16
|
Abstract
MOTIVATION Microbial species identification based on matrix-assisted laser desorption ionization time-of-flight (MALDI-TOF) mass spectrometry (MS) has become a standard tool in clinical microbiology. The resulting MALDI-TOF mass spectra also harbour the potential to deliver prediction results for other phenotypes, such as antibiotic resistance. However, the development of machine learning algorithms specifically tailored to MALDI-TOF MS-based phenotype prediction is still in its infancy. Moreover, current spectral pre-processing typically involves a parameter-heavy chain of operations without analyzing their influence on the prediction results. In addition, classification algorithms lack quantification of uncertainty, which is indispensable for predictions potentially influencing patient treatment. RESULTS We present a novel prediction method for antimicrobial resistance based on MALDI-TOF mass spectra. First, we compare the complex conventional pre-processing to a new approach that exploits topological information and requires only a single parameter, namely the number of peaks of a spectrum to keep. Second, we introduce PIKE, the peak information kernel, a similarity measure specifically tailored to MALDI-TOF mass spectra which, combined with a Gaussian process classifier, provides well-calibrated uncertainty estimates about predictions. We demonstrate the utility of our approach by predicting antibiotic resistance of three clinically highly relevant bacterial species. Our method consistently outperforms competitor approaches, while demonstrating improved performance and security by rejecting out-of-distribution samples, such as bacterial species that are not represented in the training data. Ultimately, our method could contribute to an earlier and precise antimicrobial treatment in clinical patient care. AVAILABILITY AND IMPLEMENTATION We make our code publicly available as an easy-to-use Python package under https://github.com/BorgwardtLab/maldi_PIKE.
Collapse
|
17
|
Machine Learning for Biomedical Time Series Classification: From Shapelets to Deep Learning. Methods Mol Biol 2021; 2190:33-71. [PMID: 32804360 DOI: 10.1007/978-1-0716-0826-5_2] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Abstract
With the biomedical field generating large quantities of time series data, there has been a growing interest in developing and refining machine learning methods that allow its mining and exploitation. Classification is one of the most important and challenging machine learning tasks related to time series. Many biomedical phenomena, such as the brain's activity or blood pressure, change over time. The objective of this chapter is to provide a gentle introduction to time series classification. In the first part we describe the characteristics of time series data and challenges in its analysis. The second part provides an overview of common machine learning methods used for time series classification. A real-world use case, the early recognition of sepsis, demonstrates the applicability of the methods discussed.
Collapse
|
18
|
Using routine MRI data of depressed patients to predict individual responses to electroconvulsive therapy. Exp Neurol 2020; 335:113505. [PMID: 33068570 DOI: 10.1016/j.expneurol.2020.113505] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2020] [Revised: 10/07/2020] [Accepted: 10/07/2020] [Indexed: 12/30/2022]
Abstract
Electroconvulsive therapy (ECT) is one of the most effective treatments in cases of severe and treatment resistant major depression. 60-80% of patients respond to ECT, but the procedure is demanding and robust prediction of ECT responses would be of great clinical value. Predictions based on neuroimaging data have recently come into focus, but still face methodological and practical limitations that are hampering the translation into clinical practice. In this retrospective study, we investigated the feasibility of ECT response prediction using structural magnetic resonance imaging (sMRI) data that was collected during ECT routine examinations. We applied machine learning techniques to predict individual treatment outcomes in a cohort of N = 71 ECT patients, N = 39 of which responded to the treatment. SMRI-based classification of ECT responders and non-responders reached an accuracy of 69% (sensitivity: 67%; specificity: 72%). Classification on additionally investigated clinical variables had no predictive power. Since dichotomisation of patients into ECT responders and non-responders is debatable due to many patients only showing a partial response, we additionally performed a post-hoc regression-based prediction analysis on continuous symptom improvements. This analysis yielded a significant relationship between true and predicted treatment outcomes and might be a promising alternative to dichotomization of patients. Based on our results, we argue that the prediction of individual ECT responses based on routine sMRI holds promise to overcome important limitations that are currently hampering the translation of such treatment biomarkers into everyday clinical practice. Finally, we discuss how the results of such predictive data analysis could best support the clinician's decision on whether a patient should be treated with ECT.
Collapse
|
19
|
Machine learning for microbial identification and antimicrobial susceptibility testing on MALDI-TOF mass spectra: a systematic review. Clin Microbiol Infect 2020; 26:1310-1317. [DOI: 10.1016/j.cmi.2020.03.014] [Citation(s) in RCA: 34] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/03/2019] [Revised: 03/05/2020] [Accepted: 03/13/2020] [Indexed: 01/12/2023]
|
20
|
Comorbidities, clinical signs and symptoms, laboratory findings, imaging features, treatment strategies, and outcomes in adult and pediatric patients with COVID-19: A systematic review and meta-analysis. Travel Med Infect Dis 2020; 37:101825. [PMID: 32763496 PMCID: PMC7402237 DOI: 10.1016/j.tmaid.2020.101825] [Citation(s) in RCA: 96] [Impact Index Per Article: 24.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2020] [Revised: 07/09/2020] [Accepted: 07/27/2020] [Indexed: 01/08/2023]
Abstract
INTRODUCTION Since December 2019, a novel coronavirus (SARS-CoV-2) has triggered a world-wide pandemic with an enormous medical and societal-economic toll. Thus, our aim was to gather all available information regarding comorbidities, clinical signs and symptoms, outcomes, laboratory findings, imaging features, and treatments in patients with coronavirus disease 2019 (COVID-19). METHODS EMBASE, PubMed/Medline, Scopus, and Web of Science were searched for studies published in any language between December 1st, 2019 and March 28th, 2020. Original studies were included if the exposure of interest was an infection with SARS-CoV-2 or confirmed COVID-19. The primary outcome was the risk ratio of comorbidities, clinical signs and symptoms, laboratory findings, imaging features, treatments, outcomes, and complications associated with COVID-19 morbidity and mortality. We performed random-effects pairwise meta-analyses for proportions and relative risks, I2, T2, and Cochrane Q, sensitivity analyses, and assessed publication bias. RESULTS 148 studies met the inclusion criteria for the systematic review and meta-analysis with 12'149 patients (5'739 female) and a median age of 47.0 [35.0-64.6] years. 617 patients died from COVID-19 and its complication. 297 patients were reported as asymptomatic. Older age (SMD: 1.25 [0.78-1.72]; p < 0.001), being male (RR = 1.32 [1.13-1.54], p = 0.005) and pre-existing comorbidity (RR = 1.69 [1.48-1.94]; p < 0.001) were identified as risk factors of in-hospital mortality. The heterogeneity between studies varied substantially (I2; range: 1.5-98.2%). Publication bias was only found in eight studies (Egger's test: p < 0.05). CONCLUSIONS Our meta-analyses revealed important risk factors that are associated with severity and mortality of COVID-19.
Collapse
|
21
|
SPHN/PHRT: Forming a Swiss-Wide Infrastructure for Data-Driven Sepsis Research. Stud Health Technol Inform 2020; 270:1163-1167. [PMID: 32570564 DOI: 10.3233/shti200346] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Sepsis is a highly heterogenous syndrome with variable causes and outcomes. As part of the SPHN/PHRT funding program, we aim to build a highly interoperable, interconnected network for data collection, exchange and analysis of patients on intensive care units in order to predict sepsis onset and mortality earlier. All five University Hospitals, Universities, the Swiss Institute of Bioinformatics and ETH Zurich are involved in this multi-disciplinary project. With two prospective clinical observational studies, we test our infrastructure setup and improve the framework gradually and generate relevant data for research.
Collapse
|
22
|
Large-scale DNA-based phenotypic recording and deep learning enable highly accurate sequence-function mapping. Nat Commun 2020; 11:3551. [PMID: 32669542 PMCID: PMC7363850 DOI: 10.1038/s41467-020-17222-4] [Citation(s) in RCA: 20] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/11/2020] [Accepted: 06/13/2020] [Indexed: 01/23/2023] Open
Abstract
Predicting effects of gene regulatory elements (GREs) is a longstanding challenge in biology. Machine learning may address this, but requires large datasets linking GREs to their quantitative function. However, experimental methods to generate such datasets are either application-specific or technically complex and error-prone. Here, we introduce DNA-based phenotypic recording as a widely applicable, practicable approach to generate large-scale sequence-function datasets. We use a site-specific recombinase to directly record a GRE's effect in DNA, enabling readout of both sequence and quantitative function for extremely large GRE-sets via next-generation sequencing. We record translation kinetics of over 300,000 bacterial ribosome binding sites (RBSs) in >2.7 million sequence-function pairs in a single experiment. Further, we introduce a deep learning approach employing ensembling and uncertainty modelling that predicts RBS function with high accuracy, outperforming state-of-the-art methods. DNA-based phenotypic recording combined with deep learning represents a major advance in our ability to predict function from genetic sequence.
Collapse
|
23
|
Abstract
MOTIVATION Gaining a comprehensive understanding of the genetics underlying cancer development and progression is a central goal of biomedical research. Its accomplishment promises key mechanistic, diagnostic and therapeutic insights. One major step in this direction is the identification of genes that drive the emergence of tumors upon mutation. Recent advances in the field of computational biology have shown the potential of combining genetic summary statistics that represent the mutational burden in genes with biological networks, such as protein-protein interaction networks, to identify cancer driver genes. Those approaches superimpose the summary statistics on the nodes in the network, followed by an unsupervised propagation of the node scores through the network. However, this unsupervised setting does not leverage any knowledge on well-established cancer genes, a potentially valuable resource to improve the identification of novel cancer drivers. RESULTS We develop a novel node embedding that enables classification of cancer driver genes in a supervised setting. The embedding combines a representation of the mutation score distribution in a node's local neighborhood with network propagation. We leverage the knowledge of well-established cancer driver genes to define a positive class, resulting in a partially labeled dataset, and develop a cross-validation scheme to enable supervised prediction. The proposed node embedding followed by a supervised classification improves the predictive performance compared with baseline methods and yields a set of promising genes that constitute candidates for further biological validation. AVAILABILITY AND IMPLEMENTATION Code available at https://github.com/BorgwardtLab/MoProEmbeddings. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
|
24
|
CASMAP: detection of statistically significant combinations of SNPs in association mapping. Bioinformatics 2020; 35:2680-2682. [PMID: 30541062 PMCID: PMC6662083 DOI: 10.1093/bioinformatics/bty1020] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2018] [Revised: 10/18/2018] [Accepted: 12/10/2018] [Indexed: 11/14/2022] Open
Abstract
SUMMARY Combinatorial association mapping aims to assess the statistical association of higher-order interactions of genetic markers with a phenotype of interest. This article presents combinatorial association mapping (CASMAP), a software package that leverages recent advances in significant pattern mining to overcome the statistical and computational challenges that have hindered combinatorial association mapping. CASMAP can be used to perform region-based association studies and to detect higher-order epistatic interactions of genetic variants. Most importantly, unlike other existing significant pattern mining-based tools, CASMAP allows for the correction of categorical covariates such as age or gender, making it suitable for genome-wide association studies. AVAILABILITY AND IMPLEMENTATION The R and Python packages can be downloaded from our GitHub repository http://github.com/BorgwardtLab/CASMAP. The R package is also available on CRAN. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
|
25
|
Early prediction of circulatory failure in the intensive care unit using machine learning. Nat Med 2020; 26:364-373. [DOI: 10.1038/s41591-020-0789-4] [Citation(s) in RCA: 113] [Impact Index Per Article: 28.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2019] [Accepted: 02/04/2020] [Indexed: 01/12/2023]
|
26
|
AraPheno and the AraGWAS Catalog 2020: a major database update including RNA-Seq and knockout mutation data for Arabidopsis thaliana. Nucleic Acids Res 2020; 48:D1063-D1068. [PMID: 31642487 PMCID: PMC7145550 DOI: 10.1093/nar/gkz925] [Citation(s) in RCA: 29] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2019] [Revised: 09/26/2019] [Accepted: 10/08/2019] [Indexed: 12/23/2022] Open
Abstract
Genome-wide association studies (GWAS) are integral for studying genotype-phenotype relationships and gaining a deeper understanding of the genetic architecture underlying trait variation. A plethora of genetic associations between distinct loci and various traits have been successfully discovered and published for the model plant Arabidopsis thaliana. This success and the free availability of full genomes and phenotypic data for more than 1,000 different natural inbred lines led to the development of several data repositories. AraPheno (https://arapheno.1001genomes.org) serves as a central repository of population-scale phenotypes in A. thaliana, while the AraGWAS Catalog (https://aragwas.1001genomes.org) provides a publicly available, manually curated and standardized collection of marker-trait associations for all available phenotypes from AraPheno. In this major update, we introduce the next generation of both platforms, including new data, features and tools. We included novel results on associations between knockout-mutations and all AraPheno traits. Furthermore, AraPheno has been extended to display RNA-Seq data for hundreds of accessions, providing expression information for over 28 000 genes for these accessions. All data, including the imputed genotype matrix used for GWAS, are easily downloadable via the respective databases.
Collapse
|
27
|
Graph Kernels: State-of-the-Art and Future Challenges. FOUNDATIONS AND TRENDS® IN MACHINE LEARNING 2020. [DOI: 10.1561/2200000076] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/01/2022]
|
28
|
Pretransplant Kinetics of Anti-HLA Antibodies in Patients on the Waiting List for Kidney Transplantation. J Am Soc Nephrol 2019; 30:2262-2274. [PMID: 31653784 DOI: 10.1681/asn.2019060594] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2019] [Accepted: 08/19/2019] [Indexed: 12/30/2022] Open
Abstract
BACKGROUND Patients on organ transplant waiting lists are evaluated for preexisting alloimmunity to minimize episodes of acute and chronic rejection by regularly monitoring for changes in alloimmune status. There are few studies on how alloimmunity changes over time in patients on kidney allograft waiting lists, and an apparent lack of research-based evidence supporting currently used monitoring intervals. METHODS To investigate the dynamics of alloimmune responses directed at HLA antigens, we retrospectively evaluated data on anti-HLA antibodies measured by the single-antigen bead assay from 627 waitlisted patients who subsequently received a kidney transplant at University Hospital Zurich, Switzerland, between 2008 and 2017. Our analysis focused on a filtered dataset comprising 467 patients who had at least two assay measurements. RESULTS Within the filtered dataset, we analyzed potential changes in mean fluorescence intensity values (reflecting bound anti-HLA antibodies) between consecutive measurements for individual patients in relation to the time interval between measurements. Using multiple approaches, we found no correlation between these two factors. However, when we stratified the dataset on the basis of documented previous immunizing events (transplant, pregnancy, or transfusion), we found significant differences in the magnitude of change in alloimmune status, especially among patients with a previous transplant versus patients without such a history. Further efforts to cluster patients according to statistical properties related to alloimmune status kinetics were unsuccessful, indicating considerable complexity in individual variability. CONCLUSIONS Alloimmune kinetics in patients on a kidney transplant waiting list do not appear to be related to the interval between measurements, but are instead associated with alloimmunization history. This suggests that an individualized strategy for alloimmune status monitoring may be preferable to currently used intervals.
Collapse
|
29
|
Abstract
Motivation Large-scale screenings of cancer cell lines with detailed molecular profiles against libraries of pharmacological compounds are currently being performed in order to gain a better understanding of the genetic component of drug response and to enhance our ability to recommend therapies given a patient's molecular profile. These comprehensive screens differ from the clinical setting in which (i) medical records only contain the response of a patient to very few drugs, (ii) drugs are recommended by doctors based on their expert judgment and (iii) selecting the most promising therapy is often more important than accurately predicting the sensitivity to all potential drugs. Current regression models for drug sensitivity prediction fail to account for these three properties. Results We present a machine learning approach, named Kernelized Rank Learning (KRL), that ranks drugs based on their predicted effect per cell line (patient), circumventing the difficult problem of precisely predicting the sensitivity to the given drug. Our approach outperforms several state-of-the-art predictors in drug recommendation, particularly if the training dataset is sparse, and generalizes to patient data. Our work phrases personalized drug recommendation as a new type of machine learning problem with translational potential to the clinic. Availability and implementation The Python implementation of KRL and scripts for running our experiments are available at https://github.com/BorgwardtLab/Kernelized-Rank-Learning. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
|
30
|
Abstract
Motivation Most modern intensive care units record the physiological and vital signs of patients. These data can be used to extract signatures, commonly known as biomarkers, that help physicians understand the biological complexity of many syndromes. However, most biological biomarkers suffer from either poor predictive performance or weak explanatory power. Recent developments in time series classification focus on discovering shapelets, i.e. subsequences that are most predictive in terms of class membership. Shapelets have the advantage of combining a high predictive performance with an interpretable component-their shape. Currently, most shapelet discovery methods do not rely on statistical tests to verify the significance of individual shapelets. Therefore, identifying associations between the shapelets of physiological biomarkers and patients that exhibit certain phenotypes of interest enables the discovery and subsequent ranking of physiological signatures that are interpretable, statistically validated and accurate predictors of clinical endpoints. Results We present a novel and scalable method for scanning time series and identifying discriminative patterns that are statistically significant. The significance of a shapelet is evaluated while considering the problem of multiple hypothesis testing and mitigating it by efficiently pruning untestable shapelet candidates with Tarone's method. We demonstrate the utility of our method by discovering patterns in three of a patient's vital signs: heart rate, respiratory rate and systolic blood pressure that are indicators of the severity of a future sepsis event, i.e. an inflammatory response to an infective agent that can lead to organ failure and death, if not treated in time. Availability and implementation We make our method and the scripts that are required to reproduce the experiments publicly available at https://github.com/BorgwardtLab/S3M. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
|
31
|
The AraGWAS Catalog: a curated and standardized Arabidopsis thaliana GWAS catalog. Nucleic Acids Res 2019; 46:D1150-D1156. [PMID: 29059333 PMCID: PMC5753280 DOI: 10.1093/nar/gkx954] [Citation(s) in RCA: 43] [Impact Index Per Article: 8.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2017] [Accepted: 10/06/2017] [Indexed: 12/21/2022] Open
Abstract
The abundance of high-quality genotype and phenotype data for the model organism Arabidopsis thaliana enables scientists to study the genetic architecture of many complex traits at an unprecedented level of detail using genome-wide association studies (GWAS). GWAS have been a great success in A. thaliana and many SNP-trait associations have been published. With the AraGWAS Catalog (https://aragwas.1001genomes.org) we provide a publicly available, manually curated and standardized GWAS catalog for all publicly available phenotypes from the central A. thaliana phenotype repository, AraPheno. All GWAS have been recomputed on the latest imputed genotype release of the 1001 Genomes Consortium using a standardized GWAS pipeline to ensure comparability between results. The catalog includes currently 167 phenotypes and more than 222 000 SNP-trait associations with P < 10−4, of which 3887 are significantly associated using permutation-based thresholds. The AraGWAS Catalog can be accessed via a modern web-interface and provides various features to easily access, download and visualize the results and summary statistics across GWAS.
Collapse
|
32
|
Introduction to the special issue for the ECML PKDD 2019 journal track. Data Min Knowl Discov 2019. [DOI: 10.1007/s10618-019-00642-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
|
33
|
Introduction to the special issue for the ECML PKDD 2019 journal track. Mach Learn 2019. [DOI: 10.1007/s10994-019-05831-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
34
|
|
35
|
graphkernels: R and Python packages for graph comparison. Bioinformatics 2018; 34:530-532. [PMID: 29028902 PMCID: PMC5860361 DOI: 10.1093/bioinformatics/btx602] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2017] [Accepted: 09/19/2017] [Indexed: 11/12/2022] Open
Abstract
Summary Measuring the similarity of graphs is a fundamental step in the analysis of graph-structured data, which is omnipresent in computational biology. Graph kernels have been proposed as a powerful and efficient approach to this problem of graph comparison. Here we provide graphkernels, the first R and Python graph kernel libraries including baseline kernels such as label histogram based kernels, classic graph kernels such as random walk based kernels, and the state-of-the-art Weisfeiler-Lehman graph kernel. The core of all graph kernels is implemented in C ++ for efficiency. Using the kernel matrices computed by the package, we can easily perform tasks such as classification, regression and clustering on graph-structured samples. Availability and implementation The R and Python packages including source code are available at https://CRAN.R-project.org/package=graphkernels and https://pypi.python.org/pypi/graphkernels. Contact mahito@nii.ac.jp or elisabetta.ghisu@bsse.ethz.ch. Supplementary information Supplementary data are available online at Bioinformatics.
Collapse
|
36
|
Aberrant working memory processing in major depression: evidence from multivoxel pattern classification. Neuropsychopharmacology 2018; 43:1972-1979. [PMID: 29777198 PMCID: PMC6046039 DOI: 10.1038/s41386-018-0081-1] [Citation(s) in RCA: 26] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 11/30/2017] [Revised: 04/20/2018] [Accepted: 04/23/2018] [Indexed: 01/01/2023]
Abstract
Major depressive disorder (MDD) is often accompanied by severe impairments in working memory (WM). Neuroimaging studies investigating the mechanisms underlying these impairments have produced conflicting results. It remains unclear whether MDD patients show hyper- or hypoactivity in WM-related brain regions and how potential aberrations in WM processing may contribute to the characteristic dysregulation of cognition-emotion interactions implicated in the maintenance of the disorder. In order to shed light on these questions and to overcome limitations of previous studies, we applied a multivoxel pattern classification approach to investigate brain activity in large samples of MDD patients (N = 57) and matched healthy controls (N = 61) during a WM task that incorporated positive, negative, and neutral stimuli. Results showed that patients can be distinguished from healthy controls with good classification accuracy based on functional activation patterns. ROI analyses based on the classification weight maps showed that during WM, patients had higher activity in the left DLPFC and the dorsal ACC. Furthermore, regions of the default-mode network (DMN) were less deactivated in patients. As no performance differences were observed, we conclude that patients required more effort, indexed by more activity in WM-related regions, to successfully perform the task. This increased effort might be related to difficulties in suppressing task-irrelevant information reflected by reduced deactivation of regions within the DMN. Effects were most pronounced for negative and neutral stimuli, thus pointing toward important implications of aberrations in WM processes in cognition-emotion interactions in MDD.
Collapse
|
37
|
Genome-wide genetic heterogeneity discovery with categorical covariates. Bioinformatics 2018; 33:1820-1828. [PMID: 28200033 PMCID: PMC5870548 DOI: 10.1093/bioinformatics/btx071] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2016] [Accepted: 02/08/2017] [Indexed: 12/30/2022] Open
Abstract
Motivation Genetic heterogeneity is the phenomenon that distinct genetic variants may give rise to the same phenotype. The recently introduced algorithm Fast Automatic Interval Search (FAIS) enables the genome-wide search of candidate regions for genetic heterogeneity in the form of any contiguous sequence of variants, and achieves high computational efficiency and statistical power. Although FAIS can test all possible genomic regions for association with a phenotype, a key limitation is its inability to correct for confounders such as gender or population structure, which may lead to numerous false-positive associations. Results We propose FastCMH, a method that overcomes this problem by properly accounting for categorical confounders, while still retaining statistical power and computational efficiency. Experiments comparing FastCMH with FAIS and multiple kinds of burden tests on simulated data, as well as on human and Arabidopsis samples, demonstrate that FastCMH can drastically reduce genomic inflation and discover associations that are missed by standard burden tests. Availability and Implementation An R package fastcmh is available on CRAN and the source code can be found at: https://www.bsse.ethz.ch/mlcb/research/bioinformatics-and-computational-biology/fastcmh.html Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
|
38
|
AraPheno: a public database for Arabidopsis thaliana phenotypes. Nucleic Acids Res 2016; 45:D1054-D1059. [PMID: 27924043 PMCID: PMC5210660 DOI: 10.1093/nar/gkw986] [Citation(s) in RCA: 50] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2016] [Revised: 10/11/2016] [Accepted: 10/18/2016] [Indexed: 12/11/2022] Open
Abstract
Natural genetic variation makes it possible to discover evolutionary changes that have been maintained in a population because they are advantageous. To understand genotype–phenotype relationships and to investigate trait architecture, the existence of both high-resolution genotypic and phenotypic data is necessary. Arabidopsis thaliana is a prime model for these purposes. This herb naturally occurs across much of the Eurasian continent and North America. Thus, it is exposed to a wide range of environmental factors and has been subject to natural selection under distinct conditions. Full genome sequencing data for more than 1000 different natural inbred lines are available, and this has encouraged the distributed generation of many types of phenotypic data. To leverage these data for meta analyses, AraPheno (https://arapheno.1001genomes.org) provide a central repository of population-scale phenotypes for A. thaliana inbred lines. AraPheno includes various features to easily access, download and visualize the phenotypic data. This will facilitate a comparative analysis of the many different types of phenotypic data, which is the base to further enhance our understanding of the genotype–phenotype map.
Collapse
|
39
|
Abstract
Motivation: Predicting disease phenotypes from genotypes is a key challenge in medical applications in the postgenomic era. Large training datasets of patients that have been both genotyped and phenotyped are the key requisite when aiming for high prediction accuracy. With current genotyping projects producing genetic data for hundreds of thousands of patients, large-scale phenotyping has become the bottleneck in disease phenotype prediction. Results: Here we present an approach for imputing missing disease phenotypes given the genotype of a patient. Our approach is based on co-training, which predicts the phenotype of unlabeled patients based on a second class of information, e.g. clinical health record information. Augmenting training datasets by this type of in silico phenotyping can lead to significant improvements in prediction accuracy. We demonstrate this on a dataset of patients with two diagnostic types of migraine, termed migraine with aura and migraine without aura, from the International Headache Genetics Consortium. Conclusions: Imputing missing disease phenotypes for patients via co-training leads to larger training datasets and improved prediction accuracy in phenotype prediction. Availability and implementation: The code can be obtained at: http://www.bsse.ethz.ch/mlcb/research/bioinformatics-and-computational-biology/co-training.html Contact:karsten.borgwardt@bsse.ethz.ch or menno.witteveen@bsse.ethz.ch Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
|
40
|
Abstract
Motivation: Genetic heterogeneity, the fact that several sequence variants give rise to the same phenotype, is a phenomenon that is of the utmost interest in the analysis of complex phenotypes. Current approaches for finding regions in the genome that exhibit genetic heterogeneity suffer from at least one of two shortcomings: (i) they require the definition of an exact interval in the genome that is to be tested for genetic heterogeneity, potentially missing intervals of high relevance, or (ii) they suffer from an enormous multiple hypothesis testing problem due to the large number of potential candidate intervals being tested, which results in either many false positives or a lack of power to detect true intervals. Results: Here, we present an approach that overcomes both problems: it allows one to automatically find all contiguous sequences of single nucleotide polymorphisms in the genome that are jointly associated with the phenotype. It also solves both the inherent computational efficiency problem and the statistical problem of multiple hypothesis testing, which are both caused by the huge number of candidate intervals. We demonstrate on Arabidopsis thaliana genome-wide association study data that our approach can discover regions that exhibit genetic heterogeneity and would be missed by single-locus mapping. Conclusions: Our novel approach can contribute to the genome-wide discovery of intervals that are involved in the genetic heterogeneity underlying complex phenotypes. Availability and implementation: The code can be obtained at: http://www.bsse.ethz.ch/mlcb/research/bioinformatics-and-computational-biology/sis.html. Contact:felipe.llinares@bsse.ethz.ch Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
|
41
|
Century-scale methylome stability in a recently diverged Arabidopsis thaliana lineage. PLoS Genet 2015; 11:e1004920. [PMID: 25569172 PMCID: PMC4287485 DOI: 10.1371/journal.pgen.1004920] [Citation(s) in RCA: 104] [Impact Index Per Article: 11.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2014] [Accepted: 11/24/2014] [Indexed: 01/17/2023] Open
Abstract
There has been much excitement about the possibility that exposure to specific environments can induce an ecological memory in the form of whole-sale, genome-wide epigenetic changes that are maintained over many generations. In the model plant Arabidopsis thaliana, numerous heritable DNA methylation differences have been identified in greenhouse-grown isogenic lines, but it remains unknown how natural, highly variable environments affect the rate and spectrum of such changes. Here we present detailed methylome analyses in a geographically dispersed A. thaliana population that constitutes a collection of near-isogenic lines, diverged for at least a century from a common ancestor. Methylome variation largely reflected genetic distance, and was in many aspects similar to that of lines raised in uniform conditions. Thus, even when plants are grown in varying and diverse natural sites, genome-wide epigenetic variation accumulates mostly in a clock-like manner, and epigenetic divergence thus parallels the pattern of genome-wide DNA sequence divergence. It continues to be hotly debated to what extent environmentally induced epigenetic change is stably inherited and thereby contributes to short-term adaptation. It has been shown before that natural Arabidopsis thaliana lines differ substantially in their methylation profiles. How much of this is independent of genetic changes remains, however, unclear, especially given that there is very little conservation of methylation between species, simply because the methylated sequences themselves, mostly repeats, are not conserved over millions of years. On the other hand, there is no doubt that artificially induced epialleles can contribute to phenotypic variation. To investigate whether epigenetic differentiation, at least in the short term, proceeds very differently from genetic variation, and whether genome-wide epigenetic fingerprints can be used to uncover local adaptation, we have taken advantage of a near-clonal North American A. thaliana population that has diverged under natural conditions for at least a century. We found that both patterns and rates of methylome variation were in many aspects similar to those of lines grown in stable environments, which suggests that environment-induced changes are only minor contributors to durable genome-wide heritable epigenetic variation.
Collapse
|
42
|
Abstract
Deep transcriptome sequencing (RNA-Seq) has become a vital tool for studying the state of cells in the context of varying environments, genotypes and other factors. RNA-Seq profiling data enable identification of novel isoforms, quantification of known isoforms and detection of changes in transcriptional or RNA-processing activity. Existing approaches to detect differential isoform abundance between samples either require a complete isoform annotation or fall short in providing statistically robust and calibrated significance estimates. Here, we propose a suite of statistical tests to address these open needs: a parametric test that uses known isoform annotations to detect changes in relative isoform abundance and a non-parametric test that detects differential read coverages and can be applied when isoform annotations are not available. Both methods account for the discrete nature of read counts and the inherent biological variability. We demonstrate that these tests compare favorably to previous methods, both in terms of accuracy and statistical calibrations. We use these techniques to analyze RNA-Seq libraries from Arabidopsis thaliana and Drosophila melanogaster. The identified differential RNA processing events were consistent with RT–qPCR measurements and previous studies. The proposed toolkit is available from http://bioweb.me/rdiff and enables in-depth analyses of transcriptomes, with or without available isoform annotation.
Collapse
|
43
|
Detecting regulatory gene-environment interactions with unmeasured environmental factors. ACTA ACUST UNITED AC 2013; 29:1382-9. [PMID: 23559640 DOI: 10.1093/bioinformatics/btt148] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/28/2023]
Abstract
MOTIVATION Genomic studies have revealed a substantial heritable component of the transcriptional state of the cell. To fully understand the genetic regulation of gene expression variability, it is important to study the effect of genotype in the context of external factors such as alternative environmental conditions. In model systems, explicit environmental perturbations have been considered for this purpose, allowing to directly test for environment-specific genetic effects. However, such experiments are limited to species that can be profiled in controlled environments, hampering their use in important systems such as human. Moreover, even in seemingly tightly regulated experimental conditions, subtle environmental perturbations cannot be ruled out, and hence unknown environmental influences are frequent. Here, we propose a model-based approach to simultaneously infer unmeasured environmental factors from gene expression profiles and use them in genetic analyses, identifying environment-specific associations between polymorphic loci and individual gene expression traits. RESULTS In extensive simulation studies, we show that our method is able to accurately reconstruct environmental factors and their interactions with genotype in a variety of settings. We further illustrate the use of our model in a real-world dataset in which one environmental factor has been explicitly experimentally controlled. Our method is able to accurately reconstruct the true underlying environmental factor even if it is not given as an input, allowing to detect genuine genotype-environment interactions. In addition to the known environmental factor, we find unmeasured factors involved in novel genotype-environment interactions. Our results suggest that interactions with both known and unknown environmental factors significantly contribute to gene expression variability. AVAILABILITY and implementation: Software available at http://pmbio.github.io/envGPLVM/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
|
44
|
Accurate indel prediction using paired-end short reads. BMC Genomics 2013; 14:132. [PMID: 23442375 PMCID: PMC3614465 DOI: 10.1186/1471-2164-14-132] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2012] [Accepted: 02/06/2013] [Indexed: 11/12/2022] Open
Abstract
Background One of the major open challenges in next generation sequencing (NGS) is the accurate identification of structural variants such as insertions and deletions (indels). Current methods for indel calling assign scores to different types of evidence or counter-evidence for the presence of an indel, such as the number of split read alignments spanning the boundaries of a deletion candidate or reads that map within a putative deletion. Candidates with a score above a manually defined threshold are then predicted to be true indels. As a consequence, structural variants detected in this manner contain many false positives. Results Here, we present a machine learning based method which is able to discover and distinguish true from false indel candidates in order to reduce the false positive rate. Our method identifies indel candidates using a discriminative classifier based on features of split read alignment profiles and trained on true and false indel candidates that were validated by Sanger sequencing. We demonstrate the usefulness of our method with paired-end Illumina reads from 80 genomes of the first phase of the 1001 Genomes Project (
http://www.1001genomes.org) in Arabidopsis thaliana. Conclusion In this work we show that indel classification is a necessary step to reduce the number of false positive candidates. We demonstrate that missing classification may lead to spurious biological interpretations. The software is available at:
http://agkb.is.tuebingen.mpg.de/Forschung/SV-M/.
Collapse
|
45
|
A Lasso multi-marker mixed model for association mapping with population structure correction. ACTA ACUST UNITED AC 2012; 29:206-14. [PMID: 23175758 DOI: 10.1093/bioinformatics/bts669] [Citation(s) in RCA: 81] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
MOTIVATION Exploring the genetic basis of heritable traits remains one of the central challenges in biomedical research. In traits with simple Mendelian architectures, single polymorphic loci explain a significant fraction of the phenotypic variability. However, many traits of interest seem to be subject to multifactorial control by groups of genetic loci. Accurate detection of such multivariate associations is non-trivial and often compromised by limited statistical power. At the same time, confounding influences, such as population structure, cause spurious association signals that result in false-positive findings. RESULTS We propose linear mixed models LMM-Lasso, a mixed model that allows for both multi-locus mapping and correction for confounding effects. Our approach is simple and free of tuning parameters; it effectively controls for population structure and scales to genome-wide datasets. LMM-Lasso simultaneously discovers likely causal variants and allows for multi-marker-based phenotype prediction from genotype. We demonstrate the practical use of LMM-Lasso in genome-wide association studies in Arabidopsis thaliana and linkage mapping in mouse, where our method achieves significantly more accurate phenotype prediction for 91% of the considered phenotypes. At the same time, our model dissects the phenotypic variability into components that result from individual single nucleotide polymorphism effects and population structure. Enrichment of known candidate genes suggests that the individual associations retrieved by LMM-Lasso are likely to be genuine. AVAILABILITY Code available under http://webdav.tuebingen. mpg.de/u/karsten/Forschung/research.html. CONTACT rakitsch@tuebingen.mpg.de, ippert@microsoft.com or stegle@ebi.ac.uk SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
|
46
|
Arabidopsis defense against Botrytis cinerea: chronology and regulation deciphered by high-resolution temporal transcriptomic analysis. THE PLANT CELL 2012; 24:3530-57. [PMID: 23023172 PMCID: PMC3480286 DOI: 10.1105/tpc.112.102046] [Citation(s) in RCA: 232] [Impact Index Per Article: 19.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/29/2012] [Revised: 08/14/2012] [Accepted: 09/07/2012] [Indexed: 05/18/2023]
Abstract
Transcriptional reprogramming forms a major part of a plant's response to pathogen infection. Many individual components and pathways operating during plant defense have been identified, but our knowledge of how these different components interact is still rudimentary. We generated a high-resolution time series of gene expression profiles from a single Arabidopsis thaliana leaf during infection by the necrotrophic fungal pathogen Botrytis cinerea. Approximately one-third of the Arabidopsis genome is differentially expressed during the first 48 h after infection, with the majority of changes in gene expression occurring before significant lesion development. We used computational tools to obtain a detailed chronology of the defense response against B. cinerea, highlighting the times at which signaling and metabolic processes change, and identify transcription factor families operating at different times after infection. Motif enrichment and network inference predicted regulatory interactions, and testing of one such prediction identified a role for TGA3 in defense against necrotrophic pathogens. These data provide an unprecedented level of detail about transcriptional changes during a defense response and are suited to systems biology analyses to generate predictive models of the gene regulatory networks mediating the Arabidopsis response to B. cinerea.
Collapse
|
47
|
|
48
|
ccSVM: correcting Support Vector Machines for confounding factors in biological data classification. Bioinformatics 2011; 27:i342-8. [PMID: 21685091 PMCID: PMC3117385 DOI: 10.1093/bioinformatics/btr204] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022] Open
Abstract
Motivation: Classifying biological data into different groups is a central task of bioinformatics: for instance, to predict the function of a gene or protein, the disease state of a patient or the phenotype of an individual based on its genotype. Support Vector Machines are a wide spread approach for classifying biological data, due to their high accuracy, their ability to deal with structured data such as strings, and the ease to integrate various types of data. However, it is unclear how to correct for confounding factors such as population structure, age or gender or experimental conditions in Support Vector Machine classification. Results: In this article, we present a Support Vector Machine classifier that can correct the prediction for observed confounding factors. This is achieved by minimizing the statistical dependence between the classifier and the confounding factors. We prove that this formulation can be transformed into a standard Support Vector Machine with rescaled input data. In our experiments, our confounder correcting SVM (ccSVM) improves tumor diagnosis based on samples from different labs, tuberculosis diagnosis in patients of varying age, ethnicity and gender, and phenotype prediction in the presence of population structure and outperforms state-of-the-art methods in terms of prediction accuracy. Availability: A ccSVM-implementation in MATLAB is available from http://webdav.tuebingen.mpg.de/u/karsten/Forschung/ISMB11_ccSVM/. Contact:limin.li@tuebingen.mpg.de; karsten.borgwardt@tuebingen.mpg.de
Collapse
|
49
|
Abstract
MOTIVATION In recent years, numerous genome-wide association studies have been conducted to identify genetic makeup that explains phenotypic differences observed in human population. Analytical tests on single loci are readily available and embedded in common genome analysis software toolset. The search for significant epistasis (gene-gene interactions) still poses as a computational challenge for modern day computing systems, due to the large number of hypotheses that have to be tested. RESULTS In this article, we present an approach to epistasis detection by exhaustive testing of all possible SNP pairs. The search strategy based on the Hilbert-Schmidt Independence Criterion can help delineate various forms of statistical dependence between the genetic markers and the phenotype. The actual implementation of this search is done on the highly parallelized architecture available on graphics processing units rendering the completion of the full search feasible within a day. AVAILABILITY The program is available at http://www.mpipsykl.mpg.de/epigpuhsic/. CONTACT tony@mpipsykl.mpg.de.
Collapse
|
50
|
The genome of the simian and human malaria parasite Plasmodium knowlesi. Nature 2008; 455:799-803. [PMID: 18843368 PMCID: PMC2656934 DOI: 10.1038/nature07306] [Citation(s) in RCA: 309] [Impact Index Per Article: 19.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2008] [Accepted: 07/30/2008] [Indexed: 11/08/2022]
Abstract
Plasmodium knowlesi is an intracellular malaria parasite whose natural vertebrate host is Macaca fascicularis (the 'kra' monkey); however, it is now increasingly recognized as a significant cause of human malaria, particularly in southeast Asia. Plasmodium knowlesi was the first malaria parasite species in which antigenic variation was demonstrated, and it has a close phylogenetic relationship to Plasmodium vivax, the second most important species of human malaria parasite (reviewed in ref. 4). Despite their relatedness, there are important phenotypic differences between them, such as host blood cell preference, absence of a dormant liver stage or 'hypnozoite' in P. knowlesi, and length of the asexual cycle (reviewed in ref. 4). Here we present an analysis of the P. knowlesi (H strain, Pk1(A+) clone) nuclear genome sequence. This is the first monkey malaria parasite genome to be described, and it provides an opportunity for comparison with the recently completed P. vivax genome and other sequenced Plasmodium genomes. In contrast to other Plasmodium genomes, putative variant antigen families are dispersed throughout the genome and are associated with intrachromosomal telomere repeats. One of these families, the KIRs, contains sequences that collectively match over one-half of the host CD99 extracellular domain, which may represent an unusual form of molecular mimicry.
Collapse
|