1
|
A transcriptomic based deconvolution framework for assessing differentiation stages and drug responses of AML. NPJ Precis Oncol 2024; 8:105. [PMID: 38762545 PMCID: PMC11102519 DOI: 10.1038/s41698-024-00596-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2023] [Accepted: 05/03/2024] [Indexed: 05/20/2024] Open
Abstract
The diagnostic spectrum for AML patients is increasingly based on genetic abnormalities due to their prognostic and predictive value. However, information on the AML blast phenotype regarding their maturational arrest has started to regain importance due to its predictive power for drug responses. Here, we deconvolute 1350 bulk RNA-seq samples from five independent AML cohorts on a single-cell healthy BM reference and demonstrate that the morphological differentiation stages (FAB) could be faithfully reconstituted using estimated cell compositions (ECCs). Moreover, we show that the ECCs reliably predict ex-vivo drug resistances as demonstrated for Venetoclax, a BCL-2 inhibitor, resistance specifically in AML with CD14+ monocyte phenotype. We validate these predictions using LUMC proteomics data by showing that BCL-2 protein abundance is split into two distinct clusters for NPM1-mutated AML at the extremes of CD14+ monocyte percentages, which could be crucial for the Venetoclax dosing patients. Our results suggest that Venetoclax resistance predictions can also be extended to AML without recurrent genetic abnormalities and possibly to MDS-related and secondary AML. Lastly, we show that CD14+ monocytic dominated Ven/Aza treated patients have significantly lower overall survival. Collectively, we propose a framework for allowing a joint mutation and maturation stage modeling that could be used as a blueprint for testing sensitivity for new agents across the various subtypes of AML.
Collapse
|
2
|
Mapping AML heterogeneity - multi-cohort transcriptomic analysis identifies novel clusters and divergent ex-vivo drug responses. Leukemia 2024; 38:751-761. [PMID: 38360865 DOI: 10.1038/s41375-024-02137-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2023] [Revised: 12/28/2023] [Accepted: 01/04/2024] [Indexed: 02/17/2024]
Abstract
Subtyping of acute myeloid leukaemia (AML) is predominantly based on recurrent genetic abnormalities, but recent literature indicates that transcriptomic phenotyping holds immense potential to further refine AML classification. Here we integrated five AML transcriptomic datasets with corresponding genetic information to provide an overview (n = 1224) of the transcriptomic AML landscape. Consensus clustering identified 17 robust patient clusters which improved identification of CEBPA-mutated patients with favourable outcomes, and uncovered transcriptomic subtypes for KMT2A rearrangements (2), NPM1 mutations (5), and AML with myelodysplasia-related changes (AML-MRC) (5). Transcriptomic subtypes of KMT2A, NPM1 and AML-MRC showed distinct mutational profiles, cell type differentiation arrests and immune properties, suggesting differences in underlying disease biology. Moreover, our transcriptomic clusters show differences in ex-vivo drug responses, even when corrected for differentiation arrest and superiorly capture differences in drug response compared to genetic classification. In conclusion, our findings underscore the importance of transcriptomics in AML subtyping and offer a basis for future research and personalised treatment strategies. Our transcriptomic compendium is publicly available and we supply an R package to project clusters to new transcriptomic studies.
Collapse
|
3
|
Predicting cell population-specific gene expression from genomic sequence. FRONTIERS IN BIOINFORMATICS 2024; 4:1347276. [PMID: 38501113 PMCID: PMC10944912 DOI: 10.3389/fbinf.2024.1347276] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2023] [Accepted: 01/23/2024] [Indexed: 03/20/2024] Open
Abstract
Most regulatory elements, especially enhancer sequences, are cell population-specific. One could even argue that a distinct set of regulatory elements is what defines a cell population. However, discovering which non-coding regions of the DNA are essential in which context, and as a result, which genes are expressed, is a difficult task. Some computational models tackle this problem by predicting gene expression directly from the genomic sequence. These models are currently limited to predicting bulk measurements and mainly make tissue-specific predictions. Here, we present a model that leverages single-cell RNA-sequencing data to predict gene expression. We show that cell population-specific models outperform tissue-specific models, especially when the expression profile of a cell population and the corresponding tissue are dissimilar. Further, we show that our model can prioritize GWAS variants and learn motifs of transcription factor binding sites. We envision that our model can be useful for delineating cell population-specific regulatory elements.
Collapse
|
4
|
Evaluating the effectiveness of pre-operative diagnosis of ovarian cancer using minimally invasive liquid biopsies by combining serum human epididymis protein 4 and cell-free DNA in patients with an ovarian mass. Int J Gynecol Cancer 2024:ijgc-2023-005073. [PMID: 38388177 DOI: 10.1136/ijgc-2023-005073] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/24/2024] Open
Abstract
OBJECTIVE To assess the feasibility of scalable, objective, and minimally invasive liquid biopsy-derived biomarkers such as cell-free DNA copy number profiles, human epididymis protein 4 (HE4), and cancer antigen 125 (CA125) for pre-operative risk assessment of early-stage ovarian cancer in a clinically representative and diagnostically challenging population and to compare the performance of these biomarkers with the Risk of Malignancy Index (RMI). METHODS In this case-control study, we included 100 patients with an ovarian mass clinically suspected to be early-stage ovarian cancer. Of these 100 patients, 50 were confirmed to have a malignant mass (cases) and 50 had a benign mass (controls). Using WisecondorX, an algorithm used extensively in non-invasive prenatal testing, we calculated the benign-calibrated copy number profile abnormality score. This score represents how different a sample is from benign controls based on copy number profiles. We combined this score with HE4 serum concentration to separate cases and controls. RESULTS Combining the benign-calibrated copy number profile abnormality score with HE4, we obtained a model with a significantly higher sensitivity (42% vs 0%; p<0.002) at 99% specificity as compared with the RMI that is currently employed in clinical practice. Investigating performance in subgroups, we observed especially large differences in the advanced stage and non-high-grade serous ovarian cancer groups. CONCLUSION This study demonstrates that cell-free DNA can be successfully employed to perform pre-operative risk of malignancy assessment for ovarian masses; however, results warrant validation in a more extensive clinical study.
Collapse
|
5
|
Machine learning-based biomarker profile derived from 4210 serially measured proteins predicts clinical outcome of patients with heart failure. EUROPEAN HEART JOURNAL. DIGITAL HEALTH 2023; 4:444-454. [PMID: 38045440 PMCID: PMC10689916 DOI: 10.1093/ehjdh/ztad056] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/21/2023] [Revised: 09/06/2023] [Accepted: 10/03/2023] [Indexed: 12/05/2023]
Abstract
Aims Risk assessment tools are needed for timely identification of patients with heart failure (HF) with reduced ejection fraction (HFrEF) who are at high risk of adverse events. In this study, we aim to derive a small set out of 4210 repeatedly measured proteins, which, along with clinical characteristics and established biomarkers, carry optimal prognostic capacity for adverse events, in patients with HFrEF. Methods and results In 382 patients, we performed repeated blood sampling (median follow-up: 2.1 years) and applied an aptamer-based multiplex proteomic approach. We used machine learning to select the optimal set of predictors for the primary endpoint (PEP: composite of cardiovascular death, heart transplantation, left ventricular assist device implantation, and HF hospitalization). The association between repeated measures of selected proteins and PEP was investigated by multivariable joint models. Internal validation (cross-validated c-index) and external validation (Henry Ford HF PharmacoGenomic Registry cohort) were performed. Nine proteins were selected in addition to the MAGGIC risk score, N-terminal pro-hormone B-type natriuretic peptide, and troponin T: suppression of tumourigenicity 2, tryptophanyl-tRNA synthetase cytoplasmic, histone H2A Type 3, angiotensinogen, deltex-1, thrombospondin-4, ADAMTS-like protein 2, anthrax toxin receptor 1, and cathepsin D. N-terminal pro-hormone B-type natriuretic peptide and angiotensinogen showed the strongest associations [hazard ratio (95% confidence interval): 1.96 (1.17-3.40) and 0.66 (0.49-0.88), respectively]. The multivariable model yielded a c-index of 0.85 upon internal validation and c-indices up to 0.80 upon external validation. The c-index was higher than that of a model containing established risk factors (P = 0.021). Conclusion Nine serially measured proteins captured the most essential prognostic information for the occurrence of adverse events in patients with HFrEF, and provided incremental value for HF prognostication beyond established risk factors. These proteins could be used for dynamic, individual risk assessment in a prospective setting. These findings also illustrate the potential value of relatively 'novel' biomarkers for prognostication. Clinical Trial Registration https://clinicaltrials.gov/ct2/show/NCT01851538?term=nCT01851538&draw=2&rank=1 24.
Collapse
|
6
|
Technical Report: A Comprehensive Comparison between Different Quantification Versions of Nightingale Health's 1H-NMR Metabolomics Platform. Metabolites 2023; 13:1181. [PMID: 38132863 PMCID: PMC10745109 DOI: 10.3390/metabo13121181] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2023] [Revised: 11/07/2023] [Accepted: 11/17/2023] [Indexed: 12/23/2023] Open
Abstract
1H-NMR metabolomics data is increasingly used to track health and disease. Nightingale Health, a major supplier of 1H-NMR metabolomics, has recently updated the quantification strategy to further align with clinical standards. Such updates, however, might influence backward replicability, particularly affecting studies with repeated measures. Using data from BBMRI-NL consortium (~28,000 samples from 28 cohorts), we compared Nightingale data, originally released in 2014 and 2016, with a re-quantified version released in 2020, of which both versions were based on the same NMR spectra. Apart from two discontinued and twenty-three new analytes, we generally observe a high concordance between quantification versions with 73 out of 222 (33%) analytes showing a mean ρ > 0.9 across all cohorts. Conversely, five analytes consistently showed lower Spearman's correlations (ρ < 0.7) between versions, namely acetoacetate, LDL-L, saturated fatty acids, S-HDL-C, and sphingomyelins. Furthermore, previously trained multi-analyte scores, such as MetaboAge or MetaboHealth, might be particularly sensitive to platform changes. Whereas MetaboHealth replicated well, the MetaboAge score had to be retrained due to use of discontinued analytes. Notably, both scores in the re-quantified data recapitulated mortality associations observed previously. Concluding, we urge caution in utilizing different platform versions to avoid mixing analytes, having different units, or simply being discontinued.
Collapse
|
7
|
Causal inference using observational intensive care unit data: a scoping review and recommendations for future practice. NPJ Digit Med 2023; 6:221. [PMID: 38012221 PMCID: PMC10682453 DOI: 10.1038/s41746-023-00961-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2023] [Accepted: 11/05/2023] [Indexed: 11/29/2023] Open
Abstract
This scoping review focuses on the essential role of models for causal inference in shaping actionable artificial intelligence (AI) designed to aid clinicians in decision-making. The objective was to identify and evaluate the reporting quality of studies introducing models for causal inference in intensive care units (ICUs), and to provide recommendations to improve the future landscape of research practices in this domain. To achieve this, we searched various databases including Embase, MEDLINE ALL, Web of Science Core Collection, Google Scholar, medRxiv, bioRxiv, arXiv, and the ACM Digital Library. Studies involving models for causal inference addressing time-varying treatments in the adult ICU were reviewed. Data extraction encompassed the study settings and methodologies applied. Furthermore, we assessed reporting quality of target trial components (i.e., eligibility criteria, treatment strategies, follow-up period, outcome, and analysis plan) and main causal assumptions (i.e., conditional exchangeability, positivity, and consistency). Among the 2184 titles screened, 79 studies met the inclusion criteria. The methodologies used were G methods (61%) and reinforcement learning methods (39%). Studies considered both static (51%) and dynamic treatment regimes (49%). Only 30 (38%) of the studies reported all five target trial components, and only seven (9%) studies mentioned all three causal assumptions. To achieve actionable AI in the ICU, we advocate careful consideration of the causal question of interest, describing this research question as a target trial emulation, usage of appropriate causal inference methods, and acknowledgement (and examination of potential violations of) the causal assumptions.
Collapse
|
8
|
The correlation between neuropathology levels and cognitive performance in centenarians. Alzheimers Dement 2023; 19:5036-5047. [PMID: 37092333 DOI: 10.1002/alz.13087] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2023] [Revised: 03/15/2023] [Accepted: 03/20/2023] [Indexed: 04/25/2023]
Abstract
INTRODUCTION Neuropathological substrates associated with neurodegeneration occur in brains of the oldest old. How does this affect cognitive performance? METHODS The 100-plus Study is an ongoing longitudinal cohort study of centenarians who self-report to be cognitively healthy; post mortem brain donation is optional. In 85 centenarian brains, we explored the correlations between the levels of 11 neuropathological substrates with ante mortem performance on 12 neuropsychological tests. RESULTS Levels of neuropathological substrates varied: we observed levels up to Thal-amyloid beta phase 5, Braak-neurofibrillary tangle (NFT) stage V, Consortium to Establish a Registry for Alzheimer's Disease (CERAD)-neuritic plaque score 3, Thal-cerebral amyloid angiopathy stage 3, Tar-DNA binding protein 43 (TDP-43) stage 3, hippocampal sclerosis stage 1, Braak-Lewy bodies stage 6, atherosclerosis stage 3, cerebral infarcts stage 1, and cerebral atrophy stage 2. Granulovacuolar degeneration occurred in all centenarians. Some high performers had the highest neuropathology scores. DISCUSSION Only Braak-NFT stage and limbic-predominant age-related TDP-43 encephalopathy (LATE) pathology associated significantly with performance across multiple cognitive domains. Of all cognitive tests, the clock-drawing test was particularly sensitive to levels of multiple neuropathologies.
Collapse
|
9
|
Determining epitope specificity of T-cell receptors with transformers. Bioinformatics 2023; 39:btad632. [PMID: 37847663 PMCID: PMC10636277 DOI: 10.1093/bioinformatics/btad632] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2023] [Revised: 09/09/2023] [Accepted: 10/16/2023] [Indexed: 10/19/2023] Open
Abstract
SUMMARY T-cell receptors (TCRs) on T cells recognize and bind to epitopes presented by the major histocompatibility complex in case of an infection or cancer. However, the high diversity of TCRs, as well as their unique and complex binding mechanisms underlying epitope recognition, make it difficult to predict the binding between TCRs and epitopes. Here, we present the utility of transformers, a deep learning strategy that incorporates an attention mechanism that learns the informative features, and show that these models pre-trained on a large set of protein sequences outperform current strategies. We compared three pre-trained auto-encoder transformer models (ProtBERT, ProtAlbert, and ProtElectra) and one pre-trained auto-regressive transformer model (ProtXLNet) to predict the binding specificity of TCRs to 25 epitopes from the VDJdb database (human and murine). Two additional modifications were performed to incorporate gene usage of the TCRs in the four transformer models. Of all 12 transformer implementations (four models with three different modifications), a modified version of the ProtXLNet model could predict TCR-epitope pairs with the highest accuracy (weighted F1 score 0.55 simultaneously considering all 25 epitopes). The modification included additional features representing the gene names for the TCRs. We also showed that the basic implementation of transformers outperformed the previously available methods, i.e. TCRGP, TCRdist, and DeepTCR, developed for the same biological problem, especially for the hard-to-classify labels. We show that the proficiency of transformers in attention learning can be made operational in a complex biological setting like TCR binding prediction. Further ingenuity in utilizing the full potential of transformers, either through attention head visualization or introducing additional features, can extend T-cell research avenues. AVAILABILITY AND IMPLEMENTATION Data and code are available on https://github.com/InduKhatri/tcrformer.
Collapse
|
10
|
Epigenetic and Metabolomic Biomarkers for Biological Age: A Comparative Analysis of Mortality and Frailty Risk. J Gerontol A Biol Sci Med Sci 2023; 78:1753-1762. [PMID: 37303208 PMCID: PMC10562890 DOI: 10.1093/gerona/glad137] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2022] [Indexed: 06/13/2023] Open
Abstract
Biological age captures a person's age-related risk of unfavorable outcomes using biophysiological information. Multivariate biological age measures include frailty scores and molecular biomarkers. These measures are often studied in isolation, but here we present a large-scale study comparing them. In 2 prospective cohorts (n = 3 222), we compared epigenetic (DNAm Horvath, DNAm Hannum, DNAm Lin, DNAm epiTOC, DNAm PhenoAge, DNAm DunedinPoAm, DNAm GrimAge, and DNAm Zhang) and metabolomic-based (MetaboAge and MetaboHealth) biomarkers in reflection of biological age, as represented by 5 frailty measures and overall mortality. Biomarkers trained on outcomes with biophysiological and/or mortality information outperformed age-trained biomarkers in frailty reflection and mortality prediction. DNAm GrimAge and MetaboHealth, trained on mortality, showed the strongest association with these outcomes. The associations of DNAm GrimAge and MetaboHealth with frailty and mortality were independent of each other and of the frailty score mimicking clinical geriatric assessment. Epigenetic, metabolomic, and clinical biological age markers seem to capture different aspects of aging. These findings suggest that mortality-trained molecular markers may provide novel phenotype reflecting biological age and strengthen current clinical geriatric health and well-being assessment.
Collapse
|
11
|
Benchmarking variational AutoEncoders on cancer transcriptomics data. PLoS One 2023; 18:e0292126. [PMID: 37796856 PMCID: PMC10553230 DOI: 10.1371/journal.pone.0292126] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2023] [Accepted: 09/13/2023] [Indexed: 10/07/2023] Open
Abstract
Deep generative models, such as variational autoencoders (VAE), have gained increasing attention in computational biology due to their ability to capture complex data manifolds which subsequently can be used to achieve better performance in downstream tasks, such as cancer type prediction or subtyping of cancer. However, these models are difficult to train due to the large number of hyperparameters that need to be tuned. To get a better understanding of the importance of the different hyperparameters, we examined six different VAE models when trained on TCGA transcriptomics data and evaluated on the downstream tasks of cluster agreement with cancer subtypes and survival analysis. We studied the effect of the latent space dimensionality, learning rate, optimizer, initialization and activation function on the quality of subsequent downstream tasks on the TCGA samples. We found β-TCVAE and DIP-VAE to have a good performance, on average, despite being more sensitive to hyperparameters selection. Based on these experiments, we derived recommendations for selecting the different hyperparameters settings. To ensure generalization, we tested all hyperparameter configurations on the GTEx dataset. We found a significant correlation (ρ = 0.7) between the hyperparameter effects on clustering performance in the TCGA and GTEx datasets. This highlights the robustness and generalizability of our recommendations. In addition, we examined whether the learned latent spaces capture biologically relevant information. Hereto, we measured the correlation and mutual information of the different representations with various data characteristics such as gender, age, days to metastasis, immune infiltration, and mutation signatures. We found that for all models the latent factors, in general, do not uniquely correlate with one of the data characteristics nor capture separable information in the latent factors even for models specifically designed for disentanglement.
Collapse
|
12
|
Cell type matching across species using protein embeddings and transfer learning. Bioinformatics 2023; 39:i404-i412. [PMID: 37387141 DOI: 10.1093/bioinformatics/btad248] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/01/2023] Open
Abstract
MOTIVATION Knowing the relation between cell types is crucial for translating experimental results from mice to humans. Establishing cell type matches, however, is hindered by the biological differences between the species. A substantial amount of evolutionary information between genes that could be used to align the species is discarded by most of the current methods since they only use one-to-one orthologous genes. Some methods try to retain the information by explicitly including the relation between genes, however, not without caveats. RESULTS In this work, we present a model to transfer and align cell types in cross-species analysis (TACTiCS). First, TACTiCS uses a natural language processing model to match genes using their protein sequences. Next, TACTiCS employs a neural network to classify cell types within a species. Afterward, TACTiCS uses transfer learning to propagate cell type labels between species. We applied TACTiCS on scRNA-seq data of the primary motor cortex of human, mouse, and marmoset. Our model can accurately match and align cell types on these datasets. Moreover, our model outperforms Seurat and the state-of-the-art method SAMap. Finally, we show that our gene matching method results in better cell type matches than BLAST in our model. AVAILABILITY AND IMPLEMENTATION The implementation is available on GitHub (https://github.com/kbiharie/TACTiCS). The preprocessed datasets and trained models can be downloaded from Zenodo (https://doi.org/10.5281/zenodo.7582460).
Collapse
|
13
|
Machine learning-based somatic variant calling in cell-free DNA of metastatic breast cancer patients using large NGS panels. Sci Rep 2023; 13:10424. [PMID: 37369746 DOI: 10.1038/s41598-023-37409-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2023] [Accepted: 06/21/2023] [Indexed: 06/29/2023] Open
Abstract
Next generation sequencing of cell-free DNA (cfDNA) is a promising method for treatment monitoring and therapy selection in metastatic breast cancer (MBC). However, distinguishing tumor-specific variants from sequencing artefacts and germline variation with low false discovery rate is challenging when using large targeted sequencing panels covering many tumor suppressor genes. To address this, we built a machine learning model to remove false positive variant calls and augmented it with additional filters to ensure selection of tumor-derived variants. We used cfDNA of 70 MBC patients profiled with both the small targeted Oncomine breast panel (Thermofisher) and the much larger Qiaseq Human Breast Cancer Panel (Qiagen). The model was trained on the panels' common regions using Oncomine hotspot mutations as ground truth. Applied to Qiaseq data, it achieved 35% sensitivity and 36% precision, outperforming basic filtering. For 20 patients we used germline DNA to filter for somatic variants and obtained 245 variants in total, while our model found seven variants, of which six were also detected using the germline strategy. In ten tumor-free individuals, our method detected in total one (potentially germline) variant, in contrast to 521 variants detected without our model. These results indicate that our model largely detects somatic variants.
Collapse
|
14
|
Identifying Aging and Alzheimer Disease-Associated Somatic Variations in Excitatory Neurons From the Human Frontal Cortex. Neurol Genet 2023; 9:e200066. [PMID: 37123987 PMCID: PMC10136684 DOI: 10.1212/nxg.0000000000200066] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2022] [Accepted: 02/03/2023] [Indexed: 05/02/2023]
Abstract
Background and Objectives With age, somatic mutations accumulated in human brain cells can lead to various neurologic disorders and brain tumors. Because the incidence rate of Alzheimer disease (AD) increases exponentially with age, investigating the association between AD and the accumulation of somatic mutation can help understand the etiology of AD. Methods We designed a somatic mutation detection workflow by contrasting genotypes derived from whole-genome sequencing (WGS) data with genotypes derived from scRNA-seq data and applied this workflow to 76 participants from the Religious Order Study and the Rush Memory and Aging Project (ROSMAP) cohort. We focused only on excitatory neurons, the dominant cell type in the scRNA-seq data. Results We identified 196 sites that harbored at least 1 individual with an excitatory neuron-specific somatic mutation (ENSM), and these 196 sites were mapped to 127 genes. The single base substitution (SBS) pattern of the putative ENSMs was best explained by signature SBS5 from the Catalogue of Somatic Mutations in Cancer (COSMIC) mutational signatures, a clock-like pattern correlating with the age of the individual. The count of ENSMs per individual also showed an increasing trend with age. Among the mutated sites, we found 2 sites tend to have more mutations in older individuals (16:6899517 [RBFOX1], p = 0.04; 4:21788463 [KCNIP4], p < 0.05). In addition, 2 sites were found to have a higher odds ratio to detect a somatic mutation in AD samples (6:73374221 [KCNQ5], p = 0.01 and 13:36667102 [DCLK1], p = 0.02). Thirty-two genes that harbor somatic mutations unique to AD and the KCNQ5 and DCLK1 genes were used for gene ontology (GO)-term enrichment analysis. We found the AD-specific ENSMs enriched in the GO-term "vocalization behavior" and "intraspecies interaction between organisms." Of interest we observed both age-specific and AD-specific ENSMs enriched in the K+ channel-associated genes. Discussion Our results show that combining scRNA-seq and WGS data can successfully detect putative somatic mutations. The putative somatic mutations detected from ROSMAP data set have provided new insights into the association of AD and aging with brain somatic mutagenesis.
Collapse
|
15
|
Single-cell RNA sequencing data reveals rewiring of transcriptional relationships in Alzheimer's Disease associated with risk variants. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2023:2023.05.15.23289992. [PMID: 37292975 PMCID: PMC10246028 DOI: 10.1101/2023.05.15.23289992] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Understanding how genetic risk variants contribute to Alzheimer's Disease etiology remains a challenge. Single-cell RNA sequencing (scRNAseq) allows for the investigation of cell type specific effects of genomic risk loci on gene expression. Using seven scRNAseq datasets totalling >1.3 million cells, we investigated differential correlation of genes between healthy individuals and individuals diagnosed with Alzheimer's Disease. Using the number of differential correlations of a gene to estimate its involvement and potential impact, we present a prioritization scheme for identifying probable causal genes near genomic risk loci. Besides prioritizing genes, our approach pin-points specific cell types and provides insight into the rewiring of gene-gene relationships associated with Alzheimer's.
Collapse
|
16
|
Consequences and opportunities arising due to sparser single-cell RNA-seq datasets. Genome Biol 2023; 24:86. [PMID: 37085823 PMCID: PMC10120229 DOI: 10.1186/s13059-023-02933-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2022] [Accepted: 04/10/2023] [Indexed: 04/23/2023] Open
Abstract
With the number of cells measured in single-cell RNA sequencing (scRNA-seq) datasets increasing exponentially and concurrent increased sparsity due to more zero counts being measured for many genes, we demonstrate here that downstream analyses on binary-based gene expression give similar results as count-based analyses. Moreover, a binary representation scales up to ~ 50-fold more cells that can be analyzed using the same computational resources. We also highlight the possibilities provided by binarized scRNA-seq data. Development of specialized tools for bit-aware implementations of downstream analytical tasks will enable a more fine-grained resolution of biological heterogeneity.
Collapse
|
17
|
Resilience and resistance to the accumulation of amyloid plaques and neurofibrillary tangles in centenarians: An age-continuous perspective. Alzheimers Dement 2022. [PMID: 36583547 DOI: 10.1002/alz.12899] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2022] [Revised: 11/08/2022] [Accepted: 11/15/2022] [Indexed: 12/31/2022]
Abstract
INTRODUCTION With increasing age, neuropathological substrates associated with Alzheimer's disease (AD) accumulate in brains of cognitively healthy individuals-are they resilient, or resistant to AD-associated neuropathologies? METHODS In 85 centenarian brains, we correlated NIA (amyloid) stages, Braak (neurofibrillary tangle) stages, and CERAD (neuritic plaque) scores with cognitive performance close to death as determined by Mini-Mental State Examination (MMSE) scores. We assessed centenarian brains against 2131 brains from AD patients, non-AD demented, and non-demented individuals in an age continuum ranging from 16 to 100+ years. RESULTS With age, brains from non-demented individuals reached the NIA and Braak stages observed in AD patients, while CERAD scores remained lower. In centenarians, NIA stages varied (22.4% were the highest stage 3), Braak stages rarely exceeded stage IV (5.9% were V), and CERAD scores rarely exceeded 2 (4.7% were 3); within these distributions, we observed no correlation with the MMSE (NIA: P = 0.60; Braak: P = 0.08; CERAD: P = 0.16). DISCUSSION Cognitive health can be maintained despite the accumulation of high levels of AD-related neuropathological substrates. HIGHLIGHTS Cognitively healthy elderly have AD neuropathology levels similar to AD patients. AD neuropathology loads do not correlate with cognitive performance in centenarians. Some centenarians are resilient to the highest levels of AD neuropathology.
Collapse
|
18
|
Exome sequencing identifies rare damaging variants in ATP8B4 and ABCA1 as risk factors for Alzheimer's disease. Nat Genet 2022; 54:1786-1794. [PMID: 36411364 PMCID: PMC9729101 DOI: 10.1038/s41588-022-01208-7] [Citation(s) in RCA: 41] [Impact Index Per Article: 20.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2021] [Accepted: 09/19/2022] [Indexed: 11/22/2022]
Abstract
Alzheimer's disease (AD), the leading cause of dementia, has an estimated heritability of approximately 70%1. The genetic component of AD has been mainly assessed using genome-wide association studies, which do not capture the risk contributed by rare variants2. Here, we compared the gene-based burden of rare damaging variants in exome sequencing data from 32,558 individuals-16,036 AD cases and 16,522 controls. Next to variants in TREM2, SORL1 and ABCA7, we observed a significant association of rare, predicted damaging variants in ATP8B4 and ABCA1 with AD risk, and a suggestive signal in ADAM10. Additionally, the rare-variant burden in RIN3, CLU, ZCWPW1 and ACE highlighted these genes as potential drivers of respective AD-genome-wide association study loci. Variants associated with the strongest effect on AD risk, in particular loss-of-function variants, are enriched in early-onset AD cases. Our results provide additional evidence for a major role for amyloid-β precursor protein processing, amyloid-β aggregation, lipid metabolism and microglial function in AD.
Collapse
|
19
|
Single-cell immune profiling reveals thymus-seeding populations, T cell commitment, and multilineage development in the human thymus. Sci Immunol 2022; 7:eade0182. [DOI: 10.1126/sciimmunol.ade0182] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
T cell development in the mouse thymus has been studied extensively, but less is known regarding T cell development in the human thymus. We used a combination of single-cell techniques and functional assays to perform deep immune profiling of human T cell development, focusing on the initial stages of prelineage commitment. We identified three thymus-seeding progenitor populations that also have counterparts in the bone marrow. In addition, we found that the human thymus physiologically supports the development of monocytes, dendritic cells, and NK cells, as well as limited development of B cells. These results are an important step toward monitoring and guiding regenerative therapies in patients after hematopoietic stem cell transplantation.
Collapse
|
20
|
PLIS: A metabolomic response monitor to a lifestyle intervention study in older adults. FASEB J 2022; 36:e22578. [PMID: 36183353 DOI: 10.1096/fj.202201037r] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2022] [Revised: 09/07/2022] [Accepted: 09/19/2022] [Indexed: 11/11/2022]
Abstract
The response to lifestyle intervention studies is often heterogeneous, especially in older adults. Subtle responses that may represent a health gain for individuals are not always detected by classical health variables, stressing the need for novel biomarkers that detect intermediate changes in metabolic, inflammatory, and immunity-related health. Here, our aim was to develop and validate a molecular multivariate biomarker maximally sensitive to the individual effect of a lifestyle intervention; the Personalized Lifestyle Intervention Status (PLIS). We used 1 H-NMR fasting blood metabolite measurements from before and after the 13-week combined physical and nutritional Growing Old TOgether (GOTO) lifestyle intervention study in combination with a fivefold cross-validation and a bootstrapping method to train a separate PLIS score for men and women. The PLIS scores consisted of 14 and four metabolites for females and males, respectively. Performance of the PLIS score in tracking health gain was illustrated by association of the sex-specific PLIS scores with several classical metabolic health markers, such as BMI, trunk fat%, fasting HDL cholesterol, and fasting insulin, the primary outcome of the GOTO study. We also showed that the baseline PLIS score indicated which participants respond positively to the intervention. Finally, we explored PLIS in an independent physical activity lifestyle intervention study, showing similar, albeit remarkably weaker, associations of PLIS with classical metabolic health markers. To conclude, we found that the sex-specific PLIS score was able to track the individual short-term metabolic health gain of the GOTO lifestyle intervention study. The methodology used to train the PLIS score potentially provides a useful instrument to track personal responses and predict the participant's health benefit in lifestyle interventions similar to the GOTO study.
Collapse
|
21
|
Development and validation of an early warning model for hospitalized COVID-19 patients: a multi-center retrospective cohort study. Intensive Care Med Exp 2022; 10:38. [PMID: 36117237 PMCID: PMC9482891 DOI: 10.1186/s40635-022-00465-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2022] [Accepted: 08/22/2022] [Indexed: 12/15/2022] Open
Abstract
Background Timely identification of deteriorating COVID-19 patients is needed to guide changes in clinical management and admission to intensive care units (ICUs). There is significant concern that widely used Early warning scores (EWSs) underestimate illness severity in COVID-19 patients and therefore, we developed an early warning model specifically for COVID-19 patients.
Methods We retrospectively collected electronic medical record data to extract predictors and used these to fit a random forest model. To simulate the situation in which the model would have been developed after the first and implemented during the second COVID-19 ‘wave’ in the Netherlands, we performed a temporal validation by splitting all included patients into groups admitted before and after August 1, 2020. Furthermore, we propose a method for dynamic model updating to retain model performance over time. We evaluated model discrimination and calibration, performed a decision curve analysis, and quantified the importance of predictors using SHapley Additive exPlanations values. Results We included 3514 COVID-19 patient admissions from six Dutch hospitals between February 2020 and May 2021, and included a total of 18 predictors for model fitting. The model showed a higher discriminative performance in terms of partial area under the receiver operating characteristic curve (0.82 [0.80–0.84]) compared to the National early warning score (0.72 [0.69–0.74]) and the Modified early warning score (0.67 [0.65–0.69]), a greater net benefit over a range of clinically relevant model thresholds, and relatively good calibration (intercept = 0.03 [− 0.09 to 0.14], slope = 0.79 [0.73–0.86]). Conclusions This study shows the potential benefit of moving from early warning models for the general inpatient population to models for specific patient groups. Further (independent) validation of the model is needed. Supplementary Information The online version contains supplementary material available at 10.1186/s40635-022-00465-4.
Collapse
|
22
|
Metabolomic predictors of phenotypic traits can replace and complement measured clinical variables in population-scale expression profiling studies. BMC Genomics 2022; 23:546. [PMID: 35907790 PMCID: PMC9339202 DOI: 10.1186/s12864-022-08771-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2022] [Accepted: 07/12/2022] [Indexed: 11/10/2022] Open
Abstract
Population-scale expression profiling studies can provide valuable insights into biological and disease-underlying mechanisms. The availability of phenotypic traits is essential for studying clinical effects. Therefore, missing, incomplete, or inaccurate phenotypic information can make analyses challenging and prevent RNA-seq or other omics data to be reused. A possible solution are predictors that infer clinical or behavioral phenotypic traits from molecular data. While such predictors have been developed based on different omics data types and are being applied in various studies, metabolomics-based surrogates are less commonly used than predictors based on DNA methylation profiles.In this study, we inferred 17 traits, including diabetes status and exposure to lipid medication, using previously trained metabolomic predictors. We evaluated whether these metabolomic surrogates can be used as an alternative to reported information for studying the respective phenotypes using expression profiling data of four population cohorts. For the majority of the 17 traits, the metabolomic surrogates performed similarly to the reported phenotypes in terms of effect sizes, number of significant associations, replication rates, and significantly enriched pathways.The application of metabolomics-derived surrogate outcomes opens new possibilities for reuse of multi-omics data sets. In studies where availability of clinical metadata is limited, missing or incomplete information can be complemented by these surrogates, thereby increasing the size of available data sets. Additionally, the availability of such surrogates could be used to correct for potential biological confounding. In the future, it would be interesting to further investigate the use of molecular predictors across different omics types and cohorts.
Collapse
|
23
|
Integration of metabolomics with genomics: Metabolic gene prioritization using metabolomics data and genomic variant (CADD) scores. Mol Genet Metab 2022; 136:199-218. [PMID: 35660124 DOI: 10.1016/j.ymgme.2022.05.002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 08/16/2021] [Revised: 04/06/2022] [Accepted: 05/17/2022] [Indexed: 11/30/2022]
Abstract
The integration of metabolomics data with sequencing data is a key step towards improving the diagnostic process for finding the disease-causing genetic variant(s) in patients suspected of having an inborn error of metabolism (IEM). The measured metabolite levels could provide additional phenotypical evidence to elucidate the degree of pathogenicity for variants found in genes associated with metabolic processes. We present a computational approach, called Reafect, that calculates for each reaction in a metabolic pathway a score indicating whether that reaction is deficient or not. When calculating this score, Reafect takes multiple factors into account: the magnitude and sign of alterations in the metabolite levels, the reaction distances between metabolites and reactions in the pathway, and the biochemical directionality of the reactions. We applied Reafect to untargeted metabolomics data of 72 patient samples with a known IEM and found that in 81% of the cases the correct deficient enzyme was ranked within the top 5% of all considered enzyme deficiencies. Next, we integrated Reafect with Combined Annotation Dependent Depletion (CADD) scores (a measure for gene variant deleteriousness) and ranked the metabolic genes of 27 IEM patients. We observed that this integrated approach significantly improved the prioritization of the genes containing the disease-causing variant when compared with the two approaches individually. For 15/27 IEM patients the correct affected gene was ranked within the top 0.25% of the set of potentially affected genes. Together, our findings suggest that metabolomics data improves the identification of affected genes in patients suffering from IEM.
Collapse
|
24
|
MiMIR: R-shiny application to infer risk factors and endpoints from Nightingale Health's 1H-NMR Metabolomics data. Bioinformatics 2022; 38:3847-3849. [PMID: 35695757 PMCID: PMC9344846 DOI: 10.1093/bioinformatics/btac388] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2022] [Revised: 06/02/2022] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION 1H-NMR metabolomics is rapidly becoming a standard resource in large epidemiological studies to acquire metabolic profiles in large numbers of samples in a relatively low-priced and standardized manner. Concomitantly, metabolomics-based models are increasingly developed that capture disease risk or clinical risk factors. These developments raise the need for user-friendly toolbox to inspect new 1H-NMR metabolomics data and project a wide array of previously established risk models. RESULTS We present MiMIR (Metabolomics-based Models for Imputing Risk), a graphical user interface that provides an intuitive framework for ad-hoc statistical analysis of Nightingale Health's 1H-NMR metabolomics data and allows for the projection and calibration of 24 pre-trained metabolomics-based models, without any pre-required programming knowledge. AVAILABILITY The R-shiny package is available in CRAN or downloadable at https://github.com/DanieleBizzarri/MiMIR, together with an extensive user manual (also available as Supplementary Documents to the paper). SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
|
25
|
A framework for employing longitudinally collected multicenter electronic health records to stratify heterogeneous patient populations on disease history. J Am Med Inform Assoc 2022; 29:761-769. [PMID: 35139533 PMCID: PMC9122640 DOI: 10.1093/jamia/ocac008] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2021] [Revised: 11/24/2021] [Accepted: 01/27/2022] [Indexed: 11/23/2022] Open
Abstract
OBJECTIVE To facilitate patient disease subset and risk factor identification by constructing a pipeline which is generalizable, provides easily interpretable results, and allows replication by overcoming electronic health records (EHRs) batch effects. MATERIAL AND METHODS We used 1872 billing codes in EHRs of 102 880 patients from 12 healthcare systems. Using tools borrowed from single-cell omics, we mitigated center-specific batch effects and performed clustering to identify patients with highly similar medical history patterns across the various centers. Our visualization method (PheSpec) depicts the phenotypic profile of clusters, applies a novel filtering of noninformative codes (Ranked Scope Pervasion), and indicates the most distinguishing features. RESULTS We observed 114 clinically meaningful profiles, for example, linking prostate hyperplasia with cancer and diabetes with cardiovascular problems and grouping pediatric developmental disorders. Our framework identified disease subsets, exemplified by 6 "other headache" clusters, where phenotypic profiles suggested different underlying mechanisms: migraine, convulsion, injury, eye problems, joint pain, and pituitary gland disorders. Phenotypic patterns replicated well, with high correlations of ≥0.75 to an average of 6 (2-8) of the 12 different cohorts, demonstrating the consistency with which our method discovers disease history profiles. DISCUSSION Costly clinical research ventures should be based on solid hypotheses. We repurpose methods from single-cell omics to build these hypotheses from observational EHR data, distilling useful information from complex data. CONCLUSION We establish a generalizable pipeline for the identification and replication of clinically meaningful (sub)phenotypes from widely available high-dimensional billing codes. This approach overcomes datatype problems and produces comprehensive visualizations of validation-ready phenotypes.
Collapse
|
26
|
New insights into the genetic etiology of Alzheimer's disease and related dementias. Nat Genet 2022; 54:412-436. [PMID: 35379992 PMCID: PMC9005347 DOI: 10.1038/s41588-022-01024-z] [Citation(s) in RCA: 647] [Impact Index Per Article: 323.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2021] [Accepted: 01/27/2022] [Indexed: 02/08/2023]
Abstract
Characterization of the genetic landscape of Alzheimer's disease (AD) and related dementias (ADD) provides a unique opportunity for a better understanding of the associated pathophysiological processes. We performed a two-stage genome-wide association study totaling 111,326 clinically diagnosed/'proxy' AD cases and 677,663 controls. We found 75 risk loci, of which 42 were new at the time of analysis. Pathway enrichment analyses confirmed the involvement of amyloid/tau pathways and highlighted microglia implication. Gene prioritization in the new loci identified 31 genes that were suggestive of new genetically associated processes, including the tumor necrosis factor alpha pathway through the linear ubiquitin chain assembly complex. We also built a new genetic risk score associated with the risk of future AD/dementia or progression from mild cognitive impairment to AD/dementia. The improvement in prediction led to a 1.6- to 1.9-fold increase in AD risk from the lowest to the highest decile, in addition to effects of age and the APOE ε4 allele.
Collapse
|
27
|
A hidden layer of structural variation in transposable elements reveals potential genetic modifiers in human disease-risk loci. Genome Res 2022; 32:656-670. [PMID: 35332097 PMCID: PMC8997352 DOI: 10.1101/gr.275515.121] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2021] [Accepted: 01/28/2022] [Indexed: 11/24/2022]
Abstract
Genome-wide association studies (GWAS) have been highly informative in discovering disease-associated loci but are not designed to capture all structural variations in the human genome. Using long-read sequencing data, we discovered widespread structural variation within SINE-VNTR-Alu (SVA) elements, a class of great ape-specific transposable elements with gene-regulatory roles, which represents a major source of structural variability in the human population. We highlight the presence of structurally variable SVAs (SV-SVAs) in neurological disease-associated loci, and we further associate SV-SVAs to disease-associated SNPs and differential gene expression using luciferase assays and expression quantitative trait loci data. Finally, we genetically deleted SV-SVAs in the BIN1 and CD2AP Alzheimer's disease-associated risk loci and in the BCKDK Parkinson's disease-associated risk locus and assessed multiple aspects of their gene-regulatory influence in a human neuronal context. Together, this study reveals a novel layer of genetic variation in transposable elements that may contribute to identification of the structural variants that are the actual drivers of disease associations of GWAS loci.
Collapse
|
28
|
scMoC: single-cell multi-omics clustering. BIOINFORMATICS ADVANCES 2022; 2:vbac011. [PMID: 36699396 PMCID: PMC9710707 DOI: 10.1093/bioadv/vbac011] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 09/23/2021] [Revised: 01/27/2022] [Accepted: 02/11/2022] [Indexed: 01/28/2023]
Abstract
Motivation Single-cell multi-omics assays simultaneously measure different molecular features from the same cell. A key question is how to benefit from the complementary data available and perform cross-modal clustering of cells. Results We propose Single-Cell Multi-omics Clustering (scMoC), an approach to identify cell clusters from data with comeasurements of scRNA-seq and scATAC-seq from the same cell. We overcome the high sparsity of the scATAC-seq data by using an imputation strategy that exploits the less-sparse scRNA-seq data available from the same cell. Subsequently, scMoC identifies clusters of cells by merging clusterings derived from both data domains individually. We tested scMoC on datasets generated using different protocols with variable data sparsity levels. We show that scMoC (i) is able to generate informative scATAC-seq data due to its RNA-guided imputation strategy and (ii) results in integrated clusters based on both RNA and ATAC information that are biologically meaningful either from the RNA or from the ATAC perspective. Availability and implementation The data used in this manuscript is publicly available, and we refer to the original manuscript for their description and availability. For convience sci-CAR data is available at NCBI GEO under the accession number of GSE117089. SNARE-seq data is available at NCBI GEO under the accession number of GSE126074. The 10X multiome data is available at the following link https://www.10xgenomics.com/resources/datasets/pbmc-from-a-healthy-donor-no-cell-sorting-3-k-1-standard-2-0-0. Supplementary information Supplementary data are available at Bioinformatics Advances online.
Collapse
|
29
|
The Effect of Alzheimer's Disease-Associated Genetic Variants on Longevity. Front Genet 2022; 12:748781. [PMID: 34992629 PMCID: PMC8724252 DOI: 10.3389/fgene.2021.748781] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2021] [Accepted: 11/24/2021] [Indexed: 12/22/2022] Open
Abstract
Human longevity is influenced by the genetic risk of age-related diseases. As Alzheimer’s disease (AD) represents a common condition at old age, an interplay between genetic factors affecting AD and longevity is expected. We explored this interplay by studying the prevalence of AD-associated single-nucleotide-polymorphisms (SNPs) in cognitively healthy centenarians, and replicated findings in a parental-longevity GWAS. We found that 28/38 SNPs that increased AD-risk also associated with lower odds of longevity. For each SNP, we express the imbalance between AD- and longevity-risk as an effect-size distribution. Based on these distributions, we grouped the SNPs in three groups: 17 SNPs increased AD-risk more than they decreased longevity-risk, and were enriched for β-amyloid metabolism and immune signaling; 11 variants reported a larger longevity-effect compared to their AD-effect, were enriched for endocytosis/immune-signaling, and were previously associated with other age-related diseases. Unexpectedly, 10 variants associated with an increased risk of AD and higher odds of longevity. Altogether, we show that different AD-associated SNPs have different effects on longevity, including SNPs that may confer general neuro-protective functions against AD and other age-related diseases.
Collapse
|
30
|
Demystifying machine learning for mortality prediction. Crit Care 2021; 25:447. [PMID: 34949229 PMCID: PMC8697544 DOI: 10.1186/s13054-021-03868-z] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2021] [Accepted: 11/27/2021] [Indexed: 11/24/2022] Open
|
31
|
Single-Cell Transcriptomics Links Loss of Human Pancreatic β-Cell Identity to ER Stress. Cells 2021; 10:3585. [PMID: 34944092 PMCID: PMC8700697 DOI: 10.3390/cells10123585] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2021] [Revised: 11/25/2021] [Accepted: 12/10/2021] [Indexed: 11/30/2022] Open
Abstract
The maintenance of pancreatic islet architecture is crucial for proper β-cell function. We previously reported that disruption of human islet integrity could result in altered β-cell identity. Here we combine β-cell lineage tracing and single-cell transcriptomics to investigate the mechanisms underlying this process in primary human islet cells. Using drug-induced ER stress and cytoskeleton modification models, we demonstrate that altering the islet structure triggers an unfolding protein response that causes the downregulation of β-cell maturity genes. Collectively, our findings illustrate the close relationship between endoplasmic reticulum homeostasis and β-cell phenotype, and strengthen the concept of altered β-cell identity as a mechanism underlying the loss of functional β-cell mass.
Collapse
|
32
|
Robust deep learning model for prognostic stratification of pancreatic ductal adenocarcinoma patients. iScience 2021; 24:103415. [PMID: 34901786 PMCID: PMC8637475 DOI: 10.1016/j.isci.2021.103415] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2021] [Revised: 09/27/2021] [Accepted: 11/05/2021] [Indexed: 02/07/2023] Open
Abstract
A major challenge for treating patients with pancreatic ductal adenocarcinoma (PDAC) is the unpredictability of their prognoses due to high heterogeneity. We present Multi-Omics DEep Learning for Prognosis-correlated subtyping (MODEL-P) to identify PDAC subtypes and to predict prognoses of new patients. MODEL-P was trained on autoencoder integrated multi-omics of 146 patients with PDAC together with their survival outcome. Using MODEL-P, we identified two PDAC subtypes with distinct survival outcomes (median survival 10.1 and 22.7 months, respectively, log rank p = 1 × 10−6), which correspond to DNA damage repair and immune response. We rigorously validated MODEL-P by stratifying patients in five independent datasets into these two survival groups and achieved significant survival difference, which is superior to current practice and other subtyping schemas. We believe the subtype-specific signatures would facilitate PDAC pathogenesis discovery, and MODEL-P can provide clinicians the prognoses information in the treatment decision-making to better gauge the benefits versus the risks. We developed DL-based MODEL-P to identify prognosis-correlated PDAC subtypes The identified subtypes related to DNA damage repair and immune response processes MODEL-P stratified patients from independent datasets into distinct survival groups MODEL-P could be used in clinics to aid treatment decision-making
Collapse
|
33
|
Predicting patient response with models trained on cell lines and patient-derived xenografts by nonlinear transfer learning. Proc Natl Acad Sci U S A 2021; 118:e2106682118. [PMID: 34873056 PMCID: PMC8670522 DOI: 10.1073/pnas.2106682118] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 10/18/2021] [Indexed: 12/13/2022] Open
Abstract
Preclinical models have been the workhorse of cancer research, producing massive amounts of drug response data. Unfortunately, translating response biomarkers derived from these datasets to human tumors has proven to be particularly challenging. To address this challenge, we developed TRANSACT, a computational framework that builds a consensus space to capture biological processes common to preclinical models and human tumors and exploits this space to construct drug response predictors that robustly transfer from preclinical models to human tumors. TRANSACT performs favorably compared to four competing approaches, including two deep learning approaches, on a set of 23 drug prediction challenges on The Cancer Genome Atlas and 226 metastatic tumors from the Hartwig Medical Foundation. We demonstrate that response predictions deliver a robust performance for a number of therapies of high clinical importance: platinum-based chemotherapies, gemcitabine, and paclitaxel. In contrast to other approaches, we demonstrate the interpretability of the TRANSACT predictors by correctly identifying known biomarkers of targeted therapies, and we propose potential mechanisms that mediate the resistance to two chemotherapeutic agents.
Collapse
|
34
|
Differential analysis of binarized single-cell RNA sequencing data captures biological variation. NAR Genom Bioinform 2021; 3:lqab118. [PMID: 34988441 PMCID: PMC8693570 DOI: 10.1093/nargab/lqab118] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2021] [Revised: 11/04/2021] [Accepted: 12/03/2021] [Indexed: 12/11/2022] Open
Abstract
Single-cell RNA sequencing data is characterized by a large number of zero counts, yet there is growing evidence that these zeros reflect biological variation rather than technical artifacts. We propose to use binarized expression profiles to identify the effects of biological variation in single-cell RNA sequencing data. Using 16 publicly available and simulated datasets, we show that a binarized representation of single-cell expression data accurately represents biological variation and reveals the relative abundance of transcripts more robustly than counts.
Collapse
|
35
|
Longitudinal Dynamics of Human B-Cell Response at the Single-Cell Level in Response to Tdap Vaccination. Vaccines (Basel) 2021; 9:1352. [PMID: 34835283 PMCID: PMC8617659 DOI: 10.3390/vaccines9111352] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2021] [Revised: 11/08/2021] [Accepted: 11/13/2021] [Indexed: 01/28/2023] Open
Abstract
To mount an adequate immune response against pathogens, stepwise mutation and selection processes are crucial functions of the adaptive immune system. To better characterize a successful vaccination response, we performed longitudinal (days 0, 5, 7, 10, and 14 after Boostrix vaccination) analysis of the single-cell transcriptome as well as the B-cell receptor (BCR) repertoire (scBCR-rep) in plasma cells of an immunized donor and compared it with baseline B-cell characteristics as well as flow cytometry findings. Based on the flow cytometry knowledge and literature findings, we discriminated individual B-cell subsets in the transcriptomics data and traced over-time maturation of plasmablasts/plasma cells (PB/PCs) and identified the pathways associated with the plasma cell maturation. We observed that the repertoire in PB/PCs differed from the baseline B-cell repertoire e.g., regarding expansion of unique clones in post-vaccination visits, high usage of IGHG1 in expanded clones, increased class-switching events post-vaccination represented by clonotypes spanning multiple IGHC classes and positive selection of CDR3 sequences over time. Importantly, the Variable gene family-based clustering of BCRs represented a similar measure as the gene-based clustering, but certainly improved the clustering of BCRs, as BCRs from duplicated Variable gene families could be clustered together. Finally, we developed a query tool to dissect the immune response to the components of the Boostrix vaccine. Using this tool, we could identify the BCRs related to anti-tetanus and anti-pertussis toxoid BCRs. Collectively, we developed a bioinformatic workflow which allows description of the key features of an ongoing (longitudinal) immune response, such as activation of PB/PCs, Ig class switching, somatic hypermutation, and clonal expansion, all of which are hallmarks of antigen exposure, followed by mutation & selection processes.
Collapse
|
36
|
Transcriptomic Signatures Associated With Regional Cortical Thickness Changes in Parkinson's Disease. Front Neurosci 2021; 15:733501. [PMID: 34658772 PMCID: PMC8519261 DOI: 10.3389/fnins.2021.733501] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2021] [Accepted: 09/08/2021] [Indexed: 11/16/2022] Open
Abstract
Cortical atrophy is a common manifestation in Parkinson's disease (PD), particularly in advanced stages of the disease. To elucidate the molecular underpinnings of cortical thickness changes in PD, we performed an integrated analysis of brain-wide healthy transcriptomic data from the Allen Human Brain Atlas and patterns of cortical thickness based on T1-weighted anatomical MRI data of 149 PD patients and 369 controls. For this purpose, we used partial least squares regression to identify gene expression patterns correlated with cortical thickness changes. In addition, we identified gene expression patterns underlying the relationship between cortical thickness and clinical domains of PD. Our results show that genes whose expression in the healthy brain is associated with cortical thickness changes in PD are enriched in biological pathways related to sumoylation, regulation of mitotic cell cycle, mitochondrial translation, DNA damage responses, and ER-Golgi traffic. The associated pathways were highly related to each other and all belong to cellular maintenance mechanisms. The expression of genes within most pathways was negatively correlated with cortical thickness changes, showing higher expression in regions associated with decreased cortical thickness (atrophy). On the other hand, sumoylation pathways were positively correlated with cortical thickness changes, showing higher expression in regions with increased cortical thickness (hypertrophy). Our findings suggest that alterations in the balanced interplay of these mechanisms play a role in changes of cortical thickness in PD and possibly influence motor and cognitive functions.
Collapse
|
37
|
Abstract
BACKGROUND AND AIMS Protein profiling in patients with inflammatory bowel diseases [IBD] for diagnostic and therapeutic purposes is underexplored. This study analysed the association between phenotype, genotype, and the plasma proteome in IBD. METHODS A total of 92 inflammation-related proteins were quantified in plasma of 1028 patients with IBD (567 Crohn's disease [CD]; 461 ulcerative colitis [UC]) and 148 healthy individuals to assess protein-phenotype associations. Corresponding whole-exome sequencing and global screening array data of 919 patients with IBD were included to analyse the effect of genetics on protein levels (protein quantitative trait loci [pQTL] analysis). Intestinal mucosal RNA sequencing and faecal metagenomic data were used for complementary analyses. RESULTS Thirty-two proteins were differentially abundant between IBD and healthy individuals, of which 22 proteins were independent of active inflammation; 69 proteins were associated with 15 demographic and clinical factors. Fibroblast growth factor-19 levels were decreased in CD patients with ileal disease or a history of ileocecal resection. Thirteen novel cis-pQTLs were identified and 10 replicated from previous studies. One trans-pQTL of the fucosyltransferase 2 [FUT2] gene [rs602662] and two independent cis-pQTLs of C-C motif chemokine 25 [CCL25] affected plasma CCL25 levels. Intestinal gene expression data revealed an overlapping cis-expression [e]QTL-variant [rs3745387] of the CCL25 gene. The FUT2 rs602662 trans-pQTL was associated with reduced abundances of faecal butyrate-producing bacteria. CONCLUSIONS This study shows that genotype and multiple disease phenotypes strongly associate with the plasma inflammatory proteome in IBD, and identifies disease-associated pathways that may help to improve disease management in the future.
Collapse
|
38
|
Genetics Contributes to Concomitant Pathology and Clinical Presentation in Dementia with Lewy Bodies. J Alzheimers Dis 2021; 83:269-279. [PMID: 34308904 PMCID: PMC8461715 DOI: 10.3233/jad-210365] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022]
Abstract
Background: Dementia with Lewy bodies (DLB) is a complex, progressive neurodegenerative disease with considerable phenotypic, pathological, and genetic heterogeneity. Objective: We tested if genetic variants in part explain the heterogeneity in DLB. Methods: We tested the effects of variants previously associated with DLB (near APOE, GBA, and SNCA) and polygenic risk scores for Alzheimer’s disease (AD-PRS) and Parkinson’s disease (PD-PRS). We studied 190 probable DLB patients from the Alzheimer’s dementia cohort and compared them to 2,552 control subjects. The p-tau/Aβ1–42 ratio in cerebrospinal fluid was used as in vivo proxy to separate DLB cases into DLB with concomitant AD pathology (DLB-AD) or DLB without AD (DLB-pure). We studied the clinical measures age, Mini-Mental State Examination (MMSE), and the presence of core symptoms at diagnosis and disease duration. Results: We found that all studied genetic factors significantly associated with DLB risk (all-DLB). Second, we stratified the DLB patients by the presence of concomitant AD pathology and found that APOE ɛ4 and the AD-PRS associated specifically with DLB-AD, but less with DLB-pure. In addition, the GBA p.E365K variant showed strong associated with DLB-pure and less with DLB-AD. Last, we studied the clinical measures and found that APOE ɛ4 associated with reduced MMSE, higher odds to have fluctuations and a shorter disease duration. In addition, the GBA p.E365K variant reduced the age at onset by 5.7 years, but the other variants and the PRS did not associate with clinical features. Conclusion: These finding increase our understanding of the pathological and clinical heterogeneity in DLB.
Collapse
|
39
|
Polygenic Risk Score of Longevity Predicts Longer Survival Across an Age Continuum. J Gerontol A Biol Sci Med Sci 2021; 76:750-759. [PMID: 33216869 PMCID: PMC8087277 DOI: 10.1093/gerona/glaa289] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2020] [Indexed: 12/17/2022] Open
Abstract
Studying the genome of centenarians may give insights into the molecular mechanisms underlying extreme human longevity and the escape of age-related diseases. Here, we set out to construct polygenic risk scores (PRSs) for longevity and to investigate the functions of longevity-associated variants. Using a cohort of centenarians with maintained cognitive health (N = 343), a population-matched cohort of older adults from 5 cohorts (N = 2905), and summary statistics data from genome-wide association studies on parental longevity, we constructed a PRS including 330 variants that significantly discriminated between centenarians and older adults. This PRS was also associated with longer survival in an independent sample of younger individuals (p = .02), leading up to a 4-year difference in survival based on common genetic factors only. We show that this PRS was, in part, able to compensate for the deleterious effect of the APOE-ε4 allele. Using an integrative framework, we annotated the 330 variants included in this PRS by the genes they associate with. We find that they are enriched with genes associated with cellular differentiation, developmental processes, and cellular response to stress. Together, our results indicate that an extended human life span is, in part, the result of a constellation of variants each exerting small advantageous effects on aging-related biological mechanisms that maintain overall health and decrease the risk of age-related diseases.
Collapse
|
40
|
snpXplorer: a web application to explore human SNP-associations and annotate SNP-sets. Nucleic Acids Res 2021; 49:W603-W612. [PMID: 34048563 PMCID: PMC8262737 DOI: 10.1093/nar/gkab410] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2021] [Revised: 04/19/2021] [Accepted: 05/01/2021] [Indexed: 02/06/2023] Open
Abstract
Genetic association studies are frequently used to study the genetic basis of numerous human phenotypes. However, the rapid interrogation of how well a certain genomic region associates across traits as well as the interpretation of genetic associations is often complex and requires the integration of multiple sources of annotation, which involves advanced bioinformatic skills. We developed snpXplorer, an easy-to-use web-server application for exploring Single Nucleotide Polymorphisms (SNP) association statistics and to functionally annotate sets of SNPs. snpXplorer can superimpose association statistics from multiple studies, and displays regional information including SNP associations, structural variations, recombination rates, eQTL, linkage disequilibrium patterns, genes and gene-expressions per tissue. By overlaying multiple GWAS studies, snpXplorer can be used to compare levels of association across different traits, which may help the interpretation of variant consequences. Given a list of SNPs, snpXplorer can also be used to perform variant-to-gene mapping and gene-set enrichment analysis to identify molecular pathways that are overrepresented in the list of input SNPs. snpXplorer is freely available at https://snpxplorer.net. Source code, documentation, example files and tutorial videos are available within the Help section of snpXplorer and at https://github.com/TesiNicco/snpXplorer.
Collapse
|
41
|
Population matched (pm) germline allelic variants of immunoglobulin (IG) loci: Relevance in infectious diseases and vaccination studies in human populations. Genes Immun 2021; 22:172-186. [PMID: 34120151 PMCID: PMC8196923 DOI: 10.1038/s41435-021-00143-7] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2021] [Revised: 05/12/2021] [Accepted: 06/01/2021] [Indexed: 02/05/2023]
Abstract
Immunoglobulin (IG) loci harbor inter-individual allelic variants in many different germline IG variable, diversity and joining genes of the IG heavy (IGH), kappa (IGK) and lambda (IGL) loci, which together form the genetic basis of the highly diverse antigen-specific B-cell receptors. These allelic variants can be shared between or be specific to human populations. The current immunogenetics resources gather the germline alleles, however, lack the population specificity of the alleles which poses limitations for disease-association studies related to immune responses in different human populations. Therefore, we systematically identified germline alleles from 26 different human populations around the world, profiled by "1000 Genomes" data. We identified 409 IGHV, 179 IGKV, and 199 IGLV germline alleles supported by at least seven haplotypes. The diversity of germline alleles is the highest in Africans. Remarkably, the variants in the identified novel alleles show strikingly conserved patterns, the same as found in other IG databases, suggesting over-time evolutionary selection processes. We could relate the genetic variants to population-specific immune responses, e.g. IGHV1-69 for flu in Africans. The population matched IG (pmIG) resource will enhance our understanding of the SHM-related B-cell receptor selection processes in (infectious) diseases and vaccination within and between different human populations.
Collapse
|
42
|
Abstract
Supervised methods are increasingly used to identify cell populations in single-cell data. Yet, current methods are limited in their ability to learn from multiple datasets simultaneously, are hampered by the annotation of datasets at different resolutions, and do not preserve annotations when retrained on new datasets. The latter point is especially important as researchers cannot rely on downstream analysis performed using earlier versions of the dataset. Here, we present scHPL, a hierarchical progressive learning method which allows continuous learning from single-cell data by leveraging the different resolutions of annotations across multiple datasets to learn and continuously update a classification tree. We evaluate the classification and tree learning performance using simulated as well as real datasets and show that scHPL can successfully learn known cellular hierarchies from multiple datasets while preserving the original annotations. scHPL is available at https://github.com/lcmmichielsen/scHPL .
Collapse
|
43
|
Publisher Correction: A meta-analysis of genome-wide association studies identifies multiple longevity genes. Nat Commun 2021; 12:2463. [PMID: 33893282 PMCID: PMC8065049 DOI: 10.1038/s41467-021-22613-2] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/02/2022] Open
Abstract
A Correction to this paper has been published: https://doi.org/10.1038/s41467-021-22613-2
Collapse
|
44
|
Unsupervised protein embeddings outperform hand-crafted sequence and structure features at predicting molecular function. Bioinformatics 2021; 37:162-170. [PMID: 32797179 PMCID: PMC8055213 DOI: 10.1093/bioinformatics/btaa701] [Citation(s) in RCA: 43] [Impact Index Per Article: 14.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2020] [Revised: 07/10/2020] [Accepted: 08/12/2020] [Indexed: 12/19/2022] Open
Abstract
MOTIVATION Protein function prediction is a difficult bioinformatics problem. Many recent methods use deep neural networks to learn complex sequence representations and predict function from these. Deep supervised models require a lot of labeled training data which are not available for this task. However, a very large amount of protein sequences without functional labels is available. RESULTS We applied an existing deep sequence model that had been pretrained in an unsupervised setting on the supervised task of protein molecular function prediction. We found that this complex feature representation is effective for this task, outperforming hand-crafted features such as one-hot encoding of amino acids, k-mer counts, secondary structure and backbone angles. Also, it partly negates the need for complex prediction models, as a two-layer perceptron was enough to achieve competitive performance in the third Critical Assessment of Functional Annotation benchmark. We also show that combining this sequence representation with protein 3D structure information does not lead to performance improvement, hinting that 3D structure is also potentially learned during the unsupervised pretraining. AVAILABILITY AND IMPLEMENTATION Implementations of all used models can be found at https://github.com/stamakro/GCN-for-Structure-and-Function. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
|
45
|
CBA: Cluster-Guided Batch Alignment for Single Cell RNA-seq. Front Genet 2021; 12:644211. [PMID: 33927748 PMCID: PMC8076908 DOI: 10.3389/fgene.2021.644211] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2020] [Accepted: 03/15/2021] [Indexed: 11/30/2022] Open
Abstract
The power of single-cell RNA sequencing (scRNA-seq) in detecting cell heterogeneity or developmental process is becoming more and more evident every day. The granularity of this knowledge is further propelled when combining two batches of scRNA-seq into a single large dataset. This strategy is however hampered by technical differences between these batches. Typically, these batch effects are resolved by matching similar cells across the different batches. Current approaches, however, do not take into account that we can constrain this matching further as cells can also be matched on their cell type identity. We use an auto-encoder to embed two batches in the same space such that cells are matched. To accomplish this, we use a loss function that preserves: (1) cell-cell distances within each of the two batches, as well as (2) cell-cell distances between two batches when the cells are of the same cell-type. The cell-type guidance is unsupervised, i.e., a cell-type is defined as a cluster in the original batch. We evaluated the performance of our cluster-guided batch alignment (CBA) using pancreas and mouse cell atlas datasets, against six state-of-the-art single cell alignment methods: Seurat v3, BBKNN, Scanorama, Harmony, LIGER, and BERMUDA. Compared to other approaches, CBA preserves the cluster separation in the original datasets while still being able to align the two datasets. We confirm that this separation is biologically meaningful by identifying relevant differential expression of genes for these preserved clusters.
Collapse
|
46
|
Cingulate networks associated with gray matter loss in Parkinson's disease show high expression of cholinergic genes in the healthy brain. Eur J Neurosci 2021; 53:3727-3739. [PMID: 33792979 PMCID: PMC8251922 DOI: 10.1111/ejn.15216] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2020] [Revised: 03/16/2021] [Accepted: 03/21/2021] [Indexed: 12/25/2022]
Abstract
Structural covariance networks are able to identify functionally organized brain regions by gray matter volume covariance across a population. We examined the transcriptomic signature of such anatomical networks in the healthy brain using postmortem microarray data from the Allen Human Brain Atlas. A previous study revealed that a posterior cingulate network and anterior cingulate network showed decreased gray matter in brains of Parkinson's disease patients. Therefore, we examined these two anatomical networks to understand the underlying molecular processes that may be involved in Parkinson's disease. Whole brain transcriptomics from the healthy brain revealed upregulation of genes associated with serotonin, GPCR, GABA, glutamate, and RAS-signaling pathways. Our results also suggest involvement of the cholinergic circuit, in which genes NPPA, SOSTDC1, and TYRP1 may play a functional role. Finally, both networks were enriched for genes associated with neuropsychiatric disorders that overlap with Parkinson's disease symptoms. The identified genes and pathways contribute to healthy functions of the posterior and anterior cingulate networks and disruptions to these functions may in turn contribute to the pathological and clinical events observed in Parkinson's disease.
Collapse
|
47
|
Erratum: Untangling biological factors influencing trajectory inference from single cell data. NAR Genom Bioinform 2021; 2:lqaa102. [PMID: 33577627 PMCID: PMC7679063 DOI: 10.1093/nargab/lqaa102] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
|
48
|
SCHNEL: scalable clustering of high dimensional single-cell data. Bioinformatics 2020; 36:i849-i856. [PMID: 33381821 DOI: 10.1093/bioinformatics/btaa816] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 09/07/2020] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Single cell data measures multiple cellular markers at the single-cell level for thousands to millions of cells. Identification of distinct cell populations is a key step for further biological understanding, usually performed by clustering this data. Dimensionality reduction based clustering tools are either not scalable to large datasets containing millions of cells, or not fully automated requiring an initial manual estimation of the number of clusters. Graph clustering tools provide automated and reliable clustering for single cell data, but suffer heavily from scalability to large datasets. RESULTS We developed SCHNEL, a scalable, reliable and automated clustering tool for high-dimensional single-cell data. SCHNEL transforms large high-dimensional data to a hierarchy of datasets containing subsets of data points following the original data manifold. The novel approach of SCHNEL combines this hierarchical representation of the data with graph clustering, making graph clustering scalable to millions of cells. Using seven different cytometry datasets, SCHNEL outperformed three popular clustering tools for cytometry data, and was able to produce meaningful clustering results for datasets of 3.5 and 17.2 million cells within workable time frames. In addition, we show that SCHNEL is a general clustering tool by applying it to single-cell RNA sequencing data, as well as a popular machine learning benchmark dataset MNIST. AVAILABILITY AND IMPLEMENTATION Implementation is available on GitHub (https://github.com/biovault/SCHNELpy). All datasets used in this study are publicly available. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
|
49
|
Using Out-of-Batch Reference Populations to Improve Untargeted Metabolomics for Screening Inborn Errors of Metabolism. Metabolites 2020; 11:metabo11010008. [PMID: 33375624 PMCID: PMC7824495 DOI: 10.3390/metabo11010008] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2020] [Revised: 12/14/2020] [Accepted: 12/18/2020] [Indexed: 01/15/2023] Open
Abstract
Untargeted metabolomics is an emerging technology in the laboratory diagnosis of inborn errors of metabolism (IEM). Analysis of a large number of reference samples is crucial for correcting variations in metabolite concentrations that result from factors, such as diet, age, and gender in order to judge whether metabolite levels are abnormal. However, a large number of reference samples requires the use of out-of-batch samples, which is hampered by the semi-quantitative nature of untargeted metabolomics data, i.e., technical variations between batches. Methods to merge and accurately normalize data from multiple batches are urgently needed. Based on six metrics, we compared the existing normalization methods on their ability to reduce the batch effects from nine independently processed batches. Many of those showed marginal performances, which motivated us to develop Metchalizer, a normalization method that uses 10 stable isotope-labeled internal standards and a mixed effect model. In addition, we propose a regression model with age and sex as covariates fitted on reference samples that were obtained from all nine batches. Metchalizer applied on log-transformed data showed the most promising performance on batch effect removal, as well as in the detection of 195 known biomarkers across 49 IEM patient samples and performed at least similar to an approach utilizing 15 within-batch reference samples. Furthermore, our regression model indicates that 6.5-37% of the considered features showed significant age-dependent variations. Our comprehensive comparison of normalization methods showed that our Log-Metchalizer approach enables the use out-of-batch reference samples to establish clinically-relevant reference values for metabolite concentrations. These findings open the possibilities to use large scale out-of-batch reference samples in a clinical setting, increasing the throughput and detection accuracy.
Collapse
|
50
|
Machine Learning Electronic Health Record Identification of Patients with Rheumatoid Arthritis: Algorithm Pipeline Development and Validation Study. JMIR Med Inform 2020; 8:e23930. [PMID: 33252349 PMCID: PMC7735897 DOI: 10.2196/23930] [Citation(s) in RCA: 17] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2020] [Revised: 10/18/2020] [Accepted: 10/24/2020] [Indexed: 11/18/2022] Open
Abstract
Background Financial codes are often used to extract diagnoses from electronic health records. This approach is prone to false positives. Alternatively, queries are constructed, but these are highly center and language specific. A tantalizing alternative is the automatic identification of patients by employing machine learning on format-free text entries. Objective The aim of this study was to develop an easily implementable workflow that builds a machine learning algorithm capable of accurately identifying patients with rheumatoid arthritis from format-free text fields in electronic health records. Methods Two electronic health record data sets were employed: Leiden (n=3000) and Erlangen (n=4771). Using a portion of the Leiden data (n=2000), we compared 6 different machine learning methods and a naïve word-matching algorithm using 10-fold cross-validation. Performances were compared using the area under the receiver operating characteristic curve (AUROC) and the area under the precision recall curve (AUPRC), and F1 score was used as the primary criterion for selecting the best method to build a classifying algorithm. We selected the optimal threshold of positive predictive value for case identification based on the output of the best method in the training data. This validation workflow was subsequently applied to a portion of the Erlangen data (n=4293). For testing, the best performing methods were applied to remaining data (Leiden n=1000; Erlangen n=478) for an unbiased evaluation. Results For the Leiden data set, the word-matching algorithm demonstrated mixed performance (AUROC 0.90; AUPRC 0.33; F1 score 0.55), and 4 methods significantly outperformed word-matching, with support vector machines performing best (AUROC 0.98; AUPRC 0.88; F1 score 0.83). Applying this support vector machine classifier to the test data resulted in a similarly high performance (F1 score 0.81; positive predictive value [PPV] 0.94), and with this method, we could identify 2873 patients with rheumatoid arthritis in less than 7 seconds out of the complete collection of 23,300 patients in the Leiden electronic health record system. For the Erlangen data set, gradient boosting performed best (AUROC 0.94; AUPRC 0.85; F1 score 0.82) in the training set, and applied to the test data, resulted once again in good results (F1 score 0.67; PPV 0.97). Conclusions We demonstrate that machine learning methods can extract the records of patients with rheumatoid arthritis from electronic health record data with high precision, allowing research on very large populations for limited costs. Our approach is language and center independent and could be applied to any type of diagnosis. We have developed our pipeline into a universally applicable and easy-to-implement workflow to equip centers with their own high-performing algorithm. This allows the creation of observational studies of unprecedented size covering different countries for low cost from already available data in electronic health record systems.
Collapse
|