76
|
Nentwich M, Zschornak M, Weigel T, Köhler T, Novikov D, Meyer DC, Richter C. Treatment of multiple-beam X-ray diffraction in energy-dependent measurements. JOURNAL OF SYNCHROTRON RADIATION 2024; 31:28-34. [PMID: 38095667 PMCID: PMC10833431 DOI: 10.1107/s1600577523009670] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/02/2023] [Accepted: 11/06/2023] [Indexed: 01/09/2024]
Abstract
During X-ray diffraction experiments on single crystals, the diffracted beam intensities may be affected by multiple-beam X-ray diffraction (MBD). This effect is particularly frequent at higher X-ray energies and for larger unit cells. The appearance of this so-called Renninger effect often impairs the interpretation of diffracted intensities. This applies in particular to energy spectra analysed in resonant experiments, since during scans of the incident photon energy these conditions are necessarily met for specific X-ray energies. This effect can be addressed by carefully avoiding multiple-beam reflection conditions at a given X-ray energy and a given position in reciprocal space. However, areas which are (nearly) free of MBD are not always available. This article presents a universal concept of data acquisition and post-processing for resonant X-ray diffraction experiments. Our concept facilitates the reliable determination of kinematic (MBD-free) resonant diffraction intensities even at relatively high energies which, in turn, enables the study of higher absorption edges. This way, the applicability of resonant diffraction, e.g. to reveal the local atomic and electronic structure or chemical environment, is extended for a vast majority of crystalline materials. The potential of this approach compared with conventional data reduction is demonstrated by the measurements of the Ta L3 edge of well studied lithium tantalate LiTaO3.
Collapse
|
77
|
Nemoto T, Ocari T, Planul A, Tekinsoy M, Zin EA, Dalkara D, Ferrari U. ACIDES: on-line monitoring of forward genetic screens for protein engineering. Nat Commun 2023; 14:8504. [PMID: 38148337 PMCID: PMC10751290 DOI: 10.1038/s41467-023-43967-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2023] [Accepted: 11/24/2023] [Indexed: 12/28/2023] Open
Abstract
Forward genetic screens of mutated variants are a versatile strategy for protein engineering and investigation, which has been successfully applied to various studies like directed evolution (DE) and deep mutational scanning (DMS). While next-generation sequencing can track millions of variants during the screening rounds, the vast and noisy nature of the sequencing data impedes the estimation of the performance of individual variants. Here, we propose ACIDES that combines statistical inference and in-silico simulations to improve performance estimation in the library selection process by attributing accurate statistical scores to individual variants. We tested ACIDES first on a random-peptide-insertion experiment and then on multiple public datasets from DE and DMS studies. ACIDES allows experimentalists to reliably estimate variant performance on the fly and can aid protein engineering and research pipelines in a range of applications, including gene therapy.
Collapse
|
78
|
Hejret V, Varadarajan NM, Klimentova E, Gresova K, Giassa IC, Vanacova S, Alexiou P. Analysis of chimeric reads characterises the diverse targetome of AGO2-mediated regulation. Sci Rep 2023; 13:22895. [PMID: 38129478 PMCID: PMC10739727 DOI: 10.1038/s41598-023-49757-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2023] [Accepted: 12/12/2023] [Indexed: 12/23/2023] Open
Abstract
Argonaute proteins are instrumental in regulating RNA stability and translation. AGO2, the major mammalian Argonaute protein, is known to primarily associate with microRNAs, a family of small RNA 'guide' sequences, and identifies its targets primarily via a 'seed' mediated partial complementarity process. Despite numerous studies, a definitive experimental dataset of AGO2 'guide'-'target' interactions remains elusive. Our study employs two experimental methods-AGO2 CLASH and AGO2 eCLIP, to generate thousands of AGO2 target sites verified by chimeric reads. These chimeric reads contain both the AGO2 loaded small RNA 'guide' and the target sequence, providing a robust resource for modeling AGO2 binding preferences. Our novel analysis pipeline reveals thousands of AGO2 target sites driven by microRNAs and a significant number of AGO2 'guides' derived from fragments of other small RNAs such as tRNAs, YRNAs, snoRNAs, rRNAs, and more. We utilize convolutional neural networks to train machine learning models that accurately predict the binding potential for each 'guide' class and experimentally validate several interactions. In conclusion, our comprehensive analysis of the AGO2 targetome broadens our understanding of its 'guide' repertoire and potential function in development and disease. Moreover, we offer practical bioinformatic tools for future experiments and the prediction of AGO2 targets. All data and code from this study are freely available at https://github.com/ML-Bioinfo-CEITEC/HybriDetector/ .
Collapse
|
79
|
Rischke S, Poor SM, Gurke R, Hahnefeld L, Köhm M, Ultsch A, Geisslinger G, Behrens F, Lötsch J. Machine learning identifies right index finger tenderness as key signal of DAS28-CRP based psoriatic arthritis activity. Sci Rep 2023; 13:22710. [PMID: 38123604 PMCID: PMC10733369 DOI: 10.1038/s41598-023-49574-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2023] [Accepted: 12/09/2023] [Indexed: 12/23/2023] Open
Abstract
Psoriatic arthritis (PsA) is a chronic inflammatory systemic disease whose activity is often assessed using the Disease Activity Score 28 (DAS28-CRP). The present study was designed to investigate the significance of individual components within the score for PsA activity. A cohort of 80 PsA patients (44 women and 36 men, aged 56.3 ± 12 years) with a range of disease activity from remission to moderate was analyzed using unsupervised and supervised methods applied to the DAS28-CRP components. Machine learning-based permutation importance identified tenderness in the metacarpophalangeal joint of the right index finger as the most informative item of the DAS28-CRP for PsA activity staging. This symptom alone allowed a machine learned (random forests) classifier to identify PsA remission with 67% balanced accuracy in new cases. Projection of the DAS28-CRP data onto an emergent self-organizing map of artificial neurons identified outliers, which following augmentation of group sizes by emergent self-organizing maps based generative artificial intelligence (AI) could be defined as subgroups particularly characterized by either tenderness or swelling of specific joints. AI-assisted re-evaluation of the DAS28-CRP for PsA has narrowed the score items to a most relevant symptom, and generative AI has been useful for identifying and characterizing small subgroups of patients whose symptom patterns differ from the majority. These findings represent an important step toward precision medicine that can address outliers.
Collapse
|
80
|
Xu Z, Tang S, Liu C, Zhang Q, Gu H, Li X, Di Z, Li Z. Temporal segmentation of EEG based on functional connectivity network structure. Sci Rep 2023; 13:22566. [PMID: 38114604 PMCID: PMC10730570 DOI: 10.1038/s41598-023-49891-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2023] [Accepted: 12/13/2023] [Indexed: 12/21/2023] Open
Abstract
In the study of brain functional connectivity networks, it is assumed that a network is built from a data window in which activity is stationary. However, brain activity is non-stationary over sufficiently large time periods. Addressing the analysis electroencephalograph (EEG) data, we propose a data segmentation method based on functional connectivity network structure. The goal of segmentation is to ensure that within a window of analysis, there is similar network structure. We designed an intuitive and flexible graph distance measure to quantify the difference in network structure between two analysis windows. This measure is modular: a variety of node importance indices can be plugged into it. We use a reference window versus sliding window comparison approach to detect changes, as indicated by outliers in the distribution of graph distance values. Performance of our segmentation method was tested in simulated EEG data and real EEG data from a drone piloting experiment (using correlation or phase-locking value as the functional connectivity strength metric). We compared our method under various node importance measures and against matrix-based dissimilarity metrics that use singular value decomposition on the connectivity matrix. The results show the graph distance approach worked better than matrix-based approaches; graph distance based on partial node centrality was most sensitive to network structural changes, especially when connectivity matrix values change little. The proposed method provides EEG data segmentation tailored for detecting changes in terms of functional connectivity networks. Our study provides a new perspective on EEG segmentation, one that is based on functional connectivity network structure differences.
Collapse
|
81
|
Corbe M, Boncompain G, Perez F, Del Nery E, Genovesio A. Transfer learning for versatile and training free high content screening analyses. Sci Rep 2023; 13:22599. [PMID: 38114550 PMCID: PMC10730630 DOI: 10.1038/s41598-023-49554-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2023] [Accepted: 12/09/2023] [Indexed: 12/21/2023] Open
Abstract
High content screening (HCS) is a technology that automates cell biology experiments at large scale. A High Content Screen produces a high amount of microscopy images of cells under many conditions and requires that a dedicated image and data analysis workflow be designed for each assay to select hits. This heavy data analytic step remains challenging and has been recognized as one of the burdens hindering the adoption of HCS. In this work we propose a solution to hit selection by using transfer learning without additional training. A pretrained residual network is employed to encode each image of a screen into a discriminant representation. The deep features obtained are then corrected to account for well plate bias and misalignment. We then propose two training-free pipelines dedicated to the two main categories of HCS for compound selection: with or without positive control. When a positive control is available, it is used alongside the negative control to compute a linear discriminant axis, thus building a classifier without training. Once all samples are projected onto this axis, the conditions that best reproduce the positive control can be selected. When no positive control is available, the Mahalanobis distance is computed from each sample to the negative control distribution. The latter provides a metric to identify the conditions that alter the negative control's cell phenotype. This metric is subsequently used to categorize hits through a clustering step. Given the lack of available ground truth in HCS, we provide a qualitative comparison of the results obtained using this approach with results obtained with handcrafted image analysis features for compounds and siRNA screens with or without control. Our results suggests that the fully automated and generic pipeline we propose offers a good alternative to handcrafted dedicated image analysis approaches. Furthermore, we demonstrate that this solution select conditions of interest that had not been identified using the primary dedicated analysis. Altogether, this approach provides a fully automated, reproducible, versatile and comprehensive alternative analysis solution for HCS encompassing compound-based or downregulation screens, with or without positive controls, without the need for training or cell detection, or the development of a dedicated image analysis workflow.
Collapse
|
82
|
Liang Z, Liang C. Design and implementation of load intensity monitoring platform supported by big data technology in stage training for women's sitting volleyball. Sci Rep 2023; 13:22382. [PMID: 38104202 PMCID: PMC10725414 DOI: 10.1038/s41598-023-50057-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2023] [Accepted: 12/14/2023] [Indexed: 12/19/2023] Open
Abstract
This study aims to discuss the load intensity monitoring in the training process of sitting volleyball, to help coaches understand the training status of athletes, and to provide a scientific basis for the follow-up training plan. Through big data technology, the physiological changes of athletes can be more accurately grasped. This includes classification and summary of exercise load intensity and experimental study of the relationship between heart rate and rating perceived exertion (RPE). Through monitoring the training process of a provincial women's sitting volleyball team, it is found that there is a significant positive correlation between athletes' RPE and average heart rate. This result shows that by monitoring the change in heart rate and RPE of athletes, athletes' training state and physical condition can be more accurately understood. The results reveal that through the use of big data technology and monitoring experiments, it is found that heart rate and RPE are effective monitoring indicators, which can scientifically reflect the load intensity during sitting volleyball training. The conclusions provide coaches with a more scientific basis for making training plans and useful references for sports involving people with disabilities.
Collapse
|
83
|
Zhu Y, Bi D, Saunders M, Ji Y. Prediction of chronic kidney disease progression using recurrent neural network and electronic health records. Sci Rep 2023; 13:22091. [PMID: 38086905 PMCID: PMC10716428 DOI: 10.1038/s41598-023-49271-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2023] [Accepted: 12/06/2023] [Indexed: 12/18/2023] Open
Abstract
Chronic kidney disease (CKD) is a progressive loss in kidney function. Early detection of patients who will progress to late-stage CKD is of paramount importance for patient care. To address this, we develop a pipeline to process longitudinal electronic heath records (EHRs) and construct recurrent neural network (RNN) models to predict CKD progression from stages II/III to stages IV/V. The RNN model generates predictions based on time-series records of patients, including repeated lab tests and other clinical variables. Our investigation reveals that using a single variable, the recorded estimated glomerular filtration rate (eGFR) over time, the RNN model achieves an average area under the receiver operating characteristic curve (AUROC) of 0.957 for predicting future CKD progression. When additional clinical variables, such as demographics, vital information, lab test results, and health behaviors, are incorporated, the average AUROC increases to 0.967. In both scenarios, the standard deviation of the AUROC across cross-validation trials is less than 0.01, indicating a stable and high prediction accuracy. Our analysis results demonstrate the proposed RNN model outperforms existing standard approaches, including static and dynamic Cox proportional hazards models, random forest, and LightGBM. The utilization of the RNN model and the time-series data of previous eGFR measurements underscores its potential as a straightforward and effective tool for assessing the clinical risk of CKD patients concerning their disease progression.
Collapse
|
84
|
Dunn T, Narayanasamy S. vcfdist: accurately benchmarking phased small variant calls in human genomes. Nat Commun 2023; 14:8149. [PMID: 38071244 PMCID: PMC10710436 DOI: 10.1038/s41467-023-43876-x] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2023] [Accepted: 11/22/2023] [Indexed: 12/18/2023] Open
Abstract
Accurately benchmarking small variant calling accuracy is critical for the continued improvement of human whole genome sequencing. In this work, we show that current variant calling evaluations are biased towards certain variant representations and may misrepresent the relative performance of different variant calling pipelines. We propose solutions, first exploring the affine gap parameter design space for complex variant representation and suggesting a standard. Next, we present our tool vcfdist and demonstrate the importance of enforcing local phasing for evaluation accuracy. We then introduce the notion of partial credit for mostly-correct calls and present an algorithm for clustering dependent variants. Lastly, we motivate using alignment distance metrics to supplement precision-recall curves for understanding variant calling performance. We evaluate the performance of 64 phased Truth Challenge V2 submissions and show that vcfdist improves measured insertion and deletion performance consistency across variant representations from R2 = 0.97243 for baseline vcfeval to 0.99996 for vcfdist.
Collapse
|
85
|
Sajedi S, Ebrahimi G, Roudi R, Mehta I, Heshmat A, Samimi H, Kazempour S, Zainulabadeen A, Docking TR, Arora SP, Cigarroa F, Seshadri S, Karsan A, Zare H. Integrating DNA methylation and gene expression data in a single gene network using the iNETgrate package. Sci Rep 2023; 13:21721. [PMID: 38066050 PMCID: PMC10709411 DOI: 10.1038/s41598-023-48237-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2023] [Accepted: 11/23/2023] [Indexed: 12/18/2023] Open
Abstract
Analyzing different omics data types independently is often too restrictive to allow for detection of subtle, but consistent, variations that are coherently supported based upon different assays. Integrating multi-omics data in one model can increase statistical power. However, designing such a model is challenging because different omics are measured at different levels. We developed the iNETgrate package ( https://bioconductor.org/packages/iNETgrate/ ) that efficiently integrates transcriptome and DNA methylation data in a single gene network. Applying iNETgrate on five independent datasets improved prognostication compared to common clinical gold standards and a patient similarity network approach.
Collapse
|
86
|
Jeckel H, Nosho K, Neuhaus K, Hastewell AD, Skinner DJ, Saha D, Netter N, Paczia N, Dunkel J, Drescher K. Simultaneous spatiotemporal transcriptomics and microscopy of Bacillus subtilis swarm development reveal cooperation across generations. Nat Microbiol 2023; 8:2378-2391. [PMID: 37973866 PMCID: PMC10686836 DOI: 10.1038/s41564-023-01518-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2023] [Accepted: 10/09/2023] [Indexed: 11/19/2023]
Abstract
Development of microbial communities is a complex multiscale phenomenon with wide-ranging biomedical and ecological implications. How biological and physical processes determine emergent spatial structures in microbial communities remains poorly understood due to a lack of simultaneous measurements of gene expression and cellular behaviour in space and time. Here we combined live-cell microscopy with a robotic arm for spatiotemporal sampling, which enabled us to simultaneously acquire phenotypic imaging data and spatiotemporal transcriptomes during Bacillus subtilis swarm development. Quantitative characterization of the spatiotemporal gene expression patterns revealed correlations with cellular and collective properties, and phenotypic subpopulations. By integrating these data with spatiotemporal metabolome measurements, we discovered a spatiotemporal cross-feeding mechanism fuelling swarm development: during their migration, earlier generations deposit metabolites which are consumed by later generations that swarm across the same location. These results highlight the importance of spatiotemporal effects during the emergence of phenotypic subpopulations and their interactions in bacterial communities.
Collapse
|
87
|
Menden K, Francescatto M, Nyima T, Blauwendraat C, Dhingra A, Castillo-Lizardo M, Fernandes N, Kaurani L, Kronenberg-Versteeg D, Atasu B, Sadikoglou E, Borroni B, Rodriguez-Nieto S, Simon-Sanchez J, Fischer A, Craig DW, Neumann M, Bonn S, Rizzu P, Heutink P. A multi-omics dataset for the analysis of frontotemporal dementia genetic subtypes. Sci Data 2023; 10:849. [PMID: 38040703 PMCID: PMC10692098 DOI: 10.1038/s41597-023-02598-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2022] [Accepted: 09/26/2023] [Indexed: 12/03/2023] Open
Abstract
Understanding the molecular mechanisms underlying frontotemporal dementia (FTD) is essential for the development of successful therapies. Systematic studies on human post-mortem brain tissue of patients with genetic subtypes of FTD are currently lacking. The Risk and Modyfing Factors of Frontotemporal Dementia (RiMod-FTD) consortium therefore has generated a multi-omics dataset for genetic subtypes of FTD to identify common and distinct molecular mechanisms disturbed in disease. Here, we present multi-omics datasets generated from the frontal lobe of post-mortem human brain tissue from patients with mutations in MAPT, GRN and C9orf72 and healthy controls. This data resource consists of four datasets generated with different technologies to capture the transcriptome by RNA-seq, small RNA-seq, CAGE-seq, and methylation profiling. We show concrete examples on how to use the resulting data and confirm current knowledge about FTD and identify new processes for further investigation. This extensive multi-omics dataset holds great value to reveal new research avenues for this devastating disease.
Collapse
|
88
|
Ang MY, Takeuchi F, Kato N. Deciphering the genetic landscape of obesity: a data-driven approach to identifying plausible causal genes and therapeutic targets. J Hum Genet 2023; 68:823-833. [PMID: 37620670 PMCID: PMC10678330 DOI: 10.1038/s10038-023-01189-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2023] [Revised: 08/08/2023] [Accepted: 08/15/2023] [Indexed: 08/26/2023]
Abstract
OBJECTIVES Genome-wide association studies (GWAS) have successfully revealed numerous susceptibility loci for obesity. However, identifying the causal genes, pathways, and tissues/cell types responsible for these associations remains a challenge, and standardized analysis workflows are lacking. Additionally, due to limited treatment options for obesity, there is a need for the development of new pharmacological therapies. This study aimed to address these issues by performing step-wise utilization of knowledgebase for gene prioritization and assessing the potential relevance of key obesity genes as therapeutic targets. METHODS AND RESULTS First, we generated a list of 28,787 obesity-associated SNPs from the publicly available GWAS dataset (approximately 800,000 individuals in the GIANT meta-analysis). Then, we prioritized 1372 genes with significant in silico evidence against genomic and transcriptomic data, including transcriptionally regulated genes in the brain from transcriptome-wide association studies. In further narrowing down the gene list, we selected key genes, which we found to be useful for the discovery of potential drug seeds as demonstrated in lipid GWAS separately. We thus identified 74 key genes for obesity, which are highly interconnected and enriched in several biological processes that contribute to obesity, including energy expenditure and homeostasis. Of 74 key genes, 37 had not been reported for the pathophysiology of obesity. Finally, by drug-gene interaction analysis, we detected 23 (of 74) key genes that are potential targets for 78 approved and marketed drugs. CONCLUSIONS Our results provide valuable insights into new treatment options for obesity through a data-driven approach that integrates multiple up-to-date knowledgebases.
Collapse
|
89
|
Zhou Z, Zhong Y, Zhang Z, Ren X. Spatial transcriptomics deconvolution at single-cell resolution using Redeconve. Nat Commun 2023; 14:7930. [PMID: 38040768 PMCID: PMC10692090 DOI: 10.1038/s41467-023-43600-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2022] [Accepted: 11/14/2023] [Indexed: 12/03/2023] Open
Abstract
Computational deconvolution with single-cell RNA sequencing data as reference is pivotal to interpreting spatial transcriptomics data, but the current methods are limited to cell-type resolution. Here we present Redeconve, an algorithm to deconvolute spatial transcriptomics data at single-cell resolution, enabling interpretation of spatial transcriptomics data with thousands of nuanced cell states. We benchmark Redeconve with the state-of-the-art algorithms on diverse spatial transcriptomics platforms and datasets and demonstrate the superiority of Redeconve in terms of accuracy, resolution, robustness, and speed. Application to a human pancreatic cancer dataset reveals cancer-clone-specific T cell infiltration, and application to lymph node samples identifies differential cytotoxic T cells between IgA+ and IgG+ spots, providing novel insights into tumor immunology and the regulatory mechanisms underlying antibody class switch.
Collapse
|
90
|
Sun Y, Wiese M, Hmadi R, Karayol R, Seyfferth J, Martinez Greene JA, Erdogdu NU, Deboutte W, Arrigoni L, Holz H, Renschler G, Hirsch N, Foertsch A, Basilicata MF, Stehle T, Shvedunova M, Bella C, Pessoa Rodrigues C, Schwalb B, Cramer P, Manke T, Akhtar A. MSL2 ensures biallelic gene expression in mammals. Nature 2023; 624:173-181. [PMID: 38030723 PMCID: PMC10700137 DOI: 10.1038/s41586-023-06781-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2022] [Accepted: 10/24/2023] [Indexed: 12/01/2023]
Abstract
In diploid organisms, biallelic gene expression enables the production of adequate levels of mRNA1,2. This is essential for haploinsufficient genes, which require biallelic expression for optimal function to prevent the onset of developmental disorders1,3. Whether and how a biallelic or monoallelic state is determined in a cell-type-specific manner at individual loci remains unclear. MSL2 is known for dosage compensation of the male X chromosome in flies. Here we identify a role of MSL2 in regulating allelic expression in mammals. Allele-specific bulk and single-cell analyses in mouse neural progenitor cells revealed that, in addition to the targets showing biallelic downregulation, a class of genes transitions from biallelic to monoallelic expression after MSL2 loss. Many of these genes are haploinsufficient. In the absence of MSL2, one allele remains active, retaining active histone modifications and transcription factor binding, whereas the other allele is silenced, exhibiting loss of promoter-enhancer contacts and the acquisition of DNA methylation. Msl2-knockout mice show perinatal lethality and heterogeneous phenotypes during embryonic development, supporting a role for MSL2 in regulating gene dosage. The role of MSL2 in preserving biallelic expression of specific dosage-sensitive genes sets the stage for further investigation of other factors that are involved in allelic dosage compensation in mammalian cells, with considerable implications for human disease.
Collapse
|
91
|
Janes RW, Wallace BA. DichroPipeline: A suite of online and downloadable tools and resources for protein circular dichroism spectroscopic data analyses, interpretations, and their interoperability with other bioinformatics tools and resources. Protein Sci 2023; 32:e4817. [PMID: 37881887 PMCID: PMC10680340 DOI: 10.1002/pro.4817] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2023] [Accepted: 09/30/2023] [Indexed: 10/27/2023]
Abstract
Circular Dichroism (CD) spectroscopy is a widely-used method for characterizing individual protein structures in solutions, membranes, films and macromolecular complexes, as well as for probing macromolecular interactions, conformational changes associated with binding substrates, and in different functionally-related environments. This paper describes a series of related computational and display tools that have been developed over many years to aid in those characterizations and functional interpretations. The new DichroPipeline described herein links a series of format-compatible data processing, analysis, and display tools to enable users to facilely produce the spectra, which can then be made available in the Protein Circular Dichroism Data Bank (https://pcddb.cryst.bbk.ac.uk/) resource, in which the CD spectral and associated metadata for each entry are linked to other structural and functional data bases including the Protein Data Bank (PDB), and the UniProt sequence data base, amongst others. These tools and resources thus provide the basis for a wide range of traceable structural characterizations of soluble, membrane and intrinsically-disordered proteins.
Collapse
|
92
|
Mason L, Hicks B, Almeida JS. EpiVECS: exploring spatiotemporal epidemiological data using cluster embedding and interactive visualization. Sci Rep 2023; 13:21193. [PMID: 38040776 PMCID: PMC10692107 DOI: 10.1038/s41598-023-48484-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2023] [Accepted: 11/27/2023] [Indexed: 12/03/2023] Open
Abstract
The analysis of data over space and time is a core part of descriptive epidemiology, but the complexity of spatiotemporal data makes this challenging. There is a need for methods that simplify the exploration of such data for tasks such as surveillance and hypothesis generation. In this paper, we use combined clustering and dimensionality reduction methods (hereafter referred to as 'cluster embedding' methods) to spatially visualize patterns in epidemiological time-series data. We compare several cluster embedding techniques to see which performs best along a variety of internal cluster validation metrics. We find that methods based on k-means clustering generally perform better than self-organizing maps on real world epidemiological data, with some minor exceptions. We also introduce EpiVECS, a tool which allows the user to perform cluster embedding and explore the results using interactive visualization. EpiVECS is available as a privacy preserving, in-browser open source web application at https://episphere.github.io/epivecs .
Collapse
|
93
|
Wevers D, Ramautar R, Clark C, Hankemeier T, Ali A. Opportunities and challenges for sample preparation and enrichment in mass spectrometry for single-cell metabolomics. Electrophoresis 2023; 44:2000-2024. [PMID: 37667867 DOI: 10.1002/elps.202300105] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2023] [Revised: 08/08/2023] [Accepted: 08/19/2023] [Indexed: 09/06/2023]
Abstract
Single-cell heterogeneity in metabolism, drug resistance and disease type poses the need for analytical techniques for single-cell analysis. As the metabolome provides the closest view of the status quo in the cell, studying the metabolome at single-cell resolution may unravel said heterogeneity. A challenge in single-cell metabolome analysis is that metabolites cannot be amplified, so one needs to deal with picolitre volumes and a wide range of analyte concentrations. Due to high sensitivity and resolution, MS is preferred in single-cell metabolomics. Large numbers of cells need to be analysed for proper statistics; this requires high-throughput analysis, and hence automation of the analytical workflow. Significant advances in (micro)sampling methods, CE and ion mobility spectrometry have been made, some of which have been applied in high-throughput analyses. Microfluidics has enabled an automation of cell picking and metabolite extraction; image recognition has enabled automated cell identification. Many techniques have been used for data analysis, varying from conventional techniques to novel combinations of advanced chemometric approaches. Steps have been set in making data more findable, accessible, interoperable and reusable, but significant opportunities for improvement remain. Herein, advances in single-cell analysis workflows and data analysis are discussed, and recommendations are made based on the experimental goal.
Collapse
|
94
|
Bayer FP, Gander M, Kuster B, The M. CurveCurator: a recalibrated F-statistic to assess, classify, and explore significance of dose-response curves. Nat Commun 2023; 14:7902. [PMID: 38036588 PMCID: PMC10689459 DOI: 10.1038/s41467-023-43696-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2023] [Accepted: 11/16/2023] [Indexed: 12/02/2023] Open
Abstract
Dose-response curves are key metrics in pharmacology and biology to assess phenotypic or molecular actions of bioactive compounds in a quantitative fashion. Yet, it is often unclear whether or not a measured response significantly differs from a curve without regulation, particularly in high-throughput applications or unstable assays. Treating potency and effect size estimates from random and true curves with the same level of confidence can lead to incorrect hypotheses and issues in training machine learning models. Here, we present CurveCurator, an open-source software that provides reliable dose-response characteristics by computing p-values and false discovery rates based on a recalibrated F-statistic and a target-decoy procedure that considers dataset-specific effect size distributions. The application of CurveCurator to three large-scale datasets enables a systematic drug mode of action analysis and demonstrates its scalable utility across several application areas, facilitated by a performant, interactive dashboard for fast data exploration.
Collapse
|
95
|
Xu Z, Li Q, Marchionni L, Wang K. PhenoSV: interpretable phenotype-aware model for the prioritization of genes affected by structural variants. Nat Commun 2023; 14:7805. [PMID: 38016949 PMCID: PMC10684511 DOI: 10.1038/s41467-023-43651-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2023] [Accepted: 11/15/2023] [Indexed: 11/30/2023] Open
Abstract
Structural variants (SVs) represent a major source of genetic variation associated with phenotypic diversity and disease susceptibility. While long-read sequencing can discover over 20,000 SVs per human genome, interpreting their functional consequences remains challenging. Existing methods for identifying disease-related SVs focus on deletion/duplication only and cannot prioritize individual genes affected by SVs, especially for noncoding SVs. Here, we introduce PhenoSV, a phenotype-aware machine-learning model that interprets all major types of SVs and genes affected. PhenoSV segments and annotates SVs with diverse genomic features and employs a transformer-based architecture to predict their impacts under a multiple-instance learning framework. With phenotype information, PhenoSV further utilizes gene-phenotype associations to prioritize phenotype-related SVs. Evaluation on extensive human SV datasets covering all SV types demonstrates PhenoSV's superior performance over competing methods. Applications in diseases suggest that PhenoSV can determine disease-related genes from SVs. A web server and a command-line tool for PhenoSV are available at https://phenosv.wglab.org .
Collapse
|
96
|
Dondi A, Lischetti U, Jacob F, Singer F, Borgsmüller N, Coelho R, Heinzelmann-Schwarz V, Beisel C, Beerenwinkel N. Detection of isoforms and genomic alterations by high-throughput full-length single-cell RNA sequencing in ovarian cancer. Nat Commun 2023; 14:7780. [PMID: 38012143 PMCID: PMC10682465 DOI: 10.1038/s41467-023-43387-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2023] [Accepted: 11/07/2023] [Indexed: 11/29/2023] Open
Abstract
Understanding the complex background of cancer requires genotype-phenotype information in single-cell resolution. Here, we perform long-read single-cell RNA sequencing (scRNA-seq) on clinical samples from three ovarian cancer patients presenting with omental metastasis and increase the PacBio sequencing depth to 12,000 reads per cell. Our approach captures 152,000 isoforms, of which over 52,000 were not previously reported. Isoform-level analysis accounting for non-coding isoforms reveals 20% overestimation of protein-coding gene expression on average. We also detect cell type-specific isoform and poly-adenylation site usage in tumor and mesothelial cells, and find that mesothelial cells transition into cancer-associated fibroblasts in the metastasis, partly through the TGF-β/miR-29/Collagen axis. Furthermore, we identify gene fusions, including an experimentally validated IGF2BP2::TESPA1 fusion, which is misclassified as high TESPA1 expression in matched short-read data, and call mutations confirmed by targeted NGS cancer gene panel results. With these findings, we envision long-read scRNA-seq to become increasingly relevant in oncology and personalized medicine.
Collapse
|
97
|
Xu H, Wang S, Fang M, Luo S, Chen C, Wan S, Wang R, Tang M, Xue T, Li B, Lin J, Qu K. SPACEL: deep learning-based characterization of spatial transcriptome architectures. Nat Commun 2023; 14:7603. [PMID: 37990022 PMCID: PMC10663563 DOI: 10.1038/s41467-023-43220-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/24/2022] [Accepted: 11/03/2023] [Indexed: 11/23/2023] Open
Abstract
Spatial transcriptomics (ST) technologies detect mRNA expression in single cells/spots while preserving their two-dimensional (2D) spatial coordinates, allowing researchers to study the spatial distribution of the transcriptome in tissues; however, joint analysis of multiple ST slices and aligning them to construct a three-dimensional (3D) stack of the tissue still remain a challenge. Here, we introduce spatial architecture characterization by deep learning (SPACEL) for ST data analysis. SPACEL comprises three modules: Spoint embeds a multiple-layer perceptron with a probabilistic model to deconvolute cell type composition for each spot in a single ST slice; Splane employs a graph convolutional network approach and an adversarial learning algorithm to identify spatial domains that are transcriptomically and spatially coherent across multiple ST slices; and Scube automatically transforms the spatial coordinate systems of consecutive slices and stacks them together to construct a 3D architecture of the tissue. Comparisons against 19 state-of-the-art methods using both simulated and real ST datasets from various tissues and ST technologies demonstrate that SPACEL outperforms the others for cell type deconvolution, for spatial domain identification, and for 3D alignment, thus showcasing SPACEL as a valuable integrated toolkit for ST data processing and analysis.
Collapse
|
98
|
Xiang X, Lu B, Song D, Li J, Shu K, Pu D. Evaluating the performance of low-frequency variant calling tools for the detection of variants from short-read deep sequencing data. Sci Rep 2023; 13:20444. [PMID: 37993475 PMCID: PMC10665316 DOI: 10.1038/s41598-023-47135-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2023] [Accepted: 11/09/2023] [Indexed: 11/24/2023] Open
Abstract
Detection of low-frequency variants with high accuracy plays an important role in biomedical research and clinical practice. However, it is challenging to do so with next-generation sequencing (NGS) approaches due to the high error rates of NGS. To accurately distinguish low-level true variants from these errors, many statistical variants calling tools for calling low-frequency variants have been proposed, but a systematic performance comparison of these tools has not yet been performed. Here, we evaluated four raw-reads-based variant callers (SiNVICT, outLyzer, Pisces, and LoFreq) and four UMI-based variant callers (DeepSNVMiner, MAGERI, smCounter2, and UMI-VarCal) considering their capability to call single nucleotide variants (SNVs) with allelic frequency as low as 0.025% in deep sequencing data. We analyzed a total of 54 simulated data with various sequencing depths and variant allele frequencies (VAFs), two reference data, and Horizon Tru-Q sample data. The results showed that the UMI-based callers, except smCounter2, outperformed the raw-reads-based callers regarding detection limit. Sequencing depth had almost no effect on the UMI-based callers but significantly influenced on the raw-reads-based callers. Regardless of the sequencing depth, MAGERI showed the fastest analysis, while smCounter2 consistently took the longest to finish the variant calling process. Overall, DeepSNVMiner and UMI-VarCal performed the best with considerably good sensitivity and precision of 88%, 100%, and 84%, 100%, respectively. In conclusion, the UMI-based callers, except smCounter2, outperformed the raw-reads-based callers in terms of sensitivity and precision. We recommend using DeepSNVMiner and UMI-VarCal for low-frequency variant detection. The results provide important information regarding future directions for reliable low-frequency variant detection and algorithm development, which is critical in genetics-based medical research and clinical applications.
Collapse
|
99
|
Li R, Gibson JM. Predicting Groundwater PFOA Exposure Risks with Bayesian Networks: Empirical Impact of Data Preprocessing on Model Performance. ENVIRONMENTAL SCIENCE & TECHNOLOGY 2023; 57:18329-18338. [PMID: 37594027 DOI: 10.1021/acs.est.3c00348] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/19/2023]
Abstract
The plethora of data on PFASs in environmental samples collected in response to growing concern about these chemicals could enable the training of machine-learning models for predicting exposure risks. However, differences in sampling and analysis methods across data sets must be reconciled through data preprocessing, and little information is available about how such manipulations affect the resulting models. This study evaluates how data preprocessing influences machine-learned Bayesian network models of PFOA in groundwater. We link 19 years of PFOA measurements from Minnesota, USA, to publicly available information about potential PFOA sources and factors that may influence their environmental fate. Nine different preprocessing methods were tested, and the resulting data sets were used to train models to predict the probability of PFOA ≥ 35 ppt, the 2017 Minnesota health advisory level. Different preprocessing approaches produced varying model structures with significantly different accuracies. Nonetheless, models showed similar relationships between predictor variables and PFOA exposure risks, and all models were relatively accurate, distinguishing wells at high risk from those at low risk for 82.0% to 89.0% of test data samples. There was a trade-off between data quality and model performance since a stricter data screening strategy decreased the sample size for model training.
Collapse
|
100
|
Delamare-Deboutteville J, Meemetta W, Pimsannil K, Sangpo P, Gan HM, Mohan CV, Dong HT, Senapin S. A multiplexed RT-PCR assay for nanopore whole genome sequencing of Tilapia lake virus (TiLV). Sci Rep 2023; 13:20276. [PMID: 37985860 PMCID: PMC10661697 DOI: 10.1038/s41598-023-47425-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2023] [Accepted: 11/14/2023] [Indexed: 11/22/2023] Open
Abstract
Tilapia lake virus (TiLV) is a highly contagious viral pathogen that affects tilapia, a globally significant and affordable source of fish protein. To prevent the introduction and spread of TiLV and its impact, there is an urgent need for increased surveillance, improved biosecurity measures, and continuous development of effective diagnostic and rapid sequencing methods. In this study, we have developed a multiplexed RT-PCR assay that can amplify all ten complete genomic segments of TiLV from various sources of isolation. The amplicons generated using this approach were immediately subjected to real-time sequencing on the Nanopore system. By using this approach, we have recovered and assembled 10 TiLV genomes from total RNA extracted from naturally TiLV-infected tilapia fish, concentrated tilapia rearing water, and cell culture. Our phylogenetic analysis, consisting of more than 36 TiLV genomes from both newly sequenced and publicly available TiLV genomes, provides new insights into the high genetic diversity of TiLV. This work is an essential steppingstone towards integrating rapid and real-time Nanopore-based amplicon sequencing into routine genomic surveillance of TiLV, as well as future vaccine development.
Collapse
|