101
|
Nur-A-Alam M, Nasir MK, Ahsan M, Based MA, Haider J, Kowalski M. Ensemble classification of integrated CT scan datasets in detecting COVID-19 using feature fusion from contourlet transform and CNN. Sci Rep 2023; 13:20063. [PMID: 37973820 PMCID: PMC10654719 DOI: 10.1038/s41598-023-47183-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2023] [Accepted: 11/09/2023] [Indexed: 11/19/2023] Open
Abstract
The COVID-19 disease caused by coronavirus is constantly changing due to the emergence of different variants and thousands of people are dying every day worldwide. Early detection of this new form of pulmonary disease can reduce the mortality rate. In this paper, an automated method based on machine learning (ML) and deep learning (DL) has been developed to detect COVID-19 using computed tomography (CT) scan images extracted from three publicly available datasets (A total of 11,407 images; 7397 COVID-19 images and 4010 normal images). An unsupervised clustering approach that is a modified region-based clustering technique for segmenting COVID-19 CT scan image has been proposed. Furthermore, contourlet transform and convolution neural network (CNN) have been employed to extract features individually from the segmented CT scan images and to fuse them in one feature vector. Binary differential evolution (BDE) approach has been employed as a feature optimization technique to obtain comprehensible features from the fused feature vector. Finally, a ML/DL-based ensemble classifier considering bagging technique has been employed to detect COVID-19 from the CT images. A fivefold and generalization cross-validation techniques have been used for the validation purpose. Classification experiments have also been conducted with several pre-trained models (AlexNet, ResNet50, GoogleNet, VGG16, VGG19) and found that the ensemble classifier technique with fused feature has provided state-of-the-art performance with an accuracy of 99.98%.
Collapse
|
102
|
Darevsky DM, Hu DA, Gomez FA, Davies MR, Liu X, Feeley BT. Algorithmic assessment of shoulder function using smartphone video capture and machine learning. Sci Rep 2023; 13:19986. [PMID: 37968288 PMCID: PMC10652003 DOI: 10.1038/s41598-023-46966-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2023] [Accepted: 11/07/2023] [Indexed: 11/17/2023] Open
Abstract
Tears within the stabilizing muscles of the shoulder, known as the rotator cuff (RC), are the most common cause of shoulder pain-often presenting in older patients and requiring expensive advanced imaging for diagnosis. Despite the high prevalence of RC tears within the elderly population, there is no previously published work examining shoulder kinematics using markerless motion capture in the context of shoulder injury. Here we show that a simple string pulling behavior task, where subjects pull a string using hand-over-hand motions, provides a reliable readout of shoulder mobility across animals and humans. We find that both mice and humans with RC tears exhibit decreased movement amplitude, prolonged movement time, and quantitative changes in waveform shape during string pulling task performance. In rodents, we further note the degradation of low dimensional, temporally coordinated movements after injury. Furthermore, a logistic regression model built on our biomarker ensemble succeeds in classifying human patients as having a RC tear with > 90% accuracy. Our results demonstrate how a combined framework bridging animal models, motion capture, convolutional neural networks, and algorithmic assessment of movement quality enables future research into the development of smartphone-based, at-home diagnostic tests for shoulder injury.
Collapse
|
103
|
Joo SH, Song JW, Shin K, Kim MJ, Lee J, Song YW. Knee osteoarthritis with a high grade of Kellgren-Lawrence score is associated with a worse frailty status, KNHANES 2010-2013. Sci Rep 2023; 13:19714. [PMID: 37953320 PMCID: PMC10641064 DOI: 10.1038/s41598-023-46558-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2023] [Accepted: 11/02/2023] [Indexed: 11/14/2023] Open
Abstract
Frailty as a syndrome of physical decline in late life is associated with adverse health outcomes. Knee osteoarthritis (KOA) could contribute to frailty conditions. The objective of this study was to evaluate the impact of KOA on frailty risk in a Korean National Health and Nutrition Examination Survey (KNHANES) cohort. In this study (N, total = 11,910, age; 64.10 years old [63.94-64.27; mean 95% CI], sex (female, %); 6,752 (56.69)), KOA patients were defined as those with knee joint pain and grade 2 Kellgren-Lawrence (K-L) or more on plain radiographic images who were 40 years old or older in Korean population data of KNHANES. The frailty index was calculated using 46 items related to co-morbidities and laboratory parameters. The impact of KOA on frailty risk was evaluated with logistic regression analyses. The prevalence of KOA patients was 35.6% [95% CI 34.7-36.46]. In polytomous logistic regression, the relative risk ratio (RRR) of KOA was significantly increased in the pre-frail group (2.76, 95% CI 2.30-3.31) and the frail group (7.28, 95% CI 5.90-8.98). RRR of frailty was significantly increased in patients with K-L grade 3 (1.36, 95% CI 1.13-1.63) and K-L grade 4 (2.19, 95% CI 1.72-2.79). Older age, higher BMI, smoking status, alcohol intake, low-income status, higher WBC count, higher platelet count, higher serum creatinine level and low estimated GFR were significantly associated with increased frailty risk. High hemoglobin and regular walking habits were associated with decreased frailty risk in KOA patients. In this large observation population- based survey cohort, KOA is linked to an increased risk of frailty syndrome. We found a significant connection between KOA and frailty syndrome. These results show that we need to think about the overall health of people with KOA and give them special care to prevent frailty syndrome.
Collapse
|
104
|
Hou W, Ji Z, Chen Z, Wherry EJ, Hicks SC, Ji H. A statistical framework for differential pseudotime analysis with multiple single-cell RNA-seq samples. Nat Commun 2023; 14:7286. [PMID: 37949861 PMCID: PMC10638410 DOI: 10.1038/s41467-023-42841-y] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2021] [Accepted: 10/24/2023] [Indexed: 11/12/2023] Open
Abstract
Pseudotime analysis with single-cell RNA-sequencing (scRNA-seq) data has been widely used to study dynamic gene regulatory programs along continuous biological processes. While many methods have been developed to infer the pseudotemporal trajectories of cells within a biological sample, it remains a challenge to compare pseudotemporal patterns with multiple samples (or replicates) across different experimental conditions. Here, we introduce Lamian, a comprehensive and statistically-rigorous computational framework for differential multi-sample pseudotime analysis. Lamian can be used to identify changes in a biological process associated with sample covariates, such as different biological conditions while adjusting for batch effects, and to detect changes in gene expression, cell density, and topology of a pseudotemporal trajectory. Unlike existing methods that ignore sample variability, Lamian draws statistical inference after accounting for cross-sample variability and hence substantially reduces sample-specific false discoveries that are not generalizable to new samples. Using both real scRNA-seq and simulation data, including an analysis of differential immune response programs between COVID-19 patients with different disease severity levels, we demonstrate the advantages of Lamian in decoding cellular gene expression programs in continuous biological processes.
Collapse
|
105
|
Yehudi Y, Hughes-Noehrer L, Goble C, Jay C. Subjective data models in bioinformatics and how wet lab and computational biologists conceptualise data. Sci Data 2023; 10:756. [PMID: 37919302 PMCID: PMC10622411 DOI: 10.1038/s41597-023-02627-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2023] [Accepted: 10/09/2023] [Indexed: 11/04/2023] Open
Abstract
Biological science produces "big data" in varied formats, which necessitates using computational tools to process, integrate, and analyse data. Researchers using computational biology tools range from those using computers for communication, to those writing analysis code. We examine differences in how researchers conceptualise the same data, which we call "subjective data models". We interviewed 22 people with biological experience and varied levels of computational experience, and found that many had fluid subjective data models that changed depending on circumstance. Surprisingly, results did not cluster around participants' computational experience levels. People did not consistently map entities from abstract data models to the real-world entities in files, and certain data identifier formats were easier to infer meaning from than others. Real-world implications: 1) software engineers should design interfaces for task performance, emulating popular user interfaces, rather than targeting professional backgrounds; 2) when insufficient context is provided, people may guess what data means, whether or not they are correct, emphasising the importance of contextual metadata to remove the need for erroneous guesswork.
Collapse
|
106
|
Karin J, Bornfeld Y, Nitzan M. scPrisma infers, filters and enhances topological signals in single-cell data using spectral template matching. Nat Biotechnol 2023; 41:1645-1654. [PMID: 36849830 PMCID: PMC10635821 DOI: 10.1038/s41587-023-01663-5] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2022] [Accepted: 01/06/2023] [Indexed: 03/01/2023]
Abstract
Single-cell RNA sequencing has been instrumental in uncovering cellular spatiotemporal context. This task is challenging as cells simultaneously encode multiple, potentially cross-interfering, biological signals. Here we propose scPrisma, a spectral computational method that uses topological priors to decouple, enhance and filter different classes of biological processes in single-cell data, such as periodic and linear signals. We apply scPrisma to the analysis of the cell cycle in HeLa cells, circadian rhythm and spatial zonation in liver lobules, diurnal cycle in Chlamydomonas and circadian rhythm in the suprachiasmatic nucleus in the brain. scPrisma can be used to distinguish mixed cellular populations by specific characteristics such as cell type and uncover regulatory networks and cell-cell interactions specific to predefined biological signals, such as the circadian rhythm. We show scPrisma's flexibility in incorporating prior knowledge, inference of topologically informative genes and generalization to additional diverse templates and systems. scPrisma can be used as a stand-alone workflow for signal analysis and as a prior step for downstream single-cell analysis.
Collapse
|
107
|
Dann E, Cujba AM, Oliver AJ, Meyer KB, Teichmann SA, Marioni JC. Precise identification of cell states altered in disease using healthy single-cell references. Nat Genet 2023; 55:1998-2008. [PMID: 37828140 PMCID: PMC10632138 DOI: 10.1038/s41588-023-01523-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2022] [Accepted: 09/05/2023] [Indexed: 10/14/2023]
Abstract
Joint analysis of single-cell genomics data from diseased tissues and a healthy reference can reveal altered cell states. We investigate whether integrated collections of data from healthy individuals (cell atlases) are suitable references for disease-state identification and whether matched control samples are needed to minimize false discoveries. We demonstrate that using a reference atlas for latent space learning followed by differential analysis against matched controls leads to improved identification of disease-associated cells, especially with multiple perturbed cell types. Additionally, when an atlas is available, reducing control sample numbers does not increase false discovery rates. Jointly analyzing data from a COVID-19 cohort and a blood cell atlas, we improve detection of infection-related cell states linked to distinct clinical severities. Similarly, we studied disease states in pulmonary fibrosis using a healthy lung atlas, characterizing two distinct aberrant basal states. Our analysis provides guidelines for designing disease cohort studies and optimizing cell atlas use.
Collapse
|
108
|
Nikopoulou C, Kleinenkuhnen N, Parekh S, Sandoval T, Ziegenhain C, Schneider F, Giavalisco P, Donahue KF, Vesting AJ, Kirchner M, Bozukova M, Vossen C, Altmüller J, Wunderlich T, Sandberg R, Kondylis V, Tresch A, Tessarz P. Spatial and single-cell profiling of the metabolome, transcriptome and epigenome of the aging mouse liver. NATURE AGING 2023; 3:1430-1445. [PMID: 37946043 PMCID: PMC10645594 DOI: 10.1038/s43587-023-00513-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/08/2023] [Accepted: 09/27/2023] [Indexed: 11/12/2023]
Abstract
Tissues within an organism and even cell types within a tissue can age with different velocities. However, it is unclear whether cells of one type experience different aging trajectories within a tissue depending on their spatial location. Here, we used spatial transcriptomics in combination with single-cell ATAC-seq and RNA-seq, lipidomics and functional assays to address how cells in the male murine liver are affected by age-related changes in the microenvironment. Integration of the datasets revealed zonation-specific and age-related changes in metabolic states, the epigenome and transcriptome. The epigenome changed in a zonation-dependent manner and functionally, periportal hepatocytes were characterized by decreased mitochondrial fitness, whereas pericentral hepatocytes accumulated large lipid droplets. Together, we provide evidence that changing microenvironments within a tissue exert strong influences on their resident cells that can shape epigenetic, metabolic and phenotypic outputs.
Collapse
|
109
|
Zhang S, Vasudevan S, Tan SPH, Olivo M. Fiber optic probe-based ATR-FTIR spectroscopy for rapid breast cancer detection: A pilot study. JOURNAL OF BIOPHOTONICS 2023; 16:e202300199. [PMID: 37496212 DOI: 10.1002/jbio.202300199] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/30/2023] [Revised: 07/11/2023] [Accepted: 07/25/2023] [Indexed: 07/28/2023]
Abstract
Breast cancer diagnosis is crucial for timely treatment and improved outcomes. This paper proposes a novel approach for rapid breast cancer diagnosis using optical fiber probe-based attenuated total reflectance Fourier transform infrared (ATR-FTIR) spectroscopy from 750 to 4000 cm-1 . The technique enables direct analysis of tissue samples, eliminating the need for microtome sectioning and staining, thus saving time and resources. By capturing molecular fingerprint information, various machine-learning models were used to analyze the spectroscopic data to classify cancerous and non-cancerous tissues accurately. Comparing deparaffinized and paraffinized samples reveals the impact of sample preparation and experimental methods. The study demonstrates a strong correlation between the cancerous nature of a sample and its ATR-FTIR spectrum, suggesting its potential for breast cancer diagnosis (sensitivity of 74.2% and specificity of 78.3%). The proposed approach holds promise for integration into clinical operations, providing a rapid method for preliminary breast cancer diagnosis.
Collapse
|
110
|
Blanco-Míguez A, Beghini F, Cumbo F, McIver LJ, Thompson KN, Zolfo M, Manghi P, Dubois L, Huang KD, Thomas AM, Nickols WA, Piccinno G, Piperni E, Punčochář M, Valles-Colomer M, Tett A, Giordano F, Davies R, Wolf J, Berry SE, Spector TD, Franzosa EA, Pasolli E, Asnicar F, Huttenhower C, Segata N. Extending and improving metagenomic taxonomic profiling with uncharacterized species using MetaPhlAn 4. Nat Biotechnol 2023; 41:1633-1644. [PMID: 36823356 PMCID: PMC10635831 DOI: 10.1038/s41587-023-01688-w] [Citation(s) in RCA: 109] [Impact Index Per Article: 109.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2022] [Accepted: 01/20/2023] [Indexed: 02/25/2023]
Abstract
Metagenomic assembly enables new organism discovery from microbial communities, but it can only capture few abundant organisms from most metagenomes. Here we present MetaPhlAn 4, which integrates information from metagenome assemblies and microbial isolate genomes for more comprehensive metagenomic taxonomic profiling. From a curated collection of 1.01 M prokaryotic reference and metagenome-assembled genomes, we define unique marker genes for 26,970 species-level genome bins, 4,992 of them taxonomically unidentified at the species level. MetaPhlAn 4 explains ~20% more reads in most international human gut microbiomes and >40% in less-characterized environments such as the rumen microbiome and proves more accurate than available alternatives on synthetic evaluations while also reliably quantifying organisms with no cultured isolates. Application of the method to >24,500 metagenomes highlights previously undetected species to be strong biomarkers for host conditions and lifestyles in human and mouse microbiomes and shows that even previously uncharacterized species can be genetically profiled at the resolution of single microbial strains.
Collapse
|
111
|
Cavinato L, Massi MC, Sollini M, Kirienko M, Ieva F. Dual adversarial deconfounding autoencoder for joint batch-effects removal from multi-center and multi-scanner radiomics data. Sci Rep 2023; 13:18857. [PMID: 37914758 PMCID: PMC10620174 DOI: 10.1038/s41598-023-45983-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2023] [Accepted: 10/26/2023] [Indexed: 11/03/2023] Open
Abstract
Medical imaging represents the primary tool for investigating and monitoring several diseases, including cancer. The advances in quantitative image analysis have developed towards the extraction of biomarkers able to support clinical decisions. To produce robust results, multi-center studies are often set up. However, the imaging information must be denoised from confounding factors-known as batch-effect-like scanner-specific and center-specific influences. Moreover, in non-solid cancers, like lymphomas, effective biomarkers require an imaging-based representation of the disease that accounts for its multi-site spreading over the patient's body. In this work, we address the dual-factor deconfusion problem and we propose a deconfusion algorithm to harmonize the imaging information of patients affected by Hodgkin Lymphoma in a multi-center setting. We show that the proposed model successfully denoises data from domain-specific variability (p-value < 0.001) while it coherently preserves the spatial relationship between imaging descriptions of peer lesions (p-value = 0), which is a strong prognostic biomarker for tumor heterogeneity assessment. This harmonization step allows to significantly improve the performance in prognostic models with respect to state-of-the-art methods, enabling building exhaustive patient representations and delivering more accurate analyses (p-values < 0.001 in training, p-values < 0.05 in testing). This work lays the groundwork for performing large-scale and reproducible analyses on multi-center data that are urgently needed to convey the translation of imaging-based biomarkers into the clinical practice as effective prognostic tools. The code is available on GitHub at this https://github.com/LaraCavinato/Dual-ADAE .
Collapse
|
112
|
Wu H, Lu Y, Duan Z, Wu J, Lin M, Wu Y, Han S, Li T, Fan Y, Hu X, Xiao H, Feng J, Lu Z, Kong D, Li S. Nanopore long-read RNA sequencing reveals functional alternative splicing variants in human vascular smooth muscle cells. Commun Biol 2023; 6:1104. [PMID: 37907652 PMCID: PMC10618188 DOI: 10.1038/s42003-023-05481-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2022] [Accepted: 10/18/2023] [Indexed: 11/02/2023] Open
Abstract
Vascular smooth muscle cells (VSMCs) are the major contributor to vascular repair and remodeling, which showed high level of phenotypic plasticity. Abnormalities in VSMC plasticity can lead to multiple cardiovascular diseases, wherein alternative splicing plays important roles. However, alternative splicing variants in VSMC plasticity are not fully understood. Here we systematically characterized the long-read transcriptome and their dysregulation in human aortic smooth muscle cells (HASMCs) by employing the Oxford Nanopore Technologies long-read RNA sequencing in HASMCs that are separately treated with platelet-derived growth factor, transforming growth factor, and hsa-miR-221-3P transfection. Our analysis reveals frequent alternative splicing events and thousands of unannotated transcripts generated from alternative splicing. HASMCs treated with different factors exhibit distinct transcriptional reprogramming modulated by alternative splicing. We also found that unannotated transcripts produce different open reading frames compared to the annotated transcripts. Finally, we experimentally validated the unannotated transcript derived from gene CISD1, namely CISD1-u, which plays a role in the phenotypic switch of HASMCs. Our study characterizes the phenotypic modulation of HASMCs from an insight of long-read transcriptome, which would promote the understanding and the manipulation of HASMC plasticity in cardiovascular diseases.
Collapse
|
113
|
Nolte P, Brettmacher M, Gröger CJ, Gellhaus T, Svetlove A, Schilling AF, Alves F, Rußmann C, Dullin C. Spatial correlation of 2D hard-tissue histology with 3D microCT scans through 3D printed phantoms. Sci Rep 2023; 13:18479. [PMID: 37898676 PMCID: PMC10613209 DOI: 10.1038/s41598-023-45518-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2023] [Accepted: 10/20/2023] [Indexed: 10/30/2023] Open
Abstract
Hard-tissue histology-the analysis of thin two-dimensional (2D) sections-is hampered by the opaque nature of most biological specimens, especially bone. Therefore, the cutting process cannot be assigned to regions of interest. In addition, the applied cutting-grinding method is characterized by significant material loss. As a result, relevant structures might be missed or destroyed, and 3D features can hardly be evaluated. Here, we present a novel workflow, based on conventual microCT scans of the specimen prior to the cutting process, to be used for the analysis of 3D structural features and for directing the sectioning process to the regions of interest. 3D printed fiducial markers, embedded together with the specimen in resin, are utilized to retrospectively register the obtained 2D histological images into the 3D anatomical context. This not only allows to identify the cutting position, but also enables the co-registration of the cell and extracellular matrix morphological analysis to local 3D information obtained from the microCT data. We have successfully applied our new approach to assess hard-tissue specimens of different species. After matching the predicted microCT cut plane with the histology image, we validated a high accuracy of the registration process by computing quality measures namely Jaccard and Dice similarity coefficients achieving an average score of 0.90 ± 0.04 and 0.95 ± 0.02, respectively. Thus, we believe that the novel, easy to implement correlative imaging approach holds great potential for improving the reliability and diagnostic power of classical hard-tissue histology.
Collapse
|
114
|
Barbour RL, Graber HL. Hemoglobin signal network mapping reveals novel indicators for precision medicine. Sci Rep 2023; 13:18257. [PMID: 37880310 PMCID: PMC10600136 DOI: 10.1038/s41598-023-43694-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2023] [Accepted: 09/27/2023] [Indexed: 10/27/2023] Open
Abstract
Precision medicine currently relies on a mix of deep phenotyping strategies to guide more individualized healthcare. Despite being widely available and information-rich, physiological time-series measures are often overlooked as a resource to extend insights gained from such measures. Here we have explored resting-state hemoglobin measures applied to intact whole breasts for two subject groups - women with confirmed breast cancer and control subjects - with the goal of achieving a more detailed assessment of the cancer phenotype from a non-invasive measure. Invoked is a novel ordinal partition network method applied to multivariate measures that generates a Markov chain, thereby providing access to quantitative descriptions of short-term dynamics in the form of several classes of adjacency matrices. Exploration of these and their associated co-dependent behaviors unexpectedly reveals features of structured dynamics, some of which are shown to exhibit enzyme-like behaviors and sensitivity to recognized molecular markers of disease. Thus, findings obtained strongly indicate that despite the use of a macroscale sensing method, features more typical of molecular-cellular processes can be identified. Discussed are factors unique to our approach that favor a deeper depiction of tissue phenotypes, its extension to other forms of physiological time-series measures, and its expected utility to advance goals of precision medicine.
Collapse
|
115
|
Hutchings C, Dawson CS, Krueger T, Lilley KS, Breckels LM. A Bioconductor workflow for processing, evaluating, and interpreting expression proteomics data. F1000Res 2023; 12:1402. [PMID: 38021401 PMCID: PMC10683783 DOI: 10.12688/f1000research.139116.1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 09/15/2023] [Indexed: 12/01/2023] Open
Abstract
Background: Expression proteomics involves the global evaluation of protein abundances within a system. In turn, differential expression analysis can be used to investigate changes in protein abundance upon perturbation to such a system. Methods: Here, we provide a workflow for the processing, analysis and interpretation of quantitative mass spectrometry-based expression proteomics data. This workflow utilizes open-source R software packages from the Bioconductor project and guides users end-to-end and step-by-step through every stage of the analyses. As a use-case we generated expression proteomics data from HEK293 cells with and without a treatment. Of note, the experiment included cellular proteins labelled using tandem mass tag (TMT) technology and secreted proteins quantified using label-free quantitation (LFQ). Results: The workflow explains the software infrastructure before focusing on data import, pre-processing and quality control. This is done individually for TMT and LFQ datasets. The application of statistical differential expression analysis is demonstrated, followed by interpretation via gene ontology enrichment analysis. Conclusions: A comprehensive workflow for the processing, analysis and interpretation of expression proteomics is presented. The workflow is a valuable resource for the proteomics community and specifically beginners who are at least familiar with R who wish to understand and make data-driven decisions with regards to their analyses.
Collapse
|
116
|
Ziaei Jam H, Li Y, DeVito R, Mousavi N, Ma N, Lujumba I, Adam Y, Maksimov M, Huang B, Dolzhenko E, Qiu Y, Kakembo FE, Joseph H, Onyido B, Adeyemi J, Bakhtiari M, Park J, Javadzadeh S, Jjingo D, Adebiyi E, Bafna V, Gymrek M. A deep population reference panel of tandem repeat variation. Nat Commun 2023; 14:6711. [PMID: 37872149 PMCID: PMC10593948 DOI: 10.1038/s41467-023-42278-3] [Citation(s) in RCA: 11] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2023] [Accepted: 10/05/2023] [Indexed: 10/25/2023] Open
Abstract
Tandem repeats (TRs) represent one of the largest sources of genetic variation in humans and are implicated in a range of phenotypes. Here we present a deep characterization of TR variation based on high coverage whole genome sequencing from 3550 diverse individuals from the 1000 Genomes Project and H3Africa cohorts. We develop a method, EnsembleTR, to integrate genotypes from four separate methods resulting in high-quality genotypes at more than 1.7 million TR loci. Our catalog reveals novel sequence features influencing TR heterozygosity, identifies population-specific trinucleotide expansions, and finds hundreds of novel eQTL signals. Finally, we generate a phased haplotype panel which can be used to impute most TRs from nearby single nucleotide polymorphisms (SNPs) with high accuracy. Overall, the TR genotypes and reference haplotype panel generated here will serve as valuable resources for future genome-wide and population-wide studies of TRs and their role in human phenotypes.
Collapse
|
117
|
Emara TZ, Trinh T, Huang JZ. Geographically distributed data management to support large-scale data analysis. Sci Rep 2023; 13:17783. [PMID: 37853092 PMCID: PMC10584813 DOI: 10.1038/s41598-023-44789-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2022] [Accepted: 10/12/2023] [Indexed: 10/20/2023] Open
Abstract
Nowadays, several companies prefer storing their data on multiple data centers with replication for many reasons. The data that spans various data centers ensures the fastest possible response time for customers and workforces who are geographically separated. It also provides protecting the information from the loss in case a single data center experiences a disaster. However, the amount of data is increasing at a rapid pace, which leads to challenges in storage, analysis, and various processing tasks. In this paper, we propose and design a geographically distributed data management framework to manage the massive data stored and distributed among geo-distributed data centers. The goal of the proposed framework is to enable efficient use of the distributed data blocks for various data analysis tasks. The architecture of the proposed framework is composed of a grid of geo-distributed data centers connected to a data controller (DCtrl). The DCtrl is responsible for organizing and managing the block replicas across the geo-distributed data centers. We use the BDMS system as the installed system on the distributed data centers. BDMS stores the big data file as a set of random sample data blocks, each being a random sample of the whole data file. Then, DCtrl distributes these data blocks into multiple data centers with replication. In analyzing a big data file distributed based on the proposed framework, we randomly select a sample of data blocks replicated from other data centers on any data center. We use simulation results to demonstrate the performance of the proposed framework in big data analysis across geo-distributed data centers.
Collapse
|
118
|
Papež J, Labounek R, Jabandžiev P, Česká K, Slabá K, Ošlejšková H, Aulická Š, Nestrašil I. Multivariate linear mixture models for the prediction of febrile seizure risk and recurrence: a prospective case-control study. Sci Rep 2023; 13:17372. [PMID: 37833343 PMCID: PMC10576023 DOI: 10.1038/s41598-023-43599-5] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2023] [Accepted: 09/26/2023] [Indexed: 10/15/2023] Open
Abstract
Our goal was to identify highly accurate empirical models for the prediction of the risk of febrile seizure (FS) and FS recurrence. In a prospective, three-arm, case-control study, we enrolled 162 children (age 25.8 ± 17.1 months old, 71 females). Participants formed one case group (patients with FS) and two control groups (febrile patients without seizures and healthy controls). The impact of blood iron status, peak body temperature, and participants' demographics on FS risk and recurrence was investigated with univariate and multivariate statistics. Serum iron concentration, iron saturation, and unsaturated iron-binding capacity differed between the three investigated groups (pFWE < 0.05). These serum analytes were key variables in the design of novel multivariate linear mixture models. The models classified FS risk with higher accuracy than univariate approaches. The designed bi-linear classifier achieved a sensitivity/specificity of 82%/89% and was closest to the gold-standard classifier. A multivariate model assessing FS recurrence provided a difference (pFWE < 0.05) with a separating sensitivity/specificity of 72%/69%. Iron deficiency, height percentile, and age were significant FS risk factors. In addition, height percentile and hemoglobin concentration were linked to FS recurrence. Novel multivariate models utilizing blood iron status and demographic variables predicted FS risk and recurrence among infants and young children with fever.
Collapse
|
119
|
Zhang Z, Gautam A, Lim SM, Hilty C. Analysis of Large Data Sets in a Physical Chemistry Laboratory NMR Experiment Using Python. JOURNAL OF CHEMICAL EDUCATION 2023; 100:4109-4113. [PMID: 38357475 PMCID: PMC10862468 DOI: 10.1021/acs.jchemed.3c00586] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 06/17/2023] [Revised: 09/01/2023] [Indexed: 02/16/2024]
Abstract
We describe an update to an experiment demonstrating low-field NMR spectroscopy in the undergraduate physical chemistry laboratory. A Python-based data processing and analysis protocol is developed for this experiment. The Python language is used in fillable worksheets in the notebook software JupyterLab, providing an interactive means for students to work with the measured data step by step. The protocol teaches methods for the analysis of large data sets in science or engineering, a topic that is absent from traditional chemistry curricula. Python is among the most widely used modern tools for data analysis. In addition, its open-source nature reduces the barriers for adoption in an educational laboratory.
Collapse
|
120
|
Fan A, Huang Y, Xu F, Bom S. Soft-Sensing Regression Model: From Sensor to Wafer Metrology Forecasting. SENSORS (BASEL, SWITZERLAND) 2023; 23:8363. [PMID: 37896457 PMCID: PMC10611205 DOI: 10.3390/s23208363] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/06/2023] [Revised: 09/30/2023] [Accepted: 10/06/2023] [Indexed: 10/29/2023]
Abstract
The semiconductor industry is one of the most technology-evolving and capital-intensive market sectors. Effective inspection and metrology are necessary to improve product yield, increase product quality and reduce costs. In recent years, many types of semiconductor manufacturing equipments have been equipped with sensors to facilitate real-time monitoring of the production processes. These production-state and equipment-state sensor data provide an opportunity to practice machine-learning technologies in various domains, such as anomaly/fault detection, maintenance scheduling, quality prediction, etc. In this work, we focus on the soft-sensing regression problem in metrology systems, which uses sensor data collected during wafer processing steps to predict impending inspection measurements that used to be measured in wafer inspection and metrology systems. We proposed a regressor based on Long Short-term Memory network and devised two distinct loss functions for the purpose of the training model. Although the assessment of our prediction errors by engineers is subjective, a novel piece-wise evaluation metric was introduced to evaluate model accuracy in a mathematical way. Our experimental results showcased that the proposed model is capable of achieving both accurate and early prediction across various types of inspections in complicated manufacturing processes.
Collapse
|
121
|
Brehm W, Triviño J, Krahn JM, Usón I, Diederichs K. XDSGUI: a graphical user interface for XDS, SHELX and ARCIMBOLDO. J Appl Crystallogr 2023; 56:1585-1594. [PMID: 37791359 PMCID: PMC10543682 DOI: 10.1107/s1600576723007057] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2023] [Accepted: 08/08/2023] [Indexed: 10/05/2023] Open
Abstract
XDSGUI is a lightweight graphical user interface (GUI) for the XDS, SHELX and ARCIMBOLDO program packages that serves both novice and experienced users in obtaining optimal processing and phasing results for X-ray, neutron and electron diffraction data. The design of the program enables data processing and phasing without command line usage, and supports advanced command flows in a simple user-modifiable and user-extensible way. The GUI supplies graphical information based on the tabular log output of the programs, which is more intuitive, comprehensible and efficient than text output can be.
Collapse
|
122
|
Mages S, Moriel N, Avraham-Davidi I, Murray E, Watter J, Chen F, Rozenblatt-Rosen O, Klughammer J, Regev A, Nitzan M. TACCO unifies annotation transfer and decomposition of cell identities for single-cell and spatial omics. Nat Biotechnol 2023; 41:1465-1473. [PMID: 36797494 PMCID: PMC10513360 DOI: 10.1038/s41587-023-01657-3] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2022] [Accepted: 01/02/2023] [Indexed: 02/18/2023]
Abstract
Transferring annotations of single-cell-, spatial- and multi-omics data is often challenging owing both to technical limitations, such as low spatial resolution or high dropout fraction, and to biological variations, such as continuous spectra of cell states. Based on the concept that these data are often best described as continuous mixtures of cells or molecules, we present a computational framework for the transfer of annotations to cells and their combinations (TACCO), which consists of an optimal transport model extended with different wrappers to annotate a wide variety of data. We apply TACCO to identify cell types and states, decipher spatiomolecular tissue structure at the cell and molecular level and resolve differentiation trajectories using synthetic and biological datasets. While matching or exceeding the accuracy of specialized tools for the individual tasks, TACCO reduces the computational requirements by up to an order of magnitude and scales to larger datasets (for example, considering the runtime of annotation transfer for 1 M simulated dropout observations).
Collapse
|
123
|
Adebamowo CA, Callier S, Akintola S, Maduka O, Jegede A, Arima C, Ogundiran T, Adebamowo SN. The promise of data science for health research in Africa. Nat Commun 2023; 14:6084. [PMID: 37770478 PMCID: PMC10539491 DOI: 10.1038/s41467-023-41809-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2021] [Accepted: 09/15/2023] [Indexed: 09/30/2023] Open
Abstract
Data science health research promises tremendous benefits for African populations, but its implementation is fraught with substantial ethical governance risks that could thwart the delivery of these anticipated benefits. We discuss emerging efforts to build ethical governance frameworks for data science health research in Africa and the opportunities to advance these through investments by African governments and institutions, international funding organizations and collaborations for research and capacity development.
Collapse
|
124
|
Saleh H, Amer E, Abuhmed T, Ali A, Al-Fuqaha A, El-Sappagh S. Computer aided progression detection model based on optimized deep LSTM ensemble model and the fusion of multivariate time series data. Sci Rep 2023; 13:16336. [PMID: 37770490 PMCID: PMC10539296 DOI: 10.1038/s41598-023-42796-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2023] [Accepted: 09/14/2023] [Indexed: 09/30/2023] Open
Abstract
Alzheimer's disease (AD) is the most common form of dementia. Early and accurate detection of AD is crucial to plan for disease modifying therapies that could prevent or delay the conversion to sever stages of the disease. As a chronic disease, patient's multivariate time series data including neuroimaging, genetics, cognitive scores, and neuropsychological battery provides a complete profile about patient's status. This data has been used to build machine learning and deep learning (DL) models for the early detection of the disease. However, these models still have limited performance and are not stable enough to be trusted in real medical settings. Literature shows that DL models outperform classical machine learning models, but ensemble learning has proven to achieve better results than standalone models. This study proposes a novel deep stacking framework which combines multiple DL models to accurately predict AD at an early stage. The study uses long short-term memory (LSTM) models as base models over patient's multivariate time series data to learn the deep longitudinal features. Each base LSTM classifier has been optimized using the Bayesian optimizer using different feature sets. As a result, the final optimized ensembled model employed heterogeneous base models that are trained on heterogeneous data. The performance of the resulting ensemble model has been explored using a cohort of 685 patients from the University of Washington's National Alzheimer's Coordinating Center dataset. Compared to the classical machine learning models and base LSTM classifiers, the proposed ensemble model achieves the highest testing results (i.e., 82.02, 82.25, 82.02, and 82.12 for accuracy, precision, recall, and F1-score, respectively). The resulting model enhances the performance of the state-of-the-art literature, and it could be used to build an accurate clinical decision support tool that can assist domain experts for AD progression detection.
Collapse
|
125
|
Montemurro A, Povlsen HR, Jessen LE, Nielsen M. Benchmarking data-driven filtering for denoising of TCRpMHC single-cell data. Sci Rep 2023; 13:16147. [PMID: 37752190 PMCID: PMC10522655 DOI: 10.1038/s41598-023-43048-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2023] [Accepted: 09/18/2023] [Indexed: 09/28/2023] Open
Abstract
Pairing of the T cell receptor (TCR) with its cognate peptide-MHC (pMHC) is a cornerstone in T cell-mediated immunity. Recently, single-cell sequencing coupled with DNA-barcoded MHC multimer staining has enabled high-throughput studies of T cell specificities. However, the immense variability of TCR-pMHC interactions combined with the relatively low signal-to-noise ratio in the data generated using current technologies are complicating these studies. Several approaches have been proposed for denoising single-cell TCR-pMHC specificity data. Here, we present a benchmark evaluating two such denoising methods, ICON and ITRAP. We applied and evaluated the methods on publicly available immune profiling data provided by 10x Genomics. We find that both methods identified approximately 75% of the raw data as noise. We analyzed both internal metrics developed for the purpose and performance on independent data using machine learning methods trained on the raw and denoised 10x data. We find an increased signal-to-noise ratio comparing the denoised to the raw data for both methods, and demonstrate an overall superior performance of the ITRAP method in terms of both data consistency and performance. In conclusion, this study demonstrates that Improving the data quality from high throughput studies of TCRpMHC-specificity by denoising is paramount in increasing our understanding of T cell-mediated immunity.
Collapse
|