251
|
Morvan M, Jacomo AL, Souque C, Wade MJ, Hoffmann T, Pouwels K, Lilley C, Singer AC, Porter J, Evens NP, Walker DI, Bunce JT, Engeli A, Grimsley J, O'Reilly KM, Danon L. An analysis of 45 large-scale wastewater sites in England to estimate SARS-CoV-2 community prevalence. Nat Commun 2022; 13:4313. [PMID: 35879277 PMCID: PMC9312315 DOI: 10.1038/s41467-022-31753-y] [Citation(s) in RCA: 32] [Impact Index Per Article: 16.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2021] [Accepted: 06/28/2022] [Indexed: 12/23/2022] Open
Abstract
Accurate surveillance of the COVID-19 pandemic can be weakened by under-reporting of cases, particularly due to asymptomatic or pre-symptomatic infections, resulting in bias. Quantification of SARS-CoV-2 RNA in wastewater can be used to infer infection prevalence, but uncertainty in sensitivity and considerable variability has meant that accurate measurement remains elusive. Here, we use data from 45 sewage sites in England, covering 31% of the population, and estimate SARS-CoV-2 prevalence to within 1.1% of estimates from representative prevalence surveys (with 95% confidence). Using machine learning and phenomenological models, we show that differences between sampled sites, particularly the wastewater flow rate, influence prevalence estimation and require careful interpretation. We find that SARS-CoV-2 signals in wastewater appear 4-5 days earlier in comparison to clinical testing data but are coincident with prevalence surveys suggesting that wastewater surveillance can be a leading indicator for symptomatic viral infections. Surveillance for viruses in wastewater complements and strengthens clinical surveillance, with significant implications for public health.
Collapse
|
252
|
Huang H, Wang Y, Rudin C, Browne EP. Towards a comprehensive evaluation of dimension reduction methods for transcriptomic data visualization. Commun Biol 2022; 5:719. [PMID: 35853932 PMCID: PMC9296444 DOI: 10.1038/s42003-022-03628-x] [Citation(s) in RCA: 16] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2021] [Accepted: 06/23/2022] [Indexed: 12/11/2022] Open
Abstract
Dimension reduction (DR) algorithms project data from high dimensions to lower dimensions to enable visualization of interesting high-dimensional structure. DR algorithms are widely used for analysis of single-cell transcriptomic data. Despite widespread use of DR algorithms such as t-SNE and UMAP, these algorithms have characteristics that lead to lack of trust: they do not preserve important aspects of high-dimensional structure and are sensitive to arbitrary user choices. Given the importance of gaining insights from DR, DR methods should be evaluated carefully before trusting their results. In this paper, we introduce and perform a systematic evaluation of popular DR methods, including t-SNE, art-SNE, UMAP, PaCMAP, TriMap and ForceAtlas2. Our evaluation considers five components: preservation of local structure, preservation of global structure, sensitivity to parameter choices, sensitivity to preprocessing choices, and computational efficiency. This evaluation can help us to choose DR tools that align with the scientific goals of the user.
Collapse
|
253
|
Neishabouri A, Nguyen J, Samuelsson J, Guthrie T, Biggs M, Wyatt J, Cross D, Karas M, Migueles JH, Khan S, Guo CC. Quantification of acceleration as activity counts in ActiGraph wearable. Sci Rep 2022; 12:11958. [PMID: 35831446 PMCID: PMC9279376 DOI: 10.1038/s41598-022-16003-x] [Citation(s) in RCA: 20] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2022] [Accepted: 07/04/2022] [Indexed: 11/09/2022] Open
Abstract
Digital clinical measures based on data collected by wearable devices have seen rapid growth in both clinical trials and healthcare. The widely-used measures based on wearables are epoch-based physical activity counts using accelerometer data. Even though activity counts have been the backbone of thousands of clinical and epidemiological studies, there are large variations of the algorithms that compute counts and their associated parameters-many of which have often been kept proprietary by device providers. This lack of transparency has hindered comparability between studies using different devices and limited their broader clinical applicability. ActiGraph devices have been the most-used wearable accelerometer devices for over two decades. Recognizing the importance of data transparency, interpretability and interoperability to both research and clinical use, we here describe the detailed counts algorithms of five generations of ActiGraph devices going back to the first AM7164 model, and publish the current counts algorithm in ActiGraph's ActiLife and CentrePoint software as a standalone Python package for research use. We believe that this material will provide a useful resource for the research community, accelerate digital health science and facilitate clinical applications of wearable accelerometry.
Collapse
|
254
|
Said S, Pazoki R, Karhunen V, Võsa U, Ligthart S, Bodinier B, Koskeridis F, Welsh P, Alizadeh BZ, Chasman DI, Sattar N, Chadeau-Hyam M, Evangelou E, Jarvelin MR, Elliott P, Tzoulaki I, Dehghan A. Author Correction: Genetic analysis of over half a million people characterises C-reactive protein loci. Nat Commun 2022; 13:3865. [PMID: 35790731 PMCID: PMC9256682 DOI: 10.1038/s41467-022-31706-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open
|
255
|
Huang Y, Qian X, Wang X, Wang T, Lounder SJ, Ravindran T, Demitrack Z, McCutcheon J, Asatekin A, Li B. Electrospraying Zwitterionic Copolymers as an Effective Biofouling Control for Accurate and Continuous Monitoring of Wastewater Dynamics in a Real-Time and Long-Term Manner. ENVIRONMENTAL SCIENCE & TECHNOLOGY 2022; 56:8176-8186. [PMID: 35576931 DOI: 10.1021/acs.est.2c01501] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Long-term continuous monitoring (LTCM) of water quality can provide high-fidelity datasets essential for executing swift control and enhancing system efficiency. One roadblock for LTCM using solid-state ion-selective electrode (S-ISE) sensors is biofouling on the sensor surface, which perturbs analyte mass transfer and deteriorates the sensor reading accuracy. This study advanced the anti-biofouling property of S-ISE sensors through precisely coating a self-assembled channel-type zwitterionic copolymer poly(trifluoroethyl methacrylate-random-sulfobetaine methacrylate) (PTFEMA-r-SBMA) on the sensor surface using electrospray. The PTFEMA-r-SBMA membrane exhibits exceptional permeability and selectivity to primary ions in water solutions. NH4+ S-ISE sensors with this anti-fouling zwitterionic layer were examined in real wastewater for 55 days consecutively, exhibiting sensitivity close to the theoretical value (59.18 mV/dec) and long-term stability (error <4 mg/L). Furthermore, a denoising data processing algorithm (DDPA) was developed to further improve the sensor accuracy, reducing the S-ISE sensor error to only 1.2 mg/L after 50 days of real wastewater analysis. Based on the dynamic energy cost function and carbon footprint models, LTCM is expected to save 44.9% NH4+ discharge, 12.8% energy consumption, and 26.7% greenhouse emission under normal operational conditions. This study unveils an innovative LTCM methodology by integrating advanced materials (anti-fouling layer coating) with sensor data processing (DDPA).
Collapse
|
256
|
Sun H, Poudel S, Vanderwall D, Lee DG, Li Y, Peng J. 29-Plex tandem mass tag mass spectrometry enabling accurate quantification by interference correction. Proteomics 2022; 22:e2100243. [PMID: 35723178 DOI: 10.1002/pmic.202100243] [Citation(s) in RCA: 17] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2022] [Revised: 06/14/2022] [Accepted: 06/15/2022] [Indexed: 12/14/2022]
Abstract
Tandem mass tag (TMT) mass spectrometry is a mainstream isobaric chemical labeling strategy for profiling proteomes. Here we present a 29-plex TMT method to combine the 11-plex and 18-plex labeling strategies. The 29-plex method was examined with a pooled sample composed of 1×, 3×, and 10× Escherichia coli peptides with 100× human background peptides, which generated two E. coli datasets (TMT11 and TMT18), displaying the distorted ratios of 1.0:1.7:4.2 and 1.0:1.8:4.9, respectively. This ratio compression from the expected 1:3:10 ratios was caused by co-isolated TMT-labeled ions (i.e., noise). Interestingly, the mixture of two TMT sets produced MS/MS spectra with unique features for the noise detection: (i) in TMT11-labeled spectra, TMT18-specific reporter ions (e.g., 135N) were shown as the noise; (ii) in TMT18-labeled spectra, the TMT11/TMT18-shared reporter ions (e.g., 131C) typically exhibited higher intensities than TMT18-specific reporter ions, due to contaminated TMT11-labeled ions in these shared channels. We further estimated the noise levels contributed by both TMT11- and TMT18-labeled peptides, and corrected reporter ion intensities in every spectrum. Finally, the anticipated 1:3:10 ratios were largely restored. This strategy was also validated using another 29-plex sample with 1:5 ratios. Thus the 29-plex method expands the TMT throughput and enhances the quantitative accuracy.
Collapse
|
257
|
Errekagorri I, Castellano J, Los Arcos A, Rico-González M, Pino-Ortega J. Different Sampling Frequencies to Calculate Collective Tactical Variables during Competition: A Case of an Official Female's Soccer Match. SENSORS (BASEL, SWITZERLAND) 2022; 22:4508. [PMID: 35746288 PMCID: PMC9230581 DOI: 10.3390/s22124508] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Received: 04/25/2022] [Revised: 06/07/2022] [Accepted: 06/13/2022] [Indexed: 02/05/2023]
Abstract
The objective of the study was to assess the impact of the sampling frequency on the outcomes of collective tactical variables during an official women’s soccer match. To do this, the first half (lasting 46 min) of an official league match of a semi-professional soccer team belonging to the Women’s Second Division of Spain (Reto Iberdrola) was analysed. The collective variables recorded were classified into three main groups: point-related variable (i.e., change in geometrical centre position (cGCp)), distance-related variables (i.e., width, length, height, distance from the goalkeeper to the near defender and mean distance between players), and area-related variables (i.e., surface area). Each variable was measured using eight different sampling frequencies: data every 100 (10 Hz), 200 (5 Hz), 250 (4 Hz), 400 (2.5 Hz), 500 (2 Hz), 1000 (1 Hz), 2000 (0.5 Hz), and 4000 ms (0.25 Hz). With the exception of cGCp, the outcomes of the collective tactical variables did not vary depending on the sampling frequency used (p > 0.05; Effect Size < 0.001). The results suggest that a sampling frequency of 0.5 Hz would be sufficient to measure the collective tactical variables that assess distance and area during an official soccer match.
Collapse
|
258
|
Walzer M, García-Seisdedos D, Prakash A, Brack P, Crowther P, Graham RL, George N, Mohammed S, Moreno P, Papatheodorou I, Hubbard SJ, Vizcaíno JA. Implementing the reuse of public DIA proteomics datasets: from the PRIDE database to Expression Atlas. Sci Data 2022; 9:335. [PMID: 35701420 PMCID: PMC9197839 DOI: 10.1038/s41597-022-01380-9] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2021] [Accepted: 05/12/2022] [Indexed: 11/14/2022] Open
Abstract
The number of mass spectrometry (MS)-based proteomics datasets in the public domain keeps increasing, particularly those generated by Data Independent Acquisition (DIA) approaches such as SWATH-MS. Unlike Data Dependent Acquisition datasets, the re-use of DIA datasets has been rather limited to date, despite its high potential, due to the technical challenges involved. We introduce a (re-)analysis pipeline for public SWATH-MS datasets which includes a combination of metadata annotation protocols, automated workflows for MS data analysis, statistical analysis, and the integration of the results into the Expression Atlas resource. Automation is orchestrated with Nextflow, using containerised open analysis software tools, rendering the pipeline readily available and reproducible. To demonstrate its utility, we reanalysed 10 public DIA datasets from the PRIDE database, comprising 1,278 SWATH-MS runs. The robustness of the analysis was evaluated, and the results compared to those obtained in the original publications. The final expression values were integrated into Expression Atlas, making SWATH-MS experiments more widely available and combining them with expression data originating from other proteomics and transcriptomics datasets.
Collapse
|
259
|
Enhancing the REMBRANDT MRI collection with expert segmentation labels and quantitative radiomic features. Sci Data 2022; 9:338. [PMID: 35701399 PMCID: PMC9198015 DOI: 10.1038/s41597-022-01415-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2021] [Accepted: 05/24/2022] [Indexed: 01/26/2023] Open
Abstract
Malignancy of the brain and CNS is unfortunately a common diagnosis. A large subset of these lesions tends to be high grade tumors which portend poor prognoses and low survival rates, and are estimated to be the tenth leading cause of death worldwide. The complex nature of the brain tissue environment in which these lesions arise offers a rich opportunity for translational research. Magnetic Resonance Imaging (MRI) can provide a comprehensive view of the abnormal regions in the brain, therefore, its applications in the translational brain cancer research is considered essential for the diagnosis and monitoring of disease. Recent years has seen rapid growth in the field of radiogenomics, especially in cancer, and scientists have been able to successfully integrate the quantitative data extracted from medical images (also known as radiomics) with genomics to answer new and clinically relevant questions. In this paper, we took raw MRI scans from the REMBRANDT data collection from public domain, and performed volumetric segmentation to identify subregions of the brain. Radiomic features were then extracted to represent the MRIs in a quantitative yet summarized format. This resulting dataset now enables further biomedical and integrative data analysis, and is being made public via the NeuroImaging Tools & Resources Collaboratory (NITRC) repository ( https://www.nitrc.org/projects/rembrandt_brain/ ).
Collapse
|
260
|
Ko PC, Lin PC, Do HT, Huang YF. P2P Lending Default Prediction Based on AI and Statistical Models. ENTROPY 2022; 24:e24060801. [PMID: 35741522 PMCID: PMC9222552 DOI: 10.3390/e24060801] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/27/2022] [Revised: 06/01/2022] [Accepted: 06/06/2022] [Indexed: 02/01/2023]
Abstract
Peer-to-peer lending (P2P lending) has proliferated in recent years thanks to Fintech and big data advancements. However, P2P lending platforms are not tightly governed by relevant laws yet, as their development speed has far exceeded that of regulations. Therefore, P2P lending operations are still subject to risks. This paper proposes prediction models to mitigate the risks of default and asymmetric information on P2P lending platforms. Specifically, we designed sophisticated procedures to pre-process mass data extracted from Lending Club in 2018 Q3–2019 Q2. After that, three statistical models, namely, Logistic Regression, Bayesian Classifier, and Linear Discriminant Analysis (LDA), and five AI models, namely, Decision Tree, Random Forest, LightGBM, Artificial Neural Network (ANN), and Convolutional Neural Network (CNN), were utilized for data analysis. The loan statuses of Lending Club’s customers were rationally classified. To evaluate the models, we adopted the confusion matrix series of metrics, AUC-ROC curve, Kolmogorov–Smirnov chart (KS), and Student’s t-test. Empirical studies show that LightGBM produces the best performance and is 2.91% more accurate than the other models, resulting in a revenue improvement of nearly USD 24 million for Lending Club. Student’s t-test proves that the differences between models are statistically significant.
Collapse
|
261
|
D'Ascenzo L, Popova AM, Abernathy S, Sheng K, Limbach PA, Williamson JR. Pytheas: a software package for the automated analysis of RNA sequences and modifications via tandem mass spectrometry. Nat Commun 2022; 13:2424. [PMID: 35505047 PMCID: PMC9065004 DOI: 10.1038/s41467-022-30057-5] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2021] [Accepted: 04/12/2022] [Indexed: 12/23/2022] Open
Abstract
Mass spectrometry is an important method for analysis of modified nucleosides ubiquitously present in cellular RNAs, in particular for ribosomal and transfer RNAs that play crucial roles in mRNA translation and decoding. Furthermore, modifications have effect on the lifetimes of nucleic acids in plasma and cells and are consequently incorporated into RNA therapeutics. To provide an analytical tool for sequence characterization of modified RNAs, we developed Pytheas, an open-source software package for automated analysis of tandem MS data for RNA. The main features of Pytheas are flexible handling of isotope labeling and RNA modifications, with false discovery rate statistical validation based on sequence decoys. We demonstrate bottom-up mass spectrometry characterization of diverse RNA sequences, with broad applications in the biology of stable RNAs, and quality control of RNA therapeutics and mRNA vaccines.
Collapse
|
262
|
Huang Y, Wang X, Xiang W, Wang T, Otis C, Sarge L, Lei Y, Li B. Forward-Looking Roadmaps for Long-Term Continuous Water Quality Monitoring: Bottlenecks, Innovations, and Prospects in a Critical Review. ENVIRONMENTAL SCIENCE & TECHNOLOGY 2022; 56:5334-5354. [PMID: 35442035 PMCID: PMC9063115 DOI: 10.1021/acs.est.1c07857] [Citation(s) in RCA: 11] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/17/2021] [Revised: 04/05/2022] [Accepted: 04/06/2022] [Indexed: 05/29/2023]
Abstract
Long-term continuous monitoring (LTCM) of water quality can bring far-reaching influences on water ecosystems by providing spatiotemporal data sets of diverse parameters and enabling operation of water and wastewater treatment processes in an energy-saving and cost-effective manner. However, current water monitoring technologies are deficient for long-term accuracy in data collection and processing capability. Inadequate LTCM data impedes water quality assessment and hinders the stakeholders and decision makers from foreseeing emerging problems and executing efficient control methodologies. To tackle this challenge, this review provides a forward-looking roadmap highlighting vital innovations toward LTCM, and elaborates on the impacts of LTCM through a three-hierarchy perspective: data, parameters, and systems. First, we demonstrate the critical needs and challenges of LTCM in natural resource water, drinking water, and wastewater systems, and differentiate LTCM from existing short-term and discrete monitoring techniques. We then elucidate three steps to achieve LTCM in water systems, consisting of data acquisition (water sensors), data processing (machine learning algorithms), and data application (with modeling and process control as two examples). Finally, we explore future opportunities of LTCM in four key domains, water, energy, sensing, and data, and underscore strategies to transfer scientific discoveries to general end-users.
Collapse
|
263
|
Kabbara A, Robert G, Khalil M, Verin M, Benquet P, Hassan M. An electroencephalography connectome predictive model of major depressive disorder severity. Sci Rep 2022; 12:6816. [PMID: 35473962 PMCID: PMC9042869 DOI: 10.1038/s41598-022-10949-8] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2021] [Accepted: 04/05/2022] [Indexed: 11/21/2022] Open
Abstract
Emerging evidence showed that major depressive disorder (MDD) is associated with disruptions of brain structural and functional networks, rather than impairment of isolated brain region. Thus, connectome-based models capable of predicting the depression severity at the individual level can be clinically useful. Here, we applied a machine-learning approach to predict the severity of depression using resting-state networks derived from source-reconstructed Electroencephalography (EEG) signals. Using regression models and three independent EEG datasets (N = 328), we tested whether resting state functional connectivity could predict individual depression score. On the first dataset, results showed that individuals scores could be reasonably predicted (r = 0.6, p = 4 × 10-18) using intrinsic functional connectivity in the EEG alpha band (8-13 Hz). In particular, the brain regions which contributed the most to the predictive network belong to the default mode network. We further tested the predictive potential of the established model by conducting two external validations on (N1 = 53, N2 = 154). Results showed statistically significant correlations between the predicted and the measured depression scale scores (r1 = 0.52, r2 = 0.44, p < 0.001). These findings lay the foundation for developing a generalizable and scientifically interpretable EEG network-based markers that can ultimately support clinicians in a biologically-based characterization of MDD.
Collapse
|
264
|
Said S, Pazoki R, Karhunen V, Võsa U, Ligthart S, Bodinier B, Koskeridis F, Welsh P, Alizadeh BZ, Chasman DI, Sattar N, Chadeau-Hyam M, Evangelou E, Jarvelin MR, Elliott P, Tzoulaki I, Dehghan A. Genetic analysis of over half a million people characterises C-reactive protein loci. Nat Commun 2022; 13:2198. [PMID: 35459240 PMCID: PMC9033829 DOI: 10.1038/s41467-022-29650-5] [Citation(s) in RCA: 40] [Impact Index Per Article: 20.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/22/2021] [Accepted: 03/25/2022] [Indexed: 01/08/2023] Open
Abstract
Chronic low-grade inflammation is linked to a multitude of chronic diseases. We report the largest genome-wide association study (GWAS) on C-reactive protein (CRP), a marker of systemic inflammation, in UK Biobank participants (N = 427,367, European descent) and the Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium (total N = 575,531 European descent). We identify 266 independent loci, of which 211 are not previously reported. Gene-set analysis highlighted 42 gene sets associated with CRP levels (p ≤ 3.2 ×10-6) and tissue expression analysis indicated a strong association of CRP related genes with liver and whole blood gene expression. Phenome-wide association study identified 27 clinical outcomes associated with genetically determined CRP and subsequent Mendelian randomisation analyses supported a causal association with schizophrenia, chronic airway obstruction and prostate cancer. Our findings identified genetic loci and functional properties of chronic low-grade inflammation and provided evidence for causal associations with a range of diseases.
Collapse
|
265
|
Bogdanovic B, Eftimov T, Simjanoska M. In-depth insights into Alzheimer's disease by using explainable machine learning approach. Sci Rep 2022; 12:6508. [PMID: 35444165 PMCID: PMC9021280 DOI: 10.1038/s41598-022-10202-2] [Citation(s) in RCA: 12] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2021] [Accepted: 04/04/2022] [Indexed: 11/09/2022] Open
Abstract
Alzheimer's disease is still a field of research with lots of open questions. The complexity of the disease prevents the early diagnosis before visible symptoms regarding the individual's cognitive capabilities occur. This research presents an in-depth analysis of a huge data set encompassing medical, cognitive and lifestyle's measurements from more than 12,000 individuals. Several hypothesis were established whose validity has been questioned considering the obtained results. The importance of appropriate experimental design is highly stressed in the research. Thus, a sequence of methods for handling missing data, redundancy, data imbalance, and correlation analysis have been applied for appropriate preprocessing of the data set, and consequently XGBoost model has been trained and evaluated with special attention to the hyperparameters tuning. The model was explained by using the Shapley values produced by the SHAP method. XGBoost produced a f1-score of 0.84 and as such is considered to be highly competitive among those published in the literature. This achievement, however, was not the main contribution of this paper. This research's goal was to perform global and local interpretability of the intelligent model and derive valuable conclusions over the established hypothesis. Those methods led to a single scheme which presents either positive, or, negative influence of the values of each of the features whose importance has been confirmed by means of Shapley values. This scheme might be considered as additional source of knowledge for the physicians and other experts whose concern is the exact diagnosis of early stage of Alzheimer's disease. The conclusions derived from the intelligent model's data-driven interpretability confronted all the established hypotheses. This research clearly showed the importance of explainable Machine learning approach that opens the black box and clearly unveils the relationships among the features and the diagnoses.
Collapse
|
266
|
Ping Z, Chen S, Zhou G, Huang X, Zhu SJ, Zhang H, Lee HH, Lan Z, Cui J, Chen T, Zhang W, Yang H, Xu X, Church GM, Shen Y. Towards practical and robust DNA-based data archiving using the yin-yang codec system. NATURE COMPUTATIONAL SCIENCE 2022; 2:234-242. [PMID: 38177542 PMCID: PMC10766522 DOI: 10.1038/s43588-022-00231-2] [Citation(s) in RCA: 24] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/18/2021] [Accepted: 03/18/2022] [Indexed: 01/06/2024]
Abstract
DNA is a promising data storage medium due to its remarkable durability and space-efficient storage. Early bit-to-base transcoding schemes have primarily pursued information density, at the expense of introducing biocompatibility challenges or decoding failure. Here we propose a robust transcoding algorithm named the yin-yang codec, using two rules to encode two binary bits into one nucleotide, to generate DNA sequences that are highly compatible with synthesis and sequencing technologies. We encoded two representative file formats and stored them in vitro as 200 nt oligo pools and in vivo as a ~54 kbps DNA fragment in yeast cells. Sequencing results show that the yin-yang codec exhibits high robustness and reliability for a wide variety of data types, with an average recovery rate of 99.9% above 104 molecule copies and an achieved recovery rate of 87.53% at ≤102 copies. Additionally, the in vivo storage demonstration achieved an experimentally measured physical density close to the theoretical maximum.
Collapse
|
267
|
Nordin ND, Abdullah F, Zan MSD, A Bakar AA, Krivosheev AI, Barkov FL, Konstantinov YA. Improving Prediction Accuracy and Extraction Precision of Frequency Shift from Low-SNR Brillouin Gain Spectra in Distributed Structural Health Monitoring. SENSORS 2022; 22:s22072677. [PMID: 35408291 PMCID: PMC9003443 DOI: 10.3390/s22072677] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/22/2022] [Revised: 03/25/2022] [Accepted: 03/25/2022] [Indexed: 02/04/2023]
Abstract
In this paper, we studied the possibility of increasing the Brillouin frequency shift (BFS) detection accuracy in distributed fibre-optic sensors by the separate and joint use of different algorithms for finding the spectral maximum: Lorentzian curve fitting (LCF, including the Levenberg–Marquardt (LM) method), the backward correlation technique (BWC) and a machine learning algorithm, the generalized linear model (GLM). The study was carried out on real spectra subjected to the subsequent addition of extreme digital noise. The precision and accuracy of the LM and BWC methods were studied by varying the signal-to-noise ratios (SNRs) and by incorporating the GLM method into the processing steps. It was found that the use of methods in sequence gives a gain in the accuracy of determining the sensor temperature from tenths to several degrees Celsius (or MHz in BFS scale), which is manifested for signal-to-noise ratios within 0 to 20 dB. We have found out that the double processing (BWC + GLM) is more effective for positive SNR values (in dB): it gives a gain in BFS measurement precision near 0.4 °C (428 kHz or 9.3 με); for BWC + GLM, the difference of precisions between single and double processing for SNRs below 2.6 dB is about 1.5 °C (1.6 MHz or 35 με). In this case, double processing is more effective for all SNRs. The described technique’s potential application in structural health monitoring (SHM) of concrete objects and different areas in metrology and sensing were also discussed.
Collapse
|
268
|
Zhang X, Jenkins GJ, Hakim CH, Duan D, Yao G. Four-limb wireless IMU sensor system for automatic gait detection in canines. Sci Rep 2022; 12:4788. [PMID: 35314731 PMCID: PMC8938443 DOI: 10.1038/s41598-022-08676-1] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2021] [Accepted: 03/10/2022] [Indexed: 12/24/2022] Open
Abstract
This study aims to develop a 4-limb canine gait analysis system using wireless inertial measurement units (IMUs). 3D printed sensor holders were designed to ensure quick and consistent sensor mounting. Signal analysis algorithms were developed to automatically determine the timing of swing start and end in a stride. To evaluate the accuracy of the new system, a synchronized study was conducted in which stride parameters in four dogs were measured simultaneously using the 4-limb IMU system and a pressure-sensor based walkway gait system. The results showed that stride parameters measured in both systems were highly correlated. Bland-Altman analyses revealed a nominal mean measurement bias between the two systems in both forelimbs and hindlimbs. Overall, the disagreement between the two systems was less than 10% of the mean value in over 92% of the data points acquired from forelimbs. The same performance was observed in hindlimbs except for one parameter due to small mean values. We demonstrated that this 4-limb system could successfully visualize the overall gait types and identify rapid gait changes in dogs. This method provides an effective, low-cost tool for gait studies in veterinary applications or in translational studies using dog models of neuromuscular diseases.
Collapse
|
269
|
Wagner AS, Waite LK, Wierzba M, Hoffstaedter F, Waite AQ, Poldrack B, Eickhoff SB, Hanke M. FAIRly big: A framework for computationally reproducible processing of large-scale data. Sci Data 2022; 9:80. [PMID: 35277501 PMCID: PMC8917149 DOI: 10.1038/s41597-022-01163-2] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2021] [Accepted: 02/11/2022] [Indexed: 11/30/2022] Open
Abstract
Large-scale datasets present unique opportunities to perform scientific investigations with unprecedented breadth. However, they also pose considerable challenges for the findability, accessibility, interoperability, and reusability (FAIR) of research outcomes due to infrastructure limitations, data usage constraints, or software license restrictions. Here we introduce a DataLad-based, domain-agnostic framework suitable for reproducible data processing in compliance with open science mandates. The framework attempts to minimize platform idiosyncrasies and performance-related complexities. It affords the capture of machine-actionable computational provenance records that can be used to retrace and verify the origins of research outcomes, as well as be re-executed independent of the original computing infrastructure. We demonstrate the framework's performance using two showcases: one highlighting data sharing and transparency (using the studyforrest.org dataset) and another highlighting scalability (using the largest public brain imaging dataset available: the UK Biobank dataset).
Collapse
|
270
|
Raredon MSB, Yang J, Garritano J, Wang M, Kushnir D, Schupp JC, Adams TS, Greaney AM, Leiby KL, Kaminski N, Kluger Y, Levchenko A, Niklason LE. Computation and visualization of cell-cell signaling topologies in single-cell systems data using Connectome. Sci Rep 2022; 12:4187. [PMID: 35264704 PMCID: PMC8906120 DOI: 10.1038/s41598-022-07959-x] [Citation(s) in RCA: 27] [Impact Index Per Article: 13.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2021] [Accepted: 02/28/2022] [Indexed: 12/11/2022] Open
Abstract
Single-cell RNA-sequencing data has revolutionized our ability to understand of the patterns of cell-cell and ligand-receptor connectivity that influence the function of tissues and organs. However, the quantification and visualization of these patterns in a way that informs tissue biology are major computational and epistemological challenges. Here, we present Connectome, a software package for R which facilitates rapid calculation and interactive exploration of cell-cell signaling network topologies contained in single-cell RNA-sequencing data. Connectome can be used with any reference set of known ligand-receptor mechanisms. It has built-in functionality to facilitate differential and comparative connectomics, in which signaling networks are compared between tissue systems. Connectome focuses on computational and graphical tools designed to analyze and explore cell-cell connectivity patterns across disparate single-cell datasets and reveal biologic insight. We present approaches to quantify focused network topologies and discuss some of the biologic theory leading to their design.
Collapse
|
271
|
Charnaud S, Munro JE, Semenec L, Mazhari R, Brewster J, Bourke C, Ruybal-Pesántez S, James R, Lautu-Gumal D, Karunajeewa H, Mueller I, Bahlo M. PacBio long-read amplicon sequencing enables scalable high-resolution population allele typing of the complex CYP2D6 locus. Commun Biol 2022; 5:168. [PMID: 35217695 PMCID: PMC8881578 DOI: 10.1038/s42003-022-03102-8] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2021] [Accepted: 02/01/2022] [Indexed: 01/31/2023] Open
Abstract
The CYP2D6 enzyme is estimated to metabolize 25% of commonly used pharmaceuticals and is of intense pharmacogenetic interest due to the polymorphic nature of the CYP2D6 gene. Accurate allele typing of CYP2D6 has proved challenging due to frequent copy number variants (CNVs) and paralogous pseudogenes. SNP-arrays, qPCR and short-read sequencing have been employed to interrogate CYP2D6, however these technologies are unable to capture longer range information. Long-read sequencing using the PacBio Single Molecule Real Time (SMRT) sequencing platform has yielded promising results for CYP2D6 allele typing. However, previous studies have been limited in scale and have employed nascent data processing pipelines. We present a robust data processing pipeline "PLASTER" for accurate allele typing of SMRT sequenced amplicons. We demonstrate the pipeline by typing CYP2D6 alleles in a large cohort of 377 Solomon Islanders. This pharmacogenetic method will improve drug safety and efficacy through screening prior to drug administration.
Collapse
|
272
|
Fleischer CE. A data processing approach with built-in spatial resolution reduction methods to construct energy system models. OPEN RESEARCH EUROPE 2022; 1:36. [PMID: 37645144 PMCID: PMC10446009 DOI: 10.12688/openreseurope.13420.2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Accepted: 02/02/2022] [Indexed: 08/31/2023]
Abstract
Introduction: Data processing is a crucial step in energy system modelling which prepares input data from various sources into a format needed to formulate a model. Multiple open-source web-hosted databases offer pre-processed input data within the European context. However, the number of documented open-source data processing workflows that allow for the construction of energy system models with specified spatial resolution reduction methods is still limited. Methods: The first step of the data-processing method builds a dataset using web-hosted pre-processed data and open-source software. The second step aggregates the dataset using a specified spatial aggregation method. The spatially aggregated dataset is used as input data to construct sector-coupled energy system models. Results: To demonstrate the application of the data processing process, three power and heat optimisation models of Germany were constructed using the proposed data processing approach. Significant variation in generation, transmission and storage capacity of electricity were observed between the optimisation results of the energy system models. Conclusions: This paper presents a novel data processing approach to construct sector-coupled energy system models with integrated spatial aggregations methods.
Collapse
|
273
|
Li Y, Zaheri S, Nguyen K, Liu L, Hassanipour F, Pace BS, Bleris L. Machine learning-based approaches for identifying human blood cells harboring CRISPR-mediated fetal chromatin domain ablations. Sci Rep 2022; 12:1481. [PMID: 35087158 PMCID: PMC8795181 DOI: 10.1038/s41598-022-05575-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2021] [Accepted: 12/17/2021] [Indexed: 11/08/2022] Open
Abstract
Two common hemoglobinopathies, sickle cell disease (SCD) and β-thalassemia, arise from genetic mutations within the β-globin gene. In this work, we identified a 500-bp motif (Fetal Chromatin Domain, FCD) upstream of human ϒ-globin locus and showed that the removal of this motif using CRISPR technology reactivates the expression of ϒ-globin. Next, we present two different cell morphology-based machine learning approaches that can be used identify human blood cells (KU-812) that harbor CRISPR-mediated FCD genetic modifications. Three candidate models from the first approach, which uses multilayer perceptron algorithm (MLP 20-26, MLP26-18, and MLP 30-26) and flow cytometry-derived cellular data, yielded 0.83 precision, 0.80 recall, 0.82 accuracy, and 0.90 area under the ROC (receiver operating characteristic) curve when predicting the edited cells. In comparison, the candidate model from the second approach, which uses deep learning (T2D5) and DIC microscopy-derived imaging data, performed with less accuracy (0.80) and ROC AUC (0.87). We envision that equivalent machine learning-based models can complement currently available genotyping protocols for specific genetic modifications which result in morphological changes in human cells.
Collapse
|
274
|
Barr KB, Chiang N, Bertozzi AL, Gilles J, Osher SJ, Weiss PS. Extraction of Hidden Science from Nanoscale Images. THE JOURNAL OF PHYSICAL CHEMISTRY. C, NANOMATERIALS AND INTERFACES 2022; 126:3-13. [PMID: 35633819 PMCID: PMC9135097 DOI: 10.1021/acs.jpcc.1c08712] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Scanning probe microscopies and spectroscopies enable investigation of surfaces and even buried interfaces down to the scale of chemical-bonding interactions, and this capability has been enhanced with the support of computational algorithms for data acquisition and image processing to explore physical, chemical, and biological phenomena. Here, we describe how scanning probe techniques have been enhanced by some of these recent algorithmic improvements. One improvement to the data acquisition algorithm is to advance beyond a simple rastering framework by using spirals at constant angular velocity then switching to constant linear velocity, which limits the piezo creep and hysteresis issues seen in traditional acquisition methods. One can also use image-processing techniques to model the distortions that appear from tip motion effects and to make corrections to these images. Another image-processing algorithm we discuss enables researchers to segment images by domains and subdomains, thereby highlighting reactive and interesting disordered sites at domain boundaries. Lastly, we discuss algorithms used to examine the dipole direction of individual molecules and surface domains, hydrogen bonding interactions, and molecular tilt. The computational algorithms used for scanning probe techniques are still improving rapidly and are incorporating machine learning at the next level of iteration. That said, the algorithms are not yet able to perform live adjustments during data recording that could enhance the microscopy and spectroscopic imaging methods significantly.
Collapse
|
275
|
Cresswell K, Domínguez Hernández A, Williams R, Sheikh A. Key Challenges and Opportunities for Cloud Technology in Health Care: Semistructured Interview Study. JMIR Hum Factors 2022; 9:e31246. [PMID: 34989688 PMCID: PMC8778568 DOI: 10.2196/31246] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2021] [Revised: 09/14/2021] [Accepted: 10/02/2021] [Indexed: 01/27/2023] Open
Abstract
Background The use of cloud computing (involving storage and processing of data on the internet) in health care has increasingly been highlighted as having great potential in facilitating data-driven innovations. Although some provider organizations are reaping the benefits of using cloud providers to store and process their data, others are lagging behind. Objective We aim to explore the existing challenges and barriers to the use of cloud computing in health care settings and investigate how perceived risks can be addressed. Methods We conducted a qualitative case study of cloud computing in health care settings, interviewing a range of individuals with perspectives on supply, implementation, adoption, and integration of cloud technology. Data were collected through a series of in-depth semistructured interviews exploring current applications, implementation approaches, challenges encountered, and visions for the future. The interviews were transcribed and thematically analyzed using NVivo 12 (QSR International). We coded the data based on a sociotechnical coding framework developed in related work. Results We interviewed 23 individuals between September 2020 and November 2020, including professionals working across major cloud providers, health care provider organizations, innovators, small and medium-sized software vendors, and academic institutions. The participants were united by a common vision of a cloud-enabled ecosystem of applications and by drivers surrounding data-driven innovation. The identified barriers to progress included the cost of data migration and skill gaps to implement cloud technologies within provider organizations, the cultural shift required to move to externally hosted services, a lack of user pull as many benefits were not visible to those providing frontline care, and a lack of interoperability standards and central regulations. Conclusions Implementations need to be viewed as a digitally enabled transformation of services, driven by skill development, organizational change management, and user engagement, to facilitate the implementation and exploitation of cloud-based infrastructures and to maximize returns on investment.
Collapse
|