51
|
Liu F, Yuan C, Chen H, Yang F. Prediction of linear B-cell epitopes based on protein sequence features and BERT embeddings. Sci Rep 2024; 14:2464. [PMID: 38291341 PMCID: PMC10828400 DOI: 10.1038/s41598-024-53028-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2023] [Accepted: 01/26/2024] [Indexed: 02/01/2024] Open
Abstract
Linear B-cell epitopes (BCEs) play a key role in the development of peptide vaccines and immunodiagnostic reagents. Therefore, the accurate identification of linear BCEs is of great importance in the prevention of infectious diseases and the diagnosis of related diseases. The experimental methods used to identify BCEs are both expensive and time-consuming and they do not meet the demand for identification of large-scale protein sequence data. As a result, there is a need to develop an efficient and accurate computational method to rapidly identify linear BCE sequences. In this work, we developed the new linear BCE prediction method LBCE-BERT. This method is based on peptide chain sequence information and natural language model BERT embedding information, using an XGBoost classifier. The models were trained on three benchmark datasets. The model was training on three benchmark datasets for hyperparameter selection and was subsequently evaluated on several test datasets. The result indicate that our proposed method outperforms others in terms of AUROC and accuracy. The LBCE-BERT model is publicly available at: https://github.com/Lfang111/LBCE-BERT .
Collapse
|
52
|
Ennis D, Shmorak S, Jantscher-Krenn E, Yassour M. Longitudinal quantification of Bifidobacterium longum subsp. infantis reveals late colonization in the infant gut independent of maternal milk HMO composition. Nat Commun 2024; 15:894. [PMID: 38291346 PMCID: PMC10827747 DOI: 10.1038/s41467-024-45209-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2023] [Accepted: 01/15/2024] [Indexed: 02/01/2024] Open
Abstract
Breast milk contains human milk oligosaccharides (HMOs) that cannot be digested by infants, yet nourish their developing gut microbiome. While Bifidobacterium are the best-known utilizers of individual HMOs, a longitudinal study examining the evolving microbial community at high-resolution coupled with mothers' milk HMO composition is lacking. Here, we developed a high-throughput method to quantify Bifidobacterium longum subsp. infantis (BL. infantis), a proficient HMO-utilizer, and applied it to a longitudinal cohort consisting of 21 mother-infant dyads. We observed substantial changes in the infant gut microbiome over the course of several months, while the HMO composition in mothers' milk remained relatively stable. Although Bifidobacterium species significantly influenced sample variation, no specific HMOs correlated with Bifidobacterium species abundance. Surprisingly, we found that BL. infantis colonization began late in the breastfeeding period both in our cohort and in other geographic locations, highlighting the importance of focusing on BL. infantis dynamics in the infant gut.
Collapse
|
53
|
Babaei Rikan S, Sorayaie Azar A, Naemi A, Bagherzadeh Mohasefi J, Pirnejad H, Wiil UK. Survival prediction of glioblastoma patients using modern deep learning and machine learning techniques. Sci Rep 2024; 14:2371. [PMID: 38287149 PMCID: PMC10824760 DOI: 10.1038/s41598-024-53006-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2023] [Accepted: 01/25/2024] [Indexed: 01/31/2024] Open
Abstract
In this study, we utilized data from the Surveillance, Epidemiology, and End Results (SEER) database to predict the glioblastoma patients' survival outcomes. To assess dataset skewness and detect feature importance, we applied Pearson's second coefficient test of skewness and the Ordinary Least Squares method, respectively. Using two sampling strategies, holdout and five-fold cross-validation, we developed five machine learning (ML) models alongside a feed-forward deep neural network (DNN) for the multiclass classification and regression prediction of glioblastoma patient survival. After balancing the classification and regression datasets, we obtained 46,340 and 28,573 samples, respectively. Shapley additive explanations (SHAP) were then used to explain the decision-making process of the best model. In both classification and regression tasks, as well as across holdout and cross-validation sampling strategies, the DNN consistently outperformed the ML models. Notably, the accuracy were 90.25% and 90.22% for holdout and five-fold cross-validation, respectively, while the corresponding R2 values were 0.6565 and 0.6622. SHAP analysis revealed the importance of age at diagnosis as the most influential feature in the DNN's survival predictions. These findings suggest that the DNN holds promise as a practical auxiliary tool for clinicians, aiding them in optimal decision-making concerning the treatment and care trajectories for glioblastoma patients.
Collapse
|
54
|
Tyler SR, Lozano-Ojalvo D, Guccione E, Schadt EE. Anti-correlated feature selection prevents false discovery of subpopulations in scRNAseq. Nat Commun 2024; 15:699. [PMID: 38267438 PMCID: PMC10808220 DOI: 10.1038/s41467-023-43406-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2022] [Accepted: 11/07/2023] [Indexed: 01/26/2024] Open
Abstract
While sub-clustering cell-populations has become popular in single cell-omics, negative controls for this process are lacking. Popular feature-selection/clustering algorithms fail the null-dataset problem, allowing erroneous subdivisions of homogenous clusters until nearly each cell is called its own cluster. Using real and synthetic datasets, we find that anti-correlated gene selection reduces or eliminates erroneous subdivisions, increases marker-gene selection efficacy, and efficiently scales to millions of cells.
Collapse
|
55
|
Sun Z, Zhang L, Wang R, Wang Z, Liang X, Gao J. Identification of shared pathogenetic mechanisms between COVID-19 and IC through bioinformatics and system biology. Sci Rep 2024; 14:2114. [PMID: 38267482 PMCID: PMC10808107 DOI: 10.1038/s41598-024-52625-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2023] [Accepted: 01/22/2024] [Indexed: 01/26/2024] Open
Abstract
COVID-19 increased global mortality in 2019. Cystitis became a contributing factor in SARS-CoV-2 and COVID-19 complications. The complex molecular links between cystitis and COVID-19 are unclear. This study investigates COVID-19-associated cystitis (CAC) molecular mechanisms and drug candidates using bioinformatics and systems biology. Obtain the gene expression profiles of IC (GSE11783) and COVID-19 (GSE147507) from the Gene Expression Omnibus (GEO) database. Identified the common differentially expressed genes (DEGs) in both IC and COVID-19, and extracted a number of key genes from this group. Subsequently, conduct Gene Ontology (GO) functional enrichment and Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment analysis on the DEGs. Additionally, design a protein-protein interaction (PPI) network, a transcription factor gene regulatory network, a TF miRNA regulatory network, and a gene disease association network using the DEGs. Identify and extract hub genes from the PPI network. Then construct Nomogram diagnostic prediction models based on the hub genes. The DSigDB database was used to forecast many potential molecular medicines that are associated with common DEGs. Assess the precision of hub genes and Nomogram models in diagnosing IC and COVID-19 by employing Receiver Operating Characteristic (ROC) curves. The IC dataset (GSE57560) and the COVID-19 dataset (GSE171110) were selected to validate the models' diagnostic accuracy. A grand total of 198 DEGs that overlapped were found and chosen for further research. FCER1G, ITGAM, LCP2, LILRB2, MNDA, SPI1, and TYROBP were screened as the hub genes. The Nomogram model, built using the seven hub genes, demonstrates significant utility as a diagnostic prediction model for both IC and COVID-19. Multiple potential molecular medicines associated with common DEGs have been discovered. These pathways, hub genes, and models may provide new perspectives for future research into mechanisms and guide personalised and effective therapeutics for IC patients infected with COVID-19.
Collapse
|
56
|
Aarthy M, Pandiyan GN, Paramasivan R, Kumar A, Gupta B. Identification and prioritisation of potential vaccine candidates using subtractive proteomics and designing of a multi-epitope vaccine against Wuchereria bancrofti. Sci Rep 2024; 14:1970. [PMID: 38263422 PMCID: PMC10806236 DOI: 10.1038/s41598-024-52457-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2023] [Accepted: 01/18/2024] [Indexed: 01/25/2024] Open
Abstract
This study employed subtractive proteomics and immunoinformatics to analyze the Wuchereria bancrofti proteome and identify potential therapeutic targets, with a focus on designing a vaccine against the parasite species. A comprehensive bioinformatics analysis of the parasite's proteome identified 51 probable therapeutic targets, among which "Kunitz/bovine pancreatic trypsin inhibitor domain-containing protein" was identified as the most promising vaccine candidate. The candidate protein was used to design a multi-epitope vaccine, incorporating B-cell and T-cell epitopes identified through various tools. The vaccine construct underwent extensive analysis of its antigenic, physical, and chemical features, including the determination of secondary and tertiary structures. Docking and molecular dynamics simulations were performed with HLA alleles, Toll-like receptor 4 (TLR4), and TLR3 to assess its potential to elicit the human immune response. Immune simulation analysis confirmed the predicted vaccine's strong binding affinity with immunoglobulins, indicating its potential efficacy in generating an immune response. However, experimental validation and testing of this multi-epitope vaccine construct would be needed to assess its potential against W. bancrofti and even for a broader range of lymphatic filarial infections given the similarities between W. bancrofti and Brugia.
Collapse
|
57
|
Leung YY, Naj AC, Chou YF, Valladares O, Schmidt M, Hamilton-Nelson K, Wheeler N, Lin H, Gangadharan P, Qu L, Clark K, Kuzma AB, Lee WP, Cantwell L, Nicaretta H, Haines J, Farrer L, Seshadri S, Brkanac Z, Cruchaga C, Pericak-Vance M, Mayeux RP, Bush WS, Destefano A, Martin E, Schellenberg GD, Wang LS. Human whole-exome genotype data for Alzheimer's disease. Nat Commun 2024; 15:684. [PMID: 38263370 PMCID: PMC10805795 DOI: 10.1038/s41467-024-44781-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2022] [Accepted: 01/02/2024] [Indexed: 01/25/2024] Open
Abstract
The heterogeneity of the whole-exome sequencing (WES) data generation methods present a challenge to a joint analysis. Here we present a bioinformatics strategy for joint-calling 20,504 WES samples collected across nine studies and sequenced using ten capture kits in fourteen sequencing centers in the Alzheimer's Disease Sequencing Project. The joint-genotype called variant-called format (VCF) file contains only positions within the union of capture kits. The VCF was then processed specifically to account for the batch effects arising from the use of different capture kits from different studies. We identified 8.2 million autosomal variants. 96.82% of the variants are high-quality, and are located in 28,579 Ensembl transcripts. 41% of the variants are intronic and 1.8% of the variants are with CADD > 30, indicating they are of high predicted pathogenicity. Here we show our new strategy can generate high-quality data from processing these diversely generated WES samples. The improved ability to combine data sequenced in different batches benefits the whole genomics research community.
Collapse
|
58
|
Ashraf H, Waris A, Gilani SO, Shafiq U, Iqbal J, Kamavuako EN, Berrouche Y, Brüls O, Boutaayamou M, Niazi IK. Optimizing the performance of convolutional neural network for enhanced gesture recognition using sEMG. Sci Rep 2024; 14:2020. [PMID: 38263441 PMCID: PMC10805798 DOI: 10.1038/s41598-024-52405-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2023] [Accepted: 01/18/2024] [Indexed: 01/25/2024] Open
Abstract
Deep neural networks (DNNs) have demonstrated higher performance results when compared to traditional approaches for implementing robust myoelectric control (MEC) systems. However, the delay induced by optimising a MEC remains a concern for real-time applications. As a result, an optimised DNN architecture based on fine-tuned hyperparameters is required. This study investigates the optimal configuration of convolutional neural network (CNN)-based MEC by proposing an effective data segmentation technique and a generalised set of hyperparameters. Firstly, two segmentation strategies (disjoint and overlap) and various segment and overlap sizes were studied to optimise segmentation parameters. Secondly, to address the challenge of optimising the hyperparameters of a DNN-based MEC system, the problem has been abstracted as an optimisation problem, and Bayesian optimisation has been used to solve it. From 20 healthy people, ten surface electromyography (sEMG) grasping movements abstracted from daily life were chosen as the target gesture set. With an ideal segment size of 200 ms and an overlap size of 80%, the results show that the overlap segmentation technique outperforms the disjoint segmentation technique (p-value < 0.05). In comparison to manual (12.76 ± 4.66), grid (0.10 ± 0.03), and random (0.12 ± 0.05) search hyperparameters optimisation strategies, the proposed optimisation technique resulted in a mean classification error rate (CER) of 0.08 ± 0.03 across all subjects. In addition, a generalised CNN architecture with an optimal set of hyperparameters is proposed. When tested separately on all individuals, the single generalised CNN architecture produced an overall CER of 0.09 ± 0.03. This study's significance lies in its contribution to the field of EMG signal processing by demonstrating the superiority of the overlap segmentation technique, optimizing CNN hyperparameters through Bayesian optimization, and offering practical insights for improving prosthetic control and human-computer interfaces.
Collapse
|
59
|
Sabary O, Yucovich A, Shapira G, Yaakobi E. Reconstruction algorithms for DNA-storage systems. Sci Rep 2024; 14:1951. [PMID: 38263421 PMCID: PMC10806084 DOI: 10.1038/s41598-024-51730-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2023] [Accepted: 01/09/2024] [Indexed: 01/25/2024] Open
Abstract
Motivated by DNA storage systems, this work presents the DNA reconstruction problem, in which a length-n string, is passing through the DNA-storage channel, which introduces deletion, insertion and substitution errors. This channel generates multiple noisy copies of the transmitted string which are called traces. A DNA reconstruction algorithm is a mapping which receives t traces as an input and produces an estimation of the original string. The goal in the DNA reconstruction problem is to minimize the edit distance between the original string and the algorithm's estimation. In this work, we present several new algorithms for this problem. Our algorithms look globally on the entire sequence of the traces and use dynamic programming algorithms, which are used for the shortest common supersequence and the longest common subsequence problems, in order to decode the original string. Our algorithms do not require any limitations on the input and the number of traces, and more than that, they perform well even for error probabilities as high as 0.27. The algorithms have been tested on simulated data, on data from previous DNA storage experiments, and on a new synthesized dataset, and are shown to outperform previous algorithms in reconstruction accuracy.
Collapse
|
60
|
Wang H, Gao C, Dantona C, Hull B, Sun J. DRG-LLaMA : tuning LLaMA model to predict diagnosis-related group for hospitalized patients. NPJ Digit Med 2024; 7:16. [PMID: 38253711 PMCID: PMC10803802 DOI: 10.1038/s41746-023-00989-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2023] [Accepted: 12/19/2023] [Indexed: 01/24/2024] Open
Abstract
In the U.S. inpatient payment system, the Diagnosis-Related Group (DRG) is pivotal, but its assignment process is inefficient. The study introduces DRG-LLaMA, an advanced large language model (LLM) fine-tuned on clinical notes to enhance DRGs assignment. Utilizing LLaMA as the foundational model and optimizing it through Low-Rank Adaptation (LoRA) on 236,192 MIMIC-IV discharge summaries, our DRG-LLaMA -7B model exhibited a noteworthy macro-averaged F1 score of 0.327, a top-1 prediction accuracy of 52.0%, and a macro-averaged Area Under the Curve (AUC) of 0.986, with a maximum input token length of 512. This model surpassed the performance of prior leading models in DRG prediction, showing a relative improvement of 40.3% and 35.7% in macro-averaged F1 score compared to ClinicalBERT and CAML, respectively. Applied to base DRG and complication or comorbidity (CC)/major complication or comorbidity (MCC) prediction, DRG-LLaMA achieved a top-1 prediction accuracy of 67.8% and 67.5%, respectively. Additionally, our findings indicate that DRG-LLaMA 's performance correlates with increased model parameters and input context lengths.
Collapse
|
61
|
Liu M, Srivastava G, Ramanujam J, Brylinski M. Augmented drug combination dataset to improve the performance of machine learning models predicting synergistic anticancer effects. Sci Rep 2024; 14:1668. [PMID: 38238448 PMCID: PMC10796434 DOI: 10.1038/s41598-024-51940-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2023] [Accepted: 01/11/2024] [Indexed: 01/22/2024] Open
Abstract
Combination therapy has gained popularity in cancer treatment as it enhances the treatment efficacy and overcomes drug resistance. Although machine learning (ML) techniques have become an indispensable tool for discovering new drug combinations, the data on drug combination therapy currently available may be insufficient to build high-precision models. We developed a data augmentation protocol to unbiasedly scale up the existing anti-cancer drug synergy dataset. Using a new drug similarity metric, we augmented the synergy data by substituting a compound in a drug combination instance with another molecule that exhibits highly similar pharmacological effects. Using this protocol, we were able to upscale the AZ-DREAM Challenges dataset from 8798 to 6,016,697 drug combinations. Comprehensive performance evaluations show that ML models trained on the augmented data consistently achieve higher accuracy than those trained solely on the original dataset. Our data augmentation protocol provides a systematic and unbiased approach to generating more diverse and larger-scale drug combination datasets, enabling the development of more precise and effective ML models. The protocol presented in this study could serve as a foundation for future research aimed at discovering novel and effective drug combinations for cancer treatment.
Collapse
|
62
|
Ghosh T, Han Y, Raju V, Hossain D, McCrory MA, Higgins J, Boushey C, Delp EJ, Sazonov E. Integrated image and sensor-based food intake detection in free-living. Sci Rep 2024; 14:1665. [PMID: 38238423 PMCID: PMC10796396 DOI: 10.1038/s41598-024-51687-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2021] [Accepted: 01/08/2024] [Indexed: 01/22/2024] Open
Abstract
The first step in any dietary monitoring system is the automatic detection of eating episodes. To detect eating episodes, either sensor data or images can be used, and either method can result in false-positive detection. This study aims to reduce the number of false positives in the detection of eating episodes by a wearable sensor, Automatic Ingestion Monitor v2 (AIM-2). Thirty participants wore the AIM-2 for two days each (pseudo-free-living and free-living). The eating episodes were detected by three methods: (1) recognition of solid foods and beverages in images captured by AIM-2; (2) recognition of chewing from the AIM-2 accelerometer sensor; and (3) hierarchical classification to combine confidence scores from image and accelerometer classifiers. The integration of image- and sensor-based methods achieved 94.59% sensitivity, 70.47% precision, and 80.77% F1-score in the free-living environment, which is significantly better than either of the original methods (8% higher sensitivity). The proposed method successfully reduces the number of false positives in the detection of eating episodes.
Collapse
|
63
|
Ali L, Javeed A, Noor A, Rauf HT, Kadry S, Gandomi AH. Parkinson's disease detection based on features refinement through L1 regularized SVM and deep neural network. Sci Rep 2024; 14:1333. [PMID: 38228772 PMCID: PMC10791701 DOI: 10.1038/s41598-024-51600-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2023] [Accepted: 01/07/2024] [Indexed: 01/18/2024] Open
Abstract
In previous studies, replicated and multiple types of speech data have been used for Parkinson's disease (PD) detection. However, two main problems in these studies are lower PD detection accuracy and inappropriate validation methodologies leading to unreliable results. This study discusses the effects of inappropriate validation methodologies used in previous studies and highlights the use of appropriate alternative validation methods that would ensure generalization. To enhance PD detection accuracy, we propose a two-stage diagnostic system that refines the extracted set of features through [Formula: see text] regularized linear support vector machine and classifies the refined subset of features through a deep neural network. To rigorously evaluate the effectiveness of the proposed diagnostic system, experiments are performed on two different voice recording-based benchmark datasets. For both datasets, the proposed diagnostic system achieves 100% accuracy under leave-one-subject-out (LOSO) cross-validation (CV) and 97.5% accuracy under k-fold CV. The results show that the proposed system outperforms the existing methods regarding PD detection accuracy. The results suggest that the proposed diagnostic system is essential to improving non-invasive diagnostic decision support in PD.
Collapse
|
64
|
Kini AS, Prema KV, Pai SN. Early stage black pepper leaf disease prediction based on transfer learning using ConvNets. Sci Rep 2024; 14:1404. [PMID: 38228767 PMCID: PMC10791634 DOI: 10.1038/s41598-024-51884-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2023] [Accepted: 01/10/2024] [Indexed: 01/18/2024] Open
Abstract
Plants get exposed to diseases, insects and fungus. This causes heavy damages to crop resulting in various leaves diseases. Leaf diseases can be diagnosed at an early stage with the aid of a smart computer vision system and timely disease prevention can be targeted. Black pepper is a medicinal plant that is extensively used in Ayurvedic medicine because of its therapeutic properties. The proposed work represents an intelligent transfer learning technique through state-of-the-art deep learning implementation using convolutional neural network to predict the presence of prominent diseases in black pepper leaves. The ImageNet dataset available online is used for training deep neural network. Later, this trained network is utilized for the prediction of the newly developed black pepper leaf image dataset. The developed data set consist of real time leaf images, which are candidly taken from the fields and annotated under supervision of an expert. The leaf diseases considered are anthracnose, slow wilt, early stage phytophthora, phytophthora and yellowing. The hyperparameters chosen for tuning in to deep learning models are initial learning rates, optimization algorithm, image batches, epochs, validation and training data, etc. The accuracy obtained with 0.001 learning rate ranges from 99.1 to 99.7% for the Inception V3, GoogleNet, SqueezeNet and Resnet18 models. Proposed Resnet18 model outperforms all model with 99.67% accuracy. The resulting validation accuracy obtained using these models is high and the validation loss is low. This work represents improvement in agriculture and a cutting edge deep neural network method for early stage leaf disease identification and prediction. This is an approach using a deep learning network to predict early stage black pepper leaf diseases.
Collapse
|
65
|
Meisburger SP, Ando N. Scaling and merging macromolecular diffuse scattering with mdx2. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.01.16.575887. [PMID: 38293202 PMCID: PMC10827198 DOI: 10.1101/2024.01.16.575887] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/01/2024]
Abstract
Diffuse scattering is a promising method to gain additional insight into protein dynamics from macromolecular crystallography (MX) experiments. Bragg intensities yield the average electron density, while the diffuse scattering can be processed to obtain a three-dimensional reciprocal space map, that is further analyzed to determine correlated motion. To make diffuse scattering techniques more accessible, we have created software for data processing called mdx2 that is both convenient to use and simple to extend and modify. Mdx2 is written in Python, and it interfaces with DIALS to implement self-contained data reduction workflows. Data are stored in NeXus format for software interchange and convenient visualization. Mdx2 can be run on the command line or imported as a package, for instance to encapsulate a complete workflow in a Jupyter notebook for reproducible computing and education. Here, we describe mdx2 version 1.0, a new release incorporating state-of-the-art techniques for data reduction. We describe the implementation of a complete multi-crystal scaling and merging workflow, and test the methods using a high-redundancy dataset from cubic insulin. We show that redundancy can be leveraged during scaling to correct systematic errors, and obtain accurate and reproducible measurements of weak diffuse signals.
Collapse
|
66
|
Taber CB, Sharma S, Raval MS, Senbel S, Keefe A, Shah J, Patterson E, Nolan J, Sertac Artan N, Kaya T. A holistic approach to performance prediction in collegiate athletics: player, team, and conference perspectives. Sci Rep 2024; 14:1162. [PMID: 38216641 PMCID: PMC10786827 DOI: 10.1038/s41598-024-51658-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2023] [Accepted: 01/08/2024] [Indexed: 01/14/2024] Open
Abstract
Predictive sports data analytics can be revolutionary for sports performance. Existing literature discusses players' or teams' performance, independently or in tandem. Using Machine Learning (ML), this paper aims to holistically evaluate player-, team-, and conference (season)-level performances in Division-1 Women's basketball. The players were monitored and tested through a full competitive year. The performance was quantified at the player level using the reactive strength index modified (RSImod), at the team level by the game score (GS) metric, and finally at the conference level through Player Efficiency Rating (PER). The data includes parameters from training, subjective stress, sleep, and recovery (WHOOP straps), in-game statistics (Polar monitors), and countermovement jumps. We used data balancing techniques and an Extreme Gradient Boosting (XGB) classifier to predict RSI and GS with greater than 90% accuracy and a 0.9 F1 score. The XGB regressor predicted PER with an MSE of 0.026 and an R2 of 0.680. Ensemble of Random Forest, XGB, and correlation finds feature importance at all levels. We used Partial Dependence Plots to understand the impact of each feature on the target variable. Quantifying and predicting performance at all levels will allow coaches to monitor athlete readiness and help improve training.
Collapse
|
67
|
Abohassan M, El-Basyouny K. Leveraging LiDAR-Based Simulations to Quantify the Complexity of the Static Environment for Autonomous Vehicles in Rural Settings. SENSORS (BASEL, SWITZERLAND) 2024; 24:452. [PMID: 38257547 PMCID: PMC10820782 DOI: 10.3390/s24020452] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/25/2023] [Revised: 01/05/2024] [Accepted: 01/09/2024] [Indexed: 01/24/2024]
Abstract
This paper uses virtual simulations to examine the interaction between autonomous vehicles (AVs) and their surrounding environment. A framework was developed to estimate the environment's complexity by calculating the real-time data processing requirements for AVs to navigate effectively. The VISTA simulator was used to synthesize viewpoints to replicate the captured environment accurately. With an emphasis on static physical features, roadways were dissected into relevant road features (RRFs) and full environment (FE) to study the impact of roadside features on the scene complexity and demonstrate the gravity of wildlife-vehicle collisions (WVCs) on AVs. The results indicate that roadside features substantially increase environmental complexity by up to 400%. Increasing a single lane to the road was observed to increase the processing requirements by 12.3-16.5%. Crest vertical curves decrease data rates due to occlusion challenges, with a reported average of 4.2% data loss, while sag curves can increase the complexity by 7%. In horizontal curves, roadside occlusion contributed to severe loss in road information, leading to a decrease in data rate requirements by as much as 19%. As for weather conditions, heavy rain increased the AV's processing demands by a staggering 240% when compared to normal weather conditions. AV developers and government agencies can exploit the findings of this study to better tailor AV designs and meet the necessary infrastructure requirements.
Collapse
|
68
|
Berenguer A, Morejón A, Tomás D, Mazón JN. Using Large Language Models to Enhance the Reusability of Sensor Data. SENSORS (BASEL, SWITZERLAND) 2024; 24:347. [PMID: 38257439 PMCID: PMC10818398 DOI: 10.3390/s24020347] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/19/2023] [Revised: 12/31/2023] [Accepted: 01/05/2024] [Indexed: 01/24/2024]
Abstract
The Internet of Things generates vast data volumes via diverse sensors, yet its potential remains unexploited for innovative data-driven products and services. Limitations arise from sensor-dependent data handling by manufacturers and user companies, hindering third-party access and comprehension. Initiatives like the European Data Act aim to enable high-quality access to sensor-generated data by regulating accuracy, completeness, and relevance while respecting intellectual property rights. Despite data availability, interoperability challenges impede sensor data reusability. For instance, sensor data shared in HTML formats requires an intricate, time-consuming processing to attain reusable formats like JSON or XML. This study introduces a methodology aimed at converting raw sensor data extracted from web portals into structured formats, thereby enhancing data reusability. The approach utilises large language models to derive structured formats from sensor data initially presented in non-interoperable formats. The effectiveness of these language models was assessed through quantitative and qualitative evaluations in a use case involving meteorological data. In the proposed experiments, GPT-4, the best performing LLM tested, demonstrated the feasibility of this methodology, achieving a precision of 93.51% and a recall of 85.33% in converting HTML to JSON/XML, thus confirming its potential in obtaining reusable sensor data.
Collapse
|
69
|
Jaros R, Tomicova E, Martinek R. Template subtraction based methods for non-invasive fetal electrocardiography extraction. Sci Rep 2024; 14:630. [PMID: 38182757 PMCID: PMC10770155 DOI: 10.1038/s41598-024-51213-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2023] [Accepted: 01/02/2024] [Indexed: 01/07/2024] Open
Abstract
Assessment of fetal heart rate (fHR) through non-invasive fetal electrocardiogram (fECG) is challenging task. This study compares the performance of five template subtraction (TS) methods on Labor (12 5-min recordings) and Pregnancy datasets (10 20-min recordings). The methods include TS without adaptation, TS using singular value decomposition (TS[Formula: see text]), TS using linear prediction (TS[Formula: see text]), TS using scaling factor (TS[Formula: see text]), and sequential analysis (SA). The influence of the chosen maternal wavelet for the continuous wavelet transform (CWT) detector is also compared. The F1 score was used to measure performance. Each recording in both datasets consisted of four signals, resulting in a total comparison of 88 signals for the TS-based methods. The study reported the following results: F1 = 95.71% with TS, F1 = 95.93% with TS[Formula: see text], F1 = 95.30% with TS[Formula: see text], F1 = 95.82% with TS[Formula: see text], and F1 = 95.99% with SA. The study identified gaus3 as the suitable maternal wavelet for fetal R-peak detection using the CWT detector. Furthermore, the study classified signals from the tested datasets into categories of high, medium, and low quality, providing valuable insights for subsequent fECG signal extraction. This research contributes to advancing the understanding of non-invasive fECG signal processing and lays the groundwork for improving fetal monitoring in clinical settings.
Collapse
|
70
|
Fernández Requena B, Nadeem S, Reddy VP, Naidoo V, Glasgow JN, Steyn AJC, Barbas C, Gonzalez-Riano C. LiLA: lipid lung-based ATLAS built through a comprehensive workflow designed for an accurate lipid annotation. Commun Biol 2024; 7:45. [PMID: 38182666 PMCID: PMC10770321 DOI: 10.1038/s42003-023-05680-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2023] [Accepted: 12/06/2023] [Indexed: 01/07/2024] Open
Abstract
Accurate lipid annotation is crucial for understanding the role of lipids in health and disease and identifying therapeutic targets. However, annotating the wide variety of lipid species in biological samples remains challenging in untargeted lipidomic studies. In this work, we present a lipid annotation workflow based on LC-MS and MS/MS strategies, the combination of four bioinformatic tools, and a decision tree to support the accurate annotation and semi-quantification of the lipid species present in lung tissue from control mice. The proposed workflow allowed us to generate a lipid lung-based ATLAS (LiLA), which was then employed to unveil the lipidomic signatures of the Mycobacterium tuberculosis infection at two different time points for a deeper understanding of the disease progression. This workflow, combined with manual inspection strategies of MS/MS data, can enhance the annotation process for lipidomic studies and guide the generation of sample-specific lipidome maps. LiLA serves as a freely available data resource that can be employed in future studies to address lipidomic alterations in mice lung tissue.
Collapse
|
71
|
Witman N, Zhou C, Häneke T, Xiao Y, Huang X, Rohner E, Sohlmér J, Grote Beverborg N, Lehtinen ML, Chien KR, Sahara M. Author Correction: Placental growth factor exerts a dual function for cardiomyogenesis and vasculogenesis during heart development. Nat Commun 2024; 15:283. [PMID: 38177121 PMCID: PMC10766948 DOI: 10.1038/s41467-023-44507-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/06/2024] Open
|
72
|
Yuan Y, Du J, Luo J, Zhu Y, Huang Q, Zhang M. Discrimination of missing data types in metabolomics data based on particle swarm optimization algorithm and XGBoost model. Sci Rep 2024; 14:152. [PMID: 38168582 PMCID: PMC10762217 DOI: 10.1038/s41598-023-50646-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2023] [Accepted: 12/22/2023] [Indexed: 01/05/2024] Open
Abstract
In the field of data analysis, it is often faced with a large number of missing values, especially in metabolomics data, this problem is more prominent. Data imputation is a common method to deal with missing metabolomics data, while traditional data imputation methods usually ignore the differences in missing types, and thus the results of data imputation are not satisfactory. In order to discriminate the missing types of metabolomics data, a missing data classification model (PX-MDC) based on particle swarm algorithm and XGBoost is proposed in this paper. First, the missing values in a given missing data set are obtained by panning the missing values to obtain the largest subset of complete data, and then the particle swarm algorithm is used to search for the concentration threshold of missing data and the proportion of low concentration deletions as a percentage of overall deletions. Next, the missing data are simulated based on the search results. Finally, the training data are trained using the XGBoost model using the feature set proposed in this paper in order to build a classifier for the missing data. The experimental results show that the particle swarm algorithm is able to match the traditional enumeration method in terms of accuracy and significantly reduce the search time in concentration threshold search. Compared with the current mainstream methods, the PX-MDC model designed in this paper exhibits higher accuracy and is able to distinguish different deletion types for the same metabolite. This study is expected to make an important breakthrough in metabolomics data imputation and provide strong support for research in related fields.
Collapse
|
73
|
Yi Y, Billor N, Ekstrom A, Zheng J. CW_ICA: an efficient dimensionality determination method for independent component analysis. Sci Rep 2024; 14:143. [PMID: 38167428 PMCID: PMC10762178 DOI: 10.1038/s41598-023-49355-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2023] [Accepted: 12/07/2023] [Indexed: 01/05/2024] Open
Abstract
Independent component analysis (ICA) is a widely used blind source separation method for signal pre-processing. The determination of the number of independent components (ICs) is crucial for achieving optimal performance, as an incorrect choice can result in either under-decomposition or over-decomposition. In this study, we propose a robust method to automatically determine the optimal number of ICs, named the column-wise independent component analysis (CW_ICA). CW_ICA divides the mixed signals into two blocks and applies ICA separately to each block. A quantitative measure, derived from the rank-based correlation matrix computed from the ICs of the two blocks, is utilized to determine the optimal number of ICs. The proposed method is validated and compared with the existing determination methods using simulation and scalp EEG data. The results demonstrate that CW_ICA is a reliable and robust approach for determining the optimal number of ICs. It offers computational efficiency and can be seamlessly integrated with different ICA methods.
Collapse
|
74
|
Wanichthanarak K, In-on A, Fan S, Fiehn O, Wangwiwatsin A, Khoomrung S. Data processing solutions to render metabolomics more quantitative: case studies in food and clinical metabolomics using Metabox 2.0. Gigascience 2024; 13:giae005. [PMID: 38488666 PMCID: PMC10941642 DOI: 10.1093/gigascience/giae005] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2023] [Revised: 12/22/2023] [Accepted: 02/02/2024] [Indexed: 03/18/2024] Open
Abstract
In classic semiquantitative metabolomics, metabolite intensities are affected by biological factors and other unwanted variations. A systematic evaluation of the data processing methods is crucial to identify adequate processing procedures for a given experimental setup. Current comparative studies are mostly focused on peak area data but not on absolute concentrations. In this study, we evaluated data processing methods to produce outputs that were most similar to the corresponding absolute quantified data. We examined the data distribution characteristics, fold difference patterns between 2 metabolites, and sample variance. We used 2 metabolomic datasets from a retail milk study and a lupus nephritis cohort as test cases. When studying the impact of data normalization, transformation, scaling, and combinations of these methods, we found that the cross-contribution compensating multiple standard normalization (ccmn) method, followed by square root data transformation, was most appropriate for a well-controlled study such as the milk study dataset. Regarding the lupus nephritis cohort study, only ccmn normalization could slightly improve the data quality of the noisy cohort. Since the assessment accounted for the resemblance between processed data and the corresponding absolute quantified data, our results denote a helpful guideline for processing metabolomic datasets within a similar context (food and clinical metabolomics). Finally, we introduce Metabox 2.0, which enables thorough analysis of metabolomic data, including data processing, biomarker analysis, integrative analysis, and data interpretation. It was successfully used to process and analyze the data in this study. An online web version is available at http://metsysbio.com/metabox.
Collapse
|
75
|
Ringbauer H, Huang Y, Akbari A, Mallick S, Olalde I, Patterson N, Reich D. Accurate detection of identity-by-descent segments in human ancient DNA. Nat Genet 2024; 56:143-151. [PMID: 38123640 PMCID: PMC10786714 DOI: 10.1038/s41588-023-01582-w] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2023] [Accepted: 10/20/2023] [Indexed: 12/23/2023]
Abstract
Long DNA segments shared between two individuals, known as identity-by-descent (IBD), reveal recent genealogical connections. Here we introduce ancIBD, a method for identifying IBD segments in ancient human DNA (aDNA) using a hidden Markov model and imputed genotype probabilities. We demonstrate that ancIBD accurately identifies IBD segments >8 cM for aDNA data with an average depth of >0.25× for whole-genome sequencing or >1× for 1240k single nucleotide polymorphism capture data. Applying ancIBD to 4,248 ancient Eurasian individuals, we identify relatives up to the sixth degree and genealogical connections between archaeological groups. Notably, we reveal long IBD sharing between Corded Ware and Yamnaya groups, indicating that the Yamnaya herders of the Pontic-Caspian Steppe and the Steppe-related ancestry in various European Corded Ware groups share substantial co-ancestry within only a few hundred years. These results show that detecting IBD segments can generate powerful insights into the growing aDNA record, both on a small scale relevant to life stories and on a large scale relevant to major cultural-historical events.
Collapse
|