Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

Total Articles

545
(from Reference Citation Analysis)

Article PDFs (229)

Cited by > 0 (316)

Searched Name

data processing

Ranked By

Results Analysis

Year Published Analysis
Article Type Analysis
Publication Title Analysis
Category Analysis

Results Analysis

Number	Citation Analysis
51	Liu F, Yuan C, Chen H, Yang F. Prediction of linear B-cell epitopes based on protein sequence features and BERT embeddings. Sci Rep 2024;14:2464. [PMID: 38291341 PMCID: PMC10828400 DOI: 10.1038/s41598-024-53028-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2023] [Accepted: 01/26/2024] [Indexed: 02/01/2024] Open Abstract Linear B-cell epitopes (BCEs) play a key role in the development of peptide vaccines and immunodiagnostic reagents. Therefore, the accurate identification of linear BCEs is of great importance in the prevention of infectious diseases and the diagnosis of related diseases. The experimental methods used to identify BCEs are both expensive and time-consuming and they do not meet the demand for identification of large-scale protein sequence data. As a result, there is a need to develop an efficient and accurate computational method to rapidly identify linear BCE sequences. In this work, we developed the new linear BCE prediction method LBCE-BERT. This method is based on peptide chain sequence information and natural language model BERT embedding information, using an XGBoost classifier. The models were trained on three benchmark datasets. The model was training on three benchmark datasets for hyperparameter selection and was subsequently evaluated on several test datasets. The result indicate that our proposed method outperforms others in terms of AUROC and accuracy. The LBCE-BERT model is publicly available at: https://github.com/Lfang111/LBCE-BERT . Collapse Key Words protein sequence analyses data processing Collapse MESH Headings Epitopes, B-Lymphocyte Proteins/metabolism Amino Acid Sequence Collapse Grants No.2108085MH303 Natural Science Foundation of Anhui Province of China Collapse
52	Ennis D, Shmorak S, Jantscher-Krenn E, Yassour M. Longitudinal quantification of Bifidobacterium longum subsp. infantis reveals late colonization in the infant gut independent of maternal milk HMO composition. Nat Commun 2024;15:894. [PMID: 38291346 PMCID: PMC10827747 DOI: 10.1038/s41467-024-45209-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2023] [Accepted: 01/15/2024] [Indexed: 02/01/2024] Open Abstract Breast milk contains human milk oligosaccharides (HMOs) that cannot be digested by infants, yet nourish their developing gut microbiome. While Bifidobacterium are the best-known utilizers of individual HMOs, a longitudinal study examining the evolving microbial community at high-resolution coupled with mothers' milk HMO composition is lacking. Here, we developed a high-throughput method to quantify Bifidobacterium longum subsp. infantis (BL. infantis), a proficient HMO-utilizer, and applied it to a longitudinal cohort consisting of 21 mother-infant dyads. We observed substantial changes in the infant gut microbiome over the course of several months, while the HMO composition in mothers' milk remained relatively stable. Although Bifidobacterium species significantly influenced sample variation, no specific HMOs correlated with Bifidobacterium species abundance. Surprisingly, we found that BL. infantis colonization began late in the breastfeeding period both in our cohort and in other geographic locations, highlighting the importance of focusing on BL. infantis dynamics in the infant gut. Collapse Key Words metagenomics dietary carbohydrates nutrition data processing microbiome Collapse MESH Headings Infant Female Humans Health Maintenance Organizations Longitudinal Studies Milk, Human Bifidobacterium Oligosaccharides Bifidobacterium longum Collapse Grants Azrieli Foundation Israel Science Foundation (ISF) Waterloo Foundation (TWF) Collapse
53	Babaei Rikan S, Sorayaie Azar A, Naemi A, Bagherzadeh Mohasefi J, Pirnejad H, Wiil UK. Survival prediction of glioblastoma patients using modern deep learning and machine learning techniques. Sci Rep 2024;14:2371. [PMID: 38287149 PMCID: PMC10824760 DOI: 10.1038/s41598-024-53006-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2023] [Accepted: 01/25/2024] [Indexed: 01/31/2024] Open Abstract In this study, we utilized data from the Surveillance, Epidemiology, and End Results (SEER) database to predict the glioblastoma patients' survival outcomes. To assess dataset skewness and detect feature importance, we applied Pearson's second coefficient test of skewness and the Ordinary Least Squares method, respectively. Using two sampling strategies, holdout and five-fold cross-validation, we developed five machine learning (ML) models alongside a feed-forward deep neural network (DNN) for the multiclass classification and regression prediction of glioblastoma patient survival. After balancing the classification and regression datasets, we obtained 46,340 and 28,573 samples, respectively. Shapley additive explanations (SHAP) were then used to explain the decision-making process of the best model. In both classification and regression tasks, as well as across holdout and cross-validation sampling strategies, the DNN consistently outperformed the ML models. Notably, the accuracy were 90.25% and 90.22% for holdout and five-fold cross-validation, respectively, while the corresponding R2 values were 0.6565 and 0.6622. SHAP analysis revealed the importance of age at diagnosis as the most influential feature in the DNN's survival predictions. These findings suggest that the DNN holds promise as a practical auxiliary tool for clinicians, aiding them in optimal decision-making concerning the treatment and care trajectories for glioblastoma patients. Collapse Key Words disease-free survival data mining data processing machine learning predictive medicine cancer models head and neck cancer Collapse MESH Headings Humans Glioblastoma/diagnosis Deep Learning Databases, Factual Hydrolases Machine Learning Collapse Grants Collapse
54	Tyler SR, Lozano-Ojalvo D, Guccione E, Schadt EE. Anti-correlated feature selection prevents false discovery of subpopulations in scRNAseq. Nat Commun 2024;15:699. [PMID: 38267438 PMCID: PMC10808220 DOI: 10.1038/s41467-023-43406-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2022] [Accepted: 11/07/2023] [Indexed: 01/26/2024] Open Abstract While sub-clustering cell-populations has become popular in single cell-omics, negative controls for this process are lacking. Popular feature-selection/clustering algorithms fail the null-dataset problem, allowing erroneous subdivisions of homogenous clusters until nearly each cell is called its own cluster. Using real and synthetic datasets, we find that anti-correlated gene selection reduces or eliminates erroneous subdivisions, increases marker-gene selection efficacy, and efficiently scales to millions of cells. Collapse Key Words statistical methods data processing computational biology and bioinformatics bioinformatics gene expression analysis Collapse MESH Headings Single-Cell Gene Expression Analysis Algorithms Cluster Analysis Collapse Grants K99 HG011270 NHGRI NIH HHS RC2 DK122532 NIDDK NIH HHS U01 AG046170 NIA NIH HHS U.S. Department of Health & Human Services \| NIH \| National Human Genome Research Institute (NHGRI) Collapse
55	Sun Z, Zhang L, Wang R, Wang Z, Liang X, Gao J. Identification of shared pathogenetic mechanisms between COVID-19 and IC through bioinformatics and system biology. Sci Rep 2024;14:2114. [PMID: 38267482 PMCID: PMC10808107 DOI: 10.1038/s41598-024-52625-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2023] [Accepted: 01/22/2024] [Indexed: 01/26/2024] Open Abstract COVID-19 increased global mortality in 2019. Cystitis became a contributing factor in SARS-CoV-2 and COVID-19 complications. The complex molecular links between cystitis and COVID-19 are unclear. This study investigates COVID-19-associated cystitis (CAC) molecular mechanisms and drug candidates using bioinformatics and systems biology. Obtain the gene expression profiles of IC (GSE11783) and COVID-19 (GSE147507) from the Gene Expression Omnibus (GEO) database. Identified the common differentially expressed genes (DEGs) in both IC and COVID-19, and extracted a number of key genes from this group. Subsequently, conduct Gene Ontology (GO) functional enrichment and Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment analysis on the DEGs. Additionally, design a protein-protein interaction (PPI) network, a transcription factor gene regulatory network, a TF miRNA regulatory network, and a gene disease association network using the DEGs. Identify and extract hub genes from the PPI network. Then construct Nomogram diagnostic prediction models based on the hub genes. The DSigDB database was used to forecast many potential molecular medicines that are associated with common DEGs. Assess the precision of hub genes and Nomogram models in diagnosing IC and COVID-19 by employing Receiver Operating Characteristic (ROC) curves. The IC dataset (GSE57560) and the COVID-19 dataset (GSE171110) were selected to validate the models' diagnostic accuracy. A grand total of 198 DEGs that overlapped were found and chosen for further research. FCER1G, ITGAM, LCP2, LILRB2, MNDA, SPI1, and TYROBP were screened as the hub genes. The Nomogram model, built using the seven hub genes, demonstrates significant utility as a diagnostic prediction model for both IC and COVID-19. Multiple potential molecular medicines associated with common DEGs have been discovered. These pathways, hub genes, and models may provide new perspectives for future research into mechanisms and guide personalised and effective therapeutics for IC patients infected with COVID-19. Collapse Key Words viral infection diseases immunological disorders infectious diseases respiratory tract diseases urogenital diseases computational biology and bioinformatics data mining data processing gene regulatory networks virtual drug screening autoimmune diseases inflammatory diseases immunology chemokines cytokines inflammation bladder bladder disease Collapse MESH Headings Humans COVID-19/genetics SARS-CoV-2/genetics Computational Biology MicroRNAs Cystitis Collapse Grants Collapse
56	Aarthy M, Pandiyan GN, Paramasivan R, Kumar A, Gupta B. Identification and prioritisation of potential vaccine candidates using subtractive proteomics and designing of a multi-epitope vaccine against Wuchereria bancrofti. Sci Rep 2024;14:1970. [PMID: 38263422 PMCID: PMC10806236 DOI: 10.1038/s41598-024-52457-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2023] [Accepted: 01/18/2024] [Indexed: 01/25/2024] Open Abstract This study employed subtractive proteomics and immunoinformatics to analyze the Wuchereria bancrofti proteome and identify potential therapeutic targets, with a focus on designing a vaccine against the parasite species. A comprehensive bioinformatics analysis of the parasite's proteome identified 51 probable therapeutic targets, among which "Kunitz/bovine pancreatic trypsin inhibitor domain-containing protein" was identified as the most promising vaccine candidate. The candidate protein was used to design a multi-epitope vaccine, incorporating B-cell and T-cell epitopes identified through various tools. The vaccine construct underwent extensive analysis of its antigenic, physical, and chemical features, including the determination of secondary and tertiary structures. Docking and molecular dynamics simulations were performed with HLA alleles, Toll-like receptor 4 (TLR4), and TLR3 to assess its potential to elicit the human immune response. Immune simulation analysis confirmed the predicted vaccine's strong binding affinity with immunoglobulins, indicating its potential efficacy in generating an immune response. However, experimental validation and testing of this multi-epitope vaccine construct would be needed to assess its potential against W. bancrofti and even for a broader range of lymphatic filarial infections given the similarities between W. bancrofti and Brugia. Collapse Key Words bioinformatics data acquisition data processing protein structure predictions proteome informatics Collapse MESH Headings Humans Animals Cattle Wuchereria bancrofti Proteome Proteomics Epitopes, T-Lymphocyte Aprotinin Molecular Dynamics Simulation Collapse Grants Collapse
57	Leung YY, Naj AC, Chou YF, Valladares O, Schmidt M, Hamilton-Nelson K, Wheeler N, Lin H, Gangadharan P, Qu L, Clark K, Kuzma AB, Lee WP, Cantwell L, Nicaretta H, Haines J, Farrer L, Seshadri S, Brkanac Z, Cruchaga C, Pericak-Vance M, Mayeux RP, Bush WS, Destefano A, Martin E, Schellenberg GD, Wang LS. Human whole-exome genotype data for Alzheimer's disease. Nat Commun 2024;15:684. [PMID: 38263370 PMCID: PMC10805795 DOI: 10.1038/s41467-024-44781-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2022] [Accepted: 01/02/2024] [Indexed: 01/25/2024] Open Abstract The heterogeneity of the whole-exome sequencing (WES) data generation methods present a challenge to a joint analysis. Here we present a bioinformatics strategy for joint-calling 20,504 WES samples collected across nine studies and sequenced using ten capture kits in fourteen sequencing centers in the Alzheimer's Disease Sequencing Project. The joint-genotype called variant-called format (VCF) file contains only positions within the union of capture kits. The VCF was then processed specifically to account for the batch effects arising from the use of different capture kits from different studies. We identified 8.2 million autosomal variants. 96.82% of the variants are high-quality, and are located in 28,579 Ensembl transcripts. 41% of the variants are intronic and 1.8% of the variants are with CADD > 30, indicating they are of high predicted pathogenicity. Here we show our new strategy can generate high-quality data from processing these diversely generated WES samples. The improved ability to combine data sequenced in different batches benefits the whole genomics research community. Collapse Key Words next-generation sequencing data processing data integration Collapse MESH Headings Humans Alzheimer Disease Exome Computational Biology Data Accuracy Genotype Collapse Grants RF1 AG074328 NIA NIH HHS U24 AG021886 NIA NIH HHS R01 HL105756 NHLBI NIH HHS U24 AG041689 NIA NIH HHS P30 AG066462 NIA NIH HHS U.S. Department of Health & Human Services \| NIH \| National Institute on Aging (U.S. National Institute on Aging) Collapse
58	Ashraf H, Waris A, Gilani SO, Shafiq U, Iqbal J, Kamavuako EN, Berrouche Y, Brüls O, Boutaayamou M, Niazi IK. Optimizing the performance of convolutional neural network for enhanced gesture recognition using sEMG. Sci Rep 2024;14:2020. [PMID: 38263441 PMCID: PMC10805798 DOI: 10.1038/s41598-024-52405-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2023] [Accepted: 01/18/2024] [Indexed: 01/25/2024] Open Abstract Deep neural networks (DNNs) have demonstrated higher performance results when compared to traditional approaches for implementing robust myoelectric control (MEC) systems. However, the delay induced by optimising a MEC remains a concern for real-time applications. As a result, an optimised DNN architecture based on fine-tuned hyperparameters is required. This study investigates the optimal configuration of convolutional neural network (CNN)-based MEC by proposing an effective data segmentation technique and a generalised set of hyperparameters. Firstly, two segmentation strategies (disjoint and overlap) and various segment and overlap sizes were studied to optimise segmentation parameters. Secondly, to address the challenge of optimising the hyperparameters of a DNN-based MEC system, the problem has been abstracted as an optimisation problem, and Bayesian optimisation has been used to solve it. From 20 healthy people, ten surface electromyography (sEMG) grasping movements abstracted from daily life were chosen as the target gesture set. With an ideal segment size of 200 ms and an overlap size of 80%, the results show that the overlap segmentation technique outperforms the disjoint segmentation technique (p-value < 0.05). In comparison to manual (12.76 ± 4.66), grid (0.10 ± 0.03), and random (0.12 ± 0.05) search hyperparameters optimisation strategies, the proposed optimisation technique resulted in a mean classification error rate (CER) of 0.08 ± 0.03 across all subjects. In addition, a generalised CNN architecture with an optimal set of hyperparameters is proposed. When tested separately on all individuals, the single generalised CNN architecture produced an overall CER of 0.09 ± 0.03. This study's significance lies in its contribution to the field of EMG signal processing by demonstrating the superiority of the overlap segmentation technique, optimizing CNN hyperparameters through Bayesian optimization, and offering practical insights for improving prosthetic control and human-computer interfaces. Collapse Key Words computational models data processing Collapse MESH Headings Humans Bayes Theorem Electromyography Gestures Computer Systems Neural Networks, Computer Collapse Grants Collapse
59	Sabary O, Yucovich A, Shapira G, Yaakobi E. Reconstruction algorithms for DNA-storage systems. Sci Rep 2024;14:1951. [PMID: 38263421 PMCID: PMC10806084 DOI: 10.1038/s41598-024-51730-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2023] [Accepted: 01/09/2024] [Indexed: 01/25/2024] Open Abstract Motivated by DNA storage systems, this work presents the DNA reconstruction problem, in which a length-n string, is passing through the DNA-storage channel, which introduces deletion, insertion and substitution errors. This channel generates multiple noisy copies of the transmitted string which are called traces. A DNA reconstruction algorithm is a mapping which receives t traces as an input and produces an estimation of the original string. The goal in the DNA reconstruction problem is to minimize the edit distance between the original string and the algorithm's estimation. In this work, we present several new algorithms for this problem. Our algorithms look globally on the entire sequence of the traces and use dynamic programming algorithms, which are used for the shortest common supersequence and the longest common subsequence problems, in order to decode the original string. Our algorithms do not require any limitations on the input and the number of traces, and more than that, they perform well even for error probabilities as high as 0.27. The algorithms have been tested on simulated data, on data from previous DNA storage experiments, and on a new synthesized dataset, and are shown to outperform previous algorithms in reconstruction accuracy. Collapse Key Words computational biology and bioinformatics data processing Collapse MESH Headings Algorithms DNA Motivation Probability Records Collapse Grants ERC, DNAStorage, 865630 European Union ERC, DNAStorage, 865630 European Union ERC, DNAStorage, 865630 European Union ERC, DNAStorage, 865630 European Union Grant 75855 Israel Innovation Authority Grant 75855 Israel Innovation Authority Grant 75855 Israel Innovation Authority Grant 75855 Israel Innovation Authority Collapse
60	Wang H, Gao C, Dantona C, Hull B, Sun J. DRG-LLaMA : tuning LLaMA model to predict diagnosis-related group for hospitalized patients. NPJ Digit Med 2024;7:16. [PMID: 38253711 PMCID: PMC10803802 DOI: 10.1038/s41746-023-00989-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2023] [Accepted: 12/19/2023] [Indexed: 01/24/2024] Open Abstract In the U.S. inpatient payment system, the Diagnosis-Related Group (DRG) is pivotal, but its assignment process is inefficient. The study introduces DRG-LLaMA, an advanced large language model (LLM) fine-tuned on clinical notes to enhance DRGs assignment. Utilizing LLaMA as the foundational model and optimizing it through Low-Rank Adaptation (LoRA) on 236,192 MIMIC-IV discharge summaries, our DRG-LLaMA -7B model exhibited a noteworthy macro-averaged F1 score of 0.327, a top-1 prediction accuracy of 52.0%, and a macro-averaged Area Under the Curve (AUC) of 0.986, with a maximum input token length of 512. This model surpassed the performance of prior leading models in DRG prediction, showing a relative improvement of 40.3% and 35.7% in macro-averaged F1 score compared to ClinicalBERT and CAML, respectively. Applied to base DRG and complication or comorbidity (CC)/major complication or comorbidity (MCC) prediction, DRG-LLaMA achieved a top-1 prediction accuracy of 67.8% and 67.5%, respectively. Additionally, our findings indicate that DRG-LLaMA 's performance correlates with increased model parameters and input context lengths. Collapse Key Words data processing computational models Collapse MESH Headings Collapse Grants IIS-2034479 National Science Foundation (NSF) SCH-2014438 National Science Foundation (NSF) Collapse
61	Liu M, Srivastava G, Ramanujam J, Brylinski M. Augmented drug combination dataset to improve the performance of machine learning models predicting synergistic anticancer effects. Sci Rep 2024;14:1668. [PMID: 38238448 PMCID: PMC10796434 DOI: 10.1038/s41598-024-51940-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2023] [Accepted: 01/11/2024] [Indexed: 01/22/2024] Open Abstract Combination therapy has gained popularity in cancer treatment as it enhances the treatment efficacy and overcomes drug resistance. Although machine learning (ML) techniques have become an indispensable tool for discovering new drug combinations, the data on drug combination therapy currently available may be insufficient to build high-precision models. We developed a data augmentation protocol to unbiasedly scale up the existing anti-cancer drug synergy dataset. Using a new drug similarity metric, we augmented the synergy data by substituting a compound in a drug combination instance with another molecule that exhibits highly similar pharmacological effects. Using this protocol, we were able to upscale the AZ-DREAM Challenges dataset from 8798 to 6,016,697 drug combinations. Comprehensive performance evaluations show that ML models trained on the augmented data consistently achieve higher accuracy than those trained solely on the original dataset. Our data augmentation protocol provides a systematic and unbiased approach to generating more diverse and larger-scale drug combination datasets, enabling the development of more precise and effective ML models. The protocol presented in this study could serve as a foundation for future research aimed at discovering novel and effective drug combinations for cancer treatment. Collapse Key Words computational biology and bioinformatics data processing machine learning Collapse MESH Headings Drug Synergism Computational Biology/methods Drug Combinations Drug Therapy, Combination Machine Learning Collapse Grants R35 GM119524 NIGMS NIH HHS R35GM119524 NIGMS NIH HHS National Science Foundation Louisiana Board of Regents National Institute of General Medical Sciences Collapse
62	Ghosh T, Han Y, Raju V, Hossain D, McCrory MA, Higgins J, Boushey C, Delp EJ, Sazonov E. Integrated image and sensor-based food intake detection in free-living. Sci Rep 2024;14:1665. [PMID: 38238423 PMCID: PMC10796396 DOI: 10.1038/s41598-024-51687-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2021] [Accepted: 01/08/2024] [Indexed: 01/22/2024] Open Abstract The first step in any dietary monitoring system is the automatic detection of eating episodes. To detect eating episodes, either sensor data or images can be used, and either method can result in false-positive detection. This study aims to reduce the number of false positives in the detection of eating episodes by a wearable sensor, Automatic Ingestion Monitor v2 (AIM-2). Thirty participants wore the AIM-2 for two days each (pseudo-free-living and free-living). The eating episodes were detected by three methods: (1) recognition of solid foods and beverages in images captured by AIM-2; (2) recognition of chewing from the AIM-2 accelerometer sensor; and (3) hierarchical classification to combine confidence scores from image and accelerometer classifiers. The integration of image- and sensor-based methods achieved 94.59% sensitivity, 70.47% precision, and 80.77% F1-score in the free-living environment, which is significantly better than either of the original methods (8% higher sensitivity). The proposed method successfully reduces the number of false positives in the detection of eating episodes. Collapse Key Words computational models data mining data processing image processing biomedical engineering Collapse MESH Headings Humans Mastication Monitoring, Physiologic Diet Recognition, Psychology Mental Processes Collapse Grants R01 DK100796 NIDDK NIH HHS R01DK100796 NIDDK NIH HHS National Institute of Diabetes and Digestive and Kidney Diseases Collapse
63	Ali L, Javeed A, Noor A, Rauf HT, Kadry S, Gandomi AH. Parkinson's disease detection based on features refinement through L1 regularized SVM and deep neural network. Sci Rep 2024;14:1333. [PMID: 38228772 PMCID: PMC10791701 DOI: 10.1038/s41598-024-51600-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2023] [Accepted: 01/07/2024] [Indexed: 01/18/2024] Open Abstract In previous studies, replicated and multiple types of speech data have been used for Parkinson's disease (PD) detection. However, two main problems in these studies are lower PD detection accuracy and inappropriate validation methodologies leading to unreliable results. This study discusses the effects of inappropriate validation methodologies used in previous studies and highlights the use of appropriate alternative validation methods that would ensure generalization. To enhance PD detection accuracy, we propose a two-stage diagnostic system that refines the extracted set of features through [Formula: see text] regularized linear support vector machine and classifies the refined subset of features through a deep neural network. To rigorously evaluate the effectiveness of the proposed diagnostic system, experiments are performed on two different voice recording-based benchmark datasets. For both datasets, the proposed diagnostic system achieves 100% accuracy under leave-one-subject-out (LOSO) cross-validation (CV) and 97.5% accuracy under k-fold CV. The results show that the proposed system outperforms the existing methods regarding PD detection accuracy. The results suggest that the proposed diagnostic system is essential to improving non-invasive diagnostic decision support in PD. Collapse Key Words computational science information technology data mining data processing machine learning Collapse MESH Headings Humans Algorithms Parkinson Disease/diagnosis Support Vector Machine Neural Networks, Computer Voice Collapse Grants Óbuda University Collapse
64	Kini AS, Prema KV, Pai SN. Early stage black pepper leaf disease prediction based on transfer learning using ConvNets. Sci Rep 2024;14:1404. [PMID: 38228767 PMCID: PMC10791634 DOI: 10.1038/s41598-024-51884-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2023] [Accepted: 01/10/2024] [Indexed: 01/18/2024] Open Abstract Plants get exposed to diseases, insects and fungus. This causes heavy damages to crop resulting in various leaves diseases. Leaf diseases can be diagnosed at an early stage with the aid of a smart computer vision system and timely disease prevention can be targeted. Black pepper is a medicinal plant that is extensively used in Ayurvedic medicine because of its therapeutic properties. The proposed work represents an intelligent transfer learning technique through state-of-the-art deep learning implementation using convolutional neural network to predict the presence of prominent diseases in black pepper leaves. The ImageNet dataset available online is used for training deep neural network. Later, this trained network is utilized for the prediction of the newly developed black pepper leaf image dataset. The developed data set consist of real time leaf images, which are candidly taken from the fields and annotated under supervision of an expert. The leaf diseases considered are anthracnose, slow wilt, early stage phytophthora, phytophthora and yellowing. The hyperparameters chosen for tuning in to deep learning models are initial learning rates, optimization algorithm, image batches, epochs, validation and training data, etc. The accuracy obtained with 0.001 learning rate ranges from 99.1 to 99.7% for the Inception V3, GoogleNet, SqueezeNet and Resnet18 models. Proposed Resnet18 model outperforms all model with 99.67% accuracy. The resulting validation accuracy obtained using these models is high and the validation loss is low. This work represents improvement in agriculture and a cutting edge deep neural network method for early stage leaf disease identification and prediction. This is an approach using a deep learning network to predict early stage black pepper leaf diseases. Collapse Key Words electrical and electronic engineering computational models computational platforms and environments data acquisition data integration data processing image processing machine learning plant sciences environmental sciences energy science and technology engineering materials science mathematics and computing Collapse MESH Headings Piper nigrum Neural Networks, Computer Artificial Intelligence Plant Leaves Machine Learning Collapse Grants Collapse
65	Meisburger SP, Ando N. Scaling and merging macromolecular diffuse scattering with mdx2. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.01.16.575887. [PMID: 38293202 PMCID: PMC10827198 DOI: 10.1101/2024.01.16.575887] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/01/2024] Abstract Diffuse scattering is a promising method to gain additional insight into protein dynamics from macromolecular crystallography (MX) experiments. Bragg intensities yield the average electron density, while the diffuse scattering can be processed to obtain a three-dimensional reciprocal space map, that is further analyzed to determine correlated motion. To make diffuse scattering techniques more accessible, we have created software for data processing called mdx2 that is both convenient to use and simple to extend and modify. Mdx2 is written in Python, and it interfaces with DIALS to implement self-contained data reduction workflows. Data are stored in NeXus format for software interchange and convenient visualization. Mdx2 can be run on the command line or imported as a package, for instance to encapsulate a complete workflow in a Jupyter notebook for reproducible computing and education. Here, we describe mdx2 version 1.0, a new release incorporating state-of-the-art techniques for data reduction. We describe the implementation of a complete multi-crystal scaling and merging workflow, and test the methods using a high-redundancy dataset from cubic insulin. We show that redundancy can be leveraged during scaling to correct systematic errors, and obtain accurate and reproducible measurements of weak diffuse signals. Collapse Key Words data processing diffuse scattering macromolecular crystallography reciprocal space mapping software Collapse MESH Headings Collapse Grants P30 GM124166 NIGMS NIH HHS R35 GM124847 NIGMS NIH HHS Collapse
66	Taber CB, Sharma S, Raval MS, Senbel S, Keefe A, Shah J, Patterson E, Nolan J, Sertac Artan N, Kaya T. A holistic approach to performance prediction in collegiate athletics: player, team, and conference perspectives. Sci Rep 2024;14:1162. [PMID: 38216641 PMCID: PMC10786827 DOI: 10.1038/s41598-024-51658-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2023] [Accepted: 01/08/2024] [Indexed: 01/14/2024] Open Abstract Predictive sports data analytics can be revolutionary for sports performance. Existing literature discusses players' or teams' performance, independently or in tandem. Using Machine Learning (ML), this paper aims to holistically evaluate player-, team-, and conference (season)-level performances in Division-1 Women's basketball. The players were monitored and tested through a full competitive year. The performance was quantified at the player level using the reactive strength index modified (RSImod), at the team level by the game score (GS) metric, and finally at the conference level through Player Efficiency Rating (PER). The data includes parameters from training, subjective stress, sleep, and recovery (WHOOP straps), in-game statistics (Polar monitors), and countermovement jumps. We used data balancing techniques and an Extreme Gradient Boosting (XGB) classifier to predict RSI and GS with greater than 90% accuracy and a 0.9 F1 score. The XGB regressor predicted PER with an MSE of 0.026 and an R2 of 0.680. Ensemble of Random Forest, XGB, and correlation finds feature importance at all levels. We used Partial Dependence Plots to understand the impact of each feature on the target variable. Quantifying and predicting performance at all levels will allow coaches to monitor athlete readiness and help improve training. Collapse Key Words predictive markers data processing Collapse MESH Headings Humans Female Basketball Athletes Athletic Performance Sleep Universities Collapse Grants Collapse
67	Abohassan M, El-Basyouny K. Leveraging LiDAR-Based Simulations to Quantify the Complexity of the Static Environment for Autonomous Vehicles in Rural Settings. SENSORS (BASEL, SWITZERLAND) 2024;24:452. [PMID: 38257547 PMCID: PMC10820782 DOI: 10.3390/s24020452] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/25/2023] [Revised: 01/05/2024] [Accepted: 01/09/2024] [Indexed: 01/24/2024] Abstract This paper uses virtual simulations to examine the interaction between autonomous vehicles (AVs) and their surrounding environment. A framework was developed to estimate the environment's complexity by calculating the real-time data processing requirements for AVs to navigate effectively. The VISTA simulator was used to synthesize viewpoints to replicate the captured environment accurately. With an emphasis on static physical features, roadways were dissected into relevant road features (RRFs) and full environment (FE) to study the impact of roadside features on the scene complexity and demonstrate the gravity of wildlife-vehicle collisions (WVCs) on AVs. The results indicate that roadside features substantially increase environmental complexity by up to 400%. Increasing a single lane to the road was observed to increase the processing requirements by 12.3-16.5%. Crest vertical curves decrease data rates due to occlusion challenges, with a reported average of 4.2% data loss, while sag curves can increase the complexity by 7%. In horizontal curves, roadside occlusion contributed to severe loss in road information, leading to a decrease in data rate requirements by as much as 19%. As for weather conditions, heavy rain increased the AV's processing demands by a staggering 240% when compared to normal weather conditions. AV developers and government agencies can exploit the findings of this study to better tailor AV designs and meet the necessary infrastructure requirements. Collapse Key Words LiDAR data autonomous vehicles data processing digital twins virtual simulations wildlife–vehicle collisions Collapse MESH Headings Collapse Grants Collapse
68	Berenguer A, Morejón A, Tomás D, Mazón JN. Using Large Language Models to Enhance the Reusability of Sensor Data. SENSORS (BASEL, SWITZERLAND) 2024;24:347. [PMID: 38257439 PMCID: PMC10818398 DOI: 10.3390/s24020347] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/19/2023] [Revised: 12/31/2023] [Accepted: 01/05/2024] [Indexed: 01/24/2024] Abstract The Internet of Things generates vast data volumes via diverse sensors, yet its potential remains unexploited for innovative data-driven products and services. Limitations arise from sensor-dependent data handling by manufacturers and user companies, hindering third-party access and comprehension. Initiatives like the European Data Act aim to enable high-quality access to sensor-generated data by regulating accuracy, completeness, and relevance while respecting intellectual property rights. Despite data availability, interoperability challenges impede sensor data reusability. For instance, sensor data shared in HTML formats requires an intricate, time-consuming processing to attain reusable formats like JSON or XML. This study introduces a methodology aimed at converting raw sensor data extracted from web portals into structured formats, thereby enhancing data reusability. The approach utilises large language models to derive structured formats from sensor data initially presented in non-interoperable formats. The effectiveness of these language models was assessed through quantitative and qualitative evaluations in a use case involving meteorological data. In the proposed experiments, GPT-4, the best performing LLM tested, demonstrated the feasibility of this methodology, achieving a precision of 93.51% and a recall of 85.33% in converting HTML to JSON/XML, thus confirming its potential in obtaining reusable sensor data. Collapse Key Words Internet of Things data processing data reusability interoperability sensor data Collapse MESH Headings Collapse Grants TED2021130890B-C21 Agencia Estatal de Investigación TED2021130890B-C21 European Union Next Generation ACIF/2021/507 Generalitat Valenciana Collapse
69	Jaros R, Tomicova E, Martinek R. Template subtraction based methods for non-invasive fetal electrocardiography extraction. Sci Rep 2024;14:630. [PMID: 38182757 PMCID: PMC10770155 DOI: 10.1038/s41598-024-51213-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2023] [Accepted: 01/02/2024] [Indexed: 01/07/2024] Open Abstract Assessment of fetal heart rate (fHR) through non-invasive fetal electrocardiogram (fECG) is challenging task. This study compares the performance of five template subtraction (TS) methods on Labor (12 5-min recordings) and Pregnancy datasets (10 20-min recordings). The methods include TS without adaptation, TS using singular value decomposition (TS[Formula: see text]), TS using linear prediction (TS[Formula: see text]), TS using scaling factor (TS[Formula: see text]), and sequential analysis (SA). The influence of the chosen maternal wavelet for the continuous wavelet transform (CWT) detector is also compared. The F1 score was used to measure performance. Each recording in both datasets consisted of four signals, resulting in a total comparison of 88 signals for the TS-based methods. The study reported the following results: F1 = 95.71% with TS, F1 = 95.93% with TS[Formula: see text], F1 = 95.30% with TS[Formula: see text], F1 = 95.82% with TS[Formula: see text], and F1 = 95.99% with SA. The study identified gaus3 as the suitable maternal wavelet for fetal R-peak detection using the CWT detector. Furthermore, the study classified signals from the tested datasets into categories of high, medium, and low quality, providing valuable insights for subsequent fECG signal extraction. This research contributes to advancing the understanding of non-invasive fECG signal processing and lays the groundwork for improving fetal monitoring in clinical settings. Collapse Key Words data processing biomedical engineering Collapse MESH Headings Female Pregnancy Humans Prenatal Care Fetus Electrocardiography Fetal Monitoring Heart Rate, Fetal Collapse Grants SP2023/042 Ministry of Education of the Czechia CZ.10.03.01/00/22_/0000048 REFRESH-Research Excellence For REgion Sustainability and High-tech Industries project Collapse
70	Fernández Requena B, Nadeem S, Reddy VP, Naidoo V, Glasgow JN, Steyn AJC, Barbas C, Gonzalez-Riano C. LiLA: lipid lung-based ATLAS built through a comprehensive workflow designed for an accurate lipid annotation. Commun Biol 2024;7:45. [PMID: 38182666 PMCID: PMC10770321 DOI: 10.1038/s42003-023-05680-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2023] [Accepted: 12/06/2023] [Indexed: 01/07/2024] Open Abstract Accurate lipid annotation is crucial for understanding the role of lipids in health and disease and identifying therapeutic targets. However, annotating the wide variety of lipid species in biological samples remains challenging in untargeted lipidomic studies. In this work, we present a lipid annotation workflow based on LC-MS and MS/MS strategies, the combination of four bioinformatic tools, and a decision tree to support the accurate annotation and semi-quantification of the lipid species present in lung tissue from control mice. The proposed workflow allowed us to generate a lipid lung-based ATLAS (LiLA), which was then employed to unveil the lipidomic signatures of the Mycobacterium tuberculosis infection at two different time points for a deeper understanding of the disease progression. This workflow, combined with manual inspection strategies of MS/MS data, can enhance the annotation process for lipidomic studies and guide the generation of sample-specific lipidome maps. LiLA serves as a freely available data resource that can be employed in future studies to address lipidomic alterations in mice lung tissue. Collapse Key Words lipidomics data processing Collapse MESH Headings Animals Mice Tandem Mass Spectrometry Workflow Ascomycota Computational Biology Lipids Collapse Grants R33 AI138280 NIAID NIH HHS U.S. Department of Health & Human Services \| National Institutes of Health (NIH) Ministry of Science and Innovation of Spain (MICINN), and the European Regional Development Fund FEDER, grant number PID2021-122490NB-I00 Collapse
71	Witman N, Zhou C, Häneke T, Xiao Y, Huang X, Rohner E, Sohlmér J, Grote Beverborg N, Lehtinen ML, Chien KR, Sahara M. Author Correction: Placental growth factor exerts a dual function for cardiomyogenesis and vasculogenesis during heart development. Nat Commun 2024;15:283. [PMID: 38177121 PMCID: PMC10766948 DOI: 10.1038/s41467-023-44507-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/06/2024] Open Abstract Collapse Key Words cell lineage heart stem cells stem-cell differentiation data processing Collapse MESH Headings Collapse Grants Collapse
72	Yuan Y, Du J, Luo J, Zhu Y, Huang Q, Zhang M. Discrimination of missing data types in metabolomics data based on particle swarm optimization algorithm and XGBoost model. Sci Rep 2024;14:152. [PMID: 38168582 PMCID: PMC10762217 DOI: 10.1038/s41598-023-50646-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2023] [Accepted: 12/22/2023] [Indexed: 01/05/2024] Open Abstract In the field of data analysis, it is often faced with a large number of missing values, especially in metabolomics data, this problem is more prominent. Data imputation is a common method to deal with missing metabolomics data, while traditional data imputation methods usually ignore the differences in missing types, and thus the results of data imputation are not satisfactory. In order to discriminate the missing types of metabolomics data, a missing data classification model (PX-MDC) based on particle swarm algorithm and XGBoost is proposed in this paper. First, the missing values in a given missing data set are obtained by panning the missing values to obtain the largest subset of complete data, and then the particle swarm algorithm is used to search for the concentration threshold of missing data and the proportion of low concentration deletions as a percentage of overall deletions. Next, the missing data are simulated based on the search results. Finally, the training data are trained using the XGBoost model using the feature set proposed in this paper in order to build a classifier for the missing data. The experimental results show that the particle swarm algorithm is able to match the traditional enumeration method in terms of accuracy and significantly reduce the search time in concentration threshold search. Compared with the current mainstream methods, the PX-MDC model designed in this paper exhibits higher accuracy and is able to distinguish different deletion types for the same metabolite. This study is expected to make an important breakthrough in metabolomics data imputation and provide strong support for research in related fields. Collapse Key Words computational biology and bioinformatics data processing Collapse MESH Headings Algorithms Metabolomics/methods Collapse Grants 82260988 Foundation for Innovative Research Groups of the National Natural Science Foundation of China 82260988 Foundation for Innovative Research Groups of the National Natural Science Foundation of China 82260988 Foundation for Innovative Research Groups of the National Natural Science Foundation of China 82260988 Foundation for Innovative Research Groups of the National Natural Science Foundation of China 82260988 Foundation for Innovative Research Groups of the National Natural Science Foundation of China 82260988 Foundation for Innovative Research Groups of the National Natural Science Foundation of China CXTD22015 Jiangxi University of Chinese Medicine Science and Technology Innovation Team Development Program CXTD22015 Jiangxi University of Chinese Medicine Science and Technology Innovation Team Development Program CXTD22015 Jiangxi University of Chinese Medicine Science and Technology Innovation Team Development Program CXTD22015 Jiangxi University of Chinese Medicine Science and Technology Innovation Team Development Program CXTD22015 Jiangxi University of Chinese Medicine Science and Technology Innovation Team Development Program CXTD22015 Jiangxi University of Chinese Medicine Science and Technology Innovation Team Development Program Collapse
73	Yi Y, Billor N, Ekstrom A, Zheng J. CW_ICA: an efficient dimensionality determination method for independent component analysis. Sci Rep 2024;14:143. [PMID: 38167428 PMCID: PMC10762178 DOI: 10.1038/s41598-023-49355-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2023] [Accepted: 12/07/2023] [Indexed: 01/05/2024] Open Abstract Independent component analysis (ICA) is a widely used blind source separation method for signal pre-processing. The determination of the number of independent components (ICs) is crucial for achieving optimal performance, as an incorrect choice can result in either under-decomposition or over-decomposition. In this study, we propose a robust method to automatically determine the optimal number of ICs, named the column-wise independent component analysis (CW_ICA). CW_ICA divides the mixed signals into two blocks and applies ICA separately to each block. A quantitative measure, derived from the rank-based correlation matrix computed from the ICs of the two blocks, is utilized to determine the optimal number of ICs. The proposed method is validated and compared with the existing determination methods using simulation and scalp EEG data. The results demonstrate that CW_ICA is a reliable and robust approach for determining the optimal number of ICs. It offers computational efficiency and can be seamlessly integrated with different ICA methods. Collapse Key Words learning algorithms data processing statistical methods statistics Collapse MESH Headings Collapse Grants UL1 TR003096 NCATS NIH HHS National Science Foundation National Center for Advancing Translational Sciences of the National Institutes of Health Collapse
74	Wanichthanarak K, In-on A, Fan S, Fiehn O, Wangwiwatsin A, Khoomrung S. Data processing solutions to render metabolomics more quantitative: case studies in food and clinical metabolomics using Metabox 2.0. Gigascience 2024;13:giae005. [PMID: 38488666 PMCID: PMC10941642 DOI: 10.1093/gigascience/giae005] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2023] [Revised: 12/22/2023] [Accepted: 02/02/2024] [Indexed: 03/18/2024] Open Abstract In classic semiquantitative metabolomics, metabolite intensities are affected by biological factors and other unwanted variations. A systematic evaluation of the data processing methods is crucial to identify adequate processing procedures for a given experimental setup. Current comparative studies are mostly focused on peak area data but not on absolute concentrations. In this study, we evaluated data processing methods to produce outputs that were most similar to the corresponding absolute quantified data. We examined the data distribution characteristics, fold difference patterns between 2 metabolites, and sample variance. We used 2 metabolomic datasets from a retail milk study and a lupus nephritis cohort as test cases. When studying the impact of data normalization, transformation, scaling, and combinations of these methods, we found that the cross-contribution compensating multiple standard normalization (ccmn) method, followed by square root data transformation, was most appropriate for a well-controlled study such as the milk study dataset. Regarding the lupus nephritis cohort study, only ccmn normalization could slightly improve the data quality of the noisy cohort. Since the assessment accounted for the resemblance between processed data and the corresponding absolute quantified data, our results denote a helpful guideline for processing metabolomic datasets within a similar context (food and clinical metabolomics). Finally, we introduce Metabox 2.0, which enables thorough analysis of metabolomic data, including data processing, biomarker analysis, integrative analysis, and data interpretation. It was successfully used to process and analyze the data in this study. An online web version is available at http://metsysbio.com/metabox. Collapse Key Words R package data processing metabolomics normalization quantitative analysis scaling semiquantitative analysis transformation Collapse MESH Headings Humans Software Cohort Studies Lupus Nephritis Metabolomics/methods Data Accuracy Collapse Grants R016420001 Mahidol University 630000050069 Khon Kaen University Collapse
75	Ringbauer H, Huang Y, Akbari A, Mallick S, Olalde I, Patterson N, Reich D. Accurate detection of identity-by-descent segments in human ancient DNA. Nat Genet 2024;56:143-151. [PMID: 38123640 PMCID: PMC10786714 DOI: 10.1038/s41588-023-01582-w] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2023] [Accepted: 10/20/2023] [Indexed: 12/23/2023] Abstract Long DNA segments shared between two individuals, known as identity-by-descent (IBD), reveal recent genealogical connections. Here we introduce ancIBD, a method for identifying IBD segments in ancient human DNA (aDNA) using a hidden Markov model and imputed genotype probabilities. We demonstrate that ancIBD accurately identifies IBD segments >8 cM for aDNA data with an average depth of >0.25× for whole-genome sequencing or >1× for 1240k single nucleotide polymorphism capture data. Applying ancIBD to 4,248 ancient Eurasian individuals, we identify relatives up to the sixth degree and genealogical connections between archaeological groups. Notably, we reveal long IBD sharing between Corded Ware and Yamnaya groups, indicating that the Yamnaya herders of the Pontic-Caspian Steppe and the Steppe-related ancestry in various European Corded Ware groups share substantial co-ancestry within only a few hundred years. These results show that detecting IBD segments can generate powerful insights into the growing aDNA record, both on a small scale relevant to life stories and on a large scale relevant to major cultural-historical events. Collapse Key Words software population genetics bioinformatics data processing Collapse MESH Headings Humans DNA, Ancient Genotype Genome, Human/genetics Polymorphism, Single Nucleotide/genetics Collapse Grants R01 HG012287 NHGRI NIH HHS Max-Planck-Gesellschaft (Max Planck Society) U.S. Department of Health & Human Services \| National Institutes of Health (NIH) John Templeton Foundation (JTF) Collapse