1
|
Chen J, Ionita M, Feng Y, Lu Y, Orzechowski P, Garai S, Hassinger K, Bao J, Wen J, Duong-Tran D, Wagenaar J, McKeague ML, Painter MM, Mathew D, Pattekar A, Meyer NJ, Wherry EJ, Greenplate AR, Shen L. Automated Cytometric Gating with Human-Level Performance Using Bivariate Segmentation. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.05.06.592739. [PMID: 38766268 PMCID: PMC11100732 DOI: 10.1101/2024.05.06.592739] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/22/2024]
Abstract
Recent advances in cytometry technology have enabled high-throughput data collection with multiple single-cell protein expression measurements. The significant biological and technical variance between samples in cytometry has long posed a formidable challenge during the gating process, especially for the initial gates which deal with unpredictable events, such as debris and technical artifacts. Even with the same experimental machine and protocol, the target population, as well as the cell population that needs to be excluded, may vary across different measurements. To address this challenge and mitigate the labor-intensive manual gating process, we propose a deep learning framework UNITO to rigorously identify the hierarchical cytometric subpopulations. The UNITO framework transformed a cell-level classification task into an image-based semantic segmentation problem. For reproducibility purposes, the framework was applied to three independent cohorts and successfully detected initial gates that were required to identify single cellular events as well as subsequent cell gates. We validated the UNITO framework by comparing its results with previous automated methods and the consensus of at least four experienced immunologists. UNITO outperformed existing automated methods and differed from human consensus by no more than each individual human. Most critically, UNITO framework functions as a fully automated pipeline after training and does not require human hints or prior knowledge. Unlike existing multi-channel classification or clustering pipelines, UNITO can reproduce a similar contour compared to manual gating for each intermediate gating to achieve better interpretability and provide post hoc visual inspection. Beyond acting as a pioneering framework that uses image segmentation to do auto-gating, UNITO gives a fast and interpretable way to assign the cell subtype membership, and the speed of UNITO will not be impacted by the number of cells from each sample. The pre-gating and gating inference takes approximately 2 minutes for each sample using our pre-defined 9 gates system, and it can also adapt to any sequential prediction with different configurations.
Collapse
Affiliation(s)
- Jiong Chen
- Department of Bioengineering, University of Pennsylvania School of Engineering and Applied Science, PA, USA
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania Perelman School of Medicine, PA, USA
| | - Matei Ionita
- Department of Systems Pharmacology & Translational Therapeutics, University of Pennsylvania Perelman School of Medicine, PA, USA
- Institute for Immunology and Immune Health, University of Pennsylvania Perelman School of Medicine, PA, USA
| | - Yanbo Feng
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania Perelman School of Medicine, PA, USA
| | - Yinfeng Lu
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania Perelman School of Medicine, PA, USA
- Department of Mathematics, University of Pennsylvania School of Arts and Sciences, PA, USA
| | - Patryk Orzechowski
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania Perelman School of Medicine, PA, USA
- Department of Automatics and Robotics, AGH University of Science and Technology, al. Mickiewicza 30, Krakow, 30-059, Poland
| | - Sumita Garai
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania Perelman School of Medicine, PA, USA
| | - Kenneth Hassinger
- Department of Systems Pharmacology & Translational Therapeutics, University of Pennsylvania Perelman School of Medicine, PA, USA
| | - Jingxuan Bao
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania Perelman School of Medicine, PA, USA
| | - Junhao Wen
- Laboratory of AI and Biomedical Science, Stevens Neuroimaging and Informatics Institute, Keck School of Medicine of USC, University of Southern California, CA, USA
| | - Duy Duong-Tran
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania Perelman School of Medicine, PA, USA
- Department of Mathematics, United States Naval Academy, Annapolis, MD, USA
| | - Joost Wagenaar
- Department of Systems Pharmacology & Translational Therapeutics, University of Pennsylvania Perelman School of Medicine, PA, USA
| | - Michelle L. McKeague
- Department of Systems Pharmacology & Translational Therapeutics, University of Pennsylvania Perelman School of Medicine, PA, USA
- Institute for Immunology and Immune Health, University of Pennsylvania Perelman School of Medicine, PA, USA
| | - Mark M. Painter
- Department of Systems Pharmacology & Translational Therapeutics, University of Pennsylvania Perelman School of Medicine, PA, USA
- Institute for Immunology and Immune Health, University of Pennsylvania Perelman School of Medicine, PA, USA
| | - Divij Mathew
- Department of Systems Pharmacology & Translational Therapeutics, University of Pennsylvania Perelman School of Medicine, PA, USA
- Institute for Immunology and Immune Health, University of Pennsylvania Perelman School of Medicine, PA, USA
| | - Ajinkya Pattekar
- Department of Systems Pharmacology & Translational Therapeutics, University of Pennsylvania Perelman School of Medicine, PA, USA
- Institute for Immunology and Immune Health, University of Pennsylvania Perelman School of Medicine, PA, USA
| | - Nuala J. Meyer
- Division of Pulmonary and Critical Care Medicine, Department of Medicine, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - E. John Wherry
- Department of Systems Pharmacology & Translational Therapeutics, University of Pennsylvania Perelman School of Medicine, PA, USA
- Institute for Immunology and Immune Health, University of Pennsylvania Perelman School of Medicine, PA, USA
| | - Allison R. Greenplate
- Department of Systems Pharmacology & Translational Therapeutics, University of Pennsylvania Perelman School of Medicine, PA, USA
- Institute for Immunology and Immune Health, University of Pennsylvania Perelman School of Medicine, PA, USA
| | - Li Shen
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania Perelman School of Medicine, PA, USA
| |
Collapse
|
2
|
Li X, Zhang Y, Wang J, Han J, Shen T. Long-term dynamic shifts in genomic base content and evolutionary trajectories of SARS-CoV-2 variants. J Med Virol 2023; 95:e29128. [PMID: 37772482 DOI: 10.1002/jmv.29128] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2023] [Revised: 08/30/2023] [Accepted: 09/15/2023] [Indexed: 09/30/2023]
Abstract
The rapid spread and remarkable mutations of SARS-CoV-2 variants, particularly Omicron, necessitate an understanding of their evolutionary characteristics. In this study, we analyzed representative high-quality whole-genome sequences of 2008 SARS-CoV-2 variants to explore long-term dynamic changes in genomic base (especially GC) content and variations during viral evolution. Our results demonstrated a highly negative correlation between GC content and variant emergence time (r = -0.765, p < 2.22e-16). Major gene partitions (S, N, ORF1ab) displayed similar trends. Omicron exhibited a significantly lower GC content than non-Omicron variants (p < 2.22e-16). Notably, we observed a robust negative correlation between C and T content (r = -0.778, p < 2.22e-16) and between G and A content (r = -0.773, p < 2.22e-16). Among all strains, Omicron showed the greatest base variation, with C->T mutations being the most frequent (median [interquartile range [IQR]]: 29 (27, 31), 37.67%), succeeded by G->A mutations (11 (9, 13), 14.63%). Over a 3-year span, an annual decline rate of 0.12% in SARS-CoV-2 GC content was observed and could become more pronounced in future emerging variants. These findings provided insights into the evolutionary trajectory of SARS-CoV-2, underscoring the significance of continuous genomic surveillance for effective prediction of and response to future variants.
Collapse
Affiliation(s)
- Xinjie Li
- Department of Microbiology and Infectious Disease Center, School of Basic Medical Sciences, Peking University, Beijing, China
| | - Yuqi Zhang
- Department of Microbiology and Infectious Disease Center, School of Basic Medical Sciences, Peking University, Beijing, China
| | - Jie Wang
- Department of Microbiology and Infectious Disease Center, School of Basic Medical Sciences, Peking University, Beijing, China
| | - Jun Han
- State Key Laboratory of Infectious Disease Prevention and Control, National Institute for Viral Disease Control and Prevention, China CDC, Beijing, China
| | - Tao Shen
- Department of Microbiology and Infectious Disease Center, School of Basic Medical Sciences, Peking University, Beijing, China
| |
Collapse
|
3
|
Zheng P, Zhou C, Ding Y, Liu B, Lu L, Zhu F, Duan S. Nanopore sequencing technology and its applications. MedComm (Beijing) 2023; 4:e316. [PMID: 37441463 PMCID: PMC10333861 DOI: 10.1002/mco2.316] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2022] [Revised: 05/29/2023] [Accepted: 05/31/2023] [Indexed: 07/15/2023] Open
Abstract
Since the development of Sanger sequencing in 1977, sequencing technology has played a pivotal role in molecular biology research by enabling the interpretation of biological genetic codes. Today, nanopore sequencing is one of the leading third-generation sequencing technologies. With its long reads, portability, and low cost, nanopore sequencing is widely used in various scientific fields including epidemic prevention and control, disease diagnosis, and animal and plant breeding. Despite initial concerns about high error rates, continuous innovation in sequencing platforms and algorithm analysis technology has effectively addressed its accuracy. During the coronavirus disease (COVID-19) pandemic, nanopore sequencing played a critical role in detecting the severe acute respiratory syndrome coronavirus-2 virus genome and containing the pandemic. However, a lack of understanding of this technology may limit its popularization and application. Nanopore sequencing is poised to become the mainstream choice for preventing and controlling COVID-19 and future epidemics while creating value in other fields such as oncology and botany. This work introduces the contributions of nanopore sequencing during the COVID-19 pandemic to promote public understanding and its use in emerging outbreaks worldwide. We discuss its application in microbial detection, cancer genomes, and plant genomes and summarize strategies to improve its accuracy.
Collapse
Affiliation(s)
- Peijie Zheng
- Department of Clinical MedicineSchool of MedicineZhejiang University City CollegeHangzhouChina
| | - Chuntao Zhou
- Department of Clinical MedicineSchool of MedicineZhejiang University City CollegeHangzhouChina
| | - Yuemin Ding
- Department of Clinical MedicineSchool of MedicineZhejiang University City CollegeHangzhouChina
- Institute of Translational Medicine, School of MedicineZhejiang University City CollegeHangzhouChina
- Key Laboratory of Novel Targets and Drug Study for Neural Repair of Zhejiang Province, School of MedicineZhejiang University City CollegeHangzhouChina
| | - Bin Liu
- Department of Clinical MedicineSchool of MedicineZhejiang University City CollegeHangzhouChina
| | - Liuyi Lu
- Department of Clinical MedicineSchool of MedicineZhejiang University City CollegeHangzhouChina
| | - Feng Zhu
- Department of Clinical MedicineSchool of MedicineZhejiang University City CollegeHangzhouChina
| | - Shiwei Duan
- Department of Clinical MedicineSchool of MedicineZhejiang University City CollegeHangzhouChina
- Institute of Translational Medicine, School of MedicineZhejiang University City CollegeHangzhouChina
- Key Laboratory of Novel Targets and Drug Study for Neural Repair of Zhejiang Province, School of MedicineZhejiang University City CollegeHangzhouChina
| |
Collapse
|
4
|
Miao M, De Clercq E, Li G. Towards Efficient and Accurate SARS-CoV-2 Genome Sequence Typing Based on Supervised Learning Approaches. Microorganisms 2022; 10:microorganisms10091785. [PMID: 36144387 PMCID: PMC9505117 DOI: 10.3390/microorganisms10091785] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2022] [Revised: 08/24/2022] [Accepted: 09/01/2022] [Indexed: 11/16/2022] Open
Abstract
Despite the active development of SARS-CoV-2 surveillance methods (e.g., Nextstrain, GISAID, Pangolin), the global emergence of various SARS-CoV-2 viral lineages that potentially cause antiviral and vaccine failure has driven the need for accurate and efficient SARS-CoV-2 genome sequence classifiers. This study presents an optimized method that accurately identifies the viral lineages of SARS-CoV-2 genome sequences using existing schemes. For Nextstrain and GISAID clades, a template matching-based method is proposed to quantify the differences between viral clades and to play an important role in classification evaluation. Furthermore, to improve the typing accuracy of SARS-CoV-2 genome sequences, an ensemble model that integrates a combination of machine learning-based methods (such as Random Forest and Catboost) with optimized weights is proposed for Nextstrain, Pangolin, and GISAID clades. Cross-validation is applied to optimize the parameters of the machine learning-based method and the weight settings of the ensemble model. To improve the efficiency of the model, in addition to the one-hot encoding method, we have proposed a nucleotide site mutation-based data structure that requires less computational resources and performs better in SARS-CoV-2 genome sequence typing. Based on an accumulated database of >1 million SARS-CoV-2 genome sequences, performance evaluations show that the proposed system has a typing accuracy of 99.879%, 97.732%, and 96.291% for Nextstrain, Pangolin, and GISAID clades, respectively. A single prediction only takes an average of <20 ms on a portable laptop. Overall, this study provides an efficient and accurate SARS-CoV-2 genome sequence typing system that benefits current and future surveillance of SARS-CoV-2 variants.
Collapse
Affiliation(s)
- Miao Miao
- Hunan Provincial Key Laboratory of Clinical Epidemiology, Xiangya School of Public Health, Central South University, Changsha 410078, China
| | - Erik De Clercq
- Department of Microbiology, Immunology and Transplantation, Rega Institute for Medical Research, KU Leuven, 3000 Leuven, Belgium
| | - Guangdi Li
- Hunan Provincial Key Laboratory of Clinical Epidemiology, Xiangya School of Public Health, Central South University, Changsha 410078, China
- Hunan Children’s Hospital, Changsha 410007, China
- Correspondence: ; Tel.: +86-731-8480-5414
| |
Collapse
|
5
|
Munis AM, Andersson M, Mobbs A, Hyde SC, Gill DR. Genomic diversity of SARS-CoV-2 in Oxford during United Kingdom's first national lockdown. Sci Rep 2021; 11:21484. [PMID: 34728747 PMCID: PMC8564533 DOI: 10.1038/s41598-021-01022-x] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2021] [Accepted: 10/18/2021] [Indexed: 12/15/2022] Open
Abstract
Epidemiological efforts to model the spread of SARS-CoV-2, the virus that causes COVID-19, are crucial to understanding and containing current and future outbreaks and to inform public health responses. Mutations that occur in viral genomes can alter virulence during outbreaks by increasing infection rates and helping the virus evade the host immune system. To understand the changes in viral genomic diversity and molecular epidemiology in Oxford during the first wave of infections in the United Kingdom, we analyzed 563 clinical SARS-CoV-2 samples via whole-genome sequencing using Nanopore MinION sequencing. Large-scale surveillance efforts during viral epidemics are likely to be confounded by the number of independent introductions of the viral strains into a region. To avoid such issues and better understand the selection-based changes occurring in the SARS-CoV-2 genome, we utilized local isolates collected during the UK's first national lockdown whereby personal interactions, international and national travel were considerably restricted and controlled. We were able to track the short-term evolution of the virus, detect the emergence of several mutations of concern or interest, and capture the viral diversity of the region. Overall, these results demonstrate genomic pathogen surveillance efforts have considerable utility in controlling the local spread of the virus.
Collapse
Affiliation(s)
- Altar M Munis
- Gene Medicine Group, Nuffield Division of Clinical Laboratory Sciences, Radcliffe Department of Medicine, University of Oxford, Oxford, UK
| | | | - Alexander Mobbs
- Oxford University Hospitals NHS Foundation Trust, Oxford, UK
| | - Stephen C Hyde
- Gene Medicine Group, Nuffield Division of Clinical Laboratory Sciences, Radcliffe Department of Medicine, University of Oxford, Oxford, UK
| | - Deborah R Gill
- Gene Medicine Group, Nuffield Division of Clinical Laboratory Sciences, Radcliffe Department of Medicine, University of Oxford, Oxford, UK.
| |
Collapse
|
6
|
Di Pasquale A, Radomski N, Mangone I, Calistri P, Lorusso A, Cammà C. SARS-CoV-2 surveillance in Italy through phylogenomic inferences based on Hamming distances derived from pan-SNPs, -MNPs and -InDels. BMC Genomics 2021; 22:782. [PMID: 34717546 PMCID: PMC8556844 DOI: 10.1186/s12864-021-08112-0] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2021] [Accepted: 10/20/2021] [Indexed: 01/08/2023] Open
Abstract
BACKGROUND Faced with the ongoing global pandemic of coronavirus disease, the 'National Reference Centre for Whole Genome Sequencing of microbial pathogens: database and bioinformatic analysis' (GENPAT) formally established at the 'Istituto Zooprofilattico Sperimentale dell'Abruzzo e del Molise' (IZSAM) in Teramo (Italy) is in charge of the SARS-CoV-2 surveillance at the genomic scale. In a context of SARS-CoV-2 surveillance requiring correct and fast assessment of epidemiological clusters from substantial amount of samples, the present study proposes an analytical workflow for identifying accurately the PANGO lineages of SARS-CoV-2 samples and building of discriminant minimum spanning trees (MST) bypassing the usual time consuming phylogenomic inferences based on multiple sequence alignment (MSA) and substitution model. RESULTS GENPAT constituted two collections of SARS-CoV-2 samples. The first collection consisted of SARS-CoV-2 positive swabs collected by IZSAM from the Abruzzo region (Italy), then sequenced by next generation sequencing (NGS) and analyzed in GENPAT (n = 1592), while the second collection included samples from several Italian provinces and retrieved from the reference Global Initiative on Sharing All Influenza Data (GISAID) (n = 17,201). The main results of the present work showed that (i) GENPAT and GISAID detected the same PANGO lineages, (ii) the PANGO lineages B.1.177 (i.e. historical in Italy) and B.1.1.7 (i.e. 'UK variant') are major concerns today in several Italian provinces, and the new MST-based method (iii) clusters most of the PANGO lineages together, (iv) with a higher dicriminatory power than PANGO lineages, (v) and faster that the usual phylogenomic methods based on MSA and substitution model. CONCLUSIONS The genome sequencing efforts of Italian provinces, combined with a structured national system of NGS data management, provided support for surveillance SARS-CoV-2 in Italy. We propose to build phylogenomic trees of SARS-CoV-2 variants through an accurate, discriminant and fast MST-based method avoiding the typical time consuming steps related to MSA and substitution model-based phylogenomic inference.
Collapse
Affiliation(s)
- Adriano Di Pasquale
- National Reference Centre (NRC) for Whole Genome Sequencing of microbial pathogens: data-base and bioinformatics analysis (GENPAT), Istituto Zooprofilattico Sperimentale dell’Abruzzo e del Molise “Giuseppe Caporale” (IZSAM), via Campo Boario, 64100 Teramo, TE Italy
| | - Nicolas Radomski
- National Reference Centre (NRC) for Whole Genome Sequencing of microbial pathogens: data-base and bioinformatics analysis (GENPAT), Istituto Zooprofilattico Sperimentale dell’Abruzzo e del Molise “Giuseppe Caporale” (IZSAM), via Campo Boario, 64100 Teramo, TE Italy
| | - Iolanda Mangone
- National Reference Centre (NRC) for Whole Genome Sequencing of microbial pathogens: data-base and bioinformatics analysis (GENPAT), Istituto Zooprofilattico Sperimentale dell’Abruzzo e del Molise “Giuseppe Caporale” (IZSAM), via Campo Boario, 64100 Teramo, TE Italy
| | - Paolo Calistri
- National Reference Centre (NRC) for Whole Genome Sequencing of microbial pathogens: data-base and bioinformatics analysis (GENPAT), Istituto Zooprofilattico Sperimentale dell’Abruzzo e del Molise “Giuseppe Caporale” (IZSAM), via Campo Boario, 64100 Teramo, TE Italy
| | - Alessio Lorusso
- National Reference Centre (NRC) for Whole Genome Sequencing of microbial pathogens: data-base and bioinformatics analysis (GENPAT), Istituto Zooprofilattico Sperimentale dell’Abruzzo e del Molise “Giuseppe Caporale” (IZSAM), via Campo Boario, 64100 Teramo, TE Italy
| | - Cesare Cammà
- National Reference Centre (NRC) for Whole Genome Sequencing of microbial pathogens: data-base and bioinformatics analysis (GENPAT), Istituto Zooprofilattico Sperimentale dell’Abruzzo e del Molise “Giuseppe Caporale” (IZSAM), via Campo Boario, 64100 Teramo, TE Italy
| |
Collapse
|