1
|
Oh W, Jayaraman P, Sawant AS, Chan L, Levin MA, Charney AW, Kovatch P, Glicksberg BS, Nadkarni GN. Using sequence clustering to identify clinically relevant subphenotypes in patients with COVID-19 admitted to the intensive care unit. J Am Med Inform Assoc 2022; 29:489-499. [PMID: 35092685 PMCID: PMC8800515 DOI: 10.1093/jamia/ocab252] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2021] [Revised: 10/01/2021] [Accepted: 11/02/2021] [Indexed: 11/13/2022] Open
Abstract
OBJECTIVE The novel coronavirus disease 2019 (COVID-19) has heterogenous clinical courses, indicating that there might be distinct subphenotypes in critically ill patients. Although prior research has identified these subphenotypes, the temporal pattern of multiple clinical features has not been considered in cluster models. We aimed to identify temporal subphenotypes in critically ill patients with COVID-19 using a novel sequence cluster analysis and associate them with clinically relevant outcomes. MATERIALS AND METHODS We analyzed 1036 confirmed critically ill patients with laboratory-confirmed SARS-COV-2 infection admitted to the Mount Sinai Health System in New York city. The agglomerative hierarchical clustering method was used with Levenshtein distance and Ward's minimum variance linkage. RESULTS We identified four subphenotypes. Subphenotype I (N = 233 [22.5%]) included patients with rapid respirations and a rapid heartbeat but less need for invasive interventions within the first 24 hours, along with a relatively good prognosis. Subphenotype II (N = 418 [40.3%]) represented patients with the least degree of ailments, relatively low mortality, and the highest probability of discharge from the hospital. Subphenotype III (N = 259 [25.0%]) represented patients who experienced clinical deterioration during the first 24 hours of intensive care unit admission, leading to poor outcomes. Subphenotype IV (N = 126 [12.2%]) represented an acute respiratory distress syndrome trajectory with an almost universal need for mechanical ventilation. CONCLUSION We utilized the sequence cluster analysis to identify clinical subphenotypes in critically ill COVID-19 patients who had distinct temporal patterns and different clinical outcomes. This study points toward the utility of including temporal information in subphenotyping approaches.
Collapse
Affiliation(s)
- Wonsuk Oh
- Hasso Plattner Institute of Digital Health, Icahn School of Medicine at Mount Sinai, New York, New York, USA
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, New York, USA
- Mount Sinai Clinical Intelligence Center, Icahn School of Medicine at Mount Sinai, New York, New York, USA
- Division of Data Driven and Digital Medicine, Icahn School of Medicine at Mount Sinai, New York, New York, USA
| | - Pushkala Jayaraman
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, New York, USA
- Mount Sinai Clinical Intelligence Center, Icahn School of Medicine at Mount Sinai, New York, New York, USA
- Division of Data Driven and Digital Medicine, Icahn School of Medicine at Mount Sinai, New York, New York, USA
| | - Ashwin S Sawant
- Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, New York, USA
| | - Lili Chan
- Division of Data Driven and Digital Medicine, Icahn School of Medicine at Mount Sinai, New York, New York, USA
- Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, New York, USA
- Division of Nephrology, Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, New York, USA
- Charles Bronfman Institute of Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, New York, USA
| | - Matthew A Levin
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, New York, USA
- Mount Sinai Clinical Intelligence Center, Icahn School of Medicine at Mount Sinai, New York, New York, USA
- Department of Anesthesiology, Perioperative and Pain Medicine, Icahn School of Medicine at Mount Sinai, New York, New York, USA
| | - Alexander W Charney
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, New York, USA
- Department of Pathology, Icahn School of Medicine at Mount Sinai, New York, New York, USA
- Pamela Sklar Division of Psychiatric Genomics, Icahn School of Medicine at Mount Sinai, New York, New York, USA
| | - Patricia Kovatch
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, New York, USA
- Department of Pharmacological Science, Icahn School of Medicine at Mount Sinai, New York, New York, USA
| | - Benjamin S Glicksberg
- Hasso Plattner Institute of Digital Health, Icahn School of Medicine at Mount Sinai, New York, New York, USA
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, New York, USA
- Mount Sinai Clinical Intelligence Center, Icahn School of Medicine at Mount Sinai, New York, New York, USA
| | - Girish N Nadkarni
- Hasso Plattner Institute of Digital Health, Icahn School of Medicine at Mount Sinai, New York, New York, USA
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, New York, USA
- Mount Sinai Clinical Intelligence Center, Icahn School of Medicine at Mount Sinai, New York, New York, USA
- Division of Data Driven and Digital Medicine, Icahn School of Medicine at Mount Sinai, New York, New York, USA
- Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, New York, USA
- Division of Nephrology, Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, New York, USA
- Charles Bronfman Institute of Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, New York, USA
| |
Collapse
|
3
|
Verbyla KL, Yap VB, Pahwa A, Shao Y, Huttley GA. The embedding problem for markov models of nucleotide substitution. PLoS One 2013; 8:e69187. [PMID: 23935949 PMCID: PMC3728303 DOI: 10.1371/journal.pone.0069187] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2012] [Accepted: 06/10/2013] [Indexed: 11/18/2022] Open
Abstract
Continuous-time Markov processes are often used to model the complex natural phenomenon of sequence evolution. To make the process of sequence evolution tractable, simplifying assumptions are often made about the sequence properties and the underlying process. The validity of one such assumption, time-homogeneity, has never been explored. Violations of this assumption can be found by identifying non-embeddability. A process is non-embeddable if it can not be embedded in a continuous time-homogeneous Markov process. In this study, non-embeddability was demonstrated to exist when modelling sequence evolution with Markov models. Evidence of non-embeddability was found primarily at the third codon position, possibly resulting from changes in mutation rate over time. Outgroup edges and those with a deeper time depth were found to have an increased probability of the underlying process being non-embeddable. Overall, low levels of non-embeddability were detected when examining individual edges of triads across a diverse set of alignments. Subsequent phylogenetic reconstruction analyses demonstrated that non-embeddability could impact on the correct prediction of phylogenies, but at extremely low levels. Despite the existence of non-embeddability, there is minimal evidence of violations of the local time homogeneity assumption and consequently the impact is likely to be minor.
Collapse
Affiliation(s)
- Klara L. Verbyla
- Computational Genomics Laboratory, John Curtin School of Medical Research, The Australian National University, Canberra, Australian Capital Territory, Australia
- CSIRO Mathematic, Informatics and Statistics, CSIRO, Canberra, Australian Capital Territory, Australia
| | - Von Bing Yap
- Department of Statistics and Applied Probability, National University of Singapore, Singapore
| | - Anuj Pahwa
- Computational Genomics Laboratory, John Curtin School of Medical Research, The Australian National University, Canberra, Australian Capital Territory, Australia
| | - Yunli Shao
- Computational Genomics Laboratory, John Curtin School of Medical Research, The Australian National University, Canberra, Australian Capital Territory, Australia
| | - Gavin A. Huttley
- Computational Genomics Laboratory, John Curtin School of Medical Research, The Australian National University, Canberra, Australian Capital Territory, Australia
| |
Collapse
|
5
|
Wang J, Gao X, Wang Q, Li Y. ProDis-ContSHC: learning protein dissimilarity measures and hierarchical context coherently for protein-protein comparison in protein database retrieval. BMC Bioinformatics 2012; 13 Suppl 7:S2. [PMID: 22594999 PMCID: PMC3348016 DOI: 10.1186/1471-2105-13-s7-s2] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022] Open
Abstract
BACKGROUND The need to retrieve or classify protein molecules using structure or sequence-based similarity measures underlies a wide range of biomedical applications. Traditional protein search methods rely on a pairwise dissimilarity/similarity measure for comparing a pair of proteins. This kind of pairwise measures suffer from the limitation of neglecting the distribution of other proteins and thus cannot satisfy the need for high accuracy of the retrieval systems. Recent work in the machine learning community has shown that exploiting the global structure of the database and learning the contextual dissimilarity/similarity measures can improve the retrieval performance significantly. However, most existing contextual dissimilarity/similarity learning algorithms work in an unsupervised manner, which does not utilize the information of the known class labels of proteins in the database. RESULTS In this paper, we propose a novel protein-protein dissimilarity learning algorithm, ProDis-ContSHC. ProDis-ContSHC regularizes an existing dissimilarity measure dij by considering the contextual information of the proteins. The context of a protein is defined by its neighboring proteins. The basic idea is, for a pair of proteins (i, j), if their context N(i) and N(j) is similar to each other, the two proteins should also have a high similarity. We implement this idea by regularizing dij by a factor learned from the context N(i) and N(j).Moreover, we divide the context to hierarchial sub-context and get the contextual dissimilarity vector for each protein pair. Using the class label information of the proteins, we select the relevant (a pair of proteins that has the same class labels) and irrelevant (with different labels) protein pairs, and train an SVM model to distinguish between their contextual dissimilarity vectors. The SVM model is further used to learn a supervised regularizing factor. Finally, with the new Supervised learned Dissimilarity measure, we update the Protein Hierarchial Context Coherently in an iterative algorithm--ProDis-ContSHC.We test the performance of ProDis-ContSHC on two benchmark sets, i.e., the ASTRAL 1.73 database and the FSSP/DALI database. Experimental results demonstrate that plugging our supervised contextual dissimilarity measures into the retrieval systems significantly outperforms the context-free dissimilarity/similarity measures and other unsupervised contextual dissimilarity measures that do not use the class label information. CONCLUSIONS Using the contextual proteins with their class labels in the database, we can improve the accuracy of the pairwise dissimilarity/similarity measures dramatically for the protein retrieval tasks. In this work, for the first time, we propose the idea of supervised contextual dissimilarity learning, resulting in the ProDis-ContSHC algorithm. Among different contextual dissimilarity learning approaches that can be used to compare a pair of proteins, ProDis-ContSHC provides the highest accuracy. Finally, ProDis-ContSHC compares favorably with other methods reported in the recent literature.
Collapse
Affiliation(s)
- Jingyan Wang
- King Abdullah University of Science and Technology (KAUST), Mathematical and Computer Sciences and Engineering Division, Thuwal, 23955-6900, Saudi Arabia
| | | | | | | |
Collapse
|
6
|
Zou L, Susko E, Field C, Roger AJ. Fitting nonstationary general-time-reversible models to obtain edge-lengths and frequencies for the barry-hartigan model. Syst Biol 2012; 61:927-40. [PMID: 22508720 DOI: 10.1093/sysbio/sys046] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Among models of nucleotide evolution, the Barry and Hartigan (BH) model (also known as the General Markov Model) is very flexible as it allows separate arbitrary substitution matrices along edges. For a given tree, the estimates of the BH model are a set of joint probability matrices, each giving the pairwise frequencies of nucleotides at the ends of the edge. We have previously shown that, due to an identifiability problem, these cannot be expected to consistently estimate the actual pairwise frequencies. A further consequence is that internal node frequency estimates are likely to be incorrect. Here we define a nonstationary GTR model for each edge that we refer to as the NSGTR model. We fit the NSGTR model by minimizing the sums of squares between the estimates of transition probabilities under the NSGTR model and the estimates provided by a fitted BH model. This NSGTR model provides estimates that avoid the identifiability difficulties of the BH model while closely fitting it. With the best-fitting NSGTR estimates, we are able to get interpretable frequency vectors at internal nodes as well as edge length estimates that are otherwise not yielded by the BH model. These edge lengths are interpretable as the expected number of substitutions along an edge for the model. We also show that for a nonstationary continuous-time model these are not the same as the edge length parameters for conventional substitution matrices that are output by nonstationary model phylogenetic estimation programs such as nhPhyML.
Collapse
Affiliation(s)
- Liwen Zou
- Bioinformatics Research Center, Department of Genetics, North Carolina State University, NC, USA
| | | | | | | |
Collapse
|