1
|
Gholami S, Scheppke L, Kshirsagar M, Wu Y, Dodhia R, Bonelli R, Leung I, Sallo FB, Muldrew A, Jamison C, Peto T, Lavista Ferres J, Weeks WB, Friedlander M, Lee AY. Self-Supervised Learning for Improved Optical Coherence Tomography Detection of Macular Telangiectasia Type 2. JAMA Ophthalmol 2024; 142:226-233. [PMID: 38329740 PMCID: PMC10853868 DOI: 10.1001/jamaophthalmol.2023.6454] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2023] [Accepted: 11/29/2023] [Indexed: 02/09/2024]
Abstract
Importance Deep learning image analysis often depends on large, labeled datasets, which are difficult to obtain for rare diseases. Objective To develop a self-supervised approach for automated classification of macular telangiectasia type 2 (MacTel) on optical coherence tomography (OCT) with limited labeled data. Design, Setting, and Participants This was a retrospective comparative study. OCT images from May 2014 to May 2019 were collected by the Lowy Medical Research Institute, La Jolla, California, and the University of Washington, Seattle, from January 2016 to October 2022. Clinical diagnoses of patients with and without MacTel were confirmed by retina specialists. Data were analyzed from January to September 2023. Exposures Two convolutional neural networks were pretrained using the Bootstrap Your Own Latent algorithm on unlabeled training data and fine-tuned with labeled training data to predict MacTel (self-supervised method). ResNet18 and ResNet50 models were also trained using all labeled data (supervised method). Main Outcomes and Measures The ground truth yes vs no MacTel diagnosis is determined by retinal specialists based on spectral-domain OCT. The models' predictions were compared against human graders using accuracy, sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), area under precision recall curve (AUPRC), and area under the receiver operating characteristic curve (AUROC). Uniform manifold approximation and projection was performed for dimension reduction and GradCAM visualizations for supervised and self-supervised methods. Results A total of 2636 OCT scans from 780 patients with MacTel and 131 patients without MacTel were included from the MacTel Project (mean [SD] age, 60.8 [11.7] years; 63.8% female), and another 2564 from 1769 patients without MacTel from the University of Washington (mean [SD] age, 61.2 [18.1] years; 53.4% female). The self-supervised approach fine-tuned on 100% of the labeled training data with ResNet50 as the feature extractor performed the best, achieving an AUPRC of 0.971 (95% CI, 0.969-0.972), an AUROC of 0.970 (95% CI, 0.970-0.973), accuracy of 0.898%, sensitivity of 0.898, specificity of 0.949, PPV of 0.935, and NPV of 0.919. With only 419 OCT volumes (185 MacTel patients in 10% of labeled training dataset), the ResNet18 self-supervised model achieved comparable performance, with an AUPRC of 0.958 (95% CI, 0.957-0.960), an AUROC of 0.966 (95% CI, 0.964-0.967), and accuracy, sensitivity, specificity, PPV, and NPV of 90.2%, 0.884, 0.916, 0.896, and 0.906, respectively. The self-supervised models showed better agreement with the more experienced human expert graders. Conclusions and Relevance The findings suggest that self-supervised learning may improve the accuracy of automated MacTel vs non-MacTel binary classification on OCT with limited labeled training data, and these approaches may be applicable to other rare diseases, although further research is warranted.
Collapse
Affiliation(s)
| | - Lea Scheppke
- The Lowy Medical Research Institute, La Jolla, California
| | | | - Yue Wu
- Department of Ophthalmology, University of Washington, Seattle
- Roger and Angie Karalis Johnson Retina Center, Seattle, Washington
| | - Rahul Dodhia
- AI for Good Lab, Microsoft Research, Redmond, Washington
| | | | - Irene Leung
- Moorfields Eye Hospital, London, United Kingdom
| | - Ferenc B. Sallo
- Hôpital Ophtalmique Jules-Gonin, Fondation Asile des Aveugles, University of Lausanne, Lausanne, Switzerland
| | | | | | - Tunde Peto
- Queen’s University Belfast, Belfast, Northern Ireland
| | | | | | - Martin Friedlander
- The Lowy Medical Research Institute, La Jolla, California
- The Scripps Research Institute, La Jolla, California
| | - Aaron Y. Lee
- Department of Ophthalmology, University of Washington, Seattle
- Roger and Angie Karalis Johnson Retina Center, Seattle, Washington
| |
Collapse
|
2
|
Pereira M, Kshirsagar M, Mukherjee S, Dodhia R, Lavista Ferres J, de Sousa R. Assessment of differentially private synthetic data for utility and fairness in end-to-end machine learning pipelines for tabular data. PLoS One 2024; 19:e0297271. [PMID: 38315667 PMCID: PMC10843030 DOI: 10.1371/journal.pone.0297271] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2023] [Accepted: 01/02/2024] [Indexed: 02/07/2024] Open
Abstract
Differentially private (DP) synthetic datasets are a solution for sharing data while preserving the privacy of individual data providers. Understanding the effects of utilizing DP synthetic data in end-to-end machine learning pipelines impacts areas such as health care and humanitarian action, where data is scarce and regulated by restrictive privacy laws. In this work, we investigate the extent to which synthetic data can replace real, tabular data in machine learning pipelines and identify the most effective synthetic data generation techniques for training and evaluating machine learning models. We systematically investigate the impacts of differentially private synthetic data on downstream classification tasks from the point of view of utility as well as fairness. Our analysis is comprehensive and includes representatives of the two main types of synthetic data generation algorithms: marginal-based and GAN-based. To the best of our knowledge, our work is the first that: (i) proposes a training and evaluation framework that does not assume that real data is available for testing the utility and fairness of machine learning models trained on synthetic data; (ii) presents the most extensive analysis of synthetic dataset generation algorithms in terms of utility and fairness when used for training machine learning models; and (iii) encompasses several different definitions of fairness. Our findings demonstrate that marginal-based synthetic data generators surpass GAN-based ones regarding model training utility for tabular data. Indeed, we show that models trained using data generated by marginal-based algorithms can exhibit similar utility to models trained using real data. Our analysis also reveals that the marginal-based synthetic data generated using AIM and MWEM PGM algorithms can train models that simultaneously achieve utility and fairness characteristics close to those obtained by models trained with real data.
Collapse
Affiliation(s)
- Mayana Pereira
- AI for Good Research Lab, Microsoft, Redmond, Washington, United States of America
- Department of Electrical Engineering, University of Brasilia, Brasilia, Brazil
| | - Meghana Kshirsagar
- AI for Good Research Lab, Microsoft, Redmond, Washington, United States of America
| | | | - Rahul Dodhia
- AI for Good Research Lab, Microsoft, Redmond, Washington, United States of America
| | - Juan Lavista Ferres
- AI for Good Research Lab, Microsoft, Redmond, Washington, United States of America
| | - Rafael de Sousa
- Department of Electrical Engineering, University of Brasilia, Brasilia, Brazil
| |
Collapse
|
3
|
Ciceri G, Baggiolini A, Cho HS, Kshirsagar M, Benito-Kwiecinski S, Walsh RM, Aromolaran KA, Gonzalez-Hernandez AJ, Munguba H, Koo SY, Xu N, Sevilla KJ, Goldstein PA, Levitz J, Leslie CS, Koche RP, Studer L. An epigenetic barrier sets the timing of human neuronal maturation. Nature 2024; 626:881-890. [PMID: 38297124 PMCID: PMC10881400 DOI: 10.1038/s41586-023-06984-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2022] [Accepted: 12/15/2023] [Indexed: 02/02/2024]
Abstract
The pace of human brain development is highly protracted compared with most other species1-7. The maturation of cortical neurons is particularly slow, taking months to years to develop adult functions3-5. Remarkably, such protracted timing is retained in cortical neurons derived from human pluripotent stem cells (hPSCs) during in vitro differentiation or upon transplantation into the mouse brain4,8,9. Those findings suggest the presence of a cell-intrinsic clock setting the pace of neuronal maturation, although the molecular nature of this clock remains unknown. Here we identify an epigenetic developmental programme that sets the timing of human neuronal maturation. First, we developed a hPSC-based approach to synchronize the birth of cortical neurons in vitro which enabled us to define an atlas of morphological, functional and molecular maturation. We observed a slow unfolding of maturation programmes, limited by the retention of specific epigenetic factors. Loss of function of several of those factors in cortical neurons enables precocious maturation. Transient inhibition of EZH2, EHMT1 and EHMT2 or DOT1L, at progenitor stage primes newly born neurons to rapidly acquire mature properties upon differentiation. Thus our findings reveal that the rate at which human neurons mature is set well before neurogenesis through the establishment of an epigenetic barrier in progenitor cells. Mechanistically, this barrier holds transcriptional maturation programmes in a poised state that is gradually released to ensure the prolonged timeline of human cortical neuron maturation.
Collapse
Affiliation(s)
- Gabriele Ciceri
- The Center for Stem Cell Biology and Developmental Biology Program, Memorial Sloan Kettering Cancer Center, New York, NY, USA.
| | - Arianna Baggiolini
- The Center for Stem Cell Biology and Developmental Biology Program, Memorial Sloan Kettering Cancer Center, New York, NY, USA
- Institute of Oncology Research (IOR), Bellinzona Institutes of Science (BIOS+), Bellinzona, Switzerland
- Faculty of Biomedical Sciences, Università della Svizzera Italiana, Lugano, Switzerland
| | - Hyein S Cho
- The Center for Stem Cell Biology and Developmental Biology Program, Memorial Sloan Kettering Cancer Center, New York, NY, USA
- Computational Biology Program, Memorial Sloan Kettering Cancer Center, New York, NY, USA
| | - Meghana Kshirsagar
- Computational Biology Program, Memorial Sloan Kettering Cancer Center, New York, NY, USA
- Microsoft AI for Good Research, Redmond, WA, USA
| | - Silvia Benito-Kwiecinski
- The Center for Stem Cell Biology and Developmental Biology Program, Memorial Sloan Kettering Cancer Center, New York, NY, USA
| | - Ryan M Walsh
- The Center for Stem Cell Biology and Developmental Biology Program, Memorial Sloan Kettering Cancer Center, New York, NY, USA
| | | | | | - Hermany Munguba
- Department of Biochemistry, Weill Cornell Medicine, New York, NY, USA
| | - So Yeon Koo
- The Center for Stem Cell Biology and Developmental Biology Program, Memorial Sloan Kettering Cancer Center, New York, NY, USA
- Weill Cornell Neuroscience PhD Program, New York, NY, USA
| | - Nan Xu
- The Center for Stem Cell Biology and Developmental Biology Program, Memorial Sloan Kettering Cancer Center, New York, NY, USA
- Louis V. Gerstner Jr Graduate School of Biomedical Sciences, Memorial Sloan Kettering Cancer Center, New York, NY, USA
| | - Kaylin J Sevilla
- The Center for Stem Cell Biology and Developmental Biology Program, Memorial Sloan Kettering Cancer Center, New York, NY, USA
| | - Peter A Goldstein
- Department of Anesthesiology, Weill Cornell Medicine, New York, NY, USA
| | - Joshua Levitz
- Department of Biochemistry, Weill Cornell Medicine, New York, NY, USA
| | - Christina S Leslie
- Computational Biology Program, Memorial Sloan Kettering Cancer Center, New York, NY, USA
| | - Richard P Koche
- Center for Epigenetics Research, Memorial Sloan Kettering Cancer Center, New York, NY, USA
| | - Lorenz Studer
- The Center for Stem Cell Biology and Developmental Biology Program, Memorial Sloan Kettering Cancer Center, New York, NY, USA.
| |
Collapse
|
4
|
Sledzieski S, Kshirsagar M, Baek M, Berger B, Dodhia R, Ferres JL. Democratizing Protein Language Models with Parameter-Efficient Fine-Tuning. bioRxiv 2023:2023.11.09.566187. [PMID: 37986761 PMCID: PMC10659351 DOI: 10.1101/2023.11.09.566187] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/22/2023]
Abstract
Proteomics has been revolutionized by large pre-trained protein language models, which learn unsupervised representations from large corpora of sequences. The parameters of these models are then fine-tuned in a supervised setting to tailor the model to a specific downstream task. However, as model size increases, the computational and memory footprint of fine-tuning becomes a barrier for many research groups. In the field of natural language processing, which has seen a similar explosion in the size of models, these challenges have been addressed by methods for parameter-efficient fine-tuning (PEFT). In this work, we newly bring parameter-efficient fine-tuning methods to proteomics. Using the parameter-efficient method LoRA, we train new models for two important proteomic tasks: predicting protein-protein interactions (PPI) and predicting the symmetry of homooligomers. We show that for homooligomer symmetry prediction, these approaches achieve performance competitive with traditional fine-tuning while requiring reduced memory and using three orders of magnitude fewer parameters. On the PPI prediction task, we surprisingly find that PEFT models actually outperform traditional fine-tuning while using two orders of magnitude fewer parameters. Here, we go even further to show that freezing the parameters of the language model and training only a classification head also outperforms fine-tuning, using five orders of magnitude fewer parameters, and that both of these models outperform state-of-the-art PPI prediction methods with substantially reduced compute. We also demonstrate that PEFT is robust to variations in training hyper-parameters, and elucidate where best practices for PEFT in proteomics differ from in natural language processing. Thus, we provide a blueprint to democratize the power of protein language model tuning to groups which have limited computational resources.
Collapse
Affiliation(s)
- Samuel Sledzieski
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge MA 02139, USA
- AI for Good Research Lab, Microsoft Corporation, Redmond WA 98052, USA
| | | | - Minkyung Baek
- Department of Biological Sciences, Seoul National University, Seoul 08826, South Korea
| | - Bonnie Berger
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge MA 02139, USA
- Department of Mathematics, Massachusetts Institute of Technology, Cambridge MA 02139, USA
| | - Rahul Dodhia
- AI for Good Research Lab, Microsoft Corporation, Redmond WA 98052, USA
| | | |
Collapse
|
5
|
Ali MS, Kshirsagar M, Naredo E, Ryan C. Dynamic Grammar Pruning for Program Size Reduction in Symbolic Regression. SN Comput Sci 2023; 4:402. [PMID: 37214587 PMCID: PMC10192180 DOI: 10.1007/s42979-023-01840-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 06/22/2022] [Accepted: 04/12/2023] [Indexed: 05/24/2023]
Abstract
Grammar is a key input in grammar-based genetic programming. Grammar design not only influences performance, but also program size. However, grammar design and the choice of productions often require expert input as no automatic approach exists. This research work discusses our approach to automatically reduce a bloated grammar. By utilizing a simple Production Ranking mechanism, we identify productions which are less useful and dynamically prune those to channel evolutionary search towards better (smaller) solutions. Our objective in this work was program size reduction without compromising generalization performance. We tested our approach on 13 standard symbolic regression datasets with Grammatical Evolution. Using a grammar embodying a well-defined function set as a baseline, we compare effective genome length and test performance with our approach. Dynamic grammar pruning achieved significantly better genome lengths for all datasets, while significantly improving generalization performance on three datasets, although it worsened in five datasets. When we utilized linear scaling during the production ranking stages (the first 20 generations) the results dramatically improved. Not only were the programs smaller in all datasets, but generalization scores were also significantly better than the baseline in 6 out of 13 datasets, and comparable in the rest. When the baseline was also linearly scaled as well, the program size was still smaller with the Production Ranking approach, while generalization scores dropped in only three datasets without any significant compromise in the rest.
Collapse
Affiliation(s)
- Muhammad Sarmad Ali
- Department of Computer Science and Information Systems, University of Limerick, Castletroy, Limerick, V94 T9PX Ireland
| | - Meghana Kshirsagar
- Department of Computer Science and Information Systems, University of Limerick, Castletroy, Limerick, V94 T9PX Ireland
| | - Enrique Naredo
- Department of Computer Science and Information Systems, University of Limerick, Castletroy, Limerick, V94 T9PX Ireland
| | - Conor Ryan
- Department of Computer Science and Information Systems, University of Limerick, Castletroy, Limerick, V94 T9PX Ireland
| |
Collapse
|
6
|
Meller A, Ward MD, Borowsky JH, Lotthammer JM, Kshirsagar M, Oviedo F, Lavista Ferres J, Bowman G. Predicting the locations of cryptic pockets from single protein structures using the PocketMiner graph neural network. Biophys J 2023; 122:445a. [PMID: 36784287 DOI: 10.1016/j.bpj.2022.11.2400] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/12/2023] Open
Affiliation(s)
- Artur Meller
- Washington University in St. Louis, St. Louis, MO, USA
| | - Michael D Ward
- Washington University School of Medicine, St. Louis, MO, USA
| | | | - Jeffrey M Lotthammer
- Biochemistry and Molecular Biophysics, Washington University in St. Louis, St. Louis, MO, USA
| | | | | | | | - Gregory Bowman
- Biochemistry and Molecular Biophysics, University of Pennsylvania, Philadelphia, PA, USA
| |
Collapse
|
7
|
Mukherjee S, Kshirsagar M, Becker N, Xu Y, Weeks WB, Patel S, Ferres JL, Jackson ML. Identifying long-term effects of SARS-CoV-2 and their association with social determinants of health in a cohort of over one million COVID-19 survivors. BMC Public Health 2022; 22:2394. [PMID: 36539760 PMCID: PMC9765366 DOI: 10.1186/s12889-022-14806-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2021] [Accepted: 12/05/2022] [Indexed: 12/24/2022] Open
Abstract
BACKGROUND Despite an abundance of information on the risk factors of SARS-CoV-2, there have been few US-wide studies of long-term effects. In this paper we analyzed a large medical claims database of US based individuals to identify common long-term effects as well as their associations with various social and medical risk factors. METHODS The medical claims database was obtained from a prominent US based claims data processing company, namely Change Healthcare. In addition to the claims data, the dataset also consisted of various social determinants of health such as race, income, education level and veteran status of the individuals. A self-controlled cohort design (SCCD) observational study was performed to identify ICD-10 codes whose proportion was significantly increased in the outcome period compared to the control period to identify significant long-term effects. A logistic regression-based association analysis was then performed between identified long-term effects and social determinants of health. RESULTS Among the over 1.37 million COVID patients in our datasets we found 36 out of 1724 3-digit ICD-10 codes to be statistically significantly increased in the post-COVID period (p-value < 0.05). We also found one combination of ICD-10 codes, corresponding to 'other anemias' and 'hypertension', that was statistically significantly increased in the post-COVID period (p-value < 0.05). Our logistic regression-based association analysis with social determinants of health variables, after adjusting for comorbidities and prior conditions, showed that age and gender were significantly associated with the multiple long-term effects. Race was only associated with 'other sepsis', income was only associated with 'Alopecia areata' (autoimmune disease causing hair loss), while education level was only associated with 'Maternal infectious and parasitic diseases' (p-value < 0.05). CONCLUSION We identified several long-term effects of SARS-CoV-2 through a self-controlled study on a cohort of over one million patients. Furthermore, we found that while age and gender are commonly associated with the long-term effects, other social determinants of health such as race, income and education levels have rare or no significant associations.
Collapse
Affiliation(s)
- Sumit Mukherjee
- Insitro Labs, work done while at Microsoft, South San Francisco, USA
| | - Meghana Kshirsagar
- grid.419815.00000 0001 2181 3404AI for Good Research Lab, Microsoft Corporation, 1 Microsoft Way, WA 98052 Redmond, USA
| | - Nicholas Becker
- grid.419815.00000 0001 2181 3404AI for Good Research Lab, Microsoft Corporation, 1 Microsoft Way, WA 98052 Redmond, USA ,grid.34477.330000000122986657University of Washington, Seattle, USA
| | - Yixi Xu
- grid.419815.00000 0001 2181 3404AI for Good Research Lab, Microsoft Corporation, 1 Microsoft Way, WA 98052 Redmond, USA
| | - William B. Weeks
- grid.419815.00000 0001 2181 3404AI for Good Research Lab, Microsoft Corporation, 1 Microsoft Way, WA 98052 Redmond, USA
| | - Shwetak Patel
- grid.34477.330000000122986657University of Washington, Seattle, USA
| | - Juan Lavista Ferres
- grid.419815.00000 0001 2181 3404AI for Good Research Lab, Microsoft Corporation, 1 Microsoft Way, WA 98052 Redmond, USA
| | - Michael L. Jackson
- grid.488833.c0000 0004 0615 7519Kaiser Permanente Washington, Seattle, USA
| |
Collapse
|
8
|
Kshirsagar M, Nasir M, Mukherjee S, Becker N, Dodhia R, Weeks WB, Ferres JL, Richardson B. The Risk of Hospitalization and Mortality After Breakthrough SARS-CoV-2 Infection by Vaccine Type: Observational Study of Medical Claims Data. JMIR Public Health Surveill 2022; 8:e38898. [PMID: 36265135 PMCID: PMC9645422 DOI: 10.2196/38898] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2022] [Revised: 10/06/2022] [Accepted: 10/18/2022] [Indexed: 11/09/2022] Open
Abstract
BACKGROUND Several risk factors have been identified for severe COVID-19 disease by the scientific community. In this paper, we focus on understanding the risks for severe COVID-19 infections after vaccination (ie, in breakthrough SARS-CoV-2 infections). Studying these risks by vaccine type, age, sex, comorbidities, and any prior SARS-CoV-2 infection is important to policy makers planning further vaccination efforts. OBJECTIVE We performed a comparative study of the risks of hospitalization (n=1140) and mortality (n=159) in a SARS-CoV-2 positive cohort of 19,815 patients who were all fully vaccinated with the Pfizer, Moderna, or Janssen vaccines. METHODS We performed Cox regression analysis to calculate the risk factors for developing a severe breakthrough SARS-CoV-2 infection in the study cohort by controlling for vaccine type, age, sex, comorbidities, and a prior SARS-CoV-2 infection. RESULTS We found lower hazard ratios for those receiving the Moderna vaccine (P<.001) and Pfizer vaccine (P<.001), with the lowest hazard rates being for Moderna, as compared to those who received the Janssen vaccine, independent of age, sex, comorbidities, vaccine type, and prior SARS-CoV-2 infection. Further, individuals who had a SARS-CoV-2 infection prior to vaccination had some increased protection over and above the protection already provided by the vaccines, from hospitalization (P=.001) and death (P=.04), independent of age, sex, comorbidities, and vaccine type. We found that the top statistically significant risk factors for severe breakthrough SARS-CoV-2 infections were age of >50, male gender, moderate and severe renal failure, severe liver disease, leukemia, chronic lung disease, coagulopathy, and alcohol abuse. CONCLUSIONS Among individuals who were fully vaccinated, the risk of severe breakthrough SARS-CoV-2 infection was lower for recipients of the Moderna or Pfizer vaccines and higher for recipients of the Janssen vaccine. These results from our analysis at a population level will be helpful to public health policy makers. Our result on the influence of a previous SARS-CoV-2 infection necessitates further research into the impact of multiple exposures on the risk of developing severe COVID-19.
Collapse
Affiliation(s)
| | - Md Nasir
- Microsoft, Redmond, WA, United States
| | | | - Nicholas Becker
- Microsoft, Redmond, WA, United States
- Paul G Allen School of Computer Science & Engineering, University of Washington, Seattle, WA, United States
| | | | | | | | - Barbra Richardson
- Department of Biostatistics and Global Health, University of Washington, Seattle, WA, United States
| |
Collapse
|
9
|
Kshirsagar M, Yuan H, Ferres JL, Leslie C. BindVAE: Dirichlet variational autoencoders for de novo motif discovery from accessible chromatin. Genome Biol 2022; 23:174. [PMID: 35971180 PMCID: PMC9380350 DOI: 10.1186/s13059-022-02723-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2021] [Accepted: 06/28/2022] [Indexed: 11/10/2022] Open
Abstract
We present a novel unsupervised deep learning approach called BindVAE, based on Dirichlet variational autoencoders, for jointly decoding multiple TF binding signals from open chromatin regions. BindVAE can disentangle an input DNA sequence into distinct latent factors that encode cell-type specific in vivo binding signals for individual TFs, composite patterns for TFs involved in cooperative binding, and genomic context surrounding the binding sites. On the task of retrieving the motifs of expressed TFs in a given cell type, BindVAE is competitive with existing motif discovery approaches.
Collapse
Affiliation(s)
| | - Han Yuan
- Calico Life Sciences, South San Francisco, CA, USA
| | | | | |
Collapse
|
10
|
Law JN, Akers K, Tasnina N, Santina CMD, Deutsch S, Kshirsagar M, Klein-Seetharaman J, Crovella M, Rajagopalan P, Kasif S, Murali TM. Interpretable network propagation with application to expanding the repertoire of human proteins that interact with SARS-CoV-2. Gigascience 2021; 10:giab082. [PMID: 34966926 PMCID: PMC8716363 DOI: 10.1093/gigascience/giab082] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2021] [Revised: 09/21/2021] [Accepted: 11/28/2021] [Indexed: 01/02/2023] Open
Abstract
BACKGROUND Network propagation has been widely used for nearly 20 years to predict gene functions and phenotypes. Despite the popularity of this approach, little attention has been paid to the question of provenance tracing in this context, e.g., determining how much any experimental observation in the input contributes to the score of every prediction. RESULTS We design a network propagation framework with 2 novel components and apply it to predict human proteins that directly or indirectly interact with SARS-CoV-2 proteins. First, we trace the provenance of each prediction to its experimentally validated sources, which in our case are human proteins experimentally determined to interact with viral proteins. Second, we design a technique that helps to reduce the manual adjustment of parameters by users. We find that for every top-ranking prediction, the highest contribution to its score arises from a direct neighbor in a human protein-protein interaction network. We further analyze these results to develop functional insights on SARS-CoV-2 that expand on known biology such as the connection between endoplasmic reticulum stress, HSPA5, and anti-clotting agents. CONCLUSIONS We examine how our provenance-tracing method can be generalized to a broad class of network-based algorithms. We provide a useful resource for the SARS-CoV-2 community that implicates many previously undocumented proteins with putative functional relationships to viral infection. This resource includes potential drugs that can be opportunistically repositioned to target these proteins. We also discuss how our overall framework can be extended to other, newly emerging viruses.
Collapse
Affiliation(s)
- Jeffrey N Law
- Interdisciplinary Ph.D. Program in Genetics, Bioinformatics, and Computational Biology, Virginia Tech, Blacksburg, VA 24061, USA
| | - Kyle Akers
- Interdisciplinary Ph.D. Program in Genetics, Bioinformatics, and Computational Biology, Virginia Tech, Blacksburg, VA 24061, USA
| | - Nure Tasnina
- Department of Computer Science, Virginia Tech, Blacksburg, VA 24061, USA
| | | | - Shay Deutsch
- Department of Mathematics, University of California, Los Angeles, CA 90095, USA
| | | | | | - Mark Crovella
- Department of Computer Science, Boston University, Boston, MA 02215, USA
| | | | - Simon Kasif
- Department of Biomedical Engineering, Boston University, Boston, MA 02215, USA
| | - T M Murali
- Department of Computer Science, Virginia Tech, Blacksburg, VA 24061, USA
| |
Collapse
|
11
|
Yao Y, Kshirsagar M, Vaidya G, Ducrée J, Ryan C. Convergence of Blockchain, Autonomous Agents, and Knowledge Graph to Share Electronic Health Records. Front Blockchain 2021. [DOI: 10.3389/fbloc.2021.661238] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
In this article, we discuss a data sharing and knowledge integration framework through autonomous agents with blockchain for implementing Electronic Health Records (EHR). This will enable us to augment existing blockchain-based EHR Systems. We discuss how major concerns in the health industry, i.e., trust, security and scalability, can be addressed by transitioning from existing models to convergence of the three technologies – blockchain, agent-based modeling, and knowledge graph in a decentralized ecosystem. Each autonomous agent is responsible for instantiating key processes, such as user authentication and authorization, smart contracts, and knowledge graph generation through data integration among the participating stakeholders in the network. We discuss a layered approach for the design of the proposed system leading to an enhanced, safer clinical decision-making system. This can pave the way toward more informed and engaged patients and citizens by delivering personalized healthcare.
Collapse
|
12
|
Kshirsagar M, Tasnina N, Ward MD, Law JN, Murali TM, Lavista Ferres JM, Bowman GR, Klein-Seetharaman J. Protein sequence models for prediction and comparative analysis of the SARS-CoV-2 -human interactome. Pac Symp Biocomput 2021; 26:154-165. [PMID: 33691013] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Viruses such as the novel coronavirus, SARS-CoV-2, that is wreaking havoc on the world, depend on interactions of its own proteins with those of the human host cells. Relatively small changes in sequence such as between SARS-CoV and SARS-CoV-2 can dramatically change clinical phenotypes of the virus, including transmission rates and severity of the disease. On the other hand, highly dissimilar virus families such as Coronaviridae, Ebola, and HIV have overlap in functions. In this work we aim to analyze the role of protein sequence in the binding of SARS-CoV-2 virus proteins towards human proteins and compare it to that of the above other viruses. We build supervised machine learning models, using Generalized Additive Models to predict interactions based on sequence features and find that our models perform well with an AUC-PR of 0.65 in a class-skew of 1:10. Analysis of the novel predictions using an independent dataset showed statistically significant enrichment. We further map the importance of specific amino-acid sequence features in predicting binding and summarize what combinations of sequences from the virus and the host is correlated with an interaction. By analyzing the sequence-based embeddings of the interactomes from different viruses and clustering them together we find some functionally similar proteins from different viruses. For example, vif protein from HIV-1, vp24 from Ebola and orf3b from SARS-CoV all function as interferon antagonists. Furthermore, we can differentiate the functions of similar viruses, for example orf3a's interactions are more diverged than orf7b interactions when comparing SARS-CoV and SARS-CoV-2.
Collapse
|
13
|
Yuan H, Kshirsagar M, Zamparo L, Lu Y, Leslie CS. BindSpace decodes transcription factor binding signals by large-scale sequence embedding. Nat Methods 2019; 16:858-861. [PMID: 31406384 PMCID: PMC6717532 DOI: 10.1038/s41592-019-0511-y] [Citation(s) in RCA: 21] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2019] [Accepted: 07/10/2019] [Indexed: 01/04/2023]
Abstract
Decoding transcription factor (TF) binding signals in genomic DNA is a fundamental problem. Here we present a prediction model called BindSpace that learns to embed DNA sequences and TF class/family labels into the same space. By training on binding data for hundreds of TFs and embedding over 1M DNA sequences, BindSpace achieves state-of-the-art multiclass binding prediction performance, in vitro and in vivo, and can distinguish signals of closely related TFs.
Collapse
Affiliation(s)
- Han Yuan
- Computational and Systems Biology Program, Memorial Sloan Kettering Cancer Center, New York, NY, USA.,Tri-Institutional Training Program in Computational Biology and Medicine, New York, NY, USA
| | - Meghana Kshirsagar
- Computational and Systems Biology Program, Memorial Sloan Kettering Cancer Center, New York, NY, USA
| | - Lee Zamparo
- Computational and Systems Biology Program, Memorial Sloan Kettering Cancer Center, New York, NY, USA
| | - Yuheng Lu
- Computational and Systems Biology Program, Memorial Sloan Kettering Cancer Center, New York, NY, USA
| | - Christina S Leslie
- Computational and Systems Biology Program, Memorial Sloan Kettering Cancer Center, New York, NY, USA.
| |
Collapse
|
14
|
Kshirsagar M, Murugesan K, Carbonell JG, Klein-Seetharaman J. Multitask Matrix Completion for Learning Protein Interactions Across Diseases. J Comput Biol 2017; 24:501-514. [PMID: 28128642 DOI: 10.1089/cmb.2016.0201] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022] Open
Abstract
Disease-causing pathogens such as viruses introduce their proteins into the host cells in which they interact with the host's proteins, enabling the virus to replicate inside the host. These interactions between pathogen and host proteins are key to understanding infectious diseases. Often multiple diseases involve phylogenetically related or biologically similar pathogens. Here we present a multitask learning method to jointly model interactions between human proteins and three different but related viruses: Hepatitis C, Ebola virus, and Influenza A. Our multitask matrix completion-based model uses a shared low-rank structure in addition to a task-specific sparse structure to incorporate the various interactions. We obtain between 7 and 39 percentage points improvement in predictive performance over prior state-of-the-art models. We show how our model's parameters can be interpreted to reveal both general and specific interaction-relevant characteristics of the viruses. Our code is available online.
Collapse
Affiliation(s)
| | - Keerthiram Murugesan
- 2 Language Technologies Institute, Carnegie Mellon University , Pittsburgh, Pennsylvania
| | - Jaime G Carbonell
- 2 Language Technologies Institute, Carnegie Mellon University , Pittsburgh, Pennsylvania
| | - Judith Klein-Seetharaman
- 3 Metabolic & Vascular Health, Warwick Medical School, University of Warwick , Coventry, United Kingdom
| |
Collapse
|
15
|
Kshirsagar M, Schleker S, Carbonell J, Klein-Seetharaman J. Techniques for transferring host-pathogen protein interactions knowledge to new tasks. Front Microbiol 2015; 6:36. [PMID: 25699028 PMCID: PMC4313693 DOI: 10.3389/fmicb.2015.00036] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2014] [Accepted: 01/12/2015] [Indexed: 11/17/2022] Open
Abstract
We consider the problem of building a model to predict protein-protein interactions (PPIs) between the bacterial species Salmonella Typhimurium and the plant host Arabidopsis thaliana which is a host-pathogen pair for which no known PPIs are available. To achieve this, we present approaches, which use homology and statistical learning methods called “transfer learning.” In the transfer learning setting, the task of predicting PPIs between Arabidopsis and its pathogen S. Typhimurium is called the “target task.” The presented approaches utilize labeled data i.e., known PPIs of other host-pathogen pairs (we call these PPIs the “source tasks”). The homology based approaches use heuristics based on biological intuition to predict PPIs. The transfer learning methods use the similarity of the PPIs from the source tasks to the target task to build a model. For a quantitative evaluation we consider Salmonella-mouse PPI prediction and some other host-pathogen tasks where known PPIs exist. We use metrics such as precision and recall and our results show that our methods perform well on the target task in various transfer settings. We present a brief qualitative analysis of the Arabidopsis-Salmonella predicted interactions. We filter the predictions from all approaches using Gene Ontology term enrichment and only those interactions involving Salmonella effectors. Thereby we observe that Arabidopsis proteins involved e.g., in transcriptional regulation, hormone mediated signaling and defense response may be affected by Salmonella.
Collapse
Affiliation(s)
- Meghana Kshirsagar
- School of Computer Science, Language Technologies Institute, Carnegie Mellon University Pittsburgh, PA, USA
| | - Sylvia Schleker
- Metabolic and Vascular Health, Warwick Medical School, University of Warwick Coventry, UK ; Molecular Phytomedicine, Institute of Crop Science and Resource Conservation, University of Bonn Bonn, Germany
| | - Jaime Carbonell
- School of Computer Science, Language Technologies Institute, Carnegie Mellon University Pittsburgh, PA, USA
| | | |
Collapse
|
16
|
Schleker S, Kshirsagar M, Klein-Seetharaman J. Comparing human-Salmonella with plant-Salmonella protein-protein interaction predictions. Front Microbiol 2015; 6:45. [PMID: 25674082 PMCID: PMC4309195 DOI: 10.3389/fmicb.2015.00045] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2014] [Accepted: 01/13/2015] [Indexed: 11/13/2022] Open
Abstract
Salmonellosis is the most frequent foodborne disease worldwide and can be transmitted to humans by a variety of routes, especially via animal and plant products. Salmonella bacteria are believed to use not only animal and human but also plant hosts despite their evolutionary distance. This raises the question if Salmonella employs similar mechanisms in infection of these diverse hosts. Given that most of our understanding comes from its interaction with human hosts, we investigate here to what degree knowledge of Salmonella-human interactions can be transferred to the Salmonella-plant system. Reviewed are recent publications on analysis and prediction of Salmonella-host interactomes. Putative protein-protein interactions (PPIs) between Salmonella and its human and Arabidopsis hosts were retrieved utilizing purely interolog-based approaches in which predictions were inferred based on available sequence and domain information of known PPIs, and machine learning approaches that integrate a larger set of useful information from different sources. Transfer learning is an especially suitable machine learning technique to predict plant host targets from the knowledge of human host targets. A comparison of the prediction results with transcriptomic data shows a clear overlap between the host proteins predicted to be targeted by PPIs and their gene ontology enrichment in both host species and regulation of gene expression. In particular, the cellular processes Salmonella interferes with in plants and humans are catabolic processes. The details of how these processes are targeted, however, are quite different between the two organisms, as expected based on their evolutionary and habitat differences. Possible implications of this observation on evolution of host-pathogen communication are discussed.
Collapse
Affiliation(s)
- Sylvia Schleker
- Klein-Seetharaman Laboratory, Division of Metabolic and Vascular Health, Warwick Medical School, University of Warwick , Coventry, UK ; Department of Molecular Phytomedicine, Institute of Crop Science and Resource Conservation, University of Bonn , Bonn, Germany
| | - Meghana Kshirsagar
- Language Technologies Institute, School of Computer Science, Carnegie Mellon University , Pittsburgh, PA, USA
| | - Judith Klein-Seetharaman
- Klein-Seetharaman Laboratory, Division of Metabolic and Vascular Health, Warwick Medical School, University of Warwick , Coventry, UK
| |
Collapse
|
17
|
Abstract
Motivation: An important aspect of infectious disease research involves understanding the differences and commonalities in the infection mechanisms underlying various diseases. Systems biology-based approaches study infectious diseases by analyzing the interactions between the host species and the pathogen organisms. This work aims to combine the knowledge from experimental studies of host–pathogen interactions in several diseases to build stronger predictive models. Our approach is based on a formalism from machine learning called ‘multitask learning’, which considers the problem of building models across tasks that are related to each other. A ‘task’ in our scenario is the set of host–pathogen protein interactions involved in one disease. To integrate interactions from several tasks (i.e. diseases), our method exploits the similarity in the infection process across the diseases. In particular, we use the biological hypothesis that similar pathogens target the same critical biological processes in the host, in defining a common structure across the tasks. Results: Our current work on host–pathogen protein interaction prediction focuses on human as the host, and four bacterial species as pathogens. The multitask learning technique we develop uses a task-based regularization approach. We find that the resulting optimization problem is a difference of convex (DC) functions. To optimize, we implement a Convex–Concave procedure-based algorithm. We compare our integrative approach to baseline methods that build models on a single host–pathogen protein interaction dataset. Our results show that our approach outperforms the baselines on the training data. We further analyze the protein interaction predictions generated by the models, and find some interesting insights. Availability: The predictions and code are available at: http://www.cs.cmu.edu/∼mkshirsa/ismb2013_paper320.html Contact:j.klein-seetharaman@warwick.ac.uk Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Meghana Kshirsagar
- Language Technologies Institute, School of Computer Science, Carnegie Mellon University, 5000 Forbes Ave, PA 15213, USA
| | | | | |
Collapse
|
18
|
Udupa A, Nahar P, Shah S, Kshirsagar M, Ghongane B. A comparative study of effects of omega-3 Fatty acids, alpha lipoic Acid and vitamin e in type 2 diabetes mellitus. Ann Med Health Sci Res 2013; 3:442-6. [PMID: 24116330 PMCID: PMC3793456 DOI: 10.4103/2141-9248.117954] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/11/2023] Open
Abstract
BACKGROUND Diabetes Mellitus is a metabolic disorder characterized by abnormal lipid and glucose metabolism. Various modes of adjuvant therapy have been advocated to ameliorate insulin resistance. AIM This study was intended to assess the effects of antioxidants; alpha lipoic acid (ALA), omega 3 fatty acid and vitamin E on parameters of insulin sensitivity (blood glucose and HbA1c) in patients of type 2 diabetes mellitus with documented insulin resistance. SUBJECTS AND METHODS It was a prospective, randomized, double blind, placebo controlled, single centered study. 104 patients with type 2 diabetes mellitus with insulin resistance were recruited. They were given ALA, omega 3 fatty acid, vitamin E or placebo. Fasting blood glucose and HbA1c were measured at first visit (V1) and after 90 days (V2). Statistical analysis was carried out by paired t-test by using SPSS software version 11 (SPSS, Chicago, USA). RESULTS Analysis of baseline (V1) vs. end of treatment period (V2) parameters, showed significant decrease in HbA1c in the three treatment group. We also observed decrease in fasting blood glucose in the three treatment group but it was not statistically significant (Gr. I = 0.51, Gr. II = 0.05, Gr. III = 0.22, Gr. IV = 0.88). CONCLUSION ALA, Omega 3 fatty acid and vitamin E can be used as add on therapy in patients with type 2 diabetes mellitus to improve insulin sensitivity and lipid metabolism.
Collapse
Affiliation(s)
- A Udupa
- Department of Pharmacology, B J Medical College, Pune, Maharashtra, India
| | | | | | | | | |
Collapse
|
19
|
Abstract
Motivation: Approaches that use supervised machine learning techniques for protein–protein interaction (PPI) prediction typically use features obtained by integrating several sources of data. Often certain attributes of the data are not available, resulting in missing values. In particular, our host–pathogen PPI datasets have a large fraction, in the range of 58–85% of missing values, which makes it challenging to apply machine learning algorithms. Results: We show that specialized techniques for missing value imputation can improve the performance of the models significantly. We use cross species information in combination with machine learning techniques like Group lasso with ℓ1/ℓ2 regularization. We demonstrate the benefits of our approach on two PPI prediction problems. In our first example of Salmonella–human PPI prediction, we are able to obtain high prediction accuracies with 77.6% precision and 84% recall. Comparison with various other techniques shows an improvement of 9 in F1 score over the next best technique. We also apply our method to Yersinia–human PPI prediction successfully, demonstrating the generality of our approach. Availability: Predicted interactions, datasets, features are available at: http://www.cs.cmu.edu/~mkshirsa/eccb2012_paper46.html. Contact:judithks@cs.cmu.edu Supplementary Information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Meghana Kshirsagar
- School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| | | | | |
Collapse
|
20
|
Zhao Z, Xia J, Tastan O, Singh I, Kshirsagar M, Carbonell J, Klein-Seetharaman J. Virus interactions with human signal transduction pathways. ACTA ACUST UNITED AC 2011; 4:83-105. [PMID: 21330695 DOI: 10.1504/ijcbdd.2011.038658] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/09/2023]
Abstract
Viruses depend on their hosts at every stage of their life cycles and must therefore communicate with them via Protein-Protein Interactions (PPIs). To investigate the mechanisms of communication by different viruses, we overlay reported pairwise human-virus PPIs on human signalling pathways. Of 671 pathways obtained from NCI and Reactome databases, 355 are potentially targeted by at least one virus. The majority of pathways are linked to more than one virus. We find evidence supporting the hypothesis that viruses often interact with different proteins depending on the targeted pathway. Pathway analysis indicates overrepresentation of some pathways targeted by viruses. The merged network of the most statistically significant pathways shows several centrally located proteins, which are also hub proteins. Generally, hub proteins are targeted more frequently by viruses. Numerous proteins in virus-targeted pathways are known drug targets, suggesting that these might be exploited as potential new approaches to treatments against multiple viruses.
Collapse
Affiliation(s)
- Zhongming Zhao
- Departments of Biomedical Informatics, Psychiatry, and Cancer Biology, Vanderbilt University Medical Center, Nashville, Tennessee 37232, USA.
| | | | | | | | | | | | | |
Collapse
|