1
|
Deschênes T, Tohoundjona FWE, Plante PL, Di Marzo V, Raymond F. Gene-based microbiome representation enhances host phenotype classification. mSystems 2023; 8:e0053123. [PMID: 37404032 PMCID: PMC10469787 DOI: 10.1128/msystems.00531-23] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2023] [Accepted: 05/24/2023] [Indexed: 07/06/2023] Open
Abstract
With the concomitant advances in both the microbiome and machine learning fields, the gut microbiome has become of great interest for the potential discovery of biomarkers to be used in the classification of the host health status. Shotgun metagenomics data derived from the human microbiome is composed of a high-dimensional set of microbial features. The use of such complex data for the modeling of host-microbiome interactions remains a challenge as retaining de novo content yields a highly granular set of microbial features. In this study, we compared the prediction performances of machine learning approaches according to different types of data representations derived from shotgun metagenomics. These representations include commonly used taxonomic and functional profiles and the more granular gene cluster approach. For the five case-control datasets used in this study (Type 2 diabetes, obesity, liver cirrhosis, colorectal cancer, and inflammatory bowel disease), gene-based approaches, whether used alone or in combination with reference-based data types, allowed improved or similar classification performances as the taxonomic and functional profiles. In addition, we show that using subsets of gene families from specific functional categories of genes highlight the importance of these functions on the host phenotype. This study demonstrates that both reference-free microbiome representations and curated metagenomic annotations can provide relevant representations for machine learning based on metagenomic data. IMPORTANCE Data representation is an essential part of machine learning performance when using metagenomic data. In this work, we show that different microbiome representations provide varied host phenotype classification performance depending on the dataset. In classification tasks, untargeted microbiome gene content can provide similar or improved classification compared to taxonomical profiling. Feature selection based on biological function also improves classification performance for some pathologies. Function-based feature selection combined with interpretable machine learning algorithms can generate new hypotheses that can potentially be assayed mechanistically. This work thus proposes new approaches to represent microbiome data for machine learning that can potentiate the findings associated with metagenomic data.
Collapse
Affiliation(s)
- Thomas Deschênes
- Centre Nutrition, Santé et Société (NUTRISS) – Institut sur la Nutrition et les Aliments Fonctionnels (INAF), Université Laval, Québec, Canada
- Canada Research Excellence Chair on the Microbiome-Endocannabinoidome Axis in Metabolic Health (CERC-MEND), Quebec City, Quebec, Canada
- Institut Intelligence et Données, Université Laval, Québec, Canada
| | - Fred Wilfried Elom Tohoundjona
- Centre Nutrition, Santé et Société (NUTRISS) – Institut sur la Nutrition et les Aliments Fonctionnels (INAF), Université Laval, Québec, Canada
- Canada Research Excellence Chair on the Microbiome-Endocannabinoidome Axis in Metabolic Health (CERC-MEND), Quebec City, Quebec, Canada
| | - Pier-Luc Plante
- Centre Nutrition, Santé et Société (NUTRISS) – Institut sur la Nutrition et les Aliments Fonctionnels (INAF), Université Laval, Québec, Canada
- Canada Research Excellence Chair on the Microbiome-Endocannabinoidome Axis in Metabolic Health (CERC-MEND), Quebec City, Quebec, Canada
- Institut Intelligence et Données, Université Laval, Québec, Canada
| | - Vincenzo Di Marzo
- Centre Nutrition, Santé et Société (NUTRISS) – Institut sur la Nutrition et les Aliments Fonctionnels (INAF), Université Laval, Québec, Canada
- Canada Research Excellence Chair on the Microbiome-Endocannabinoidome Axis in Metabolic Health (CERC-MEND), Quebec City, Quebec, Canada
- École de nutrition, Faculté des sciences de l’agriculture et de l’alimentation (FSAA), Université Laval, Québec, Canada
- Centre de recherche de l’Institut universitaire de cardiologie et de pneumologie de Québec (IUCPQ), Québec, Canada
- Département de médecine, Faculté de Médecine, Université Laval, Québec, Canada
- Joint International Unit on Chemical and Biomolecular Research on the Microbiome and its Impact on Metabolic Health and Nutrition (UMI-MicroMeNu), Quebec City, Canada
| | - Frédéric Raymond
- Centre Nutrition, Santé et Société (NUTRISS) – Institut sur la Nutrition et les Aliments Fonctionnels (INAF), Université Laval, Québec, Canada
- Canada Research Excellence Chair on the Microbiome-Endocannabinoidome Axis in Metabolic Health (CERC-MEND), Quebec City, Quebec, Canada
- Institut Intelligence et Données, Université Laval, Québec, Canada
- École de nutrition, Faculté des sciences de l’agriculture et de l’alimentation (FSAA), Université Laval, Québec, Canada
| |
Collapse
|
2
|
Ma QL, Huang FM, Guo W, Feng KY, Huang T, Cai YD. Machine Learning Classification of Time since BNT162b2 COVID-19 Vaccination Based on Array-Measured Antibody Activity. Life (Basel) 2023; 13:1304. [PMID: 37374086 DOI: 10.3390/life13061304] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2023] [Revised: 05/26/2023] [Accepted: 05/29/2023] [Indexed: 06/29/2023] Open
Abstract
Vaccines trigger an immunological response that includes B and T cells, with B cells producing antibodies. SARS-CoV-2 immunity weakens over time after vaccination. Discovering key changes in antigen-reactive antibodies over time after vaccination could help improve vaccine efficiency. In this study, we collected data on blood antibody levels in a cohort of healthcare workers vaccinated for COVID-19 and obtained 73 antigens in samples from four groups according to the duration after vaccination, including 104 unvaccinated healthcare workers, 534 healthcare workers within 60 days after vaccination, 594 healthcare workers between 60 and 180 days after vaccination, and 141 healthcare workers over 180 days after vaccination. Our work was a reanalysis of the data originally collected at Irvine University. This data was obtained in Orange County, California, USA, with the collection process commencing in December 2020. British variant (B.1.1.7), South African variant (B.1.351), and Brazilian/Japanese variant (P.1) were the most prevalent strains during the sampling period. An efficient machine learning based framework containing four feature selection methods (least absolute shrinkage and selection operator, light gradient boosting machine, Monte Carlo feature selection, and maximum relevance minimum redundancy) and four classification algorithms (decision tree, k-nearest neighbor, random forest, and support vector machine) was designed to select essential antibodies against specific antigens. Several efficient classifiers with a weighted F1 value around 0.75 were constructed. The antigen microarray used for identifying antibody levels in the coronavirus features ten distinct SARS-CoV-2 antigens, comprising various segments of both nucleocapsid protein (NP) and spike protein (S). This study revealed that S1 + S2, S1.mFcTag, S1.HisTag, S1, S2, Spike.RBD.His.Bac, Spike.RBD.rFc, and S1.RBD.mFc were most highly ranked among all features, where S1 and S2 are the subunits of Spike, and the suffixes represent the tagging information of different recombinant proteins. Meanwhile, the classification rules were obtained from the optimal decision tree to explain quantitatively the roles of antigens in the classification. This study identified antibodies associated with decreased clinical immunity based on populations with different time spans after vaccination. These antibodies have important implications for maintaining long-term immunity to SARS-CoV-2.
Collapse
Affiliation(s)
- Qing-Lan Ma
- School of Life Sciences, Shanghai University, Shanghai 200444, China
| | - Fei-Ming Huang
- School of Life Sciences, Shanghai University, Shanghai 200444, China
| | - Wei Guo
- Key Laboratory of Stem Cell Biology, Shanghai Jiao Tong University School of Medicine (SJTUSM) & Shanghai Institutes for Biological Sciences (SIBS), Chinese Academy of Sciences (CAS), Shanghai 200030, China
| | - Kai-Yan Feng
- Department of Computer Science, Guangdong AIB Polytechnic College, Guangzhou 510507, China
| | - Tao Huang
- Bio-Med Big Data Center, CAS Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai 200031, China
- CAS Key Laboratory of Tissue Microenvironment and Tumor, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai 200031, China
| | - Yu-Dong Cai
- School of Life Sciences, Shanghai University, Shanghai 200444, China
| |
Collapse
|
3
|
Identification of Smoking-Associated Transcriptome Aberration in Blood with Machine Learning Methods. BIOMED RESEARCH INTERNATIONAL 2023; 2023:5333361. [PMID: 36644165 PMCID: PMC9833906 DOI: 10.1155/2023/5333361] [Citation(s) in RCA: 22] [Impact Index Per Article: 22.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 09/03/2022] [Revised: 12/15/2022] [Accepted: 12/15/2022] [Indexed: 01/06/2023]
Abstract
Long-term cigarette smoking causes various human diseases, including respiratory disease, cancer, and gastrointestinal (GI) disorders. Alterations in gene expression and variable splicing processes induced by smoking are associated with the development of diseases. This study applied advanced machine learning methods to identify the isoforms with important roles in distinguishing smokers from former smokers based on the expression profile of isoforms from current and former smokers collected in one previous study. These isoforms were deemed as features, which were first analyzed by the Boruta to select features highly correlated with the target variables. Then, the selected features were evaluated by four feature ranking algorithms, resulting in four feature lists. The incremental feature selection method was applied to each list for obtaining the optimal feature subsets and building high-performance classification models. Furthermore, a series of classification rules were accessed by decision tree with the highest performance. Eventually, the rationality of the mined isoforms (features) and classification rules was verified by reviewing previous research. Features such as isoforms ENST00000464835 (expressed by LRRN3), ENST00000622663 (expressed by SASH1), and ENST00000284311 (expressed by GPR15), and pathways (cytotoxicity mediated by natural killer cell and cytokine-cytokine receptor interaction) revealed by the enrichment analysis, were highly relevant to smoking response, suggesting the robustness of our analysis pipeline.
Collapse
|
4
|
Ren J, Guo W, Feng K, Huang T, Cai Y. Identifying MicroRNA Markers That Predict COVID-19 Severity Using Machine Learning Methods. Life (Basel) 2022; 12:1964. [PMID: 36556329 PMCID: PMC9784129 DOI: 10.3390/life12121964] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2022] [Revised: 11/21/2022] [Accepted: 11/21/2022] [Indexed: 11/25/2022] Open
Abstract
Individuals with the SARS-CoV-2 infection may experience a wide range of symptoms, from being asymptomatic to having a mild fever and cough to a severe respiratory impairment that results in death. MicroRNA (miRNA), which plays a role in the antiviral effects of SARS-CoV-2 infection, has the potential to be used as a novel marker to distinguish between patients who have various COVID-19 clinical severities. In the current study, the existing blood expression profiles reported in two previous studies were combined for deep analyses. The final profiles contained 1444 miRNAs in 375 patients from six categories, which were as follows: 30 patients with mild COVID-19 symptoms, 81 patients with moderate COVID-19 symptoms, 30 non-COVID-19 patients with mild symptoms, 137 patients with severe COVID-19 symptoms, 31 non-COVID-19 patients with severe symptoms, and 66 healthy controls. An efficient computational framework containing four feature selection methods (LASSO, LightGBM, MCFS, and mRMR) and four classification algorithms (DT, KNN, RF, and SVM) was designed to screen clinical miRNA markers, and a high-precision RF model with a 0.780 weighted F1 was constructed. Some miRNAs, including miR-24-3p, whose differential expression was discovered in patients with acute lung injury complications brought on by severe COVID-19, and miR-148a-3p, differentially expressed against SARS-CoV-2 structural proteins, were identified, thereby suggesting the effectiveness and accuracy of our framework. Meanwhile, we extracted classification rules based on the DT model for the quantitative representation of the role of miRNA expression in differentiating COVID-19 patients with different severities. The search for novel biomarkers that could predict the severity of the disease could aid in the clinical diagnosis of COVID-19 and in exploring the specific mechanisms of the complications caused by SARS-CoV-2 infection. Moreover, new therapeutic targets for the disease may be found.
Collapse
Affiliation(s)
- Jingxin Ren
- School of Life Sciences, Shanghai University, Shanghai 200444, China
| | - Wei Guo
- Key Laboratory of Stem Cell Biology, Shanghai Jiao Tong University School of Medicine (SJTUSM) & Shanghai Institutes for Biological Sciences (SIBS), Chinese Academy of Sciences (CAS), Shanghai 200030, China
| | - Kaiyan Feng
- Department of Computer Science, Guangdong AIB Polytechnic College, Guangzhou 510507, China
| | - Tao Huang
- Bio-Med Big Data Center, CAS Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai 200031, China
- CAS Key Laboratory of Tissue Microenvironment and Tumor, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai 200031, China
| | - Yudong Cai
- School of Life Sciences, Shanghai University, Shanghai 200444, China
| |
Collapse
|
5
|
Zhou M, Bian K, Hu F, Lai W. A New Strategy for Identification of Coal Miners With Abnormal Physical Signs Based on EN-mRMR. Front Bioeng Biotechnol 2022; 10:935481. [PMID: 35898648 PMCID: PMC9310099 DOI: 10.3389/fbioe.2022.935481] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2022] [Accepted: 06/06/2022] [Indexed: 11/21/2022] Open
Abstract
Coal miners’ occupational health is a key part of production safety in the coal mine. Accurate identification of abnormal physical signs is the key to preventing occupational diseases and improving miners’ working environment. There are many problems when evaluating the physical health status of miners manually, such as too many sign parameters, low diagnostic efficiency, missed diagnosis, and misdiagnosis. To solve these problems, the machine learning algorithm is used to identify miners with abnormal signs. We proposed a feature screening strategy of integrating elastic net (EN) and Max-Relevance and Min-Redundancy (mRMR) to establish the model to identify abnormal signs and obtain the key physical signs. First, the raw 21 physical signs were expanded to 25 by feature construction technology. Then, the EN was used to delete redundant physical signs. Finally, the mRMR combined with the support vector classification of intelligent optimization algorithm by Gravitational Search Algorithm (GSA-SVC) is applied to further simplify the rest of 12 relatively important physical signs and obtain the optimal model with data of six physical signs. At this time, the accuracy, precision, recall, specificity, G-mean, and MCC of the test set were 97.50%, 97.78%, 97.78%, 97.14%, 0.98, and 0.95. The experimental results show that the proposed strategy improves the model performance with the smallest features and realizes the accurate identification of abnormal coal miners. The conclusion could provide reference evidence for intelligent classification and assessment of occupational health in the early stage.
Collapse
Affiliation(s)
- Mengran Zhou
- School of Electrical and Information Engineering, Anhui University of Science and Technology, Huainan, China
| | - Kai Bian
- School of Electrical and Information Engineering, Anhui University of Science and Technology, Huainan, China
| | - Feng Hu
- School of Electrical and Information Engineering, Anhui University of Science and Technology, Huainan, China
| | - Wenhao Lai
- School of Electrical and Information Engineering, Anhui University of Science and Technology, Huainan, China
| |
Collapse
|
6
|
Li Z, Huang F, Chen L, Huang T, Cai YD. Identifying In Vitro Cultured Human Hepatocytes Markers with Machine Learning Methods Based on Single-Cell RNA-Seq Data. Front Bioeng Biotechnol 2022; 10:916309. [PMID: 35706505 PMCID: PMC9189284 DOI: 10.3389/fbioe.2022.916309] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2022] [Accepted: 05/11/2022] [Indexed: 01/12/2023] Open
Abstract
Cell transplantation is an effective method for compensating for the loss of liver function and improve patient survival. However, given that hepatocytes cultivated in vitro have diverse developmental processes and physiological features, obtaining hepatocytes that can properly function in vivo is difficult. In the present study, we present an advanced computational analysis on single-cell transcriptional profiling to resolve the heterogeneity of the hepatocyte differentiation process in vitro and to mine biomarkers at different periods of differentiation. We obtained a batch of compressed and effective classification features with the Boruta method and ranked them using the Max-Relevance and Min-Redundancy method. Some key genes were identified during the in vitro culture of hepatocytes, including CD147, which not only regulates terminally differentiated cells in the liver but also affects cell differentiation. PPIA, which encodes a CD147 ligand, also appeared in the identified gene list, and the combination of the two proteins mediated multiple biological pathways. Other genes, such as TMSB10, TMEM176B, and CD63, which are involved in the maturation and differentiation of hepatocytes and assist different hepatic cell types in performing their roles were also identified. Then, several classifiers were trained and evaluated to obtain optimal classifiers and optimal feature subsets, using three classification algorithms (random forest, k-nearest neighbor, and decision tree) and the incremental feature selection method. The best random forest classifier with a 0.940 Matthews correlation coefficient was constructed to distinguish different hepatic cell types. Finally, classification rules were created for quantitatively describing hepatic cell types. In summary, This study provided potential targets for cell transplantation associated liver disease treatment strategies by elucidating the process and mechanism of hepatocyte development at both qualitative and quantitative levels.
Collapse
Affiliation(s)
- ZhanDong Li
- College of Biological and Food Engineering, Jilin Engineering Normal University, Changchun, China
| | - FeiMing Huang
- School of Life Sciences, Shanghai University, Shanghai, China
| | - Lei Chen
- College of Information Engineering, Shanghai Maritime University, Shanghai, China
| | - Tao Huang
- Bio-Med Big Data Center, CAS Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, China
- CAS Key Laboratory of Tissue Microenvironment and Tumor, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, China
| | - Yu-Dong Cai
- School of Life Sciences, Shanghai University, Shanghai, China
| |
Collapse
|
7
|
Huang F, Chen L, Guo W, Zhou X, Feng K, Huang T, Cai Y. Identifying COVID-19 Severity-Related SARS-CoV-2 Mutation Using a Machine Learning Method. Life (Basel) 2022; 12:806. [PMID: 35743837 PMCID: PMC9225528 DOI: 10.3390/life12060806] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2022] [Revised: 05/22/2022] [Accepted: 05/25/2022] [Indexed: 12/22/2022] Open
Abstract
SARS-CoV-2 shows great evolutionary capacity through a high frequency of genomic variation during transmission. Evolved SARS-CoV-2 often demonstrates resistance to previous vaccines and can cause poor clinical status in patients. Mutations in the SARS-CoV-2 genome involve mutations in structural and nonstructural proteins, and some of these proteins such as spike proteins have been shown to be directly associated with the clinical status of patients with severe COVID-19 pneumonia. In this study, we collected genome-wide mutation information of virulent strains and the severity of COVID-19 pneumonia in patients varying depending on their clinical status. Important protein mutations and untranslated region mutations were extracted using machine learning methods. First, through Boruta and four ranking algorithms (least absolute shrinkage and selection operator, light gradient boosting machine, max-relevance and min-redundancy, and Monte Carlo feature selection), mutations that were highly correlated with the clinical status of the patients were screened out and sorted in four feature lists. Some mutations such as D614G and V1176F were shown to be associated with viral infectivity. Moreover, previously unreported mutations such as A320V of nsp14 and I164ILV of nsp14 were also identified, which suggests their potential roles. We then applied the incremental feature selection method to each feature list to construct efficient classifiers, which can be directly used to distinguish the clinical status of COVID-19 patients. Meanwhile, four sets of quantitative rules were set up, which can help us to more intuitively understand the role of each mutation in differentiating the clinical status of COVID-19 patients. Identified key mutations linked to virologic properties will help better understand the mechanisms of infection and will aid in the development of antiviral treatments.
Collapse
Affiliation(s)
- Feiming Huang
- School of Life Sciences, Shanghai University, Shanghai 200444, China;
| | - Lei Chen
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China;
| | - Wei Guo
- Key Laboratory of Stem Cell Biology, Shanghai Jiao Tong University School of Medicine (SJTUSM) and Shanghai Institutes for Biological Sciences (SIBS), Chinese Academy of Sciences (CAS), Shanghai 200025, China;
| | - Xianchao Zhou
- Center for Single-Cell Omics, School of Public Health, Shanghai Jiao Tong University School of Medicine (SJTUSM), Shanghai 200025, China;
| | - Kaiyan Feng
- Department of Computer Science, Guangdong AIB Polytechnic College, Guangzhou 510060, China;
| | - Tao Huang
- Bio-Med Big Data Center, CAS Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai 200031, China
- CAS Key Laboratory of Tissue Microenvironment and Tumor, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai 200031, China
| | - Yudong Cai
- School of Life Sciences, Shanghai University, Shanghai 200444, China;
| |
Collapse
|