1
|
Russel WA, Perry J, Bonzani C, Dontino A, Mekonnen Z, Ay A, Taye B. Feature selection and association rule learning identify risk factors of malnutrition among Ethiopian schoolchildren. FRONTIERS IN EPIDEMIOLOGY 2023; 3:1150619. [PMID: 38455884 PMCID: PMC10910994 DOI: 10.3389/fepid.2023.1150619] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 01/24/2023] [Accepted: 06/20/2023] [Indexed: 03/09/2024]
Abstract
Introduction Previous studies have sought to identify risk factors for malnutrition in populations of schoolchildren, depending on traditional logistic regression methods. However, holistic machine learning (ML) approaches are emerging that may provide a more comprehensive analysis of risk factors. Methods This study employed feature selection and association rule learning ML methods in conjunction with logistic regression on epidemiological survey data from 1,036 Ethiopian school children. Our first analysis used the entire dataset and then we reran this analysis on age, residence, and sex population subsets. Results Both logistic regression and ML methods identified older childhood age as a significant risk factor, while females and vaccinated individuals showed reduced odds of stunting. Our machine learning analyses provided additional insights into the data, as feature selection identified that age, school latrine cleanliness, large family size, and nail trimming habits were significant risk factors for stunting, underweight, and thinness. Association rule learning revealed an association between co-occurring hygiene and socio-economical variables with malnutrition that was otherwise missed using traditional statistical methods. Discussion Our analysis supports the benefit of integrating feature selection methods, association rules learning techniques, and logistic regression to identify comprehensive risk factors associated with malnutrition in young children.
Collapse
Affiliation(s)
- William A. Russel
- Department of Biology, Colgate University, Hamilton, NY, United States
| | - Jim Perry
- Department of Computer Science, Colgate University, Hamilton, NY, United States
| | - Claire Bonzani
- Department of Mathematics, Colgate University, Hamilton, NY, United States
| | - Amanda Dontino
- Department of Biology, Colgate University, Hamilton, NY, United States
| | - Zeleke Mekonnen
- Institute of Health, School of Medical Laboratory Sciences, Jimma University, Jimma, Ethiopia
| | - Ahmet Ay
- Department of Biology, Colgate University, Hamilton, NY, United States
- Department of Mathematics, Colgate University, Hamilton, NY, United States
| | - Bineyam Taye
- Department of Biology, Colgate University, Hamilton, NY, United States
| |
Collapse
|
2
|
Ma QL, Huang FM, Guo W, Feng KY, Huang T, Cai YD. Machine Learning Classification of Time since BNT162b2 COVID-19 Vaccination Based on Array-Measured Antibody Activity. Life (Basel) 2023; 13:1304. [PMID: 37374086 DOI: 10.3390/life13061304] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2023] [Revised: 05/26/2023] [Accepted: 05/29/2023] [Indexed: 06/29/2023] Open
Abstract
Vaccines trigger an immunological response that includes B and T cells, with B cells producing antibodies. SARS-CoV-2 immunity weakens over time after vaccination. Discovering key changes in antigen-reactive antibodies over time after vaccination could help improve vaccine efficiency. In this study, we collected data on blood antibody levels in a cohort of healthcare workers vaccinated for COVID-19 and obtained 73 antigens in samples from four groups according to the duration after vaccination, including 104 unvaccinated healthcare workers, 534 healthcare workers within 60 days after vaccination, 594 healthcare workers between 60 and 180 days after vaccination, and 141 healthcare workers over 180 days after vaccination. Our work was a reanalysis of the data originally collected at Irvine University. This data was obtained in Orange County, California, USA, with the collection process commencing in December 2020. British variant (B.1.1.7), South African variant (B.1.351), and Brazilian/Japanese variant (P.1) were the most prevalent strains during the sampling period. An efficient machine learning based framework containing four feature selection methods (least absolute shrinkage and selection operator, light gradient boosting machine, Monte Carlo feature selection, and maximum relevance minimum redundancy) and four classification algorithms (decision tree, k-nearest neighbor, random forest, and support vector machine) was designed to select essential antibodies against specific antigens. Several efficient classifiers with a weighted F1 value around 0.75 were constructed. The antigen microarray used for identifying antibody levels in the coronavirus features ten distinct SARS-CoV-2 antigens, comprising various segments of both nucleocapsid protein (NP) and spike protein (S). This study revealed that S1 + S2, S1.mFcTag, S1.HisTag, S1, S2, Spike.RBD.His.Bac, Spike.RBD.rFc, and S1.RBD.mFc were most highly ranked among all features, where S1 and S2 are the subunits of Spike, and the suffixes represent the tagging information of different recombinant proteins. Meanwhile, the classification rules were obtained from the optimal decision tree to explain quantitatively the roles of antigens in the classification. This study identified antibodies associated with decreased clinical immunity based on populations with different time spans after vaccination. These antibodies have important implications for maintaining long-term immunity to SARS-CoV-2.
Collapse
Affiliation(s)
- Qing-Lan Ma
- School of Life Sciences, Shanghai University, Shanghai 200444, China
| | - Fei-Ming Huang
- School of Life Sciences, Shanghai University, Shanghai 200444, China
| | - Wei Guo
- Key Laboratory of Stem Cell Biology, Shanghai Jiao Tong University School of Medicine (SJTUSM) & Shanghai Institutes for Biological Sciences (SIBS), Chinese Academy of Sciences (CAS), Shanghai 200030, China
| | - Kai-Yan Feng
- Department of Computer Science, Guangdong AIB Polytechnic College, Guangzhou 510507, China
| | - Tao Huang
- Bio-Med Big Data Center, CAS Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai 200031, China
- CAS Key Laboratory of Tissue Microenvironment and Tumor, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai 200031, China
| | - Yu-Dong Cai
- School of Life Sciences, Shanghai University, Shanghai 200444, China
| |
Collapse
|
3
|
Identification of Smoking-Associated Transcriptome Aberration in Blood with Machine Learning Methods. BIOMED RESEARCH INTERNATIONAL 2023; 2023:5333361. [PMID: 36644165 PMCID: PMC9833906 DOI: 10.1155/2023/5333361] [Citation(s) in RCA: 22] [Impact Index Per Article: 22.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 09/03/2022] [Revised: 12/15/2022] [Accepted: 12/15/2022] [Indexed: 01/06/2023]
Abstract
Long-term cigarette smoking causes various human diseases, including respiratory disease, cancer, and gastrointestinal (GI) disorders. Alterations in gene expression and variable splicing processes induced by smoking are associated with the development of diseases. This study applied advanced machine learning methods to identify the isoforms with important roles in distinguishing smokers from former smokers based on the expression profile of isoforms from current and former smokers collected in one previous study. These isoforms were deemed as features, which were first analyzed by the Boruta to select features highly correlated with the target variables. Then, the selected features were evaluated by four feature ranking algorithms, resulting in four feature lists. The incremental feature selection method was applied to each list for obtaining the optimal feature subsets and building high-performance classification models. Furthermore, a series of classification rules were accessed by decision tree with the highest performance. Eventually, the rationality of the mined isoforms (features) and classification rules was verified by reviewing previous research. Features such as isoforms ENST00000464835 (expressed by LRRN3), ENST00000622663 (expressed by SASH1), and ENST00000284311 (expressed by GPR15), and pathways (cytotoxicity mediated by natural killer cell and cytokine-cytokine receptor interaction) revealed by the enrichment analysis, were highly relevant to smoking response, suggesting the robustness of our analysis pipeline.
Collapse
|
4
|
Shu Y, Guo Y, Zheng Y, He S, Shi Z. RNA methylation in vascular disease: a systematic review. J Cardiothorac Surg 2022; 17:323. [PMID: 36536469 PMCID: PMC9762007 DOI: 10.1186/s13019-022-02077-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2022] [Accepted: 12/10/2022] [Indexed: 12/23/2022] Open
Abstract
Despite the rise in morbidity and mortality associated with vascular diseases, the underlying pathophysiological molecular mechanisms are still unclear. RNA N6-methyladenosine modification, as the most common cellular mechanism of RNA regulation, participates in a variety of biological functions and plays an important role in epigenetics. A large amount of evidence shows that RNA N6-methyladenosine modifications play a key role in the morbidity caused by vascular diseases. Further research on the relationship between RNA N6-methyladenosine modifications and vascular diseases is necessary to understand disease mechanisms at the gene level and to provide new tools for diagnosis and treatment. In this study, we summarize the currently available data on RNA N6-methyladenosine modifications in vascular diseases, addressing four aspects: the cellular regulatory system of N6-methyladenosine methylation, N6-methyladenosine modifications in risk factors for vascular disease, N6-methyladenosine modifications in vascular diseases, and techniques for the detection of N6-methyladenosine-methylated RNA.
Collapse
Affiliation(s)
- Yue Shu
- Geriatric Multi-Clinic Center, Hainan ChengMei Hospital, Haikou, Hainan People’s Republic of China ,Department of Special Medical Services, Hainan Cancer Hospital, Haikou, Hainan People’s Republic of China
| | - Yilong Guo
- grid.488137.10000 0001 2267 2324Medical School of Chinese PLA, Beijing, People’s Republic of China ,grid.414252.40000 0004 1761 8894Department of Vascular and Endovascular Surgery, The First Medical Centre of Chinese PLA General Hospital, Beijing, People’s Republic of China
| | - Yin Zheng
- Geriatric Multi-Clinic Center, Hainan ChengMei Hospital, Haikou, Hainan People’s Republic of China ,Department of Special Medical Services, Hainan Cancer Hospital, Haikou, Hainan People’s Republic of China
| | - Shuwu He
- grid.443397.e0000 0004 0368 7493Department of Cardiovascular Surgery, The Second Affiliated Hospital of Hainan Medical University, 48th of Bai Shui Tang Road, Haikou, 570311 Hainan People’s Republic of China
| | - Zhensu Shi
- grid.443397.e0000 0004 0368 7493Department of Cardiovascular Surgery, The Second Affiliated Hospital of Hainan Medical University, 48th of Bai Shui Tang Road, Haikou, 570311 Hainan People’s Republic of China
| |
Collapse
|
5
|
Identifying MicroRNA Markers That Predict COVID-19 Severity Using Machine Learning Methods. LIFE (BASEL, SWITZERLAND) 2022; 12:life12121964. [PMID: 36556329 PMCID: PMC9784129 DOI: 10.3390/life12121964] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 10/27/2022] [Revised: 11/21/2022] [Accepted: 11/21/2022] [Indexed: 11/25/2022]
Abstract
Individuals with the SARS-CoV-2 infection may experience a wide range of symptoms, from being asymptomatic to having a mild fever and cough to a severe respiratory impairment that results in death. MicroRNA (miRNA), which plays a role in the antiviral effects of SARS-CoV-2 infection, has the potential to be used as a novel marker to distinguish between patients who have various COVID-19 clinical severities. In the current study, the existing blood expression profiles reported in two previous studies were combined for deep analyses. The final profiles contained 1444 miRNAs in 375 patients from six categories, which were as follows: 30 patients with mild COVID-19 symptoms, 81 patients with moderate COVID-19 symptoms, 30 non-COVID-19 patients with mild symptoms, 137 patients with severe COVID-19 symptoms, 31 non-COVID-19 patients with severe symptoms, and 66 healthy controls. An efficient computational framework containing four feature selection methods (LASSO, LightGBM, MCFS, and mRMR) and four classification algorithms (DT, KNN, RF, and SVM) was designed to screen clinical miRNA markers, and a high-precision RF model with a 0.780 weighted F1 was constructed. Some miRNAs, including miR-24-3p, whose differential expression was discovered in patients with acute lung injury complications brought on by severe COVID-19, and miR-148a-3p, differentially expressed against SARS-CoV-2 structural proteins, were identified, thereby suggesting the effectiveness and accuracy of our framework. Meanwhile, we extracted classification rules based on the DT model for the quantitative representation of the role of miRNA expression in differentiating COVID-19 patients with different severities. The search for novel biomarkers that could predict the severity of the disease could aid in the clinical diagnosis of COVID-19 and in exploring the specific mechanisms of the complications caused by SARS-CoV-2 infection. Moreover, new therapeutic targets for the disease may be found.
Collapse
|
6
|
Jian F, Huang F, Zhang YH, Huang T, Cai YD. Identifying anal and cervical tumorigenesis-associated methylation signaling with machine learning methods. Front Oncol 2022; 12:998032. [PMID: 36249027 PMCID: PMC9557006 DOI: 10.3389/fonc.2022.998032] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2022] [Accepted: 09/14/2022] [Indexed: 11/13/2022] Open
Abstract
Cervical and anal carcinoma are neoplastic diseases with various intraepithelial neoplasia stages. The underlying mechanisms for cancer initiation and progression have not been fully revealed. DNA methylation has been shown to be aberrantly regulated during tumorigenesis in anal and cervical carcinoma, revealing the important roles of DNA methylation signaling as a biomarker to distinguish cancer stages in clinics. In this research, several machine learning methods were used to analyze the methylation profiles on anal and cervical carcinoma samples, which were divided into three classes representing various stages of tumor progression. Advanced feature selection methods, including Boruta, LASSO, LightGBM, and MCFS, were used to select methylation features that are highly correlated with cancer progression. Some methylation probes including cg01550828 and its corresponding gene RNF168 have been reported to be associated with human papilloma virus-related anal cancer. As for biomarkers for cervical carcinoma, cg27012396 and its functional gene HDAC4 were confirmed to regulate the glycolysis and survival of hypoxic tumor cells in cervical carcinoma. Furthermore, we developed effective classifiers for identifying various tumor stages and derived classification rules that reflect the quantitative impact of methylation on tumorigenesis. The current study identified methylation signals associated with the development of cervical and anal carcinoma at qualitative and quantitative levels using advanced machine learning methods.
Collapse
Affiliation(s)
- Fangfang Jian
- Department of Obstetrics & Gynecology, Ruijin Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, China
| | - FeiMing Huang
- School of Life Sciences, Shanghai University, Shanghai, China
| | - Yu-Hang Zhang
- Channing Division of Network Medicine, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA, United States
| | - Tao Huang
- Bio-Med Big Data Center, CAS Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, China
- CAS Key Laboratory of Tissue Microenvironment and Tumor, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, China
- *Correspondence: Tao Huang, ; Yu-Dong Cai,
| | - Yu-Dong Cai
- School of Life Sciences, Shanghai University, Shanghai, China
- *Correspondence: Tao Huang, ; Yu-Dong Cai,
| |
Collapse
|
7
|
Lu S, Wang H, Zhang J. Identification of uveitis-associated functions based on the feature selection analysis of gene ontology and Kyoto Encyclopedia of Genes and Genomes pathway enrichment scores. Front Mol Neurosci 2022; 15:1007352. [PMID: 36157069 PMCID: PMC9493498 DOI: 10.3389/fnmol.2022.1007352] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2022] [Accepted: 08/22/2022] [Indexed: 11/13/2022] Open
Abstract
Uveitis is a typical type of eye inflammation affecting the middle layer of eye (i.e., uvea layer) and can lead to blindness in middle-aged and young people. Therefore, a comprehensive study determining the disease susceptibility and the underlying mechanisms for uveitis initiation and progression is urgently needed for the development of effective treatments. In the present study, 108 uveitis-related genes are collected on the basis of literature mining, and 17,560 other human genes are collected from the Ensembl database, which are treated as non-uveitis genes. Uveitis- and non-uveitis-related genes are then encoded by gene ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment scores based on the genes and their neighbors in STRING, resulting in 20,681 GO term features and 297 KEGG pathway features. Subsequently, we identify functions and biological processes that can distinguish uveitis-related genes from other human genes by using an integrated feature selection method, which incorporate feature filtering method (Boruta) and four feature importance assessment methods (i.e., LASSO, LightGBM, MCFS, and mRMR). Some essential GO terms and KEGG pathways related to uveitis, such as GO:0001841 (neural tube formation), has04612 (antigen processing and presentation in human beings), and GO:0043379 (memory T cell differentiation), are identified. The plausibility of the association of mined functional features with uveitis is verified on the basis of the literature. Overall, several advanced machine learning methods are used in the current study to uncover specific functions of uveitis and provide a theoretical foundation for the clinical treatment of uveitis.
Collapse
Affiliation(s)
- Shiheng Lu
- Department of Ophthalmology, Shanghai Eye Disease Prevention and Treatment Center, Shanghai Eye Hospital, Shanghai, China
- Department of Ophthalmology, Shanghai General Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, China
- Shanghai Key Laboratory of Ocular Fundus Diseases, Shanghai, China
- Shanghai Engineering Center for Visual Science and Photomedicine, Shanghai, China
- National Clinical Research Center for Eye Diseases, Shanghai, China
- Shanghai Engineering Research Center for Precise Diagnosis and Treatment of Eye Diseases, Shanghai, China
- *Correspondence: Shiheng Lu,
| | - Hui Wang
- Department of Orthopedics, Shanghai Yangpu Hospital of Traditional Chinese Medicine, Shanghai, China
| | - Jian Zhang
- Department of Ophthalmology, Shanghai General Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, China
- Shanghai Key Laboratory of Ocular Fundus Diseases, Shanghai, China
- Shanghai Engineering Center for Visual Science and Photomedicine, Shanghai, China
- National Clinical Research Center for Eye Diseases, Shanghai, China
- Shanghai Engineering Research Center for Precise Diagnosis and Treatment of Eye Diseases, Shanghai, China
- Jian Zhang,
| |
Collapse
|
8
|
Abdelwahab O, Awad N, Elserafy M, Badr E. A feature selection-based framework to identify biomarkers for cancer diagnosis: A focus on lung adenocarcinoma. PLoS One 2022; 17:e0269126. [PMID: 36067196 PMCID: PMC9447897 DOI: 10.1371/journal.pone.0269126] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2021] [Accepted: 05/15/2022] [Indexed: 12/23/2022] Open
Abstract
Lung cancer (LC) represents most of the cancer incidences in the world. There are many types of LC, but Lung Adenocarcinoma (LUAD) is the most common type. Although RNA-seq and microarray data provide a vast amount of gene expression data, most of the genes are insignificant to clinical diagnosis. Feature selection (FS) techniques overcome the high dimensionality and sparsity issues of the large-scale data. We propose a framework that applies an ensemble of feature selection techniques to identify genes highly correlated to LUAD. Utilizing LUAD RNA-seq data from the Cancer Genome Atlas (TCGA), we employed mutual information (MI) and recursive feature elimination (RFE) feature selection techniques along with support vector machine (SVM) classification model. We have also utilized Random Forest (RF) as an embedded FS technique. The results were integrated and candidate biomarker genes across all techniques were identified. The proposed framework has identified 12 potential biomarkers that are highly correlated with different LC types, especially LUAD. A predictive model has been trained utilizing the identified biomarker expression profiling and performance of 97.99% was achieved. In addition, upon performing differential gene expression analysis, we could find that all 12 genes were significantly differentially expressed between normal and LUAD tissues, and strongly correlated with LUAD according to previous reports. We here propose that using multiple feature selection methods effectively reduces the number of identified biomarkers and directly affects their biological relevance.
Collapse
Affiliation(s)
- Omar Abdelwahab
- University of Science and Technology, Zewail City of Science and Technology, Giza, Egypt
| | - Nourelislam Awad
- University of Science and Technology, Zewail City of Science and Technology, Giza, Egypt
- Center of Informatics Science, Nile university, Giza, Egypt
| | - Menattallah Elserafy
- University of Science and Technology, Zewail City of Science and Technology, Giza, Egypt
- Center for Genomics, Helmy Institute for Medical Sciences, Zewail City of Science and Technology, Giza, Egypt
| | - Eman Badr
- University of Science and Technology, Zewail City of Science and Technology, Giza, Egypt
- Faculty of Computers and Artificial Intelligence, Cairo University, Giza, Egypt
| |
Collapse
|
9
|
Song J, Huang F, Chen L, Feng K, Jian F, Huang T, Cai YD. Identification of methylation signatures associated with CAR T cell in B-cell acute lymphoblastic leukemia and non-hodgkin’s lymphoma. Front Oncol 2022; 12:976262. [PMID: 36033519 PMCID: PMC9402909 DOI: 10.3389/fonc.2022.976262] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2022] [Accepted: 07/25/2022] [Indexed: 11/13/2022] Open
Abstract
CD19-targeted CAR T cell immunotherapy has exceptional efficacy for the treatment of B-cell malignancies. B-cell acute lymphocytic leukemia and non-Hodgkin’s lymphoma are two common B-cell malignancies with high recurrence rate and are refractory to cure. Although CAR T-cell immunotherapy overcomes the limitations of conventional treatments for such malignancies, failure of treatment and tumor recurrence remain common. In this study, we searched for important methylation signatures to differentiate CAR-transduced and untransduced T cells from patients with acute lymphoblastic leukemia and non-Hodgkin’s lymphoma. First, we used three feature ranking methods, namely, Monte Carlo feature selection, light gradient boosting machine, and least absolute shrinkage and selection operator, to rank all methylation features in order of their importance. Then, the incremental feature selection method was adopted to construct efficient classifiers and filter the optimal feature subsets. Some important methylated genes, namely, SERPINB6, ANK1, PDCD5, DAPK2, and DNAJB6, were identified. Furthermore, the classification rules for distinguishing different classes were established, which can precisely describe the role of methylation features in the classification. Overall, we applied advanced machine learning approaches to the high-throughput data, investigating the mechanism of CAR T cells to establish the theoretical foundation for modifying CAR T cells.
Collapse
Affiliation(s)
- Jiwei Song
- College of Life Science, Changchun Sci-Tech University, Shuangyang, China
| | - FeiMing Huang
- School of Life Sciences, Shanghai University, Shanghai, China
| | - Lei Chen
- College of Information Engineering, Shanghai Maritime University, Shanghai, China
| | - KaiYan Feng
- Department of Computer Science, Guangdong AIB Polytechnic College, Guangzhou, China
| | - Fangfang Jian
- Ruijin Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, China
| | - Tao Huang
- Bio-Med Big Data Center, CAS Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, China
- CAS Key Laboratory of Tissue Microenvironment and Tumor, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, China
- *Correspondence: Tao Huang, ; Yu-Dong Cai,
| | - Yu-Dong Cai
- School of Life Sciences, Shanghai University, Shanghai, China
- *Correspondence: Tao Huang, ; Yu-Dong Cai,
| |
Collapse
|
10
|
The Blood Gene Expression Signature for Kawasaki Disease in Children Identified with Advanced Feature Selection Methods. BIOMED RESEARCH INTERNATIONAL 2021; 2020:6062436. [PMID: 32685506 PMCID: PMC7327570 DOI: 10.1155/2020/6062436] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/26/2020] [Accepted: 06/12/2020] [Indexed: 01/22/2023]
Abstract
Kawasaki disease (KD) is an acute vasculitis, accompanied by coronary artery aneurysm, coronary artery dilatation, arrhythmia, and other serious cardiovascular diseases. So far, the etiology of KD is unclear; it is necessary to study the molecular mechanism and related factors of KD. In this study, we analyzed the expression profiles of 75 DB (identifying bacteria), 122 DV (identifying virus), 71 HC (healthy control), and 311 KD (Kawasaki disease) samples. 332 key genes related to KD and pathogen infections were identified using a combination of advanced feature selection methods: (1) Boruta, (2) Monte-Carlo Feature Selection (MCFS), and (3) Incremental Feature Selection (IFS). The number of signature genes was narrowed down step by step. Subsequently, their functions were revealed by KEGG and GO enrichment analyses. Our results provided clues of potential molecular mechanisms of KD and were helpful for KD detection and treatment.
Collapse
|
11
|
Li JF, Ma XJ, Ying LL, Tong YH, Xiang XP. Multi-Omics Analysis of Acute Lymphoblastic Leukemia Identified the Methylation and Expression Differences Between BCP-ALL and T-ALL. Front Cell Dev Biol 2021; 8:622393. [PMID: 33553159 PMCID: PMC7859262 DOI: 10.3389/fcell.2020.622393] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2020] [Accepted: 12/15/2020] [Indexed: 02/06/2023] Open
Abstract
Acute lymphoblastic leukemia (ALL) as a common cancer is a heterogeneous disease which is mainly divided into BCP-ALL and T-ALL, accounting for 80–85% and 15–20%, respectively. There are many differences between BCP-ALL and T-ALL, including prognosis, treatment, drug screening, gene research and so on. In this study, starting with methylation and gene expression data, we analyzed the molecular differences between BCP-ALL and T-ALL and identified the multi-omics signatures using Boruta and Monte Carlo feature selection methods. There were 7 expression signature genes (CD3D, VPREB3, HLA-DRA, PAX5, BLNK, GALNT6, SLC4A8) and 168 methylation sites corresponding to 175 methylation signature genes. The overall accuracy, accuracy of BCP-ALL, accuracy of T-ALL of the RIPPER (Repeated Incremental Pruning to Produce Error Reduction) classifier using these signatures evaluated with 10-fold cross validation repeated 3 times were 0.973, 0.990, and 0.933, respectively. Two overlapped genes between 175 methylation signature genes and 7 expression signature genes were CD3D and VPREB3. The network analysis of the methylation and expression signature genes suggested that their common gene, CD3D, was not only different on both methylation and expression levels, but also played a key regulatory role as hub on the network. Our results provided insights of understanding the underlying molecular mechanisms of ALL and facilitated more precision diagnosis and treatment of ALL.
Collapse
Affiliation(s)
- Jin-Fan Li
- Department of Pathology, The Second Affiliated Hospital, School of Medicine, Zhejiang University, Hangzhou, China
| | - Xiao-Jing Ma
- Department of Pathology, The Second Affiliated Hospital, School of Medicine, Zhejiang University, Hangzhou, China
| | - Lin-Lin Ying
- Department of Pathology, The Second Affiliated Hospital, School of Medicine, Zhejiang University, Hangzhou, China
| | - Ying-Hui Tong
- Department of Pharmacy, Cancer Hospital of the University of Chinese Academy of Sciences (Zhejiang Cancer Hospital), Institute of Cancer and Basic Medicine (IBMC), Chinese Academy of Sciences, Hangzhou, China
| | - Xue-Ping Xiang
- Department of Pathology, The Second Affiliated Hospital, School of Medicine, Zhejiang University, Hangzhou, China
| |
Collapse
|
12
|
Li D, Lin H, Li L. Multiple Feature Selection Strategies Identified Novel Cardiac Gene Expression Signature for Heart Failure. Front Physiol 2020; 11:604241. [PMID: 33304275 PMCID: PMC7693561 DOI: 10.3389/fphys.2020.604241] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2020] [Accepted: 10/15/2020] [Indexed: 02/02/2023] Open
Abstract
Heart failure (HF) is a serious condition in which the support of blood pumped by the heart is insufficient to meet the demands of body at a normal cardiac filling pressure. Approximately 26 million patients worldwide are suffering from heart failure and about 17–45% of patients with heart failure die within 1-year, and the majority die within 5-years admitted to a hospital. The molecular mechanisms underlying the progression of heart failure have been poorly studied. We compared the gene expression profiles between patients with heart failure (n = 177) and without heart failure (n = 136) using multiple feature selection strategies and identified 38 HF signature genes. The support vector machine (SVM) classifier based on these 38 genes evaluated with leave-one-out cross validation (LOOCV) achieved great performance with sensitivity of 0.983 and specificity of 0.963. The network analysis suggested that the hub gene SMOC2 may play important roles in HF. Other genes, such as FCN3, HMGN2, and SERPINA3, also showed great promises. Our results can facilitate the early detection of heart failure and can reveal its molecular mechanisms.
Collapse
Affiliation(s)
- Dan Li
- Department of Cardiovascular Medicine, First Hospital Affiliated to Harbin Medical University, Harbin, China
| | - Hong Lin
- Internal Medicine-Cardiovascular Department, Harbin Chest Hospital, Harbin, China
| | - Luyifei Li
- Department of Cardiovascular Medicine, First Hospital Affiliated to Harbin Medical University, Harbin, China
| |
Collapse
|
13
|
Xia Q, Shu Z, Ye T, Zhang M. Identification and Analysis of the Blood lncRNA Signature for Liver Cirrhosis and Hepatocellular Carcinoma. Front Genet 2020; 11:595699. [PMID: 33365048 PMCID: PMC7750531 DOI: 10.3389/fgene.2020.595699] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2020] [Accepted: 10/13/2020] [Indexed: 12/12/2022] Open
Abstract
As one of the most common malignant tumors, hepatocellular carcinoma (HCC) is the fifth major cause of cancer-associated mortality worldwide. In 90% of cases, HCC develops in the context of liver cirrhosis and chronic hepatitis B virus (HBV) infection is an important etiology for cirrhosis and HCC, accounting for 53% of all HCC cases. To understand the underlying mechanisms of the dynamic chain reactions from normal to HBV infection, from HBV infection to liver cirrhosis, from liver cirrhosis to HCC, we analyzed the blood lncRNA expression profiles from 38 healthy control samples, 45 chronic hepatitis B patients, 46 liver cirrhosis patients, and 46 HCC patients. Advanced machine-learning methods including Monte Carlo feature selection, incremental feature selection (IFS), and support vector machine (SVM) were applied to discover the signature associated with HCC progression and construct the prediction model. One hundred seventy-one key HCC progression-associated lncRNAs were identified and their overall accuracy was 0.823 as evaluated with leave-one-out cross validation (LOOCV). The accuracies of the lncRNA signature for healthy control, chronic hepatitis B, liver cirrhosis, and HCC were 0.895, 0.711, 0.870, and 0.826, respectively. The 171-lncRNA signature is not only useful for early detection and intervention of HCC, but also helpful for understanding the multistage tumorigenic processes of HCC.
Collapse
Affiliation(s)
- Qi Xia
- State Key Laboratory for Diagnosis and Treatment of Infectious Diseases, National Clinical Research Center for Infectious Diseases, Collaborative Innovation Center for Diagnosis and Treatment of Infectious Diseases, The First Affiliated Hospital, College of Medicine, Zhejiang University, Hangzhou, China.,Key Laboratory for Biomedical Engineering of Ministry of Education, Zhejiang University, Hangzhou, China.,Zhejiang University, Hangzhou, China
| | - Zheyue Shu
- Zhejiang University, Hangzhou, China.,Division of Hepatobiliary and Pancreatic Surgery, Department of Surgery, The First Affiliated Hospital, College of Medicine, Zhejiang University, Hangzhou, China.,Key Laboratory of Combined Multi-Organ Transplantation, Ministry of Public Health, Hangzhou, China
| | - Ting Ye
- Zhejiang University, Hangzhou, China
| | - Min Zhang
- Zhejiang University, Hangzhou, China.,Division of Hepatobiliary and Pancreatic Surgery, Department of Surgery, The First Affiliated Hospital, College of Medicine, Zhejiang University, Hangzhou, China.,Key Laboratory of Combined Multi-Organ Transplantation, Ministry of Public Health, Hangzhou, China
| |
Collapse
|
14
|
Wu Z, Shou L, Wang J, Huang T, Xu X. The Methylation Pattern for Knee and Hip Osteoarthritis. Front Cell Dev Biol 2020; 8:602024. [PMID: 33240895 PMCID: PMC7677303 DOI: 10.3389/fcell.2020.602024] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2020] [Accepted: 10/22/2020] [Indexed: 01/08/2023] Open
Abstract
Osteoarthritis is one of the most prevalent chronic joint diseases for middle-aged and elderly people. But in recent years, the number of young people suffering from the disease increases quickly. It is known that osteoarthritis is a common degenerative disease caused by the combination and interaction of many factors such as natural and environmental factors. DNA methylations reflect the effects of environmental factors. Several researches on DNA methylation at specific genes in OA cartilage indicated the great potential roles of DNA methylation in OA. To systematically investigate the methylation pattern in knee and hip osteoarthritis, we analyzed the methylation profiles in cartilage of 16 OA hip samples, 19 control hip samples and 62 OA knee samples. 12 discriminative methylation sites were identified using advanced minimal Redundancy Maximal Relevance (mRMR) and Incremental Feature Selection (IFS) methods. The SVM classifier of these 12 methylation sites from genes like MEIS1, GABRG3, RXRA, and EN1, can perfectly classify the OA hip samples, control hip samples and OA knee samples evaluated with LOOCV (Leave-One Out-Cross Validation). These 12 methylation sites can not only serve as biomarker, but also provide underlying mechanism of OA.
Collapse
Affiliation(s)
- Zhen Wu
- Departmemt of Orthopaedics, Tongde Hospital of Zhejiang Province, Hangzhou, China
| | - Lu Shou
- Departmemt of Pneumology, Tongde Hospital of Zhejiang Province, Hangzhou, China
| | - Jian Wang
- Departmemt of Orthopaedics, Tongde Hospital of Zhejiang Province, Hangzhou, China
| | - Tao Huang
- Shanghai Institute of Nutrition and Health, Chinese Academy of Sciences, Shanghai, China
| | - Xinwei Xu
- Departmemt of Orthopaedics, Tongde Hospital of Zhejiang Province, Hangzhou, China
| |
Collapse
|
15
|
Zhu JH, Yan QL, Wang JW, Chen Y, Ye QH, Wang ZJ, Huang T. The Key Genes for Perineural Invasion in Pancreatic Ductal Adenocarcinoma Identified With Monte-Carlo Feature Selection Method. Front Genet 2020; 11:554502. [PMID: 33193628 PMCID: PMC7593847 DOI: 10.3389/fgene.2020.554502] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2020] [Accepted: 08/17/2020] [Indexed: 12/20/2022] Open
Abstract
Background Pancreatic ductal adenocarcinoma (PDAC) is the most aggressive form of pancreatic cancer. Its 5-year survival rate is only 3–5%. Perineural invasion (PNI) is a process of cancer cells invading the surrounding nerves and perineural spaces. It is considered to be associated with the poor prognosis of PDAC. About 90% of pancreatic cancer patients have PNI. The high incidence of PNI in pancreatic cancer limits radical resection and promotes local recurrence, which negatively affects life quality and survival time of the patients with pancreatic cancer. Objectives To investigate the mechanism of PNI in pancreatic cancer, we analyzed the gene expression profiles of tumors and adjacent tissues from 50 PDAC patients which included 28 patients with perineural invasion and 22 patients without perineural invasion. Method Using Monte-Carlo feature selection and Incremental Feature Selection (IFS) method, we identified 26 key features within which 15 features were from tumor tissues and 11 features were from adjacent tissues. Results Our results suggested that not only the tumor tissue, but also the adjacent tissue, was informative for perineural invasion prediction. The SVM classifier based on these 26 key features can predict perineural invasion accurately, with a high accuracy of 0.94 evaluated with leave-one-out cross validation (LOOCV). Conclusion The in-depth biological analysis of key feature genes, such as TNFRSF14, XPO1, and ATF3, shed light on the understanding of perineural invasion in pancreatic ductal adenocarcinoma.
Collapse
Affiliation(s)
- Jin-Hui Zhu
- Department of General Surgery, The Second Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, China
| | - Qiu-Liang Yan
- Department of General Surgery, Jinhua People's Hospital, Jinhua, China
| | - Jian-Wei Wang
- Department of Surgical Oncology, The Second Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, China
| | - Yan Chen
- Department of General Surgery, The Second Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, China
| | - Qing-Huang Ye
- Department of General Surgery, The Second Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, China
| | - Zhi-Jiang Wang
- Department of General Surgery, The Second Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, China
| | - Tao Huang
- Shanghai Institute of Nutrition and Health, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China
| |
Collapse
|
16
|
Zhang YH, Jin M, Li J, Kong X. Identifying circulating miRNA biomarkers for early diagnosis and monitoring of lung cancer. Biochim Biophys Acta Mol Basis Dis 2020; 1866:165847. [DOI: 10.1016/j.bbadis.2020.165847] [Citation(s) in RCA: 16] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2020] [Revised: 04/28/2020] [Accepted: 05/19/2020] [Indexed: 02/09/2023]
|
17
|
Li J, Xu Q, Wu M, Huang T, Wang Y. Pan-Cancer Classification Based on Self-Normalizing Neural Networks and Feature Selection. Front Bioeng Biotechnol 2020; 8:766. [PMID: 32850695 PMCID: PMC7417299 DOI: 10.3389/fbioe.2020.00766] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2020] [Accepted: 06/17/2020] [Indexed: 11/13/2022] Open
Abstract
Cancer is a one of the severest diseases and cancer classification plays an important role in cancer diagnosis and treatment. Some different cancers even have similar molecular features such as DNA copy number variant. Pan-cancer classification is still non-trivial at molecular level. Herein, we propose a computational method to classify cancer types by using the self-normalizing neural network (SNN) for analyzing pan-cancer copy number variation data. Since the dimension of the copy number variation features is high, the Monte Carlo feature selection method was used to rank these features. Then a classifier was built by SNN and feature selection method to select features. Three thousand six hundred ninety-four features were chosen for the prediction model, which yields the accuracy value is 0.798 and macro F1 is 0.789. We compared our model to random forest method. Results show the accuracy and macro F1 obtained by our classifier are higher than those obtained by random forest classifier, indicating the good predictive power of our method in distinguishing four different cancer types. This method is also extendable to pan-cancer classification for other molecular features.
Collapse
Affiliation(s)
- Junyi Li
- Department of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, China
| | - Qingzhe Xu
- Department of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, China
| | - Mingxiao Wu
- Department of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, China
| | - Tao Huang
- Shanghai Institute of Nutrition and Health, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China
| | - Yadong Wang
- Department of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, China
| |
Collapse
|
18
|
Ren X, Wang S, Huang T. Decipher the connections between proteins and phenotypes. BIOCHIMICA ET BIOPHYSICA ACTA-PROTEINS AND PROTEOMICS 2020; 1868:140503. [PMID: 32707349 DOI: 10.1016/j.bbapap.2020.140503] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/31/2020] [Revised: 06/30/2020] [Accepted: 07/16/2020] [Indexed: 10/23/2022]
Abstract
As the outward-most representation of life, phenotype is the fundamental basis with which humans understand life and disease. But with the advent of molecular and sequencing technique and research, a growing portion of science research focuses primarily on the molecular level of life. Our understanding in molecular variations and mechanisms can only be fully utilized when they are translated into the phenotypic level. In this study, we constructed similarity network for phenotype ontology, and then applied network analysis methods to discover phenotype/disease clusters. Then, we used machine learning models to predict protein-phenotype associations. Each protein was characterized by the functional profiles of its interaction neighbors on the protein-protein interaction network. Our methods can not only predict protein-phenotype associations, but also reveal the underlying mechanisms from protein to phenotype.
Collapse
Affiliation(s)
- Xiaohui Ren
- Shanghai Institute of Nutrition and Health, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China
| | - Steven Wang
- Department of Molecular Biology, Columbia University, New York, USA
| | - Tao Huang
- Shanghai Institute of Nutrition and Health, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China.
| |
Collapse
|
19
|
Pan X, Zeng T, Zhang YH, Chen L, Feng K, Huang T, Cai YD. Investigation and Prediction of Human Interactome Based on Quantitative Features. Front Bioeng Biotechnol 2020; 8:730. [PMID: 32766217 PMCID: PMC7379396 DOI: 10.3389/fbioe.2020.00730] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2020] [Accepted: 06/09/2020] [Indexed: 01/27/2023] Open
Abstract
Protein is one of the most significant components of all living creatures. All significant and essential biological structures and functions relies on proteins and their respective biological functions. However, proteins cannot perform their unique biological significance independently. They have to interact with each other to realize the complicated biological processes in all living creatures including human beings. In other words, proteins depend on interactions (protein-protein interactions) to realize their significant effects. Thus, the significance comparison and quantitative contribution of candidate PPI features must be determined urgently. According to previous studies, 258 physical and chemical characteristics of proteins have been reported and confirmed to definitively affect the interaction efficiency of the related proteins. Among such features, essential physiochemical features of proteins like stoichiometric balance, protein abundance, molecular weight and charge distribution have been validated to be quite significant and irreplaceable for protein-protein interactions (PPIs). Therefore, in this study, we, on one hand, presented a novel computational framework to identify the key factors affecting PPIs with Boruta feature selection (BFS), Monte Carlo feature selection (MCFS), incremental feature selection (IFS), and on the other hand, built a quantitative decision-rule system to evaluate the potential PPIs under real conditions with random forest (RF) and RIPPER algorithms, thereby supplying several new insights into the detailed biological mechanisms of complicated PPIs. The main datasets and codes can be downloaded at https://github.com/xypan1232/Mass-PPI.
Collapse
Affiliation(s)
- Xiaoyong Pan
- School of Life Sciences, Shanghai University, Shanghai, China.,Key Laboratory of System Control and Information Processing, Ministry of Education of China, Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, Shanghai, China
| | - Tao Zeng
- Key Laboratory of Systems Biology, Institute of Biochemistry and Cell Biology, Chinese Academy of Sciences, Shanghai, China
| | - Yu-Hang Zhang
- Shanghai Institute of Nutrition and Health, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China
| | - Lei Chen
- College of Information Engineering, Shanghai Maritime University, Shanghai, China
| | - Kaiyan Feng
- Department of Computer Science, Guangdong AIB Polytechnic, Guangzhou, China
| | - Tao Huang
- Shanghai Institute of Nutrition and Health, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China
| | - Yu-Dong Cai
- School of Life Sciences, Shanghai University, Shanghai, China
| |
Collapse
|
20
|
Liang P, Yang W, Chen X, Long C, Zheng L, Li H, Zuo Y. Machine Learning of Single-Cell Transcriptome Highly Identifies mRNA Signature by Comparing F-Score Selection with DGE Analysis. MOLECULAR THERAPY. NUCLEIC ACIDS 2020; 20:155-163. [PMID: 32169803 PMCID: PMC7066034 DOI: 10.1016/j.omtn.2020.02.004] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/12/2019] [Revised: 12/27/2019] [Accepted: 02/05/2020] [Indexed: 12/21/2022]
Abstract
Human preimplantation development is a complex process involving dramatic changes in transcriptional architecture. For a better understanding of their time-spatial development, it is indispensable to identify key genes. Although the single-cell RNA sequencing (RNA-seq) techniques could provide detailed clustering signatures, the identification of decisive factors remains difficult. Additionally, it requires high experimental cost and a long experimental period. Thus, it is highly desired to develop computational methods for identifying effective genes of development signature. In this study, we first developed a predictor called EmPredictor to identify developmental stages of human preimplantation embryogenesis. First, we compared the F-score of feature selection algorithms with differential gene expression (DGE) analysis to find specific signatures of the development stage. In addition, by training the support vector machine (SVM), four types of signature subsets were comprehensively discussed. The prediction results showed that a feature subset with 1,881 genes from the F-score algorithm obtained the best predictive performance, which achieved the highest accuracy of 93.3% on the cross-validation set. Further function enrichment demonstrated that the gene set selected by the feature selection method was involved in more development-related pathways and cell fate determination biomarkers. This indicates that the F-score algorithm should be preferentially proposed for detecting key genes of multi-period data in mammalian early development.
Collapse
Affiliation(s)
- Pengfei Liang
- The State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of Life Sciences, Inner Mongolia University, Hohhot 010070, China
| | - Wuritu Yang
- The State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of Life Sciences, Inner Mongolia University, Hohhot 010070, China
| | - Xing Chen
- The State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of Life Sciences, Inner Mongolia University, Hohhot 010070, China
| | - Chunshen Long
- The State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of Life Sciences, Inner Mongolia University, Hohhot 010070, China
| | - Lei Zheng
- The State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of Life Sciences, Inner Mongolia University, Hohhot 010070, China
| | - Hanshuang Li
- The State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of Life Sciences, Inner Mongolia University, Hohhot 010070, China
| | - Yongchun Zuo
- The State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of Life Sciences, Inner Mongolia University, Hohhot 010070, China.
| |
Collapse
|
21
|
Li M, Chen F, Zhang Y, Xiong Y, Li Q, Huang H. Identification of Post-myocardial Infarction Blood Expression Signatures Using Multiple Feature Selection Strategies. Front Physiol 2020; 11:483. [PMID: 32581823 PMCID: PMC7287215 DOI: 10.3389/fphys.2020.00483] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2020] [Accepted: 04/20/2020] [Indexed: 12/24/2022] Open
Abstract
Myocardial infarction (MI) is a type of serious heart attack in which the blood flow to the heart is suddenly interrupted, resulting in injury to the heart muscles due to a lack of oxygen supply. Although clinical diagnosis methods can be used to identify the occurrence of MI, using the changes of molecular markers or characteristic molecules in blood to characterize the early phase and later trend of MI will help us choose a more reasonable treatment plan. Previously, comparative transcriptome studies focused on finding differentially expressed genes between MI patients and healthy people. However, signature molecules altered in different phases of MI have not been well excavated. We developed a set of computational approaches integrating multiple machine learning algorithms, including Monte Carlo feature selection (MCFS), incremental feature selection (IFS), and support vector machine (SVM), to identify gene expression characteristics on different phases of MI. 134 genes were determined to serve as features for building optimal SVM classifiers to distinguish acute MI and post-MI. Subsequently, functional enrichment analyses followed by protein-protein interaction analysis on 134 genes identified several hub genes (IL1R1, TLR2, and TLR4) associated with progression of MI, which can be used as new diagnostic molecules for MI.
Collapse
Affiliation(s)
- Ming Li
- Department of Cardiology, Eastern Hospital, Sichuan Academy of Medical Sciences & Sichuan Provincial People's Hospital, Chengdu, China
| | - Fuli Chen
- Department of Cardiology, Sichuan Academy of Medical Sciences & Sichuan Provincial People's Hospital, Chengdu, China
| | - Yaling Zhang
- Department of Nephrology, Eastern Hospital, Sichuan Academy of Medical Sciences & Sichuan Provincial People's Hospital, Chengdu, China
| | - Yan Xiong
- Department of Cardiology, Sichuan Academy of Medical Sciences & Sichuan Provincial People's Hospital, Chengdu, China
| | - Qiyong Li
- Department of Cardiology, Sichuan Academy of Medical Sciences & Sichuan Provincial People's Hospital, Chengdu, China
| | - Hui Huang
- Department of Cardiology, Sichuan Academy of Medical Sciences & Sichuan Provincial People's Hospital, Chengdu, China
| |
Collapse
|
22
|
Tao X, Wu X, Huang T, Mu D. Identification and Analysis of Dysfunctional Genes and Pathways in CD8 + T Cells of Non-Small Cell Lung Cancer Based on RNA Sequencing. Front Genet 2020; 11:352. [PMID: 32457792 PMCID: PMC7227791 DOI: 10.3389/fgene.2020.00352] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2020] [Accepted: 03/23/2020] [Indexed: 12/26/2022] Open
Abstract
Lung cancer, the most common of malignant tumors, is typically of the non-small cell (NSCLC) type. T-cell-based immunotherapies are a promising and powerful approach to treating NSCLCs. To characterize the CD8+ T cells of non-small cell lung cancer, we re-analyzed the published RNA-Seq gene expression profiles of 36 CD8+ T cell isolated from tumor (TIL) samples and 32 adjacent uninvolved lung (NTIL) samples. With an advanced Monte Carlo method of feature selection, we identified the CD8+ TIL specific expression patterns. These patterns revealed the key dysfunctional genes and pathways in CD8+ TIL and shed light on the molecular mechanisms of immunity and use of immunotherapy.
Collapse
Affiliation(s)
- Xuefang Tao
- Affiliated Hospital of Shaoxing University, Shaoxing, China
| | - Xiaotang Wu
- Shanghai Engineering Research Center of Pharmaceutical Translation, Shanghai, China
| | - Tao Huang
- Shanghai Institute of Nutrition and Health, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China
| | - Deguang Mu
- Department of Respiratory Medicine, Zhejiang Provincial People's Hospital, People's Hospital of Hangzhou Medical College, Hangzhou, China
| |
Collapse
|
23
|
Kalina J, Matonoha C. A sparse pair-preserving centroid-based supervised learning method for high-dimensional biomedical data or images. Biocybern Biomed Eng 2020. [DOI: 10.1016/j.bbe.2020.03.008] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
24
|
Zhang H, Jin Z, Cheng L, Zhang B. Integrative Analysis of Methylation and Gene Expression in Lung Adenocarcinoma and Squamous Cell Lung Carcinoma. Front Bioeng Biotechnol 2020; 8:3. [PMID: 32117905 PMCID: PMC7019569 DOI: 10.3389/fbioe.2020.00003] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2019] [Accepted: 01/03/2020] [Indexed: 12/18/2022] Open
Abstract
Lung cancer is a highly prevalent type of cancer with a poor 5-year survival rate of about 4-17%. Eighty percent lung cancer belongs to non-small-cell lung cancer (NSCLC). For a long time, the treatment of NSCLC has been mostly guided by tumor stage, and there has been no significant difference between the therapy strategy of lung adenocarcinoma (LUAD) and squamous cell lung carcinoma (SCLC), the two major subtypes of NSCLC. In recent years, important molecular differences between LUAD and SCLC are increasingly identified, indicating that targeted therapy will be more and more histologically specific in the future. To investigate the LUAD and SCLC difference on multi-omics scale, we analyzed the methylation and gene expression data together. With the Boruta method to remove irrelevant features and the MCFS (Monte Carlo Feature Selection) method to identify the significantly important features, we identified 113 key methylation features and 23 key gene expression features. HNF1B and TP63 were found to be dysfunctional on both methylation and gene expression levels. The experimentally determined interaction network suggested that TP63 may play an important role in connecting methylation genes and expression genes. Many of the discovered signature genes have been supported by literature. Our results may provide directions of precision diagnosis and therapy of LUAD and SCLC.
Collapse
Affiliation(s)
- Hao Zhang
- Department of Respiratory and Critical Care Medicine, Second Affiliated Hospital of Zhejiang University School of Medicine, Hangzhou, China
| | - Zhou Jin
- Department of Respiratory and Critical Care Medicine, Second Affiliated Hospital of Zhejiang University School of Medicine, Hangzhou, China.,Department of Respiration, Hospital of Traditional Chinese Medicine of Zhenhai, Ningbo, China
| | - Ling Cheng
- Shanghai Engineering Research Center of Pharmaceutical Translation, Shanghai, China
| | - Bin Zhang
- Department of Respiratory and Critical Care Medicine, Second Affiliated Hospital of Zhejiang University School of Medicine, Hangzhou, China
| |
Collapse
|
25
|
Zhang J, Hu H, Xu S, Jiang H, Zhu J, Qin E, He Z, Chen E. The Functional Effects of Key Driver KRAS Mutations on Gene Expression in Lung Cancer. Front Genet 2020; 11:17. [PMID: 32117436 PMCID: PMC7010953 DOI: 10.3389/fgene.2020.00017] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2019] [Accepted: 01/07/2020] [Indexed: 12/11/2022] Open
Abstract
Lung cancer is a common malignant cancer. Kirsten rat sarcoma oncogene (KRAS) mutations have been considered as a key driver for lung cancers. KRAS p.G12C mutations were most predominant in NSCLC which was comprised about 11–16% of lung adenocarcinomas (p.G12C accounts for 45–50% of mutant KRAS). But it is still not clear how the KRAS mutation triggers lung cancers. To study the molecular mechanisms of KRAS mutation in lung cancer. We analyzed the gene expression profiles of 156 KRAS mutation samples and other negative samples with two stage feature selection approach: (1) minimal Redundancy Maximal Relevance (mRMR) and (2) Incremental Feature Selection (IFS). At last, 41 predictive genes for KRAS mutation were identified and a KRAS mutation predictor was constructed. Its leave one out cross validation MCC was 0.879. Our results were helpful for understanding the roles of KRAS mutation in lung cancer.
Collapse
Affiliation(s)
- Jisong Zhang
- Department of Pulmonary and Critical Care Medicine, Sir Run Run Shaw Hospital of Zhejiang University, Hangzhou, China
| | - Huihui Hu
- Department of Pulmonary and Critical Care Medicine, Sir Run Run Shaw Hospital of Zhejiang University, Hangzhou, China
| | - Shan Xu
- Department of Pulmonary and Critical Care Medicine, Sir Run Run Shaw Hospital of Zhejiang University, Hangzhou, China
| | - Hanliang Jiang
- Department of Pulmonary and Critical Care Medicine, Sir Run Run Shaw Hospital of Zhejiang University, Hangzhou, China
| | - Jihong Zhu
- Department of Anesthesiology, Sir Run Run Shaw Hospital of Zhejiang University, Hangzhou, China
| | - E Qin
- Department of Respiratory Medicine, Shaoxing People's Hospital (Shaoxing Hospital, Zhejiang University School of Medicine), Shaoxing, China
| | - Zhengfu He
- Department of Thoracic Surgery, Sir Run Run Shaw Hospital of Zhejiang University, Hangzhou, China
| | - Enguo Chen
- Department of Pulmonary and Critical Care Medicine, Sir Run Run Shaw Hospital of Zhejiang University, Hangzhou, China
| |
Collapse
|
26
|
Thornlow BP, Armstrong J, Holmes AD, Howard JM, Corbett-Detig RB, Lowe TM. Predicting transfer RNA gene activity from sequence and genome context. Genome Res 2020; 30:85-94. [PMID: 31857444 PMCID: PMC6961574 DOI: 10.1101/gr.256164.119] [Citation(s) in RCA: 20] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2019] [Accepted: 12/12/2019] [Indexed: 01/25/2023]
Abstract
Transfer RNA (tRNA) genes are among the most highly transcribed genes in the genome owing to their central role in protein synthesis. However, there is evidence for a broad range of gene expression across tRNA loci. This complexity, combined with difficulty in measuring transcript abundance and high sequence identity across transcripts, has severely limited our collective understanding of tRNA gene expression regulation and evolution. We establish sequence-based correlates to tRNA gene expression and develop a tRNA gene classification method that does not require, but benefits from, comparative genomic information and achieves accuracy comparable to molecular assays. We observe that guanine + cytosine (G + C) content and CpG density surrounding tRNA loci is exceptionally well correlated with tRNA gene activity, supporting a prominent regulatory role of the local genomic context in combination with internal sequence features. We use our tRNA gene activity predictions in conjunction with a comprehensive tRNA gene ortholog set spanning 29 placental mammals to estimate the evolutionary rate of functional changes among orthologs. Our method adds a new dimension to large-scale tRNA functional prediction and will help prioritize characterization of functional tRNA variants. Its simplicity and robustness should enable development of similar approaches for other clades, as well as exploration of functional diversification of members of large gene families.
Collapse
Affiliation(s)
- Bryan P Thornlow
- Department of Biomolecular Engineering, University of California, Santa Cruz, California 95064, USA
| | - Joel Armstrong
- Department of Biomolecular Engineering, University of California, Santa Cruz, California 95064, USA
- Genomics Institute, University of California, Santa Cruz, California 95064, USA
| | - Andrew D Holmes
- Department of Biomolecular Engineering, University of California, Santa Cruz, California 95064, USA
| | - Jonathan M Howard
- Department of Biomolecular Engineering, University of California, Santa Cruz, California 95064, USA
| | - Russell B Corbett-Detig
- Department of Biomolecular Engineering, University of California, Santa Cruz, California 95064, USA
- Genomics Institute, University of California, Santa Cruz, California 95064, USA
| | - Todd M Lowe
- Department of Biomolecular Engineering, University of California, Santa Cruz, California 95064, USA
- Genomics Institute, University of California, Santa Cruz, California 95064, USA
| |
Collapse
|
27
|
Zhao X, Chen L, Guo ZH, Liu T. Predicting Drug Side Effects with Compact Integration of Heterogeneous Networks. Curr Bioinform 2019. [DOI: 10.2174/1574893614666190220114644] [Citation(s) in RCA: 72] [Impact Index Per Article: 14.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022]
Abstract
Background:
The side effects of drugs are not only harmful to humans but also the major
reasons for withdrawing approved drugs, bringing greater risks for pharmaceutical companies.
However, detecting the side effects for a given drug via traditional experiments is time- consuming
and expensive. In recent years, several computational methods have been proposed to predict the
side effects of drugs. However, most of the methods cannot effectively integrate the heterogeneous
properties of drugs.
Methods:
In this study, we adopted a network embedding method, Mashup, to extract essential and
informative drug features from several drug heterogeneous networks, representing different properties
of drugs. For side effects, a network was also built, from where side effect features were extracted.
These features can capture essential information about drugs and side effects in a network
level. Drug and side effect features were combined together to represent each pair of drug and side
effect, which was deemed as a sample in this study. Furthermore, they were fed into a random forest
(RF) algorithm to construct the prediction model, called the RF network model.
Results:
The RF network model was evaluated by several tests. The average of Matthews correlation
coefficients on the balanced and unbalanced datasets was 0.640 and 0.641, respectively.
Conclusion:
The RF network model was superior to the models incorporating other machine
learning algorithms and one previous model. Finally, we also investigated the influence of two feature
dimension parameters on the RF network model and found that our model was not very sensitive
to these parameters.
Collapse
Affiliation(s)
- Xian Zhao
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China
| | - Lei Chen
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China
| | - Zi-Han Guo
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China
| | - Tao Liu
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China
| |
Collapse
|
28
|
Chen L, Li D, Shao Y, Wang H, Liu Y, Zhang Y. Identifying Microbiota Signature and Functional Rules Associated With Bacterial Subtypes in Human Intestine. Front Genet 2019; 10:1146. [PMID: 31803234 PMCID: PMC6872643 DOI: 10.3389/fgene.2019.01146] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2019] [Accepted: 10/21/2019] [Indexed: 12/12/2022] Open
Abstract
Gut microbiomes are integral microflora located in the human intestine with particular symbiosis. Among all microorganisms in the human intestine, bacteria are the most significant subgroup that contains many unique and functional species. The distribution patterns of bacteria in the human intestine not only reflect the different microenvironments in different sections of the intestine but also indicate that bacteria may have unique biological functions corresponding to their proper regions of the intestine. However, describing the functional differences between the bacterial subgroups and their distributions in different individuals is difficult using traditional computational approaches. Here, we first attempted to introduce four effective sets of bacterial features from independent databases. We then presented a novel computational approach to identify potential distinctive features among bacterial subgroups based on a systematic dataset on the gut microbiome from approximately 1,500 human gut bacterial strains. We also established a group of quantitative rules for explaining such distinctions. Results may reveal the microstructural characteristics of the intestinal flora and deepen our understanding on the regulatory role of bacterial subgroups in the human intestine.
Collapse
Affiliation(s)
- Lijuan Chen
- College of Animal Science and Technology, Anhui Agricultural University, Hefei, China
| | - Daojie Li
- College of Animal Science and Technology, Anhui Agricultural University, Hefei, China
| | - Ye Shao
- School of Medicine, Huaqiao University, Quanzhou, China
| | - Hui Wang
- College of Animal Science and Technology, Anhui Agricultural University, Hefei, China
| | - Yuqing Liu
- Anhui Province Key Laboratory of Farmland Ecological Conservation and Pollution Prevention, School of Resources and Environment, Anhui Agricultural University, Hefei, China
| | - Yunhua Zhang
- Anhui Province Key Laboratory of Farmland Ecological Conservation and Pollution Prevention, School of Resources and Environment, Anhui Agricultural University, Hefei, China
| |
Collapse
|
29
|
Pan X, Zeng T, Yuan F, Zhang YH, Chen L, Zhu L, Wan S, Huang T, Cai YD. Screening of Methylation Signature and Gene Functions Associated With the Subtypes of Isocitrate Dehydrogenase-Mutation Gliomas. Front Bioeng Biotechnol 2019; 7:339. [PMID: 31803734 PMCID: PMC6871504 DOI: 10.3389/fbioe.2019.00339] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2019] [Accepted: 10/30/2019] [Indexed: 02/05/2023] Open
Abstract
Isocitrate dehydrogenase (IDH) is an oncogene, and the expression of a mutated IDH promotes cell proliferation and inhibits cell differentiation. IDH exists in three different isoforms, whose mutation can cause many solid tumors, especially gliomas in adults. No effective method for classifying gliomas on genetic signatures is currently available. DNA methylation may be applied to distinguish cancer cells from normal tissues. In this study, we focused on three subtypes of IDH-mutation gliomas by examining methylation data. Several advanced computational methods were used, such as Monte Carlo feature selection (MCFS), incremental feature selection (IFS), support machine vector (SVM), etc. The MCFS method was adopted to analyze methylation features, resulting in a feature list. Then, the IFS method incorporating SVM was applied to the list to extract important methylation features and construct an optimal SVM classifier. As a result, several methylation features (sites) were found to relate to glioma subclasses, which are annotated onto multiple genes, such as FLJ37543, LCE3D, FAM89A, ADCY5, ESR1, C2orf67, REST, EPHA7, etc. These genes are enriched in biological functions, including cellular developmental process, neuron differentiation, cellular component morphogenesis, and G-protein-coupled receptor signaling pathway. Our results, which are supported by literature reports and independent dataset validation, showed that our identified genes and functions contributed to the detailed glioma subtypes. This study provided a basic research on IDH-mutation gliomas.
Collapse
Affiliation(s)
- XiaoYong Pan
- School of Life Sciences, Shanghai University, Shanghai, China.,Key Laboratory of System Control and Information Processing, Ministry of Education of China, Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, Shanghai, China.,IDLab, Department for Electronics and Information Systems, Ghent University, Ghent, Belgium
| | - Tao Zeng
- Key Laboratory of Systems Biology, Institute of Biochemistry and Cell Biology, Chinese Academy of Sciences, Shanghai, China
| | - Fei Yuan
- Department of Science and Technology, Binzhou Medical University Hospital, Binzhou, China
| | - Yu-Hang Zhang
- Shanghai Institute of Nutrition and Health, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China
| | - Lei Chen
- College of Information Engineering, Shanghai Maritime University, Shanghai, China.,Shanghai Key Laboratory of PMMP, East China Normal University, Shanghai, China
| | - LiuCun Zhu
- School of Life Sciences, Shanghai University, Shanghai, China
| | - SiBao Wan
- School of Life Sciences, Shanghai University, Shanghai, China
| | - Tao Huang
- Shanghai Institute of Nutrition and Health, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China
| | - Yu-Dong Cai
- School of Life Sciences, Shanghai University, Shanghai, China
| |
Collapse
|
30
|
Zhang GL, Pan LL, Huang T, Wang JH. The transcriptome difference between colorectal tumor and normal tissues revealed by single-cell sequencing. J Cancer 2019; 10:5883-5890. [PMID: 31737124 PMCID: PMC6843882 DOI: 10.7150/jca.32267] [Citation(s) in RCA: 25] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2018] [Accepted: 06/17/2019] [Indexed: 12/29/2022] Open
Abstract
The previous cancer studies were difficult to reproduce since the tumor tissues were analyzed directly. But the tumor tissues were actually a mixture of different cancer cells. The transcriptome of single-cell was much robust than the transcriptome of a mixed tissue. The single-cell transcriptome had much smaller variance. In this study, we analyzed the single-cell transcriptome of 272 colorectal cancer (CRC) epithelial cells and 160 normal epithelial cells and identified 342 discriminative transcripts using advanced machine learning methods. The most discriminative transcripts were LGALS4, PHGR1, C15orf48, HEPACAM2, PERP, FABP1, FCGBP, MT1G, TSPAN1 and CKB. We further clustered the 342 transcripts into two categories. The upregulated transcripts in CRC epithelial cells were significantly enriched in Ribosome, Protein processing in endoplasmic reticulum, Antigen processing and presentation and p53 signaling pathway. The downregulated transcripts in CRC epithelial cells were significantly enriched in Mineral absorption, Aldosterone-regulated sodium reabsorption and Oxidative phosphorylation pathways. The biological analysis of the discriminative transcripts revealed the possible mechanism of colorectal cancer.
Collapse
Affiliation(s)
- Guo-Liang Zhang
- Department of Colorectal Surgery, The First Affiliated Hospital, College of Medicine, Zhejiang University, Hangzhou 310003, Zhejiang, China
| | - Le-Lin Pan
- Department of Colorectal Surgery, The First Affiliated Hospital, College of Medicine, Zhejiang University, Hangzhou 310003, Zhejiang, China
| | - Tao Huang
- Shanghai Institute of Nutrition and Health, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai 200031, China
| | - Jin-Hai Wang
- Department of Colorectal Surgery, The First Affiliated Hospital, College of Medicine, Zhejiang University, Hangzhou 310003, Zhejiang, China
| |
Collapse
|
31
|
Identifying Methylation Pattern and Genes Associated with Breast Cancer Subtypes. Int J Mol Sci 2019; 20:ijms20174269. [PMID: 31480430 PMCID: PMC6747348 DOI: 10.3390/ijms20174269] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2019] [Revised: 08/19/2019] [Accepted: 08/29/2019] [Indexed: 12/18/2022] Open
Abstract
Breast cancer is regarded worldwide as a severe human disease. Various genetic variations, including hereditary and somatic mutations, contribute to the initiation and progression of this disease. The diagnostic parameters of breast cancer are not limited to the conventional protein content and can include newly discovered genetic variants and even genetic modification patterns such as methylation and microRNA. In addition, breast cancer detection extends to detailed breast cancer stratifications to provide subtype-specific indications for further personalized treatment. One genome-wide expression–methylation quantitative trait loci analysis confirmed that different breast cancer subtypes have various methylation patterns. However, recognizing clinically applied (methylation) biomarkers is difficult due to the large number of differentially methylated genes. In this study, we attempted to re-screen a small group of functional biomarkers for the identification and distinction of different breast cancer subtypes with advanced machine learning methods. The findings may contribute to biomarker identification for different breast cancer subtypes and provide a new perspective for differential pathogenesis in breast cancer subtypes.
Collapse
|
32
|
Chen L, Pan X, Zhang YH, Hu X, Feng K, Huang T, Cai YD. Primary Tumor Site Specificity is Preserved in Patient-Derived Tumor Xenograft Models. Front Genet 2019; 10:738. [PMID: 31456818 PMCID: PMC6701289 DOI: 10.3389/fgene.2019.00738] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2019] [Accepted: 07/15/2019] [Indexed: 12/17/2022] Open
Abstract
Patient-derived tumor xenograft (PDX) mouse models are widely used for drug screening. The underlying assumption is that PDX tissue is very similar with the original patient tissue, and it has the same response to the drug treatment. To investigate whether the primary tumor site information is well preserved in PDX, we analyzed the gene expression profiles of PDX mouse models originated from different tissues, including breast, kidney, large intestine, lung, ovary, pancreas, skin, and soft tissues. The popular Monte Carlo feature selection method was employed to analyze the expression profile, yielding a feature list. From this list, incremental feature selection and support vector machine (SVM) were adopted to extract distinctively expressed genes in PDXs from different primary tumor sites and build an optimal SVM classifier. In addition, we also set up a group of quantitative rules to identify primary tumor sites. A total of 755 genes were extracted by the feature selection procedures, on which the SVM classifier can provide a high performance with MCC 0.986 on classifying primary tumor sites originated from different tissues. Furthermore, we obtained 16 classification rules, which gave a lower accuracy but clear classification procedures. Such results validated that the primary tumor site specificity was well preserved in PDX as the PDXs from different primary tumor sites were still very different and these PDX differences were similar with the differences observed in patients with tumor. For example, VIM and ABHD17C were highly expressed in the PDX from breast tissue and also highly expressed in breast cancer patients.
Collapse
Affiliation(s)
- Lei Chen
- Shanghai Institute of Nutrition and Health, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China.,College of Information Engineering, Shanghai Maritime University, Shanghai, China.,Shanghai Key Laboratory of PMMP, East China Normal University, Shanghai, China
| | - Xiaoyong Pan
- Department of Medical Informatics, Erasmus Medical Center, Rotterdam, Netherlands
| | - Yu-Hang Zhang
- Shanghai Institute of Nutrition and Health, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China
| | - Xiaohua Hu
- Department of Biostatistics and Computational Biology, School of Life Sciences, Fudan University, Shanghai, China
| | - KaiYan Feng
- Department of Computer Science, Guangdong AIB Polytechnic, Guangzhou, China
| | - Tao Huang
- Shanghai Institute of Nutrition and Health, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China
| | - Yu-Dong Cai
- School of Life Sciences, Shanghai University, Shanghai, China
| |
Collapse
|
33
|
Li J, Lu L, Zhang YH, Xu Y, Liu M, Feng K, Chen L, Kong X, Huang T, Cai YD. Identification of leukemia stem cell expression signatures through Monte Carlo feature selection strategy and support vector machine. Cancer Gene Ther 2019; 27:56-69. [PMID: 31138902 DOI: 10.1038/s41417-019-0105-y] [Citation(s) in RCA: 30] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2019] [Revised: 04/28/2019] [Accepted: 05/04/2019] [Indexed: 01/09/2023]
Abstract
Acute myeloid leukemia (AML) is a type of blood cancer characterized by the rapid growth of immature white blood cells from the bone marrow. Therapy resistance resulting from the persistence of leukemia stem cells (LSCs) are found in numerous patients. Comparative transcriptome studies have been previously conducted to analyze differentially expressed genes between LSC+ and LSC- cells. However, these studies mainly focused on a limited number of genes with the most obvious expression differences between the two cell types. We developed a computational approach incorporating several machine learning algorithms, including Monte Carlo feature selection (MCFS), incremental feature selection (IFS), support vector machine (SVM), Repeated Incremental Pruning to Produce Error Reduction (RIPPER), to identify gene expression features specific to LSCs. One thousand 0ne hudred fifty-nine features (genes) were first identified, which can be used to build the optimal SVM classifier for distinguishing LSC+ and LSC- cells. Among these 1159 genes, the top 17 genes were identified as LSC-specific biomarkers. In addition, six classification rules were produced by RIPPER algorithm. The subsequent literature review on these features/genes and the classification rules and functional enrichment analyses of the 1159 features/genes confirmed the relevance of extracted genes and rules to the characteristics of LSCs.
Collapse
Affiliation(s)
- JiaRui Li
- Shanghai Institute of Nutrition and Health, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, 200031, P. R. China.,School of Life Sciences, Shanghai University, Shanghai, 200444, P. R. China
| | - Lin Lu
- Department of Radiology, Columbia University Medical Center, New York, NY, 10032, USA
| | - Yu-Hang Zhang
- Shanghai Institute of Nutrition and Health, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, 200031, P. R. China
| | - YaoChen Xu
- Institute of Biochemistry and Cell Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, 200031, P. R. China
| | - Min Liu
- College of Information Engineering, Shanghai Maritime University, Shanghai, 201306, P. R. China
| | - KaiYan Feng
- Department of Computer Science, Guangdong AIB Polytechnic, Guangzhou, 510507, P. R. China
| | - Lei Chen
- College of Information Engineering, Shanghai Maritime University, Shanghai, 201306, P. R. China.,Shanghai Key Laboratory of PMMP, East China Normal University, Shanghai, 200241, P. R. China
| | - XiangYin Kong
- Shanghai Institute of Nutrition and Health, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, 200031, P. R. China.
| | - Tao Huang
- Shanghai Institute of Nutrition and Health, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, 200031, P. R. China.
| | - Yu-Dong Cai
- School of Life Sciences, Shanghai University, Shanghai, 200444, P. R. China.
| |
Collapse
|
34
|
Analysis of Expression Pattern of snoRNAs in Different Cancer Types with Machine Learning Algorithms. Int J Mol Sci 2019; 20:ijms20092185. [PMID: 31052553 PMCID: PMC6539089 DOI: 10.3390/ijms20092185] [Citation(s) in RCA: 26] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2019] [Revised: 04/29/2019] [Accepted: 04/30/2019] [Indexed: 01/17/2023] Open
Abstract
Small nucleolar RNAs (snoRNAs) are a new type of functional small RNAs involved in the chemical modifications of rRNAs, tRNAs, and small nuclear RNAs. It is reported that they play important roles in tumorigenesis via various regulatory modes. snoRNAs can both participate in the regulation of methylation and pseudouridylation and regulate the expression pattern of their host genes. This research investigated the expression pattern of snoRNAs in eight major cancer types in TCGA via several machine learning algorithms. The expression levels of snoRNAs were first analyzed by a powerful feature selection method, Monte Carlo feature selection (MCFS). A feature list and some informative features were accessed. Then, the incremental feature selection (IFS) was applied to the feature list to extract optimal features/snoRNAs, which can make the support vector machine (SVM) yield best performance. The discriminative snoRNAs included HBII-52-14, HBII-336, SNORD123, HBII-85-29, HBII-420, U3, HBI-43, SNORD116, SNORA73B, SCARNA4, HBII-85-20, etc., on which the SVM can provide a Matthew’s correlation coefficient (MCC) of 0.881 for predicting these eight cancer types. On the other hand, the informative features were fed into the Johnson reducer and repeated incremental pruning to produce error reduction (RIPPER) algorithms to generate classification rules, which can clearly show different snoRNAs expression patterns in different cancer types. The analysis results indicated that extracted discriminative snoRNAs can be important for identifying cancer samples in different types and the expression pattern of snoRNAs in different cancer types can be partly uncovered by quantitative recognition rules.
Collapse
|
35
|
Chen L, Pan X, Zhang YH, Kong X, Huang T, Cai YD. Tissue differences revealed by gene expression profiles of various cell lines. J Cell Biochem 2019; 120:7068-7081. [PMID: 30368905 DOI: 10.1002/jcb.27977] [Citation(s) in RCA: 22] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2018] [Accepted: 10/04/2018] [Indexed: 01/24/2023]
Abstract
Mechanisms through which tissues are formed and maintained remain unknown but are fundamental aspects in biology. Tissue-specific gene expression is a valuable tool to study such mechanisms. But in many biomedical studies, cell lines, rather than human body tissues, are used to investigate biological mechanisms Whether or not cell lines maintain their tissue-specific characteristics after they are isolated and cultured outside the human body remains to be explored. In this study, we applied a novel computational method to identify core genes that contribute to the differentiation of cell lines from various tissues. Several advanced computational techniques, such as Monte Carlo feature selection method, incremental feature selection method, and support vector machine (SVM) algorithm, were incorporated in the proposed method, which extensively analyzed the gene expression profiles of cell lines from different tissues. As a result, we extracted a group of functional genes that can indicate the differences of cell lines in different tissues and built an optimal SVM classifier for identifying cell lines in different tissues. In addition, a set of rules for classifying cell lines were also reported, which can give a clearer picture of cell lines in different issues although its performance was not better than the optimal SVM classifier. Finally, we compared such genes with the tissue-specific genes identified by the Genotype-tissue Expression project. Results showed that most expression patterns between tissues remained in the derived cell lines despite some uniqueness that some genes show tissue specificity.
Collapse
Affiliation(s)
- Lei Chen
- School of Life Sciences, Shanghai University, Shanghai, China.,College of Information Engineering, Shanghai Maritime University, Shanghai, China.,Shanghai Key Laboratory of PMMP, East China Normal University, Shanghai, China
| | - Xiaoyong Pan
- Department of Medical Informatics, Erasmus MC, Rotterdam, The Netherlands
| | - Yu-Hang Zhang
- Institute of Health Sciences, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China
| | - Xiangyin Kong
- Institute of Health Sciences, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China
| | - Tao Huang
- Institute of Health Sciences, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China
| | - Yu-Dong Cai
- School of Life Sciences, Shanghai University, Shanghai, China
| |
Collapse
|
36
|
Chen L, Pan X, Zhang YH, Huang T, Cai YD. Analysis of Gene Expression Differences between Different Pancreatic Cells. ACS OMEGA 2019; 4:6421-6435. [DOI: 10.1021/acsomega.8b02171] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/30/2023]
Affiliation(s)
- Lei Chen
- School of Life Sciences, Shanghai University, Shanghai 200444, China
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China
- Shanghai Key Laboratory of PMMP, East China Normal University, Shanghai 200241, China
| | - Xiaoyong Pan
- Department of Medical Informatics, Erasmus MC, Rotterdam 3014ZK, Netherlands
| | - Yu-Hang Zhang
- Institute of Health Sciences, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai 200031, China
| | - Tao Huang
- Institute of Health Sciences, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai 200031, China
| | - Yu-Dong Cai
- School of Life Sciences, Shanghai University, Shanghai 200444, China
| |
Collapse
|
37
|
Chen X, Jin Y, Feng Y. Evaluation of Plasma Extracellular Vesicle MicroRNA Signatures for Lung Adenocarcinoma and Granuloma With Monte-Carlo Feature Selection Method. Front Genet 2019; 10:367. [PMID: 31105742 PMCID: PMC6498093 DOI: 10.3389/fgene.2019.00367] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2018] [Accepted: 04/05/2019] [Indexed: 12/24/2022] Open
Abstract
Extracellular Vesicle (EV) is a compilation of secreted vesicles, including micro vesicles, large oncosomes, and exosomes. It can be used in non-invasive diagnosis. MicroRNAs (miRNAs) processed by exosomes can be detected by liquid biopsy. To objectively evaluate the discriminative ability of miRNAs from whole plasma, EV and EV-free plasma, we analyzed the miRNA expression profiles in whole plasma, EV and EV-free plasma of 10 lung adenocarcinoma and 9 granuloma patients. With Monte-Carlo feature selection method, the top discriminative miRNAs in whole plasma, EV and EV-free plasma were identified, and they were quite different. Using the Repeated Incremental Pruning to Produce Error Reduction (RIPPER) method, we learned the classification rules: in whole plasma, granuloma patients did not express hsa-miR-223-3p while the lung adenocarcinoma patients expressed hsa-miR-223-3p; in EV, the hsa-miR-23b-3p was highly expressed in granuloma patients but not lung adenocarcinoma patients; in EV-free plasma, hsa-miR-376a-3p was expressed in granuloma patients but barely expressed in lung adenocarcinoma patients. For prediction performance, whole plasma had the highest weighted accuracy and EV outperformed EV-free plasma. Our results suggested that EV can be used as lung cancer biomarker. However, since it is less stable and not easy to detect, there are still technological difficulties to overcome.
Collapse
Affiliation(s)
- Xiangbo Chen
- Key Laboratory of Molecular Epigenetics of the Ministry of Education, Northeast Normal University, Changchun, China.,Hangzhou Baocheng Biotechnology Co., Ltd., Hangzhou, China
| | - Yunjie Jin
- Department of Oncology, Shanghai Putuo People's Hospital, Shanghai, China
| | - Yu Feng
- Shuguang Hospital, Shanghai University of Traditional Chinese Medicine, Shanghai, China
| |
Collapse
|
38
|
Zhang Y, Dong D, Li D, Lu L, Li J, Zhang Y, Chen L. Computational Method for the Identification of Molecular Metabolites Involved in Cereal Hull Color Variations. Comb Chem High Throughput Screen 2019; 21:760-770. [DOI: 10.2174/1386207322666190129105441] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2018] [Revised: 08/02/2018] [Accepted: 08/16/2018] [Indexed: 11/22/2022]
Abstract
Background:
Cereal hull color is an important quality specification characteristic. Many
studies were conducted to identify genetic changes underlying cereal hull color diversity. However,
these studies mainly focused on the gene level. Recent studies have suggested that metabolomics can
accurately reflect the integrated and real-time cell processes that contribute to the formation of
different cereal colors.
Methods:
In this study, we exploited published metabolomics databases and applied several
advanced computational methods, such as minimum redundancy maximum relevance (mRMR),
incremental forward search (IFS), random forest (RF) to investigate cereal hull color at the metabolic
level. First, the mRMR was applied to analyze cereal hull samples represented by metabolite
features, yielding a feature list. Then, the IFS and RF were used to test several feature sets,
constructed according to the aforementioned feature list. Finally, the optimal feature sets and RF
classifier were accessed based on the testing results.
Results and Conclusion:
A total of 158 key metabolites were found to be useful in distinguishing
white cereal hulls from colorful cereal hulls. A prediction model constructed with these metabolites
and a random forest algorithm generated a high Matthews coefficient correlation value of 0.701.
Furthermore, 24 of these metabolites were previously found to be relevant to cereal color. Our study
can provide new insights into the molecular basis of cereal hull color formation.
Collapse
Affiliation(s)
- Yunhua Zhang
- Anhui Province Key Laboratory of Farmland Ecological Conservation and Pollution Prevention, School of Resources and Environment, Anhui Agricultural University, Hefei, Anhui, China
| | - Dong Dong
- Anhui Province Key Laboratory of Farmland Ecological Conservation and Pollution Prevention, School of Resources and Environment, Anhui Agricultural University, Hefei, Anhui, China
| | - Dai Li
- Anhui Province Key Laboratory of Farmland Ecological Conservation and Pollution Prevention, School of Resources and Environment, Anhui Agricultural University, Hefei, Anhui, China
| | - Lin Lu
- Department of Radiology, Columbia University Medical Center, New York, United States
| | - JiaRui Li
- School of Life Sciences, Shanghai University, Shanghai, China
| | - YuHang Zhang
- Institute of Health Sciences, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China
| | - Lijuan Chen
- College of Animal Science and Technology, Anhui Agricultural University, Hefei, Anhui, China
| |
Collapse
|
39
|
Wang T, Chen L, Zhao X. Prediction of Drug Combinations with a Network Embedding Method. Comb Chem High Throughput Screen 2019; 21:789-797. [DOI: 10.2174/1386207322666181226170140] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2018] [Revised: 11/02/2018] [Accepted: 11/28/2018] [Indexed: 01/10/2023]
Abstract
Aim and Objective:
There are several diseases having a complicated mechanism. For such
complicated diseases, a single drug cannot treat them very well because these diseases always
involve several targets and single targeted drugs cannot modulate these targets simultaneously. Drug
combination is an effective way to treat such diseases. However, determination of effective drug
combinations is time- and cost-consuming via traditional methods. It is urgent to build quick and
cheap methods in this regard. Designing effective computational methods incorporating advanced
computational techniques to predict drug combinations is an alternative and feasible way.
Method:
In this study, we proposed a novel network embedding method, which can extract
topological features of each drug combination from a drug network that was constructed using
chemical-chemical interaction information retrieved from STITCH. These topological features were
combined with individual features of drug combination reported in one previous study. Several
advanced computational methods were employed to construct an effective prediction model, such as
synthetic minority oversampling technique (SMOTE) that was used to tackle imbalanced dataset,
minimum redundancy maximum relevance (mRMR) and incremental feature selection (IFS)
methods that were adopted to analyze features and extract optimal features for building an optimal
support machine vector (SVM) classifier.
Results and Conclusion:
The constructed optimal SVM classifier yielded an MCC of 0.806, which
is superior to the classifier only using individual features with or without SMOTE. The performance
of the classifier can be improved by combining the topological features and essential features of a
drug combination.
Collapse
Affiliation(s)
- Tianyun Wang
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China
| | - Lei Chen
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China
| | - Xian Zhao
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China
| |
Collapse
|
40
|
Deng M, Lv XD, Fang ZX, Xie XS, Chen WY. The blood transcriptional signature for active and latent tuberculosis. Infect Drug Resist 2019; 12:321-328. [PMID: 30787624 PMCID: PMC6363485 DOI: 10.2147/idr.s184640] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/14/2023] Open
Abstract
BACKGROUND Although the incidence of tuberculosis (TB) has dropped substantially, it still is a serious threat to human health. And in recent years, the emergence of resistant bacilli and inadequate disease control and prevention has led to a significant rise in the global TB epidemic. It is known that the cause of TB is Mycobacterium tuberculosis infection. But it is not clear why some infected patients are active while others are latent. METHODS We analyzed the blood gene expression profiles of 69 latent TB patients and 54 active pulmonary TB patients from GEO (Transcript Expression Omnibus) database. RESULTS By applying minimal redundancy maximal relevance and incremental feature selection, we identified 24 signature genes which can predict the TB activation. The support vector machine predictor based on these 24 genes had a sensitivity of 0.907, specificity of 0.913, and accuracy of 0.911, respectively. Although they need to be validated in a large independent dataset, the biological analysis of these 24 genes showed great promise. CONCLUSION We found that cytokine production was a key process during TB activation and genes like CYBB, TSPO, CD36, and STAT1 worth further investigation.
Collapse
Affiliation(s)
- Min Deng
- Department of Infectious Diseases, The First Hospital of Jiaxing, The First Affiliated Hospital of Jiaxing University, Jiaxing 314000, China,
| | - Xiao-Dong Lv
- Department of Respiration, The First Hospital of Jiaxing, The First Affiliated Hospital of Jiaxing University, Jiaxing 314000, China
| | - Zhi-Xian Fang
- Department of Respiration, The First Hospital of Jiaxing, The First Affiliated Hospital of Jiaxing University, Jiaxing 314000, China
| | - Xin-Sheng Xie
- Department of Infectious Diseases, The First Hospital of Jiaxing, The First Affiliated Hospital of Jiaxing University, Jiaxing 314000, China,
| | - Wen-Yu Chen
- Department of Respiration, The First Hospital of Jiaxing, The First Affiliated Hospital of Jiaxing University, Jiaxing 314000, China
| |
Collapse
|
41
|
Chen L, Pan X, Zhang YH, Liu M, Huang T, Cai YD. Classification of Widely and Rarely Expressed Genes with Recurrent Neural Network. Comput Struct Biotechnol J 2018; 17:49-60. [PMID: 30595815 PMCID: PMC6307323 DOI: 10.1016/j.csbj.2018.12.002] [Citation(s) in RCA: 30] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2018] [Revised: 12/07/2018] [Accepted: 12/09/2018] [Indexed: 02/06/2023] Open
Abstract
A tissue-specific gene expression shapes the formation of tissues, while gene expression changes reflect the immune response of the human body to environmental stimulations or pressure, particularly in disease conditions, such as cancers. A few genes are commonly expressed across tissues or various cancers, while others are not. To investigate the functional differences between widely and rarely expressed genes, we defined the genes that were expressed in 32 normal tissues/cancers (i.e., called widely expressed genes; FPKM >1 in all samples) and those that were not detected (i.e., called rarely expressed genes; FPKM <1 in all samples) based on the large gene expression data set provided by Uhlen et al. Each gene was encoded using the gene ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment scores. Minimum redundancy maximum relevance (mRMR) was used to measure and rank these features on the mRMR feature list. Thereafter, we applied the incremental feature selection method with a supervised classifier recurrent neural network (RNN) to select the discriminate features for classifying widely expressed genes from rarely expressed genes and construct an optimum RNN classifier. The Youden's indexes generated by the optimum RNN classifier and evaluated using a 10-fold cross validation were 0.739 for normal tissues and 0.639 for cancers. Furthermore, the underlying mechanisms of the key discriminate GO and KEGG features were analyzed. Results can facilitate the identification of the expression landscape of genes and elucidation of how gene expression shapes tissues and the microenvironment of cancers. Some genes are widely expressed across tissues or various cancers. A number of genes are rarely expressed across tissues or various cancers. The functional differences between widely and rarely expressed genes were studied. Several GO terms and KEGG pathways were extracted and analyzed.
Collapse
Affiliation(s)
- Lei Chen
- School of Life Sciences, Shanghai University, Shanghai 200444, People's Republic of China.,College of Information Engineering, Shanghai Maritime University, Shanghai 201306, People's Republic of China.,Shanghai Key Laboratory of PMMP, East China Normal University, Shanghai 200241, People's Republic of China
| | - XiaoYong Pan
- Department of Medical Informatics, Erasmus MC, Rotterdam, the Netherlands
| | - Yu-Hang Zhang
- Institute of Health Sciences, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai 200031, People's Republic of China
| | - Min Liu
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, People's Republic of China
| | - Tao Huang
- Institute of Health Sciences, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai 200031, People's Republic of China
| | - Yu-Dong Cai
- School of Life Sciences, Shanghai University, Shanghai 200444, People's Republic of China
| |
Collapse
|
42
|
Chen L, Zhang S, Pan X, Hu X, Zhang YH, Yuan F, Huang T, Cai YD. HIV infection alters the human epigenetic landscape. Gene Ther 2018; 26:29-39. [PMID: 30443044 DOI: 10.1038/s41434-018-0051-6] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2018] [Revised: 10/30/2018] [Accepted: 10/31/2018] [Indexed: 02/07/2023]
Abstract
Many complex diseases or traits are the results of both genetic and environmental factors. The environmental factors affect the human body by modifying its epigenetics, which controls the activity of genomes without mutating it. Viral infection is one of the common environmental factors for complex diseases. For example, the human immunodeficiency virus (HIV) infection can cause acquired immune deficiency syndrome (AIDS), HBV, and HCV infections are associated with hepatocellular carcinoma, and human papillomavirus infection is a causal factor in cervical carcinoma. In this study, to investigate how HIV infection affects DNA methylation, we analyzed the blood DNA methylation data of 485 512 sites in 44 HIV- and 142 HIV + patients. Several advanced computational methods were applied to identify the core distinctive features that were different between the HIV patients and the healthy controls. These methods can be used for differentiating HIV-infected patients from uninfected ones. These core distinctive DNA methylation features were confirmed to be functionally connected to premature aging and abnormal immune regulation, two typical pathological symptoms of HIV infection, revealing the potential regulatory mechanisms of HIV infection on the DNA methylation status of the host cells and provided novel insights on the pathogenesis of HIV infection and AIDS.
Collapse
Affiliation(s)
- Lei Chen
- School of Life Sciences, Shanghai University, Shanghai, 200444, China.,Shanghai Key Laboratory of PMMP, East China Normal University, Shanghai, 200241, China.,College of Information Engineering, Shanghai Maritime University, Shanghai, 201306, China
| | - Shiqi Zhang
- Department of Biostatistics, University of Copenhagen, Copenhagen, Denmark
| | - Xiaoyong Pan
- Department of Medical Informatics, Erasmus MC, Rotterdam, Netherlands
| | - XiaoHua Hu
- Department of Biostatistics and Computational Biology, School of Life Sciences, Fudan University, Shanghai, 200438, China
| | - Yu-Hang Zhang
- Institute of Health Sciences, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, 200031, China
| | - Fei Yuan
- Department of Science & Technology, Binzhou Medical University Hospital, Binzhou, 256603, Shandong, China
| | - Tao Huang
- Institute of Health Sciences, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, 200031, China.
| | - Yu-Dong Cai
- School of Life Sciences, Shanghai University, Shanghai, 200444, China.
| |
Collapse
|
43
|
The early detection of asthma based on blood gene expression. Mol Biol Rep 2018; 46:217-223. [PMID: 30421126 DOI: 10.1007/s11033-018-4463-6] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2018] [Accepted: 11/01/2018] [Indexed: 01/10/2023]
Abstract
Asthma is a complex heterogeneous disorder with hereditary tendency and the most widely used therapy is inhalation of anti-inflammatory corticosteroids. But it has systemic side effects. If the chronic inflammation can be detected in early stage, the dosage of corticosteroids will be low and the side effects can be avoided. Therefore, to discover the early stage blood biomarkers for asthma, we analyzed the gene expression profiles in the blood of 77 moderate asthma patients and 87 healthy controls. With advanced feature selection methods, minimal Redundancy Maximal Relevance and Incremental Feature Selection, we identified 31 genes, such as MYD88, ZFP36, CCR3 and CYP3A5, as the optimal asthma biomarker. The sensitivity, specificity and accuracy of the 31-gene Support Vector Machine predictor evaluated with Leave-One-Out Cross Validation were 0.870, 0.816 and 0.841, respectively. Through literature survey, many biomarker genes have asthma associated functions. Our results not only provided the easy-to-apply blood gene expression biomarkers for early detection of asthma, but also an explainable qualitative model with biological significance.
Collapse
|
44
|
Identification of the Gene Expression Rules That Define the Subtypes in Glioma. J Clin Med 2018; 7:jcm7100350. [PMID: 30322114 PMCID: PMC6210469 DOI: 10.3390/jcm7100350] [Citation(s) in RCA: 30] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2018] [Revised: 10/09/2018] [Accepted: 10/11/2018] [Indexed: 11/16/2022] Open
Abstract
As a common brain cancer derived from glial cells, gliomas have three subtypes: glioblastoma, diffuse astrocytoma, and anaplastic astrocytoma. The subtypes have distinctive clinical features but are closely related to each other. A glioblastoma can be derived from the early stage of diffuse astrocytoma, which can be transformed into anaplastic astrocytoma. Due to the complexity of these dynamic processes, single-cell gene expression profiles are extremely helpful to understand what defines these subtypes. We analyzed the single-cell gene expression profiles of 5057 cells of anaplastic astrocytoma tissues, 261 cells of diffuse astrocytoma tissues, and 1023 cells of glioblastoma tissues with advanced machine learning methods. In detail, a powerful feature selection method, Monte Carlo feature selection (MCFS) method, was adopted to analyze the gene expression profiles of cells, resulting in a feature list. Then, the incremental feature selection (IFS) method was applied to the obtained feature list, with the help of support vector machine (SVM), to extract key features (genes) and construct an optimal SVM classifier. Several key biomarker genes, such as IGFBP2, IGF2BP3, PRDX1, NOV, NEFL, HOXA10, GNG12, SPRY4, and BCL11A, were identified. In addition, the underlying rules of classifying the three subtypes were produced by Johnson reducer algorithm. We found that in diffuse astrocytoma, PRDX1 is highly expressed, and in glioblastoma, the expression level of PRDX1 is low. These rules revealed the difference among the three subtypes, and how they are formed and transformed. These genes are not only biomarkers for glioma subtypes, but also drug targets that may switch the clinical features or even reverse the tumor progression.
Collapse
|
45
|
Lin H, Qiu X, Zhang B, Zhang J. Identification of the predictive genes for the response of colorectal cancer patients to FOLFOX therapy. Onco Targets Ther 2018; 11:5943-5955. [PMID: 30271178 PMCID: PMC6149834 DOI: 10.2147/ott.s167656] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022] Open
Abstract
Background Colorectal cancer is a malignant tumor with high death rate. Chemotherapy, radiotherapy and surgery are the three common treatments of colorectal cancer. For early colorectal cancer patients, postoperative adjuvant chemotherapy can reduce the risk of recurrence. For advanced colorectal cancer patients, palliative chemotherapy can significantly improve the life quality of patients and prolong survival. FOLFOX is one of the mainstream chemotherapies in colorectal cancer, however, its response rate is only about 50%. Methods To systematically investigate why some of the colorectal cancer patients have response to FOLFOX therapy while others do not, we searched all publicly available database and combined three gene expression datasets of colorectal cancer patients with FOLFOX therapy. With advanced minimal redundancy maximal relevance and incremental feature selection method, we identified the biomarker genes. Results A Support Vector Machine-based classifier was constructed to predict the response of colorectal cancer patients to FOLFOX therapy. Its accuracy, sensitivity and specificity were 0.854, 0.845 and 0.863, respectively. Conclusion The biological analysis of representative biomarker genes suggested that apoptosis and inflammation signaling pathways were essential for the response of colorectal cancer patients to FOLFOX chemotherapy.
Collapse
Affiliation(s)
- Hengjun Lin
- Department of Tumor, Anus and Intestine, Jinhua People's Hospital, Jinhua, Zhejiang 321000, China,
| | - Xueke Qiu
- Department of Tumor, Anus and Intestine, Jinhua People's Hospital, Jinhua, Zhejiang 321000, China,
| | - Bo Zhang
- Department of Tumor, Anus and Intestine, Jinhua People's Hospital, Jinhua, Zhejiang 321000, China,
| | - Jichao Zhang
- Department of Tumor, Anus and Intestine, Jinhua People's Hospital, Jinhua, Zhejiang 321000, China,
| |
Collapse
|
46
|
Pan X, Hu X, Zhang YH, Chen L, Zhu L, Wan S, Huang T, Cai YD. Identification of the copy number variant biomarkers for breast cancer subtypes. Mol Genet Genomics 2018; 294:95-110. [PMID: 30203254 DOI: 10.1007/s00438-018-1488-4] [Citation(s) in RCA: 48] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2018] [Accepted: 09/03/2018] [Indexed: 01/07/2023]
Abstract
Breast cancer is a common and threatening malignant disease with multiple biological and clinical subtypes. It can be categorized into subtypes of luminal A, luminal B, Her2 positive, and basal-like. Copy number variants (CNVs) have been reported to be a potential and even better biomarker for cancer diagnosis than mRNA biomarkers, because it is considerably more stable and robust than gene expression. Thus, it is meaningful to detect CNVs of different cancers. To identify the CNV biomarker for breast cancer subtypes, we integrated the CNV data of more than 2000 samples from two large breast cancer databases, METABRIC and The Cancer Genome Atlas (TCGA). A Monte Carlo feature selection-based and incremental feature selection-based computational method was proposed and tested to identify the distinctive core CNVs in different breast cancer subtypes. We identified the CNV genes that may contribute to breast cancer tumorigenesis as well as built a set of quantitative distinctive rules for recognition of the breast cancer subtypes. The tenfold cross-validation Matthew's correlation coefficient (MCC) on METABRIC training set and the independent test on TCGA dataset were 0.515 and 0.492, respectively. The CNVs of PGAP3, GRB7, MIR4728, PNMT, STARD3, TCAP and ERBB2 were important for the accurate diagnosis of breast cancer subtypes. The findings reported in this study may further uncover the difference between different breast cancer subtypes and improve the diagnosis accuracy.
Collapse
Affiliation(s)
- Xiaoyong Pan
- College of Life Science, Shanghai University, Shanghai, 200444, People's Republic of China.,Department of Medical Informatics, Erasmus MC, Rotterdam, The Netherlands
| | - XiaoHua Hu
- Department of Biostatistics and Computational Biology, School of Life Sciences, Fudan University, Shanghai, 200438, People's Republic of China
| | - Yu-Hang Zhang
- Institute of Health Sciences, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, 200031, People's Republic of China
| | - Lei Chen
- College of Information Engineering, Shanghai Maritime University, Shanghai, 201306, People's Republic of China.,Shanghai Key Laboratory of PMMP, East China Normal University, Shanghai, 200241, People's Republic of China
| | - LiuCun Zhu
- College of Life Science, Shanghai University, Shanghai, 200444, People's Republic of China
| | - ShiBao Wan
- College of Life Science, Shanghai University, Shanghai, 200444, People's Republic of China
| | - Tao Huang
- Institute of Health Sciences, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, 200031, People's Republic of China.
| | - Yu-Dong Cai
- College of Life Science, Shanghai University, Shanghai, 200444, People's Republic of China.
| |
Collapse
|
47
|
Li J, Lan CN, Kong Y, Feng SS, Huang T. Identification and Analysis of Blood Gene Expression Signature for Osteoarthritis With Advanced Feature Selection Methods. Front Genet 2018; 9:246. [PMID: 30214455 PMCID: PMC6125376 DOI: 10.3389/fgene.2018.00246] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2018] [Accepted: 06/22/2018] [Indexed: 12/15/2022] Open
Abstract
Osteoarthritis (OA) is a complex disease that affects articular joints and may cause disability. The incidence of OA is extremely high. Most elderly people have the symptoms of osteoarthritis. The physiotherapy of OA is time consuming, and the chances of full recovery from OA are very minimal. The most effective way of fighting OA is early diagnosis and early intervention. Liquid biopsy has become a popular noninvasive test. To find the blood gene expression signature for OA, we reanalyzed the publicly available blood gene expression profiles of 106 patients with OA and 33 control samples using an automatic computational pipeline based on advanced feature selection methods. Finally, a compact 23-gene set was identified. On the basis of these 23 genes, we constructed a Support Vector Machine (SVM) classifier and evaluated it with leave-one-out cross-validation. Its sensitivity (Sn), specificity (Sp), accuracy (ACC), and Mathew's correlation coefficient (MCC) were 0.991, 0.909, 0.971, and 0.920, respectively. Obviously, the performance needed to be validated in an independent large dataset, but the in-depth biological analysis of the 23 biomarkers showed great promise and suggested that mRNA surveillance pathway and multicellular organism growth played important roles in OA. Our results shed light on OA diagnosis through liquid biopsy.
Collapse
Affiliation(s)
- Jing Li
- Department of Rehabilitation, The Second Xiangya Hospital, Central South University, Changsha, China
| | - Chun-Na Lan
- Department of Rehabilitation, The Second Xiangya Hospital, Central South University, Changsha, China
| | - Ying Kong
- Department of Rehabilitation, The Second Xiangya Hospital, Central South University, Changsha, China
| | - Song-Shan Feng
- Department of Neurosurgery, Xiangya Hospital, Central South University, Changsha, China
| | - Tao Huang
- Institute of Health Sciences, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China
| |
Collapse
|
48
|
Identifying Patients with Atrioventricular Septal Defect in Down Syndrome Populations by Using Self-Normalizing Neural Networks and Feature Selection. Genes (Basel) 2018; 9:genes9040208. [PMID: 29649131 PMCID: PMC5924550 DOI: 10.3390/genes9040208] [Citation(s) in RCA: 31] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2018] [Revised: 03/28/2018] [Accepted: 04/03/2018] [Indexed: 02/06/2023] Open
Abstract
Atrioventricular septal defect (AVSD) is a clinically significant subtype of congenital heart disease (CHD) that severely influences the health of babies during birth and is associated with Down syndrome (DS). Thus, exploring the differences in functional genes in DS samples with and without AVSD is a critical way to investigate the complex association between AVSD and DS. In this study, we present a computational method to distinguish DS patients with AVSD from those without AVSD using the newly proposed self-normalizing neural network (SNN). First, each patient was encoded by using the copy number of probes on chromosome 21. The encoded features were ranked by the reliable Monte Carlo feature selection (MCFS) method to obtain a ranked feature list. Based on this feature list, we used a two-stage incremental feature selection to construct two series of feature subsets and applied SNNs to build classifiers to identify optimal features. Results show that 2737 optimal features were obtained, and the corresponding optimal SNN classifier constructed on optimal features yielded a Matthew’s correlation coefficient (MCC) value of 0.748. For comparison, random forest was also used to build classifiers and uncover optimal features. This method received an optimal MCC value of 0.582 when top 132 features were utilized. Finally, we analyzed some key features derived from the optimal features in SNNs found in literature support to further reveal their essential roles.
Collapse
|
49
|
Wang D, Li JR, Zhang YH, Chen L, Huang T, Cai YD. Identification of Differentially Expressed Genes between Original Breast Cancer and Xenograft Using Machine Learning Algorithms. Genes (Basel) 2018. [PMID: 29534550 PMCID: PMC5867876 DOI: 10.3390/genes9030155] [Citation(s) in RCA: 42] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022] Open
Abstract
Breast cancer is one of the most common malignancies in women. Patient-derived tumor xenograft (PDX) model is a cutting-edge approach for drug research on breast cancer. However, PDX still exhibits differences from original human tumors, thereby challenging the molecular understanding of tumorigenesis. In particular, gene expression changes after tissues are transplanted from human to mouse model. In this study, we propose a novel computational method by incorporating several machine learning algorithms, including Monte Carlo feature selection (MCFS), random forest (RF), and rough set-based rule learning, to identify genes with significant expression differences between PDX and original human tumors. First, 831 breast tumors, including 657 PDX and 174 human tumors, were collected. Based on MCFS and RF, 32 genes were then identified to be informative for the prediction of PDX and human tumors and can be used to construct a prediction model. The prediction model exhibits a Matthews coefficient correlation value of 0.777. Seven interpretable interactions within the informative gene were detected based on the rough set-based rule learning. Furthermore, the seven interpretable interactions can be well supported by previous experimental studies. Our study not only presents a method for identifying informative genes with differential expression but also provides insights into the mechanism through which gene expression changes after being transplanted from human tumor into mouse model. This work would be helpful for research and drug development for breast cancer.
Collapse
Affiliation(s)
- Deling Wang
- Institute of Health Sciences, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai 200031, China.
- Department of Medical Imaging, Sun Yat-sen University Cancer Center, State Key Laboratory of Oncology in South China; Collaborative Innovation Center for Cancer Medicine, Guangzhou 510060, China.
| | - Jia-Rui Li
- School of Life Sciences, Shanghai University, Shanghai 200444, China.
| | - Yu-Hang Zhang
- Institute of Health Sciences, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai 200031, China.
| | - Lei Chen
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China.
| | - Tao Huang
- Institute of Health Sciences, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai 200031, China.
| | - Yu-Dong Cai
- School of Life Sciences, Shanghai University, Shanghai 200444, China.
| |
Collapse
|