1
|
Hadiby S, Ben Ali YM. Integrating pharmacophore model and deep learning for activity prediction of molecules with BRCA1 gene. J Bioinform Comput Biol 2024; 22:2450003. [PMID: 38567386 DOI: 10.1142/s0219720024500033] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/04/2024]
Abstract
In this paper, we propose a novel approach for predicting the activity/inactivity of molecules with the BRCA1 gene by combining pharmacophore modeling and deep learning techniques. Initially, we generated 3D pharmacophore fingerprints using a pharmacophore model, which captures the essential features and spatial arrangements critical for biological activity. These fingerprints served as informative representations of the molecular structures. Next, we employed deep learning algorithms to train a predictive model using the generated pharmacophore fingerprints. The deep learning model was designed to learn complex patterns and relationships between the pharmacophore features and the corresponding activity/inactivity labels of the molecules. By utilizing this integrated approach, we aimed to enhance the accuracy and efficiency of activity prediction. To validate the effectiveness of our approach, we conducted experiments using a dataset of known molecules with BRCA1 gene activity/inactivity from diverse sources. Our results demonstrated promising predictive performance, indicating the successful integration of pharmacophore modeling and deep learning. Furthermore, we utilized the trained model to predict the activity/inactivity of unknown molecules extracted from the ChEMBL database. The predictions obtained from the ChEMBL database were assessed and compared against experimentally determined values to evaluate the reliability and generalizability of our model. Overall, our proposed approach showcased significant potential in accurately predicting the activity/inactivity of molecules with the BRCA1 gene, thus enabling the identification of potential candidates for further investigation in drug discovery and development processes.
Collapse
Affiliation(s)
- Seloua Hadiby
- Department of Computer Science, Computer Research Laboratory, Badji Mokhtar University, Annaba, Algeria
| | - Yamina Mohamed Ben Ali
- Department of Computer Science, Computer Research Laboratory, Badji Mokhtar University, Annaba, Algeria
| |
Collapse
|
2
|
Chen L, Qu R, Liu X. Improved multi-label classifiers for predicting protein subcellular localization. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2024; 21:214-236. [PMID: 38303420 DOI: 10.3934/mbe.2024010] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/03/2024]
Abstract
Protein functions are closely related to their subcellular locations. At present, the prediction of protein subcellular locations is one of the most important problems in protein science. The evident defects of traditional methods make it urgent to design methods with high efficiency and low costs. To date, lots of computational methods have been proposed. However, this problem is far from being completely solved. Recently, some multi-label classifiers have been proposed to identify subcellular locations of human, animal, Gram-negative bacterial and eukaryotic proteins. These classifiers adopted the protein features derived from gene ontology information. Although they provided good performance, they can be further improved by adopting more powerful machine learning algorithms. In this study, four improved multi-label classifiers were set up for identification of subcellular locations of the above four protein types. The random k-labelsets (RAKEL) algorithm was used to tackle proteins with multiple locations, and random forest was used as the basic prediction engine. All classifiers were tested by jackknife test, indicating their high performance. Comparisons with previous classifiers further confirmed the superiority of the proposed classifiers.
Collapse
Affiliation(s)
- Lei Chen
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China
| | - Ruyun Qu
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China
| | - Xintong Liu
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China
| |
Collapse
|
3
|
McNair D. Artificial Intelligence and Machine Learning for Lead-to-Candidate Decision-Making and Beyond. Annu Rev Pharmacol Toxicol 2023; 63:77-97. [PMID: 35679624 DOI: 10.1146/annurev-pharmtox-051921-023255] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/25/2023]
Abstract
The use of artificial intelligence (AI) and machine learning (ML) in pharmaceutical research and development has to date focused on research: target identification; docking-, fragment-, and motif-based generation of compound libraries; modeling of synthesis feasibility; rank-ordering likely hits according to structural and chemometric similarity to compounds having known activity and affinity to the target(s); optimizing a smaller library for synthesis and high-throughput screening; and combining evidence from screening to support hit-to-lead decisions. Applying AI/ML methods to lead optimization and lead-to-candidate (L2C) decision-making has shown slower progress, especially regarding predicting absorption, distribution, metabolism, excretion, and toxicology properties. The present review surveys reasons why this is so, reports progress that has occurred in recent years, and summarizes some of the issues that remain. Effective AI/ML tools to derisk L2C and later phases of development are important to accelerate the pharmaceutical development process, ameliorate escalating development costs, and achieve greater success rates.
Collapse
Affiliation(s)
- Douglas McNair
- Global Health, Integrated Development, Bill & Melinda Gates Foundation, Seattle, Washington, USA;
| |
Collapse
|
4
|
Philip AK, Samuel BA, Bhatia S, Khalifa SAM, El-Seedi HR. Artificial Intelligence and Precision Medicine: A New Frontier for the Treatment of Brain Tumors. Life (Basel) 2022; 13:24. [PMID: 36675973 PMCID: PMC9866715 DOI: 10.3390/life13010024] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2022] [Revised: 12/08/2022] [Accepted: 12/19/2022] [Indexed: 12/24/2022] Open
Abstract
Brain tumors are a widespread and serious neurological phenomenon that can be life- threatening. The computing field has allowed for the development of artificial intelligence (AI), which can mimic the neural network of the human brain. One use of this technology has been to help researchers capture hidden, high-dimensional images of brain tumors. These images can provide new insights into the nature of brain tumors and help to improve treatment options. AI and precision medicine (PM) are converging to revolutionize healthcare. AI has the potential to improve cancer imaging interpretation in several ways, including more accurate tumor genotyping, more precise delineation of tumor volume, and better prediction of clinical outcomes. AI-assisted brain surgery can be an effective and safe option for treating brain tumors. This review discusses various AI and PM techniques that can be used in brain tumor treatment. These new techniques for the treatment of brain tumors, i.e., genomic profiling, microRNA panels, quantitative imaging, and radiomics, hold great promise for the future. However, there are challenges that must be overcome for these technologies to reach their full potential and improve healthcare.
Collapse
Affiliation(s)
- Anil K. Philip
- School of Pharmacy, University of Nizwa, Birkat Al Mouz, Nizwa 616, Oman
| | - Betty Annie Samuel
- School of Pharmacy, University of Nizwa, Birkat Al Mouz, Nizwa 616, Oman
| | - Saurabh Bhatia
- Natural and Medical Science Research Center, University of Nizwa, Birkat Al Mouz, Nizwa 616, Oman
| | - Shaden A. M. Khalifa
- Department of Molecular Biosciences, The Wenner-Gren Institute, Stockholm University, S-106 91 Stockholm, Sweden
| | - Hesham R. El-Seedi
- International Research Center for Food Nutrition and Safety, Jiangsu University, Zhenjiang 212013, China
- Pharmacognosy Group, Department of Pharmaceutical Biosciences, BMC, Uppsala University, SE-751 24 Uppsala, Sweden
- International Joint Research Laboratory of Intelligent Agriculture and Agri-Products Processing, Jiangsu Education Department, Jiangsu University, Nanjing 210024, China
| |
Collapse
|
5
|
Deep Learning Based-Virtual Screening Using 2D Pharmacophore Fingerprint in Drug Discovery. Neural Process Lett 2022. [DOI: 10.1007/s11063-022-10879-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
6
|
Mansoor M, Nauman M, Ur Rehman H, Benso A. Gene Ontology GAN (GOGAN): a novel architecture for protein function prediction. Soft comput 2022. [DOI: 10.1007/s00500-021-06707-z] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022]
|
7
|
Wu X, Zeng W, Lin F, Zhou X. NeuRank: learning to rank with neural networks for drug-target interaction prediction. BMC Bioinformatics 2021; 22:567. [PMID: 34836495 PMCID: PMC8620576 DOI: 10.1186/s12859-021-04476-y] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2021] [Accepted: 11/08/2021] [Indexed: 01/09/2023] Open
Abstract
BACKGROUND Experimental verification of a drug discovery process is expensive and time-consuming. Therefore, recently, the demand to more efficiently and effectively identify drug-target interactions (DTIs) has intensified. RESULTS We treat the prediction of DTIs as a ranking problem and propose a neural network architecture, NeuRank, to address it. Also, we assume that similar drug compounds are likely to interact with similar target proteins. Thus, in our model, we add drug and target similarities, which are very effective at improving the prediction of DTIs. Then, we develop NeuRank from a point-wise to a pair-wise, and further to list-wise model. CONCLUSION Finally, results from extensive experiments on five public data sets (DrugBank, Enzymes, Ion Channels, G-Protein-Coupled Receptors, and Nuclear Receptors) show that, in identifying DTIs, our models achieve better performance than other state-of-the-art methods.
Collapse
Affiliation(s)
- Xiujin Wu
- School of Informatics, Xiamen University, Xiamen, China
| | - Wenhua Zeng
- School of Informatics, Xiamen University, Xiamen, China
| | - Fan Lin
- School of Informatics, Xiamen University, Xiamen, China
| | - Xiuze Zhou
- Shuye Technology Co., Ltd., Hangzhou, China
| |
Collapse
|
8
|
iMPT-FDNPL: Identification of Membrane Protein Types with Functional Domains and a Natural Language Processing Approach. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2021; 2021:7681497. [PMID: 34671418 PMCID: PMC8523280 DOI: 10.1155/2021/7681497] [Citation(s) in RCA: 29] [Impact Index Per Article: 9.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/22/2021] [Revised: 09/15/2021] [Accepted: 09/27/2021] [Indexed: 12/20/2022]
Abstract
Membrane protein is an important kind of proteins. It plays essential roles in several cellular processes. Based on the intramolecular arrangements and positions in a cell, membrane proteins can be divided into several types. It is reported that the types of a membrane protein are highly related to its functions. Determination of membrane protein types is a hot topic in recent years. A plenty of computational methods have been proposed so far. Some of them used functional domain information to encode proteins. However, this procedure was still crude. In this study, we designed a novel feature extraction scheme to obtain informative features of proteins from their functional domain information. Such scheme termed domains as words and proteins, represented by its domains, as sentences. The natural language processing approach, word2vector, was applied to access the features of domains, which were further refined to protein features. Based on these features, RAndom k-labELsets with random forest as the base classifier was employed to build the multilabel classifier, namely, iMPT-FDNPL. The tenfold cross-validation results indicated the good performance of such classifier. Furthermore, such classifier was superior to other classifiers based on features derived from functional domains via one-hot scheme or derived from other properties of proteins, suggesting the effectiveness of protein features generated by the proposed scheme.
Collapse
|
9
|
Carracedo-Reboredo P, Liñares-Blanco J, Rodríguez-Fernández N, Cedrón F, Novoa FJ, Carballal A, Maojo V, Pazos A, Fernandez-Lozano C. A review on machine learning approaches and trends in drug discovery. Comput Struct Biotechnol J 2021; 19:4538-4558. [PMID: 34471498 PMCID: PMC8387781 DOI: 10.1016/j.csbj.2021.08.011] [Citation(s) in RCA: 127] [Impact Index Per Article: 42.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2021] [Revised: 08/06/2021] [Accepted: 08/06/2021] [Indexed: 12/30/2022] Open
Abstract
Drug discovery aims at finding new compounds with specific chemical properties for the treatment of diseases. In the last years, the approach used in this search presents an important component in computer science with the skyrocketing of machine learning techniques due to its democratization. With the objectives set by the Precision Medicine initiative and the new challenges generated, it is necessary to establish robust, standard and reproducible computational methodologies to achieve the objectives set. Currently, predictive models based on Machine Learning have gained great importance in the step prior to preclinical studies. This stage manages to drastically reduce costs and research times in the discovery of new drugs. This review article focuses on how these new methodologies are being used in recent years of research. Analyzing the state of the art in this field will give us an idea of where cheminformatics will be developed in the short term, the limitations it presents and the positive results it has achieved. This review will focus mainly on the methods used to model the molecular data, as well as the biological problems addressed and the Machine Learning algorithms used for drug discovery in recent years.
Collapse
Key Words
- ADMET, Absorption, distribution, metabolism, elimination and toxicity
- ADR, Adverse Drug Reaction
- AI, Artificial Intelligence
- ANN, Artificial Neural Networks
- APFP, Atom Pairs 2d FingerPrint
- AUC, Area under the Curve
- BBB, Blood–Brain barrier
- CDK, Chemical Development Kit
- CNN, Convolutional Neural Networks
- CNS, Central Nervous System
- CPI, Compound-protein interaction
- CV, Cross Validation
- Cheminformatics
- DL, Deep Learning
- DNA, Deoxyribonucleic acid
- Deep Learning
- Drug Discovery
- ECFP, Extended Connectivity Fingerprints
- FDA, Food and Drug Administration
- FNN, Fully Connected Neural Networks
- FP, Fringerprints
- FS, Feature Selection
- GCN, Graph Convolutional Networks
- GEO, Gene Expression Omnibus
- GNN, Graph Neural Networks
- GO, Gene Ontology
- KEGG, Kyoto Encyclopedia of Genes and Genomes
- MACCS, Molecular ACCess System
- MCC, Matthews correlation coefficient
- MD, Molecular Descriptors
- MKL, Multiple Kernel Learning
- ML, Machine Learning
- Machine Learning
- Molecular Descriptors
- NB, Naive Bayes
- OOB, Out of Bag
- PCA, Principal Component Analyisis
- QSAR
- QSAR, Quantitative structure–activity relationship
- RF, Random Forest
- RNA, Ribonucleic Acid
- SMILES, simplified molecular-input line-entry system
- SVM, Support Vector Machines
- TCGA, The Cancer Genome Atlas
- WHO, World Health Organization
- t-SNE, t-Distributed Stochastic Neighbor Embedding
Collapse
Affiliation(s)
- Paula Carracedo-Reboredo
- Department of Computer Science and Information Technologies, Faculty of Computer Science, Universidade da Coruna, Campus Elviña s/n, A Coruña 15071, Spain
| | - Jose Liñares-Blanco
- Department of Computer Science and Information Technologies, Faculty of Computer Science, Universidade da Coruna, Campus Elviña s/n, A Coruña 15071, Spain
- CITIC-Research Center of Information and Communication Technologies, Universidade da Coruna, A Coruña 15071, Spain
| | - Nereida Rodríguez-Fernández
- CITIC-Research Center of Information and Communication Technologies, Universidade da Coruna, A Coruña 15071, Spain
- Department of Computer Science and Information Technologies, Faculty of Communication Science, Universidade da Coruna, Campus Elviña s/n, A Coruña 15071, Spain
| | - Francisco Cedrón
- Department of Computer Science and Information Technologies, Faculty of Computer Science, Universidade da Coruna, Campus Elviña s/n, A Coruña 15071, Spain
| | - Francisco J. Novoa
- Department of Computer Science and Information Technologies, Faculty of Computer Science, Universidade da Coruna, Campus Elviña s/n, A Coruña 15071, Spain
| | - Adrian Carballal
- Department of Computer Science and Information Technologies, Faculty of Computer Science, Universidade da Coruna, Campus Elviña s/n, A Coruña 15071, Spain
- CITIC-Research Center of Information and Communication Technologies, Universidade da Coruna, A Coruña 15071, Spain
- Department of Computer Science and Information Technologies, Faculty of Communication Science, Universidade da Coruna, Campus Elviña s/n, A Coruña 15071, Spain
| | - Victor Maojo
- Biomedical Informatics Group, Artificial Intelligence Department, Polytechnic University of Madrid, Calle de los Ciruelos, Boadilla del Monte, Madrid 28660, Spain
| | - Alejandro Pazos
- Department of Computer Science and Information Technologies, Faculty of Computer Science, Universidade da Coruna, Campus Elviña s/n, A Coruña 15071, Spain
- CITIC-Research Center of Information and Communication Technologies, Universidade da Coruna, A Coruña 15071, Spain
- Grupo de Redes de Neuronas Artificiales y Sistemas Adaptativos. Imagen Médica y Diagnóstico Radiológico (RNASA-IMEDIR), Complexo Hospitalario Universitario de A Coruña (CHUAC), SERGAS, Universidade da Coruña, Instituto de Investigación Biomédica de A Coruña (INIBIC), A Coruña, Spain
| | - Carlos Fernandez-Lozano
- Department of Computer Science and Information Technologies, Faculty of Computer Science, Universidade da Coruna, Campus Elviña s/n, A Coruña 15071, Spain
- CITIC-Research Center of Information and Communication Technologies, Universidade da Coruna, A Coruña 15071, Spain
- Grupo de Redes de Neuronas Artificiales y Sistemas Adaptativos. Imagen Médica y Diagnóstico Radiológico (RNASA-IMEDIR), Complexo Hospitalario Universitario de A Coruña (CHUAC), SERGAS, Universidade da Coruña, Instituto de Investigación Biomédica de A Coruña (INIBIC), A Coruña, Spain
| |
Collapse
|
10
|
Chen L, Li Z, Zeng T, Zhang YH, Li H, Huang T, Cai YD. Predicting gene phenotype by multi-label multi-class model based on essential functional features. Mol Genet Genomics 2021; 296:905-918. [PMID: 33914130 DOI: 10.1007/s00438-021-01789-8] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2021] [Accepted: 04/13/2021] [Indexed: 12/19/2022]
Abstract
Phenotype is one of the most significant concepts in genetics, which is used to describe all the characteristics of a research object that can be observed. Considering that phenotype reflects the integrated features of genotype and environment factors, it is hard to define phenotype characteristics, even difficult to predict unknown phenotypes. Restricted by current biological techniques, it is still quite expensive and time-consuming to obtain sufficient structural information of large-scale phenotype-associated genes/proteins. Various bioinformatics methods have been presented to solve such problem, and researchers have confirmed the efficacy and prediction accuracy of functional network-based prediction. But general functional descriptions have highly complicated inner structures for phenotype prediction. To further address this issue and improve the efficacy of phenotype prediction on more than ten kinds of phenotypes, we first extract functional enrichment features from GO and KEGG, and then use node2vec to learn functional embedding features of genes from a gene-gene network. All these features are analyzed by some feature selection methods (Boruta, minimum redundancy maximum relevance) to generate a feature list. Such list is fed into the incremental feature selection, incorporating some multi-label classifiers built by RAkEL and some classic base classifiers, to build an optimum multi-label multi-class classification model for phenotype prediction. According to recent researches, our method has indeed identified many literature-supported genes/proteins and their associated phenotypes, and even some candidate genes with re-assigned new phenotypes, which provide a new computational tool for the accurate and effective phenotypic prediction.
Collapse
Affiliation(s)
- Lei Chen
- School of Life Sciences, Shanghai University, Shanghai, 200444, People's Republic of China.,College of Information Engineering, Shanghai Maritime University, Shanghai, 201306, People's Republic of China
| | - Zhandong Li
- College of Food Engineering, Jilin Engineering Normal University, Changchun, 130052, People's Republic of China
| | - Tao Zeng
- CAS Key Laboratory of Computational Biology, Bio-Med Big Data Center, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, 200031, People's Republic of China
| | - Yu-Hang Zhang
- Channing Division of Network Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
| | - Hao Li
- College of Food Engineering, Jilin Engineering Normal University, Changchun, 130052, People's Republic of China
| | - Tao Huang
- Key Laboratory of Tissue Microenvironment and Tumor, Shanghai Institute of Nutrition and Health, Chinese Academy of Sciences, Shanghai, 200031, People's Republic of China.
| | - Yu-Dong Cai
- School of Life Sciences, Shanghai University, Shanghai, 200444, People's Republic of China.
| |
Collapse
|
11
|
Identifying the Immunological Gene Signatures of Immune Cell Subtypes. BIOMED RESEARCH INTERNATIONAL 2021. [DOI: 10.1155/2021/6639698] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/17/2022]
Abstract
The immune system is a complicated defensive system that comprises multiple functional cells and molecules acting against endogenous and exogenous pathogenic factors. Identifying immune cell subtypes and recognizing their unique immunological functions are difficult because of the complicated cellular components and immunological functions of the immune system. With the development of transcriptomics and high-throughput sequencing, the gene expression profiling of immune cells can provide a new strategy to explore the immune cell subtyping. On the basis of the new profiling data of mouse immune cell gene expression from the Immunological Genome Project (ImmGen), a novel computational pipeline was applied to identify different immune cell subtypes, including αβ T cells, B cells, γδ T cells, and innate lymphocytes. First, the profiling data was analyzed by a powerful feature selection method, Monte-Carlo Feature Selection, resulting in a feature list and some informative features. For the list, the two-stage incremental feature selection method, incorporating random forest as the classification algorithm, was applied to extract essential gene signatures and build an efficient classifier. On the other hand, a rule learning scheme was applied on the informative features to construct quantitative expression rules. A group of gene signatures was found as qualitatively related to the biological processes of four immune cell subtypes. The quantitative expression rules can efficiently cluster immune cells. This work provides a novel computational tool for immune cell quantitative subtyping and biomarker recognition.
Collapse
|
12
|
Peng X, Chen L, Zhou JP. Identification of Carcinogenic Chemicals with Network Embedding and Deep Learning Methods. Curr Bioinform 2021. [DOI: 10.2174/1574893615999200414084317] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/21/2023]
Abstract
Background:
Cancer is the second leading cause of human death in the world. To date,
many factors have been confirmed to be the cause of cancer. Among them, carcinogenic chemicals
have been widely accepted as the important ones. Traditional methods for detecting carcinogenic
chemicals are of low efficiency and high cost.
Objective:
The aim of this study was to design an efficient computational method for the
identification of carcinogenic chemicals.
Methods:
A new computational model was proposed for detecting carcinogenic chemicals. As a
data-driven model, carcinogenic and non-carcinogenic chemicals were obtained from Carcinogenic
Potency Database (CPDB). These chemicals were represented by features extracted from five
chemical networks, representing five types of chemical associations, via a network embedding
method, Mashup. Obtained features were fed into a powerful deep learning method, recurrent
neural network, to build the model.
Results:
The jackknife test on such model provided the F-measure of 0.971 and AUROC of 0.971.
Conclusion:
The proposed model was quite effective and was superior to the models with
traditional machine learning algorithms, classic chemical encoding schemes or direct usage of
chemical associations.
Collapse
Affiliation(s)
- Xuefei Peng
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China
| | - Lei Chen
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China
| | - Jian-Peng Zhou
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China
| |
Collapse
|
13
|
Zhu L, Yang X, Zhu R, Yu L. Identifying Discriminative Biological Function Features and Rules for Cancer-Related Long Non-coding RNAs. Front Genet 2021; 11:598773. [PMID: 33391350 PMCID: PMC7772407 DOI: 10.3389/fgene.2020.598773] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2020] [Accepted: 11/23/2020] [Indexed: 01/17/2023] Open
Abstract
Cancer has been a major public health problem worldwide for many centuries. Cancer is a complex disease associated with accumulative genetic mutations, epigenetic aberrations, chromosomal instability, and expression alteration. Increasing lines of evidence suggest that many non-coding transcripts, which are termed as non-coding RNAs, have important regulatory roles in cancer. In particular, long non-coding RNAs (lncRNAs) play crucial roles in tumorigenesis. Cancer-related lncRNAs serve as oncogenic factors or tumor suppressors. Although many lncRNAs are identified as potential regulators in tumorigenesis by using traditional experimental methods, they are time consuming and expensive considering the tremendous amount of lncRNAs needed. Thus, effective and fast approaches to recognize tumor-related lncRNAs should be developed. The proposed approach should help us understand not only the mechanisms of lncRNAs that participate in tumorigenesis but also their satisfactory performance in distinguishing cancer-related lncRNAs. In this study, we utilized a decision tree (DT), a type of rule learning algorithm, to investigate cancer-related lncRNAs with functional annotation contents [gene ontology (GO) terms and KEGG pathways] of their co-expressed genes. Cancer-related and other lncRNAs encoded by the key enrichment features of GO and KEGG filtered by feature selection methods were used to build an informative DT, which further induced several decision rules. The rules provided not only a new tool for identifying cancer-related lncRNAs but also connected the lncRNAs and cancers with the combinations of GO terms. Results provided new directions for understanding cancer-related lncRNAs.
Collapse
Affiliation(s)
- Liucun Zhu
- School of Life Sciences, Shanghai University, Shanghai, China
| | - Xin Yang
- School of Life Sciences, Shanghai University, Shanghai, China
| | - Rui Zhu
- School of Life Sciences, Shanghai University, Shanghai, China
| | - Lei Yu
- Department of Medical Oncology, Shanghai Concord Medical Cancer Center, Shanghai, China
| |
Collapse
|
14
|
iMPTCE-Hnetwork: A Multilabel Classifier for Identifying Metabolic Pathway Types of Chemicals and Enzymes with a Heterogeneous Network. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2021; 2021:6683051. [PMID: 33488764 PMCID: PMC7803417 DOI: 10.1155/2021/6683051] [Citation(s) in RCA: 25] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/19/2020] [Revised: 12/16/2020] [Accepted: 12/19/2020] [Indexed: 12/16/2022]
Abstract
Metabolic pathway is an important type of biological pathways. It produces essential molecules and energies to maintain the life of living organisms. Each metabolic pathway consists of a chain of chemical reactions, which always need enzymes to participate in. Thus, chemicals and enzymes are two major components for each metabolic pathway. Although several metabolic pathways have been uncovered, the metabolic pathway system is still far from complete. Some hidden chemicals or enzymes are not discovered in a certain metabolic pathway. Besides the traditional experiments to detect hidden chemicals or enzymes, an alternative pipeline is to design efficient computational methods. In this study, we proposed a powerful multilabel classifier, called iMPTCE-Hnetwork, to uniformly assign chemicals and enzymes to metabolic pathway types reported in KEGG. Such classifier adopted the embedding features derived from a heterogeneous network, which defined chemicals and enzymes as nodes and the interactions between chemicals and enzymes as edges, through a powerful network embedding algorithm, Mashup. The popular RAndom k-labELsets (RAKEL) algorithm was employed to construct the classifier, which incorporated the support vector machine (polynomial kernel) as the basic classifier. The ten-fold cross-validation results indicated that such a classifier had good performance with accuracy higher than 0.800 and exact match higher than 0.750. Several comparisons were done to indicate the superiority of the iMPTCE-Hnetwork.
Collapse
|
15
|
Zhang YH, Pan X, Zeng T, Chen L, Huang T, Cai YD. Identifying the RNA signatures of coronary artery disease from combined lncRNA and mRNA expression profiles. Genomics 2020; 112:4945-4958. [PMID: 32919019 DOI: 10.1016/j.ygeno.2020.09.016] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2020] [Revised: 07/28/2020] [Accepted: 09/05/2020] [Indexed: 12/23/2022]
Abstract
Coronary artery disease (CAD) is the most common cardiovascular disease. CAD research has greatly progressed during the past decade. mRNA is a traditional and popular pipeline to investigate various disease, including CAD. Compared with mRNA, lncRNA has better stability and thus may serve as a better disease indicator in blood. Investigating potential CAD-related lncRNAs and mRNAs will greatly contribute to the diagnosis and treatment of CAD. In this study, a computational analysis was conducted on patients with CAD by using a comprehensive transcription dataset with combined mRNA and lncRNA expression data. Several machine learning algorithms, including feature selection methods and classification algorithms, were applied to screen for the most CAD-related RNA molecules. Decision rules were also reported to provide a quantitative description about the effect of these RNA molecules on CAD progression. These new findings (CAD-related RNA molecules and rules) can help understand mRNA and lncRNA expression levels in CAD.
Collapse
Affiliation(s)
- Yu-Hang Zhang
- School of Life Sciences, Shanghai University, Shanghai 200444, China; Shanghai Institute of Nutrition and Health, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai 200031, China; Channing Division of Network Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA.
| | - Xiaoyong Pan
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, Key Laboratory of System Control and Information Processing, Ministry of Education of China, 200240 Shanghai, China.
| | - Tao Zeng
- Shanghai Research Center for Brain Science and Brain-Inspired Intelligence, Shanghai 201210, China.
| | - Lei Chen
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China.
| | - Tao Huang
- Shanghai Institute of Nutrition and Health, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai 200031, China.
| | - Yu-Dong Cai
- School of Life Sciences, Shanghai University, Shanghai 200444, China.
| |
Collapse
|
16
|
Zhang X, Chen L. Prediction of membrane protein types by fusing protein-protein interaction and protein sequence information. BIOCHIMICA ET BIOPHYSICA ACTA-PROTEINS AND PROTEOMICS 2020; 1868:140524. [PMID: 32858174 DOI: 10.1016/j.bbapap.2020.140524] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/26/2020] [Revised: 07/17/2020] [Accepted: 07/30/2020] [Indexed: 11/30/2022]
Abstract
Membrane proteins are gatekeepers to the cell and essential for determination of the function of cells. Identification of the types of membrane proteins is an essential problem in cell biology. It is time-consuming and expensive to identify the type of membrane proteins with traditional experimental methods. The alternative way is to design effective computational methods, which can provide quick and reliable predictions. To date, several computational methods have been proposed in this regard. Several of them used the features extracted from the sequence information of individual proteins. Recently, networks are more and more popular to tackle different protein-related problems, which can organize proteins in a system level and give an overview of all proteins. However, such form weakens the essential properties of proteins, such as their sequence information. In this study, a novel feature fusion scheme was proposed, which integrated the information of protein sequences and protein-protein interaction network. The fused features of a protein were defined as the linear combination of sequence features of all proteins in the network, where the combination coefficients were the probabilities yielded by the random walk with restart algorithm with the protein as the seed node. Several models with such fused features and different classification algorithms were built and evaluated. Their performance for predicting the type of membrane proteins was improved compared with the models only with the sequence features or network information.
Collapse
Affiliation(s)
- Xiaolin Zhang
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, People's Republic of China
| | - Lei Chen
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, People's Republic of China.
| |
Collapse
|
17
|
Alternative Polyadenylation Modification Patterns Reveal Essential Posttranscription Regulatory Mechanisms of Tumorigenesis in Multiple Tumor Types. BIOMED RESEARCH INTERNATIONAL 2020; 2020:6384120. [PMID: 32626751 PMCID: PMC7315320 DOI: 10.1155/2020/6384120] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/28/2020] [Revised: 05/26/2020] [Accepted: 05/30/2020] [Indexed: 12/11/2022]
Abstract
Among various risk factors for the initiation and progression of cancer, alternative polyadenylation (APA) is a remarkable endogenous contributor that directly triggers the malignant phenotype of cancer cells. APA affects biological processes at a transcriptional level in various ways. As such, APA can be involved in tumorigenesis through gene expression, protein subcellular localization, or transcription splicing pattern. The APA sites and status of different cancer types may have diverse modification patterns and regulatory mechanisms on transcripts. Potential APA sites were screened by applying several machine learning algorithms on a TCGA-APA dataset. First, a powerful feature selection method, minimum redundancy maximum relevancy, was applied on the dataset, resulting in a feature list. Then, the feature list was fed into the incremental feature selection, which incorporated the support vector machine as the classification algorithm, to extract key APA features and build a classifier. The classifier can classify cancer patients into cancer types with perfect performance. The key APA-modified genes had a potential prognosis ability because of their significant power in the survival analysis of TCGA pan-cancer data.
Collapse
|
18
|
Huang G. Computational Models and Methods for Drug Target Prediction and Drug Repositioning. Comb Chem High Throughput Screen 2020; 23:270-273. [PMID: 32452755 DOI: 10.2174/138620732304200409112209] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023]
Affiliation(s)
- Guohua Huang
- Provincial Key Laboratory of Informational Service for Rural Area of Southwestern Hunan Shaoyang University Shaoyang 422000, China
| |
Collapse
|
19
|
Zhang S, Zeng T, Hu B, Zhang YH, Feng K, Chen L, Niu Z, Li J, Huang T, Cai YD. Discriminating Origin Tissues of Tumor Cell Lines by Methylation Signatures and Dys-Methylated Rules. Front Bioeng Biotechnol 2020; 8:507. [PMID: 32528944 PMCID: PMC7264161 DOI: 10.3389/fbioe.2020.00507] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2020] [Accepted: 04/30/2020] [Indexed: 12/18/2022] Open
Abstract
DNA methylation is an essential epigenetic modification for multiple biological processes. DNA methylation in mammals acts as an epigenetic mark of transcriptional repression. Aberrant levels of DNA methylation can be observed in various types of tumor cells. Thus, DNA methylation has attracted considerable attention among researchers to provide new and feasible tumor therapies. Conventional studies considered single-gene methylation or specific loci as biomarkers for tumorigenesis. However, genome-scale methylated modification has not been completely investigated. Thus, we proposed and compared two novel computational approaches based on multiple machine learning algorithms for the qualitative and quantitative analyses of methylation-associated genes and their dys-methylated patterns. This study contributes to the identification of novel effective genes and the establishment of optimal quantitative rules for aberrant methylation distinguishing tumor cells with different origin tissues.
Collapse
Affiliation(s)
- Shiqi Zhang
- School of Life Sciences, Shanghai University, Shanghai, China.,Department of Biostatistics, University of Copenhagen, Copenhagen, Denmark
| | - Tao Zeng
- Shanghai Research Center for Brain Science and Brain-Inspired Intelligence, Shanghai, China
| | - Bin Hu
- State Key Laboratory of Livestock and Poultry Breeding, Guangdong Public Laboratory of Animal Breeding and Nutrition, Guangdong Key Laboratory of Animal Breeding and Nutrition, Institute of Animal Science, Guangdong Academy of Agricultural Sciences, Guangzhou, China
| | - Yu-Hang Zhang
- Shanghai Institute of Nutrition and Health, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China
| | - Kaiyan Feng
- Department of Computer Science, Guangdong AIB Polytechnic, Guangzhou, China
| | - Lei Chen
- College of Information Engineering, Shanghai Maritime University, Shanghai, China
| | - Zhibin Niu
- College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Jianhao Li
- State Key Laboratory of Livestock and Poultry Breeding, Guangdong Public Laboratory of Animal Breeding and Nutrition, Guangdong Key Laboratory of Animal Breeding and Nutrition, Institute of Animal Science, Guangdong Academy of Agricultural Sciences, Guangzhou, China
| | - Tao Huang
- Shanghai Institute of Nutrition and Health, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China
| | - Yu-Dong Cai
- School of Life Sciences, Shanghai University, Shanghai, China
| |
Collapse
|
20
|
Prediction of Drug Side Effects with a Refined Negative Sample Selection Strategy. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2020; 2020:1573543. [PMID: 32454877 PMCID: PMC7232712 DOI: 10.1155/2020/1573543] [Citation(s) in RCA: 49] [Impact Index Per Article: 12.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/28/2020] [Revised: 04/14/2020] [Accepted: 04/23/2020] [Indexed: 01/07/2023]
Abstract
Drugs are an important way to treat various diseases. However, they inevitably produce side effects, bringing great risks to human bodies and pharmaceutical companies. How to predict the side effects of drugs has become one of the essential problems in drug research. Designing efficient computational methods is an alternative way. Some studies paired the drug and side effect as a sample, thereby modeling the problem as a binary classification problem. However, the selection of negative samples is a key problem in this case. In this study, a novel negative sample selection strategy was designed for accessing high-quality negative samples. Such strategy applied the random walk with restart (RWR) algorithm on a chemical-chemical interaction network to select pairs of drugs and side effects, such that drugs were less likely to have corresponding side effects, as negative samples. Through several tests with a fixed feature extraction scheme and different machine-learning algorithms, models with selected negative samples produced high performance. The best model even yielded nearly perfect performance. These models had much higher performance than those without such strategy or with another selection strategy. Furthermore, it is not necessary to consider the balance of positive and negative samples under such a strategy.
Collapse
|
21
|
Yuan F, Pan X, Zeng T, Zhang YH, Chen L, Gan Z, Huang T, Cai YD. Identifying Cell-Type Specific Genes and Expression Rules Based on Single-Cell Transcriptomic Atlas Data. Front Bioeng Biotechnol 2020; 8:350. [PMID: 32411685 PMCID: PMC7201067 DOI: 10.3389/fbioe.2020.00350] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2020] [Accepted: 03/30/2020] [Indexed: 01/07/2023] Open
Abstract
Single-cell sequencing technologies have emerged to address new and longstanding biological and biomedical questions. Previous studies focused on the analysis of bulk tissue samples composed of millions of cells. However, the genomes within the cells of an individual multicellular organism are not always the same. In this study, we aimed to identify the crucial and characteristically expressed genes that may play functional roles in tissue development and organogenesis, by analyzing a single-cell transcriptomic atlas of mice. We identified the most relevant gene features and decision rules classifying 18 cell categories, providing a list of genes that may perform important functions in the process of tissue development because of their tissue-specific expression patterns. These genes may serve as biomarkers to identify the origin of unknown cell subgroups so as to recognize specific cell stages/states during the dynamic process, and also be applied as potential therapy targets for developmental disorders.
Collapse
Affiliation(s)
- Fei Yuan
- School of Life Sciences, Shanghai University, Shanghai, China.,Department of Science and Technology, Binzhou Medical University Hospital, Binzhou, China
| | - XiaoYong Pan
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, and Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai, China
| | - Tao Zeng
- Key Laboratory of Systems Biology, Institute of Biochemistry and Cell Biology, Chinese Academy of Sciences, Shanghai, China
| | - Yu-Hang Zhang
- Shanghai Institute of Nutrition and Health, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China
| | - Lei Chen
- College of Information Engineering, Shanghai Maritime University, Shanghai, China.,Shanghai Key Laboratory of Pure Mathematics and Mathematical Practice, East China Normal University, Shanghai, China
| | - Zijun Gan
- Shanghai Institute of Nutrition and Health, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China
| | - Tao Huang
- Shanghai Institute of Nutrition and Health, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China
| | - Yu-Dong Cai
- School of Life Sciences, Shanghai University, Shanghai, China
| |
Collapse
|
22
|
Chen L, Pan X, Guo W, Gan Z, Zhang YH, Niu Z, Huang T, Cai YD. Investigating the gene expression profiles of cells in seven embryonic stages with machine learning algorithms. Genomics 2020; 112:2524-2534. [PMID: 32045671 DOI: 10.1016/j.ygeno.2020.02.004] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2019] [Revised: 12/26/2019] [Accepted: 02/07/2020] [Indexed: 12/15/2022]
Abstract
The development of embryonic cells involves several continuous stages, and some genes are related to embryogenesis. To date, few studies have systematically investigated changes in gene expression profiles during mammalian embryogenesis. In this study, a computational analysis using machine learning algorithms was performed on the gene expression profiles of mouse embryonic cells at seven stages. First, the profiles were analyzed through a powerful Monte Carlo feature selection method for the generation of a feature list. Second, increment feature selection was applied on the list by incorporating two classification algorithms: support vector machine (SVM) and repeated incremental pruning to produce error reduction (RIPPER). Through SVM, we extracted several latent gene biomarkers, indicating the stages of embryonic cells, and constructed an optimal SVM classifier that produced a nearly perfect classification of embryonic cells. Furthermore, some interesting rules were accessed by the RIPPER algorithm, suggesting different expression patterns for different stages.
Collapse
Affiliation(s)
- Lei Chen
- School of Life Sciences, Shanghai University, Shanghai 200444, China; College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China; Shanghai Key Laboratory of PMMP, East China Normal University, Shanghai 200241, China.
| | - XiaoYong Pan
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, Key Laboratory of System Control and Information Processing, Ministry of Education of China, 200240 Shanghai, China.
| | - Wei Guo
- Institute of Health Sciences, Shanghai Jiao Tong University School of Medicine, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai 200031, China.
| | - Zijun Gan
- Shanghai Institute of Nutrition and Health, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai 200031, China.
| | - Yu-Hang Zhang
- Shanghai Institute of Nutrition and Health, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai 200031, China; Channing Division of Network Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA.
| | - Zhibin Niu
- College of Intelligence and Computing, Tianjin University, Tianjin 300072, China.
| | - Tao Huang
- Shanghai Institute of Nutrition and Health, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai 200031, China.
| | - Yu-Dong Cai
- School of Life Sciences, Shanghai University, Shanghai 200444, China.
| |
Collapse
|
23
|
Identifying Methylation Pattern and Genes Associated with Breast Cancer Subtypes. Int J Mol Sci 2019; 20:ijms20174269. [PMID: 31480430 PMCID: PMC6747348 DOI: 10.3390/ijms20174269] [Citation(s) in RCA: 30] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2019] [Revised: 08/19/2019] [Accepted: 08/29/2019] [Indexed: 12/18/2022] Open
Abstract
Breast cancer is regarded worldwide as a severe human disease. Various genetic variations, including hereditary and somatic mutations, contribute to the initiation and progression of this disease. The diagnostic parameters of breast cancer are not limited to the conventional protein content and can include newly discovered genetic variants and even genetic modification patterns such as methylation and microRNA. In addition, breast cancer detection extends to detailed breast cancer stratifications to provide subtype-specific indications for further personalized treatment. One genome-wide expression–methylation quantitative trait loci analysis confirmed that different breast cancer subtypes have various methylation patterns. However, recognizing clinically applied (methylation) biomarkers is difficult due to the large number of differentially methylated genes. In this study, we attempted to re-screen a small group of functional biomarkers for the identification and distinction of different breast cancer subtypes with advanced machine learning methods. The findings may contribute to biomarker identification for different breast cancer subtypes and provide a new perspective for differential pathogenesis in breast cancer subtypes.
Collapse
|