1
|
López-Cortés A, Cabrera-Andrade A, Echeverría-Garcés G, Echeverría-Espinoza P, Pineda-Albán M, Elsitdie N, Bueno-Miño J, Cruz-Segundo CM, Dorado J, Pazos A, Gonzáles-Díaz H, Pérez-Castillo Y, Tejera E, Munteanu CR. Unraveling druggable cancer-driving proteins and targeted drugs using artificial intelligence and multi-omics analyses. Sci Rep 2024; 14:19359. [PMID: 39169044 PMCID: PMC11339426 DOI: 10.1038/s41598-024-68565-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2024] [Accepted: 07/25/2024] [Indexed: 08/23/2024] Open
Abstract
The druggable proteome refers to proteins that can bind to small molecules with appropriate chemical affinity, inducing a favorable clinical response. Predicting druggable proteins through screening and in silico modeling is imperative for drug design. To contribute to this field, we developed an accurate predictive classifier for druggable cancer-driving proteins using amino acid composition descriptors of protein sequences and 13 machine learning linear and non-linear classifiers. The optimal classifier was achieved with the support vector machine method, utilizing 200 tri-amino acid composition descriptors. The high performance of the model is evident from an area under the receiver operating characteristics (AUROC) of 0.975 ± 0.003 and an accuracy of 0.929 ± 0.006 (threefold cross-validation). The machine learning prediction model was enhanced with multi-omics approaches, including the target-disease evidence score, the shortest pathways to cancer hallmarks, structure-based ligandability assessment, unfavorable prognostic protein analysis, and the oncogenic variome. Additionally, we performed a drug repurposing analysis to identify drugs with the highest affinity capable of targeting the best predicted proteins. As a result, we identified 79 key druggable cancer-driving proteins with the highest ligandability, and 23 of them demonstrated unfavorable prognostic significance across 16 TCGA PanCancer types: CDKN2A, BCL10, ACVR1, CASP8, JAG1, TSC1, NBN, PREX2, PPP2R1A, DNM2, VAV1, ASXL1, TPR, HRAS, BUB1B, ATG7, MARK3, SETD2, CCNE1, MUTYH, CDKN2C, RB1, and SMARCA4. Moreover, we prioritized 11 clinically relevant drugs targeting these proteins. This strategy effectively predicts and prioritizes biomarkers, therapeutic targets, and drugs for in-depth studies in clinical trials. Scripts are available at https://github.com/muntisa/machine-learning-for-druggable-proteins .
Collapse
Affiliation(s)
- Andrés López-Cortés
- Cancer Research Group (CRG), Faculty of Medicine, Universidad de Las Américas, Quito, Ecuador.
| | - Alejandro Cabrera-Andrade
- Grupo de Bio-Quimioinformática, Universidad de Las Américas, Quito, Ecuador
- Escuela de Enfermería, Facultad de Ciencias de la Salud, Universidad de Las Américas, Quito, Ecuador
| | - Gabriela Echeverría-Garcés
- Centro de Referencia Nacional de Genómica, Secuenciación y Bioinformática, Instituto Nacional de Investigación en Salud Pública "Leopoldo Izquieta Pérez", Quito, Ecuador
- Latin American Network for the Implementation and Validation of Clinical Pharmacogenomics Guidelines (RELIVAF-CYTED), Santiago, Chile
| | | | - Micaela Pineda-Albán
- Cancer Research Group (CRG), Faculty of Medicine, Universidad de Las Américas, Quito, Ecuador
| | - Nicole Elsitdie
- Cancer Research Group (CRG), Faculty of Medicine, Universidad de Las Américas, Quito, Ecuador
| | - José Bueno-Miño
- Cancer Research Group (CRG), Faculty of Medicine, Universidad de Las Américas, Quito, Ecuador
| | - Carlos M Cruz-Segundo
- RNASA-IMEDIR, Computer Science Faculty, University of A Coruna, A Coruña, Spain
- Tecnológico de Estudios Superiores de Jocotitlán, Jocotitlán, Mexico
| | - Julian Dorado
- RNASA-IMEDIR, Computer Science Faculty, University of A Coruna, A Coruña, Spain
- Centro de Investigación en Tecnologías de la Información y las Comunicaciones (CITIC), University of A Coruna, A Coruña, Spain
| | - Alejandro Pazos
- RNASA-IMEDIR, Computer Science Faculty, University of A Coruna, A Coruña, Spain
- Centro de Investigación en Tecnologías de la Información y las Comunicaciones (CITIC), University of A Coruna, A Coruña, Spain
- Biomedical Research Institute of A Coruna (INIBIC), University Hospital Complex of A Coruna (CHUAC), A Coruña, Spain
| | - Humberto Gonzáles-Díaz
- Department of Organic Chemistry II, University of the Basque Country UPV/EHU, Biscay, Spain
- IKERBASQUE, Basque Foundation for Science, Biscay, Spain
| | | | - Eduardo Tejera
- Grupo de Bio-Quimioinformática, Universidad de Las Américas, Quito, Ecuador
| | - Cristian R Munteanu
- RNASA-IMEDIR, Computer Science Faculty, University of A Coruna, A Coruña, Spain
- Centro de Investigación en Tecnologías de la Información y las Comunicaciones (CITIC), University of A Coruna, A Coruña, Spain
- Biomedical Research Institute of A Coruna (INIBIC), University Hospital Complex of A Coruna (CHUAC), A Coruña, Spain
| |
Collapse
|
2
|
Preethy H A, Venkatakrishnan YB, Ramakrishnan V, Krishnan UM. A network pharmacological approach for the identification of potential therapeutic targets of Brahmi Nei - a complex traditional Siddha formulation. J Biomol Struct Dyn 2024:1-24. [PMID: 38459935 DOI: 10.1080/07391102.2024.2322612] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/03/2023] [Accepted: 02/19/2024] [Indexed: 03/11/2024]
Abstract
Brahmi Nei (BN), a traditional Indian polyherbal formulation has been described in classical texts for the treatment of anxiety and depression, as well as to fortify the immune system. The individual herbs of BN have been used for treatment of wide range of disorders including cognition, inflammation, skin ailments and cancer etc., This diverse basket of therapeutic activity suggests that BN may possess therapeutic benefits to other disorders. So, the present study aims to identify the potential therapeutic targets of BN using a network pharmacological approach to comprehend the multi target action of its multiple phytoconstituents. We have employed Randić Index for the first time to calculate the contribution score of module segregated targets towards diseases. Our results suggests that BN targets could also be effective in other diseases such as lysosomal storage disorders, respiratory disorders etc., apart from neurological disorders. The key targets with highest topological measures of Targets-(Pathway)-Targets network were identified as potential therapeutic targets of BN. And the top hit target PTGS2, a gene encoding for cyclooxygenase-2 was further evaluated using molecular docking, molecular dynamic simulation and in vitro studies. Our findings open up new therapeutic facets for BN that can be explored systematically in future.Communicated by Ramaswamy H. Sarma.
Collapse
Affiliation(s)
- Agnes Preethy H
- Centre for Nanotechnology & Advanced Biomaterials (CeNTAB), SASTRA Deemed University, Thanjavur, India
- School of Chemical & Biotechnology (SCBT), SASTRA Deemed University, Thanjavur, India
| | | | | | - Uma Maheswari Krishnan
- Centre for Nanotechnology & Advanced Biomaterials (CeNTAB), SASTRA Deemed University, Thanjavur, India
- School of Chemical & Biotechnology (SCBT), SASTRA Deemed University, Thanjavur, India
- School of Arts, Sciences, Humanities & Education (SASHE), SASTRA Deemed University, Thanjavur, India
| |
Collapse
|
3
|
Raju B, Narendra G, Verma H, Kumar M, Sapra B, Kaur G, jain SK, Silakari O. Machine Learning Enabled Structure-Based Drug Repurposing Approach to Identify Potential CYP1B1 Inhibitors. ACS OMEGA 2022; 7:31999-32013. [PMID: 36120033 PMCID: PMC9476183 DOI: 10.1021/acsomega.2c02983] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/13/2022] [Accepted: 08/23/2022] [Indexed: 06/15/2023]
Abstract
Drug-metabolizing enzyme (DME)-mediated pharmacokinetic resistance of some clinically approved anticancer agents is one of the main reasons for cancer treatment failure. In particular, some commonly used anticancer medicines, including docetaxel, tamoxifen, imatinib, cisplatin, and paclitaxel, are inactivated by CYP1B1. Currently, no approved drugs are available to treat this CYP1B1-mediated inactivation, making the pharmaceutical industries strive to discover new anticancer agents. Because of the extreme complexity and high risk in drug discovery and development, it is worthwhile to come up with a drug repurposing strategy that may solve the resistance problem of existing chemotherapeutics. Therefore, in the current study, a drug repurposing strategy was implemented to find the possible CYP1B1 inhibitors using machine learning (ML) and structure-based virtual screening (SB-VS) approaches. Initially, three different ML models were developed such as support vector machines (SVMs), random forest (RF), and artificial neural network (ANN); subsequently, the best-selected ML model was employed for virtual screening of the selleckchem database to identify potential CYP1B1 inhibitors. The inhibition potency of the obtained hits was judged by analyzing the crucial active site amino acid interactions against CYP1B1. After a thorough assessment of docking scores, binding affinities, as well as binding modes, four compounds were selected and further subjected to in vitro analysis. From the in vitro analysis, it was observed that chlorprothixene, nadifloxacin, and ticagrelor showed promising inhibitory activity toward CYP1B1 in the IC50 range of 0.07-3.00 μM. These new chemical scaffolds can be explored as adjuvant therapies to address CYP1B1-mediated drug-resistance problems.
Collapse
Affiliation(s)
- Baddipadige Raju
- Molecular
Modeling Lab (MML), Department of Pharmaceutical Sciences and Drug
Research, Punjabi University, Patiala, Punjab 147002, India
| | - Gera Narendra
- Molecular
Modeling Lab (MML), Department of Pharmaceutical Sciences and Drug
Research, Punjabi University, Patiala, Punjab 147002, India
| | - Himanshu Verma
- Molecular
Modeling Lab (MML), Department of Pharmaceutical Sciences and Drug
Research, Punjabi University, Patiala, Punjab 147002, India
| | - Manoj Kumar
- Molecular
Modeling Lab (MML), Department of Pharmaceutical Sciences and Drug
Research, Punjabi University, Patiala, Punjab 147002, India
| | - Bharti Sapra
- Molecular
Modeling Lab (MML), Department of Pharmaceutical Sciences and Drug
Research, Punjabi University, Patiala, Punjab 147002, India
| | - Gurleen Kaur
- Center
for Basic and Translational Research in Health Sciences, Guru Nanak Dev University, Amritsar 143005, India
| | - Subheet Kumar jain
- Center
for Basic and Translational Research in Health Sciences, Guru Nanak Dev University, Amritsar 143005, India
| | - Om Silakari
- Molecular
Modeling Lab (MML), Department of Pharmaceutical Sciences and Drug
Research, Punjabi University, Patiala, Punjab 147002, India
| |
Collapse
|
4
|
Raju B, Verma H, Narendra G, Sapra B, Silakari O. Multiple machine learning, molecular docking, and ADMET screening approach for identification of selective inhibitors of CYP1B1. J Biomol Struct Dyn 2021; 40:7975-7990. [PMID: 33769194 DOI: 10.1080/07391102.2021.1905552] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
Abstract
Cytochrome P4501B1 is a ubiquitous family protein that is majorly overexpressed in tumors and is responsible for biotransformation-based inactivation of anti-cancer drugs. This inactivation marks the cause of resistance to chemotherapeutics. In the present study, integrated in-silico approaches were utilized to identify selective CYP1B1 inhibitors. To achieve this objective, we initially developed different machine learning models corresponding to two isoforms of the CYP1 family i.e. CYP1A1 and CYP1B1. Subsequently, small molecule databases including ChemBridge, Maybridge, and natural compound library were screened from the selected models of CYP1B1 and CYP1A1. The obtained CYP1B1 inhibitors were further subjected to molecular docking and ADMET analysis. The selectivity of the obtained hits for CYP1B1 over the other isoforms was also judged with molecular docking analysis. Finally, two hits were found to be the most stable which retained key interactions within the active site of CYP1B1 after the molecular dynamics simulations. Novel compound with CYP-D9 and CYP-14 IDs were found to be the most selective CYP1B1 inhibitors which may address the issue of resistance. Moreover, these compounds can be considered as safe agents for further cell-based and animal model studies.Communicated by Ramaswamy H. Sarma.
Collapse
Affiliation(s)
- Baddipadige Raju
- Molecular Modeling Lab (MML), Department of Pharmaceutical Sciences and Drug Research, Punjabi University, Patiala, Punjab, India
| | - Himanshu Verma
- Molecular Modeling Lab (MML), Department of Pharmaceutical Sciences and Drug Research, Punjabi University, Patiala, Punjab, India
| | - Gera Narendra
- Molecular Modeling Lab (MML), Department of Pharmaceutical Sciences and Drug Research, Punjabi University, Patiala, Punjab, India
| | - Bharti Sapra
- Molecular Modeling Lab (MML), Department of Pharmaceutical Sciences and Drug Research, Punjabi University, Patiala, Punjab, India
| | - Om Silakari
- Molecular Modeling Lab (MML), Department of Pharmaceutical Sciences and Drug Research, Punjabi University, Patiala, Punjab, India
| |
Collapse
|
5
|
He P, Hou L, Tao H, Dai Q, Yao Y. An Analysis Model of Protein Mass Spectrometry Data and its Application. Curr Bioinform 2020. [DOI: 10.2174/1574893614666191202150844] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Backgroud:
The impact of cancer in society created the necessity of new and faster
theoretical models for the early diagnosis of cancer.
Methods:
In this work, a mass spectrometry (MS) data analysis method based on the star-like
graph of protein and support vector machine (SVM) was proposed and applied to the ovarian
cancer early classification in the MS data set. Firstly, the MS data is reduced and transformed into
the corresponding protein sequence. Then, the topological indexes of the star-like graph are
calculated to describe each MS data of the cancer sample. Finally, the SVM model is suggested to
classify the MS data.
Results:
Using independent training and testing experiments 10 times to evaluate the ovarian
cancer detection models, the average prediction accuracy, sensitivity, and specificity of the model
were 96.45%, 96.88%, and 95.67%, respectively, for [0,1] normalization data, and 94.43%,
96.25%, and 91.11% for [-1,1] normalization data.
Conclusion:
The model combined with the SELDI-TOF-MS technology has a prospect in early
clinical detection and diagnosis of ovarian cancer.
Collapse
Affiliation(s)
- Pingan He
- School of Science, Zhejiang Sci-Tech University, Hangzhou 310018,China
| | - Longao Hou
- School of Science, Zhejiang Sci-Tech University, Hangzhou 310018,China
| | - Hong Tao
- School of Science, Zhejiang Sci-Tech University, Hangzhou 310018,China
| | - Qi Dai
- College of Life Science, Zhejiang Sci-Tech University, Hangzhou 310018,China
| | - Yuhua Yao
- School of Mathematics and Statistics, Hainan Normal University, Haikou 570100,China
| |
Collapse
|
6
|
López-Cortés A, Cabrera-Andrade A, Vázquez-Naya JM, Pazos A, Gonzáles-Díaz H, Paz-Y-Miño C, Guerrero S, Pérez-Castillo Y, Tejera E, Munteanu CR. Prediction of breast cancer proteins involved in immunotherapy, metastasis, and RNA-binding using molecular descriptors and artificial neural networks. Sci Rep 2020; 10:8515. [PMID: 32444848 PMCID: PMC7244564 DOI: 10.1038/s41598-020-65584-y] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2019] [Accepted: 04/28/2020] [Indexed: 12/12/2022] Open
Abstract
Breast cancer (BC) is a heterogeneous disease where genomic alterations, protein expression deregulation, signaling pathway alterations, hormone disruption, ethnicity and environmental determinants are involved. Due to the complexity of BC, the prediction of proteins involved in this disease is a trending topic in drug design. This work is proposing accurate prediction classifier for BC proteins using six sets of protein sequence descriptors and 13 machine-learning methods. After using a univariate feature selection for the mix of five descriptor families, the best classifier was obtained using multilayer perceptron method (artificial neural network) and 300 features. The performance of the model is demonstrated by the area under the receiver operating characteristics (AUROC) of 0.980 ± 0.0037, and accuracy of 0.936 ± 0.0056 (3-fold cross-validation). Regarding the prediction of 4,504 cancer-associated proteins using this model, the best ranked cancer immunotherapy proteins related to BC were RPS27, SUPT4H1, CLPSL2, POLR2K, RPL38, AKT3, CDK3, RPS20, RASL11A and UBTD1; the best ranked metastasis driver proteins related to BC were S100A9, DDA1, TXN, PRNP, RPS27, S100A14, S100A7, MAPK1, AGR3 and NDUFA13; and the best ranked RNA-binding proteins related to BC were S100A9, TXN, RPS27L, RPS27, RPS27A, RPL38, MRPL54, PPAN, RPS20 and CSRP1. This powerful model predicts several BC-related proteins that should be deeply studied to find new biomarkers and better therapeutic targets. Scripts can be downloaded at https://github.com/muntisa/neural-networks-for-breast-cancer-proteins.
Collapse
Affiliation(s)
- Andrés López-Cortés
- Centro de Investigación Genética y Genómica, Facultad de Ciencias de la Salud Eugenio Espejo, Universidad UTE, Mariscal Sucre Avenue, Quito, 170129, Ecuador.
- RNASA-IMEDIR, Computer Science Faculty, University of Coruna, Coruna, 15071, Spain.
- Red Latinoamericana de Implementación y Validación de Guías Clínicas Farmacogenómicas (RELIVAF-CYTED), Quito, Ecuador.
| | - Alejandro Cabrera-Andrade
- RNASA-IMEDIR, Computer Science Faculty, University of Coruna, Coruna, 15071, Spain
- Grupo de Bio-Quimioinformática, Universidad de Las Américas, Avenue de los Granados, Quito, 170125, Ecuador
- Carrera de Enfermería, Facultad de Ciencias de la Salud, Universidad de Las Américas, Avenue de los Granados, Quito, 170125, Ecuador
| | - José M Vázquez-Naya
- RNASA-IMEDIR, Computer Science Faculty, University of Coruna, Coruna, 15071, Spain
- Centro de Investigación en Tecnologías de la Información y las Comunicaciones (CITIC), Campus de Elviña s/n 15071, A Coruña, Spain
- Biomedical Research Institute of A Coruña (INIBIC), University Hospital Complex of A Coruña (CHUAC), 15006, A Coruña, Spain
| | - Alejandro Pazos
- RNASA-IMEDIR, Computer Science Faculty, University of Coruna, Coruna, 15071, Spain
- Centro de Investigación en Tecnologías de la Información y las Comunicaciones (CITIC), Campus de Elviña s/n 15071, A Coruña, Spain
- Biomedical Research Institute of A Coruña (INIBIC), University Hospital Complex of A Coruña (CHUAC), 15006, A Coruña, Spain
| | - Humberto Gonzáles-Díaz
- Department of Organic Chemistry II, University of the Basque Country UPV/EHU, Leioa 48940, Biscay, Spain
- IKERBASQUE, Basque Foundation for Science, Bilbao, 48011, Biscay, Spain
| | - César Paz-Y-Miño
- Centro de Investigación Genética y Genómica, Facultad de Ciencias de la Salud Eugenio Espejo, Universidad UTE, Mariscal Sucre Avenue, Quito, 170129, Ecuador
| | - Santiago Guerrero
- Centro de Investigación Genética y Genómica, Facultad de Ciencias de la Salud Eugenio Espejo, Universidad UTE, Mariscal Sucre Avenue, Quito, 170129, Ecuador
| | - Yunierkis Pérez-Castillo
- Grupo de Bio-Quimioinformática, Universidad de Las Américas, Avenue de los Granados, Quito, 170125, Ecuador
- Escuela de Ciencias Físicas y Matemáticas, Universidad de Las Américas, Avenue de los Granados, Quito, 170125, Ecuador
| | - Eduardo Tejera
- Grupo de Bio-Quimioinformática, Universidad de Las Américas, Avenue de los Granados, Quito, 170125, Ecuador
- Facultad de Ingeniería y Ciencias Agropecuarias, Universidad de Las Américas, Avenue de los Granados, Quito, 170125, Ecuador
| | - Cristian R Munteanu
- RNASA-IMEDIR, Computer Science Faculty, University of Coruna, Coruna, 15071, Spain
- Centro de Investigación en Tecnologías de la Información y las Comunicaciones (CITIC), Campus de Elviña s/n 15071, A Coruña, Spain
- Biomedical Research Institute of A Coruña (INIBIC), University Hospital Complex of A Coruña (CHUAC), 15006, A Coruña, Spain
| |
Collapse
|
7
|
Keyvanpour MR, Shirzad MB. An Analysis of QSAR Research Based on Machine Learning Concepts. Curr Drug Discov Technol 2020; 18:17-30. [PMID: 32178612 DOI: 10.2174/1570163817666200316104404] [Citation(s) in RCA: 21] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2019] [Revised: 08/22/2019] [Accepted: 10/28/2019] [Indexed: 11/22/2022]
Abstract
Quantitative Structure-Activity Relationship (QSAR) is a popular approach developed to correlate chemical molecules with their biological activities based on their chemical structures. Machine learning techniques have proved to be promising solutions to QSAR modeling. Due to the significant role of machine learning strategies in QSAR modeling, this area of research has attracted much attention from researchers. A considerable amount of literature has been published on machine learning based QSAR modeling methodologies whilst this domain still suffers from lack of a recent and comprehensive analysis of these algorithms. This study systematically reviews the application of machine learning algorithms in QSAR, aiming to provide an analytical framework. For this purpose, we present a framework called 'ML-QSAR'. This framework has been designed for future research to: a) facilitate the selection of proper strategies among existing algorithms according to the application area requirements, b) help to develop and ameliorate current methods and c) providing a platform to study existing methodologies comparatively. In ML-QSAR, first a structured categorization is depicted which studied the QSAR modeling research based on machine models. Then several criteria are introduced in order to assess the models. Finally, inspired by aforementioned criteria the qualitative analysis is carried out.
Collapse
Affiliation(s)
| | - Mehrnoush Barani Shirzad
- Data Mining Research Laboratory, Department of Computer Engineering, Alzahra University, Tehran, Iran
| |
Collapse
|
8
|
Bonetta R, Valentino G. Machine learning techniques for protein function prediction. Proteins 2019; 88:397-413. [PMID: 31603244 DOI: 10.1002/prot.25832] [Citation(s) in RCA: 67] [Impact Index Per Article: 13.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2019] [Revised: 07/05/2019] [Accepted: 09/17/2019] [Indexed: 12/17/2022]
Abstract
Proteins play important roles in living organisms, and their function is directly linked with their structure. Due to the growing gap between the number of proteins being discovered and their functional characterization (in particular as a result of experimental limitations), reliable prediction of protein function through computational means has become crucial. This paper reviews the machine learning techniques used in the literature, following their evolution from simple algorithms such as logistic regression to more advanced methods like support vector machines and modern deep neural networks. Hyperparameter optimization methods adopted to boost prediction performance are presented. In parallel, the metamorphosis in the features used by these algorithms from classical physicochemical properties and amino acid composition, up to text-derived features from biomedical literature and learned feature representations using autoencoders, together with feature selection and dimensionality reduction techniques, are also reviewed. The success stories in the application of these techniques to both general and specific protein function prediction are discussed.
Collapse
Affiliation(s)
- Rosalin Bonetta
- Centre for Molecular Medicine and Biobanking, University of Malta, Msida, Malta
| | - Gianluca Valentino
- Department of Communications and Computer Engineering, University of Malta, Msida, Malta
| |
Collapse
|
9
|
Concu R, Cordeiro MNDS. Alignment-Free Method to Predict Enzyme Classes and Subclasses. Int J Mol Sci 2019; 20:ijms20215389. [PMID: 31671806 PMCID: PMC6862210 DOI: 10.3390/ijms20215389] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2019] [Revised: 10/21/2019] [Accepted: 10/23/2019] [Indexed: 01/03/2023] Open
Abstract
The Enzyme Classification (EC) number is a numerical classification scheme for enzymes, established using the chemical reactions they catalyze. This classification is based on the recommendation of the Nomenclature Committee of the International Union of Biochemistry and Molecular Biology. Six enzyme classes were recognised in the first Enzyme Classification and Nomenclature List, reported by the International Union of Biochemistry in 1961. However, a new enzyme group was recently added as the six existing EC classes could not describe enzymes involved in the movement of ions or molecules across membranes. Such enzymes are now classified in the new EC class of translocases (EC 7). Several computational methods have been developed in order to predict the EC number. However, due to this new change, all such methods are now outdated and need updating. In this work, we developed a new multi-task quantitative structure-activity relationship (QSAR) method aimed at predicting all 7 EC classes and subclasses. In so doing, we developed an alignment-free model based on artificial neural networks that proved to be very successful.
Collapse
Affiliation(s)
- Riccardo Concu
- LAQV@REQUIMTE/Department of Chemistry and Biochemistry, Faculty of Sciences, University of Porto, 4169-007 Porto, Portugal.
| | - M Natália D S Cordeiro
- LAQV@REQUIMTE/Department of Chemistry and Biochemistry, Faculty of Sciences, University of Porto, 4169-007 Porto, Portugal.
| |
Collapse
|
10
|
Concu R, D. S. Cordeiro MN, Munteanu CR, González-Díaz H. PTML Model of Enzyme Subclasses for Mining the Proteome of Biofuel Producing Microorganisms. J Proteome Res 2019; 18:2735-2746. [DOI: 10.1021/acs.jproteome.8b00949] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022]
Affiliation(s)
- Riccardo Concu
- LAQV@REQUIMTE/Department of Chemistry and Biochemistry, Faculty of Sciences, University of Porto, 4169-007 Porto, Portugal
| | - M. Natália. D. S. Cordeiro
- LAQV@REQUIMTE/Department of Chemistry and Biochemistry, Faculty of Sciences, University of Porto, 4169-007 Porto, Portugal
| | - Cristian R. Munteanu
- RNASA-IMEDIR, Computer Science Faculty, University of A Coruña, 15071 A Coruña, Spain
- INIBIC Biomedical Research Institute of Coruña, CHUAC University Hospital, 15006 A Coruña, Spain
| | - Humbert González-Díaz
- Department of Organic Chemistry II, University of Basque Country UPV/EHU, 48940 Leioa, Biscay, Spain
- IKERBASQUE, Basque Foundation for Science, 48011 Bilbao, Biscay, Spain
| |
Collapse
|
11
|
Lin X, Huang X, Zhou L, Ren W, Zeng J, Yao W, Wang X. The Robust Classification Model Based on Combinatorial Features. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2019; 16:650-657. [PMID: 29990202 DOI: 10.1109/tcbb.2017.2779512] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Analyzing the disease data from the view of combinatorial features may better characterize the disease phenotype. In this study, a novel method is proposed to construct feature combinations and a classification model (CFC-CM) by mining key feature relationships. CFC-CM iteratively tests for differences in the feature relationship between different groups. To do this, it uses a modified $k$k-top-scoring pair (M-$k$k-TSP) algorithm and then selects the most discriminative feature pairs in the current feature set to infer the combinatorial features and build the classification model. Compared with support vector machines, random forests, least absolute shrinkage and selection operator, elastic net, and M-$k$k-TSP, the superior performance of CFC-CM on nine public gene expression datasets validates its potential for more precise identification of complex diseases. Subsequently, CFC-CM was applied to two metabolomics datasets, it obtained accuracy rates of $88.73\pm 2.06\%$88.73±2.06% and $79.11\pm 2.70\%$79.11±2.70% in distinguishing between hepatocellular carcinoma and hepatic cirrhosis groups and between acute kidney injury (AKI) and non-AKI samples, results superior to those of the other five methods. In summary, the better results of CFC-CM show that in contrast to molecules and combinations constituted by just two features, the combinations inferred by appropriate number of features could better identify the complex diseases.
Collapse
|
12
|
Differential Gene Expression Analysis of RNA-seq Data Using Machine Learning for Cancer Research. LEARNING AND ANALYTICS IN INTELLIGENT SYSTEMS 2019. [DOI: 10.1007/978-3-030-15628-2_3] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/22/2022]
|
13
|
Blanco JL, Porto-Pazos AB, Pazos A, Fernandez-Lozano C. Prediction of high anti-angiogenic activity peptides in silico using a generalized linear model and feature selection. Sci Rep 2018; 8:15688. [PMID: 30356060 PMCID: PMC6200741 DOI: 10.1038/s41598-018-33911-z] [Citation(s) in RCA: 34] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2018] [Accepted: 10/06/2018] [Indexed: 12/22/2022] Open
Abstract
Screening and in silico modeling are critical activities for the reduction of experimental costs. They also speed up research notably and strengthen the theoretical framework, thus allowing researchers to numerically quantify the importance of a particular subset of information. For example, in fields such as cancer and other highly prevalent diseases, having a reliable prediction method is crucial. The objective of this paper is to classify peptide sequences according to their anti-angiogenic activity to understand the underlying principles via machine learning. First, the peptide sequences were converted into three types of numerical molecular descriptors based on the amino acid composition. We performed different experiments with the descriptors and merged them to obtain baseline results for the performance of the models, particularly of each molecular descriptor subset. A feature selection process was applied to reduce the dimensionality of the problem and remove noisy features – which are highly present in biological problems. After a robust machine learning experimental design under equal conditions (nested resampling, cross-validation, hyperparameter tuning and different runs), we statistically and significantly outperformed the best previously published anti-angiogenic model with a generalized linear model via coordinate descent (glmnet), achieving a mean AUC value greater than 0.96 and with an accuracy of 0.86 with 200 molecular descriptors, mixed from the three groups. A final analysis with the top-40 discriminative anti-angiogenic activity peptides is presented along with a discussion of the feature selection process and the individual importance of each molecular descriptors According to our findings, anti-angiogenic activity peptides are strongly associated with amino acid sequences SP, LSL, PF, DIT, PC, GH, RQ, QD, TC, SC, AS, CLD, ST, MF, GRE, IQ, CQ and HG.
Collapse
Affiliation(s)
- Jose Liñares Blanco
- Department of Computer Science, Faculty of Computer Science, University of A Coruña, A Coruña, 15071, Spain
| | - Ana B Porto-Pazos
- Department of Computer Science, Faculty of Computer Science, University of A Coruña, A Coruña, 15071, Spain.,Instituto de Investigación Biomédica de A Coruña (INIBIC). Complexo Hospitalario Universitario de A Coruña, A Coruña, Spain
| | - Alejandro Pazos
- Department of Computer Science, Faculty of Computer Science, University of A Coruña, A Coruña, 15071, Spain.,Instituto de Investigación Biomédica de A Coruña (INIBIC). Complexo Hospitalario Universitario de A Coruña, A Coruña, Spain
| | - Carlos Fernandez-Lozano
- Department of Computer Science, Faculty of Computer Science, University of A Coruña, A Coruña, 15071, Spain. .,Instituto de Investigación Biomédica de A Coruña (INIBIC). Complexo Hospitalario Universitario de A Coruña, A Coruña, Spain.
| |
Collapse
|
14
|
Chen Q, Meng Z, Liu X, Jin Q, Su R. Decision Variants for the Automatic Determination of Optimal Feature Subset in RF-RFE. Genes (Basel) 2018; 9:genes9060301. [PMID: 29914084 PMCID: PMC6027449 DOI: 10.3390/genes9060301] [Citation(s) in RCA: 52] [Impact Index Per Article: 8.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2018] [Revised: 05/30/2018] [Accepted: 06/06/2018] [Indexed: 11/24/2022] Open
Abstract
Feature selection, which identifies a set of most informative features from the original feature space, has been widely used to simplify the predictor. Recursive feature elimination (RFE), as one of the most popular feature selection approaches, is effective in data dimension reduction and efficiency increase. A ranking of features, as well as candidate subsets with the corresponding accuracy, is produced through RFE. The subset with highest accuracy (HA) or a preset number of features (PreNum) are often used as the final subset. However, this may lead to a large number of features being selected, or if there is no prior knowledge about this preset number, it is often ambiguous and subjective regarding final subset selection. A proper decision variant is in high demand to automatically determine the optimal subset. In this study, we conduct pioneering work to explore the decision variant after obtaining a list of candidate subsets from RFE. We provide a detailed analysis and comparison of several decision variants to automatically select the optimal feature subset. Random forest (RF)-recursive feature elimination (RF-RFE) algorithm and a voting strategy are introduced. We validated the variants on two totally different molecular biology datasets, one for a toxicogenomic study and the other one for protein sequence analysis. The study provides an automated way to determine the optimal feature subset when using RF-RFE.
Collapse
Affiliation(s)
- Qi Chen
- School of Computer Software, Tianjin University, Tianjin 300350, China.
- The Military Transportation Command Department, Army Military Transportation University, Tianjin 300361, China.
| | - Zhaopeng Meng
- School of Computer Software, Tianjin University, Tianjin 300350, China.
- Tianjin University of Traditional Chinese Medicine, Tianjin 300193, China.
| | - Xinyi Liu
- School of Computer Software, Tianjin University, Tianjin 300350, China.
| | - Qianguo Jin
- School of Computer Software, Tianjin University, Tianjin 300350, China.
| | - Ran Su
- School of Computer Software, Tianjin University, Tianjin 300350, China.
- State Key Laboratory of Medicinal Chemical Biology, Nankai University, Tianjin 300074, China.
| |
Collapse
|
15
|
González-Durruthy M, Monserrat JM, Rasulev B, Casañola-Martín GM, Barreiro Sorrivas JM, Paraíso-Medina S, Maojo V, González-Díaz H, Pazos A, Munteanu CR. Carbon Nanotubes' Effect on Mitochondrial Oxygen Flux Dynamics: Polarography Experimental Study and Machine Learning Models using Star Graph Trace Invariants of Raman Spectra. NANOMATERIALS 2017; 7:nano7110386. [PMID: 29137126 PMCID: PMC5707603 DOI: 10.3390/nano7110386] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/07/2017] [Revised: 11/06/2017] [Accepted: 11/08/2017] [Indexed: 11/16/2022]
Abstract
This study presents the impact of carbon nanotubes (CNTs) on mitochondrial oxygen mass flux (Jm) under three experimental conditions. New experimental results and a new methodology are reported for the first time and they are based on CNT Raman spectra star graph transform (spectral moments) and perturbation theory. The experimental measures of Jm showed that no tested CNT family can inhibit the oxygen consumption profiles of mitochondria. The best model for the prediction of Jm for other CNTs was provided by random forest using eight features, obtaining test R-squared (R2) of 0.863 and test root-mean-square error (RMSE) of 0.0461. The results demonstrate the capability of encoding CNT information into spectral moments of the Raman star graphs (SG) transform with a potential applicability as predictive tools in nanotechnology and material risk assessments.
Collapse
Affiliation(s)
- Michael González-Durruthy
- Institute of Biological Science (ICB), Federal University of Rio Grande, Rio Grande, RS 96270-900, Brazil.
| | - Jose M Monserrat
- Institute of Biological Science (ICB), Federal University of Rio Grande, Rio Grande, RS 96270-900, Brazil.
| | - Bakhtiyor Rasulev
- Department of Coatings and Polymeric Materials, North Dakota State University (NDSU), Fargo, ND 58102, USA.
| | | | - José María Barreiro Sorrivas
- Computer Science School (ETSIINF), Polytechnic University of Madrid (UPM), Calle de losCiruelos, Boadilla del Monte, 28660 Madrid, Spain.
| | - Sergio Paraíso-Medina
- Biomedical Informatics Group, Artificial Intelligence Department, Polytechnic University of Madrid, Calle de los Ciruelos, Boadilla del Monte, 28660 Madrid, Spain.
| | - Víctor Maojo
- Biomedical Informatics Group, Artificial Intelligence Department, Polytechnic University of Madrid, Calle de los Ciruelos, Boadilla del Monte, 28660 Madrid, Spain.
| | - Humberto González-Díaz
- Department of Organic Chemistry II, University of the Basque Country UPV/EHU, 48940 Leioa, Biscay, Spain.
- IKERBASQUE, Basque Foundation for Science, 48011 Bilbao, Biscay, Spain.
| | - Alejandro Pazos
- INIBIC Institute of Biomedical Research, CHUAC, UDC, 15006 Coruña, Spain.
- RNASA-IMEDIR, Computer Sciences Faculty, University of Coruña, 15071 Coruña, Spain.
| | - Cristian R Munteanu
- INIBIC Institute of Biomedical Research, CHUAC, UDC, 15006 Coruña, Spain.
- RNASA-IMEDIR, Computer Sciences Faculty, University of Coruña, 15071 Coruña, Spain.
| |
Collapse
|
16
|
González-Durruthy M, Alberici LC, Curti C, Naal Z, Atique-Sawazaki DT, Vázquez-Naya JM, González-Díaz H, Munteanu CR. Experimental-Computational Study of Carbon Nanotube Effects on Mitochondrial Respiration: In Silico Nano-QSPR Machine Learning Models Based on New Raman Spectra Transform with Markov-Shannon Entropy Invariants. J Chem Inf Model 2017; 57:1029-1044. [PMID: 28414908 DOI: 10.1021/acs.jcim.6b00458] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022]
Abstract
The study of selective toxicity of carbon nanotubes (CNTs) on mitochondria (CNT-mitotoxicity) is of major interest for future biomedical applications. In the current work, the mitochondrial oxygen consumption (E3) is measured under three experimental conditions by exposure to pristine and oxidized CNTs (hydroxylated and carboxylated). Respiratory functional assays showed that the information on the CNT Raman spectroscopy could be useful to predict structural parameters of mitotoxicity induced by CNTs. The in vitro functional assays show that the mitochondrial oxidative phosphorylation by ATP-synthase (or state V3 of respiration) was not perturbed in isolated rat-liver mitochondria. For the first time a star graph (SG) transform of the CNT Raman spectra is proposed in order to obtain the raw information for a nano-QSPR model. Box-Jenkins and perturbation theory operators are used for the SG Shannon entropies. A modified RRegrs methodology is employed to test four regression methods such as multiple linear regression (LM), partial least squares regression (PLS), neural networks regression (NN), and random forest (RF). RF provides the best models to predict the mitochondrial oxygen consumption in the presence of specific CNTs with R2 of 0.998-0.999 and RMSE of 0.0068-0.0133 (training and test subsets). This work is aimed at demonstrating that the SG transform of Raman spectra is useful to encode CNT information, similarly to the SG transform of the blood proteome spectra in cancer or electroencephalograms in epilepsy and also as a prospective chemoinformatics tool for nanorisk assessment. All data files and R object models are available at https://dx.doi.org/10.6084/m9.figshare.3472349 .
Collapse
Affiliation(s)
| | | | | | | | | | - José M Vázquez-Naya
- RNASA-IMEDIR, Computer Science Faculty, University of A Coruna , Campus de Elviña s/n, 15071 A Coruña, Spain
| | - Humberto González-Díaz
- Department of Organic Chemistry II, Faculty of Science and Technology, University of the Basque Country UPV/EHU , 48940, Leioa, Bizkaia, Spain.,IKERBASQUE, Basque Foundation for Science , 48011, Bilbao, Bizkaia, Spain
| | - Cristian R Munteanu
- RNASA-IMEDIR, Computer Science Faculty, University of A Coruna , Campus de Elviña s/n, 15071 A Coruña, Spain.,Instituto de Investigación Biomédica de A Coruña (INIBIC), Complexo Hospitalario Universitario de A Coruña (CHUAC) , A Coruña, 15006, Spain
| |
Collapse
|
17
|
McSkimming DI, Rasheed K, Kannan N. Classifying kinase conformations using a machine learning approach. BMC Bioinformatics 2017; 18:86. [PMID: 28152981 PMCID: PMC5290640 DOI: 10.1186/s12859-017-1506-2] [Citation(s) in RCA: 28] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2016] [Accepted: 01/28/2017] [Indexed: 02/07/2023] Open
Abstract
Background Signaling proteins such as protein kinases adopt a diverse array of conformations to respond to regulatory signals in signaling pathways. Perhaps the most fundamental conformational change of a kinase is the transition between active and inactive states, and defining the conformational features associated with kinase activation is critical for selectively targeting abnormally regulated kinases in diseases. While manual examination of crystal structures have led to the identification of key structural features associated with kinase activation, the large number of kinase crystal structures (~3,500) and extensive conformational diversity displayed by the protein kinase superfamily poses unique challenges in fully defining the conformational features associated with kinase activation. Although some computational approaches have been proposed, they are typically based on a small subset of crystal structures using measurements biased towards the active site geometry. Results We utilize an unbiased informatics based machine learning approach to classify all eukaryotic protein kinase conformations deposited in the PDB. We show that the orientation of the activation segment, measured by φ, ψ, χ1, and pseudo-dihedral angles more accurately classify kinase crystal conformations than existing methods. We show that the formation of the K-E salt bridge is statistically dependent upon the activation segment orientation and identify evolutionary differences between the activation segment conformation of tyrosine and serine/threonine kinases. We provide evidence that our method can identify conformational changes associated with the binding of allosteric regulatory proteins, and show that the greatest variation in inactive structures comes from kinase group and family specific side chain orientations. Conclusion We have provided the first comprehensive machine learning based classification of protein kinase active/inactive conformations, taking into account more structures and measurements than any previous classification effort. Further, our unbiased classification of inactive structures reveals residues associated with kinase functional specificity. To enable classification of new crystal structures, we have made our classifier publicly accessible through a stand-alone program housed at https://github.com/esbg/kinconform [DOI:10.5281/zenodo.249090]. Electronic supplementary material The online version of this article (doi:10.1186/s12859-017-1506-2) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
| | - Khaled Rasheed
- Department of Computer Science, University of Georgia, Athens, GA, 30602, USA
| | - Natarajan Kannan
- Institute of Bioinformatics, University of Georgia, Athens, GA, 30602, USA. .,Department of Biochemistry & Molecular Biology, University of Georgia, Athens, GA, 30602, USA.
| |
Collapse
|