1
|
Santos SDS, Torres M, Galeano D, Sánchez MDM, Cernuzzi L, Paccanaro A. Machine learning and network medicine approaches for drug repositioning for COVID-19. PATTERNS (NEW YORK, N.Y.) 2022; 3:100396. [PMID: 34778851 PMCID: PMC8576113 DOI: 10.1016/j.patter.2021.100396] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/18/2021] [Revised: 06/21/2021] [Accepted: 11/01/2021] [Indexed: 12/13/2022]
Abstract
We present two machine learning approaches for drug repurposing. While we have developed them for COVID-19, they are disease-agnostic. The two methodologies are complementary, targeting SARS-CoV-2 and host factors, respectively. Our first approach consists of a matrix factorization algorithm to rank broad-spectrum antivirals. Our second approach, based on network medicine, uses graph kernels to rank drugs according to the perturbation they induce on a subnetwork of the human interactome that is crucial for SARS-CoV-2 infection/replication. Our experiments show that our top predicted broad-spectrum antivirals include drugs indicated for compassionate use in COVID-19 patients; and that the ranking obtained by our kernel-based approach aligns with experimental data. Finally, we present the COVID-19 repositioning explorer (CoREx), an interactive online tool to explore the interplay between drugs and SARS-CoV-2 host proteins in the context of biological networks, protein function, drug clinical use, and Connectivity Map. CoREx is freely available at: https://paccanarolab.org/corex/.
Collapse
Affiliation(s)
- Suzana de Siqueira Santos
- Escola de Matemática Aplicada, Fundação Getulio Vargas, Rio de Janeiro 22250-900, Brazil
- COVID-19 International Research Team
| | - Mateo Torres
- Escola de Matemática Aplicada, Fundação Getulio Vargas, Rio de Janeiro 22250-900, Brazil
- COVID-19 International Research Team
| | - Diego Galeano
- Escola de Matemática Aplicada, Fundação Getulio Vargas, Rio de Janeiro 22250-900, Brazil
- Facultad de Ingenieria, Universidad Nacional de Asunción, Luque 110948, Paraguay
- COVID-19 International Research Team
| | | | - Luca Cernuzzi
- Universidad Católica “Nuestra Señora de la Asunción”, Asunción C.C. 1683, Paraguay
| | - Alberto Paccanaro
- Escola de Matemática Aplicada, Fundação Getulio Vargas, Rio de Janeiro 22250-900, Brazil
- Department of Computer Science, Centre for Systems and Synthetic Biology, Royal Holloway, University of London, Egham Hill, Egham TW20 0EX, UK
- COVID-19 International Research Team
| |
Collapse
|
2
|
Wu J, Zhang Q, Li G. Identification of cancer-related module in protein-protein interaction network based on gene prioritization. J Bioinform Comput Biol 2021; 20:2150031. [PMID: 34860145 DOI: 10.1142/s0219720021500311] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
With the rapid development of deep sequencing technologies, a large amount of high-throughput data has been available for studying the carcinogenic mechanism at the molecular level. It has been widely accepted that the development and progression of cancer are regulated by modules/pathways rather than individual genes. The investigation of identifying cancer-related active modules has received an extensive attention. In this paper, we put forward an identification method ModFinder by integrating both biological networks and gene expression profiles. More concretely, a gene scoring function is devised by using the regression model with [Formula: see text]-step random walk kernel, and the genes are ranked according to both of their active scores and degrees in the PPI network. Then a greedy algorithm NSEA is introduced to find an active module with high score and strong connectivity. Experiments were performed on both simulated data and real biological one, i.e. breast cancer and cervical cancer. Compared with the previous methods SigMod, LEAN and RegMod, ModFinder shows competitive performance. It can successfully identify a well-connected module that contains a large proportion of cancer-related genes, including some well-known oncogenes or tumor suppressors enriched in cancer-related pathways.
Collapse
Affiliation(s)
- Jingli Wu
- Guangxi Key Lab of Multi-Source Information Mining & Security, Guangxi Normal University, Guilin 541004, P. R. China.,Yimeng Executive Leadership Academy, Linyi 276000, P. R. China
| | - Qi Zhang
- College of Computer Science and Information Engineering, Guangxi Normal University, Guilin 541004, P. R. China
| | - Gaoshi Li
- Guangxi Key Lab of Multi-Source Information Mining & Security, Guangxi Normal University, Guilin 541004, P. R. China
| |
Collapse
|
3
|
Network modeling of patients' biomolecular profiles for clinical phenotype/outcome prediction. Sci Rep 2020; 10:3612. [PMID: 32107391 PMCID: PMC7046773 DOI: 10.1038/s41598-020-60235-8] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2019] [Accepted: 11/05/2019] [Indexed: 12/15/2022] Open
Abstract
Methods for phenotype and outcome prediction are largely based on inductive supervised models that use selected biomarkers to make predictions, without explicitly considering the functional relationships between individuals. We introduce a novel network-based approach named Patient-Net (P-Net) in which biomolecular profiles of patients are modeled in a graph-structured space that represents gene expression relationships between patients. Then a kernel-based semi-supervised transductive algorithm is applied to the graph to explore the overall topology of the graph and to predict the phenotype/clinical outcome of patients. Experimental tests involving several publicly available datasets of patients afflicted with pancreatic, breast, colon and colorectal cancer show that our proposed method is competitive with state-of-the-art supervised and semi-supervised predictive systems. Importantly, P-Net also provides interpretable models that can be easily visualized to gain clues about the relationships between patients, and to formulate hypotheses about their stratification.
Collapse
|
4
|
Frasca M. Gene2DisCo: Gene to disease using disease commonalities. Artif Intell Med 2017; 82:34-46. [DOI: 10.1016/j.artmed.2017.08.001] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2017] [Revised: 07/24/2017] [Accepted: 08/13/2017] [Indexed: 01/10/2023]
|
5
|
Ma C, Chen Y, Wilkins D, Chen X, Zhang J. An unsupervised learning approach to find ovarian cancer genes through integration of biological data. BMC Genomics 2015; 16 Suppl 9:S3. [PMID: 26328548 PMCID: PMC4547402 DOI: 10.1186/1471-2164-16-s9-s3] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
Cancer is a disease characterized largely by the accumulation of out-of-control somatic mutations during the lifetime of a patient. Distinguishing driver mutations from passenger mutations has posed a challenge in modern cancer research. With the advanced development of microarray experiments and clinical studies, a large numbers of candidate cancer genes have been extracted and distinguishing informative genes out of them is essential. As a matter of fact, we proposed to find the informative genes for cancer by using mutation data from ovarian cancers in our framework. In our model we utilized the patient gene mutation profile, gene expression data and gene gene interactions network to construct a graphical representation of genes and patients. Markov processes for mutation and patients are triggered separately. After this process, cancer genes are prioritized automatically by examining their scores at their stationary distributions in the eigenvector. Extensive experiments demonstrate that the integration of heterogeneous sources of information is essential in finding important cancer genes.
Collapse
|
6
|
Frasca M, Bassis S, Valentini G. Learning node labels with multi-category Hopfield networks. Neural Comput Appl 2015. [DOI: 10.1007/s00521-015-1965-1] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/02/2023]
|
7
|
Abstract
The challenging task of studying and modeling complex dynamics of biological systems in order to describe various human diseases has gathered great interest in recent years. Major biological processes are mediated through protein interactions, hence there is a need to understand the chaotic network that forms these processes in pursuance of understanding human diseases. The applications of protein interaction networks to disease datasets allow the identification of genes and proteins associated with diseases, the study of network properties, identification of subnetworks, and network-based disease gene classification. Although various protein interaction network analysis strategies have been employed, grand challenges are still existing. Global understanding of protein interaction networks via integration of high-throughput functional genomics data from different levels will allow researchers to examine the disease pathways and identify strategies to control them. As a result, it seems likely that more personalized, more accurate and more rapid disease gene diagnostic techniques will be devised in the future, as well as novel strategies that are more personalized. This mini-review summarizes the current practice of protein interaction networks in medical research as well as challenges to be overcome.
Collapse
Affiliation(s)
- Tuba Sevimoglu
- Department of Bioengineering, Marmara University, Goztepe, 34722 Istanbul, Turkey
| | - Kazim Yalcin Arga
- Department of Bioengineering, Marmara University, Goztepe, 34722 Istanbul, Turkey
| |
Collapse
|
8
|
Valentini G, Paccanaro A, Caniza H, Romero AE, Re M. An extensive analysis of disease-gene associations using network integration and fast kernel-based gene prioritization methods. Artif Intell Med 2014; 61:63-78. [PMID: 24726035 PMCID: PMC4070077 DOI: 10.1016/j.artmed.2014.03.003] [Citation(s) in RCA: 41] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2013] [Revised: 03/05/2014] [Accepted: 03/10/2014] [Indexed: 02/07/2023]
Abstract
OBJECTIVE In the context of "network medicine", gene prioritization methods represent one of the main tools to discover candidate disease genes by exploiting the large amount of data covering different types of functional relationships between genes. Several works proposed to integrate multiple sources of data to improve disease gene prioritization, but to our knowledge no systematic studies focused on the quantitative evaluation of the impact of network integration on gene prioritization. In this paper, we aim at providing an extensive analysis of gene-disease associations not limited to genetic disorders, and a systematic comparison of different network integration methods for gene prioritization. MATERIALS AND METHODS We collected nine different functional networks representing different functional relationships between genes, and we combined them through both unweighted and weighted network integration methods. We then prioritized genes with respect to each of the considered 708 medical subject headings (MeSH) diseases by applying classical guilt-by-association, random walk and random walk with restart algorithms, and the recently proposed kernelized score functions. RESULTS The results obtained with classical random walk algorithms and the best single network achieved an average area under the curve (AUC) across the 708 MeSH diseases of about 0.82, while kernelized score functions and network integration boosted the average AUC to about 0.89. Weighted integration, by exploiting the different "informativeness" embedded in different functional networks, outperforms unweighted integration at 0.01 significance level, according to the Wilcoxon signed rank sum test. For each MeSH disease we provide the top-ranked unannotated candidate genes, available for further bio-medical investigation. CONCLUSIONS Network integration is necessary to boost the performances of gene prioritization methods. Moreover the methods based on kernelized score functions can further enhance disease gene ranking results, by adopting both local and global learning strategies, able to exploit the overall topology of the network.
Collapse
Affiliation(s)
- Giorgio Valentini
- AnacletoLab - Dipartimento di Informatica, Università degli Studi di Milano, via Comelico 39/41, 20135 Milano, Italy.
| | - Alberto Paccanaro
- Department of Computer Science and Centre for Systems and Synthetic Biology, Royal Holloway, University of London, Egham TW20 0EX, UK
| | - Horacio Caniza
- Department of Computer Science and Centre for Systems and Synthetic Biology, Royal Holloway, University of London, Egham TW20 0EX, UK
| | - Alfonso E Romero
- Department of Computer Science and Centre for Systems and Synthetic Biology, Royal Holloway, University of London, Egham TW20 0EX, UK
| | - Matteo Re
- AnacletoLab - Dipartimento di Informatica, Università degli Studi di Milano, via Comelico 39/41, 20135 Milano, Italy
| |
Collapse
|
9
|
Valentini G. Hierarchical ensemble methods for protein function prediction. ISRN BIOINFORMATICS 2014; 2014:901419. [PMID: 25937954 PMCID: PMC4393075 DOI: 10.1155/2014/901419] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/02/2014] [Accepted: 02/25/2014] [Indexed: 12/11/2022]
Abstract
Protein function prediction is a complex multiclass multilabel classification problem, characterized by multiple issues such as the incompleteness of the available annotations, the integration of multiple sources of high dimensional biomolecular data, the unbalance of several functional classes, and the difficulty of univocally determining negative examples. Moreover, the hierarchical relationships between functional classes that characterize both the Gene Ontology and FunCat taxonomies motivate the development of hierarchy-aware prediction methods that showed significantly better performances than hierarchical-unaware "flat" prediction methods. In this paper, we provide a comprehensive review of hierarchical methods for protein function prediction based on ensembles of learning machines. According to this general approach, a separate learning machine is trained to learn a specific functional term and then the resulting predictions are assembled in a "consensus" ensemble decision, taking into account the hierarchical relationships between classes. The main hierarchical ensemble methods proposed in the literature are discussed in the context of existing computational methods for protein function prediction, highlighting their characteristics, advantages, and limitations. Open problems of this exciting research area of computational biology are finally considered, outlining novel perspectives for future research.
Collapse
Affiliation(s)
- Giorgio Valentini
- AnacletoLab-Dipartimento di Informatica, Università degli Studi di Milano, Via Comelico 39, 20135 Milano, Italy
| |
Collapse
|
10
|
Mesiti M, Re M, Valentini G. Think globally and solve locally: secondary memory-based network learning for automated multi-species function prediction. Gigascience 2014; 3:5. [PMID: 24843788 PMCID: PMC4006453 DOI: 10.1186/2047-217x-3-5] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2013] [Accepted: 04/01/2014] [Indexed: 01/08/2023] Open
Abstract
Background Network-based learning algorithms for automated function prediction (AFP) are negatively affected by the limited coverage of experimental data and limited a priori known functional annotations. As a consequence their application to model organisms is often restricted to well characterized biological processes and pathways, and their effectiveness with poorly annotated species is relatively limited. A possible solution to this problem might consist in the construction of big networks including multiple species, but this in turn poses challenging computational problems, due to the scalability limitations of existing algorithms and the main memory requirements induced by the construction of big networks. Distributed computation or the usage of big computers could in principle respond to these issues, but raises further algorithmic problems and require resources not satisfiable with simple off-the-shelf computers. Results We propose a novel framework for scalable network-based learning of multi-species protein functions based on both a local implementation of existing algorithms and the adoption of innovative technologies: we solve “locally” the AFP problem, by designing “vertex-centric” implementations of network-based algorithms, but we do not give up thinking “globally” by exploiting the overall topology of the network. This is made possible by the adoption of secondary memory-based technologies that allow the efficient use of the large memory available on disks, thus overcoming the main memory limitations of modern off-the-shelf computers. This approach has been applied to the analysis of a large multi-species network including more than 300 species of bacteria and to a network with more than 200,000 proteins belonging to 13 Eukaryotic species. To our knowledge this is the first work where secondary-memory based network analysis has been applied to multi-species function prediction using biological networks with hundreds of thousands of proteins. Conclusions The combination of these algorithmic and technological approaches makes feasible the analysis of large multi-species networks using ordinary computers with limited speed and primary memory, and in perspective could enable the analysis of huge networks (e.g. the whole proteomes available in SwissProt), using well-equipped stand-alone machines.
Collapse
Affiliation(s)
- Marco Mesiti
- AnacletoLab - Department of Computer Science, University of Milano, Via Comelico 39/41, 20135 Milano, Italy
| | - Matteo Re
- AnacletoLab - Department of Computer Science, University of Milano, Via Comelico 39/41, 20135 Milano, Italy
| | - Giorgio Valentini
- AnacletoLab - Department of Computer Science, University of Milano, Via Comelico 39/41, 20135 Milano, Italy
| |
Collapse
|
11
|
Re M, Valentini G. Network-based drug ranking and repositioning with respect to DrugBank therapeutic categories. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2013; 10:1359-1371. [PMID: 24407295 DOI: 10.1109/tcbb.2013.62] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/03/2023]
Abstract
Drug repositioning is a challenging computational problem involving the integration of heterogeneous sources of biomolecular data and the design of label ranking algorithms able to exploit the overall topology of the underlying pharmacological network. In this context, we propose a novel semisupervised drug ranking problem: prioritizing drugs in integrated biochemical networks according to specific DrugBank therapeutic categories. Algorithms for drug repositioning usually perform the inference step into an inhomogeneous similarity space induced by the relationships existing between drugs and a second type of entity (e.g., disease, target, ligand set), thus making unfeasible a drug ranking within a homogeneous pharmacological space. To deal with this problem, we designed a general framework based on bipartite network projections by which homogeneous pharmacological networks can be constructed and integrated from heterogeneous and complementary sources of chemical, biomolecular and clinical information. Moreover, we present a novel algorithmic scheme based on kernelized score functions that adopts both local and global learning strategies to effectively rank drugs in the integrated pharmacological space using different network combination methods. Detailed experiments with more than 80 DrugBank therapeutic categories involving about 1,300 FDA-approved drugs show the effectiveness of the proposed approach.
Collapse
Affiliation(s)
- Matteo Re
- Universita degli Studi di Milano, Milano
| | | |
Collapse
|
12
|
Frasca M, Bertoni A, Re M, Valentini G. A neural network algorithm for semi-supervised node label learning from unbalanced data. Neural Netw 2013; 43:84-98. [DOI: 10.1016/j.neunet.2013.01.021] [Citation(s) in RCA: 40] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2012] [Revised: 01/28/2013] [Accepted: 01/29/2013] [Indexed: 01/03/2023]
|
13
|
Re M, Mesiti M, Valentini G. A fast ranking algorithm for predicting gene functions in biomolecular networks. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2012; 9:1812-1818. [PMID: 23221088 DOI: 10.1109/tcbb.2012.114] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/01/2023]
Abstract
Ranking genes in functional networks according to a specific biological function is a challenging task raising relevant performance and computational complexity problems. To cope with both these problems we developed a transductive gene ranking method based on kernelized score functions able to fully exploit the topology and the graph structure of biomolecular networks and to capture significant functional relationships between genes. We run the method on a network constructed by integrating multiple biomolecular data sources in the yeast model organism, achieving significantly better results than the compared state-of-the-art network-based algorithms for gene function prediction, and with relevant savings in computational time. The proposed approach is general and fast enough to be in perspective applied to other relevant node ranking problems in large and complex biological networks.
Collapse
Affiliation(s)
- Matteo Re
- Dipartimento di Informatica, Università degli Studi di Milano, Via Comelico 39/41, I-20135 Milano, Italia.
| | | | | |
Collapse
|