1
|
Macorano A, Mazzolari A, Malloci G, Pedretti A, Vistoli G, Gervasoni S. An improved dataset of force fields, electronic and physicochemical descriptors of metabolic substrates. Sci Data 2024; 11:929. [PMID: 39191771 PMCID: PMC11349763 DOI: 10.1038/s41597-024-03707-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2024] [Accepted: 07/30/2024] [Indexed: 08/29/2024] Open
Abstract
In silico prediction of xenobiotic metabolism is an important strategy to accelerate the drug discovery process, as candidate compounds often fail in clinical phases due to their poor pharmacokinetic profiles. Here we present MetaQM, a dataset of quantum-mechanical (QM) optimized metabolic substrates, including force field parameters, electronic and physicochemical properties. MetaQM comprises 2054 metabolic substrates extracted from the MetaQSAR database. We provide QM-optimized geometries, General Amber Force Field (FF) parameters for all studied molecules, and an extended set of structural and physicochemical descriptors as calculated by DFT and PM7 methods. The generated data can be used in different types of analysis. FF parameters can be applied to perform classical molecular mechanics calculations as exemplified by the validating molecular dynamics simulations reported here. The calculated descriptors can represent input features for developing improved predictive models for metabolism and drug design, as exemplified in this work. Finally, the QM-optimized molecular structures are valuable starting points for both ligand- and structure-based analyses such as pharmacophore mapping and docking simulations.
Collapse
Affiliation(s)
- Alessio Macorano
- Dipartimento di Scienze Farmaceutiche, Università degli Studi di Milano, via Mangiagalli 25, 20133, Milano, Italy
| | - Angelica Mazzolari
- Dipartimento di Scienze Farmaceutiche, Università degli Studi di Milano, via Mangiagalli 25, 20133, Milano, Italy
| | - Giuliano Malloci
- Dipartimento di Fisica, Università degli Studi di Cagliari, Cittadella Universitaria, S.P. Monserrato-Sestu Km 0.7, I-09042, Monserrato, CA, Italy
| | - Alessandro Pedretti
- Dipartimento di Scienze Farmaceutiche, Università degli Studi di Milano, via Mangiagalli 25, 20133, Milano, Italy
| | - Giulio Vistoli
- Dipartimento di Scienze Farmaceutiche, Università degli Studi di Milano, via Mangiagalli 25, 20133, Milano, Italy
| | - Silvia Gervasoni
- Dipartimento di Fisica, Università degli Studi di Cagliari, Cittadella Universitaria, S.P. Monserrato-Sestu Km 0.7, I-09042, Monserrato, CA, Italy.
| |
Collapse
|
2
|
Mazzolari A, Perazzoni P, Sabato E, Lunghini F, Beccari AR, Vistoli G, Pedretti A. MetaSpot: A General Approach for Recognizing the Reactive Atoms Undergoing Metabolic Reactions Based on the MetaQSAR Database. Int J Mol Sci 2023; 24:11064. [PMID: 37446241 DOI: 10.3390/ijms241311064] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2023] [Revised: 06/27/2023] [Accepted: 06/28/2023] [Indexed: 07/15/2023] Open
Abstract
The prediction of drug metabolism is attracting great interest for the possibility of discarding molecules with unfavorable ADME/Tox profile at the early stage of the drug discovery process. In this context, artificial intelligence methods can generate highly performing predictive models if they are trained by accurate metabolic data. MetaQSAR-based datasets were collected to predict the sites of metabolism for most metabolic reactions. The models were based on a set of structural, physicochemical, and stereo-electronic descriptors and were generated by the random forest algorithm. For each considered biotransformation, two types of models were developed: the first type involved all non-reactive atoms and included atom types among the descriptors, while the second type involved only non-reactive centers having the same atom type(s) of the reactive atoms. All the models of the first type revealed very high performances; the models of the second type show on average worst performances while being almost always able to recognize the reactive centers; only conjugations with glucuronic acid are unsatisfactorily predicted by the models of the second type. Feature evaluation confirms the major role of lipophilicity, self-polarizability, and H-bonding for almost all considered reactions. The obtained results emphasize the possibility of recognizing the sites of metabolism by classification models trained on MetaQSAR database. The two types of models can be synergistically combined since the first models identify which atoms can undergo a given metabolic reactions, while the second models detect the truly reactive centers. The generated models are available as scripts for the VEGA program.
Collapse
Affiliation(s)
- Angelica Mazzolari
- Dipartimento di Scienze Farmaceutiche, Università degli Studi di Milano, Via Luigi Mangiagalli, 25, I-20133 Milano, Italy
| | - Pietro Perazzoni
- Dipartimento di Scienze Farmaceutiche, Università degli Studi di Milano, Via Luigi Mangiagalli, 25, I-20133 Milano, Italy
| | - Emanuela Sabato
- Dipartimento di Scienze Farmaceutiche, Università degli Studi di Milano, Via Luigi Mangiagalli, 25, I-20133 Milano, Italy
| | - Filippo Lunghini
- EXSCALATE, Dompé Farmaceutici S.p.A., Via Tommaso De Amicis, 95, I-80131 Napoli, Italy
| | - Andrea R Beccari
- EXSCALATE, Dompé Farmaceutici S.p.A., Via Tommaso De Amicis, 95, I-80131 Napoli, Italy
| | - Giulio Vistoli
- Dipartimento di Scienze Farmaceutiche, Università degli Studi di Milano, Via Luigi Mangiagalli, 25, I-20133 Milano, Italy
| | - Alessandro Pedretti
- Dipartimento di Scienze Farmaceutiche, Università degli Studi di Milano, Via Luigi Mangiagalli, 25, I-20133 Milano, Italy
| |
Collapse
|
3
|
Pérez-Pérez M, Ferreira T, Igrejas G, Fdez-Riverola F. A novel gluten knowledge base of potential biomedical and health-related interactions extracted from the literature: using machine learning and graph analysis methodologies to reconstruct the bibliome. J Biomed Inform 2023:104398. [PMID: 37230405 DOI: 10.1016/j.jbi.2023.104398] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2022] [Revised: 05/12/2023] [Accepted: 05/15/2023] [Indexed: 05/27/2023]
Abstract
BACKGROUND In return for their nutritional properties and broad availability, cereal crops have been associated with different alimentary disorders and symptoms, with the majority of the responsibility being attributed to gluten. Therefore, the research of gluten-related literature data continues to be produced at ever-growing rates, driven in part by the recent exploratory studies that link gluten to non-traditional diseases and the popularity of gluten-free diets, making it increasingly difficult to access and analyse practical and structured information. In this sense, the accelerated discovery of novel advances in diagnosis and treatment, as well as exploratory studies, produce a favourable scenario for disinformation and misinformation. OBJECTIVES Aligned with, the European Union strategy "Delivering on EU Food Safety and Nutrition in 2050" which emphasizes the inextricable links between imbalanced diets, the increased exposure to unreliable sources of information and misleading information, and the increased dependency on reliable sources of information; this paper presents GlutKNOIS, a public and interactive literature-based database that reconstructs and represents the experimental biomedical knowledge extracted from the gluten-related literature. The developed platform includes different external database knowledge, bibliometrics statistics and social media discussion to propose a novel and enhanced way to search, visualise and analyse potential biomedical and health-related interactions in relation to the gluten domain. METHODS For this purpose, the presented study applies a semi-supervised curation workflow that combines natural language processing techniques, machine learning algorithms, ontology-based normalization and integration approaches, named entity recognition methods, and graph knowledge reconstruction methodologies to process, classify, represent and analyse the experimental findings contained in the literature, which is also complemented by data from the social discussion. RESULTS and Conclusions: In this sense, 5,814 documents were manually annotated and 7,424 were fully automatically processed to reconstruct the first online gluten-related knowledge database of evidenced health-related interactions that produce health or metabolic changes based on the literature. In addition, the automatic processing of the literature combined with the knowledge representation methodologies proposed has the potential to assist in the revision and analysis of years of gluten research. The reconstructed knowledge base is public and accessible at https://sing-group.org/glutknois/.
Collapse
Affiliation(s)
- Martín Pérez-Pérez
- CINBIO, Universidade de Vigo, Department of Computer Science, ESEI - Escuela Superior de Ingeniería Informática, 32004 Ourense, España; SING Research Group, Galicia Sur Health Research Institute (IIS Galicia Sur), SERGAS-UVIGO, Spain.
| | - Tânia Ferreira
- Department of Genetics and Biotechnology, University of Trás-os-Montes and Alto Douro, Vila Real, Portugal; Functional Genomics and Proteomics Unit, University of Trás-os-Montes and Alto Douro, Vila Real, Portugal.
| | - Gilberto Igrejas
- Department of Genetics and Biotechnology, University of Trás-os-Montes and Alto Douro, Vila Real, Portugal; Functional Genomics and Proteomics Unit, University of Trás-os-Montes and Alto Douro, Vila Real, Portugal; LAQV-REQUIMTE, Faculty of Science and Technology, Nova University of Lisbon, Lisbon, Portugal.
| | - Florentino Fdez-Riverola
- CINBIO, Universidade de Vigo, Department of Computer Science, ESEI - Escuela Superior de Ingeniería Informática, 32004 Ourense, España; SING Research Group, Galicia Sur Health Research Institute (IIS Galicia Sur), SERGAS-UVIGO, Spain.
| |
Collapse
|
4
|
Jing Y, Feng B, Gao J, Li J, Zhou G, Sun Z, Wang Y. BLAB2CancerKD: a knowledge graph database focusing on the association between lactic acid bacteria and cancer, but beyond. Database (Oxford) 2023; 2023:7176387. [PMID: 37221044 DOI: 10.1093/database/baad036] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/01/2023] [Revised: 04/23/2023] [Accepted: 04/28/2023] [Indexed: 05/25/2023]
Abstract
In a broad sense, lactic acid bacteria (LAB) is a general term for Gram-positive bacteria that can produce lactic acid by utilizing fermentable carbohydrates. It is widely used in essential fields such as industry, agriculture, animal husbandry and medicine. At the same time, LAB are closely related to human health. They can regulate human intestinal flora and improve gastrointestinal function and body immunity. Cancer, a disease in which some cells grow out of control and spread to other body parts, is one of the leading causes of human death worldwide. In recent years, the potential of LAB in cancer treatment has attracted attention. Mining knowledge from the scientific literature significantly accelerates its application in cancer treatment. Using 7794 literature studies of LAB cancer as source data, we have processed 16 543 biomedical concepts and 23 091 associations by using automatic text mining tools combined with manual curation of domain experts. An ontology containing 31 434 pieces of structured data is constructed. Finally, based on ontology, a knowledge graph (KG) database, which is called Beyond 'Lactic acid bacteria to Cancer Knowledge graph Database' (BLAB2CancerKD), is constructed by using KG and web technology. BLAB2CancerKD presents all the relevant knowledge intuitively and clearly in various data presentation forms, and the interactive system function also makes it more efficient. BLAB2CancerKD will be continuously updated to advance the research and application of LAB in cancer therapy. Researchers can visit BLAB2CancerKD at. Database URL http://110.40.139.2:18095/.
Collapse
Affiliation(s)
- Yi Jing
- Faculty of Science, The University of New South Wales, High Street, Sydney, New South Wales 2052, Australia
- Inner Mongolia Autonomous Region Key Laboratory of Big Data Research and Application for Agriculture and Animal Husbandry, Zhaowuda Road No. 306, Hohhot 010018, China
| | - Baiyang Feng
- Inner Mongolia Autonomous Region Key Laboratory of Big Data Research and Application for Agriculture and Animal Husbandry, Zhaowuda Road No. 306, Hohhot 010018, China
- College of Computer and Information Engineering, Inner Mongolia Agricultural University, Erdos East Street No. 29, Hohhot 010011, China
| | - Jing Gao
- Inner Mongolia Autonomous Region Key Laboratory of Big Data Research and Application for Agriculture and Animal Husbandry, Zhaowuda Road No. 306, Hohhot 010018, China
- College of Computer and Information Engineering, Inner Mongolia Agricultural University, Erdos East Street No. 29, Hohhot 010011, China
- Inner Mongolia Autonomous Region Big Data Center, Chilechuan Street No. 1, Hohhot 010091, China
| | - Jin Li
- Inner Mongolia Autonomous Region Key Laboratory of Big Data Research and Application for Agriculture and Animal Husbandry, Zhaowuda Road No. 306, Hohhot 010018, China
- College of Computer and Information Engineering, Inner Mongolia Agricultural University, Erdos East Street No. 29, Hohhot 010011, China
| | - Ganghui Zhou
- Inner Mongolia Autonomous Region Key Laboratory of Big Data Research and Application for Agriculture and Animal Husbandry, Zhaowuda Road No. 306, Hohhot 010018, China
- College of Computer and Information Engineering, Inner Mongolia Agricultural University, Erdos East Street No. 29, Hohhot 010011, China
| | - Zhihong Sun
- College of Food Science and Engineering, Inner Mongolia Agricultural University, Zhaowuda Road No. 306, Hohhot 010018, China
| | - Yufei Wang
- The Affiliated Hospital of Inner Mongolia Medical University, Tongdao North road No.1, Hohhot 010050, China
| |
Collapse
|
5
|
Rodriguez-Esteban R. New reasons for biologists to write with a formal language. Database (Oxford) 2022; 2022:6600538. [PMID: 35657112 PMCID: PMC9216469 DOI: 10.1093/database/baac039] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2022] [Revised: 03/18/2022] [Accepted: 05/17/2022] [Indexed: 12/03/2022]
Abstract
Current biological writing is afflicted by the use of ambiguous names, convoluted sentences, vague statements and narrative-fitted storylines. This represents a challenge for biological research in general and in particular for fields such as biological database curation and text mining, which have been tasked to cope with exponentially growing content. Improving the quality of biological writing by encouraging unambiguity and precision would foster expository discipline and machine reasoning. More specifically, the routine inclusion of formal languages in biological writing would improve our ability to describe, compile and model biology.
Collapse
Affiliation(s)
- Raul Rodriguez-Esteban
- Roche Pharmaceutical Research and Early Development, Roche Innovation Center Basel, Grenzacherstrasse 124 , Basel 4070, Switzerland
| |
Collapse
|
6
|
Boosting biomedical document classification through the use of domain entity recognizers and semantic ontologies for document representation: The case of gluten bibliome. Neurocomputing 2022. [DOI: 10.1016/j.neucom.2021.10.100] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
|
7
|
MetaClass, a Comprehensive Classification System for Predicting the Occurrence of Metabolic Reactions Based on the MetaQSAR Database. Molecules 2021; 26:molecules26195857. [PMID: 34641400 PMCID: PMC8512547 DOI: 10.3390/molecules26195857] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2021] [Revised: 09/20/2021] [Accepted: 09/21/2021] [Indexed: 11/25/2022] Open
Abstract
(1) Background: Machine learning algorithms are finding fruitful applications in predicting the ADME profile of new molecules, with a particular focus on metabolism predictions. However, the development of comprehensive metabolism predictors is hampered by the lack of highly accurate metabolic resources. Hence, we recently proposed a manually curated metabolic database (MetaQSAR), the level of accuracy of which is well suited to the development of predictive models. (2) Methods: MetaQSAR was used to extract datasets to predict the metabolic reactions subdivided into major classes, classes and subclasses. The collected datasets comprised a total of 3788 first-generation metabolic reactions. Predictive models were developed by using standard random forest algorithms and sets of physicochemical, stereo-electronic and constitutional descriptors. (3) Results: The developed models showed satisfactory performance, especially for hydrolyses and conjugations, while redox reactions were predicted with greater difficulty, which was reasonable as they depend on many complex features that are not properly encoded by the included descriptors. (4) Conclusions: The generated models allowed a precise comparison of the propensity of each metabolic reaction to be predicted and the factors affecting their predictability were discussed in detail. Overall, the study led to the development of a freely downloadable global predictor, MetaClass, which correctly predicts 80% of the reported reactions, as assessed by an explorative validation analysis on an external dataset, with an overall MCC = 0.44.
Collapse
|
8
|
Krantz M, Zimmer D, Adler SO, Kitashova A, Klipp E, Mühlhaus T, Nägele T. Data Management and Modeling in Plant Biology. FRONTIERS IN PLANT SCIENCE 2021; 12:717958. [PMID: 34539712 PMCID: PMC8446634 DOI: 10.3389/fpls.2021.717958] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/31/2021] [Accepted: 07/29/2021] [Indexed: 05/25/2023]
Abstract
The study of plant-environment interactions is a multidisciplinary research field. With the emergence of quantitative large-scale and high-throughput techniques, amount and dimensionality of experimental data have strongly increased. Appropriate strategies for data storage, management, and evaluation are needed to make efficient use of experimental findings. Computational approaches of data mining are essential for deriving statistical trends and signatures contained in data matrices. Although, current biology is challenged by high data dimensionality in general, this is particularly true for plant biology. Plants as sessile organisms have to cope with environmental fluctuations. This typically results in strong dynamics of metabolite and protein concentrations which are often challenging to quantify. Summarizing experimental output results in complex data arrays, which need computational statistics and numerical methods for building quantitative models. Experimental findings need to be combined by computational models to gain a mechanistic understanding of plant metabolism. For this, bioinformatics and mathematics need to be combined with experimental setups in physiology, biochemistry, and molecular biology. This review presents and discusses concepts at the interface of experiment and computation, which are likely to shape current and future plant biology. Finally, this interface is discussed with regard to its capabilities and limitations to develop a quantitative model of plant-environment interactions.
Collapse
Affiliation(s)
- Maria Krantz
- Theoretical Biophysics, Institute of Biology, Humboldt-Universität zu Berlin, Berlin, Germany
| | - David Zimmer
- Computational Systems Biology, Technische Universität Kaiserslautern, Kaiserslautern, Germany
| | - Stephan O. Adler
- Theoretical Biophysics, Institute of Biology, Humboldt-Universität zu Berlin, Berlin, Germany
| | - Anastasia Kitashova
- Plant Evolutionary Cell Biology, Faculty of Biology, Ludwig-Maximilians-Universität München, Planegg-Martinsried, Germany
| | - Edda Klipp
- Theoretical Biophysics, Institute of Biology, Humboldt-Universität zu Berlin, Berlin, Germany
| | - Timo Mühlhaus
- Computational Systems Biology, Technische Universität Kaiserslautern, Kaiserslautern, Germany
| | - Thomas Nägele
- Plant Evolutionary Cell Biology, Faculty of Biology, Ludwig-Maximilians-Universität München, Planegg-Martinsried, Germany
| |
Collapse
|
9
|
Foerster H, Battey JND, Sierro N, Ivanov NV, Mueller LA. Metabolic networks of the Nicotiana genus in the spotlight: content, progress and outlook. Brief Bioinform 2021; 22:bbaa136. [PMID: 32662816 PMCID: PMC8138835 DOI: 10.1093/bib/bbaa136] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2020] [Revised: 05/19/2020] [Accepted: 06/04/2020] [Indexed: 01/09/2023] Open
Abstract
Manually curated metabolic databases residing at the Sol Genomics Network comprise two taxon-specific databases for the Solanaceae family, i.e. SolanaCyc and the genus Nicotiana, i.e. NicotianaCyc as well as six species-specific databases for Nicotiana tabacum TN90, N. tabacum K326, Nicotiana benthamiana, N. sylvestris, N. tomentosiformis and N. attenuata. New pathways were created through the extraction, examination and verification of related data from the literature and the aid of external database guided by an expert-led curation process. Here we describe the curation progress that has been achieved in these databases since the first release version 1.0 in 2016, the curation flow and the curation process using the example metabolic pathway for cholesterol in plants. The current content of our databases comprises 266 pathways and 36 superpathways in SolanaCyc and 143 pathways plus 21 superpathways in NicotianaCyc, manually curated and validated specifically for the Solanaceae family and Nicotiana genus, respectively. The curated data have been propagated to the respective Nicotiana-specific databases, which resulted in the enrichment and more accurate presentation of their metabolic networks. The quality and coverage in those databases have been compared with related external databases and discussed in terms of literature support and metabolic content.
Collapse
|
10
|
Mazzolari A, Sommaruga L, Pedretti A, Vistoli G. MetaTREE, a Novel Database Focused on Metabolic Trees, Predicts an Important Detoxification Mechanism: The Glutathione Conjugation. Molecules 2021; 26:2098. [PMID: 33917533 PMCID: PMC8038802 DOI: 10.3390/molecules26072098] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2021] [Revised: 03/22/2021] [Accepted: 03/30/2021] [Indexed: 02/07/2023] Open
Abstract
(1) Background: Data accuracy plays a key role in determining the model performances and the field of metabolism prediction suffers from the lack of truly reliable data. To enhance the accuracy of metabolic data, we recently proposed a manually curated database collected by a meta-analysis of the specialized literature (MetaQSAR). Here we aim to further increase data accuracy by focusing on publications reporting exhaustive metabolic trees. This selection should indeed reduce the number of false negative data. (2) Methods: A new metabolic database (MetaTREE) was thus collected and utilized to extract a dataset for metabolic data concerning glutathione conjugation (MT-dataset). After proper pre-processing, this dataset, along with the corresponding dataset extracted from MetaQSAR (MQ-dataset), was utilized to develop binary classification models using a random forest algorithm. (3) Results: The comparison of the models generated by the two collected datasets reveals the better performances reached by the MT-dataset (MCC raised from 0.63 to 0.67, sensitivity from 0.56 to 0.58). The analysis of the applicability domain also confirms that the model based on the MT-dataset shows a more robust predictive power with a larger applicability domain. (4) Conclusions: These results confirm that focusing on metabolic trees represents a convenient approach to increase data accuracy by reducing the false negative cases. The encouraging performances shown by the models developed by the MT-dataset invites to use of MetaTREE for predictive studies in the field of xenobiotic metabolism.
Collapse
Affiliation(s)
- Angelica Mazzolari
- Dipartimento di Scienze Farmaceutiche, Università degli Studi di Milano, Via Mangiagalli 25, I-20133 Milano, Italy; (L.S.); (A.P.); (G.V.)
| | | | | | | |
Collapse
|
11
|
Turina P, Fariselli P, Capriotti E. ThermoScan: Semi-automatic Identification of Protein Stability Data From PubMed. Front Mol Biosci 2021; 8:620475. [PMID: 33842537 PMCID: PMC8027235 DOI: 10.3389/fmolb.2021.620475] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2020] [Accepted: 02/18/2021] [Indexed: 11/13/2022] Open
Abstract
During the last years, the increasing number of DNA sequencing and protein mutagenesis studies has generated a large amount of variation data published in the biomedical literature. The collection of such data has been essential for the development and assessment of tools predicting the impact of protein variants at functional and structural levels. Nevertheless, the collection of manually curated data from literature is a highly time consuming and costly process that requires domain experts. In particular, the development of methods for predicting the effect of amino acid variants on protein stability relies on the thermodynamic data extracted from literature. In the past, such data were deposited in the ProTherm database, which however is no longer maintained since 2013. For facilitating the collection of protein thermodynamic data from literature, we developed the semi-automatic tool ThermoScan. ThermoScan is a text mining approach for the identification of relevant thermodynamic data on protein stability from full-text articles. The method relies on a regular expression searching for groups of words, including the most common conceptual words appearing in experimental studies on protein stability, several thermodynamic variables, and their units of measure. ThermoScan analyzes full-text articles from the PubMed Central Open Access subset and calculates an empiric score that allows the identification of manuscripts reporting thermodynamic data on protein stability. The method was optimized on a set of publications included in the ProTherm database, and tested on a new curated set of articles, manually selected for presence of thermodynamic data. The results show that ThermoScan returns accurate predictions and outperforms recently developed text-mining algorithms based on the analysis of publication abstracts. Availability: The ThermoScan server is freely accessible online at https://folding.biofold.org/thermoscan. The ThermoScan python code and the Google Chrome extension for submitting visualized PMC web pages to the ThermoScan server are available at https://github.com/biofold/ThermoScan.
Collapse
Affiliation(s)
- Paola Turina
- Department of Pharmacy and Biotechnology (FaBiT), University of Bologna, Bologna, Italy
| | - Piero Fariselli
- Department of Medical Sciences, University of Torino, Torino, Italy
| | - Emidio Capriotti
- Department of Pharmacy and Biotechnology (FaBiT), University of Bologna, Bologna, Italy
| |
Collapse
|
12
|
Paley S, Keseler IM, Krummenacker M, Karp PD. Leveraging Curation Among Escherichia coli Pathway/Genome Databases Using Ortholog-Based Annotation Propagation. Front Microbiol 2021; 12:614355. [PMID: 33763039 PMCID: PMC7982652 DOI: 10.3389/fmicb.2021.614355] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2020] [Accepted: 03/02/2021] [Indexed: 12/19/2022] Open
Abstract
Updating genome databases to reflect newly published molecular findings for an organism was hard enough when only a single strain of a given organism had been sequenced. With multiple sequenced strains now available for many organisms, the challenge has grown significantly because of the still-limited resources available for the manual curation that corrects errors and captures new knowledge. We have developed a method to automatically propagate multiple types of curated knowledge from genes and proteins in one genome database to their orthologs in uncurated databases for related strains, imposing several quality-control filters to reduce the chances of introducing errors. We have applied this method to propagate information from the highly curated EcoCyc database for Escherichia coli K-12 to databases for 480 other Escherichia coli strains in the BioCyc database collection. The increase in value and utility of the target databases after propagation is considerable. Target databases received updates for an average of 2,535 proteins each. In addition to widespread addition and regularization of gene and protein names, 97% of the target databases were improved by the addition of at least 200 new protein complexes, at least 800 new or updated reaction assignments, and at least 2,400 sets of GO annotations.
Collapse
Affiliation(s)
- Suzanne Paley
- Bioinformatics Research Group, SRI International, Menlo Park, CA, United States
| | - Ingrid M Keseler
- Bioinformatics Research Group, SRI International, Menlo Park, CA, United States
| | - Markus Krummenacker
- Bioinformatics Research Group, SRI International, Menlo Park, CA, United States
| | - Peter D Karp
- Bioinformatics Research Group, SRI International, Menlo Park, CA, United States
| |
Collapse
|
13
|
Garcia-Pelaez J, Rodriguez D, Medina-Molina R, Garcia-Rivas G, Jerjes-Sánchez C, Trevino V. PubTerm: a web tool for organizing, annotating and curating genes, diseases, molecules and other concepts from PubMed records. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2019; 2019:5280306. [PMID: 30624653 PMCID: PMC6323318 DOI: 10.1093/database/bay137] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/14/2018] [Accepted: 12/02/2018] [Indexed: 11/13/2022]
Abstract
Background and objective Analysis, annotation and curation of biomedical scientific literature is a recurrent task in biomedical research, database curation and clinics. Commonly, the reading is centered on concepts such as genes, diseases or molecules. Database curators may also need to annotate published abstracts related to a specific topic. However, few free and intuitive tools exist to assist users in this context. Therefore, we developed PubTerm, a web tool to organize, categorize, curate and annotate a large number of PubMed abstracts related to biological entities such as genes, diseases, chemicals, species, sequence variants and other related information. Methods A variety of interfaces were implemented to facilitate curation and annotation, including the organization of abstracts by terms, by the co-occurrence of terms or by specific phrases. Information includes statistics on the occurrence of terms. The abstracts, terms and other related information can be annotated and categorized using user-defined categories. The session information can be saved and restored, and the data can be exported to other formats. Results The pipeline in PubTerm starts by specifying a PubMed query or list of PubMed identifiers. Then, the user can specify three lists of categories and specify what information will be highlighted in which colors. The user then utilizes the `term view’ to organize the abstracts by gene, disease, species or other information to facilitate the annotation and categorization of terms or abstracts. Other views also facilitate the exploration of abstracts and connections between terms. We have used PubTerm to quickly and efficiently curate collections of more than 400 abstracts that mention more than 350 genes to generate revised lists of susceptibility genes for diseases. An example is provided for pulmonary arterial hypertension. Conclusions PubTerm saves time for literature revision by assisting with annotation organization and knowledge acquisition.
Collapse
Affiliation(s)
- José Garcia-Pelaez
- Tecnologico de Monterrey, Escuela de Medicina y Ciencias de la Salud. Ave. Morones Prieto 3000, Monterrey, N.L., México
| | - David Rodriguez
- Tecnologico de Monterrey, Escuela de Medicina y Ciencias de la Salud. Ave. Morones Prieto 3000, Monterrey, N.L., México
| | - Roberto Medina-Molina
- Tecnologico de Monterrey, Escuela de Medicina y Ciencias de la Salud. Ave. Morones Prieto 3000, Monterrey, N.L., México
| | - Gerardo Garcia-Rivas
- Tecnologico de Monterrey, Escuela de Medicina y Ciencias de la Salud. Ave. Morones Prieto 3000, Monterrey, N.L., México.,Centro de Investigación Biomédica, Hospital Zambrano-Hellion, Tec Salud, Tecnologico de Monterrey, Batallón San Patricio 112 Col. Real de San Agustín, San Pedro Garza García, N.L., México
| | - Carlos Jerjes-Sánchez
- Tecnologico de Monterrey, Escuela de Medicina y Ciencias de la Salud. Ave. Morones Prieto 3000, Monterrey, N.L., México.,Centro de Investigación Biomédica, Hospital Zambrano-Hellion, Tec Salud, Tecnologico de Monterrey, Batallón San Patricio 112 Col. Real de San Agustín, San Pedro Garza García, N.L., México
| | - Victor Trevino
- Tecnologico de Monterrey, Escuela de Medicina y Ciencias de la Salud. Ave. Morones Prieto 3000, Monterrey, N.L., México
| |
Collapse
|
14
|
Pedretti A, Mazzolari A, Vistoli G, Testa B. MetaQSAR: An Integrated Database Engine to Manage and Analyze Metabolic Data. J Med Chem 2018; 61:1019-1030. [DOI: 10.1021/acs.jmedchem.7b01473] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022]
Affiliation(s)
- Alessandro Pedretti
- Dipartimento
di Scienze Farmaceutiche “Pietro Pratesi”, Facoltà
di Farmacia, Università degli Studi di Milano, Via Luigi Mangiagalli 25, I-20133 Milano, Italy
| | - Angelica Mazzolari
- Dipartimento
di Scienze Farmaceutiche “Pietro Pratesi”, Facoltà
di Farmacia, Università degli Studi di Milano, Via Luigi Mangiagalli 25, I-20133 Milano, Italy
| | - Giulio Vistoli
- Dipartimento
di Scienze Farmaceutiche “Pietro Pratesi”, Facoltà
di Farmacia, Università degli Studi di Milano, Via Luigi Mangiagalli 25, I-20133 Milano, Italy
| | | |
Collapse
|
15
|
Foerster H, Bombarely A, Battey JND, Sierro N, Ivanov NV, Mueller LA. SolCyc: a database hub at the Sol Genomics Network (SGN) for the manual curation of metabolic networks in Solanum and Nicotiana specific databases. Database (Oxford) 2018; 2018:4995113. [PMID: 29762652 PMCID: PMC5946812 DOI: 10.1093/database/bay035] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2017] [Revised: 03/13/2018] [Accepted: 03/15/2018] [Indexed: 01/20/2023]
Abstract
Database URL https://solgenomics.net/tools/solcyc/.
Collapse
Affiliation(s)
- Hartmut Foerster
- Boyce Thompson Institute, 533 Tower Road, Ithaca, New York, 14853-1801, USA
| | - Aureliano Bombarely
- Department of Horticulture, Virginia Polytechnic Institute and State University, 220 Ag Quad Lane, Blacksburg, VA 24061, USA
| | - James N D Battey
- PMI R&D, Philip Morris Products S.A (Part of Philip Morris International group of companies), Quai Jeanrenaud 6, Neuchâtel CH-2000, Switzerland
| | - Nicolas Sierro
- PMI R&D, Philip Morris Products S.A (Part of Philip Morris International group of companies), Quai Jeanrenaud 6, Neuchâtel CH-2000, Switzerland
| | - Nikolai V Ivanov
- PMI R&D, Philip Morris Products S.A (Part of Philip Morris International group of companies), Quai Jeanrenaud 6, Neuchâtel CH-2000, Switzerland
| | - Lukas A Mueller
- Boyce Thompson Institute, 533 Tower Road, Ithaca, New York, 14853-1801, USA
| |
Collapse
|
16
|
Gabella C, Durinx C, Appel R. Funding knowledgebases: Towards a sustainable funding model for the UniProt use case. F1000Res 2017; 6. [PMID: 29333230 PMCID: PMC5747334 DOI: 10.12688/f1000research.12989.2] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 03/19/2018] [Indexed: 11/30/2022] Open
Abstract
Millions of life scientists across the world rely on bioinformatics data resources for their research projects. Data resources can be very expensive, especially those with a high added value as the expert-curated knowledgebases. Despite the increasing need for such highly accurate and reliable sources of scientific information, most of them do not have secured funding over the near future and often depend on short-term grants that are much shorter than their planning horizon. Additionally, they are often evaluated as research projects rather than as research infrastructure components. In this work, twelve funding models for data resources are described and applied on the case study of the Universal Protein Resource (UniProt), a key resource for protein sequences and functional information knowledge. We show that most of the models present inconsistencies with open access or equity policies, and that while some models do not allow to cover the total costs, they could potentially be used as a complementary income source. We propose the
Infrastructure Model as a sustainable and equitable model for all core data resources in the life sciences. With this model, funding agencies would set aside a fixed percentage of their research grant volumes, which would subsequently be redistributed to core data resources according to well-defined selection criteria. This model, compatible with the principles of open science, is in agreement with several international initiatives such as the Human Frontiers Science Program Organisation (HFSPO) and the OECD Global Science Forum (GSF) project. Here, we have estimated that less than 1% of the total amount dedicated to research grants in the life sciences would be sufficient to cover the costs of the core data resources worldwide, including both knowledgebases and deposition databases.
Collapse
Affiliation(s)
- Chiara Gabella
- ELIXIR-Switzerland, SIB Swiss Institute of Bioinformatics, Lausanne, 1015, Switzerland
| | - Christine Durinx
- ELIXIR-Switzerland, SIB Swiss Institute of Bioinformatics, Lausanne, 1015, Switzerland
| | - Ron Appel
- ELIXIR-Switzerland, SIB Swiss Institute of Bioinformatics, Lausanne, 1015, Switzerland
| |
Collapse
|
17
|
Data management and data enrichment for systems biology projects. J Biotechnol 2017; 261:229-237. [PMID: 28606610 DOI: 10.1016/j.jbiotec.2017.06.007] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2017] [Revised: 06/06/2017] [Accepted: 06/09/2017] [Indexed: 12/24/2022]
Abstract
Collecting, curating, interlinking, and sharing high quality data are central to de.NBI-SysBio, the systems biology data management service center within the de.NBI network (German Network for Bioinformatics Infrastructure). The work of the center is guided by the FAIR principles for scientific data management and stewardship. FAIR stands for the four foundational principles Findability, Accessibility, Interoperability, and Reusability which were established to enhance the ability of machines to automatically find, access, exchange and use data. Within this overview paper we describe three tools (SABIO-RK, Excemplify, SEEK) that exemplify the contribution of de.NBI-SysBio services to FAIR data, models, and experimental methods storage and exchange. The interconnectivity of the tools and the data workflow within systems biology projects will be explained. For many years we are the German partner in the FAIRDOM initiative (http://fair-dom.org) to establish a European data and model management service facility for systems biology.
Collapse
|
18
|
Karp PD. Crowd-sourcing and author submission as alternatives to professional curation. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2016; 2016:baw149. [PMID: 28025340 PMCID: PMC5199147 DOI: 10.1093/database/baw149] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Received: 08/23/2016] [Revised: 10/18/2016] [Accepted: 10/19/2016] [Indexed: 11/28/2022]
Abstract
Can we decrease the costs of database curation by crowd-sourcing curation work or by offloading curation to publication authors? This perspective considers the significant experience accumulated by the bioinformatics community with these two alternatives to professional curation in the last 20 years; that experience should be carefully considered when formulating new strategies for biological databases. The vast weight of empirical evidence to date suggests that crowd-sourced curation is not a successful model for biological databases. Multiple approaches to crowd-sourced curation have been attempted by multiple groups, and extremely low participation rates by ‘the crowd’ are the overwhelming outcome. The author-curation model shows more promise for boosting curator efficiency. However, its limitations include that the quality of author-submitted annotations is uncertain, the response rate is low (but significant), and to date author curation has involved relatively simple forms of annotation involving one or a few types of data. Furthermore, shifting curation to authors may simply redistribute costs rather than decreasing costs; author curation may in fact increase costs because of the overhead involved in having every curating author learn what professional curators know: curation conventions, curation software and curation procedures.
Collapse
Affiliation(s)
- Peter D Karp
- Bioinformatics Research Group, SRI International, 333 Ravenswood Ave, Menlo Park, CA 94025, USA. Tel:650-859-4358; Fax: 650-859-3735; e-mail:
| |
Collapse
|