1
|
Piereck B, Oliveira-Lima M, Benko-Iseppon AM, Diehl S, Schneider R, Brasileiro-Vidal AC, Barbosa-Silva A. LAITOR4HPC: A text mining pipeline based on HPC for building interaction networks. BMC Bioinformatics 2020; 21:365. [PMID: 32838742 PMCID: PMC7447576 DOI: 10.1186/s12859-020-03620-4] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2019] [Accepted: 06/19/2020] [Indexed: 11/11/2022] Open
Abstract
Background The amount of published full-text articles has increased dramatically. Text mining tools configure an essential approach to building biological networks, updating databases and providing annotation for new pathways. PESCADOR is an online web server based on LAITOR and NLProt text mining tools, which retrieves protein-protein co-occurrences in a tabular-based format, adding a network schema. Here we present an HPC-oriented version of PESCADOR’s native text mining tool, renamed to LAITOR4HPC, aiming to access an unlimited abstract amount in a short time to enrich available networks, build new ones and possibly highlight whether fields of research have been exhaustively studied. Results By taking advantage of parallel computing HPC infrastructure, the full collection of MEDLINE abstracts available until June 2017 was analyzed in a shorter period (6 days) when compared to the original online implementation (with an estimated 2 years to run the same data). Additionally, three case studies were presented to illustrate LAITOR4HPC usage possibilities. The first case study targeted soybean and was used to retrieve an overview of published co-occurrences in a single organism, retrieving 15,788 proteins in 7894 co-occurrences. In the second case study, a target gene family was searched in many organisms, by analyzing 15 species under biotic stress. Most co-occurrences regarded Arabidopsis thaliana and Zea mays. The third case study concerned the construction and enrichment of an available pathway. Choosing A. thaliana for further analysis, the defensin pathway was enriched, showing additional signaling and regulation molecules, and how they respond to each other in the modulation of this complex plant defense response. Conclusions LAITOR4HPC can be used for an efficient text mining based construction of biological networks derived from big data sources, such as MEDLINE abstracts. Time consumption and data input limitations will depend on the available resources at the HPC facility. LAITOR4HPC enables enough flexibility for different approaches and data amounts targeted to an organism, a subject, or a specific pathway. Additionally, it can deliver comprehensive results where interactions are classified into four types, according to their reliability.
Collapse
Affiliation(s)
- Bruna Piereck
- Genetics Department, Laboratório de Genética e Biologia Vegetal, Universidade Federal de Pernambuco, Recife, Pernambuco, Brazil
| | - Marx Oliveira-Lima
- Genetics Department, Laboratório de Genética e Biologia Vegetal, Universidade Federal de Pernambuco, Recife, Pernambuco, Brazil
| | - Ana Maria Benko-Iseppon
- Genetics Department, Laboratório de Genética e Biologia Vegetal, Universidade Federal de Pernambuco, Recife, Pernambuco, Brazil.
| | - Sarah Diehl
- University of Luxembourg, Luxembourg Centre for Systems Biomedicine, Bioinformatics Core, Esch-sur-Alzette, Luxembourg
| | - Reinhard Schneider
- University of Luxembourg, Luxembourg Centre for Systems Biomedicine, Bioinformatics Core, Esch-sur-Alzette, Luxembourg
| | - Ana Christina Brasileiro-Vidal
- Genetics Department, Laboratório de Genética e Biologia Vegetal, Universidade Federal de Pernambuco, Recife, Pernambuco, Brazil
| | - Adriano Barbosa-Silva
- University of Luxembourg, Luxembourg Centre for Systems Biomedicine, Bioinformatics Core, Esch-sur-Alzette, Luxembourg. .,Queen Mary University of London, Centre for Translational Bioinformatics, William Harvey Research Institute, Barts and The London School of Medicine and Dentistry, Charterhouse Square, London, UK.
| |
Collapse
|
2
|
Chen G, Jia Y, Zhu L, Li P, Zhang L, Tao C, Jim Zheng W. Gene fingerprint model for literature based detection of the associations among complex diseases: a case study of COPD. BMC Med Inform Decis Mak 2019; 19:20. [PMID: 30700303 PMCID: PMC6354331 DOI: 10.1186/s12911-019-0738-7] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Disease comorbidity is very common and has significant impact on disease treatment. Revealing the associations among diseases may help to understand the mechanisms of diseases, improve the prevention and treatment of diseases, and support the discovery of new drugs or new uses of existing drugs. METHODS In this paper, we introduced a mathematical model to represent gene related diseases with a series of associated genes based on the overrepresentation of genes and diseases in PubMed literature. We also illustrated an efficient way to reveal the implicit connections between COPD and other diseases based on this model. RESULTS We applied this approach to analyze the relationships between Chronic Obstructive Pulmonary Disease (COPD) and other diseases under the Lung diseases branch in the Medical subject heading index system and detected 4 novel diseases relevant to COPD. As judged by domain experts, the F score of our approach is up to 77.6%. CONCLUSIONS The results demonstrate the effectiveness of the gene fingerprint model for diseases on the basis of medical literature.
Collapse
Affiliation(s)
- Guocai Chen
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, 7000 Fannin St Suite 600, Houston, TX, 77030, USA
| | - Yuxi Jia
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, 7000 Fannin St Suite 600, Houston, TX, 77030, USA.,Department of Medical Informatics, School of Public Health, Jilin University, Changchun, Jilin, 130021, China
| | - Lisha Zhu
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, 7000 Fannin St Suite 600, Houston, TX, 77030, USA
| | - Ping Li
- Department of Development Pediatrics, The Second Affiliated Hospital of Jilin University, Changchun, Jilin, 130041, China
| | - Lin Zhang
- Department of Respiratory Medicine, The Second Affiliated Hospital of Jilin University, Changchun, Jilin, 130041, China
| | - Cui Tao
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, 7000 Fannin St Suite 600, Houston, TX, 77030, USA.
| | - W Jim Zheng
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, 7000 Fannin St Suite 600, Houston, TX, 77030, USA.
| |
Collapse
|
3
|
Zhou J, Fu BQ. The research on gene-disease association based on text-mining of PubMed. BMC Bioinformatics 2018; 19:37. [PMID: 29415654 PMCID: PMC5804013 DOI: 10.1186/s12859-018-2048-y] [Citation(s) in RCA: 25] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2017] [Accepted: 01/29/2018] [Indexed: 11/23/2022] Open
Abstract
Background The associations between genes and diseases are of critical significance in aspects of prevention, diagnosis and treatment. Although gene-disease relationships have been investigated extensively, much of the underpinnings of these associations are yet to be elucidated. Methods A novel method integrates MeSH database, term weight (TW), and co-occurrence methods to predict gene-disease associations based on the cosine similarity between gene vectors and disease vectors. Vectors are transformed from the texts of documents in the PubMed database according to the appearance and location of the gene or disease terms. The disease related text data has been optimized during the process of constructing vectors. Results The overall distribution of cosine similarity value was investigated. By using the gene-disease association data in OMIM database as golden standard, the performance of cosine similarity in predicting gene-disease linkage was evaluated. The effects of applying weight matrix, penalty weights for keywords (PWK), and normalization were also investigated. Finally, we demonstrated that our method outperforms heterogeneous network edge prediction (HNEP) in aspects of precision rate and recall rate. Conclusions Our method proposed in this paper is easy to be conducted and the results can be integrated with other models to improve the overall performance of gene-disease association predictions.
Collapse
Affiliation(s)
- Jie Zhou
- Guangdong Key Laboratory of Computer Network, School of Computer Science and Engineering, South China University of Technology, Guangzhou, 510006, China.
| | - Bo-Quan Fu
- Guangdong Key Laboratory of Computer Network, School of Computer Science and Engineering, South China University of Technology, Guangzhou, 510006, China
| |
Collapse
|
4
|
Lopes KDP, Campos-Laborie FJ, Vialle RA, Ortega JM, De Las Rivas J. Evolutionary hallmarks of the human proteome: chasing the age and coregulation of protein-coding genes. BMC Genomics 2016; 17:725. [PMID: 27801289 PMCID: PMC5088522 DOI: 10.1186/s12864-016-3062-y] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/19/2023] Open
Abstract
Background The development of large-scale technologies for quantitative transcriptomics has enabled comprehensive analysis of the gene expression profiles in complete genomes. RNA-Seq allows the measurement of gene expression levels in a manner far more precise and global than previous methods. Studies using this technology are altering our view about the extent and complexity of the eukaryotic transcriptomes. In this respect, multiple efforts have been done to determine and analyse the gene expression patterns of human cell types in different conditions, either in normal or pathological states. However, until recently, little has been reported about the evolutionary marks present in human protein-coding genes, particularly from the combined perspective of gene expression and protein evolution. Results We present a combined analysis of human protein-coding gene expression profiling and time-scale ancestry mapping, that places the genes in taxonomy clades and reveals eight evolutionary major steps (“hallmarks”), that include clusters of functionally coherent proteins. The human expressed genes are analysed using a RNA-Seq dataset of 116 samples from 32 tissues. The evolutionary analysis of the human proteins is performed combining the information from: (i) a database of orthologous proteins (OMA), (ii) the taxonomy mapping of genes to lineage clades (from NCBI Taxonomy) and (iii) the evolution time-scale mapping provided by TimeTree (Timescale of Life). The human protein-coding genes are also placed in a relational context based in the construction of a robust gene coexpression network, that reveals tighter links between age-related protein-coding genes and finds functionally coherent gene modules. Conclusions Understanding the relational landscape of the human protein-coding genes is essential for interpreting the functional elements and modules of our active genome. Moreover, decoding the evolutionary history of the human genes can provide very valuable information to reveal or uncover their origin and function. Electronic supplementary material The online version of this article (doi:10.1186/s12864-016-3062-y) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Katia de Paiva Lopes
- Bioinformatics and Functional Genomics Group, Cancer Research Center (CiC-IBMCC, CSIC/USAL/IBSAL), Consejo Superior de Investigaciones Cientificas (CSIC), Salamanca, Spain.,Departamento de Bioquímica e Imunologia, Instituto de Ciências Biológicas (ICB), Universidade Federal de Minas Gerais (UFMG), Belo Horizonte, Brasil
| | - Francisco José Campos-Laborie
- Bioinformatics and Functional Genomics Group, Cancer Research Center (CiC-IBMCC, CSIC/USAL/IBSAL), Consejo Superior de Investigaciones Cientificas (CSIC), Salamanca, Spain
| | - Ricardo Assunção Vialle
- Departamento de Bioquímica e Imunologia, Instituto de Ciências Biológicas (ICB), Universidade Federal de Minas Gerais (UFMG), Belo Horizonte, Brasil
| | - José Miguel Ortega
- Departamento de Bioquímica e Imunologia, Instituto de Ciências Biológicas (ICB), Universidade Federal de Minas Gerais (UFMG), Belo Horizonte, Brasil
| | - Javier De Las Rivas
- Bioinformatics and Functional Genomics Group, Cancer Research Center (CiC-IBMCC, CSIC/USAL/IBSAL), Consejo Superior de Investigaciones Cientificas (CSIC), Salamanca, Spain.
| |
Collapse
|
5
|
Andrade-Navarro M, Perez-Iratxeta C. Text mining of biomedical literature: doing well, but we could be doing better. Methods 2015; 74:1-2. [PMID: 25703199 DOI: 10.1016/j.ymeth.2015.01.014] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022] Open
Affiliation(s)
- Miguel Andrade-Navarro
- Faculty of Biology, Johannes-Gutenberg University of Mainz, Gresemundweg 2, 55128 Mainz, Germany; Institute of Molecular Biology, Ackermannweg 4, 55128 Mainz, Germany
| | - Carol Perez-Iratxeta
- Ottawa Hospital Research Institute, 501 Smyth Road, Ottawa, Ontario K1H 8L6, Canada
| |
Collapse
|