1
|
Literature Mining of Disease Associated Noncoding RNA in the Omics Era. Molecules 2022; 27:molecules27154710. [PMID: 35897884 PMCID: PMC9331993 DOI: 10.3390/molecules27154710] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2022] [Revised: 07/20/2022] [Accepted: 07/22/2022] [Indexed: 02/01/2023] Open
Abstract
Noncoding RNAs (ncRNA) are transcripts without protein-coding potential that play fundamental regulatory roles in diverse cellular processes and diseases. The application of deep sequencing experiments in ncRNA research have generated massive omics datasets, which require rapid examination, interpretation and validation based on exiting knowledge resources. Thus, text-mining methods have been increasingly adapted for automatic extraction of relations between an ncRNA and its target or a disease condition from biomedical literature. These bioinformatics tools can also assist in more complex research, such as database curation of candidate ncRNAs and hypothesis generation with respect to pathophysiological mechanisms. In this concise review, we first introduced basic concepts and workflow of literature mining systems. Then, we compared available bioinformatics tools tailored for ncRNA studies, including the tasks, applicability, and limitations. Their powerful utilities and flexibility are demonstrated by examples in a variety of diseases, such as Alzheimer’s disease, atherosclerosis and cancers. Finally, we outlined several challenges from the viewpoints of both system developers and end users. We concluded that the application of text-mining techniques will booster disease-associated ncRNA discoveries in the biomedical literature and enable integrative biology in the current omics era.
Collapse
|
2
|
Hendrickx JO, van Gastel J, Leysen H, Martin B, Maudsley S. High-dimensionality Data Analysis of Pharmacological Systems Associated with Complex Diseases. Pharmacol Rev 2020; 72:191-217. [PMID: 31843941 DOI: 10.1124/pr.119.017921] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022] Open
Abstract
It is widely accepted that molecular reductionist views of highly complex human physiologic activity, e.g., the aging process, as well as therapeutic drug efficacy are largely oversimplifications. Currently some of the most effective appreciation of biologic disease and drug response complexity is achieved using high-dimensionality (H-D) data streams from transcriptomic, proteomic, metabolomics, or epigenomic pipelines. Multiple H-D data sets are now common and freely accessible for complex diseases such as metabolic syndrome, cardiovascular disease, and neurodegenerative conditions such as Alzheimer's disease. Over the last decade our ability to interrogate these high-dimensionality data streams has been profoundly enhanced through the development and implementation of highly effective bioinformatic platforms. Employing these computational approaches to understand the complexity of age-related diseases provides a facile mechanism to then synergize this pathologic appreciation with a similar level of understanding of therapeutic-mediated signaling. For informative pathology and drug-based analytics that are able to generate meaningful therapeutic insight across diverse data streams, novel informatics processes such as latent semantic indexing and topological data analyses will likely be important. Elucidation of H-D molecular disease signatures from diverse data streams will likely generate and refine new therapeutic strategies that will be designed with a cognizance of a realistic appreciation of the complexity of human age-related disease and drug effects. We contend that informatic platforms should be synergistic with more advanced chemical/drug and phenotypic cellular/tissue-based analytical predictive models to assist in either de novo drug prioritization or effective repurposing for the intervention of aging-related diseases. SIGNIFICANCE STATEMENT: All diseases, as well as pharmacological mechanisms, are far more complex than previously thought a decade ago. With the advent of commonplace access to technologies that produce large volumes of high-dimensionality data (e.g., transcriptomics, proteomics, metabolomics), it is now imperative that effective tools to appreciate this highly nuanced data are developed. Being able to appreciate the subtleties of high-dimensionality data will allow molecular pharmacologists to develop the most effective multidimensional therapeutics with effectively engineered efficacy profiles.
Collapse
Affiliation(s)
- Jhana O Hendrickx
- Receptor Biology Laboratory, Department of Biomedical Research (J.O.H., J.v.G., H.L., S.M.) and Faculty of Pharmacy, Biomedical and Veterinary Sciences (J.O.H., J.v.G., H.L., B.M., S.M.), University of Antwerp, Antwerp, Belgium
| | - Jaana van Gastel
- Receptor Biology Laboratory, Department of Biomedical Research (J.O.H., J.v.G., H.L., S.M.) and Faculty of Pharmacy, Biomedical and Veterinary Sciences (J.O.H., J.v.G., H.L., B.M., S.M.), University of Antwerp, Antwerp, Belgium
| | - Hanne Leysen
- Receptor Biology Laboratory, Department of Biomedical Research (J.O.H., J.v.G., H.L., S.M.) and Faculty of Pharmacy, Biomedical and Veterinary Sciences (J.O.H., J.v.G., H.L., B.M., S.M.), University of Antwerp, Antwerp, Belgium
| | - Bronwen Martin
- Receptor Biology Laboratory, Department of Biomedical Research (J.O.H., J.v.G., H.L., S.M.) and Faculty of Pharmacy, Biomedical and Veterinary Sciences (J.O.H., J.v.G., H.L., B.M., S.M.), University of Antwerp, Antwerp, Belgium
| | - Stuart Maudsley
- Receptor Biology Laboratory, Department of Biomedical Research (J.O.H., J.v.G., H.L., S.M.) and Faculty of Pharmacy, Biomedical and Veterinary Sciences (J.O.H., J.v.G., H.L., B.M., S.M.), University of Antwerp, Antwerp, Belgium
| |
Collapse
|
3
|
Roy S, Zaman KI, Williams RW, Homayouni R. Evaluation of Sirtuin-3 probe quality and co-expressed genes using literature cohesion. BMC Bioinformatics 2019; 20:104. [PMID: 30871457 PMCID: PMC6419539 DOI: 10.1186/s12859-019-2621-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open
Abstract
BACKGROUND Gene co-expression studies can provide important insights into molecular and cellular signaling pathways. The GeneNetwork database is a unique resource for co-expression analysis using data from a variety of tissues across genetically distinct inbred mice. However, extraction of biologically meaningful co-expressed gene sets is challenging due to variability in microarray platforms, probe quality, normalization methods, and confounding biological factors. In this study, we tested whether literature derived functional cohesion could be used as an objective metric in lieu of 'ground truth' to evaluate the quality of probes and microarray datasets. RESULTS We examined Sirtuin-3 (Sirt3) co-expressed gene sets extracted from either liver or brain tissues of BXD recombinant inbred mice in the GeneNetwork database. Depending on the microarray platform, there were as many as 26 probes that targeted different regions of Sirt3 primary transcript. Co-expressed gene sets (ranging from 100-1000 genes) associated with each Sirt3 probe were evaluated using the previously developed literature-derived cohesion p-value (LPv) and benchmarked against 'gold standards' derived from proteomic studies or Gene Ontology classifications. We found that the maximal F-measure was obtained at an average window size of 535 genes. Using set size of 500 genes, the Pearson correlations between LPv and F-measure as well as between LPv and mitochondrial gene enrichment p-values were 0.90 and 0.93, respectively. Importantly, we found that the LPv approach can distinguish high quality Sirt3 probes. Analysis of the most functionally cohesive Sirt3 co-expressed gene set revealed core metabolic pathways that were shared between hippocampus and liver as well as distinct pathways which were unique to each tissue. These results are consistent with other studies that suggest Sirt3 is a key metabolic regulator and has distinct functions in energy-producing vs. energy-demanding tissues. CONCLUSIONS Our results provide proof-of-concept that literature cohesion analysis is useful for evaluating the quality of probes and microarray datasets, particularly when experimentally derived gold standards are unavailable. Our approach would enable researchers to rapidly identify biologically meaningful co-expressed gene sets and facilitate discovery from high throughput genomic data.
Collapse
Affiliation(s)
- Sujoy Roy
- Bioinformatics Program, University of Memphis, Memphis, 38152 USA
- Center for Translational Informatics, University of Memphis, Memphis, 38152 USA
| | - Kazi I. Zaman
- Bioinformatics Program, University of Memphis, Memphis, 38152 USA
| | - Robert W. Williams
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, 38163 USA
| | - Ramin Homayouni
- Bioinformatics Program, University of Memphis, Memphis, 38152 USA
- Center for Translational Informatics, University of Memphis, Memphis, 38152 USA
- Department of Biology, University of Memphis, Memphis, 38152 USA
| |
Collapse
|
4
|
Roy S, Yun D, Madahian B, Berry MW, Deng LY, Goldowitz D, Homayouni R. Navigating the Functional Landscape of Transcription Factors via Non-Negative Tensor Factorization Analysis of MEDLINE Abstracts. Front Bioeng Biotechnol 2017; 5:48. [PMID: 28894735 PMCID: PMC5581332 DOI: 10.3389/fbioe.2017.00048] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2017] [Accepted: 07/31/2017] [Indexed: 01/09/2023] Open
Abstract
In this study, we developed and evaluated a novel text-mining approach, using non-negative tensor factorization (NTF), to simultaneously extract and functionally annotate transcriptional modules consisting of sets of genes, transcription factors (TFs), and terms from MEDLINE abstracts. A sparse 3-mode term × gene × TF tensor was constructed that contained weighted frequencies of 106,895 terms in 26,781 abstracts shared among 7,695 genes and 994 TFs. The tensor was decomposed into sub-tensors using non-negative tensor factorization (NTF) across 16 different approximation ranks. Dominant entries of each of 2,861 sub-tensors were extracted to form term–gene–TF annotated transcriptional modules (ATMs). More than 94% of the ATMs were found to be enriched in at least one KEGG pathway or GO category, suggesting that the ATMs are functionally relevant. One advantage of this method is that it can discover potentially new gene–TF associations from the literature. Using a set of microarray and ChIP-Seq datasets as gold standard, we show that the precision of our method for predicting gene–TF associations is significantly higher than chance. In addition, we demonstrate that the terms in each ATM can be used to suggest new GO classifications to genes and TFs. Taken together, our results indicate that NTF is useful for simultaneous extraction and functional annotation of transcriptional regulatory networks from unstructured text, as well as for literature based discovery. A web tool called Transcriptional Regulatory Modules Extracted from Literature (TREMEL), available at http://binf1.memphis.edu/tremel, was built to enable browsing and searching of ATMs.
Collapse
Affiliation(s)
- Sujoy Roy
- Bioinformatics Program, University of Memphis, Memphis, TN, United States.,Center for Translational Informatics, University of Memphis, Memphis, TN, United States
| | - Daqing Yun
- Computer and Information Sciences Program, Harrisburg University of Science and Technology, Harrisburg, PA, United States
| | - Behrouz Madahian
- Department of Mathematical Sciences, University of Memphis, Memphis, TN, United States
| | - Michael W Berry
- Department of Electrical Engineering and Computer Science, University of Tennessee, Knoxville, TN, United States
| | - Lih-Yuan Deng
- Department of Mathematical Sciences, University of Memphis, Memphis, TN, United States
| | - Daniel Goldowitz
- Center for Molecular Medicine and Therapeutics, University of British Columbia, Vancouver, BC, Canada
| | - Ramin Homayouni
- Bioinformatics Program, University of Memphis, Memphis, TN, United States.,Center for Translational Informatics, University of Memphis, Memphis, TN, United States.,Department of Biological Sciences, University of Memphis, Memphis, TN, United States
| |
Collapse
|
5
|
An index-based algorithm for fast on-line query processing of latent semantic analysis. PLoS One 2017; 12:e0177523. [PMID: 28520747 PMCID: PMC5433746 DOI: 10.1371/journal.pone.0177523] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2016] [Accepted: 04/29/2017] [Indexed: 11/23/2022] Open
Abstract
Latent Semantic Analysis (LSA) is widely used for finding the documents whose semantic is similar to the query of keywords. Although LSA yield promising similar results, the existing LSA algorithms involve lots of unnecessary operations in similarity computation and candidate check during on-line query processing, which is expensive in terms of time cost and cannot efficiently response the query request especially when the dataset becomes large. In this paper, we study the efficiency problem of on-line query processing for LSA towards efficiently searching the similar documents to a given query. We rewrite the similarity equation of LSA combined with an intermediate value called partial similarity that is stored in a designed index called partial index. For reducing the searching space, we give an approximate form of similarity equation, and then develop an efficient algorithm for building partial index, which skips the partial similarities lower than a given threshold θ. Based on partial index, we develop an efficient algorithm called ILSA for supporting fast on-line query processing. The given query is transformed into a pseudo document vector, and the similarities between query and candidate documents are computed by accumulating the partial similarities obtained from the index nodes corresponds to non-zero entries in the pseudo document vector. Compared to the LSA algorithm, ILSA reduces the time cost of on-line query processing by pruning the candidate documents that are not promising and skipping the operations that make little contribution to similarity scores. Extensive experiments through comparison with LSA have been done, which demonstrate the efficiency and effectiveness of our proposed algorithm.
Collapse
|
6
|
Wren JD, Toby I, Hong H, Nanduri B, Kaundal R, Dozmorov MG, Thakkar S. Proceedings of the 2016 MidSouth Computational Biology and Bioinformatics Society (MCBIOS) Conference. BMC Bioinformatics 2016; 17:356. [PMID: 27766933 PMCID: PMC5073803 DOI: 10.1186/s12859-016-1213-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022] Open
Affiliation(s)
- Jonathan D Wren
- Arthritis and Clinical Immunology Research Program, Oklahoma Medical Research Foundation, 825 N.E. 13th Street, Oklahoma City, OK, 73104-5005, USA. .,Biochemistry and Molecular Biology Department, University of Oklahoma Health Sciences Center, Oklahoma City, USA. .,Stephenson Cancer Center, University of Oklahoma Health Sciences Center, Oklahoma City, USA. .,Department of Geriatric Medicine, University of Oklahoma Health Sciences Center, Oklahoma City, USA.
| | - Inimary Toby
- Department of Clinical Sciences, UT Southwestern Medical Center, 5323 Harry Hines Boulevard, Dallas, TX, 75390-9066, USA
| | - Huxiao Hong
- Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, U.S. Food and Drug Administration, Jefferson, AR, USA
| | - Bindu Nanduri
- Department of Basic Sciences, College of Veterinary Medicine, Mississippi State University, Mississippi, MS, USA
| | - Rakesh Kaundal
- Bioinformatics Facility, Institute for Integrative Genome Biology, University of California, Riverside, California, USA
| | - Mikhail G Dozmorov
- Department of Biostatistics, Richmond Academy of Medicine, Virginia Commonwealth University, Virginia, USA
| | - Shraddha Thakkar
- Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, U.S. Food and Drug Administration, Jefferson, AR, USA
| |
Collapse
|