1
|
Mao J, Cao Y, Zhang Y, Huang B, Zhao Y. A novel method for identifying key genes in macroevolution based on deep learning with attention mechanism. Sci Rep 2023; 13:19727. [PMID: 37957311 PMCID: PMC10643560 DOI: 10.1038/s41598-023-47113-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2023] [Accepted: 11/09/2023] [Indexed: 11/15/2023] Open
Abstract
Macroevolution can be regarded as the result of evolutionary changes of synergistically acting genes. Unfortunately, the importance of these genes in macroevolution is difficult to assess and hence the identification of macroevolutionary key genes is a major challenge in evolutionary biology. In this study, we designed various word embedding libraries of natural language processing (NLP) considering the multiple mechanisms of evolutionary genomics. A novel method (IKGM) based on three types of attention mechanisms (domain attention, kmer attention and fused attention) were proposed to calculate the weights of different genes in macroevolution. Taking 34 species of diurnal butterflies and nocturnal moths in Lepidoptera as an example, we identified a few of key genes with high weights, which annotated to the functions of circadian rhythms, sensory organs, as well as behavioral habits etc. This study not only provides a novel method to identify the key genes of macroevolution at the genomic level, but also helps us to understand the microevolution mechanisms of diurnal butterflies and nocturnal moths in Lepidoptera.
Collapse
Affiliation(s)
- Jiawei Mao
- College of Big Data and Intelligent Engineering, Southwest Forestry University, Kunming, 650224, China
| | - Yong Cao
- College of Big Data and Intelligent Engineering, Southwest Forestry University, Kunming, 650224, China
| | - Yan Zhang
- College of Mathematics and Physics, Southwest Forestry University, Kunming, 650224, China
| | - Biaosheng Huang
- College of Big Data and Intelligent Engineering, Southwest Forestry University, Kunming, 650224, China
| | - Youjie Zhao
- College of Big Data and Intelligent Engineering, Southwest Forestry University, Kunming, 650224, China.
| |
Collapse
|
2
|
Gupta M, Wu H, Arora S, Gupta A, Chaudhary G, Hua Q. Gene Mutation Classification through Text Evidence Facilitating Cancer Tumour Detection. JOURNAL OF HEALTHCARE ENGINEERING 2021; 2021:8689873. [PMID: 34367540 PMCID: PMC8337154 DOI: 10.1155/2021/8689873] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/27/2021] [Revised: 06/26/2021] [Accepted: 07/13/2021] [Indexed: 12/03/2022]
Abstract
A cancer tumour consists of thousands of genetic mutations. Even after advancement in technology, the task of distinguishing genetic mutations, which act as driver for the growth of tumour with passengers (Neutral Genetic Mutations), is still being done manually. This is a time-consuming process where pathologists interpret every genetic mutation from the clinical evidence manually. These clinical shreds of evidence belong to a total of nine classes, but the criterion of classification is still unknown. The main aim of this research is to propose a multiclass classifier to classify the genetic mutations based on clinical evidence (i.e., the text description of these genetic mutations) using Natural Language Processing (NLP) techniques. The dataset for this research is taken from Kaggle and is provided by the Memorial Sloan Kettering Cancer Center (MSKCC). The world-class researchers and oncologists contribute the dataset. Three text transformation models, namely, CountVectorizer, TfidfVectorizer, and Word2Vec, are utilized for the conversion of text to a matrix of token counts. Three machine learning classification models, namely, Logistic Regression (LR), Random Forest (RF), and XGBoost (XGB), along with the Recurrent Neural Network (RNN) model of deep learning, are applied to the sparse matrix (keywords count representation) of text descriptions. The accuracy score of all the proposed classifiers is evaluated by using the confusion matrix. Finally, the empirical results show that the RNN model of deep learning has performed better than other proposed classifiers with the highest accuracy of 70%.
Collapse
Affiliation(s)
- Meenu Gupta
- Department of Computer Science and Engineering, Chandigarh University, Ajitgarh, Punjab, India
| | - Hao Wu
- Digital Zhejiang Technology Operations Co., Ltd., Hangzhou, China
| | - Simrann Arora
- Bharati Vidyapeeth's College of Engineering, New Delhi, India
| | - Akash Gupta
- Bharati Vidyapeeth's College of Engineering, New Delhi, India
| | - Gopal Chaudhary
- Bharati Vidyapeeth's College of Engineering, New Delhi, India
| | - Qiaozhi Hua
- Computer School, Hubei University of Arts and Science, Xiangyang 441000, China
| |
Collapse
|
3
|
Li P, Jiang X, Zhang G, Trabucco JT, Raciti D, Smith C, Ringwald M, Marai GE, Arighi C, Shatkay H. Utilizing image and caption information for biomedical document classification. Bioinformatics 2021; 37:i468-i476. [PMID: 34252939 PMCID: PMC8346654 DOI: 10.1093/bioinformatics/btab331] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 05/06/2021] [Indexed: 11/15/2022] Open
Abstract
Motivation Biomedical research findings are typically disseminated through publications. To simplify access to domain-specific knowledge while supporting the research community, several biomedical databases devote significant effort to manual curation of the literature—a labor intensive process. The first step toward biocuration requires identifying articles relevant to the specific area on which the database focuses. Thus, automatically identifying publications relevant to a specific topic within a large volume of publications is an important task toward expediting the biocuration process and, in turn, biomedical research. Current methods focus on textual contents, typically extracted from the title-and-abstract. Notably, images and captions are often used in publications to convey pivotal evidence about processes, experiments and results. Results We present a new document classification scheme, using both image and caption information, in addition to titles-and-abstracts. To use the image information, we introduce a new image representation, namely Figure-word, based on class labels of subfigures. We use word embeddings for representing captions and titles-and-abstracts. To utilize all three types of information, we introduce two information integration methods. The first combines Figure-words and textual features obtained from captions and titles-and-abstracts into a single larger vector for document representation; the second employs a meta-classification scheme. Our experiments and results demonstrate the usefulness of the newly proposed Figure-words for representing images. Moreover, the results showcase the value of Figure-words, captions and titles-and-abstracts in providing complementary information for document classification; these three sources of information when combined, lead to an overall improved classification performance. Availability and implementation Source code and the list of PMIDs of the publications in our datasets are available upon request.
Collapse
Affiliation(s)
- Pengyuan Li
- Department of Computer and Information Sciences, University of Delaware, Newark, DE 19716, USA
| | - Xiangying Jiang
- Department of Computer and Information Sciences, University of Delaware, Newark, DE 19716, USA.,Amazon, Seattle, WA 98109, USA
| | - Gongbo Zhang
- Department of Computer and Information Sciences, University of Delaware, Newark, DE 19716, USA.,Google, Mountain View, CA 94043, USA
| | - Juan Trelles Trabucco
- Department of Computer Science, The University of Illinois at Chicago, Chicago, IL 60612, USA
| | - Daniela Raciti
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA 91125, USA
| | | | | | - G Elisabeta Marai
- Department of Computer Science, The University of Illinois at Chicago, Chicago, IL 60612, USA
| | - Cecilia Arighi
- Department of Computer and Information Sciences, University of Delaware, Newark, DE 19716, USA
| | - Hagit Shatkay
- Department of Computer and Information Sciences, University of Delaware, Newark, DE 19716, USA
| |
Collapse
|
4
|
Jiang X, Li P, Kadin J, Blake JA, Ringwald M, Shatkay H. Integrating image caption information into biomedical document classification in support of biocuration. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2021; 2020:5819650. [PMID: 32294192 PMCID: PMC7159034 DOI: 10.1093/database/baaa024] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/31/2019] [Revised: 02/10/2020] [Accepted: 03/11/2020] [Indexed: 01/12/2023]
Abstract
Gathering information from the scientific literature is essential for biomedical research, as much knowledge is conveyed through publications. However, the large and rapidly increasing publication rate makes it impractical for researchers to quickly identify all and only those documents related to their interest. As such, automated biomedical document classification attracts much interest. Such classification is critical in the curation of biological databases, because biocurators must scan through a vast number of articles to identify pertinent information within documents most relevant to the database. This is a slow, labor-intensive process that can benefit from effective automation. We present a document classification scheme aiming to identify papers containing information relevant to a specific topic, among a large collection of articles, for supporting the biocuration classification task. Our framework is based on a meta-classification scheme we have introduced before; here we incorporate into it features gathered from figure captions, in addition to those obtained from titles and abstracts. We trained and tested our classifier over a large imbalanced dataset, originally curated by the Gene Expression Database (GXD). GXD collects all the gene expression information in the Mouse Genome Informatics (MGI) resource. As part of the MGI literature classification pipeline, GXD curators identify MGI-selected papers that are relevant for GXD. The dataset consists of ~60 000 documents (5469 labeled as relevant; 52 866 as irrelevant), gathered throughout 2012–2016, in which each document is represented by the text of its title, abstract and figure captions. Our classifier attains precision 0.698, recall 0.784, f-measure 0.738 and Matthews correlation coefficient 0.711, demonstrating that the proposed framework effectively addresses the high imbalance in the GXD classification task. Moreover, our classifier’s performance is significantly improved by utilizing information from image captions compared to using titles and abstracts alone; this observation clearly demonstrates that image captions provide substantial information for supporting biomedical document classification and curation. Database URL:
Collapse
Affiliation(s)
- Xiangying Jiang
- The Computational Biomedicine and Machine Learning Lab, Department of Computer & Information Sciences, University of Delaware, 18 Amstel Ave, Newark, DE 19716, USA
| | - Pengyuan Li
- The Computational Biomedicine and Machine Learning Lab, Department of Computer & Information Sciences, University of Delaware, 18 Amstel Ave, Newark, DE 19716, USA
| | - James Kadin
- The Jackson Laboratory, 600 Main Street, Bar Harbor, ME, USA
| | - Judith A Blake
- The Jackson Laboratory, 600 Main Street, Bar Harbor, ME, USA
| | - Martin Ringwald
- The Jackson Laboratory, 600 Main Street, Bar Harbor, ME, USA
| | - Hagit Shatkay
- The Computational Biomedicine and Machine Learning Lab, Department of Computer & Information Sciences, University of Delaware, 18 Amstel Ave, Newark, DE 19716, USA
| |
Collapse
|