1
|
Tempke R, Musho T. Autonomous generation of single photon emitting materials. NANOSCALE 2024; 16:10239-10249. [PMID: 38726673 DOI: 10.1039/d3nr04944b] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/31/2024]
Abstract
The utilization of machine learning in Materials Science underscores the critical importance of the quality and quantity of data in training models effectively. Unlike fields such as image processing and natural language processing, there is limited availability of atomistic datasets, leading to biases in training data. Particularly in the domain of materials discovery, there exists an issue of continuity in atomistic datasets. Experimental data sourced from literature and patents is usually only available for favorable data, resulting in bias in the training dataset. This study focuses on developing a SMILES-based model for generating synthetic datasets of quantum materials using a variational autoencoder. This study centers on the generation of a synthetic dataset of quantum materials specifically for quantum sensing applications, with a focus on two-level quantum molecules that exhibit a dipole blockade. The proposed technique offers an improved sampling algorithm by incorporating newly generated data into the sampling algorithm to create a more normally distributed dataset. Through this technique, the study was able to generate over 1 000 000 candidate quantum materials from a small dataset of only 8000 materials. The generated dataset identified several iodine-containing molecules as promising single photon emitting materials for potential quantum sensing applications.
Collapse
Affiliation(s)
- Robert Tempke
- Department of Mechanical, Materials and Aerospace Engineering, West Virginia University, P.O. Box 6106, Morgantown, WV, USA.
| | - Terence Musho
- Department of Mechanical, Materials and Aerospace Engineering, West Virginia University, P.O. Box 6106, Morgantown, WV, USA.
| |
Collapse
|
2
|
Vaškevičius M, Kapočiūtė-Dzikienė J, Vaškevičius A, Šlepikas L. Deep learning-based automatic action extraction from structured chemical synthesis procedures. PeerJ Comput Sci 2023; 9:e1511. [PMID: 37705639 PMCID: PMC10495970 DOI: 10.7717/peerj-cs.1511] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2023] [Accepted: 07/07/2023] [Indexed: 09/15/2023]
Abstract
This article proposes a methodology that uses machine learning algorithms to extract actions from structured chemical synthesis procedures, thereby bridging the gap between chemistry and natural language processing. The proposed pipeline combines ML algorithms and scripts to extract relevant data from USPTO and EPO patents, which helps transform experimental procedures into structured actions. This pipeline includes two primary tasks: classifying patent paragraphs to select chemical procedures and converting chemical procedure sentences into a structured, simplified format. We employ artificial neural networks such as long short-term memory, bidirectional LSTMs, transformers, and fine-tuned T5. Our results show that the bidirectional LSTM classifier achieved the highest accuracy of 0.939 in the first task, while the Transformer model attained the highest BLEU score of 0.951 in the second task. The developed pipeline enables the creation of a dataset of chemical reactions and their procedures in a structured format, facilitating the application of AI-based approaches to streamline synthetic pathways, predict reaction outcomes, and optimize experimental conditions. Furthermore, the developed pipeline allows for creating a structured dataset of chemical reactions and procedures, making it easier for researchers to access and utilize the valuable information in synthesis procedures.
Collapse
Affiliation(s)
- Mantas Vaškevičius
- Department of Applied Informatics, Vytautas Magnus University, Kaunas, Lithuania
- JSC Synhet, Kaunas, Lithuania
| | | | - Arnas Vaškevičius
- Faculty of Mechanical Engineering and Design, Kaunas University of Technology, Kaunas, Lithuania
| | | |
Collapse
|
3
|
Om Kumar CU, Gajendran S, Balaji V, Nhaveen A, Sai Balakrishnan S. Securing health care data through blockchain enabled collaborative machine learning. Soft comput 2023; 27:9941-9954. [PMID: 37287568 PMCID: PMC10204011 DOI: 10.1007/s00500-023-08330-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 04/25/2023] [Indexed: 06/09/2023]
Abstract
Transferring of data in machine learning from one party to another party is one of the issues that has been in existence since the development of technology. Health care data collection using machine learning techniques can lead to privacy issues which cause disturbances among the parties and reduces the possibility to work with either of the parties. Since centralized way of information transfer between two parties can be limited and risky as they are connected using machine learning, this factor motivated us to use the decentralized way where there is no connection but model transfer between both parties will be in process through a federated way. The purpose of this research is to investigate a model transfer between a user and the client(s) in an organization using federated learning techniques and reward the client(s) for their efforts with tokens accordingly using blockchain technology. In this research, the user shares a model to organizations that are willing to volunteer their service to provide help to the user. The model is trained and transferred among the user and the clients in the organizations in a privacy preserving way. In this research, we found that the process of model transfer between user and the volunteered organizations works completely fine with the help of federated learning techniques and the client(s) is/are rewarded with tokens for their efforts. We used the COVID-19 dataset to test the federation process, which yielded individual results of 88% for contributor a, 85% for contributor b, and 74% for contributor c. When using the FedAvg algorithm, we were able to achieve a total accuracy of 82%.
Collapse
Affiliation(s)
- C. U. Om Kumar
- School of Computer Science and Engineering, Vellore Institute of Technology, Chennai Campus, Chennai, India
| | - Sudhakaran Gajendran
- School of Electronics Engineering, Vellore Institute of Technology, Chennai Campus, Chennai, India
| | - V. Balaji
- Department of Computer Science and Engineering, SRM Easwari Engineering College, Chennai, Tamil Nadu India
| | - A. Nhaveen
- Department of Computer Science and Engineering, SRM Easwari Engineering College, Chennai, Tamil Nadu India
| | - S. Sai Balakrishnan
- Department of Computer Science and Engineering, SRM Easwari Engineering College, Chennai, Tamil Nadu India
| |
Collapse
|
4
|
Cifci MA, Hussain S, Canatalay PJ. Hybrid Deep Learning Approach for Accurate Tumor Detection in Medical Imaging Data. Diagnostics (Basel) 2023; 13:diagnostics13061025. [PMID: 36980333 PMCID: PMC10047127 DOI: 10.3390/diagnostics13061025] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2023] [Revised: 03/04/2023] [Accepted: 03/06/2023] [Indexed: 03/30/2023] Open
Abstract
The automated extraction of critical information from electronic medical records, such as oncological medical events, has become increasingly important with the widespread use of electronic health records. However, extracting tumor-related medical events can be challenging due to their unique characteristics. To address this difficulty, we propose a novel approach that utilizes Generative Adversarial Networks (GANs) for data augmentation and pseudo-data generation algorithms to improve the model's transfer learning skills for various tumor-related medical events. Our approach involves a two-stage pre-processing and model training process, where the data is cleansed, normalized, and augmented using pseudo-data. We evaluate our approach using the i2b2/UTHealth 2010 dataset and observe promising results in extracting primary tumor site size, tumor size, and metastatic site information. The proposed method has significant implications for healthcare and medical research as it can extract vital information from electronic medical records for oncological medical events.
Collapse
Affiliation(s)
- Mehmet Akif Cifci
- The Institute of Computer Technology, Tu Wien University, 1040 Vienna, Austria
- Department of Computer Engineering, Bandirma Onyedi Eylul University, Balikesir 10200, Turkey
- Engineering and Informatics Department, Klaipėdos Valstybinė Kolegija/Higher Education Institution, 92294 Klaipeda, Lithuania
| | - Sadiq Hussain
- Examination Branch, Dibrugarh University, Dibrugarh 786004, Assam, India
| | | |
Collapse
|
5
|
Gajendran S, Manjula D, Sugumaran V, Hema R. Extraction of knowledge graph of Covid-19 through mining of unstructured biomedical corpora. Comput Biol Chem 2023; 102:107808. [PMID: 36621289 PMCID: PMC9807269 DOI: 10.1016/j.compbiolchem.2022.107808] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2022] [Revised: 12/21/2022] [Accepted: 12/29/2022] [Indexed: 01/04/2023]
Abstract
The number of biomedical articles published is increasing rapidly over the years. Currently there are about 30 million articles in PubMed and over 25 million mentions in Medline. Among these fundamentals, Biomedical Named Entity Recognition (BioNER) and Biomedical Relation Extraction (BioRE) are the most essential in analysing the literature. In the biomedical domain, Knowledge Graph is used to visualize the relationships between various entities such as proteins, chemicals and diseases. Scientific publications have increased dramatically as a result of the search for treatments and potential cures for the new Coronavirus, but efficiently analysing, integrating, and utilising related sources of information remains a difficulty. In order to effectively combat the disease during pandemics like COVID-19, literature must be used quickly and effectively. In this paper, we introduced a fully automated framework consists of BERT-BiLSTM, Knowledge graph, and Representation Learning model to extract the top diseases, chemicals, and proteins related to COVID-19 from the literature. The proposed framework uses Named Entity Recognition models for disease recognition, chemical recognition, and protein recognition. Then the system uses the Chemical - Disease Relation Extraction and Chemical - Protein Relation Extraction models. And the system extracts the entities and relations from the CORD-19 dataset using the models. The system then creates a Knowledge Graph for the extracted relations and entities. The system performs Representation Learning on this KG to get the embeddings of all entities and get the top related diseases, chemicals, and proteins with respect to COVID-19.
Collapse
Affiliation(s)
- Sudhakaran Gajendran
- School of Electronics Engineering, Vellore Institute of Technology, Chennai, India,Corresponding author
| | - D. Manjula
- School of Computer Science Engineering, Vellore Institute of Technology, Chennai, India
| | - Vijayan Sugumaran
- Center for Data Science and Big Data Analytics, Oakland University, Rochester, MI, USA,Department of Decision and Information Sciences, School of Business Administration, Oakland University, Rochester, MI, USA
| | - R. Hema
- Department of Electronics and Communication Engineering, St. Joseph College of Engineering, Chennai, India
| |
Collapse
|
6
|
Chai Z, Jin H, Shi S, Zhan S, Zhuo L, Yang Y, Lian Q. Noise Reduction Learning Based on XLNet-CRF for Biomedical Named Entity Recognition. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:595-605. [PMID: 35259113 DOI: 10.1109/tcbb.2022.3157630] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
In recent years, Biomedical Named Entity Recognition (BioNER) systems have mainly been based on deep neural networks, which are used to extract information from the rapidly expanding biomedical literature. Long-distance context autoencoding language models based on transformers have recently been employed for BioNER with great success. However, noise interference exists in the process of pre-training and fine-tuning, and there is no effective decoder for label dependency. Current models have many aspects in need of improvement for better performance. We propose two kinds of noise reduction models, Shared Labels and Dynamic Splicing, based on XLNet encoding which is a permutation language pre-training model and decoding by Conditional Random Field (CRF). By testing 15 biomedical named entity recognition datasets, the two models improved the average F1-score by 1.504 and 1.48, respectively, and state-of-the-art performance was achieved on 7 of them. Further analysis proves the effectiveness of the two models and the improvement of the recognition effect of CRF, and suggests the applicable scope of the models according to different data characteristics.
Collapse
|
7
|
Rezaeenour J, Ahmadi M, Jelodar H, Shahrooei R. Systematic review of content analysis algorithms based on deep neural networks. MULTIMEDIA TOOLS AND APPLICATIONS 2022; 82:17879-17903. [PMID: 36313481 PMCID: PMC9589819 DOI: 10.1007/s11042-022-14043-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 08/02/2021] [Revised: 07/12/2022] [Accepted: 10/06/2022] [Indexed: 06/16/2023]
Abstract
Today according to social media, the internet, Etc. Data is rapidly produced and occupies a large space in systems that have resulted in enormous data warehouses; the progress in information technology has significantly increased the speed and ease of data flow.text mining is one of the most important methods for extracting a useful model through extracting and adapting knowledge from data sets. However, many studies have been conducted based on the usage of deep learning for text processing and text mining issues.The idea and method of text mining are one of the fields that seek to extract useful information from unstructured textual data that is used very today. Deep learning and machine learning techniques in classification and text mining and their type are discussed in this paper as well. Neural networks of various kinds, namely, ANN, RNN, CNN, and LSTM, are the subject of study to select the best technique. In this study, we conducted a Systematic Literature Review to extract and associate the algorithms and features that have been used in this area. Based on our search criteria, we retrieved 130 relevant studies from electronic databases between 1997 and 2021; we have selected 43 studies for further analysis using inclusion and exclusion criteria in Section 3.2. According to this study, hybrid LSTM is the most widely used deep learning algorithm in these studies, and SVM in machine learning method high accuracy in result shown.
Collapse
Affiliation(s)
- Jalal Rezaeenour
- Department of Industrial Engineering, University of Qom, Qom, Iran
| | - Mahnaz Ahmadi
- Department of Industrial Engineering, University of Qom, Qom, Iran
| | - Hamed Jelodar
- Faculty of computer science, Dalhousie University, 6050 University Ave, Halifax, NS B3H 1W5 Canada
| | - Roshan Shahrooei
- Department of Industrial Engineering, University of Qom, Qom, Iran
| |
Collapse
|
8
|
Tarasova OA, Rudik AV, Biziukova NY, Filimonov DA, Poroikov VV. Chemical named entity recognition in the texts of scientific publications using the naïve Bayes classifier approach. J Cheminform 2022; 14:55. [PMID: 35964150 PMCID: PMC9375066 DOI: 10.1186/s13321-022-00633-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2022] [Accepted: 07/12/2022] [Indexed: 11/24/2022] Open
Abstract
Motivation Application of chemical named entity recognition (CNER) algorithms allows retrieval of information from texts about chemical compound identifiers and creates associations with physical–chemical properties and biological activities. Scientific texts represent low-formalized sources of information. Most methods aimed at CNER are based on machine learning approaches, including conditional random fields and deep neural networks. In general, most machine learning approaches require either vector or sparse word representation of texts. Chemical named entities (CNEs) constitute only a small fraction of the whole text, and the datasets used for training are highly imbalanced. Methods and results We propose a new method for extracting CNEs from texts based on the naïve Bayes classifier combined with specially developed filters. In contrast to the earlier developed CNER methods, our approach uses the representation of the data as a set of fragments of text (FoTs) with the subsequent preparati`on of a set of multi-n-grams (sequences from one to n symbols) for each FoT. Our approach may provide the recognition of novel CNEs. For CHEMDNER corpus, the values of the sensitivity (recall) was 0.95, precision was 0.74, specificity was 0.88, and balanced accuracy was 0.92 based on five-fold cross validation. We applied the developed algorithm to the extracted CNEs of potential Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) main protease (Mpro) inhibitors. A set of CNEs corresponding to the chemical substances evaluated in the biochemical assays used for the discovery of Mpro inhibitors was retrieved. Manual analysis of the appropriate texts showed that CNEs of potential SARS-CoV-2 Mpro inhibitors were successfully identified by our method. Conclusion The obtained results show that the proposed method can be used for filtering out words that are not related to CNEs; therefore, it can be successfully applied to the extraction of CNEs for the purposes of cheminformatics and medicinal chemistry. Supplementary Information The online version contains supplementary material available at 10.1186/s13321-022-00633-4.
Collapse
Affiliation(s)
- O A Tarasova
- Laboratory of Structure-Function Based Drug Design, Institute of Biomedical Chemistry, 10 bldg. 8, Pogodinskaya Str., Moscow, 119121, Russia.
| | - A V Rudik
- Laboratory of Structure-Function Based Drug Design, Institute of Biomedical Chemistry, 10 bldg. 8, Pogodinskaya Str., Moscow, 119121, Russia
| | - N Yu Biziukova
- Laboratory of Structure-Function Based Drug Design, Institute of Biomedical Chemistry, 10 bldg. 8, Pogodinskaya Str., Moscow, 119121, Russia
| | - D A Filimonov
- Laboratory of Structure-Function Based Drug Design, Institute of Biomedical Chemistry, 10 bldg. 8, Pogodinskaya Str., Moscow, 119121, Russia
| | - V V Poroikov
- Laboratory of Structure-Function Based Drug Design, Institute of Biomedical Chemistry, 10 bldg. 8, Pogodinskaya Str., Moscow, 119121, Russia
| |
Collapse
|
9
|
An intelligent disease prediction and monitoring system using feature selection, multi-neural network and fuzzy rules. Neural Comput Appl 2022. [DOI: 10.1007/s00521-022-07527-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/17/2022]
|
10
|
Ozaydin B, Zengul F, Oner N, Delen D. Approaches for text mining of mHealth literature. Mhealth 2022; 8:11. [PMID: 35449502 PMCID: PMC9014235 DOI: 10.21037/mhealth-22-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/07/2022] [Accepted: 02/09/2022] [Indexed: 11/06/2022] Open
Affiliation(s)
- Bunyamin Ozaydin
- Department of Health Services Administration, the University of Alabama at Birmingham, Birmingham, AL, USA
| | - Ferhat Zengul
- Department of Health Services Administration, the University of Alabama at Birmingham, Birmingham, AL, USA
| | - Nurettin Oner
- Department of Health Services Administration, the University of Alabama at Birmingham, Birmingham, AL, USA
| | - Dursun Delen
- Department of Management Science and Information Systems, Oklahoma State University, Stillwater, OK, USA
| |
Collapse
|
11
|
Autonomous design of new chemical reactions using a variational autoencoder. Commun Chem 2022; 5:40. [PMID: 36697652 PMCID: PMC9814385 DOI: 10.1038/s42004-022-00647-x] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2021] [Accepted: 02/16/2022] [Indexed: 01/28/2023] Open
Abstract
Artificial intelligence based chemistry models are a promising method of exploring chemical reaction design spaces. However, training datasets based on experimental synthesis are typically reported only for the optimal synthesis reactions. This leads to an inherited bias in the model predictions. Therefore, robust datasets that span the entirety of the solution space are necessary to remove inherited bias and permit complete training of the space. In this study, an artificial intelligence model based on a Variational AutoEncoder (VAE) has been developed and investigated to synthetically generate continuous datasets. The approach involves sampling the latent space to generate new chemical reactions. This developed technique is demonstrated by generating over 7,000,000 new reactions from a training dataset containing only 7,000 reactions. The generated reactions include molecular species that are larger and more diverse than the training set.
Collapse
|
12
|
Attention-Based Deep Multiple-Instance Learning for Classifying Circular RNA and Other Long Non-Coding RNA. Genes (Basel) 2021; 12:genes12122018. [PMID: 34946967 PMCID: PMC8701965 DOI: 10.3390/genes12122018] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2021] [Revised: 12/14/2021] [Accepted: 12/17/2021] [Indexed: 12/23/2022] Open
Abstract
Circular RNA (circRNA) is a distinguishable circular formed long non-coding RNA (lncRNA), which has specific roles in transcriptional regulation, multiple biological processes. The identification of circRNA from other lncRNA is necessary for relevant research. In this study, we designed attention-based multi-instance learning (MIL) network architecture fed with a raw sequence, to learn the sparse features of RNA sequences and to accomplish the circRNAs identification task. The model outperformed the state-of-art models. Moreover, following the validation of the attention mechanism effectiveness by the handwritten digit dataset, the key sequence loci underlying circRNA’s recognition were obtained based on the corresponding attention score. Then, motif enrichment analysis identified some of the key motifs for circRNA formation. In conclusion, we designed deep learning network architecture suitable for learning gene sequences with sparse features and implemented it for the circRNA identification task, and the model has strong representation capability in the indication of some key loci.
Collapse
|
13
|
Ramachandran R, Arutchelvan K. Named entity recognition on bio-medical literature documents using hybrid based approach. JOURNAL OF AMBIENT INTELLIGENCE AND HUMANIZED COMPUTING 2021:1-10. [PMID: 33723489 PMCID: PMC7947151 DOI: 10.1007/s12652-021-03078-z] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/22/2020] [Accepted: 03/02/2021] [Indexed: 06/02/2023]
Abstract
There have been many changes in the medical field due to technological advances. The progression in technologies provides lot of opportunities to extract valuable insights from huge amount of unstructured data. The literature documents published by the researchers in medical domain consists enormous amount of knowledge. Many organizations are involving in retrieving the hidden information from the literature documents. Extracting the drug names, diseases, symptoms, route of administration, species and dosage forms from the textual document is an easy task due to the innovation of technologies in the Natural Language Processing. In this article, a new hybrid based approach is proposed to identify named entity from the medical literature documents. New dictionary has been built for route of administration, dosage forms and symptoms to annotate the entities in the medical documents. The annotated entities are trained by the blank Spacy machine learning model. The trained model provide a decent accuracy when compared with the existing model. The hybrid model is validated with the dictionary and human (optional)to calculate the confusion matrix. It is able to identify more entities than the prevailing model. The average F1 score for five entities of the proposed hybrid based approach 73.79%.
Collapse
Affiliation(s)
- R. Ramachandran
- Department of Computer and Information Science, Annamalai University, Tamil Nadu, Chidambaram, India
| | - K. Arutchelvan
- Department of Computer and Information Science, Annamalai University, Tamil Nadu, Chidambaram, India
| |
Collapse
|