1
|
Lindley S, Lu Y, Shukla D. The Experimentalist's Guide to Machine Learning for Small Molecule Design. ACS APPLIED BIO MATERIALS 2024; 7:657-684. [PMID: 37535819 PMCID: PMC10880109 DOI: 10.1021/acsabm.3c00054] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2023] [Accepted: 07/17/2023] [Indexed: 08/05/2023]
Abstract
Initially part of the field of artificial intelligence, machine learning (ML) has become a booming research area since branching out into its own field in the 1990s. After three decades of refinement, ML algorithms have accelerated scientific developments across a variety of research topics. The field of small molecule design is no exception, and an increasing number of researchers are applying ML techniques in their pursuit of discovering, generating, and optimizing small molecule compounds. The goal of this review is to provide simple, yet descriptive, explanations of some of the most commonly utilized ML algorithms in the field of small molecule design along with those that are highly applicable to an experimentally focused audience. The algorithms discussed here span across three ML paradigms: supervised learning, unsupervised learning, and ensemble methods. Examples from the published literature will be provided for each algorithm. Some common pitfalls of applying ML to biological and chemical data sets will also be explained, alongside a brief summary of a few more advanced paradigms, including reinforcement learning and semi-supervised learning.
Collapse
Affiliation(s)
- Sarah
E. Lindley
- Department
of Bioengineering, University of Illinois, Urbana−Champaign, Illinois 61801, United States
| | - Yiyang Lu
- Department
of Chemical and Biomolecular Engineering, University of Illinois, Urbana−Champaign, Illinois 61801, United States
| | - Diwakar Shukla
- Department
of Bioengineering, University of Illinois, Urbana−Champaign, Illinois 61801, United States
- Department
of Chemical and Biomolecular Engineering, University of Illinois, Urbana−Champaign, Illinois 61801, United States
- Center
for Biophysics & Computational Biology, University of Illinois, Urbana−Champaign, Illinois 61801, United States
- Department
of Plant Biology, University of Illinois, Urbana−Champaign, Illinois 61801, United States
| |
Collapse
|
2
|
Silva JCF, Ferreira MA, Carvalho TFM, Silva FF, de A. Silveira S, Brommonschenkel SH, Fontes EPB. RLPredictiOme, a Machine Learning-Derived Method for High-Throughput Prediction of Plant Receptor-like Proteins, Reveals Novel Classes of Transmembrane Receptors. Int J Mol Sci 2022; 23:12176. [PMID: 36293031 PMCID: PMC9603095 DOI: 10.3390/ijms232012176] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2022] [Revised: 10/08/2022] [Accepted: 10/09/2022] [Indexed: 11/16/2022] Open
Abstract
Cell surface receptors play essential roles in perceiving and processing external and internal signals at the cell surface of plants and animals. The receptor-like protein kinases (RLK) and receptor-like proteins (RLPs), two major classes of proteins with membrane receptor configuration, play a crucial role in plant development and disease defense. Although RLPs and RLKs share a similar single-pass transmembrane configuration, RLPs harbor short divergent C-terminal regions instead of the conserved kinase domain of RLKs. This RLP receptor structural design precludes sequence comparison algorithms from being used for high-throughput predictions of the RLP family in plant genomes, as has been extensively performed for RLK superfamily predictions. Here, we developed the RLPredictiOme, implemented with machine learning models in combination with Bayesian inference, capable of predicting RLP subfamilies in plant genomes. The ML models were simultaneously trained using six types of features, along with three stages to distinguish RLPs from non-RLPs (NRLPs), RLPs from RLKs, and classify new subfamilies of RLPs in plants. The ML models achieved high accuracy, precision, sensitivity, and specificity for predicting RLPs with relatively high probability ranging from 0.79 to 0.99. The prediction of the method was assessed with three datasets, two of which contained leucine-rich repeats (LRR)-RLPs from Arabidopsis and rice, and the last one consisted of the complete set of previously described Arabidopsis RLPs. In these validation tests, more than 90% of known RLPs were correctly predicted via RLPredictiOme. In addition to predicting previously characterized RLPs, RLPredictiOme uncovered new RLP subfamilies in the Arabidopsis genome. These include probable lipid transfer (PLT)-RLP, plastocyanin-like-RLP, ring finger-RLP, glycosyl-hydrolase-RLP, and glycerophosphoryldiester phosphodiesterase (GDPD, GDPDL)-RLP subfamilies, yet to be characterized. Compared to the only Arabidopsis GDPDL-RLK, molecular evolution studies confirmed that the ectodomain of GDPDL-RLPs might have undergone a purifying selection with a predominance of synonymous substitutions. Expression analyses revealed that predicted GDPGL-RLPs display a basal expression level and respond to developmental and biotic signals. The results of these biological assays indicate that these subfamily members have maintained functional domains during evolution and may play relevant roles in development and plant defense. Therefore, RLPredictiOme provides a framework for genome-wide surveys of the RLP superfamily as a foundation to rationalize functional studies of surface receptors and their relationships with different biological processes.
Collapse
Affiliation(s)
- Jose Cleydson F. Silva
- National Institute of Science and Technology in Plant-Pest Interactions, Bioagro, Viçosa 36570-900, Brazil
| | - Marco Aurélio Ferreira
- Departament of Biochemistry and Molecular Biology, Universidade Federal de Viçosa, Viçosa 36570-900, Brazil
| | - Thales F. M. Carvalho
- Institute of Engineering, Science and Technology, Universidade Federal dos Vales do Jequitinhonha e Mucuri, Janaúba 39447-814, Brazil
| | - Fabyano F. Silva
- Departament of Animal Science, Universidade Federal de Viçosa, Viçosa 36570-900, Brazil
| | - Sabrina de A. Silveira
- Department of Computer Science, Universidade Federal de Viçosa, Viçosa 36570-900, Brazil
| | | | - Elizabeth P. B. Fontes
- Departament of Biochemistry and Molecular Biology, Universidade Federal de Viçosa, Viçosa 36570-900, Brazil
| |
Collapse
|
3
|
Positive effects of bubbles as a feeding predictor on behaviour of farmed rainbow trout. Sci Rep 2022; 12:11368. [PMID: 35790759 PMCID: PMC9256598 DOI: 10.1038/s41598-022-15302-7] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2022] [Accepted: 06/22/2022] [Indexed: 11/11/2022] Open
Abstract
Occupational enrichment emerges as a promising strategy for improving the welfare of farmed animals. This form of enrichment aims to stimulate cognitive abilities of animals by providing them with more opportunities to interact with and control their environment. Predictability of salient daily events, and in particular predictability of feeding, is currently one of the most studied occupational enrichment strategies and can take several forms. In fish, while temporal predictability of feeding has been widely investigated, signalled predictability (based on a signal, such as light or sound) has received little attention. Depending on the type of predictability used and the ecology of the species, the effects on fish welfare often differ. The present study aimed to determine which feeding predictability would be most appropriate for rainbow trout, the main continental farmed fish in Europe, and what the consequences might be for their welfare. We tested four feeding predictability conditions: temporal (based on time of day), signalled (based on bubble diffusion), temporal + signalled (based on time and bubble diffusion), and unpredictable (random feeding times). Behavioural and zootechnical outcomes recorded were swimming activity, aggressive behaviours, burst of accelerations, and jumps, emotional reactivity, and growth. Our results showed that rainbow trout can predict daily feedings relying on time and/or bubbles as predictors as early as two weeks of conditioning, as evidenced by their increased swimming activity before feeding or during feed omission tests, which allowed to reinforce their conditioned response. Temporal predictability alone resulted in an increase in pre-feeding aggressive behaviours, burst of accelerations, and jumps, suggesting that the use of time as the sole predictor of feedings in husbandry practices may be detrimental to fish welfare. Signalled predictability with bubbles alone resulted in fewer pre-feeding agonistic behaviours, burst of accelerations, and jumps than in the temporal predictability condition. The combination of temporal and signalled predictability elicited the highest conditioned response and the level of pre-feeding aggression behaviours, burst of accelerations and jumps tended to be lower than for temporal predictability alone. Interestingly, fish swimming activity during bubble diffusion also revealed that bubbles were highly attractive regardless of the condition. Rainbow trout growth and emotional reactivity were not affected by the predictability condition. We conclude, therefore, that the use of bubbles as a feeding predictor could represent an interesting approach to improve rainbow trout welfare in farms, by acting as both an occupational and physical enrichment.
Collapse
|
4
|
Identifying Key Biomarkers and Immune Infiltration in Female Patients with Ischemic Stroke Based on Weighted Gene Co-Expression Network Analysis. Neural Plast 2022; 2022:5379876. [PMID: 35432523 PMCID: PMC9012649 DOI: 10.1155/2022/5379876] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/24/2021] [Revised: 02/24/2022] [Accepted: 03/07/2022] [Indexed: 01/02/2023] Open
Abstract
Stroke is one of the leading causes of death and disability worldwide. Evidence shows that ischemic stroke (IS) accounts for nearly 80 percent of all strokes and that the etiology, risk factors, and prognosis of this disease differ by gender. Female patients may bear a greater burden than male patients. The immune system may play an important role in the pathophysiology of females with IS. Therefore, it is critical to investigate the key biomarkers and immune infiltration of female IS patients to develop effective treatment methods. Herein, we used weighted gene co-expression network analysis (WGCNA) to determine the key modules and core genes in female IS patients using the GSE22255, GSE37587, and GSE16561 datasets from the GEO database. Subsequently, we performed functional enrichment analysis and built a protein-protein interaction (PPI) network. Ten genes were selected as the true central genes for further investigation. After that, we explored the specific molecular and biological functions of these hub genes to gain a better understanding of the underlying pathogenesis of female IS patients. Moreover, the “Cell type Identification by Estimating Relative Subsets of RNA Transcripts (CIBERSORT)” was used to examine the distribution pattern of immune subtypes in female patients with IS and normal controls, revealing a new potential target for clinical treatment of the disease.
Collapse
|
5
|
Silva JCF, Teixeira RM, Silva FF, Brommonschenkel SH, Fontes EPB. Machine learning approaches and their current application in plant molecular biology: A systematic review. PLANT SCIENCE : AN INTERNATIONAL JOURNAL OF EXPERIMENTAL PLANT BIOLOGY 2019; 284:37-47. [PMID: 31084877 DOI: 10.1016/j.plantsci.2019.03.020] [Citation(s) in RCA: 27] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/18/2018] [Revised: 02/28/2019] [Accepted: 03/26/2019] [Indexed: 05/19/2023]
Abstract
Machine learning (ML) is a field of artificial intelligence that has rapidly emerged in molecular biology, thus allowing the exploitation of Big Data concepts in plant genomics. In this context, the main challenges are given in terms of how to analyze massive datasets and extract new knowledge in all levels of cellular systems research. In summary, ML techniques allow complex interactions to be inferred in several biological systems. Despite its potential, ML has been underused due to complex computational algorithms and definition terms. Therefore, a systematic review to disentangle ML approaches is relevant for plant scientists and has been considered in this study. We presented the main steps for ML development (from data selection to evaluation of classification/prediction models) with a respective discussion approaching functional genomics mainly in terms of pathogen effector genes in plant immunity. Additionally, we also considered how to access public source databases under an ML framework towards advancing plant molecular biology and introduced novel powerful tools, such as deep learning.
Collapse
Affiliation(s)
- Jose Cleydson F Silva
- National Institute of Science and Technology in Plant-Pest Interactions, Bioagro, Universidade Federal de Viçosa, Av. PH Rolfs s/n, Centro, Viçosa, MG, 36570-000, Brazil; Department of Biochemistry and Molecular Biology/Bioagro, Universidade Federal de Viçosa, Viçosa, MG, Brazil
| | - Ruan M Teixeira
- National Institute of Science and Technology in Plant-Pest Interactions, Bioagro, Universidade Federal de Viçosa, Av. PH Rolfs s/n, Centro, Viçosa, MG, 36570-000, Brazil; Department of Biochemistry and Molecular Biology/Bioagro, Universidade Federal de Viçosa, Viçosa, MG, Brazil
| | - Fabyano F Silva
- Department of Animal Science, Universidade Federal de Viçosa, Viçosa, MG, Brazil
| | - Sergio H Brommonschenkel
- National Institute of Science and Technology in Plant-Pest Interactions, Bioagro, Universidade Federal de Viçosa, Av. PH Rolfs s/n, Centro, Viçosa, MG, 36570-000, Brazil; Plant Pathology Department /Bioagro, Universidade Federal de Viçosa, Viçosa, MG, Brazil
| | - Elizabeth P B Fontes
- National Institute of Science and Technology in Plant-Pest Interactions, Bioagro, Universidade Federal de Viçosa, Av. PH Rolfs s/n, Centro, Viçosa, MG, 36570-000, Brazil; Department of Biochemistry and Molecular Biology/Bioagro, Universidade Federal de Viçosa, Viçosa, MG, Brazil.
| |
Collapse
|
6
|
Su W, Gu X, Peterson T. TIR-Learner, a New Ensemble Method for TIR Transposable Element Annotation, Provides Evidence for Abundant New Transposable Elements in the Maize Genome. MOLECULAR PLANT 2019; 12:447-460. [PMID: 30802553 DOI: 10.1016/j.molp.2019.02.008] [Citation(s) in RCA: 70] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/02/2018] [Revised: 02/19/2019] [Accepted: 02/19/2019] [Indexed: 05/21/2023]
Abstract
Transposable elements (TEs) make up a large and rapidly evolving proportion of plant genomes. Among Class II DNA TEs, TIR elements are flanked by characteristic terminal inverted repeat sequences (TIRs). TIR TEs may play important roles in genome evolution, including generating allelic diversity, inducing structural variation, and regulating gene expression. However, TIR TE identification and annotation has been hampered by the lack of effective tools, resulting in erroneous TE annotations and a significant underestimation of the proportion of TIR elements in the maize genome. This problem has largely limited our understanding of the impact of TIR elements on plant genome structure and evolution. In this paper, we propose a new method of TIR element detection and annotation. This new pipeline combines the advantages of current homology-based annotation methods with powerful de novo machine-learning approaches, resulting in greatly increased efficiency and accuracy of TIR element annotation. The results show that the copy number and genome proportion of TIR elements in maize is much larger than that of current annotations. In addition, the distribution of some TIR superfamily elements is reduced in centromeric and pericentromeric positions, while others do not show a similar bias. Finally, the incorporation of machine-learning techniques has enabled the identification of large numbers of new DTA (hAT) family elements, which have all the hallmarks of bona fide TEs yet which lack high homology with currently known DTA elements. Together, these results provide new tools for TE research and new insight into the impact of TIR elements on maize genome diversity.
Collapse
Affiliation(s)
- Weijia Su
- Department of Genetics, Development and Cell Biology, Iowa State University, Ames, IA 50011-3260, USA
| | - Xun Gu
- Department of Genetics, Development and Cell Biology, Iowa State University, Ames, IA 50011-3260, USA
| | - Thomas Peterson
- Department of Genetics, Development and Cell Biology, Iowa State University, Ames, IA 50011-3260, USA; Department of Agronomy, Iowa State University, Ames, IA 50011-3260, USA.
| |
Collapse
|
7
|
Akond Z, Hasan MN, Alam M, Alam M, Mollah MH. Classification of Functional Metagenomes Recovered from Different Environmental Samples. Bioinformation 2019; 15:26-32. [PMID: 31359995 PMCID: PMC6651027 DOI: 10.6026/97320630015026] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2018] [Accepted: 12/26/2018] [Indexed: 11/23/2022] Open
Abstract
Classification of functional metagenomes from the microbial community plays the vital role in the metagenomics research. In this paper, an investigation was made to study the performance of beta-t random forest classifier for classification of metagenomics data. Nine key functional meta-genomic variables were selected using the beta-t test statistic from the 10 different microbial community using p-value at 5% level of significance. Then beta-t random forest classifier showed the higher accuracy (96%), true positive rate (96%) and lower false positive rate (5%), false discovery rate (5%) and misclassification error rate (5%) for classification of metagenomes. This method showed the better performance compare to Bayes, SVM, KNN, AdaBoost and LogitBoost).
Collapse
Affiliation(s)
- Zobaer Akond
- Bioinformatics Lab, Department of Statistics, University of Rajshahi, Rajshahi-6205,Bangladesh
- Institute of Environmental Science,University of Rajshahi-6205,Bangladesh
- Agricultural Statistics and Information and Communication Technology (ASICT) Division,Bangladesh Agricultural Research Institute(BARI),Joydebpur,Gazipur-1701,Bangladesh
| | - Mohammad Nazmol Hasan
- Bioinformatics Lab, Department of Statistics, University of Rajshahi, Rajshahi-6205,Bangladesh
- Bangabandhu Sheikh Mujibur Rahaman Agricultural University,Joydebpur,Gazipur-1706, Bangladesh
| | - Md.Jahangir Alam
- Bioinformatics Lab, Department of Statistics, University of Rajshahi, Rajshahi-6205,Bangladesh
| | - Munirul Alam
- Emerging Infections, Infectious Diseases Division,International Centre for Diarrheal Disease Research,Bangladesh (icddr,b)
| | - Md.Nurul Haque Mollah
- Bangabandhu Sheikh Mujibur Rahaman Agricultural University,Joydebpur,Gazipur-1706, Bangladesh
| |
Collapse
|