1
|
Liu J, Tang X, Guan X. Grain protein function prediction based on self-attention mechanism and bidirectional LSTM. Brief Bioinform 2023; 24:6886418. [PMID: 36567619 DOI: 10.1093/bib/bbac493] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2022] [Revised: 10/13/2022] [Accepted: 10/18/2022] [Indexed: 12/27/2022] Open
Abstract
With the development of genome sequencing technology, using computing technology to predict grain protein function has become one of the important tasks of bioinformatics. The protein data of four grains, soybean, maize, indica and japonica are selected in this experimental dataset. In this paper, a novel neural network algorithm Chemical-SA-BiLSTM is proposed for grain protein function prediction. The Chemical-SA-BiLSTM algorithm fuses the chemical properties of proteins on the basis of amino acid sequences, and combines the self-attention mechanism with the bidirectional Long Short-Term Memory network. The experimental results show that the Chemical-SA-BiLSTM algorithm is superior to other classical neural network algorithms, and can more accurately predict the protein function, which proves the effectiveness of the Chemical-SA-BiLSTM algorithm in the prediction of grain protein function. The source code of our method is available at https://github.com/HwaTong/Chemical-SA-BiLSTM.
Collapse
Affiliation(s)
- Jing Liu
- College of Information Engineering, Shanghai Maritime University, 201306, Shanghai, China
| | - Xinghua Tang
- College of Information Engineering, Shanghai Maritime University, 201306, Shanghai, China
| | - Xiao Guan
- School of Health Science and Engineering, University of Shanghai for Science and Technology, 200093, Shanghai, China
| |
Collapse
|
2
|
Villalobos-Alva J, Ochoa-Toledo L, Villalobos-Alva MJ, Aliseda A, Pérez-Escamirosa F, Altamirano-Bustamante NF, Ochoa-Fernández F, Zamora-Solís R, Villalobos-Alva S, Revilla-Monsalve C, Kemper-Valverde N, Altamirano-Bustamante MM. Protein Science Meets Artificial Intelligence: A Systematic Review and a Biochemical Meta-Analysis of an Inter-Field. Front Bioeng Biotechnol 2022; 10:788300. [PMID: 35875501 PMCID: PMC9301016 DOI: 10.3389/fbioe.2022.788300] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2021] [Accepted: 05/25/2022] [Indexed: 11/23/2022] Open
Abstract
Proteins are some of the most fascinating and challenging molecules in the universe, and they pose a big challenge for artificial intelligence. The implementation of machine learning/AI in protein science gives rise to a world of knowledge adventures in the workhorse of the cell and proteome homeostasis, which are essential for making life possible. This opens up epistemic horizons thanks to a coupling of human tacit-explicit knowledge with machine learning power, the benefits of which are already tangible, such as important advances in protein structure prediction. Moreover, the driving force behind the protein processes of self-organization, adjustment, and fitness requires a space corresponding to gigabytes of life data in its order of magnitude. There are many tasks such as novel protein design, protein folding pathways, and synthetic metabolic routes, as well as protein-aggregation mechanisms, pathogenesis of protein misfolding and disease, and proteostasis networks that are currently unexplored or unrevealed. In this systematic review and biochemical meta-analysis, we aim to contribute to bridging the gap between what we call binomial artificial intelligence (AI) and protein science (PS), a growing research enterprise with exciting and promising biotechnological and biomedical applications. We undertake our task by exploring "the state of the art" in AI and machine learning (ML) applications to protein science in the scientific literature to address some critical research questions in this domain, including What kind of tasks are already explored by ML approaches to protein sciences? What are the most common ML algorithms and databases used? What is the situational diagnostic of the AI-PS inter-field? What do ML processing steps have in common? We also formulate novel questions such as Is it possible to discover what the rules of protein evolution are with the binomial AI-PS? How do protein folding pathways evolve? What are the rules that dictate the folds? What are the minimal nuclear protein structures? How do protein aggregates form and why do they exhibit different toxicities? What are the structural properties of amyloid proteins? How can we design an effective proteostasis network to deal with misfolded proteins? We are a cross-functional group of scientists from several academic disciplines, and we have conducted the systematic review using a variant of the PICO and PRISMA approaches. The search was carried out in four databases (PubMed, Bireme, OVID, and EBSCO Web of Science), resulting in 144 research articles. After three rounds of quality screening, 93 articles were finally selected for further analysis. A summary of our findings is as follows: regarding AI applications, there are mainly four types: 1) genomics, 2) protein structure and function, 3) protein design and evolution, and 4) drug design. In terms of the ML algorithms and databases used, supervised learning was the most common approach (85%). As for the databases used for the ML models, PDB and UniprotKB/Swissprot were the most common ones (21 and 8%, respectively). Moreover, we identified that approximately 63% of the articles organized their results into three steps, which we labeled pre-process, process, and post-process. A few studies combined data from several databases or created their own databases after the pre-process. Our main finding is that, as of today, there are no research road maps serving as guides to address gaps in our knowledge of the AI-PS binomial. All research efforts to collect, integrate multidimensional data features, and then analyze and validate them are, so far, uncoordinated and scattered throughout the scientific literature without a clear epistemic goal or connection between the studies. Therefore, our main contribution to the scientific literature is to offer a road map to help solve problems in drug design, protein structures, design, and function prediction while also presenting the "state of the art" on research in the AI-PS binomial until February 2021. Thus, we pave the way toward future advances in the synthetic redesign of novel proteins and protein networks and artificial metabolic pathways, learning lessons from nature for the welfare of humankind. Many of the novel proteins and metabolic pathways are currently non-existent in nature, nor are they used in the chemical industry or biomedical field.
Collapse
Affiliation(s)
- Jalil Villalobos-Alva
- Unidad de Investigación en Enfermedades Metabólicas, Centro Médico Nacional Siglo XXI, Instituto Mexicano del Seguro Social, Mexico City, Mexico
| | - Luis Ochoa-Toledo
- Instituto de Ciencias Aplicadas y Tecnología (ICAT), Universidad Nacional Autónoma de México (UNAM), Mexico City, Mexico
| | - Mario Javier Villalobos-Alva
- Unidad de Investigación en Enfermedades Metabólicas, Centro Médico Nacional Siglo XXI, Instituto Mexicano del Seguro Social, Mexico City, Mexico
| | - Atocha Aliseda
- Instituto de Investigaciones Filosóficas, Universidad Nacional Autónoma de México (UNAM), Mexico City, Mexico
| | - Fernando Pérez-Escamirosa
- Instituto de Ciencias Aplicadas y Tecnología (ICAT), Universidad Nacional Autónoma de México (UNAM), Mexico City, Mexico
| | | | - Francine Ochoa-Fernández
- Unidad de Investigación en Enfermedades Metabólicas, Centro Médico Nacional Siglo XXI, Instituto Mexicano del Seguro Social, Mexico City, Mexico
| | - Ricardo Zamora-Solís
- Unidad de Investigación en Enfermedades Metabólicas, Centro Médico Nacional Siglo XXI, Instituto Mexicano del Seguro Social, Mexico City, Mexico
| | - Sebastián Villalobos-Alva
- Unidad de Investigación en Enfermedades Metabólicas, Centro Médico Nacional Siglo XXI, Instituto Mexicano del Seguro Social, Mexico City, Mexico
| | - Cristina Revilla-Monsalve
- Unidad de Investigación en Enfermedades Metabólicas, Centro Médico Nacional Siglo XXI, Instituto Mexicano del Seguro Social, Mexico City, Mexico
| | - Nicolás Kemper-Valverde
- Instituto de Ciencias Aplicadas y Tecnología (ICAT), Universidad Nacional Autónoma de México (UNAM), Mexico City, Mexico
| | - Myriam M. Altamirano-Bustamante
- Unidad de Investigación en Enfermedades Metabólicas, Centro Médico Nacional Siglo XXI, Instituto Mexicano del Seguro Social, Mexico City, Mexico
| |
Collapse
|
3
|
Bileschi ML, Belanger D, Bryant DH, Sanderson T, Carter B, Sculley D, Bateman A, DePristo MA, Colwell LJ. Using deep learning to annotate the protein universe. Nat Biotechnol 2022; 40:932-937. [PMID: 35190689 DOI: 10.1038/s41587-021-01179-w] [Citation(s) in RCA: 88] [Impact Index Per Article: 44.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2021] [Accepted: 12/02/2021] [Indexed: 12/30/2022]
Abstract
Understanding the relationship between amino acid sequence and protein function is a long-standing challenge with far-reaching scientific and translational implications. State-of-the-art alignment-based techniques cannot predict function for one-third of microbial protein sequences, hampering our ability to exploit data from diverse organisms. Here, we train deep learning models to accurately predict functional annotations for unaligned amino acid sequences across rigorous benchmark assessments built from the 17,929 families of the protein families database Pfam. The models infer known patterns of evolutionary substitutions and learn representations that accurately cluster sequences from unseen families. Combining deep models with existing methods significantly improves remote homology detection, suggesting that the deep models learn complementary information. This approach extends the coverage of Pfam by >9.5%, exceeding additions made over the last decade, and predicts function for 360 human reference proteome proteins with no previous Pfam annotation. These results suggest that deep learning models will be a core component of future protein annotation tools.
Collapse
Affiliation(s)
| | | | | | - Theo Sanderson
- Google Research, Cambridge, MA, USA
- The Francis Crick Institute, London, UK
| | - Brandon Carter
- MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA, USA
| | | | - Alex Bateman
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Hinxton, UK
| | - Mark A DePristo
- Google Research, Cambridge, MA, USA
- BigHat Biosciences, San Mateo, CA, USA
| | - Lucy J Colwell
- Google Research, Cambridge, MA, USA.
- Department of Chemistry, University of Cambridge, Cambridge, UK.
| |
Collapse
|
4
|
Chauhan V, Tiwari A, Joshi N, Khandelwal S. Multi-label classifier for protein sequence using heuristic-based deep convolution neural network. APPL INTELL 2022. [DOI: 10.1007/s10489-021-02529-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|
5
|
Yu L, Xue L, Liu F, Li Y, Jing R, Luo J. The applications of deep learning algorithms on in silico druggable proteins identification. J Adv Res 2022; 41:219-231. [PMID: 36328750 PMCID: PMC9637576 DOI: 10.1016/j.jare.2022.01.009] [Citation(s) in RCA: 13] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2021] [Revised: 12/21/2021] [Accepted: 01/18/2022] [Indexed: 11/20/2022] Open
Abstract
We developed the first deep learning-based druggable protein classifier for fast and accurate identification of potential druggable proteins. Experimental results on a standard dataset demonstrate that the prediction performance of deep learning model is comparable to those of existing methods. We visualized the representations of druggable proteins learned by deep learning models, which helps us understand how they work. Our analysis reconfirms that the attention mechanism is especially useful for explaining deep learning models.
Introduction The top priority in drug development is to identify novel and effective drug targets. In vitro assays are frequently used for this purpose; however, traditional experimental approaches are insufficient for large-scale exploration of novel drug targets, as they are expensive, time-consuming and laborious. Therefore, computational methods have emerged in recent decades as an alternative to aid experimental drug discovery studies by developing sophisticated predictive models to estimate unknown drugs/compounds and their targets. The recent success of deep learning (DL) techniques in machine learning and artificial intelligence has further attracted a great deal of attention in the biomedicine field, including computational drug discovery. Objectives This study focuses on the practical applications of deep learning algorithms for predicting druggable proteins and proposes a powerful predictor for fast and accurate identification of potential drug targets. Methods Using a gold-standard dataset, we explored several typical protein features and different deep learning algorithms and evaluated their performance in a comprehensive way. We provide an overview of the entire experimental process, including protein features and descriptors, neural network architectures, libraries and toolkits for deep learning modelling, performance evaluation metrics, model interpretation and visualization. Results Experimental results show that the hybrid model (architecture: CNN-RNN (BiLSTM) + DNN; feature: dictionary encoding + DC_TC_CTD) performed better than the other models on the benchmark dataset. This hybrid model was able to achieve 90.0% accuracy and 0.800 MCC on the test dataset and 84.8% and 0.703 on a nonredundant independent test dataset, which is comparable to those of existing methods. Conclusion We developed the first deep learning-based classifier for fast and accurate identification of potential druggable proteins. We hope that this study will be helpful for future researchers who would like to use deep learning techniques to develop relevant predictive models.
Collapse
|
6
|
Keresztes L, Szögi E, Varga B, Grolmusz V. Identifying super-feminine, super-masculine and sex-defining connections in the human braingraph. Cogn Neurodyn 2021; 15:949-959. [PMID: 34786030 PMCID: PMC8572280 DOI: 10.1007/s11571-021-09687-w] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2020] [Revised: 04/23/2021] [Accepted: 05/29/2021] [Indexed: 11/26/2022] Open
Abstract
For more than a decade now, we can discover and study thousands of cerebral connections with the application of diffusion magnetic resonance imaging (dMRI) techniques and the accompanying algorithmic workflow. While numerous connectomical results were published enlightening the relation between the braingraph and certain biological, medical, and psychological properties, it is still a great challenge to identify a small number of brain connections closely related to those conditions. In the present contribution, by applying the 1200 Subjects Release of the Human Connectome Project (HCP) and Support Vector Machines, we identify just 102 connections out of the total number of 1950 connections in the 83-vertex graphs of 1064 subjects, which-by a simple linear test-precisely, without any error determine the sex of the subject. Next, we re-scaled the weights of the edges-corresponding to the discovered fibers-to be between 0 and 1, and, very surprisingly, we were able to identify two graph edges out of these 102, such that, if their weights are both 1, then the connectome always belongs to a female subject, independently of the other edges. Similarly, we have identified 3 edges from these 102, whose weights, if two of them are 1 and one is 0, imply that the graph belongs to a male subject-again, independently of the other edges. We call the former 2 edges superfeminine and the first two of the 3 edges supermasculine edges of the human connectome. Even more interestingly, the edge, connecting the right Pars Triangularis and the right Superior Parietal areas, is one of the 2 superfeminine edges, and it is also the third edge, accompanying the two supermasculine connections if its weight is 0; therefore, it is also a "switching" edge. Identifying such edge-sets of distinction is the unprecedented result of this work. SUPPLEMENTARY INFORMATION The online version contains supplementary material available at 10.1007/s11571-021-09687-w.
Collapse
Affiliation(s)
- László Keresztes
- PIT Bioinformatics Group, Eötvös University, H-1117 Budapest, Hungary
| | - Evelin Szögi
- PIT Bioinformatics Group, Eötvös University, H-1117 Budapest, Hungary
| | - Bálint Varga
- PIT Bioinformatics Group, Eötvös University, H-1117 Budapest, Hungary
| | - Vince Grolmusz
- PIT Bioinformatics Group, Eötvös University, H-1117 Budapest, Hungary
- Uratim Ltd., H-1118 Budapest, Hungary
| |
Collapse
|
7
|
Pissarra J, Pagniez J, Petitdidier E, Séveno M, Vigy O, Bras-Gonçalves R, Lemesre JL, Holzmuller P. Proteomic Analysis of the Promastigote Secretome of Seven Leishmania Species. J Proteome Res 2021; 21:30-48. [PMID: 34806897 DOI: 10.1021/acs.jproteome.1c00244] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
Leishmaniasis is one of the most impactful parasitic diseases worldwide, endangering the lives of 1 billion people every year. There are 20 different species of Leishmania able to infect humans, causing cutaneous (CL), visceral (VL), and/or mucocutaneous leishmaniasis (MCL). Leishmania parasites are known to secrete a plethora of proteins to establish infection and modulate the host's immune system. In this study, we analyzed using tandem mass spectrometry the total protein content of the secretomes produced by promastigote forms from seven Leishmania species grown in serum-free in vitro cultures. The core secretome shared by all seven Leishmania species corresponds to up to one-third of total secreted proteins, suggesting conserved mechanisms of adaptation to the vertebrate host. The relative abundance confirms the importance of known virulence factors and some proteins uniquely present in CL- or VL-causing species and may provide further insight regarding their pathogenesis. Bioinformatic analysis showed that most proteins were secreted via unconventional mechanisms, with an important role for vesicle-based secretion for all species. Gene Ontology annotation and enrichment analyses showed a high level of functional conservation among species. This study contributes to the current knowledge on the biological significance of differently secreted proteins and provides new information on the correlation of Leishmania secretome to clinical outcomes and species-specific pathogenesis.
Collapse
Affiliation(s)
- Joana Pissarra
- UMR 177 INTERTRYP, Institut de Recherche pour le Développement (IRD), 34394 Montpellier, France
| | - Julie Pagniez
- UMR 177 INTERTRYP, Institut de Recherche pour le Développement (IRD), 34394 Montpellier, France
| | - Elodie Petitdidier
- UMR 177 INTERTRYP, Institut de Recherche pour le Développement (IRD), 34394 Montpellier, France
| | - Martial Séveno
- BCM, Univ. Montpellier, CNRS, INSERM, 34090 Montpellier, France
| | - Oana Vigy
- IGF, Univ. Montpellier, CNRS, INSERM, 34090 Montpellier, France
| | - Rachel Bras-Gonçalves
- UMR 177 INTERTRYP, Institut de Recherche pour le Développement (IRD), 34394 Montpellier, France
| | - Jean-Loup Lemesre
- UMR 177 INTERTRYP, Institut de Recherche pour le Développement (IRD), 34394 Montpellier, France
| | - Philippe Holzmuller
- UMR ASTRE, CIRAD, INRAE, University of Montpellier (I-MUSE), 34090 Montpellier, France
| |
Collapse
|
8
|
Sandaruwan PD, Wannige CT. An improved deep learning model for hierarchical classification of protein families. PLoS One 2021; 16:e0258625. [PMID: 34669708 PMCID: PMC8528337 DOI: 10.1371/journal.pone.0258625] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2020] [Accepted: 10/01/2021] [Indexed: 12/28/2022] Open
Abstract
Although genes carry information, proteins are the main role player in providing all the functionalities of a living organism. Massive amounts of different proteins involve in every function that occurs in a cell. These amino acid sequences can be hierarchically classified into a set of families and subfamilies depending on their evolutionary relatedness and similarities in their structure or function. Protein characterization to identify protein structure and function is done accurately using laboratory experiments. With the rapidly increasing huge amount of novel protein sequences, these experiments have become difficult to carry out since they are expensive, time-consuming, and laborious. Therefore, many computational classification methods are introduced to classify proteins and predict their functional properties. With the progress of the performance of the computational techniques, deep learning plays a key role in many areas. Novel deep learning models such as DeepFam, ProtCNN have been presented to classify proteins into their families recently. However, these deep learning models have been used to carry out the non-hierarchical classification of proteins. In this research, we propose a deep learning neural network model named DeepHiFam with high accuracy to classify proteins hierarchically into different levels simultaneously. The model achieved an accuracy of 98.38% for protein family classification and more than 80% accuracy for the classification of protein subfamilies and sub-subfamilies. Further, DeepHiFam performed well in the non-hierarchical classification of protein families and achieved an accuracy of 98.62% and 96.14% for the popular Pfam dataset and COG dataset respectively.
Collapse
|
9
|
Jing R, Wen T, Liao C, Xue L, Liu F, Yu L, Luo J. DeepT3 2.0: improving type III secreted effector predictions by an integrative deep learning framework. NAR Genom Bioinform 2021; 3:lqab086. [PMID: 34617013 PMCID: PMC8489581 DOI: 10.1093/nargab/lqab086] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2021] [Revised: 08/12/2021] [Accepted: 09/09/2021] [Indexed: 11/13/2022] Open
Abstract
Type III secretion systems (T3SSs) are bacterial membrane-embedded nanomachines that allow a number of humans, plant and animal pathogens to inject virulence factors directly into the cytoplasm of eukaryotic cells. Export of effectors through T3SSs is critical for motility and virulence of most Gram-negative pathogens. Current computational methods can predict type III secreted effectors (T3SEs) from amino acid sequences, but due to algorithmic constraints, reliable and large-scale prediction of T3SEs in Gram-negative bacteria remains a challenge. Here, we present DeepT3 2.0 (http://advintbioinforlab.com/deept3/), a novel web server that integrates different deep learning models for genome-wide predicting T3SEs from a bacterium of interest. DeepT3 2.0 combines various deep learning architectures including convolutional, recurrent, convolutional-recurrent and multilayer neural networks to learn N-terminal representations of proteins specifically for T3SE prediction. Outcomes from the different models are processed and integrated for discriminating T3SEs and non-T3SEs. Because it leverages diverse models and an integrative deep learning framework, DeepT3 2.0 outperforms existing methods in validation datasets. In addition, the features learned from networks are analyzed and visualized to explain how models make their predictions. We propose DeepT3 2.0 as an integrated and accurate tool for the discovery of T3SEs.
Collapse
Affiliation(s)
- Runyu Jing
- School of Cyber Science and Engineering, Sichuan University, Chengdu 610065, China
| | - Tingke Wen
- School of Cyber Science and Engineering, Sichuan University, Chengdu 610065, China
| | - Chengxiang Liao
- School of Cyber Science and Engineering, Sichuan University, Chengdu 610065, China
| | - Li Xue
- School of Public Health, Southwest Medical University, Luzhou 646000, China
| | - Fengjuan Liu
- School of Geography and Resources, Guizhou Education University, Guiyang 550018, China
| | - Lezheng Yu
- School of Chemistry and Materials Science, Guizhou Education University, Guiyang 550018, China
| | - Jiesi Luo
- Department of Pharmacology, School of Pharmacy, Southwest Medical University, Luzhou 646000, China
| |
Collapse
|
10
|
Fabris F, Palmer D, de Magalhães JP, Freitas AA. Comparing enrichment analysis and machine learning for identifying gene properties that discriminate between gene classes. Brief Bioinform 2021; 21:803-814. [PMID: 30895300 DOI: 10.1093/bib/bbz028] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2018] [Revised: 02/18/2019] [Accepted: 02/19/2019] [Indexed: 01/08/2023] Open
Abstract
Biologists very often use enrichment methods based on statistical hypothesis tests to identify gene properties that are significantly over-represented in a given set of genes of interest, by comparison with a 'background' set of genes. These enrichment methods, although based on rigorous statistical foundations, are not always the best single option to identify patterns in biological data. In many cases, one can also use classification algorithms from the machine-learning field. Unlike enrichment methods, classification algorithms are designed to maximize measures of predictive performance and are capable of analysing combinations of gene properties, instead of one property at a time. In practice, however, the majority of studies use either enrichment or classification methods (rather than both), and there is a lack of literature discussing the pros and cons of both types of method. The goal of this paper is to compare and contrast enrichment and classification methods, offering two contributions. First, we discuss the (to some extent complementary) advantages and disadvantages of both types of methods for identifying gene properties that discriminate between gene classes. Second, we provide a set of high-level recommendations for using enrichment and classification methods. Overall, by highlighting the strengths and the weaknesses of both types of methods we argue that both should be used in bioinformatics analyses.
Collapse
Affiliation(s)
- Fabio Fabris
- School of Computing, University of Kent, Kent, CT2 7NF, UK
| | - Daniel Palmer
- Integrative Genomics of Ageing Group, Institute of Ageing and Chronic Disease, University of Liverpool, Liverpool, UK
| | - João Pedro de Magalhães
- Integrative Genomics of Ageing Group, Institute of Ageing and Chronic Disease, University of Liverpool, Liverpool, UK
| | - Alex A Freitas
- School of Computing, University of Kent, Kent, CT2 7NF, UK
| |
Collapse
|
11
|
Vu TTD, Jung J. Protein function prediction with gene ontology: from traditional to deep learning models. PeerJ 2021; 9:e12019. [PMID: 34513334 PMCID: PMC8395570 DOI: 10.7717/peerj.12019] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2021] [Accepted: 07/29/2021] [Indexed: 11/25/2022] Open
Abstract
Protein function prediction is a crucial part of genome annotation. Prediction methods have recently witnessed rapid development, owing to the emergence of high-throughput sequencing technologies. Among the available databases for identifying protein function terms, Gene Ontology (GO) is an important resource that describes the functional properties of proteins. Researchers are employing various approaches to efficiently predict the GO terms. Meanwhile, deep learning, a fast-evolving discipline in data-driven approach, exhibits impressive potential with respect to assigning GO terms to amino acid sequences. Herein, we reviewed the currently available computational GO annotation methods for proteins, ranging from conventional to deep learning approach. Further, we selected some suitable predictors from among the reviewed tools and conducted a mini comparison of their performance using a worldwide challenge dataset. Finally, we discussed the remaining major challenges in the field, and emphasized the future directions for protein function prediction with GO.
Collapse
Affiliation(s)
- Thi Thuy Duong Vu
- Department of Information and Communication Engineering, Myongji University, Yongin-si, Gyeonggi-do, South Korea
| | - Jaehee Jung
- Department of Information and Communication Engineering, Myongji University, Yongin-si, Gyeonggi-do, South Korea
| |
Collapse
|
12
|
Zhang D, Kabuka MR. Protein Family Classification from Scratch: A CNN Based Deep Learning Approach. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:1996-2007. [PMID: 31944984 DOI: 10.1109/tcbb.2020.2966633] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Next-generation sequencing techniques provide us with an opportunity for generating sequenced proteins and identifying the biological families and functions of these proteins. However, compared with identified proteins, uncharacterized proteins consist of a notable percentage of the overall proteins in the bioinformatics research field. Traditional family classification methods often devote themselves to extracting N-Gram features from sequences while ignoring motif information as well as affinity information between motifs and adjacent amino acids. Previous clustering-based algorithms have typically been used to define protein features with domain knowledge and annotate protein families based on extensive data samples. In this paper, we apply CNN based amino acid representation learning with limited characterized proteins to explore the performances of annotated protein families by taking into account the amino acid location information. Additionally, we apply the method to all reviewed protein sequences with their families retrieved from the UniProt database to evaluate our approach. Last but not least, we verify our model using those unreviewed protein records, which is typically ignored by other methods.
Collapse
|
13
|
Jing R, Li Y, Xue L, Liu F, Li M, Luo J. autoBioSeqpy: A Deep Learning Tool for the Classification of Biological Sequences. J Chem Inf Model 2020; 60:3755-3764. [PMID: 32786512 DOI: 10.1021/acs.jcim.0c00409] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
Deep learning has proven to be a powerful method with applications in various fields including image, language, and biomedical data. Thanks to the libraries and toolkits such as TensorFlow, PyTorch, and Keras, researchers can use different deep learning architectures and data sets for rapid modeling. However, the available implementations of neural networks using these toolkits are usually designed for a specific research and are difficult to transfer to other work. Here, we present autoBioSeqpy, a tool that uses deep learning for biological sequence classification. The advantage of this tool is its simplicity. Users only need to prepare the input data set and then use a command line interface. Then, autoBioSeqpy automatically executes a series of customizable steps including text reading, parameter initialization, sequence encoding, model loading, training, and evaluation. In addition, the tool provides various ready-to-apply and adapt model templates to improve the usability of these networks. We introduce the application of autoBioSeqpy on three biological sequence problems: the prediction of type III secreted proteins, protein subcellular localization, and CRISPR/Cas9 sgRNA activity. autoBioSeqpy is freely available with examples at https://github.com/jingry/autoBioSeqpy.
Collapse
Affiliation(s)
- Runyu Jing
- College of Cybersecurity, Sichuan University, Chengdu 610065, China
| | - Yizhou Li
- College of Cybersecurity, Sichuan University, Chengdu 610065, China
| | - Li Xue
- School of Public Health, Southwest Medical University, Luzhou, Sichuan 646000, China
| | - Fengjuan Liu
- School of Geography and Resources, Guizhou Education University, Guiyang 550018, China
| | - Menglong Li
- College of Chemistry, Sichuan University, Chengdu 610065, China
| | - Jiesi Luo
- Department of Pharmacology, School of Pharmacy, Southwest Medical University, Luzhou, Sichuan 646000, China
| |
Collapse
|
14
|
Carter B, Bileschi M, Smith J, Sanderson T, Bryant D, Belanger D, Colwell LJ. Critiquing Protein Family Classification Models Using Sufficient Input Subsets. J Comput Biol 2019; 27:1219-1231. [PMID: 31874057 DOI: 10.1089/cmb.2019.0339] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/05/2023] Open
Abstract
In many application domains, neural networks are highly accurate and have been deployed at large scale. However, users often do not have good tools for understanding how these models arrive at their predictions. This has hindered adoption in fields such as the life and medical sciences, where researchers require that models base their decisions on underlying biological phenomena rather than peculiarities of the dataset. We propose a set of methods for critiquing deep learning models and demonstrate their application for protein family classification, a task for which high-accuracy models have considerable potential impact. Our methods extend the Sufficient Input Subsets (SIS) technique, which we use to identify subsets of features in each protein sequence that are alone sufficient for classification. Our suite of tools analyzes these subsets to shed light on the decision-making criteria employed by models trained on this task. These tools show that while deep models may perform classification for biologically relevant reasons, their behavior varies considerably across the choice of network architecture and parameter initialization. While the techniques that we develop are specific to the protein sequence classification task, the approach taken generalizes to a broad set of scientific contexts in which model interpretability is essential.
Collapse
Affiliation(s)
- Brandon Carter
- MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, Massachusetts, USA
- Google Research, Mountain View, California, USA
| | | | - Jamie Smith
- Google Research, Mountain View, California, USA
| | | | - Drew Bryant
- Google Research, Mountain View, California, USA
| | | | - Lucy J Colwell
- Google Research, Mountain View, California, USA
- Department of Chemistry, Cambridge University, Cambridge, United Kingdom
| |
Collapse
|
15
|
Zhang D, Kabuka M. Multimodal deep representation learning for protein interaction identification and protein family classification. BMC Bioinformatics 2019; 20:531. [PMID: 31787089 PMCID: PMC6886253 DOI: 10.1186/s12859-019-3084-y] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/13/2023] Open
Abstract
BACKGROUND Protein-protein interactions(PPIs) engage in dynamic pathological and biological procedures constantly in our life. Thus, it is crucial to comprehend the PPIs thoroughly such that we are able to illuminate the disease occurrence, achieve the optimal drug-target therapeutic effect and describe the protein complex structures. However, compared to the protein sequences obtainable from various species and organisms, the number of revealed protein-protein interactions is relatively limited. To address this dilemma, lots of research endeavor have investigated in it to facilitate the discovery of novel PPIs. Among these methods, PPI prediction techniques that merely rely on protein sequence data are more widespread than other methods which require extensive biological domain knowledge. RESULTS In this paper, we propose a multi-modal deep representation learning structure by incorporating protein physicochemical features with the graph topological features from the PPI networks. Specifically, our method not only bears in mind the protein sequence information but also discerns the topological representations for each protein node in the PPI networks. In our paper, we construct a stacked auto-encoder architecture together with a continuous bag-of-words (CBOW) model based on generated metapaths to study the PPI predictions. Following by that, we utilize the supervised deep neural networks to identify the PPIs and classify the protein families. The PPI prediction accuracy for eight species ranged from 96.76% to 99.77%, which signifies that our multi-modal deep representation learning framework achieves superior performance compared to other computational methods. CONCLUSION To the best of our knowledge, this is the first multi-modal deep representation learning framework for examining the PPI networks.
Collapse
Affiliation(s)
- Da Zhang
- Department of Electrical and Computer Engineering, University of Miami, Coral Gables, FL, U.S..
| | - Mansur Kabuka
- Department of Electrical and Computer Engineering, University of Miami, Coral Gables, FL, U.S
| |
Collapse
|
16
|
Szalkai B, Grolmusz V. SECLAF: a webserver and deep neural network design tool for hierarchical biological sequence classification. Bioinformatics 2019; 34:2487-2489. [PMID: 29490010 DOI: 10.1093/bioinformatics/bty116] [Citation(s) in RCA: 17] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2017] [Accepted: 02/26/2018] [Indexed: 11/14/2022] Open
Abstract
Summary Artificial intelligence tools are gaining more and more ground each year in bioinformatics. Learning algorithms can be taught for specific tasks by using the existing enormous biological databases, and the resulting models can be used for the high-quality classification of novel, un-categorized data in numerous areas, including biological sequence analysis. Here, we introduce SECLAF, a webserver that uses deep neural networks for hierarchical biological sequence classification. By applying SECLAF for residue-sequences, we have reported [Methods (2018), https://doi.org/10.1016/j.ymeth.2017.06.034] the most accurate multi-label protein classifier to date (UniProt-into 698 classes-AUC 99.99%; Gene Ontology-into 983 classes-AUC 99.45%). Our framework SECLAF can be applied for other sequence classification tasks, as we describe in the present contribution. Availability and implementation The program SECLAF is implemented in Python, and is available for download, with example datasets at the website https://pitgroup.org/seclaf/. For Gene Ontology and UniProt based classifications a webserver is also available at the address above.
Collapse
Affiliation(s)
- Balázs Szalkai
- PIT Bioinformatics Group, Institute of Mathematics, Eötvös University, H-1117 Budapest, Hungary
| | - Vince Grolmusz
- PIT Bioinformatics Group, Institute of Mathematics, Eötvös University, H-1117 Budapest, Hungary.,Uratim Ltd, H-1118 Budapest, Hungary
| |
Collapse
|
17
|
Yang KK, Wu Z, Arnold FH. Machine-learning-guided directed evolution for protein engineering. Nat Methods 2019; 16:687-694. [PMID: 31308553 DOI: 10.1038/s41592-019-0496-6] [Citation(s) in RCA: 464] [Impact Index Per Article: 92.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2018] [Accepted: 06/17/2019] [Indexed: 02/06/2023]
Abstract
Protein engineering through machine-learning-guided directed evolution enables the optimization of protein functions. Machine-learning approaches predict how sequence maps to function in a data-driven manner without requiring a detailed model of the underlying physics or biological pathways. Such methods accelerate directed evolution by learning from the properties of characterized variants and using that information to select sequences that are likely to exhibit improved properties. Here we introduce the steps required to build machine-learning sequence-function models and to use those models to guide engineering, making recommendations at each stage. This review covers basic concepts relevant to the use of machine learning for protein engineering, as well as the current literature and applications of this engineering paradigm. We illustrate the process with two case studies. Finally, we look to future opportunities for machine learning to enable the discovery of unknown protein functions and uncover the relationship between protein sequence and function.
Collapse
Affiliation(s)
- Kevin K Yang
- Division of Chemistry and Chemical Engineering, California Institute of Technology, Pasadena, CA, USA
| | - Zachary Wu
- Division of Chemistry and Chemical Engineering, California Institute of Technology, Pasadena, CA, USA
| | - Frances H Arnold
- Division of Chemistry and Chemical Engineering, California Institute of Technology, Pasadena, CA, USA.
| |
Collapse
|
18
|
Leveraging implicit knowledge in neural networks for functional dissection and engineering of proteins. NAT MACH INTELL 2019. [DOI: 10.1038/s42256-019-0049-9] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
|
19
|
Calarco L, Ellis J. Annotating the ‘hypothetical’ in hypothetical proteins: In-silico analysis of uncharacterised proteins for the Apicomplexan parasite, Neospora caninum. Vet Parasitol 2019; 265:29-37. [DOI: 10.1016/j.vetpar.2018.11.015] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2018] [Revised: 10/30/2018] [Accepted: 11/24/2018] [Indexed: 12/12/2022]
|
20
|
Tchitchek N. Navigating in the vast and deep oceans of high-dimensional biological data. Methods 2018; 132:1-2. [DOI: 10.1016/j.ymeth.2017.11.009] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023] Open
|