151
|
Zhang F, Cetin Karayumak S, Hoffmann N, Rathi Y, Golby AJ, O'Donnell LJ. Deep white matter analysis (DeepWMA): Fast and consistent tractography segmentation. Med Image Anal 2020; 65:101761. [PMID: 32622304 PMCID: PMC7483951 DOI: 10.1016/j.media.2020.101761] [Citation(s) in RCA: 38] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2019] [Revised: 06/16/2020] [Accepted: 06/18/2020] [Indexed: 02/07/2023]
Abstract
White matter tract segmentation, i.e. identifying tractography fibers (streamline trajectories) belonging to anatomically meaningful fiber tracts, is an essential step to enable tract quantification and visualization. In this study, we present a deep learning tractography segmentation method (DeepWMA) that allows fast and consistent identification of 54 major deep white matter fiber tracts from the whole brain. We create a large-scale training tractography dataset of 1 million labeled fiber samples, and we propose a novel 2D multi-channel feature descriptor (FiberMap) that encodes spatial coordinates of points along each fiber. We learn a convolutional neural network (CNN) fiber classification model based on FiberMap and obtain a high fiber classification accuracy of 90.99% on the training tractography data with ground truth fiber labels. Then, the method is evaluated on a test dataset of 597 diffusion MRI scans from six independently acquired populations across genders, the lifespan (1 day - 82 years), and different health conditions (healthy control, neuropsychiatric disorders, and brain tumor patients). We perform comparisons with two state-of-the-art tract segmentation methods. Experimental results show that our method obtains a highly consistent tract segmentation result, where on average over 99% of the fiber tracts are successfully identified across all subjects under study, most importantly, including neonates and patients with space-occupying brain tumors. We also demonstrate good generalization of the method to tractography data from multiple different fiber tracking methods. The proposed method leverages deep learning techniques and provides a fast and efficient tool for brain white matter segmentation in large diffusion MRI tractography datasets.
Collapse
Affiliation(s)
- Fan Zhang
- Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA.
| | | | - Nico Hoffmann
- Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
| | - Yogesh Rathi
- Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
| | - Alexandra J Golby
- Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
| | | |
Collapse
|
152
|
Dutta P, Mishra P, Saha S. Incomplete multi-view gene clustering with data regeneration using Shape Boltzmann Machine. Comput Biol Med 2020; 125:103965. [PMID: 32931989 DOI: 10.1016/j.compbiomed.2020.103965] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2020] [Revised: 08/08/2020] [Accepted: 08/08/2020] [Indexed: 11/17/2022]
Abstract
Deciphering patterns in the structural and functional anatomy of genes can prove to be very helpful in understanding genetic biology and genomics. Also, the availability of the multiple omics data, along with the advent of machine learning techniques, aids medical professionals in gaining insights about various biological regulations. Gene clustering is one of the many such computation techniques that can help in understanding gene behavior. However, more comprehensive and reliable insights can be gained if different modalities/views of biomedical data are considered. However, in most multi-view cases, each view contains some missing data, leading to incomplete multi-view clustering. In this study, we have presented a deep Boltzmann machine-based incomplete multi-view clustering framework for gene clustering. Here, we seek to regenerate the data of the three NCBI datasets in the incomplete modalities using Shape Boltzmann Machines. The overall performance of the proposed multi-view clustering technique has been evaluated using the Silhouette index and Davies-Bouldin index, and the comparative analysis shows an improvement over state-of-the-art methods. Finally, to prove that the improvement attained by the proposed incomplete multi-view clustering is statistically significant, we perform Welch's t-test. AVAILABILITY OF DATA AND MATERIALS: https://github.com/piyushmishra12/IMC.
Collapse
Affiliation(s)
- Pratik Dutta
- Department of Computer Science and Engineering, Indian Institute of Technology, Patna, India
| | - Piyush Mishra
- Department of Computer Science and Engineering, IIIT, Bhubaneswar, India
| | - Sriparna Saha
- Department of Computer Science and Engineering, Indian Institute of Technology, Patna, India.
| |
Collapse
|
153
|
Hew B, Tan QW, Goh W, Ng JWX, Mutwil M. LSTrAP-Crowd: prediction of novel components of bacterial ribosomes with crowd-sourced analysis of RNA sequencing data. BMC Biol 2020; 18:114. [PMID: 32883264 PMCID: PMC7470450 DOI: 10.1186/s12915-020-00846-9] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2020] [Accepted: 08/12/2020] [Indexed: 12/12/2022] Open
Abstract
BACKGROUND Bacterial resistance to antibiotics is a growing health problem that is projected to cause more deaths than cancer by 2050. Consequently, novel antibiotics are urgently needed. Since more than half of the available antibiotics target the structurally conserved bacterial ribosomes, factors involved in protein synthesis are thus prime targets for the development of novel antibiotics. However, experimental identification of these potential antibiotic target proteins can be labor-intensive and challenging, as these proteins are likely to be poorly characterized and specific to few bacteria. Here, we use a bioinformatics approach to identify novel components of protein synthesis. RESULTS In order to identify these novel proteins, we established a Large-Scale Transcriptomic Analysis Pipeline in Crowd (LSTrAP-Crowd), where 285 individuals processed 26 terabytes of RNA-sequencing data of the 17 most notorious bacterial pathogens. In total, the crowd processed 26,269 RNA-seq experiments and used the data to construct gene co-expression networks, which were used to identify more than a hundred uncharacterized genes that were transcriptionally associated with protein synthesis. We provide the identity of these genes together with the processed gene expression data. CONCLUSIONS We identified genes related to protein synthesis in common bacterial pathogens and thus provide a resource of potential antibiotic development targets for experimental validation. The data can be used to explore additional vulnerabilities of bacteria, while our approach demonstrates how the processing of gene expression data can be easily crowd-sourced.
Collapse
Affiliation(s)
- Benedict Hew
- School of Biological Sciences, Nanyang Technological University, 60 Nanyang Drive, Singapore, 637551, Singapore
| | - Qiao Wen Tan
- School of Biological Sciences, Nanyang Technological University, 60 Nanyang Drive, Singapore, 637551, Singapore
| | - William Goh
- School of Biological Sciences, Nanyang Technological University, 60 Nanyang Drive, Singapore, 637551, Singapore
| | - Jonathan Wei Xiong Ng
- School of Biological Sciences, Nanyang Technological University, 60 Nanyang Drive, Singapore, 637551, Singapore
| | - Marek Mutwil
- School of Biological Sciences, Nanyang Technological University, 60 Nanyang Drive, Singapore, 637551, Singapore.
| |
Collapse
|
154
|
Ranjan A, Fahad MS, Fernandez-Baca D, Deepak A, Tripathi S. Deep Robust Framework for Protein Function Prediction Using Variable-Length Protein Sequences. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:1648-1659. [PMID: 30998479 DOI: 10.1109/tcbb.2019.2911609] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
The order of amino acids in a protein sequence enables the protein to acquire a conformation suitable for performing functions, thereby motivating the need to analyze these sequences for predicting functions. Although machine learning based approaches are fast compared to methods using BLAST, FASTA, etc., they fail to perform well for long protein sequences (with more than 300 amino acids). In this paper, we introduce a novel method for construction of two separate feature sets for protein using bi-directional long short-term memory network based on the analysis of fixed 1) single-sized segments and 2) multi-sized segments. The model trained on the proposed feature set based on multi-sized segments is combined with the model trained using state-of-the-art Multi-label Linear Discriminant Analysis (MLDA) features to further improve the accuracy. Extensive evaluations using separate datasets for biological processes and molecular functions demonstrate not only improved results for long sequences, but also significantly improve the overall accuracy over state-of-the-art method. The single-sized approach produces an improvement of +3.37 percent for biological processes and +5.48 percent for molecular functions over the MLDA based classifier. The corresponding numbers for multi-sized approach are +5.38 and +8.00 percent. Combining the two models, the accuracy further improves to +7.41 and +9.21 percent, respectively.
Collapse
|
155
|
Fan K, Zhang Y. Pseudo2GO: A Graph-Based Deep Learning Method for Pseudogene Function Prediction by Borrowing Information From Coding Genes. Front Genet 2020; 11:807. [PMID: 33014009 PMCID: PMC7461887 DOI: 10.3389/fgene.2020.00807] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2020] [Accepted: 07/06/2020] [Indexed: 12/16/2022] Open
Abstract
Pseudogenes are indicating more and more functional potentials recently, though historically were regarded as relics of evolution. Computational methods for predicting pseudogene functions on Gene Ontology is important for directing experimental discovery. However, no pseudogene-specific computational methods have been proposed to directly predict their Gene Ontology (GO) terms. The biggest challenge for pseudogene function prediction is the lack of enough features and functional annotations, making training a predictive model difficult. Considering the close functional similarity between pseudogenes and their parent coding genes that share great amount of DNA sequence, as well as that coding genes have rich annotations, we aim to predict pseudogene functions by borrowing information from coding genes in a graph-based way. Here we propose Pseudo2GO, a graph-based deep learning semi-supervised model for pseudogene function prediction. A sequence similarity graph is first constructed to connect pseudogenes and coding genes. Multiple features are incorporated into the model as the node attributes to enable the graph an attributed graph, including expression profiles, interactions with microRNAs, protein-protein interactions (PPIs), and genetic interactions. Graph convolutional networks are used to propagate node attributes across the graph to make classifications on pseudogenes. Comparing Pseudo2GO with other frameworks adapted from popular protein function prediction methods, we demonstrated that our method has achieved state-of-the-art performance, significantly outperforming other methods in terms of the M-AUPR metric.
Collapse
Affiliation(s)
- Kunjie Fan
- Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH, United States
| | - Yan Zhang
- Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH, United States
- The Ohio State University Comprehensive Cancer Center, Columbus, OH, United States
| |
Collapse
|
156
|
Saha S, Prasad A, Chatterjee P, Basu S, Nasipuri M. Protein function prediction from dynamic protein interaction network using gene expression data. J Bioinform Comput Biol 2020; 17:1950025. [PMID: 31617461 DOI: 10.1142/s0219720019500252] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Computational prediction of functional annotation of proteins is an uphill task. There is an ever increasing gap between functional characterization of protein sequences and deluge of protein sequences generated by large-scale sequencing projects. The dynamic nature of protein interactions is frequently observed which is mostly influenced by any new change of state or change in stimuli. Functional characterization of proteins can be inferred from their interactions with each other, which is dynamic in nature. In this work, we have used a dynamic protein-protein interaction network (PPIN), time course gene expression data and protein sequence information for prediction of functional annotation of proteins. During progression of a particular function, it has also been observed that not all the proteins are active at all time points. For unannotated active proteins, our proposed methodology explores the dynamic PPIN consisting of level-1 and level-2 neighboring proteins at different time points, filtered by Damerau-Levenshtein edit distance to estimate the similarity between two protein sequences and coefficient variation methods to assess the strength of an edge in a network. Finally, from the filtered dynamic PPIN, at each time point, functional annotations of the level-2 proteins are assigned to the unknown and unannotated active proteins through the level-1 neighbor, following a bottom-up strategy. Our proposed methodology achieves an average precision, recall and F-Score of 0.59, 0.76 and 0.61 respectively, which is significantly higher than the reported state-of-the-art methods.
Collapse
Affiliation(s)
- Sovan Saha
- Department of Computer Science & Engineering, Dr. Sudhir Chandra Sur Degree Engineering College, 540, Dum Dum Road, Near Dum Dum Jn. Station, Surermath, Kolkata 700074, India
| | - Abhimanyu Prasad
- Department of Computer Science & Engineering, Dr. Sudhir Chandra Sur Degree Engineering College, 540, Dum Dum Road, Near Dum Dum Jn. Station, Surermath, Kolkata 700074, India
| | - Piyali Chatterjee
- Department of Computer Science & Engineering, Netaji Subhash Engineering College, Techno City, Panchpota, Garia, Kolkata 700152, India
| | - Subhadip Basu
- Department of Computer Science & Engineering, Jadavpur University, 188, Raja S.C. Mallick Road, Kolkata 700032, India
| | - Mita Nasipuri
- Department of Computer Science & Engineering, Netaji Subhash Engineering College, Techno City, Panchpota, Garia, Kolkata 700152, India
| |
Collapse
|
157
|
Fan K, Guan Y, Zhang Y. Graph2GO: a multi-modal attributed network embedding method for inferring protein functions. Gigascience 2020; 9:giaa081. [PMID: 32770210 PMCID: PMC7414417 DOI: 10.1093/gigascience/giaa081] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2019] [Revised: 04/30/2020] [Indexed: 01/17/2023] Open
Abstract
BACKGROUND Identifying protein functions is important for many biological applications. Since experimental functional characterization of proteins is time-consuming and costly, accurate and efficient computational methods for predicting protein functions are in great demand for generating the testable hypotheses guiding large-scale experiments." RESULTS Here, we propose Graph2GO, a multi-modal graph-based representation learning model that can integrate heterogeneous information, including multiple types of interaction networks (sequence similarity network and protein-protein interaction network) and protein features (amino acid sequence, subcellular location, and protein domains) to predict protein functions on gene ontology. Comparing Graph2GO to BLAST, as a baseline model, and to two popular protein function prediction methods (Mashup and deepNF), we demonstrated that our model can achieve state-of-the-art performance. We show the robustness of our model by testing on multiple species. We also provide a web server supporting function query and downstream analysis on-the-fly. CONCLUSIONS Graph2GO is the first model that has utilized attributed network representation learning methods to model both interaction networks and protein features for predicting protein functions, and achieved promising performance. Our model can be easily extended to include more protein features to further improve the performance. Besides, Graph2GO is also applicable to other application scenarios involving biological networks, and the learned latent representations can be used as feature inputs for machine learning tasks in various downstream analyses.
Collapse
Affiliation(s)
- Kunjie Fan
- Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH 43210, USA
| | - Yuanfang Guan
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA
| | - Yan Zhang
- Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH 43210, USA
- The Ohio State University Comprehensive Cancer Center (OSUCCC - James), Columbus, OH 43210, USA
| |
Collapse
|
158
|
Le DH. UFO: A tool for unifying biomedical ontology-based semantic similarity calculation, enrichment analysis and visualization. PLoS One 2020; 15:e0235670. [PMID: 32645039 PMCID: PMC7347127 DOI: 10.1371/journal.pone.0235670] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2019] [Accepted: 06/22/2020] [Indexed: 02/06/2023] Open
Abstract
Background Biomedical ontologies have been growing quickly and proven to be useful in many biomedical applications. Important applications of those data include estimating the functional similarity between ontology terms and between annotated biomedical entities, analyzing enrichment for a set of biomedical entities. Many semantic similarity calculation and enrichment analysis methods have been proposed for such applications. Also, a number of tools implementing the methods have been developed on different platforms. However, these tools have implemented a small number of the semantic similarity calculation and enrichment analysis methods for a certain type of biomedical ontology. Note that the methods can be applied to all types of biomedical ontologies. More importantly, each method can be dominant in different applications; thus, users have more choice with more number of methods implemented in tools. Also, more functions would facilitate their task with ontology. Results In this study, we developed a Cytoscape app, named UFO, which unifies most of the semantic similarity measures for between-term and between-entity similarity calculation for all types of biomedical ontologies in OBO format. Based on the similarity calculation, UFO can calculate the similarity between two sets of entities and weigh imported entity networks as well as generate functional similarity networks. Besides, it can perform enrichment analysis of a set of entities by different methods. Moreover, UFO can visualize structural relationships between ontology terms, annotating relationships between entities and terms, and functional similarity between entities. Finally, we demonstrated the ability of UFO through some case studies on finding the best semantic similarity measures for assessing the similarity between human disease phenotypes, constructing biomedical entity functional similarity networks for predicting disease-associated biomarkers, and performing enrichment analysis on a set of similar phenotypes. Conclusions Taken together, UFO is expected to be a tool where biomedical ontologies can be exploited for various biomedical applications. Availability UFO is distributed as a Cytoscape app, and can be downloaded freely at Cytoscape App (http://apps.cytoscape.org/apps/ufo) for non-commercial use
Collapse
Affiliation(s)
- Duc-Hau Le
- Department of Computational Biomedicine, Vingroup Big Data Institute, Hanoi, Vietnam
- School of Computer Science and Engineering, Thuyloi University, Hanoi, Vietnam
- * E-mail:
| |
Collapse
|
159
|
Liu Z, Chen Q, Lan W, Liang J, Chen YPP, Chen B. A Survey of Network Embedding for Drug Analysis and Prediction. Curr Protein Pept Sci 2020; 22:CPPS-EPUB-107859. [PMID: 32614745 DOI: 10.2174/1389203721666200702145701] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/01/2020] [Revised: 04/05/2020] [Accepted: 05/21/2020] [Indexed: 11/22/2022]
Abstract
Traditional network-based computational methods have shown good results in drug analysis and prediction. However, these methods are time consuming and lack universality, and it is difficult to exploit the auxiliary information of nodes and edges. Network embedding provides a promising way for alleviating the above problems by transforming network into a low-dimensional space while preserving network structure and auxiliary information. This thus facilitates the application of machine learning algorithms for subsequent processing. Network embedding has been introduced into drug analysis and prediction in the last few years, and has shown superior performance over traditional methods. However, there is no systematic review of this issue. This article offers a comprehensive survey of the primary network embedding methods and their applications in drug analysis and prediction. The network embedding technologies applied in homogeneous network and heterogeneous network are investigated and compared, including matrix decomposition, random walk, and deep learning. Especially, the Graph neural network (GNN) methods in deep learning are highlighted. Further, the applications of network embedding in drug similarity estimation, drug-target interaction prediction, adverse drug reactions prediction, protein function and therapeutic peptides prediction are discussed. Several future potential research directions are also discussed.
Collapse
Affiliation(s)
- Zhixian Liu
- School of Medical, Guangxi University, Nanning. China
| | - Qingfeng Chen
- School of Computer, Electronic and Information, Guangxi University, Nanning. China
| | - Wei Lan
- School of Computer, Electronic and Information, Guangxi University, Nanning. China
| | - Jiahai Liang
- School of Electronics and Information Engineering, Beibu Gulf University, Qinzhou. China
| | - Yi-Ping Phoebe Chen
- Department of Computer Science and Information Technology, La Trobe University, Melbourne. Australia
| | - Baoshan Chen
- State Key Laboratory for Conservation and Utilization of Subtropical Agro-bioresources, Guangxi University, Nanning. China
| |
Collapse
|
160
|
Lennox M, Robertson N, Devereux B. Expanding the Vocabulary of a Protein: Application of Subword Algorithms to Protein Sequence Modelling. ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL INTERNATIONAL CONFERENCE 2020; 2020:2361-2367. [PMID: 33018481 DOI: 10.1109/embc44109.2020.9176380] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Deep learning has proven to be a useful tool for modelling protein properties. However, given the variability in the length of proteins, it can be difficult to summarise the sequence of amino acids effectively. In many cases, as a result of using fixed-length representations, information about long proteins can be lost through truncation, or model training can be slow due to the use of excessive padding. In this work, we aim to overcome these problems by expanding upon the original vocabulary used to represent the protein sequence. To this end, we utilise two prominent subword algorithms that have been previously used to reach state-of-the-art results in various Natural Language Processing tasks. The algorithms are used to encode the original protein sequence into a set of subsequences before they are analysed by a Doc2Vec model. The pre-trained encodings produced by each algorithm are tested on a variety of downstream tasks: four protein property prediction tasks (plasma membrane localization, thermostability, peak absorption wavelength, enantioselectivity) as well as drug-target affinity prediction tasks over two datasets. Our results significantly improve on the state-of-the-art for these tasks, demonstrating the benefits of using subword compression algorithms for modelling proteins.
Collapse
|
161
|
Poverennaya E, Kiseleva O, Romanova A, Pyatnitskiy M. Predicting Functions of Uncharacterized Human Proteins: From Canonical to Proteoforms. Genes (Basel) 2020; 11:E677. [PMID: 32575886 PMCID: PMC7350264 DOI: 10.3390/genes11060677] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2020] [Revised: 06/09/2020] [Accepted: 06/19/2020] [Indexed: 01/22/2023] Open
Abstract
Despite tremendous efforts in genomics, transcriptomics, and proteomics communities, there is still no comprehensive data about the exact number of protein-coding genes, translated proteoforms, and their function. In addition, by now, we lack functional annotation for 1193 genes, where expression was confirmed at the proteomic level (uPE1 proteins). We re-analyzed results of AP-MS experiments from the BioPlex 2.0 database to predict functions of uPE1 proteins and their splice forms. By building a protein-protein interaction network for 12 ths. identified proteins encoded by 11 ths. genes, we were able to predict Gene Ontology categories for a total of 387 uPE1 genes. We predicted different functions for canonical and alternatively spliced forms for four uPE1 genes. In total, functional differences were revealed for 62 proteoforms encoded by 31 genes. Based on these results, it can be carefully concluded that the dynamics and versatility of the interactome is ensured by changing the dominant splice form. Overall, we propose that analysis of large-scale AP-MS experiments performed for various cell lines and under various conditions is a key to understanding the full potential of genes role in cellular processes.
Collapse
Affiliation(s)
- Ekaterina Poverennaya
- Department of Bioinformatics, Institute of Biomedical Chemistry, 119121 Moscow, Russia; (O.K.); (A.R.); (M.P.)
- Institute of Environmental and Agricultural Biology (X-BIO),Tyumen State University, 625003 Tyumen, Russia
| | - Olga Kiseleva
- Department of Bioinformatics, Institute of Biomedical Chemistry, 119121 Moscow, Russia; (O.K.); (A.R.); (M.P.)
| | - Anastasia Romanova
- Department of Bioinformatics, Institute of Biomedical Chemistry, 119121 Moscow, Russia; (O.K.); (A.R.); (M.P.)
- Faculty of Biological and Medical Physics, Moscow Institute of Physics and Technology, Dolgoprudny, 141701 Moscow, Russia
| | - Mikhail Pyatnitskiy
- Department of Bioinformatics, Institute of Biomedical Chemistry, 119121 Moscow, Russia; (O.K.); (A.R.); (M.P.)
- Department of Molecular Biology and Genetics, Federal Research and Clinical Center of Physical-Chemical Medicine of Federal Medical Biological Agency, 119435 Moscow, Russia
| |
Collapse
|
162
|
ElAbd H, Bromberg Y, Hoarfrost A, Lenz T, Franke A, Wendorff M. Amino acid encoding for deep learning applications. BMC Bioinformatics 2020; 21:235. [PMID: 32517697 PMCID: PMC7285590 DOI: 10.1186/s12859-020-03546-x] [Citation(s) in RCA: 34] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2020] [Accepted: 05/12/2020] [Indexed: 12/25/2022] Open
Abstract
BACKGROUND The number of applications of deep learning algorithms in bioinformatics is increasing as they usually achieve superior performance over classical approaches, especially, when bigger training datasets are available. In deep learning applications, discrete data, e.g. words or n-grams in language, or amino acids or nucleotides in bioinformatics, are generally represented as a continuous vector through an embedding matrix. Recently, learning this embedding matrix directly from the data as part of the continuous iteration of the model to optimize the target prediction - a process called 'end-to-end learning' - has led to state-of-the-art results in many fields. Although usage of embeddings is well described in the bioinformatics literature, the potential of end-to-end learning for single amino acids, as compared to more classical manually-curated encoding strategies, has not been systematically addressed. To this end, we compared classical encoding matrices, namely one-hot, VHSE8 and BLOSUM62, to end-to-end learning of amino acid embeddings for two different prediction tasks using three widely used architectures, namely recurrent neural networks (RNN), convolutional neural networks (CNN), and the hybrid CNN-RNN. RESULTS By using different deep learning architectures, we show that end-to-end learning is on par with classical encodings for embeddings of the same dimension even when limited training data is available, and might allow for a reduction in the embedding dimension without performance loss, which is critical when deploying the models to devices with limited computational capacities. We found that the embedding dimension is a major factor in controlling the model performance. Surprisingly, we observed that deep learning models are capable of learning from random vectors of appropriate dimension. CONCLUSION Our study shows that end-to-end learning is a flexible and powerful method for amino acid encoding. Further, due to the flexibility of deep learning systems, amino acid encoding schemes should be benchmarked against random vectors of the same dimension to disentangle the information content provided by the encoding scheme from the distinguishability effect provided by the scheme.
Collapse
Affiliation(s)
- Hesham ElAbd
- Institute of Clinical Molecular Biology, Christian-Albrechts-University of Kiel, Kiel, Germany
| | - Yana Bromberg
- Department of Biochemistry and Microbiology, Rutgers University, New Brunswick, NJ, USA.,Department of Genetics, Rutgers University, New Brunswick, NJ, USA.,Technical University of Munich Institute for Advanced Study, (TUM-IAS), Lichtenbergstr. 2a, 85748, Garching/Munich, Germany
| | - Adrienne Hoarfrost
- Department of Biochemistry and Microbiology, Rutgers University, New Brunswick, NJ, USA
| | - Tobias Lenz
- Research Group for Evolutionary Immunogenomics, Max Planck Institute for Evolutionary Biology, 24306, Plön, Germany
| | - Andre Franke
- Institute of Clinical Molecular Biology, Christian-Albrechts-University of Kiel, Kiel, Germany.
| | - Mareike Wendorff
- Institute of Clinical Molecular Biology, Christian-Albrechts-University of Kiel, Kiel, Germany
| |
Collapse
|
163
|
Mutwil M. Computational approaches to unravel the pathways and evolution of specialized metabolism. CURRENT OPINION IN PLANT BIOLOGY 2020; 55:38-46. [PMID: 32200228 DOI: 10.1016/j.pbi.2020.01.007] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/11/2019] [Revised: 01/19/2020] [Accepted: 01/31/2020] [Indexed: 05/13/2023]
Abstract
Specialized metabolites serve as a chemical arsenal that protects plants from abiotic stress, pathogens, and herbivores, and they are an essential component of our nutrition and medicine. Despite their importance, we are still at the beginning of unravelling biosynthetic pathways that produce these compounds, which is needed to produce more resilient and nutritious crops, expand our inventory of useful biomolecules, and give valuable insights into plant evolution. This review focuses on the evolution of specialized metabolism in the plant kingdom and the state-of-the-art approaches used to identify the biosynthetic pathways of these useful compounds.
Collapse
Affiliation(s)
- Marek Mutwil
- School of Biological Sciences, Nanyang Technological University, 60 Nanyang Drive, Singapore, 637551, Singapore.
| |
Collapse
|
164
|
Crawford J, Greene CS. Incorporating biological structure into machine learning models in biomedicine. Curr Opin Biotechnol 2020; 63:126-134. [PMID: 31962244 PMCID: PMC7308204 DOI: 10.1016/j.copbio.2019.12.021] [Citation(s) in RCA: 18] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2019] [Revised: 12/17/2019] [Accepted: 12/19/2019] [Indexed: 12/19/2022]
Abstract
In biomedical applications of machine learning, relevant information often has a rich structure that is not easily encoded as real-valued predictors. Examples of such data include DNA or RNA sequences, gene sets or pathways, gene interaction or coexpression networks, ontologies, and phylogenetic trees. We highlight recent examples of machine learning models that use structure to constrain model architecture or incorporate structured data into model training. For machine learning in biomedicine, where sample size is limited and model interpretability is crucial, incorporating prior knowledge in the form of structured data can be particularly useful. The area of research would benefit from performant open source implementations and independent benchmarking efforts.
Collapse
Affiliation(s)
- Jake Crawford
- Graduate Group in Genomics and Computational Biology, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, United States; Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, United States
| | - Casey S Greene
- Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, United States; Childhood Cancer Data Lab, Alex's Lemonade Stand Foundation, Philadelphia, PA, United States.
| |
Collapse
|
165
|
You R, Yao S, Xiong Y, Huang X, Sun F, Mamitsuka H, Zhu S. NetGO: improving large-scale protein function prediction with massive network information. Nucleic Acids Res 2020; 47:W379-W387. [PMID: 31106361 PMCID: PMC6602452 DOI: 10.1093/nar/gkz388] [Citation(s) in RCA: 65] [Impact Index Per Article: 16.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2019] [Revised: 04/24/2019] [Accepted: 05/01/2019] [Indexed: 01/19/2023] Open
Abstract
Automated function prediction (AFP) of proteins is of great significance in biology. AFP can be regarded as a problem of the large-scale multi-label classification where a protein can be associated with multiple gene ontology terms as its labels. Based on our GOLabeler—a state-of-the-art method for the third critical assessment of functional annotation (CAFA3), in this paper we propose NetGO, a web server that is able to further improve the performance of the large-scale AFP by incorporating massive protein-protein network information. Specifically, the advantages of NetGO are threefold in using network information: (i) NetGO relies on a powerful learning to rank framework from machine learning to effectively integrate both sequence and network information of proteins; (ii) NetGO uses the massive network information of all species (>2000) in STRING (other than only some specific species) and (iii) NetGO still can use network information to annotate a protein by homology transfer, even if it is not contained in STRING. Separating training and testing data with the same time-delayed settings of CAFA, we comprehensively examined the performance of NetGO. Experimental results have clearly demonstrated that NetGO significantly outperforms GOLabeler and other competing methods. The NetGO web server is freely available at http://issubmission.sjtu.edu.cn/netgo/.
Collapse
Affiliation(s)
- Ronghui You
- School of Computer Science and Shanghai Key Lab of Intelligent Information Processing, Fudan University, Shanghai 200433, China.,Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai 200433, China.,Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence (Fudan University), Ministry of Education, China
| | - Shuwei Yao
- School of Computer Science and Shanghai Key Lab of Intelligent Information Processing, Fudan University, Shanghai 200433, China.,Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai 200433, China.,Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence (Fudan University), Ministry of Education, China
| | - Yi Xiong
- Department of Bioinformatics and Biostatistics, Shanghai Jiao Tong University
| | - Xiaodi Huang
- School of Computing and Mathematics, Charles Sturt University, Albury, NSW 2640, Australia
| | - Fengzhu Sun
- Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai 200433, China.,Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence (Fudan University), Ministry of Education, China.,Quantitative and Computational Biology, Department of Biological Sciences, University of Southern California, Los Angeles, CA 90089, USA
| | - Hiroshi Mamitsuka
- Bioinformatics Center, Institute for Chemical Research, Kyoto University, Uji 611-0011, Japan.,Department of Computer Science, Aalto University, Espoo and Helsinki, Finland
| | - Shanfeng Zhu
- School of Computer Science and Shanghai Key Lab of Intelligent Information Processing, Fudan University, Shanghai 200433, China.,Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai 200433, China.,Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence (Fudan University), Ministry of Education, China
| |
Collapse
|
166
|
Jin S, Zeng X, Xia F, Huang W, Liu X. Application of deep learning methods in biological networks. Brief Bioinform 2020; 22:1902-1917. [PMID: 32363401 DOI: 10.1093/bib/bbaa043] [Citation(s) in RCA: 84] [Impact Index Per Article: 21.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/25/2019] [Revised: 02/19/2020] [Accepted: 03/05/2020] [Indexed: 01/07/2023] Open
Abstract
The increase in biological data and the formation of various biomolecule interaction databases enable us to obtain diverse biological networks. These biological networks provide a wealth of raw materials for further understanding of biological systems, the discovery of complex diseases and the search for therapeutic drugs. However, the increase in data also increases the difficulty of biological networks analysis. Therefore, algorithms that can handle large, heterogeneous and complex data are needed to better analyze the data of these network structures and mine their useful information. Deep learning is a branch of machine learning that extracts more abstract features from a larger set of training data. Through the establishment of an artificial neural network with a network hierarchy structure, deep learning can extract and screen the input information layer by layer and has representation learning ability. The improved deep learning algorithm can be used to process complex and heterogeneous graph data structures and is increasingly being applied to the mining of network data information. In this paper, we first introduce the used network data deep learning models. After words, we summarize the application of deep learning on biological networks. Finally, we discuss the future development prospects of this field.
Collapse
|
167
|
Cai Y, Wang J, Deng L. SDN2GO: An Integrated Deep Learning Model for Protein Function Prediction. Front Bioeng Biotechnol 2020; 8:391. [PMID: 32411695 PMCID: PMC7201018 DOI: 10.3389/fbioe.2020.00391] [Citation(s) in RCA: 19] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2020] [Accepted: 04/07/2020] [Indexed: 02/01/2023] Open
Abstract
The assignment of function to proteins at a large scale is essential for understanding the molecular mechanism of life. However, only a very small percentage of the more than 179 million proteins in UniProtKB have Gene Ontology (GO) annotations supported by experimental evidence. In this paper, we proposed an integrated deep-learning-based classification model, named SDN2GO, to predict protein functions. SDN2GO applies convolutional neural networks to learn and extract features from sequences, protein domains, and known PPI networks, and then utilizes a weight classifier to integrate these features and achieve accurate predictions of GO terms. We constructed the training set and the independent test set according to the time-delayed principle of the Critical Assessment of Function Annotation (CAFA) and compared it with two highly competitive methods and the classic BLAST method on the independent test set. The results show that our method outperforms others on each sub-ontology of GO. We also investigated the performance of using protein domain information. We learned from the Natural Language Processing (NLP) to process domain information and pre-trained a deep learning sub-model to extract the comprehensive features of domains. The experimental results demonstrate that the domain features we obtained are much improved the performance of our model. Our deep learning models together with the data pre-processing scripts are publicly available as an open source software at https://github.com/Charrick/SDN2GO.
Collapse
Affiliation(s)
- Yideng Cai
- School of Computer Science and Engineering, Central South University, Changsha, China
| | - Jiacheng Wang
- School of Computer Science and Engineering, Central South University, Changsha, China
| | - Lei Deng
- School of Computer Science and Engineering, Central South University, Changsha, China
- School of Software, Xinjiang University, Urumqi, China
| |
Collapse
|
168
|
Sangphukieo A, Laomettachit T, Ruengjitchatchawalya M. Photosynthetic protein classification using genome neighborhood-based machine learning feature. Sci Rep 2020; 10:7108. [PMID: 32346070 PMCID: PMC7189237 DOI: 10.1038/s41598-020-64053-w] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2019] [Accepted: 04/07/2020] [Indexed: 11/08/2022] Open
Abstract
Identification of novel photosynthetic proteins is important for understanding and improving photosynthetic efficiency. Synergistically, genome neighborhood can provide additional useful information to identify photosynthetic proteins. We, therefore, expected that applying a computational approach, particularly machine learning (ML) with the genome neighborhood-based feature should facilitate the photosynthetic function assignment. Our results revealed a functional relationship between photosynthetic genes and their conserved neighboring genes observed by 'Phylo score', indicating their functions could be inferred from the genome neighborhood profile. Therefore, we created a new method for extracting patterns based on the genome neighborhood network (GNN) and applied them for the photosynthetic protein classification using ML algorithms. Random forest (RF) classifier using genome neighborhood-based features achieved the highest accuracy up to 87% in the classification of photosynthetic proteins and also showed better performance (Mathew's correlation coefficient = 0.718) than other available tools including the sequence similarity search (0.447) and ML-based method (0.361). Furthermore, we demonstrated the ability of our model to identify novel photosynthetic proteins compared to the other methods. Our classifier is available at http://bicep2.kmutt.ac.th/photomod_standalone, https://bit.ly/2S0I2Ox and DockerHub: https://hub.docker.com/r/asangphukieo/photomod.
Collapse
Affiliation(s)
- Apiwat Sangphukieo
- Bioinformatics and Systems Biology Program, School of Bioresources and Technology, King Mongkut's University of Technology Thonburi (KMUTT), Bang Khun Thian, Bangkok, 10150, Thailand
- School of Information Technology, KMUTT, Bang Mod, Thung Khru, Bangkok, 10140, Thailand
| | - Teeraphan Laomettachit
- Bioinformatics and Systems Biology Program, School of Bioresources and Technology, King Mongkut's University of Technology Thonburi (KMUTT), Bang Khun Thian, Bangkok, 10150, Thailand
| | - Marasri Ruengjitchatchawalya
- Bioinformatics and Systems Biology Program, School of Bioresources and Technology, King Mongkut's University of Technology Thonburi (KMUTT), Bang Khun Thian, Bangkok, 10150, Thailand.
- Biotechnology program, School of Bioresources and Technology, KMUTT, Bang Khun Thian, Bangkok, 10150, Thailand.
- Algal Biotechnology Research Group, Pilot Plant Development and Training Institute (PDTI), KMUTT, Bang Khun Thian, Bangkok, 10150, Thailand.
| |
Collapse
|
169
|
Zhao Y, Wang J, Chen J, Zhang X, Guo M, Yu G. A Literature Review of Gene Function Prediction by Modeling Gene Ontology. Front Genet 2020; 11:400. [PMID: 32391061 PMCID: PMC7193026 DOI: 10.3389/fgene.2020.00400] [Citation(s) in RCA: 30] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2020] [Accepted: 03/30/2020] [Indexed: 12/14/2022] Open
Abstract
Annotating the functional properties of gene products, i.e., RNAs and proteins, is a fundamental task in biology. The Gene Ontology database (GO) was developed to systematically describe the functional properties of gene products across species, and to facilitate the computational prediction of gene function. As GO is routinely updated, it serves as the gold standard and main knowledge source in functional genomics. Many gene function prediction methods making use of GO have been proposed. But no literature review has summarized these methods and the possibilities for future efforts from the perspective of GO. To bridge this gap, we review the existing methods with an emphasis on recent solutions. First, we introduce the conventions of GO and the widely adopted evaluation metrics for gene function prediction. Next, we summarize current methods of gene function prediction that apply GO in different ways, such as using hierarchical or flat inter-relationships between GO terms, compressing massive GO terms and quantifying semantic similarities. Although many efforts have improved performance by harnessing GO, we conclude that there remain many largely overlooked but important topics for future research.
Collapse
Affiliation(s)
- Yingwen Zhao
- College of Computer and Information Science, Southwest University, Chongqing, China
| | - Jun Wang
- College of Computer and Information Science, Southwest University, Chongqing, China
| | - Jian Chen
- State Key Laboratory of Agrobiotechnology and National Maize Improvement Center, China Agricultural University, Beijing, China
| | - Xiangliang Zhang
- CBRC, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
| | - Maozu Guo
- School of Electrical and Information Engineering, Beijing University of Civil Engineering and Architecture, Beijing, China
| | - Guoxian Yu
- College of Computer and Information Science, Southwest University, Chongqing, China
- CBRC, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
| |
Collapse
|
170
|
Thapa N, Chaudhari M, McManus S, Roy K, Newman RH, Saigo H, Kc DB. DeepSuccinylSite: a deep learning based approach for protein succinylation site prediction. BMC Bioinformatics 2020; 21:63. [PMID: 32321437 PMCID: PMC7178942 DOI: 10.1186/s12859-020-3342-z] [Citation(s) in RCA: 39] [Impact Index Per Article: 9.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2019] [Accepted: 01/08/2020] [Indexed: 01/15/2023] Open
Abstract
Background Protein succinylation has recently emerged as an important and common post-translation modification (PTM) that occurs on lysine residues. Succinylation is notable both in its size (e.g., at 100 Da, it is one of the larger chemical PTMs) and in its ability to modify the net charge of the modified lysine residue from + 1 to − 1 at physiological pH. The gross local changes that occur in proteins upon succinylation have been shown to correspond with changes in gene activity and to be perturbed by defects in the citric acid cycle. These observations, together with the fact that succinate is generated as a metabolic intermediate during cellular respiration, have led to suggestions that protein succinylation may play a role in the interaction between cellular metabolism and important cellular functions. For instance, succinylation likely represents an important aspect of genomic regulation and repair and may have important consequences in the etiology of a number of disease states. In this study, we developed DeepSuccinylSite, a novel prediction tool that uses deep learning methodology along with embedding to identify succinylation sites in proteins based on their primary structure. Results Using an independent test set of experimentally identified succinylation sites, our method achieved efficiency scores of 79%, 68.7% and 0.48 for sensitivity, specificity and MCC respectively, with an area under the receiver operator characteristic (ROC) curve of 0.8. In side-by-side comparisons with previously described succinylation predictors, DeepSuccinylSite represents a significant improvement in overall accuracy for prediction of succinylation sites. Conclusion Together, these results suggest that our method represents a robust and complementary technique for advanced exploration of protein succinylation.
Collapse
Affiliation(s)
- Niraj Thapa
- Department of Computational Science and Engineering, North Carolina A&T State University, Greensboro, NC, USA
| | - Meenal Chaudhari
- Department of Computational Science and Engineering, North Carolina A&T State University, Greensboro, NC, USA
| | - Sean McManus
- Department of Computational Science and Engineering, North Carolina A&T State University, Greensboro, NC, USA
| | - Kaushik Roy
- Department of Computer Science, North Carolina A&T State University, Greensboro, NC, USA
| | - Robert H Newman
- Department of Biology, North Carolina A&T State University, Greensboro, NC, USA
| | - Hiroto Saigo
- Faculty of Information Science and Electrical Engineering, Kyushu University, Fukuoka, Japan
| | - Dukka B Kc
- Electrical Engineering and Computer Science Department, Wichita State University, Wichita, KS, USA.
| |
Collapse
|
171
|
Strodthoff N, Wagner P, Wenzel M, Samek W. UDSMProt: universal deep sequence models for protein classification. Bioinformatics 2020; 36:2401-2409. [PMID: 31913448 PMCID: PMC7178389 DOI: 10.1093/bioinformatics/btaa003] [Citation(s) in RCA: 73] [Impact Index Per Article: 18.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2019] [Revised: 12/13/2019] [Accepted: 01/02/2020] [Indexed: 01/03/2023] Open
Abstract
MOTIVATION Inferring the properties of a protein from its amino acid sequence is one of the key problems in bioinformatics. Most state-of-the-art approaches for protein classification are tailored to single classification tasks and rely on handcrafted features, such as position-specific-scoring matrices from expensive database searches. We argue that this level of performance can be reached or even be surpassed by learning a task-agnostic representation once, using self-supervised language modeling, and transferring it to specific tasks by a simple fine-tuning step. RESULTS We put forward a universal deep sequence model that is pre-trained on unlabeled protein sequences from Swiss-Prot and fine-tuned on protein classification tasks. We apply it to three prototypical tasks, namely enzyme class prediction, gene ontology prediction and remote homology and fold detection. The proposed method performs on par with state-of-the-art algorithms that were tailored to these specific tasks or, for two out of three tasks, even outperforms them. These results stress the possibility of inferring protein properties from the sequence alone and, on more general grounds, the prospects of modern natural language processing methods in omics. Moreover, we illustrate the prospects for explainable machine learning methods in this field by selected case studies. AVAILABILITY AND IMPLEMENTATION Source code is available under https://github.com/nstrodt/UDSMProt. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Nils Strodthoff
- Department of Video Coding & Analytics, Fraunhofer Heinrich Hertz Institute, Berlin 10587, Germany
| | - Patrick Wagner
- Department of Video Coding & Analytics, Fraunhofer Heinrich Hertz Institute, Berlin 10587, Germany
| | - Markus Wenzel
- Department of Video Coding & Analytics, Fraunhofer Heinrich Hertz Institute, Berlin 10587, Germany
| | - Wojciech Samek
- Department of Video Coding & Analytics, Fraunhofer Heinrich Hertz Institute, Berlin 10587, Germany
| |
Collapse
|
172
|
Peng J, Xue H, Wei Z, Tuncali I, Hao J, Shang X. Integrating multi-network topology for gene function prediction using deep neural networks. Brief Bioinform 2020; 22:2096-2105. [PMID: 32249297 DOI: 10.1093/bib/bbaa036] [Citation(s) in RCA: 50] [Impact Index Per Article: 12.5] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2019] [Revised: 02/09/2020] [Accepted: 02/25/2020] [Indexed: 01/18/2023] Open
Abstract
MOTIVATION The emergence of abundant biological networks, which benefit from the development of advanced high-throughput techniques, contributes to describing and modeling complex internal interactions among biological entities such as genes and proteins. Multiple networks provide rich information for inferring the function of genes or proteins. To extract functional patterns of genes based on multiple heterogeneous networks, network embedding-based methods, aiming to capture non-linear and low-dimensional feature representation based on network biology, have recently achieved remarkable performance in gene function prediction. However, existing methods do not consider the shared information among different networks during the feature learning process. RESULTS Taking the correlation among the networks into account, we design a novel semi-supervised autoencoder method to integrate multiple networks and generate a low-dimensional feature representation. Then we utilize a convolutional neural network based on the integrated feature embedding to annotate unlabeled gene functions. We test our method on both yeast and human datasets and compare with three state-of-the-art methods. The results demonstrate the superior performance of our method. We not only provide a comprehensive analysis of the performance of the newly proposed algorithm but also provide a tool for extracting features of genes based on multiple networks, which can be used in the downstream machine learning task. AVAILABILITY DeepMNE-CNN is freely available at https://github.com/xuehansheng/DeepMNE-CNN. CONTACT jiajiepeng@nwpu.edu.cn; shang@nwpu.edu.cn; jianye.hao@tju.edu.cn.
Collapse
Affiliation(s)
- Jiajie Peng
- School of Computer Science, Northwestern Polytechnical University, Xi'an, 710072, China
| | - Hansheng Xue
- Key Laboratory of Big Data Storage and Management, Ministry of Industry and Information Technology, Northwestern Polytechnical University, Xi'an, 710072, China
| | - Zhongyu Wei
- Research School of Computer Science, Australian National University, Canberra, 2601, Australia
| | - Idil Tuncali
- School of Data Science, Fudan University, Shanghai, 200433, China
| | | | | |
Collapse
|
173
|
Fabris F, Palmer D, Salama KM, de Magalhães JP, Freitas AA. Using deep learning to associate human genes with age-related diseases. Bioinformatics 2020; 36:2202-2208. [PMID: 31845988 PMCID: PMC7141856 DOI: 10.1093/bioinformatics/btz887] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2019] [Revised: 09/06/2019] [Accepted: 12/13/2019] [Indexed: 11/15/2022] Open
Abstract
Motivation One way to identify genes possibly associated with ageing is to build a classification model (from the machine learning field) capable of classifying genes as associated with multiple age-related diseases. To build this model, we use a pre-compiled list of human genes associated with age-related diseases and apply a novel Deep Neural Network (DNN) method to find associations between gene descriptors (e.g. Gene Ontology terms, protein–protein interaction data and biological pathway information) and age-related diseases. Results The novelty of our new DNN method is its modular architecture, which has the capability of combining several sources of biological data to predict which ageing-related diseases a gene is associated with (if any). Our DNN method achieves better predictive performance than standard DNN approaches, a Gradient Boosted Tree classifier (a strong baseline method) and a Logistic Regression classifier. Given the DNN model produced by our method, we use two approaches to identify human genes that are not known to be associated with age-related diseases according to our dataset. First, we investigate genes that are close to other disease-associated genes in a complex multi-dimensional feature space learned by the DNN algorithm. Second, using the class label probabilities output by our DNN approach, we identify genes with a high probability of being associated with age-related diseases according to the model. We provide evidence of these putative associations retrieved from the DNN model with literature support. Availability and implementation The source code and datasets can be found at: https://github.com/fabiofabris/Bioinfo2019. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Fabio Fabris
- School of Computing, University of Kent, Canterbury, Kent CT2 7NF, UK
| | - Daniel Palmer
- Integrative Genomics of Ageing Group, Institute of Ageing and Chronic Disease, University of Liverpool, Liverpool L7 8TX, UK
| | - Khalid M Salama
- School of Computing, University of Kent, Canterbury, Kent CT2 7NF, UK
| | - João Pedro de Magalhães
- Integrative Genomics of Ageing Group, Institute of Ageing and Chronic Disease, University of Liverpool, Liverpool L7 8TX, UK
| | - Alex A Freitas
- School of Computing, University of Kent, Canterbury, Kent CT2 7NF, UK
| |
Collapse
|
174
|
The sterlet sturgeon genome sequence and the mechanisms of segmental rediploidization. Nat Ecol Evol 2020; 4:841-852. [PMID: 32231327 PMCID: PMC7269910 DOI: 10.1038/s41559-020-1166-x] [Citation(s) in RCA: 124] [Impact Index Per Article: 31.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2019] [Accepted: 02/27/2020] [Indexed: 12/20/2022]
Abstract
Sturgeons seem to be frozen in time. The archaic characteristics of this ancient fish lineage place it in a key phylogenetic position at the base of the ~30,000 modern teleost fish species. Moreover, sturgeons are notoriously polyploid, providing unique opportunities to investigate the evolution of polyploid genomes. We assembled a high-quality chromosome-level reference genome for the sterlet, Acipenser ruthenus. Our analysis revealed a very low protein evolution rate that is at least as slow as in other deep branches of the vertebrate tree, such as that of the coelacanth. We uncovered a whole-genome duplication that occurred in the Jurassic, early in the evolution of the entire sturgeon lineage. Following this polyploidization, the rediploidization of the genome included the loss of whole chromosomes in a segmental deduplication process. While known adaptive processes helped conserve a high degree of structural and functional tetraploidy over more than 180 million years, the reduction of redundancy of the polyploid genome seems to have been remarkably random. A genome assembly of the sterlet, Acipenser ruthenus, reveals a whole-genome duplication early in the evolution of the entire sturgeon lineage and provides details about the rediploidization of the genome.
Collapse
|
175
|
Makrodimitris S, van Ham RCHJ, Reinders MJT. Improving protein function prediction using protein sequence and GO-term similarities. Bioinformatics 2020; 35:1116-1124. [PMID: 30169569 PMCID: PMC6449755 DOI: 10.1093/bioinformatics/bty751] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2017] [Revised: 07/04/2018] [Accepted: 08/28/2018] [Indexed: 12/26/2022] Open
Abstract
MOTIVATION Most automatic functional annotation methods assign Gene Ontology (GO) terms to proteins based on annotations of highly similar proteins. We advocate that proteins that are less similar are still informative. Also, despite their simplicity and structure, GO terms seem to be hard for computers to learn, in particular the Biological Process ontology, which has the most terms (>29 000). We propose to use Label-Space Dimensionality Reduction (LSDR) techniques to exploit the redundancy of GO terms and transform them into a more compact latent representation that is easier to predict. RESULTS We compare proteins using a sequence similarity profile (SSP) to a set of annotated training proteins. We introduce two new LSDR methods, one based on the structure of the GO, and one based on semantic similarity of terms. We show that these LSDR methods, as well as three existing ones, improve the Critical Assessment of Functional Annotation performance of several function prediction algorithms. Cross-validation experiments on Arabidopsis thaliana proteins pinpoint the superiority of our GO-aware LSDR over generic LSDR. Our experiments on A.thaliana proteins show that the SSP representation in combination with a kNN classifier outperforms state-of-the-art and baseline methods in terms of cross-validated F-measure. AVAILABILITY AND IMPLEMENTATION Source code for the experiments is available at https://github.com/stamakro/SSP-LSDR. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Stavros Makrodimitris
- Department of Intelligent Systems, Delft Bioinformatics Lab, Delft University of Technology, Delft, The Netherlands.,Department of Bioinformatics, Keygene N.V., Wageningen, The Netherlands
| | - Roeland C H J van Ham
- Department of Intelligent Systems, Delft Bioinformatics Lab, Delft University of Technology, Delft, The Netherlands.,Department of Bioinformatics, Keygene N.V., Wageningen, The Netherlands
| | - Marcel J T Reinders
- Department of Intelligent Systems, Delft Bioinformatics Lab, Delft University of Technology, Delft, The Netherlands
| |
Collapse
|
176
|
Yue X, Wang Z, Huang J, Parthasarathy S, Moosavinasab S, Huang Y, Lin SM, Zhang W, Zhang P, Sun H. Graph embedding on biomedical networks: methods, applications and evaluations. Bioinformatics 2020; 36:1241-1251. [PMID: 31584634 PMCID: PMC7703771 DOI: 10.1093/bioinformatics/btz718] [Citation(s) in RCA: 102] [Impact Index Per Article: 25.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2019] [Revised: 08/25/2019] [Accepted: 09/26/2019] [Indexed: 01/12/2023] Open
Abstract
MOTIVATION Graph embedding learning that aims to automatically learn low-dimensional node representations, has drawn increasing attention in recent years. To date, most recent graph embedding methods are evaluated on social and information networks and are not comprehensively studied on biomedical networks under systematic experiments and analyses. On the other hand, for a variety of biomedical network analysis tasks, traditional techniques such as matrix factorization (which can be seen as a type of graph embedding methods) have shown promising results, and hence there is a need to systematically evaluate the more recent graph embedding methods (e.g. random walk-based and neural network-based) in terms of their usability and potential to further the state-of-the-art. RESULTS We select 11 representative graph embedding methods and conduct a systematic comparison on 3 important biomedical link prediction tasks: drug-disease association (DDA) prediction, drug-drug interaction (DDI) prediction, protein-protein interaction (PPI) prediction; and 2 node classification tasks: medical term semantic type classification, protein function prediction. Our experimental results demonstrate that the recent graph embedding methods achieve promising results and deserve more attention in the future biomedical graph analysis. Compared with three state-of-the-art methods for DDAs, DDIs and protein function predictions, the recent graph embedding methods achieve competitive performance without using any biological features and the learned embeddings can be treated as complementary representations for the biological features. By summarizing the experimental results, we provide general guidelines for properly selecting graph embedding methods and setting their hyper-parameters for different biomedical tasks. AVAILABILITY AND IMPLEMENTATION As part of our contributions in the paper, we develop an easy-to-use Python package with detailed instructions, BioNEV, available at: https://github.com/xiangyue9607/BioNEV, including all source code and datasets, to facilitate studying various graph embedding methods on biomedical tasks. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Xiang Yue
- Department of Computer Science and Engineering, OH, USA
| | - Zhen Wang
- Department of Computer Science and Engineering, OH, USA
| | - Jingong Huang
- Department of Electrical and Computer Engineering, The Ohio State University, Columbus, OH, USA
| | | | - Soheil Moosavinasab
- Research Information Solutions and Innovation, The Research Institute at Nationwide Children’s Hospital, Columbus, OH, USA
| | - Yungui Huang
- Research Information Solutions and Innovation, The Research Institute at Nationwide Children’s Hospital, Columbus, OH, USA
| | - Simon M Lin
- Research Information Solutions and Innovation, The Research Institute at Nationwide Children’s Hospital, Columbus, OH, USA
| | - Wen Zhang
- College of Informatics, Huazhong Agricultural University, Wuhan, Hubei, China
| | - Ping Zhang
- Department of Computer Science and Engineering, OH, USA
- Department of Biomedical Informatics, The Ohio State University, Columbus, OH, USA
| | - Huan Sun
- Department of Computer Science and Engineering, OH, USA
| |
Collapse
|
177
|
de Jongh RP, van Dijk AD, Julsing MK, Schaap PJ, de Ridder D. Designing Eukaryotic Gene Expression Regulation Using Machine Learning. Trends Biotechnol 2020; 38:191-201. [DOI: 10.1016/j.tibtech.2019.07.007] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2019] [Revised: 07/12/2019] [Accepted: 07/19/2019] [Indexed: 12/11/2022]
|
178
|
Mishra R, Mohapatra R, Mahanty B, Joshi RK. Analysis of microRNAs and their targets from onion (Allium cepa) using genome survey sequences (GSS) and expressed sequence tags (ESTs). Bioinformation 2019; 15:907-917. [PMID: 32256010 PMCID: PMC7088424 DOI: 10.6026/97320630015907] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/27/2019] [Revised: 12/30/2019] [Accepted: 12/31/2019] [Indexed: 01/31/2023] Open
Abstract
MicroRNAs are small non-coding RNAs of 21-24 nucleotides in length that acts as important modulators of gene expression related to numerous biological processes including development and defense response in eukaryotes. However, only a limited report on onion (Allium cepa) miRNAs is available and their associated role in growth and development of onion is not yet clear. Therefore, it is of interest to identify miRNAs and their targets in Allium cepa using the genome survey sequences (GSSs) and expressed sequence tags (ESTs) and deduce the functions of the target genes using gene ontology (GO) terms. We report 14 potential miRNAs belonging to 13 different families (miR162, miR168, miR172c, miR172e, miR398, miR400, miR414, miR1134, miR1223, miR6219, miR7725, miR8570, miR8703 and miR8752). BLAST analysis using psRNATarget server predicted 39 potential targets for the identified miRNAs majority of which were transcription factors implicated in plant growth, development, hormone signaling and stress responses. These data forms the basis for further analysis and verification towards understanding the miRNA mediated regulatory mechanism in Allium cepa.
Collapse
Affiliation(s)
- Rukmini Mishra
- School of Applied Sciences, Centurion University of Technology and Management, Odisha, India
| | - Rupesh Mohapatra
- Centre for Biotechnology, Siksha O Anusandhan University, Bhubaneswar-751030, Odisha, India
| | - Bijayalaxmi Mahanty
- Dept. of Biotechnology, Rama Devi Women's University, Vidya Vihar, Bhubaneswar-751022, Odisha, India
| | - Raj Kumar Joshi
- Dept. of Biotechnology, Rama Devi Women's University, Vidya Vihar, Bhubaneswar-751022, Odisha, India
| |
Collapse
|
179
|
Shi Q, Chen W, Huang S, Wang Y, Xue Z. Deep learning for mining protein data. Brief Bioinform 2019; 22:194-218. [PMID: 31867611 DOI: 10.1093/bib/bbz156] [Citation(s) in RCA: 31] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2019] [Revised: 10/21/2019] [Accepted: 11/07/2019] [Indexed: 01/16/2023] Open
Abstract
The recent emergence of deep learning to characterize complex patterns of protein big data reveals its potential to address the classic challenges in the field of protein data mining. Much research has revealed the promise of deep learning as a powerful tool to transform protein big data into valuable knowledge, leading to scientific discoveries and practical solutions. In this review, we summarize recent publications on deep learning predictive approaches in the field of mining protein data. The application architectures of these methods include multilayer perceptrons, stacked autoencoders, deep belief networks, two- or three-dimensional convolutional neural networks, recurrent neural networks, graph neural networks, and complex neural networks and are described from five perspectives: residue-level prediction, sequence-level prediction, three-dimensional structural analysis, interaction prediction, and mass spectrometry data mining. The advantages and deficiencies of these architectures are presented in relation to various tasks in protein data mining. Additionally, some practical issues and their future directions are discussed, such as robust deep learning for protein noisy data, architecture optimization for specific tasks, efficient deep learning for limited protein data, multimodal deep learning for heterogeneous protein data, and interpretable deep learning for protein understanding. This review provides comprehensive perspectives on general deep learning techniques for protein data analysis.
Collapse
Affiliation(s)
- Qiang Shi
- School of Software Engineering, Huazhong University of Science and Technology. His main interests cover machine learning especially deep learning, protein data analysis, and big data mining
| | - Weiya Chen
- School of Software Engineering, Huazhong University of Science & Technology, Wuhan, China. His research interests cover bioinformatics, virtual reality, and data visualization
| | - Siqi Huang
- Software Engineering at Huazhong University of science and technology, focusing on Machine learning and data mining
| | - Yan Wang
- School of life, University of Science & Technology; her main interests cover protein structure and function prediction and big data mining
| | - Zhidong Xue
- School of Software Engineering, Huazhong University of Science & Technology, Wuhan, China. His research interests cover bioinformatics, machine learning, and image processing
| |
Collapse
|
180
|
Functional Gene Network of Prenyltransferases in Arabidopsis thaliana. Molecules 2019; 24:molecules24244556. [PMID: 31842481 PMCID: PMC6943727 DOI: 10.3390/molecules24244556] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2019] [Revised: 12/09/2019] [Accepted: 12/10/2019] [Indexed: 12/17/2022] Open
Abstract
Prenyltransferases (PTs) are enzymes that catalyze prenyl chain elongation. Some are highly similar to each other at the amino acid level. Therefore, it is difficult to assign their function based solely on their sequence homology to functional orthologs. Other experiments, such as in vitro enzymatic assay, mutant analysis, and mutant complementation are necessary to assign their precise function. Moreover, subcellular localization can also influence the functionality of the enzymes within the pathway network, because different isoprenoid end products are synthesized in the cytosol, mitochondria, or plastids from prenyl diphosphate (prenyl-PP) substrates. In addition to in vivo functional experiments, in silico approaches, such as co-expression analysis, can provide information about the topology of PTs within the isoprenoid pathway network. There has been huge progress in the last few years in the characterization of individual Arabidopsis PTs, resulting in better understanding of their function and their topology within the isoprenoid pathway. Here, we summarize these findings and present the updated topological model of PTs in the Arabidopsis thaliana isoprenoid pathway.
Collapse
|
181
|
Wang J, Zhang J, Cai Y, Deng L. DeepMiR2GO: Inferring Functions of Human MicroRNAs Using a Deep Multi-Label Classification Model. Int J Mol Sci 2019; 20:E6046. [PMID: 31801264 PMCID: PMC6928926 DOI: 10.3390/ijms20236046] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2019] [Revised: 11/25/2019] [Accepted: 11/26/2019] [Indexed: 01/08/2023] Open
Abstract
MicroRNAs (miRNAs) are a highly abundant collection of functional non-coding RNAs involved in cellular regulation and various complex human diseases. Although a large number of miRNAs have been identified, most of their physiological functions remain unknown. Computational methods play a vital role in exploring the potential functions of miRNAs. Here, we present DeepMiR2GO, a tool for integrating miRNAs, proteins and diseases, to predict the gene ontology (GO) functions based on multiple deep neuro-symbolic models. DeepMiR2GO starts by integrating the miRNA co-expression network, protein-protein interaction (PPI) network, disease phenotype similarity network, and interactions or associations among them into a global heterogeneous network. Then, it employs an efficient graph embedding strategy to learn potential network representations of the global heterogeneous network as the topological features. Finally, a deep multi-label classification network based on multiple neuro-symbolic models is built and used to annotate the GO terms of miRNAs. The predicted results demonstrate that DeepMiR2GO performs significantly better than other state-of-the-art approaches in terms of precision, recall, and maximum F-measure.
Collapse
Affiliation(s)
- Jiacheng Wang
- School of Computer Science and Engineering, Central South University, Changsha 410083, China; (J.W.); (Y.C.)
| | - Jingpu Zhang
- School of Computer and Data Science, Henan University of Urban Construction, Pingdingshan 467000, China;
| | - Yideng Cai
- School of Computer Science and Engineering, Central South University, Changsha 410083, China; (J.W.); (Y.C.)
| | - Lei Deng
- School of Computer Science and Engineering, Central South University, Changsha 410083, China; (J.W.); (Y.C.)
- School of Software, Xinjiang University, Urumqi 830008, China
| |
Collapse
|
182
|
Bonetta R, Valentino G. Machine learning techniques for protein function prediction. Proteins 2019; 88:397-413. [PMID: 31603244 DOI: 10.1002/prot.25832] [Citation(s) in RCA: 63] [Impact Index Per Article: 12.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2019] [Revised: 07/05/2019] [Accepted: 09/17/2019] [Indexed: 12/17/2022]
Abstract
Proteins play important roles in living organisms, and their function is directly linked with their structure. Due to the growing gap between the number of proteins being discovered and their functional characterization (in particular as a result of experimental limitations), reliable prediction of protein function through computational means has become crucial. This paper reviews the machine learning techniques used in the literature, following their evolution from simple algorithms such as logistic regression to more advanced methods like support vector machines and modern deep neural networks. Hyperparameter optimization methods adopted to boost prediction performance are presented. In parallel, the metamorphosis in the features used by these algorithms from classical physicochemical properties and amino acid composition, up to text-derived features from biomedical literature and learned feature representations using autoencoders, together with feature selection and dimensionality reduction techniques, are also reviewed. The success stories in the application of these techniques to both general and specific protein function prediction are discussed.
Collapse
Affiliation(s)
- Rosalin Bonetta
- Centre for Molecular Medicine and Biobanking, University of Malta, Msida, Malta
| | - Gianluca Valentino
- Department of Communications and Computer Engineering, University of Malta, Msida, Malta
| |
Collapse
|
183
|
Mishra S, Rastogi YP, Jabin S, Kaur P, Amir M, Khatun S. A deep learning ensemble for function prediction of hypothetical proteins from pathogenic bacterial species. Comput Biol Chem 2019; 83:107147. [PMID: 31698160 DOI: 10.1016/j.compbiolchem.2019.107147] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2019] [Revised: 10/05/2019] [Accepted: 10/09/2019] [Indexed: 01/06/2023]
Abstract
Protein function prediction is a crucial task in the post-genomics era due to their diverse irreplaceable roles in a biological system. Traditional methods involved cost-intensive and time-consuming molecular biology techniques but they proved to be ineffective after the outburst of sequencing data through the advent of cost-effective and advanced sequencing techniques. To manage the pace of annotation with that of data generation, there is a shift to computational approaches which are based on homology, sequence and structure-based features, protein-protein interaction networks, phylogenetic profiles, and physicochemical properties, etc. A combination of these features has proven to be promising for protein function prediction in terms of improving prediction accuracy. In the present work, we have employed a combination of features based on sequence, physicochemical property, subsequence and annotation features with a total of 9890 features extracted and/or calculated for 171,212 reviewed prokaryotic proteins of 9 bacterial phyla from UniProtKB, to train a supervised deep learning ensemble model with the aim to categorize a bacterial hypothetical/unreviewed protein's function into 1739 GO terms as functional classes. The proposed system being fully dedicated to bacterial organisms is a novel attempt amongst various existing machine learning based protein function prediction systems based on mixed organisms. Experimental results demonstrate the success of the proposed deep learning ensemble model based on deep neural network method with F1 measure of 0.7912 on the prepared Test dataset 1 of reviewed proteins.
Collapse
Affiliation(s)
- Sarthak Mishra
- Department of Computer Science, Jamia Millia Islamia, Jamia Nagar, New Delhi, 110025, Delhi, India
| | - Yash Pratap Rastogi
- Department of Computer Science, Jamia Millia Islamia, Jamia Nagar, New Delhi, 110025, Delhi, India
| | - Suraiya Jabin
- Department of Computer Science, Jamia Millia Islamia, Jamia Nagar, New Delhi, 110025, Delhi, India.
| | - Punit Kaur
- Department of Biophysics, All India Institute of Medical Sciences (AIIMS), New Delhi, 110 029, Delhi, India
| | - Mohammad Amir
- Department of Computer Science, Jamia Millia Islamia, Jamia Nagar, New Delhi, 110025, Delhi, India
| | - Shabnam Khatun
- Department of Computer Science, Jamia Millia Islamia, Jamia Nagar, New Delhi, 110025, Delhi, India
| |
Collapse
|
184
|
Mahapatra M, Mahanty B, Joshi RK. Genome wide identification and functional assignments of C 2H 2 Zinc-finger family transcription factors in Dichanthelium oligosanthes. Bioinformation 2019; 15:689-696. [PMID: 31787818 PMCID: PMC6859702 DOI: 10.6026/97320630015689] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2019] [Revised: 10/16/2019] [Accepted: 10/16/2019] [Indexed: 12/23/2022] Open
Abstract
Transcription factors (TFs) are biological regulators of gene function in response to various internal and external stimuli. C2H2 zinc finger proteins (C2H2-ZFPs) are a large family of TFs that play crucial roles in plant growth and development, hormone signalling and response to biotic and abiotic stresses. While C2H2-ZFPs have been well characterized in many model and crop plants, they are yet to be ascertained in the evolutionarily important C3 plant Dichanthelium oligosanthes (Heller's rosette grass). In the present study, we report 32 C2H2-ZF genes (DoZFs) belonging to three different classes-Q type, C-type and Z-type based on structural elucidation and phylogenetic analysis. Sequence comparisons revealed paralogs within the DoZFs and orthologs among with rice ZF genes. Motif assignment showed the presence of the distinctive C2H2-ZF conserved domain "QALGGH" in these proteins. Cis-element analysis indicated that majority of the predicted C2H2-ZFPs are associated with hormone signalling and abiotic stress responses. Further, their role in nucleic acid binding and transcriptional regulation was also observed using predicted functional assignment. Thus, we report an overview of the C2H2-ZF gene family in D. oligosanthes that could serve as the basis for future experimental studies on isolation and functional implication of these genes in different biological mechanism of C3 plants.
Collapse
Affiliation(s)
- Manisha Mahapatra
- Department of Biotechnology, Rama Devi Women's University, Vidya Vihar, Bhubaneswar-751022, Odisha, INDIA
| | - Bijayalaxmi Mahanty
- Department of Biotechnology, Rama Devi Women's University, Vidya Vihar, Bhubaneswar-751022, Odisha, INDIA
| | - Raj Kumar Joshi
- Department of Biotechnology, Rama Devi Women's University, Vidya Vihar, Bhubaneswar-751022, Odisha, INDIA
| |
Collapse
|
185
|
Zitnik M, Nguyen F, Wang B, Leskovec J, Goldenberg A, Hoffman MM. Machine Learning for Integrating Data in Biology and Medicine: Principles, Practice, and Opportunities. AN INTERNATIONAL JOURNAL ON INFORMATION FUSION 2019; 50:71-91. [PMID: 30467459 PMCID: PMC6242341 DOI: 10.1016/j.inffus.2018.09.012] [Citation(s) in RCA: 222] [Impact Index Per Article: 44.4] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/10/2023]
Abstract
New technologies have enabled the investigation of biology and human health at an unprecedented scale and in multiple dimensions. These dimensions include myriad properties describing genome, epigenome, transcriptome, microbiome, phenotype, and lifestyle. No single data type, however, can capture the complexity of all the factors relevant to understanding a phenomenon such as a disease. Integrative methods that combine data from multiple technologies have thus emerged as critical statistical and computational approaches. The key challenge in developing such approaches is the identification of effective models to provide a comprehensive and relevant systems view. An ideal method can answer a biological or medical question, identifying important features and predicting outcomes, by harnessing heterogeneous data across several dimensions of biological variation. In this Review, we describe the principles of data integration and discuss current methods and available implementations. We provide examples of successful data integration in biology and medicine. Finally, we discuss current challenges in biomedical integrative methods and our perspective on the future development of the field.
Collapse
Affiliation(s)
- Marinka Zitnik
- Department of Computer Science, Stanford University,
Stanford, CA, USA
| | - Francis Nguyen
- Department of Medical Biophysics, University of Toronto,
Toronto, ON, Canada
- Princess Margaret Cancer Centre, Toronto, ON, Canada
| | - Bo Wang
- Hikvision Research Institute, Santa Clara, CA, USA
| | - Jure Leskovec
- Department of Computer Science, Stanford University,
Stanford, CA, USA
- Chan Zuckerberg Biohub, San Francisco, CA, USA
| | - Anna Goldenberg
- Genetics & Genome Biology, SickKids Research Institute,
Toronto, ON, Canada
- Department of Computer Science, University of Toronto,
Toronto, ON, Canada
- Vector Institute, Toronto, ON, Canada
| | - Michael M. Hoffman
- Department of Medical Biophysics, University of Toronto,
Toronto, ON, Canada
- Princess Margaret Cancer Centre, Toronto, ON, Canada
- Department of Computer Science, University of Toronto,
Toronto, ON, Canada
- Vector Institute, Toronto, ON, Canada
| |
Collapse
|
186
|
Investigation of machine learning techniques on proteomics: A comprehensive survey. PROGRESS IN BIOPHYSICS AND MOLECULAR BIOLOGY 2019; 149:54-69. [PMID: 31568792 DOI: 10.1016/j.pbiomolbio.2019.09.004] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/05/2019] [Revised: 09/16/2019] [Accepted: 09/23/2019] [Indexed: 11/21/2022]
Abstract
Proteomics is the extensive investigation of proteins which has empowered the recognizable proof of consistently expanding quantities of protein. Proteins are necessary part of living life form, with numerous capacities. The proteome is the complete arrangement of proteins that are created or altered by a life form or framework of the organism. Proteome fluctuates with time and unambiguous prerequisites, or stresses, that a cell or organism experiences. Proteomics is an interdisciplinary area that has derived from the hereditary data of different genome ventures. Much proteomics information is gathered with the assistance of high throughput techniques, for example, mass spectrometry and microarray. It would regularly take weeks or months to analyze the information and perform examinations by hand. Therefore, scholars and scientific experts are teaming up with computer science researchers and mathematicians to make projects and pipeline to computationally examine the protein information. Utilizing bioinformatics procedures, scientists are prepared to do quicker investigation and protein information storing. The goal of this paper is to brief about the review of machine learning procedures and its application in the field of proteomics.
Collapse
|
187
|
Nakano FK, Lietaert M, Vens C. Machine learning for discovering missing or wrong protein function annotations : A comparison using updated benchmark datasets. BMC Bioinformatics 2019; 20:485. [PMID: 31547800 PMCID: PMC6755698 DOI: 10.1186/s12859-019-3060-6] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2019] [Accepted: 08/27/2019] [Indexed: 12/22/2022] Open
Abstract
BACKGROUND A massive amount of proteomic data is generated on a daily basis, nonetheless annotating all sequences is costly and often unfeasible. As a countermeasure, machine learning methods have been used to automatically annotate new protein functions. More specifically, many studies have investigated hierarchical multi-label classification (HMC) methods to predict annotations, using the Functional Catalogue (FunCat) or Gene Ontology (GO) label hierarchies. Most of these studies employed benchmark datasets created more than a decade ago, and thus train their models on outdated information. In this work, we provide an updated version of these datasets. By querying recent versions of FunCat and GO yeast annotations, we provide 24 new datasets in total. We compare four HMC methods, providing baseline results for the new datasets. Furthermore, we also evaluate whether the predictive models are able to discover new or wrong annotations, by training them on the old data and evaluating their results against the most recent information. RESULTS The results demonstrated that the method based on predictive clustering trees, Clus-Ensemble, proposed in 2008, achieved superior results compared to more recent methods on the standard evaluation task. For the discovery of new knowledge, Clus-Ensemble performed better when discovering new annotations in the FunCat taxonomy, whereas hierarchical multi-label classification with genetic algorithm (HMC-GA), a method based on genetic algorithms, was overall superior when detecting annotations that were removed. In the GO datasets, Clus-Ensemble once again had the upper hand when discovering new annotations, HMC-GA performed better for detecting removed annotations. However, in this evaluation, there were less significant differences among the methods. CONCLUSIONS The experiments have showed that protein function prediction is a very challenging task which should be further investigated. We believe that the baseline results associated with the updated datasets provided in this work should be considered as guidelines for future studies, nonetheless the old versions of the datasets should not be disregarded since other tasks in machine learning could benefit from them.
Collapse
Affiliation(s)
- Felipe Kenji Nakano
- KU Leuven, Campus KULAK - Department of Public Health and Primary Care, Etienne Sabbelaan 53, Kortrijk, 8500 Belgium
- ITEC - imec, Etienne Sabbelaan 51, Kortrijk, 8500 Belgium
| | - Mathias Lietaert
- Howest University of Applied Sciences, Campus Brugge Station, Rijselstraat 5, Brugge, 8200 Belgium
| | - Celine Vens
- KU Leuven, Campus KULAK - Department of Public Health and Primary Care, Etienne Sabbelaan 53, Kortrijk, 8500 Belgium
- ITEC - imec, Etienne Sabbelaan 51, Kortrijk, 8500 Belgium
| |
Collapse
|
188
|
Kulmanov M, Hoehndorf R. DeepGOPlus: improved protein function prediction from sequence. Bioinformatics 2019; 36:422-429. [PMID: 31350877 PMCID: PMC9883727 DOI: 10.1093/bioinformatics/btz595] [Citation(s) in RCA: 133] [Impact Index Per Article: 26.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2019] [Revised: 07/01/2019] [Accepted: 07/24/2019] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION Protein function prediction is one of the major tasks of bioinformatics that can help in wide range of biological problems such as understanding disease mechanisms or finding drug targets. Many methods are available for predicting protein functions from sequence based features, protein-protein interaction networks, protein structure or literature. However, other than sequence, most of the features are difficult to obtain or not available for many proteins thereby limiting their scope. Furthermore, the performance of sequence-based function prediction methods is often lower than methods that incorporate multiple features and predicting protein functions may require a lot of time. RESULTS We developed a novel method for predicting protein functions from sequence alone which combines deep convolutional neural network (CNN) model with sequence similarity based predictions. Our CNN model scans the sequence for motifs which are predictive for protein functions and combines this with functions of similar proteins (if available). We evaluate the performance of DeepGOPlus using the CAFA3 evaluation measures and achieve an Fmax of 0.390, 0.557 and 0.614 for BPO, MFO and CCO evaluations, respectively. These results would have made DeepGOPlus one of the three best predictors in CCO and the second best performing method in the BPO and MFO evaluations. We also compare DeepGOPlus with state-of-the-art methods such as DeepText2GO and GOLabeler on another dataset. DeepGOPlus can annotate around 40 protein sequences per second on common hardware, thereby making fast and accurate function predictions available for a wide range of proteins. AVAILABILITY AND IMPLEMENTATION http://deepgoplus.bio2vec.net/ . SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Maxat Kulmanov
- Computational Bioscience Research Center, Computer, Electrical and Mathematical Sciences & Engineering Division, King Abdullah University of Science and Technology, Thuwal 23955, Saudi Arabia
| | | |
Collapse
|
189
|
Chen H, Shaw D, Zeng J, Bu D, Jiang T. DIFFUSE: predicting isoform functions from sequences and expression profiles via deep learning. Bioinformatics 2019; 35:i284-i294. [PMID: 31510699 PMCID: PMC6612874 DOI: 10.1093/bioinformatics/btz367] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/25/2023] Open
Abstract
MOTIVATION Alternative splicing generates multiple isoforms from a single gene, greatly increasing the functional diversity of a genome. Although gene functions have been well studied, little is known about the specific functions of isoforms, making accurate prediction of isoform functions highly desirable. However, the existing approaches to predicting isoform functions are far from satisfactory due to at least two reasons: (i) unlike genes, isoform-level functional annotations are scarce. (ii) The information of isoform functions is concealed in various types of data including isoform sequences, co-expression relationship among isoforms, etc. RESULTS In this study, we present a novel approach, DIFFUSE (Deep learning-based prediction of IsoForm FUnctions from Sequences and Expression), to predict isoform functions. To integrate various types of data, our approach adopts a hybrid framework by first using a deep neural network (DNN) to predict the functions of isoforms from their genomic sequences and then refining the prediction using a conditional random field (CRF) based on co-expression relationship. To overcome the lack of isoform-level ground truth labels, we further propose an iterative semi-supervised learning algorithm to train both the DNN and CRF together. Our extensive computational experiments demonstrate that DIFFUSE could effectively predict the functions of isoforms and genes. It achieves an average area under the receiver operating characteristics curve of 0.840 and area under the precision-recall curve of 0.581 over 4184 GO functional categories, which are significantly higher than the state-of-the-art methods. We further validate the prediction results by analyzing the correlation between functional similarity, sequence similarity, expression similarity and structural similarity, as well as the consistency between the predicted functions and some well-studied functional features of isoform sequences. AVAILABILITY AND IMPLEMENTATION https://github.com/haochenucr/DIFFUSE. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Hao Chen
- Department of Compute Science and Engineering, University of California, Riverside, CA, USA
| | - Dipan Shaw
- Department of Compute Science and Engineering, University of California, Riverside, CA, USA
| | - Jianyang Zeng
- Institute for Interdisciplinary Information Sciences, Tsinghua University, Beijing, China
| | - Dongbo Bu
- Key Lab of Intelligent Information Process, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Tao Jiang
- Department of Compute Science and Engineering, University of California, Riverside, CA, USA
- Bioinformatics Division, BNRIST/Department of Computer Science and Technology, Tsinghua University, Beijing, China
| |
Collapse
|
190
|
Lv Z, Ao C, Zou Q. Protein Function Prediction: From Traditional Classifier to Deep Learning. Proteomics 2019; 19:e1900119. [PMID: 31187588 DOI: 10.1002/pmic.201900119] [Citation(s) in RCA: 61] [Impact Index Per Article: 12.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2019] [Revised: 05/20/2019] [Indexed: 11/12/2022]
Abstract
Deep learning demonstrates greater competence over traditional machine learning techniques for many tasks. In last several years, deep learning has been applied to protein function prediction and a series of good achievements has been obtained. These findings extensively advanced our understanding of protein function. However, the accuracy of protein function prediction based upon deep learning still has yet to be improved. In article number 1900019, Issue 12, Zhang et al. construct DeepFunc, a deep learning framework using derived feature information of protein sequence and protein interactions network. They find that implementing DeepFunc for protein function prediction is more accurate than using DeepGO, a similar method reported previously. Meanwhile, they find that the method of combining multiple derived feature information in DeepFunc is much better than the method of using only single derived feature information. Due to its fully exploiting feature representation learning ability, deep learning with more derived feature information will enable it to be a promising method for solving more complicated protein function prediction problems and other bioinformatics challenges. Recent researches have provided some major insights into the value for using deep learning to protein function prediction problem.
Collapse
Affiliation(s)
- Zhibin Lv
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, P. R. China
| | - Chunyan Ao
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, P. R. China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, P. R. China.,Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, P. R. China
| |
Collapse
|
191
|
Teso S, Masera L, Diligenti M, Passerini A. Combining learning and constraints for genome-wide protein annotation. BMC Bioinformatics 2019; 20:338. [PMID: 31208327 PMCID: PMC6580517 DOI: 10.1186/s12859-019-2875-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2018] [Accepted: 05/03/2019] [Indexed: 11/28/2022] Open
Abstract
BACKGROUND The advent of high-throughput experimental techniques paved the way to genome-wide computational analysis and predictive annotation studies. When considering the joint annotation of a large set of related entities, like all proteins of a certain genome, many candidate annotations could be inconsistent, or very unlikely, given the existing knowledge. A sound predictive framework capable of accounting for this type of constraints in making predictions could substantially contribute to the quality of machine-generated annotations at a genomic scale. RESULTS We present OCELOT, a predictive pipeline which simultaneously addresses functional and interaction annotation of all proteins of a given genome. The system combines sequence-based predictors for functional and protein-protein interaction (PPI) prediction with a consistency layer enforcing (soft) constraints as fuzzy logic rules. The enforced rules represent the available prior knowledge about the classification task, including taxonomic constraints over each GO hierarchy (e.g. a protein labeled with a GO term should also be labeled with all ancestor terms) as well as rules combining interaction and function prediction. An extensive experimental evaluation on the Yeast genome shows that the integration of prior knowledge via rules substantially improves the quality of the predictions. The system largely outperforms GoFDR, the only high-ranking system at the last CAFA challenge with a readily available implementation, when GoFDR is given access to intra-genome information only (as OCELOT), and has comparable or better results (depending on the hierarchy and performance measure) when GoFDR is allowed to use information from other genomes. Our system also compares favorably to recent methods based on deep learning.
Collapse
Affiliation(s)
- Stefano Teso
- Computer Science Department, KULeuven, Celestijnenlaan 200 A bus 2402, Leuven, 3001 Belgium
| | - Luca Masera
- Department of Information Engineering and Computer Science, University of Trento, Via Sommarive, 5, Povo di Trento, 38123 Italy
| | - Michelangelo Diligenti
- Department of Information Engineering and Mathematics, University of Siena, San Niccolò, via Roma, 56, Siena, 53100 Italy
| | - Andrea Passerini
- Department of Information Engineering and Computer Science, University of Trento, Via Sommarive, 5, Povo di Trento, 38123 Italy
| |
Collapse
|
192
|
Saha S, Chatterjee P, Basu S, Nasipuri M, Plewczynski D. FunPred 3.0: improved protein function prediction using protein interaction network. PeerJ 2019; 7:e6830. [PMID: 31198622 PMCID: PMC6535044 DOI: 10.7717/peerj.6830] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2018] [Accepted: 03/21/2019] [Indexed: 11/23/2022] Open
Abstract
Proteins are the most versatile macromolecules in living systems and perform crucial biological functions. In the advent of the post-genomic era, the next generation sequencing is done routinely at the population scale for a variety of species. The challenging problem is to massively determine the functions of proteins that are yet not characterized by detailed experimental studies. Identification of protein functions experimentally is a laborious and time-consuming task involving many resources. We therefore propose the automated protein function prediction methodology using in silico algorithms trained on carefully curated experimental datasets. We present the improved protein function prediction tool FunPred 3.0, an extended version of our previous methodology FunPred 2, which exploits neighborhood properties in protein–protein interaction network (PPIN) and physicochemical properties of amino acids. Our method is validated using the available functional annotations in the PPIN network of Saccharomyces cerevisiae in the latest Munich information center for protein (MIPS) dataset. The PPIN data of S. cerevisiae in MIPS dataset includes 4,554 unique proteins in 13,528 protein–protein interactions after the elimination of the self-replicating and the self-interacting protein pairs. Using the developed FunPred 3.0 tool, we are able to achieve the mean precision, the recall and the F-score values of 0.55, 0.82 and 0.66, respectively. FunPred 3.0 is then used to predict the functions of unpredicted protein pairs (incomplete and missing functional annotations) in MIPS dataset of S. cerevisiae. The method is also capable of predicting the subcellular localization of proteins along with its corresponding functions. The code and the complete prediction results are available freely at: https://github.com/SovanSaha/FunPred-3.0.git.
Collapse
Affiliation(s)
- Sovan Saha
- Department of Computer Science and Engineering, Dr. Sudhir Chandra Sur Degree Engineering College, Kolkata, West Bengal, India
| | - Piyali Chatterjee
- Department of Computer Science and Engineering, Netaji Subhash Engineering College, Kolkata, India
| | - Subhadip Basu
- Department of Computer Science and Engineering, Jadavpur University, Kolkata, West Bengal, India
| | - Mita Nasipuri
- Department of Computer Science and Engineering, Jadavpur University, Kolkata, West Bengal, India
| | - Dariusz Plewczynski
- Laboratory of Functional and Structural Genomics, Centre of New Technologies, University of Warsaw, Warsaw, Poland.,Faculty of Mathematics and Information Science, Warsaw University of Technology, Warsaw, Poland
| |
Collapse
|
193
|
Zhang F, Song H, Zeng M, Li Y, Kurgan L, Li M. DeepFunc: A Deep Learning Framework for Accurate Prediction of Protein Functions from Protein Sequences and Interactions. Proteomics 2019; 19:e1900019. [PMID: 30941889 DOI: 10.1002/pmic.201900019] [Citation(s) in RCA: 52] [Impact Index Per Article: 10.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2019] [Revised: 03/18/2019] [Indexed: 01/06/2023]
Abstract
Annotation of protein functions plays an important role in understanding life at the molecular level. High-throughput sequencing produces massive numbers of raw proteins sequences and only about 1% of them have been manually annotated with functions. Experimental annotations of functions are expensive, time-consuming and do not keep up with the rapid growth of the sequence numbers. This motivates the development of computational approaches that predict protein functions. A novel deep learning framework, DeepFunc, is proposed which accurately predicts protein functions from protein sequence- and network-derived information. More precisely, DeepFunc uses a long and sparse binary vector to encode information concerning domains, families, and motifs collected from the InterPro tool that is associated with the input protein sequence. This vector is processed with two neural layers to obtain a low-dimensional vector which is combined with topological information extracted from protein-protein interactions (PPIs) and functional linkages. The combined information is processed by a deep neural network that predicts protein functions. DeepFunc is empirically and comparatively tested on a benchmark testing dataset and the Critical Assessment of protein Function Annotation algorithms (CAFA) 3 dataset. The experimental results demonstrate that DeepFunc outperforms current methods on the testing dataset and that it secures the highest Fmax = 0.54 and AUC = 0.94 on the CAFA3 dataset.
Collapse
Affiliation(s)
- Fuhao Zhang
- School of Computer Science and Engineering, Central South University, Changsha, 410083, P. R. China
| | - Hong Song
- School of Computer Science and Engineering, Central South University, Changsha, 410083, P. R. China
| | - Min Zeng
- School of Computer Science and Engineering, Central South University, Changsha, 410083, P. R. China
| | - Yaohang Li
- School of Computer Science and Engineering, Central South University, Changsha, 410083, P. R. China.,Department of Computer Science, Old Dominion University, Norfolk, VA, 23529, USA
| | - Lukasz Kurgan
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA, 23284, USA
| | - Min Li
- School of Computer Science and Engineering, Central South University, Changsha, 410083, P. R. China
| |
Collapse
|
194
|
Zhu J, Zhao Q, Katsevich E, Sabatti C. Exploratory Gene Ontology Analysis with Interactive Visualization. Sci Rep 2019; 9:7793. [PMID: 31127124 PMCID: PMC6534545 DOI: 10.1038/s41598-019-42178-x] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2018] [Accepted: 03/14/2019] [Indexed: 12/17/2022] Open
Abstract
The Gene Ontology (GO) is a central resource for functional-genomics research. Scientists rely on the functional annotations in the GO for hypothesis generation and couple it with high-throughput biological data to enhance interpretation of results. At the same time, the sheer number of concepts (>30,000) and relationships (>70,000) presents a challenge: it can be difficult to draw a comprehensive picture of how certain concepts of interest might relate with the rest of the ontology structure. Here we present new visualization strategies to facilitate the exploration and use of the information in the GO. We rely on novel graphical display and software architecture that allow significant interaction. To illustrate the potential of our strategies, we provide examples from high-throughput genomic analyses, including chromatin immunoprecipitation experiments and genome-wide association studies. The scientist can also use our visualizations to identify gene sets that likely experience coordinated changes in their expression and use them to simulate biologically-grounded single cell RNA sequencing data, or conduct power studies for differential gene expression studies using our built-in pipeline. Our software and documentation are available at http://aegis.stanford.edu .
Collapse
Affiliation(s)
- Junjie Zhu
- Department of Electrical Engineering, Stanford University, Stanford, CA, USA.
| | - Qian Zhao
- Department of Statistics, Stanford University, Stanford, CA, USA
| | - Eugene Katsevich
- Department of Statistics, Stanford University, Stanford, CA, USA
| | - Chiara Sabatti
- Department of Statistics, Stanford University, Stanford, CA, USA.
- Department of Biomedical Data Science, Stanford University, Stanford, CA, USA.
| |
Collapse
|
195
|
Stephenson N, Shane E, Chase J, Rowland J, Ries D, Justice N, Zhang J, Chan L, Cao R. Survey of Machine Learning Techniques in Drug Discovery. Curr Drug Metab 2019; 20:185-193. [DOI: 10.2174/1389200219666180820112457] [Citation(s) in RCA: 111] [Impact Index Per Article: 22.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2017] [Revised: 01/01/2018] [Accepted: 03/19/2018] [Indexed: 12/19/2022]
Abstract
Background:Drug discovery, which is the process of discovering new candidate medications, is very important for pharmaceutical industries. At its current stage, discovering new drugs is still a very expensive and time-consuming process, requiring Phases I, II and III for clinical trials. Recently, machine learning techniques in Artificial Intelligence (AI), especially the deep learning techniques which allow a computational model to generate multiple layers, have been widely applied and achieved state-of-the-art performance in different fields, such as speech recognition, image classification, bioinformatics, etc. One very important application of these AI techniques is in the field of drug discovery.Methods:We did a large-scale literature search on existing scientific websites (e.g, ScienceDirect, Arxiv) and startup companies to understand current status of machine learning techniques in drug discovery.Results:Our experiments demonstrated that there are different patterns in machine learning fields and drug discovery fields. For example, keywords like prediction, brain, discovery, and treatment are usually in drug discovery fields. Also, the total number of papers published in drug discovery fields with machine learning techniques is increasing every year.Conclusion:The main focus of this survey is to understand the current status of machine learning techniques in the drug discovery field within both academic and industrial settings, and discuss its potential future applications. Several interesting patterns for machine learning techniques in drug discovery fields are discussed in this survey.
Collapse
Affiliation(s)
- Natalie Stephenson
- Department of Computer Science, Pacific Lutheran University, Tacoma, WA 98447, United States
| | - Emily Shane
- Department of Computer Science, Pacific Lutheran University, Tacoma, WA 98447, United States
| | - Jessica Chase
- Department of Computer Science, Pacific Lutheran University, Tacoma, WA 98447, United States
| | - Jason Rowland
- Department of Computer Science, Pacific Lutheran University, Tacoma, WA 98447, United States
| | - David Ries
- Department of Computer Science, Pacific Lutheran University, Tacoma, WA 98447, United States
| | - Nicola Justice
- Department of Mathematics, Pacific Lutheran University, Tacoma, WA 98447, United States
| | - Jie Zhang
- Key Laboratory of Hebei Province for Plant Physiology and Molecular Pathology, College of Life Sciences, Hebei Agricultural University, Baoding, China
| | - Leong Chan
- School of Business, Pacific Lutheran University, Tacoma, WA 98447, United States
| | - Renzhi Cao
- Department of Computer Science, Pacific Lutheran University, Tacoma, WA 98447, United States
| |
Collapse
|
196
|
Sureyya Rifaioglu A, Doğan T, Jesus Martin M, Cetin-Atalay R, Atalay V. DEEPred: Automated Protein Function Prediction with Multi-task Feed-forward Deep Neural Networks. Sci Rep 2019; 9:7344. [PMID: 31089211 PMCID: PMC6517386 DOI: 10.1038/s41598-019-43708-3] [Citation(s) in RCA: 47] [Impact Index Per Article: 9.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2018] [Accepted: 04/27/2019] [Indexed: 01/22/2023] Open
Abstract
Automated protein function prediction is critical for the annotation of uncharacterized protein sequences, where accurate prediction methods are still required. Recently, deep learning based methods have outperformed conventional algorithms in computer vision and natural language processing due to the prevention of overfitting and efficient training. Here, we propose DEEPred, a hierarchical stack of multi-task feed-forward deep neural networks, as a solution to Gene Ontology (GO) based protein function prediction. DEEPred was optimized through rigorous hyper-parameter tests, and benchmarked using three types of protein descriptors, training datasets with varying sizes and GO terms form different levels. Furthermore, in order to explore how training with larger but potentially noisy data would change the performance, electronically made GO annotations were also included in the training process. The overall predictive performance of DEEPred was assessed using CAFA2 and CAFA3 challenge datasets, in comparison with the state-of-the-art protein function prediction methods. Finally, we evaluated selected novel annotations produced by DEEPred with a literature-based case study considering the 'biofilm formation process' in Pseudomonas aeruginosa. This study reports that deep learning algorithms have significant potential in protein function prediction; particularly when the source data is large. The neural network architecture of DEEPred can also be applied to the prediction of the other types of ontological associations. The source code and all datasets used in this study are available at: https://github.com/cansyl/DEEPred .
Collapse
Affiliation(s)
- Ahmet Sureyya Rifaioglu
- Department of Computer Engineering, METU, Ankara, 06800, Turkey
- Department of Computer Engineering, İskenderun Technical University, Hatay, 31200, Turkey
| | - Tunca Doğan
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Hinxton, Cambridge, CB10 1SD, UK.
- KanSiL, Department of Health Informatics, Graduate School of Informatics, METU, Ankara, 06800, Turkey.
| | - Maria Jesus Martin
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Hinxton, Cambridge, CB10 1SD, UK
| | - Rengul Cetin-Atalay
- KanSiL, Department of Health Informatics, Graduate School of Informatics, METU, Ankara, 06800, Turkey
| | - Volkan Atalay
- Department of Computer Engineering, METU, Ankara, 06800, Turkey.
- KanSiL, Department of Health Informatics, Graduate School of Informatics, METU, Ankara, 06800, Turkey.
| |
Collapse
|
197
|
Leveraging implicit knowledge in neural networks for functional dissection and engineering of proteins. NAT MACH INTELL 2019. [DOI: 10.1038/s42256-019-0049-9] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
|
198
|
Nelson W, Zitnik M, Wang B, Leskovec J, Goldenberg A, Sharan R. To Embed or Not: Network Embedding as a Paradigm in Computational Biology. Front Genet 2019; 10:381. [PMID: 31118945 PMCID: PMC6504708 DOI: 10.3389/fgene.2019.00381] [Citation(s) in RCA: 74] [Impact Index Per Article: 14.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2019] [Accepted: 04/09/2019] [Indexed: 12/20/2022] Open
Abstract
Current technology is producing high throughput biomedical data at an ever-growing rate. A common approach to interpreting such data is through network-based analyses. Since biological networks are notoriously complex and hard to decipher, a growing body of work applies graph embedding techniques to simplify, visualize, and facilitate the analysis of the resulting networks. In this review, we survey traditional and new approaches for graph embedding and compare their application to fundamental problems in network biology with using the networks directly. We consider a broad variety of applications including protein network alignment, community detection, and protein function prediction. We find that in all of these domains both types of approaches are of value and their performance depends on the evaluation measures being used and the goal of the project. In particular, network embedding methods outshine direct methods according to some of those measures and are, thus, an essential tool in bioinformatics research.
Collapse
Affiliation(s)
- Walter Nelson
- Genetics and Genome Biology, SickKids Research Institute, Toronto, ON, Canada
- Department of Cell and Systems Biology, University of Toronto, Toronto, ON, Canada
| | - Marinka Zitnik
- Department of Computer Science, Stanford University, Stanford, CA, United States
| | - Bo Wang
- Department of Computer Science, Stanford University, Stanford, CA, United States
- Peter Munk Cardiac Center, University Health Network, Toronto, ON, Canada
- Vector Institute, Toronto, ON, Canada
| | - Jure Leskovec
- Department of Computer Science, Stanford University, Stanford, CA, United States
- Chan Zuckerberg Biohub, San Francisco, CA, United States
| | - Anna Goldenberg
- Genetics and Genome Biology, SickKids Research Institute, Toronto, ON, Canada
- Vector Institute, Toronto, ON, Canada
- Department of Computer Science, University of Toronto, Toronto, ON, Canada
| | - Roded Sharan
- School of Computer Science, Tel Aviv University, Tel Aviv, Israel
| |
Collapse
|
199
|
Deep learning in bioinformatics: Introduction, application, and perspective in the big data era. Methods 2019; 166:4-21. [PMID: 31022451 DOI: 10.1016/j.ymeth.2019.04.008] [Citation(s) in RCA: 134] [Impact Index Per Article: 26.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2018] [Revised: 03/23/2019] [Accepted: 04/15/2019] [Indexed: 12/13/2022] Open
Abstract
Deep learning, which is especially formidable in handling big data, has achieved great success in various fields, including bioinformatics. With the advances of the big data era in biology, it is foreseeable that deep learning will become increasingly important in the field and will be incorporated in vast majorities of analysis pipelines. In this review, we provide both the exoteric introduction of deep learning, and concrete examples and implementations of its representative applications in bioinformatics. We start from the recent achievements of deep learning in the bioinformatics field, pointing out the problems which are suitable to use deep learning. After that, we introduce deep learning in an easy-to-understand fashion, from shallow neural networks to legendary convolutional neural networks, legendary recurrent neural networks, graph neural networks, generative adversarial networks, variational autoencoder, and the most recent state-of-the-art architectures. After that, we provide eight examples, covering five bioinformatics research directions and all the four kinds of data type, with the implementation written in Tensorflow and Keras. Finally, we discuss the common issues, such as overfitting and interpretability, that users will encounter when adopting deep learning methods and provide corresponding suggestions. The implementations are freely available at https://github.com/lykaust15/Deep_learning_examples.
Collapse
|
200
|
From Genotype to Phenotype: Augmenting Deep Learning with Networks and Systems Biology. ACTA ACUST UNITED AC 2019; 15:68-73. [PMID: 31777764 DOI: 10.1016/j.coisb.2019.04.001] [Citation(s) in RCA: 22] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/16/2023]
Abstract
Cells, as complex systems, consist of diverse interacting biomolecules arranged in dynamic hierarchical modules. Recent advances in deep learning methods now allow one to encode this rich existing knowledge in the architecture of the learning procedure, thus providing the models with the knowledge that is absent in the training data. By encoding biological networks in the architecture, one can develop flexible deep models that propagate information through the molecular networks to successfully classify cell states. Moreover, this flexibility in the architecture can be harnessed to model the hierarchical structure of real biological systems, efficiently converting gene-level data to pathway-level information with an ultimate impact on cell phenotype. Furthermore, such models could require fewer training samples, are more generalizable across diverse biological contexts, and can make predictions that are more consistent with the current understanding on the inner-working of biological systems.
Collapse
|