1
|
Ru X, Ye X, Sakurai T, Zou Q. Application of learning to rank in bioinformatics tasks. Brief Bioinform 2021; 22:6102666. [PMID: 33454758 DOI: 10.1093/bib/bbaa394] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2020] [Revised: 11/09/2020] [Accepted: 11/24/2020] [Indexed: 12/17/2022] Open
Abstract
Over the past decades, learning to rank (LTR) algorithms have been gradually applied to bioinformatics. Such methods have shown significant advantages in multiple research tasks in this field. Therefore, it is necessary to summarize and discuss the application of these algorithms so that these algorithms are convenient and contribute to bioinformatics. In this paper, the characteristics of LTR algorithms and their strengths over other types of algorithms are analyzed based on the application of multiple perspectives in bioinformatics. Finally, the paper further discusses the shortcomings of the LTR algorithms, the methods and means to better use the algorithms and some open problems that currently exist.
Collapse
Affiliation(s)
| | - Xiucai Ye
- Department of Computer Science and Center for Artificial Intelligence Research (C-AIR), University of Tsukuba
| | | | - Quan Zou
- University of Electronic Science and Technology of China
| |
Collapse
|
2
|
Postic G, Janel N, Tufféry P, Moroy G. An information gain-based approach for evaluating protein structure models. Comput Struct Biotechnol J 2020; 18:2228-2236. [PMID: 32837711 PMCID: PMC7431362 DOI: 10.1016/j.csbj.2020.08.013] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2020] [Revised: 08/06/2020] [Accepted: 08/07/2020] [Indexed: 12/23/2022] Open
Abstract
For three decades now, knowledge-based scoring functions that operate through the "potential of mean force" (PMF) approach have continuously proven useful for studying protein structures. Although these statistical potentials are not to be confused with their physics-based counterparts of the same name-i.e. PMFs obtained by molecular dynamics simulations-their particular success in assessing the native-like character of protein structure predictions has lead authors to consider the computed scores as approximations of the free energy. However, this physical justification is a matter of controversy since the beginning. Alternative interpretations based on Bayes' theorem have been proposed, but the misleading formalism that invokes the inverse Boltzmann law remains recurrent in the literature. In this article, we present a conceptually new method for ranking protein structure models by quality, which is (i) independent of any physics-based explanation and (ii) relevant to statistics and to a general definition of information gain. The theoretical development described in this study provides new insights into how statistical PMFs work, in comparison with our approach. To prove the concept, we have built interatomic distance-dependent scoring functions, based on the former and new equations, and compared their performance on an independent benchmark of 60,000 protein structures. The results demonstrate that our new formalism outperforms statistical PMFs in evaluating the quality of protein structural decoys. Therefore, this original type of score offers a possibility to improve the success of statistical PMFs in the various fields of structural biology where they are applied. The open-source code is available for download at https://gitlab.rpbs.univ-paris-diderot.fr/src/ig-score.
Collapse
Affiliation(s)
- Guillaume Postic
- Université de Paris, BFA, UMR 8251, CNRS, ERL U1133, Inserm, F-75013 Paris, France.,Université de Paris, BFA, UMR 8251, CNRS, F-75013 Paris, France.,Institut Français de Bioinformatique (IFB), UMS 3601-CNRS, Université Paris-Saclay, Orsay, France.,Ressource Parisienne en Bioinformatique Structurale (RPBS), Paris, France
| | - Nathalie Janel
- Université de Paris, BFA, UMR 8251, CNRS, F-75013 Paris, France
| | - Pierre Tufféry
- Université de Paris, BFA, UMR 8251, CNRS, ERL U1133, Inserm, F-75013 Paris, France.,Ressource Parisienne en Bioinformatique Structurale (RPBS), Paris, France
| | - Gautier Moroy
- Université de Paris, BFA, UMR 8251, CNRS, ERL U1133, Inserm, F-75013 Paris, France
| |
Collapse
|
3
|
Gadiyaram V, Vishveshwara S, Vishveshwara S. From Quantum Chemistry to Networks in Biology: A Graph Spectral Approach to Protein Structure Analyses. J Chem Inf Model 2019; 59:1715-1727. [DOI: 10.1021/acs.jcim.9b00002] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Affiliation(s)
- Vasundhara Gadiyaram
- IISc Mathematics Initiative (IMI), Indian Institute of Science, C V Raman Road, Bengaluru, Karnataka 560012, India
| | - Smitha Vishveshwara
- Department of Physics, University of Illinois at Urbana−Champaign, Urbana, Illinois 61801-3080, United States
| | - Saraswathi Vishveshwara
- Molecular Biophysics Unit, Indian Institute of Science, C V Raman Road, Bengaluru, Karnataka 560012, India
| |
Collapse
|
4
|
Chikkerur J, Samanta AK, Dhali A, Kolte AP, Roy S, Maria P. In Silico evaluation and identification of fungi capable of producing endo-inulinase enzyme. PLoS One 2018; 13:e0200607. [PMID: 30001376 PMCID: PMC6042768 DOI: 10.1371/journal.pone.0200607] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2017] [Accepted: 06/29/2018] [Indexed: 11/28/2022] Open
Abstract
The enzyme endo-inulinase hydrolyzes inulin to short chain fructooligosaccharides (FOS) that are potential prebiotics with many health promoting benefits. Although the raw materials for inulin production are inexpensive and readily available, commercial production of FOS from inulin is limited due to inadequate availability of the enzyme source. This study aimed to identify the fungi capable of producing endo-inulinase based on the in silico analysis of proteins retrieved from non-redundant protein sequence database. The endo-inulinase of Aspergillus ficuum was used as reference sequence. The amino acid sequences with >90% sequence coverage, belonging to different fungi were retrieved from the database and used for constructing three-dimensional (3D) protein models using SWISS-MODEL and Bagheerath H. The 3D models of comparable quality as that of the reference endo-inulinase were selected based on QMEAN Z score. The selected models were evaluated and validated for different structural and functional qualities using Pro-Q, ProSA, PSN-QA, VERIFY-3D, PROCHECK, PROTSAV metaserver, STRAP, molecular docking, and molecular dynamic simulation analyses. A total of 230 proteins belonging to 53 fungal species exhibited sequence coverage >90%. Sixty one protein sequences with >60% sequence identity were modeled as endo-inulinase with higher QMEAN Z Score. The evaluations and validations of these 61 selected models for different structural and functional qualities revealed that 60 models belonging to 22 fungal species exhibited native like structure and unique motifs and residues as that of the reference endo-inulinase. Further, these models also exhibited similar kind of interaction between the active site around the conserved glutamate residue and substrate as that of the reference endo-inulinase. In conclusion, based on the current study, 22 fungal species could be identified as endo-inulinase producer. Nevertheless, further biological assessment of their capability for producing endo-inulinase is imminent if they are to be used for commercial endo-inulinase production for application in FOS industry.
Collapse
Affiliation(s)
- Jayaram Chikkerur
- Animal Nutrition Division, ICAR-National Institute of Animal Nutrition and Physiology, Bengaluru, Karnataka, India
- Department of Microbiology, School of Sciences, Jain University, Bengaluru, Karnataka, India
- * E-mail:
| | - Ashis Kumar Samanta
- Animal Nutrition Division, ICAR-National Institute of Animal Nutrition and Physiology, Bengaluru, Karnataka, India
| | - Arindam Dhali
- Bioenergetics and Environmental Sciences Division, ICAR-National Institute of Animal Nutrition and Physiology, Bengaluru, Karnataka, India
| | - Atul Purushottam Kolte
- Animal Nutrition Division, ICAR-National Institute of Animal Nutrition and Physiology, Bengaluru, Karnataka, India
| | - Sohini Roy
- Animal Nutrition Division, ICAR-National Institute of Animal Nutrition and Physiology, Bengaluru, Karnataka, India
- Department of Microbiology, School of Sciences, Jain University, Bengaluru, Karnataka, India
| | - Pratheepa Maria
- Division of Genomic Resources, ICAR-National Bureau of Agricultural Insect Resources, Bengaluru, Karnataka, India
| |
Collapse
|
5
|
Jing X, Dong Q. MQAPRank: improved global protein model quality assessment by learning-to-rank. BMC Bioinformatics 2017; 18:275. [PMID: 28545390 PMCID: PMC5445322 DOI: 10.1186/s12859-017-1691-z] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2017] [Accepted: 05/16/2017] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Protein structure prediction has achieved a lot of progress during the last few decades and a greater number of models for a certain sequence can be predicted. Consequently, assessing the qualities of predicted protein models in perspective is one of the key components of successful protein structure prediction. Over the past years, a number of methods have been developed to address this issue, which could be roughly divided into three categories: single methods, quasi-single methods and clustering (or consensus) methods. Although these methods achieve much success at different levels, accurate protein model quality assessment is still an open problem. RESULTS Here, we present the MQAPRank, a global protein model quality assessment program based on learning-to-rank. The MQAPRank first sorts the decoy models by using single method based on learning-to-rank algorithm to indicate their relative qualities for the target protein. And then it takes the first five models as references to predict the qualities of other models by using average GDT_TS scores between reference models and other models. Benchmarked on CASP11 and 3DRobot datasets, the MQAPRank achieved better performances than other leading protein model quality assessment methods. Recently, the MQAPRank participated in the CASP12 under the group name FDUBio and achieved the state-of-the-art performances. CONCLUSIONS The MQAPRank provides a convenient and powerful tool for protein model quality assessment with the state-of-the-art performances, it is useful for protein structure prediction and model quality assessment usages.
Collapse
Affiliation(s)
- Xiaoyang Jing
- School of Computer Science, Fudan University, Shanghai, 200433 People’s Republic of China
| | - Qiwen Dong
- School of Data Science and Engineering, East China Normal University, Shanghai, 200062 People’s Republic of China
| |
Collapse
|
6
|
Kaushik R, Jayaram B. Structural difficulty index: a reliable measure for modelability of protein tertiary structures. Protein Eng Des Sel 2016; 29:391-7. [PMID: 27334454 DOI: 10.1093/protein/gzw025] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2016] [Accepted: 05/27/2016] [Indexed: 11/13/2022] Open
Abstract
The success in protein tertiary-structure prediction is considered to be a function of coverage and similarity/identity of their sequences with suitable templates in the structural databases. However, this measure of modelability of a protein sequence into its structure may be misleading. Addressing this limitation, we propose here a 'structural difficulty (SD)' index, which is derived from secondary structures, homology and physicochemical features of protein sequences. The SD index reflects the capability of predicting accurate structures and helps to assess the potential for developing proteome level structural databases for various organisms with some of the best methodologies available currently. For instance, the plausibility of populating the structural database of human proteome with reliable quality structures under 3 Å root mean square deviation from the corresponding natives is found to be ∼37% of a total of 11 084 manually curated soluble proteins and ∼64% for all annotated and reviewed unique soluble protein (344 661 sequences) of UniProtKB. Also for 77 human pathogenic viruses comprising 2365 globular viral proteins out of which only 162 structures are solved experimentally, SD index scores 1336 proteins in the modelable zone. Availability of reliable protein structures may prove a crucial aid in developing species-wise structural proteomic databases for accelerating function annotation and for drug development endeavors.
Collapse
Affiliation(s)
- Rahul Kaushik
- Kusuma School of Biological Sciences, Indian Institute of Technology, Hauz Khas, New Delhi 110016, India Supercomputing Facility for Bioinformatics & Computational Biology, Indian Institute of Technology, Hauz Khas, New Delhi 110016, India
| | - B Jayaram
- Kusuma School of Biological Sciences, Indian Institute of Technology, Hauz Khas, New Delhi 110016, India Supercomputing Facility for Bioinformatics & Computational Biology, Indian Institute of Technology, Hauz Khas, New Delhi 110016, India Department of Chemistry, Indian Institute of Technology, Hauz Khas, New Delhi 110016, India
| |
Collapse
|
7
|
ProTSAV: A protein tertiary structure analysis and validation server. BIOCHIMICA ET BIOPHYSICA ACTA-PROTEINS AND PROTEOMICS 2015; 1864:11-9. [PMID: 26478257 DOI: 10.1016/j.bbapap.2015.10.004] [Citation(s) in RCA: 46] [Impact Index Per Article: 5.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/15/2015] [Revised: 09/26/2015] [Accepted: 10/14/2015] [Indexed: 01/06/2023]
Abstract
Quality assessment of predicted model structures of proteins is as important as the protein tertiary structure prediction. A highly efficient quality assessment of predicted model structures directs further research on function. Here we present a new server ProTSAV, capable of evaluating predicted model structures based on some popular online servers and standalone tools. ProTSAV furnishes the user with a single quality score in case of individual protein structure along with a graphical representation and ranking in case of multiple protein structure assessment. The server is validated on ~64,446 protein structures including experimental structures from RCSB and predicted model structures for CASP targets and from public decoy sets. ProTSAV succeeds in predicting quality of protein structures with a specificity of 100% and a sensitivity of 98% on experimentally solved structures and achieves a specificity of 88%and a sensitivity of 91% on predicted protein structures of CASP11 targets under 2Å.The server overcomes the limitations of any single server/method and is seen to be robust in helping in quality assessment. ProTSAV is freely available at http://www.scfbio-iitd.res.in/software/proteomics/protsav.jsp.
Collapse
|
8
|
An empirical energy function for structural assessment of protein transmembrane domains. Biochimie 2015; 115:155-61. [DOI: 10.1016/j.biochi.2015.05.018] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2015] [Accepted: 05/21/2015] [Indexed: 11/19/2022]
|