1
|
Capped norm linear discriminant analysis and its applications. APPL INTELL 2023. [DOI: 10.1007/s10489-022-04395-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/01/2023]
|
2
|
Mahmood T, Choi J, Ryoung Park K. Artificial Intelligence-based Classification of Pollen Grains Using Attention-guided Pollen Features Aggregation Network. JOURNAL OF KING SAUD UNIVERSITY - COMPUTER AND INFORMATION SCIENCES 2023. [DOI: 10.1016/j.jksuci.2023.01.013] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/24/2023]
|
3
|
Qin X, Zhang L, Liu M, Xu Z, Liu G. ASFold-DNN: Protein Fold Recognition Based on Evolutionary Features With Variable Parameters Using Full Connected Neural Network. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:2712-2722. [PMID: 34133282 DOI: 10.1109/tcbb.2021.3089168] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Protein fold recognition contribute to comprehend the function of proteins, which is of great help to the gene therapy of diseases and the development of new drugs. Researchers have been working in this direction and have made considerable achievements, but challenges still exist on low sequence similarity datasets. In this study, we propose the ASFold-DNN framework for protein fold recognition research. Above all, four groups of evolutionary features are extracted from the primary structures of proteins, and a preliminary selection of variable parameter is made for two groups of features including ACC _HMM and SXG _HMM, respectively. Then several feature selection algorithms are selected for comparison and the best feature selection scheme is obtained by changing their internal threshold values. Finally, multiple hyper-parameters of Full Connected Neural Network are fully optimized to construct the best model. DD, EDD and TG datasets with low sequence similarities are chosen to evaluate the performance of the models constructed by the framework, and the final prediction accuracy are 85.28, 95.00 and 88.84 percent, respectively. Furthermore, the ASTRAL186 and LE datasets are introduced to further verify the generalization ability of our proposed framework. Comprehensive experimental results prove that the ASFold-DNN framework is more prominent than the state-of-the-art studies on protein fold recognition. The source code and data of ASFold-DNN can be downloaded from https://github.com/Bioinformatics-Laboratory/project/tree/master/ASFold.
Collapse
|
4
|
Villegas-Morcillo A, Gomez AM, Sanchez V. An analysis of protein language model embeddings for fold prediction. Brief Bioinform 2022; 23:6571527. [PMID: 35443054 DOI: 10.1093/bib/bbac142] [Citation(s) in RCA: 12] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2022] [Revised: 03/21/2022] [Accepted: 03/28/2022] [Indexed: 11/13/2022] Open
Abstract
The identification of the protein fold class is a challenging problem in structural biology. Recent computational methods for fold prediction leverage deep learning techniques to extract protein fold-representative embeddings mainly using evolutionary information in the form of multiple sequence alignment (MSA) as input source. In contrast, protein language models (LM) have reshaped the field thanks to their ability to learn efficient protein representations (protein-LM embeddings) from purely sequential information in a self-supervised manner. In this paper, we analyze a framework for protein fold prediction using pre-trained protein-LM embeddings as input to several fine-tuning neural network models, which are supervisedly trained with fold labels. In particular, we compare the performance of six protein-LM embeddings: the long short-term memory-based UniRep and SeqVec, and the transformer-based ESM-1b, ESM-MSA, ProtBERT and ProtT5; as well as three neural networks: Multi-Layer Perceptron, ResCNN-BGRU (RBG) and Light-Attention (LAT). We separately evaluated the pairwise fold recognition (PFR) and direct fold classification (DFC) tasks on well-known benchmark datasets. The results indicate that the combination of transformer-based embeddings, particularly those obtained at amino acid level, with the RBG and LAT fine-tuning models performs remarkably well in both tasks. To further increase prediction accuracy, we propose several ensemble strategies for PFR and DFC, which provide a significant performance boost over the current state-of-the-art results. All this suggests that moving from traditional protein representations to protein-LM embeddings is a very promising approach to protein fold-related tasks.
Collapse
Affiliation(s)
- Amelia Villegas-Morcillo
- Department of Signal Theory, Telematics and Communications, University of Granada, Granada, Spain
| | - Angel M Gomez
- Department of Signal Theory, Telematics and Communications, University of Granada, Granada, Spain
| | - Victoria Sanchez
- Department of Signal Theory, Telematics and Communications, University of Granada, Granada, Spain
| |
Collapse
|
5
|
Two-dimensional Bhattacharyya bound linear discriminant analysis with its applications. APPL INTELL 2021; 52:8793-8809. [PMID: 34764624 PMCID: PMC8568685 DOI: 10.1007/s10489-021-02843-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 09/12/2021] [Indexed: 11/16/2022]
Abstract
The recently proposed L2-norm linear discriminant analysis criterion based on Bhattacharyya error bound estimation (L2BLDA) was an effective improvement over linear discriminant analysis (LDA) and was used to handle vector input samples. When faced with two-dimensional (2D) inputs, such as images, converting two-dimensional data to vectors, regardless of the inherent structure of the image, may result in some loss of useful information. In this paper, we propose a novel two-dimensional Bhattacharyya bound linear discriminant analysis (2DBLDA). 2DBLDA maximizes the matrix-based between-class distance, which is measured by the weighted pairwise distances of class means and minimizes the matrix-based within-class distance. The criterion of 2DBLDA is equivalent to optimizing the upper bound of the Bhattacharyya error. The weighting constant between the between-class and within-class terms is determined by the involved data that make the proposed 2DBLDA adaptive. The construction of 2DBLDA avoids the small sample size (SSS) problem, is robust, and can be solved through a simple standard eigenvalue decomposition problem. The experimental results on image recognition and face image reconstruction demonstrate the effectiveness of 2DBLDA.
Collapse
|
6
|
Bankapur S, Patil N. Enhanced Protein Structural Class Prediction Using Effective Feature Modeling and Ensemble of Classifiers. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:2409-2419. [PMID: 32149653 DOI: 10.1109/tcbb.2020.2979430] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Protein Secondary Structural Class (PSSC) information is important in investigating further challenges of protein sequences like protein fold recognition, protein tertiary structure prediction, and analysis of protein functions for drug discovery. Identification of PSSC using biological methods is time-consuming and cost-intensive. Several computational models have been developed to predict the structural class; however, they lack in generalization of the model. Hence, predicting PSSC based on protein sequences is still proving to be an uphill task. In this article, we proposed an effective, novel and generalized prediction model consisting of a feature modeling and an ensemble of classifiers. The proposed feature modeling extracts discriminating information (features) by leveraging three techniques: (i) Embedding - features are extracted on the basis of spatial residue arrangements of the sequences using word embedding approaches; (ii) SkipXGram Bi-gram - various sets of skipped bi-gram features are extracted from the sequences; and (iii) General Statistical (GS) based features are extracted which covers the global information of structural sequences. The combined effective sets of features are trained and classified using an ensemble of three classifiers: Support Vector Machine (SVM), Random Forest (RF), and Gradient Boosting Machines (GBM). The proposed model when assessed on five benchmark datasets (high and low sequence similarity), viz. z277, z498, 25PDB, 1189, and FC699, reported an overall accuracy of 93.55, 97.58, 81.82, 81.11, and 93.93 percent respectively. The proposed model is further validated on a large-scale updated low similarity ( ≤ 25%) dataset, where it achieved an overall accuracy of 81.11 percent. The proposed generalized model is robust and consistently outperformed several state-of-the-art models on all the five benchmark datasets.
Collapse
|
7
|
Villegas-Morcillo A, Sanchez V, Gomez AM. FoldHSphere: deep hyperspherical embeddings for protein fold recognition. BMC Bioinformatics 2021; 22:490. [PMID: 34641786 PMCID: PMC8507389 DOI: 10.1186/s12859-021-04419-7] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2021] [Accepted: 09/29/2021] [Indexed: 12/01/2022] Open
Abstract
Background Current state-of-the-art deep learning approaches for protein fold recognition learn protein embeddings that improve prediction performance at the fold level. However, there still exists aperformance gap at the fold level and the (relatively easier) family level, suggesting that it might be possible to learn an embedding space that better represents the protein folds. Results In this paper, we propose the FoldHSphere method to learn a better fold embedding space through a two-stage training procedure. We first obtain prototype vectors for each fold class that are maximally separated in hyperspherical space. We then train a neural network by minimizing the angular large margin cosine loss to learn protein embeddings clustered around the corresponding hyperspherical fold prototypes. Our network architectures, ResCNN-GRU and ResCNN-BGRU, process the input protein sequences by applying several residual-convolutional blocks followed by a gated recurrent unit-based recurrent layer. Evaluation results on the LINDAHL dataset indicate that the use of our hyperspherical embeddings effectively bridges the performance gap at the family and fold levels. Furthermore, our FoldHSpherePro ensemble method yields an accuracy of 81.3% at the fold level, outperforming all the state-of-the-art methods. Conclusions Our methodology is efficient in learning discriminative and fold-representative embeddings for the protein domains. The proposed hyperspherical embeddings are effective at identifying the protein fold class by pairwise comparison, even when amino acid sequence similarities are low. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-021-04419-7.
Collapse
Affiliation(s)
- Amelia Villegas-Morcillo
- Department of Signal Theory, Telematics and Communications, University of Granada, Periodista Daniel Saucedo Aranda, 18071, Granada, Spain.
| | - Victoria Sanchez
- Department of Signal Theory, Telematics and Communications, University of Granada, Periodista Daniel Saucedo Aranda, 18071, Granada, Spain
| | - Angel M Gomez
- Department of Signal Theory, Telematics and Communications, University of Granada, Periodista Daniel Saucedo Aranda, 18071, Granada, Spain
| |
Collapse
|
8
|
Bankapur S, Patil N. An Enhanced Protein Fold Recognition for Low Similarity Datasets Using Convolutional and Skip-Gram Features With Deep Neural Network. IEEE Trans Nanobioscience 2020; 20:42-49. [PMID: 32894720 DOI: 10.1109/tnb.2020.3022456] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
The protein fold recognition is one of the important tasks of structural biology, which helps in addressing further challenges like predicting the protein tertiary structures and its functions. Many machine learning works are published to identify the protein folds effectively. However, very few works have reported the fold recognition accuracy above 80% on benchmark datasets. In this study, an effective set of global and local features are extracted from the proposed Convolutional (Conv) and SkipXGram bi-gram (SXGbg) techniques, and the fold recognition is performed using the proposed deep neural network. The performance of the proposed model reported 91.4% fold accuracy on one of the derived low similarity (< 25%) datasets of latest extended version of SCOPe_2.07. The proposed model is further evaluated on three popular and publicly available benchmark datasets such as DD, EDD, and TG and obtained 85.9%, 95.8%, and 88.8% fold accuracies, respectively. This work is first to report fold recognition accuracy above 85% on all the benchmark datasets. The performance of the proposed model has outperformed the best state-of-the-art models by 5% to 23% on DD, 2% to 19% on EDD, and 3% to 30% on TG dataset.
Collapse
|
9
|
Książek W, Hammad M, Pławiak P, Acharya UR, Tadeusiewicz R. Development of novel ensemble model using stacking learning and evolutionary computation techniques for automated hepatocellular carcinoma detection. Biocybern Biomed Eng 2020. [DOI: 10.1016/j.bbe.2020.08.007] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022]
|
10
|
Perales-González C, Carbonero-Ruz M, Pérez-Rodríguez J, Becerra-Alonso D, Fernández-Navarro F. Negative correlation learning in the extreme learning machine framework. Neural Comput Appl 2020. [DOI: 10.1007/s00521-020-04788-9] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/09/2023]
|
11
|
Demir FB, Tuncer T, Kocamaz AF, Ertam F. A survival classification method for hepatocellular carcinoma patients with chaotic Darcy optimization method based feature selection. Med Hypotheses 2020; 139:109626. [PMID: 32087492 DOI: 10.1016/j.mehy.2020.109626] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2020] [Revised: 02/10/2020] [Accepted: 02/12/2020] [Indexed: 12/18/2022]
Abstract
Survey is one of the crucial data retrieval methods in the literature. However, surveys often contain missing data and redundant features. Therefore, missing feature completion and feature selection have been widely used for knowledge extraction from surveys. We have a hypothesis to solve these two problems. To implement our hypothesis, a classification method is presented. Our proposed method consists of missing feature completion with a statistical moment (average) and feature selection using a novel swarm optimization method. Firstly, an average based supervised feature completion method is applied to Hepatocellular Carcinoma survey (HCC). The used HCC survey consists of 49 features. To select meaningful features, a chaotic Darcy optimization based feature selection method is presented and this method selects 31 most discriminative features of the completed HCC dataset. 0.9879 accuracy rate was obtained by using the proposed chaotic Darcy optimization-based HCC survival classification method.
Collapse
Affiliation(s)
- Fahrettin Burak Demir
- Department of Computer Sciences, Vahap Kucuk Vocational School, Malatya Turgut Ozal University, Malatya, Turkey.
| | - Turker Tuncer
- Department of Digital Forensics Engineering, Technology Faculty, Firat University, Elazig, Turkey.
| | - Adnan Fatih Kocamaz
- Department of Computer Engineering, Engineering Faculty, Inonu University, Malatya, Turkey.
| | - Fatih Ertam
- Department of Digital Forensics Engineering, Technology Faculty, Firat University, Elazig, Turkey.
| |
Collapse
|
12
|
Deep Learning in the Biomedical Applications: Recent and Future Status. APPLIED SCIENCES-BASEL 2019. [DOI: 10.3390/app9081526] [Citation(s) in RCA: 75] [Impact Index Per Article: 15.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
Abstract
Deep neural networks represent, nowadays, the most effective machine learning technology in biomedical domain. In this domain, the different areas of interest concern the Omics (study of the genome—genomics—and proteins—transcriptomics, proteomics, and metabolomics), bioimaging (study of biological cell and tissue), medical imaging (study of the human organs by creating visual representations), BBMI (study of the brain and body machine interface) and public and medical health management (PmHM). This paper reviews the major deep learning concepts pertinent to such biomedical applications. Concise overviews are provided for the Omics and the BBMI. We end our analysis with a critical discussion, interpretation and relevant open challenges.
Collapse
|