1
|
Lilhore UK, Simiaya S, Alhussein M, Faujdar N, Dalal S, Aurangzeb K. Optimizing protein sequence classification: integrating deep learning models with Bayesian optimization for enhanced biological analysis. BMC Med Inform Decis Mak 2024; 24:236. [PMID: 39192227 DOI: 10.1186/s12911-024-02631-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2024] [Accepted: 08/07/2024] [Indexed: 08/29/2024] Open
Abstract
Efforts to enhance the accuracy of protein sequence classification are of utmost importance in driving forward biological analyses and facilitating significant medical advancements. This study presents a cutting-edge model called ProtICNN-BiLSTM, which combines attention-based Improved Convolutional Neural Networks (ICNN) and Bidirectional Long Short-Term Memory (BiLSTM) units seamlessly. Our main goal is to improve the accuracy of protein sequence classification by carefully optimizing performance through Bayesian Optimisation. ProtICNN-BiLSTM combines the power of CNN and BiLSTM architectures to effectively capture local and global protein sequence dependencies. In the proposed model, the ICNN component uses convolutional operations to identify local patterns. Captures long-range associations by analyzing sequence data forward and backwards. In advanced biological studies, Bayesian Optimisation optimizes model hyperparameters for efficiency and robustness. The model was extensively confirmed with PDB-14,189 and other protein data. We found that ProtICNN-BiLSTM outperforms traditional categorization models. Bayesian Optimization's fine-tuning and seamless integration of local and global sequence information make it effective. The precision of ProtICNN-BiLSTM improves comparative protein sequence categorization. The study improves computational bioinformatics for complex biological analysis. Good results from the ProtICNN-BiLSTM model improve protein sequence categorization. This powerful tool could improve medical and biological research. The breakthrough protein sequence classification model is ProtICNN-BiLSTM. Bayesian optimization, ICNN, and BiLSTM analyze biological data accurately.
Collapse
Affiliation(s)
- Umesh Kumar Lilhore
- School of Computing Science and Engineering, Galgotias University, Greater Noida, UP, India
| | - Sarita Simiaya
- School of Computing Science and Engineering, Galgotias University, Greater Noida, UP, India
| | - Musaed Alhussein
- Department of Computer Engineering, College of Computer and Information Sciences, King Saud University, P. O. Box 51178, Riyadh, 11543, Saudi Arabia
| | - Neetu Faujdar
- Department of Computer Engineering and Applications, GLA University, 281406, UP, Mathura, India
| | | | - Khursheed Aurangzeb
- Department of Computer Engineering, College of Computer and Information Sciences, King Saud University, P. O. Box 51178, Riyadh, 11543, Saudi Arabia
| |
Collapse
|
2
|
Hasanzadeh A, Hamblin MR, Kiani J, Noori H, Hardie JM, Karimi M, Shafiee H. Could artificial intelligence revolutionize the development of nanovectors for gene therapy and mRNA vaccines? NANO TODAY 2022; 47:101665. [PMID: 37034382 PMCID: PMC10081506 DOI: 10.1016/j.nantod.2022.101665] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/19/2023]
Abstract
Gene therapy enables the introduction of nucleic acids like DNA and RNA into host cells, and is expected to revolutionize the treatment of a wide range of diseases. This growth has been further accelerated by the discovery of CRISPR/Cas technology, which allows accurate genomic editing in a broad range of cells and organisms in vitro and in vivo. Despite many advances in gene delivery and the development of various viral and non-viral gene delivery vectors, the lack of highly efficient non-viral systems with low cellular toxicity remains a challenge. The application of cutting-edge technologies such as artificial intelligence (AI) has great potential to find new paradigms to solve this issue. Herein, we review AI and its major subfields including machine learning (ML), neural networks (NNs), expert systems, deep learning (DL), computer vision and robotics. We discuss the potential of AI-based models and algorithms in the design of targeted gene delivery vehicles capable of crossing extracellular and intracellular barriers by viral mimicry strategies. We finally discuss the role of AI in improving the function of CRISPR/Cas systems, developing novel nanobots, and mRNA vaccine carriers.
Collapse
Affiliation(s)
- Akbar Hasanzadeh
- Cellular and Molecular Research Center, Iran University of Medical Sciences, Tehran 1449614535, Iran
- Department of Medical Nanotechnology, Faculty of Advanced Technologies in Medicine, Iran University of Medical Sciences, Tehran 1449614535, Iran
| | - Michael R Hamblin
- Laser Research Centre, Faculty of Health Science, University of Johannesburg, Doornfontein 2028, South Africa
- Radiation Biology Research Center, Iran University of Medical Sciences, Tehran, Iran
| | - Jafar Kiani
- Oncopathology Research Center, Iran University of Medical Sciences, Tehran 1449614535, Iran
- Department of Molecular Medicine, Faculty of Advanced Technologies in Medicine, Iran University of Medical Sciences, Tehran, Iran
| | - Hamid Noori
- Cellular and Molecular Research Center, Iran University of Medical Sciences, Tehran 1449614535, Iran
- Department of Medical Nanotechnology, Faculty of Advanced Technologies in Medicine, Iran University of Medical Sciences, Tehran 1449614535, Iran
| | - Joseph M. Hardie
- Division of Engineering in Medicine, Department of Medicine, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA, 02139 USA
| | - Mahdi Karimi
- Cellular and Molecular Research Center, Iran University of Medical Sciences, Tehran 1449614535, Iran
- Department of Medical Nanotechnology, Faculty of Advanced Technologies in Medicine, Iran University of Medical Sciences, Tehran 1449614535, Iran
- Oncopathology Research Center, Iran University of Medical Sciences, Tehran 1449614535, Iran
- Research Center for Science and Technology in Medicine, Tehran University of Medical Sciences, Tehran 141556559, Iran
- Applied Biotechnology Research Centre, Tehran Medical Science, Islamic Azad University, Tehran 1584743311, Iran
| | - Hadi Shafiee
- Division of Engineering in Medicine, Department of Medicine, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA, 02139 USA
| |
Collapse
|
3
|
Sikander R, Arif M, Ghulam A, Worachartcheewan A, Thafar MA, Habib S. Identification of the ubiquitin-proteasome pathway domain by hyperparameter optimization based on a 2D convolutional neural network. Front Genet 2022; 13:851688. [PMID: 35937990 PMCID: PMC9355632 DOI: 10.3389/fgene.2022.851688] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2022] [Accepted: 06/29/2022] [Indexed: 11/13/2022] Open
Abstract
The major mechanism of proteolysis in the cytosol and nucleus is the ubiquitin-proteasome pathway (UPP). The highly controlled UPP has an effect on a wide range of cellular processes and substrates, and flaws in the system can lead to the pathogenesis of a number of serious human diseases. Knowledge about UPPs provide useful hints to understand the cellular process and drug discovery. The exponential growth in next-generation sequencing wet lab approaches have accelerated the accumulation of unannotated data in online databases, making the UPP characterization/analysis task more challenging. Thus, computational methods are used as an alternative for fast and accurate identification of UPPs. Aiming this, we develop a novel deep learning-based predictor named "2DCNN-UPP" for identifying UPPs with low error rate. In the proposed method, we used proposed algorithm with a two-dimensional convolutional neural network with dipeptide deviation features. To avoid the over fitting problem, genetic algorithm is employed to select the optimal features. Finally, the optimized attribute set are fed as input to the 2D-CNN learning engine for building the model. Empirical evidence or outcomes demonstrates that the proposed predictor achieved an overall accuracy and AUC (ROC) value using 10-fold cross validation test. Superior performance compared to other state-of-the art methods for discrimination the relations UPPs classification. Both on and independent test respectively was trained on 10-fold cross validation method and then evaluated through independent test. In the case where experimentally validated ubiquitination sites emerged, we must devise a proteomics-based predictor of ubiquitination. Meanwhile, we also evaluated the generalization power of our trained modal via independent test, and obtained remarkable performance in term of 0.862 accuracy, 0.921 sensitivity, 0.803 specificity 0.803, and 0.730 Matthews correlation coefficient (MCC) respectively. Four approaches were used in the sequences, and the physical properties were calculated combined. When used a 10-fold cross-validation, 2D-CNN-UPP obtained an AUC (ROC) value of 0.862 predicted score. We analyzed the relationship between UPP protein and non-UPP protein predicted score. Last but not least, this research could effectively analyze the large scale relationship between UPP proteins and non-UPP proteins in particular and other protein problems in general and our research work might improve computational biological research. Therefore, we could utilize the latest features in our model framework and Dipeptide Deviation from Expected Mean (DDE) -based protein structure features for the prediction of protein structure, functions, and different molecules, such as DNA and RNA.
Collapse
Affiliation(s)
- Rahu Sikander
- School of Computer Science and Technology, Xidian University, Xi’an, China
| | - Muhammad Arif
- Department of Community Medical Technology, Faculty of Medical Technology, Mahidol University, Bangkok, Thailand
| | - Ali Ghulam
- Computerization and Network Section, Sindh Agriculture University, Tando Jam, Pakistan
| | - Apilak Worachartcheewan
- Department of Community Medical Technology, Faculty of Medical Technology, Mahidol University, Bangkok, Thailand
| | - Maha A. Thafar
- Department of Computer Science, Collage of Computer and Information Technology, Taif University, Taif, Saudi Arabia
| | - Shabana Habib
- Department of Information Technology, College of Computer, Qassim University, Buraydah, Saudi Arabia
| |
Collapse
|
4
|
Nguyen TTD, Chen S, Ho QT, Ou YY. Using multiple convolutional window scanning of convolutional neural network for an efficient prediction of ATP-binding sites in transport proteins. Proteins 2022; 90:1486-1492. [PMID: 35246878 DOI: 10.1002/prot.26329] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2021] [Revised: 02/23/2022] [Accepted: 02/25/2022] [Indexed: 12/31/2022]
Abstract
Protein multiple sequence alignment information has long been important features to know about functions of proteins inferred from related sequences with known functions. It is therefore one of the underlying ideas of Alpha fold 2, a breakthrough study and model for the prediction of three-dimensional structures of proteins from their primary sequence. Our study used protein multiple sequence alignment information in the form of position-specific scoring matrices as input. We also refined the use of a convolutional neural network, a well-known deep-learning architecture with impressive achievement on image and image-like data. Specifically, we revisited the study of prediction of adenosine triphosphate (ATP)-binding sites with more efficient convolutional neural networks. We applied multiple convolutional window scanning filters of a convolutional neural network on position-specific scoring matrices for as much as useful information as possible. Furthermore, only the most specific motifs are retained at each feature map output through the one-max pooling layer before going to the next layer. We assumed that this way could help us retain the most conserved motifs which are discriminative information for prediction. Our experiment results show that a convolutional neural network with not too many convolutional layers can be enough to extract the conserved information of proteins, which leads to higher performance. Our best prediction models were obtained after examining them with different hyper-parameters. Our experiment results showed that our models were superior to traditional use of convolutional neural networks on the same datasets as well as other machine-learning classification algorithms.
Collapse
Affiliation(s)
| | - Syun Chen
- Department of Computer Science and Engineering, Yuan Ze University, Chung-Li, Taiwan
| | - Quang-Thai Ho
- Department of Computer Science and Engineering, Yuan Ze University, Chung-Li, Taiwan
| | - Yu-Yen Ou
- Department of Computer Science and Engineering, Yuan Ze University, Chung-Li, Taiwan
| |
Collapse
|
5
|
Xiao X, Shao YT, Luo ZT, Qiu WR. m5C-HPromoter: An Ensemble Deep Learning Predictor for Identifying 5-methylcytosine Sites in Human Promoters. Curr Bioinform 2022. [DOI: 10.2174/1574893617666220330150259] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Aims:
This paper is intended to identify 5-methylcytosine Sites in Human Promoters.
Background:
Aberrant DNA methylation patterns are often associated with tumor development, hypermethylation inhibits expression of tumor suppressor genes, and hypomethylation stimulates expression of certain oncogenes. Most DNA methylation occurs on the CpG island of gene promoter region.
Objective:
Therefore, a comprehensive display of the methylation status of the promoter region of human gene is extremely important for understanding cancer pathogenesis and function of post-transcriptional modification.
Method:
This paper constructed three human promoter methylation datasets, a total of 3 million sample sequences, of small cell lung cancer, non-small cell lung cancer, and hepatocellular carcinoma from Cancer Cell Line Encyclopedia (CCLE) database. Frequency-based One-Hot Encoding was used to encode the sample sequence, and an innovative stacking-based ensemble deep learning classifier was applied to establish the m5C-HPromoter predictor.
Result:
Taking the average of 10 times of 5-fold cross-validation, m5C-HPromoter obtained a good result of Accuracy (Acc) = 0.9270, Matthew's correlation coefficient (MCC) = 0.7234, Sensitivity (Sn) = 0.9123, and Specificity (Sp) = 0.9290.
Collapse
Affiliation(s)
- Xuan Xiao
- Department of Computer, Jing-De-Zhen Ceramic Institute, 333046, Jing-De-Zhen, China
| | - Yu-Tao Shao
- Department of Computer, Jing-De-Zhen Ceramic Institute, 333046, Jing-De-Zhen, China
| | - Zhen-Tao Luo
- Department of Computer, Jing-De-Zhen Ceramic Institute, 333046, Jing-De-Zhen, China
| | - Wang-Ren Qiu
- Department of Computer, Jing-De-Zhen Ceramic Institute, 333046, Jing-De-Zhen, China
| |
Collapse
|
6
|
Nguyen TTD, Ho QT, Tarn YC, Ou YY. MFPS_CNN: Multi-filter pattern scanning from position-specific scoring matrix with convolutional neural network for efficient prediction of ion transporters. Mol Inform 2022; 41:e2100271. [PMID: 35322557 DOI: 10.1002/minf.202100271] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2021] [Accepted: 03/23/2022] [Indexed: 11/08/2022]
Abstract
In cellular transportation mechanisms, the movement of ions across the cell membrane and its proper control are important for cells, especially for life processes. Ion transporters/pumps and ion channel proteins work as border guards controlling the incessant traffic of ions across cell membranes. We revisited the study of classification of transporters and ion channels from membrane proteins with a more efficient deep learning approach. Specifically, we applied multi-window scanning filters of convolutional neural networks on almost full-length position-specific scoring matrices for extracting useful information. In this way, we were able to retain important evolutionary information of the proteins. Our experiment results show that a convolutional neural network with a minimum number of convolutional layers can be enough to extract the conserved information of proteins which leads to higher performance. Our best prediction models were obtained after examining different data imbalanced handling techniques, and different protein encoding methods. We also showed that our models were superior to traditional deep learning approaches on the same datasets as well as other machine learning classification algorithms.
Collapse
|
7
|
Nguyen TTD, Ho QT, Le NQK, Phan VD, Ou YY. Use Chou's 5-Steps Rule With Different Word Embedding Types to Boost Performance of Electron Transport Protein Prediction Model. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:1235-1244. [PMID: 32750894 DOI: 10.1109/tcbb.2020.3010975] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Living organisms receive necessary energy substances directly from cellular respiration. The completion of electron storage and transportation requires the process of cellular respiration with the aid of electron transport chains. Therefore, the work of deciphering electron transport proteins is inevitably needed. The identification of these proteins with high performance has a prompt dependence on the choice of methods for feature extraction and machine learning algorithm. In this study, protein sequences served as natural language sentences comprising words. The nominated word embedding-based feature sets, hinged on the word embedding modulation and protein motif frequencies, were useful for feature choosing. Five word embedding types and a variety of conjoint features were examined for such feature selection. The support vector machine algorithm consequentially was employed to perform classification. The performance statistics within the 5-fold cross-validation including average accuracy, specificity, sensitivity, as well as MCC rates surpass 0.95. Such metrics in the independent test are 96.82, 97.16, 95.76 percent, and 0.9, respectively. Compared to state-of-the-art predictors, the proposed method can generate more preferable performance above all metrics indicating the effectiveness of the proposed method in determining electron transport proteins. Furthermore, this study reveals insights about the applicability of various word embeddings for understanding surveyed sequences.
Collapse
|
8
|
Sikander R, Wang Y, Ghulam A, Wu X. Identification of Enzymes-specific Protein Domain Based on DDE, and Convolutional Neural Network. Front Genet 2021; 12:759384. [PMID: 34917128 PMCID: PMC8670239 DOI: 10.3389/fgene.2021.759384] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2021] [Accepted: 10/25/2021] [Indexed: 11/21/2022] Open
Abstract
Predicting the protein sequence information of enzymes and non-enzymes is an important but a very challenging task. Existing methods use protein geometric structures only or protein sequences alone to predict enzymatic functions. Thus, their prediction results are unsatisfactory. In this paper, we propose a novel approach for predicting the amino acid sequences of enzymes and non-enzymes via Convolutional Neural Network (CNN). In CNN, the roles of enzymes are predicted from multiple sides of biological information, including information on sequences and structures. We propose the use of two-dimensional data via 2DCNN to predict the proteins of enzymes and non-enzymes by using the same fivefold cross-validation function. We also use an independent dataset to test the performance of our model, and the results demonstrate that we are able to solve the overfitting problem. We used the CNN model proposed herein to demonstrate the superiority of our model for classifying an entire set of filters, such as 32, 64, and 128 parameters, with the fivefold validation test set as the independent classification. Via the Dipeptide Deviation from Expected Mean (DDE) matrix, mutation information is extracted from amino acid sequences and structural information with the distance and angle of amino acids is conveyed. The derived feature maps are then encoded in DDE exploitation. The independent datasets are then compared with other two methods, namely, GRU and XGBOOST. All analyses were conducted using 32, 64 and 128 filters on our proposed CNN method. The cross-validation datasets achieved an accuracy score of 0.8762%, whereas the accuracy of independent datasets was 0.7621%. Additional variables were derived on the basis of ROC AUC with fivefold cross-validation was achieved score is 0.95%. The performance of our model and that of other models in terms of sensitivity (0.9028%) and specificity (0.8497%) was compared. The overall accuracy of our model was 0.9133% compared with 0.8310% for the other model.
Collapse
Affiliation(s)
- Rahu Sikander
- School of Computer Science and Technology, Xidian University, Xi'an, China
| | - Yuping Wang
- School of Computer Science and Technology, Xidian University, Xi'an, China
| | - Ali Ghulam
- Computerization and Network Section, Sindh Agriculture University, Tando Jam, Pakistan
| | - Xianjuan Wu
- School of Computer Science and Technology, Xidian University, Xi'an, China
| |
Collapse
|
9
|
Zhao Q, Ma J, Wang Y, Xie F, Lv Z, Xu Y, Shi H, Han K. Mul-SNO: A novel prediction tool for S-nitrosylation sites based on deep learning methods. IEEE J Biomed Health Inform 2021; 26:2379-2387. [PMID: 34762593 DOI: 10.1109/jbhi.2021.3123503] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Protein s-nitrosylation (SNO is one of the most important post-translational modifications and is formed by the covalent modification of nitric oxide and cysteine residues. Extensive studies have shown that SNO plays a pivotal role in the plant immune response and treating various major human diseases. In recent years, SNO sites have become a hot research topic. Traditional biochemical methods for SNO site identification are time-consuming and costly. In this study, we developed an economical and efficient SNO site prediction tool named Mul-SNO. Mul-SNO ensembled current popular and powerful deep learning model bidirectional long short-term memory (BiLSTM and bidirectional encoder representations from Transformers (BERT . Compared with existing state-of-the-art methods, Mul-SNO obtained better ACC of 0.911 and 0.796 based on 10-fold cross-validation and independent data sets, respectively. The prediction server can be obtained for free at http://lab.malab.cn/~mjq/Mul-SNO/.
Collapse
|
10
|
Sandaruwan PD, Wannige CT. An improved deep learning model for hierarchical classification of protein families. PLoS One 2021; 16:e0258625. [PMID: 34669708 PMCID: PMC8528337 DOI: 10.1371/journal.pone.0258625] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2020] [Accepted: 10/01/2021] [Indexed: 12/28/2022] Open
Abstract
Although genes carry information, proteins are the main role player in providing all the functionalities of a living organism. Massive amounts of different proteins involve in every function that occurs in a cell. These amino acid sequences can be hierarchically classified into a set of families and subfamilies depending on their evolutionary relatedness and similarities in their structure or function. Protein characterization to identify protein structure and function is done accurately using laboratory experiments. With the rapidly increasing huge amount of novel protein sequences, these experiments have become difficult to carry out since they are expensive, time-consuming, and laborious. Therefore, many computational classification methods are introduced to classify proteins and predict their functional properties. With the progress of the performance of the computational techniques, deep learning plays a key role in many areas. Novel deep learning models such as DeepFam, ProtCNN have been presented to classify proteins into their families recently. However, these deep learning models have been used to carry out the non-hierarchical classification of proteins. In this research, we propose a deep learning neural network model named DeepHiFam with high accuracy to classify proteins hierarchically into different levels simultaneously. The model achieved an accuracy of 98.38% for protein family classification and more than 80% accuracy for the classification of protein subfamilies and sub-subfamilies. Further, DeepHiFam performed well in the non-hierarchical classification of protein families and achieved an accuracy of 98.62% and 96.14% for the popular Pfam dataset and COG dataset respectively.
Collapse
|
11
|
Zhang Y, Li M, Ji Z, Fan W, Yuan S, Liu Q, Chen Q. Twin self-supervision based semi-supervised learning (TS-SSL): Retinal anomaly classification in SD-OCT images. Neurocomputing 2021. [DOI: 10.1016/j.neucom.2021.08.051] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/29/2023]
|
12
|
Przybył K, Koszela K, Adamski F, Samborska K, Walkowiak K, Polarczyk M. Deep and Machine Learning Using SEM, FTIR, and Texture Analysis to Detect Polysaccharide in Raspberry Powders. SENSORS (BASEL, SWITZERLAND) 2021; 21:5823. [PMID: 34502718 PMCID: PMC8434077 DOI: 10.3390/s21175823] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/23/2021] [Revised: 08/16/2021] [Accepted: 08/24/2021] [Indexed: 11/27/2022]
Abstract
In the paper, an attempt was made to use methods of artificial neural networks (ANN) and Fourier transform infrared spectroscopy (FTIR) to identify raspberry powders that are different from each other in terms of the amount and the type of polysaccharide. Spectra in the absorbance function (FTIR) were prepared as well as training sets, taking into account the structure of microparticles acquired from microscopic images with Scanning Electron Microscopy (SEM). In addition to the above, Multi-Layer Perceptron Networks (MLPNs) with a set of texture descriptors (machine learning) and Convolution Neural Network (CNN) with bitmap (deep learning) were devised, which is an innovative attitude to solving this issue. The aim of the paper was to create MLPN and CNN neural models, which are characterized by a high efficiency of classification. It translates into recognizing microparticles (obtaining their homogeneity) of raspberry powders on the basis of the texture of the image pixel.
Collapse
Affiliation(s)
- Krzysztof Przybył
- Food Sciences and Nutrition, Department of Food Technology of Plant Origin, Poznan University of Life Sciences, Wojska Polskiego 31, 60-624 Poznan, Poland; (K.P.); (F.A.)
| | - Krzysztof Koszela
- Department of Biosystems Engineering, Poznan University of Life Sciences, Wojska Polskiego 50, 60-625 Poznan, Poland
| | - Franciszek Adamski
- Food Sciences and Nutrition, Department of Food Technology of Plant Origin, Poznan University of Life Sciences, Wojska Polskiego 31, 60-624 Poznan, Poland; (K.P.); (F.A.)
| | - Katarzyna Samborska
- Institute of Food Sciences, Warsaw University of Life Sciences WULS-SGGW, Nowoursynowska 159c, 02-787 Warsaw, Poland;
| | - Katarzyna Walkowiak
- Food Sciences and Nutrition, Department of Physics and Biophysics, Poznan University of Life Sciences, Wojska Polskiego 28, 60-637 Poznan, Poland;
| | - Mariusz Polarczyk
- Main Library and Scientific Information Centre, Poznan University of Life Sciences, Witosa 45, 61-693 Poznan, Poland;
| |
Collapse
|
13
|
Abstract
Background:
The evolutionary history of organisms can be described by phylogenetic
trees. We need to compare the topologies of rooted phylogenetic trees when researching the
evolution of a given set of species.
Objective:
Up to now, there are several metrics measuring the dissimilarity between rooted
phylogenetic trees, and those metrics are defined by different ways.
Methods:
This paper analyzes those metrics from their definitions and the distance values
computed by those metrics by terms of experiments.
Results:
The results of experiments show that the distances calculated by the cluster metric, the
partition metric, and the equivalent metric have a good Gaussian fitting, and the equivalent metric
can describe the difference between trees better than the others.
Conclusion:
Moreover, it presents a tool called as CDRPT (Computing Distance for Rooted
Phylogenetic Trees). CDRPT is a web server to calculate the distance for trees by an on-line way.
CDRPT can also be off-line used by means of installing application packages for the Windows
system. It greatly facilitates the use of researchers. The home page of CDRPT is
http://bioinformatics.imu.edu.cn/tree/.
Collapse
Affiliation(s)
- Juan Wang
- School of Computer Science, Inner Mongolia University, Hohhot, China
| | - Xinyue Qi
- School of Computer Science, Inner Mongolia University, Hohhot, China
| | - Bo Cui
- School of Computer Science, Inner Mongolia University, Hohhot, China
| | - Maozu Guo
- School of Electrical and Information Engineering, Beijing University of Civil Engineering and Architecture, Beijing, China
| |
Collapse
|
14
|
Li J, Chang M, Gao Q, Song X, Gao Z. Lung Cancer Classification and Gene Selection by Combining Affinity Propagation Clustering and Sparse Group Lasso. Curr Bioinform 2020. [DOI: 10.2174/1574893614666191017103557] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023]
Abstract
Background:
Cancer threatens human health seriously. Diagnosing cancer via gene expression
analysis is a hot topic in cancer research.
Objective:
The study aimed to diagnose the accurate type of lung cancer and discover the pathogenic
genes.
Methods:
In this study, Affinity Propagation (AP) clustering with similarity score was employed
to each type of lung cancer and normal lung. After grouping genes, sparse group lasso was adopted
to construct four binary classifiers and the voting strategy was used to integrate them.
Results:
This study screened six gene groups that may associate with different lung cancer subtypes
among 73 genes groups, and identified three possible key pathogenic genes, KRAS, BRAF
and VDR. Furthermore, this study achieved improved classification accuracies at minority classes
SQ and COID in comparison with other four methods.
Conclusion:
We propose the AP clustering based sparse group lasso (AP-SGL), which provides
an alternative for simultaneous diagnosis and gene selection for lung cancer.
Collapse
Affiliation(s)
- Juntao Li
- College of Mathematics and Information Science, Henan Normal University, Xinxiang, 453007, China
| | - Mingming Chang
- College of Mathematics and Information Science, Henan Normal University, Xinxiang, 453007, China
| | - Qinghui Gao
- College of Mathematics and Information Science, Henan Normal University, Xinxiang, 453007, China
| | - Xuekun Song
- School of Information Technology, Henan University of Chinese Medicine, Zhengzhou, 450046, China
| | - Zhiyu Gao
- School of Information Technology, Henan University of Chinese Medicine, Zhengzhou, 450046, China
| |
Collapse
|
15
|
Nguyen TTD, Le NQK, Ho QT, Phan DV, Ou YY. TNFPred: identifying tumor necrosis factors using hybrid features based on word embeddings. BMC Med Genomics 2020; 13:155. [PMID: 33087125 PMCID: PMC7579990 DOI: 10.1186/s12920-020-00779-w] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022] Open
Abstract
Background Cytokines are a class of small proteins that act as chemical messengers and play a significant role in essential cellular processes including immunity regulation, hematopoiesis, and inflammation. As one important family of cytokines, tumor necrosis factors have association with the regulation of a various biological processes such as proliferation and differentiation of cells, apoptosis, lipid metabolism, and coagulation. The implication of these cytokines can also be seen in various diseases such as insulin resistance, autoimmune diseases, and cancer. Considering the interdependence between this kind of cytokine and others, classifying tumor necrosis factors from other cytokines is a challenge for biological scientists. Methods In this research, we employed a word embedding technique to create hybrid features which was proved to efficiently identify tumor necrosis factors given cytokine sequences. We segmented each protein sequence into protein words and created corresponding word embedding for each word. Then, word embedding-based vector for each sequence was created and input into machine learning classification models. When extracting feature sets, we not only diversified segmentation sizes of protein sequence but also conducted different combinations among split grams to find the best features which generated the optimal prediction. Furthermore, our methodology follows a well-defined procedure to build a reliable classification tool. Results With our proposed hybrid features, prediction models obtain more promising performance compared to seven prominent sequenced-based feature kinds. Results from 10 independent runs on the surveyed dataset show that on an average, our optimal models obtain an area under the curve of 0.984 and 0.998 on 5-fold cross-validation and independent test, respectively. Conclusions These results show that biologists can use our model to identify tumor necrosis factors from other cytokines efficiently. Moreover, this study proves that natural language processing techniques can be applied reasonably to help biologists solve bioinformatics problems efficiently.
Collapse
Affiliation(s)
| | - Nguyen-Quoc-Khanh Le
- Professional Master Program in Artificial Intelligence in Medicine, College of Medicine, Taipei Medical University, Taipei City, 106, Taiwan.,Research Center for Artificial Intelligence in Medicine, Taipei Medical University, Taipei City, 106, Taiwan
| | - Quang-Thai Ho
- Department of Computer Science and Engineering, Yuan Ze University, Taoyuan, 32003, Taiwan
| | - Dinh-Van Phan
- University of Economics, The University of Danang, Danang, 550000, Vietnam
| | - Yu-Yen Ou
- Department of Computer Science and Engineering, Yuan Ze University, Taoyuan, 32003, Taiwan.
| |
Collapse
|
16
|
Lin CJ, Jeng SY. Optimization of Deep Learning Network Parameters Using Uniform Experimental Design for Breast Cancer Histopathological Image Classification. Diagnostics (Basel) 2020; 10:diagnostics10090662. [PMID: 32882935 PMCID: PMC7555941 DOI: 10.3390/diagnostics10090662] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2020] [Revised: 08/21/2020] [Accepted: 08/31/2020] [Indexed: 12/20/2022] Open
Abstract
Breast cancer, a common cancer type, is a major health concern in women. Recently, researchers used convolutional neural networks (CNNs) for medical image analysis and demonstrated classification performance for breast cancer diagnosis from within histopathological image datasets. However, the parameter settings of a CNN model are complicated, and using Breast Cancer Histopathological Database data for the classification is time-consuming. To overcome these problems, this study used a uniform experimental design (UED) and optimized the CNN parameters of breast cancer histopathological image classification. In UED, regression analysis was used to optimize the parameters. The experimental results indicated that the proposed method with UED parameter optimization provided 84.41% classification accuracy rate. In conclusion, the proposed method can improve the classification accuracy effectively, with results superior to those of other similar methods.
Collapse
Affiliation(s)
- Cheng-Jian Lin
- Department of Computer Science and Information Engineering, National Chin-Yi University of Technology, Taichung 411, Taiwan;
- School of Intelligence, National Taichung University of Science and Technology, Taichung 404, Taiwan
- Correspondence:
| | - Shiou-Yun Jeng
- Department of Computer Science and Information Engineering, National Chin-Yi University of Technology, Taichung 411, Taiwan;
| |
Collapse
|
17
|
Abstract
During the last three decades or so, many efforts have been made to study the protein cleavage
sites by some disease-causing enzyme, such as HIV (Human Immunodeficiency Virus) protease
and SARS (Severe Acute Respiratory Syndrome) coronavirus main proteinase. It has become increasingly
clear <i>via</i> this mini-review that the motivation driving the aforementioned studies is quite wise,
and that the results acquired through these studies are very rewarding, particularly for developing peptide
drugs.
Collapse
Affiliation(s)
- Kuo-Chen Chou
- Gordon Life Science Institute, Boston, MA 02478, United States
| |
Collapse
|
18
|
Wiktorowicz A, Wit A, Dziewierz A, Rzeszutko L, Dudek D, Kleczynski P. Calcium Pattern Assessment in Patients with Severe Aortic Stenosis Via the Chou's 5-Steps Rule. Curr Pharm Des 2020; 25:3769-3775. [PMID: 31566130 DOI: 10.2174/1381612825666190930101258] [Citation(s) in RCA: 22] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2019] [Accepted: 09/26/2019] [Indexed: 02/07/2023]
Abstract
BACKGROUND Progression of aortic valve calcifications (AVC) leads to aortic valve stenosis (AS). Importantly, the AVC degree has a great impact on AS progression, treatment selection and outcomes. Methods of AVC assessment do not provide accurate quantitative evaluation and analysis of calcium distribution and deposition in a repetitive manner. OBJECTIVE We aim to prepare a reliable tool for detailed AVC pattern analysis with quantitative parameters. METHODS We analyzed computed tomography (CT) scans of fifty patients with severe AS using a dedicated software based on MATLAB version R2017a (MathWorks, Natick, MA, USA) and ImageJ version 1.51 (NIH, USA) with the BoneJ plugin version 1.4.2 with a self-developed algorithm. RESULTS We listed unique parameters describing AVC and prepared 3D AVC models with color pointed calcium layer thickness in the stenotic aortic valve. These parameters were derived from CT-images in a semi-automated and repeatable manner. They were divided into morphometric, topological and textural parameters and may yield crucial information about the anatomy of the stenotic aortic valve. CONCLUSION In our study, we were able to obtain and define quantitative parameters for calcium assessment of the degenerated aortic valves. Whether the defined parameters are able to predict potential long-term outcomes after treatment, requires further investigation.
Collapse
Affiliation(s)
- Agata Wiktorowicz
- 2nd Department of Cardiology, Institute of Cardiology, Jagiellonian University Medical College, 31-501 Kopernika St. 17, Krakow, Poland
| | - Adrian Wit
- Faculty of Physics and Applied Computer Science, University of Science and Technology, Mickiewicza Ave. 30, 30-059 Krakow, Poland
| | - Artur Dziewierz
- 2nd Department of Cardiology, Institute of Cardiology, Jagiellonian University Medical College, 31-501 Kopernika St. 17, Krakow, Poland
| | - Lukasz Rzeszutko
- 2nd Department of Cardiology, Institute of Cardiology, Jagiellonian University Medical College, 31-501 Kopernika St. 17, Krakow, Poland
| | - Dariusz Dudek
- 2nd Department of Cardiology, Institute of Cardiology, Jagiellonian University Medical College, 31-501 Kopernika St. 17, Krakow, Poland
| | - Pawel Kleczynski
- 2nd Department of Cardiology, Institute of Cardiology, Jagiellonian University Medical College, 31-501 Kopernika St. 17, Krakow, Poland
| |
Collapse
|
19
|
Augmented EMTCNN: A Fast and Accurate Facial Landmark Detection Network. APPLIED SCIENCES-BASEL 2020. [DOI: 10.3390/app10072253] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/17/2022]
Abstract
Facial landmarks represent prominent feature points on the face that can be used as anchor points in many face-related tasks. So far, a lot of research has been done with the aim of achieving efficient extraction of landmarks from facial images. Employing a large number of feature points for landmark detection and tracking usually requires excessive processing time. On the contrary, relying on too few feature points cannot accurately represent diverse landmark properties, such as shape. To extract the 68 most popular facial landmark points efficiently, in our previous study, we proposed a model called EMTCNN that extended the multi-task cascaded convolutional neural network for real-time face landmark detection. To improve the detection accuracy, in this study, we augment the EMTCNN model by using two convolution techniques—dilated convolution and CoordConv. The former makes it possible to increase the filter size without a significant increase in computation time. The latter enables the spatial coordinate information of landmarks to be reflected in the model. We demonstrate that our model can improve the detection accuracy while maintaining the processing speed.
Collapse
|
20
|
Do DT, Le NQK. Using extreme gradient boosting to identify origin of replication in Saccharomyces cerevisiae via hybrid features. Genomics 2020; 112:2445-2451. [PMID: 31987913 DOI: 10.1016/j.ygeno.2020.01.017] [Citation(s) in RCA: 28] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2019] [Revised: 01/12/2020] [Accepted: 01/23/2020] [Indexed: 12/11/2022]
Abstract
DNA replication is a fundamental task that plays a crucial role in the propagation of all living things on earth. Hence, the accurate identification of its origin could be the key to giving an insightful understanding of the regulatory mechanism of gene expression. Indeed, with the robust development of computational techniques and the abundant biological sequencing data, it has become possible for scientists to identify the origin of replication accurately and promptly. This growing concern has drawn a lot of attention among experts in this field. However, to gain better outcomes, more work is required. Therefore, this study is designed to explore the combination of state-of-the-art features and extreme gradient boosting learning system in classifying DNA sequences. Our hybrid approach is able to identify the origin of DNA replication with achieved sensitivity of 85.19%, specificity of 93.83%, accuracy of 89.51%, and MCC of 0.7931. Evidence is presented to show that our proposed method is superior to the state-of-the-art methods on the same benchmark dataset. Moreover, the research results represent a further step towards developing the prediction models for DNA replication in particular and DNA sequences in general.
Collapse
Affiliation(s)
- Duyen Thi Do
- Toxicology and Biomedicine Research Group, Faculty of Applied Sciences, Ton Duc Thang University, Ho Chi Minh City, Viet Nam.
| | - Nguyen Quoc Khanh Le
- Professional Master Program in Artificial Intelligence in Medicine, College of Medicine, Taipei Medical University, Taipei City 106, Taiwan; Research Center of Artificial Intelligence in Medicine, Taipei Medical University, Taipei City 106, Taiwan.
| |
Collapse
|
21
|
Some illuminating remarks on molecular genetics and genomics as well as drug development. Mol Genet Genomics 2020; 295:261-274. [PMID: 31894399 DOI: 10.1007/s00438-019-01634-z] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2019] [Accepted: 12/05/2019] [Indexed: 02/07/2023]
Abstract
Facing the explosive growth of biological sequences unearthed in the post-genomic age, one of the most important but also most difficult problems in computational biology is how to express a biological sequence with a discrete model or a vector, but still keep it with considerable sequence-order information or its special pattern. To deal with such a challenging problem, the ideas of "pseudo amino acid components" and "pseudo K-tuple nucleotide composition" have been proposed. The ideas and their approaches have further stimulated the birth for "distorted key theory", "wenxing diagram", and substantially strengthening the power in treating the multi-label systems, as well as the establishment of the famous "5-steps rule". All these logic developments are quite natural that are very useful not only for theoretical scientists but also for experimental scientists in conducting genetics/genomics analysis and drug development. Presented in this review paper are also their future perspectives; i.e., their impacts will become even more significant and propounding.
Collapse
|
22
|
Le NQK, Ho QT, Yapp EKY, Ou YY, Yeh HY. DeepETC: A deep convolutional neural network architecture for investigating and classifying electron transport chain's complexes. Neurocomputing 2020. [DOI: 10.1016/j.neucom.2019.09.070] [Citation(s) in RCA: 18] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
|
23
|
Shao YT, Liu XX, Lu Z, Chou KC. pLoc_Deep-mHum: Predict Subcellular Localization of Human Proteins by Deep Learning. ACTA ACUST UNITED AC 2020. [DOI: 10.4236/ns.2020.127042] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
|
24
|
Shao Y, Chou KC. pLoc_Deep-mEuk: Predict Subcellular Localization of Eukaryotic Proteins by Deep Learning. ACTA ACUST UNITED AC 2020. [DOI: 10.4236/ns.2020.126034] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
|
25
|
iQSP: A Sequence-Based Tool for the Prediction and Analysis of Quorum Sensing Peptides via Chou's 5-Steps Rule and Informative Physicochemical Properties. Int J Mol Sci 2019; 21:ijms21010075. [PMID: 31861928 PMCID: PMC6981611 DOI: 10.3390/ijms21010075] [Citation(s) in RCA: 36] [Impact Index Per Article: 7.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2019] [Revised: 12/13/2019] [Accepted: 12/18/2019] [Indexed: 01/18/2023] Open
Abstract
Understanding of quorum-sensing peptides (QSPs) in their functional mechanism plays an essential role in finding new opportunities to combat bacterial infections by designing drugs. With the avalanche of the newly available peptide sequences in the post-genomic age, it is highly desirable to develop a computational model for efficient, rapid and high-throughput QSP identification purely based on the peptide sequence information alone. Although, few methods have been developed for predicting QSPs, their prediction accuracy and interpretability still requires further improvements. Thus, in this work, we proposed an accurate sequence-based predictor (called iQSP) and a set of interpretable rules (called IR-QSP) for predicting and analyzing QSPs. In iQSP, we utilized a powerful support vector machine (SVM) cooperating with 18 informative features from physicochemical properties (PCPs). Rigorous independent validation test showed that iQSP achieved maximum accuracy and MCC of 93.00% and 0.86, respectively. Furthermore, a set of interpretable rules IR-QSP was extracted by using random forest model and the 18 informative PCPs. Finally, for the convenience of experimental scientists, the iQSP web server was established and made freely available online. It is anticipated that iQSP will become a useful tool or at least as a complementary existing method for predicting and analyzing QSPs.
Collapse
|
26
|
Le NQK, Huynh TT. Identifying SNAREs by Incorporating Deep Learning Architecture and Amino Acid Embedding Representation. Front Physiol 2019; 10:1501. [PMID: 31920706 PMCID: PMC6914855 DOI: 10.3389/fphys.2019.01501] [Citation(s) in RCA: 38] [Impact Index Per Article: 7.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2019] [Accepted: 11/26/2019] [Indexed: 12/12/2022] Open
Abstract
SNAREs (soluble N-ethylmaleimide-sensitive factor activating protein receptors) are a group of proteins that are crucial for membrane fusion and exocytosis of neurotransmitters from the cell. They play an important role in a broad range of cell processes, including cell growth, cytokinesis, and synaptic transmission, to promote cell membrane integration in eukaryotes. Many studies determined that SNARE proteins have been associated with a lot of human diseases, especially in cancer. Therefore, identifying their functions is a challenging problem for scientists to better understand the cancer disease as well as design the drug targets for treatment. We described each protein sequence based on the amino acid embeddings using fastText, which is a natural language processing model performing well in its field. Because each protein sequence is similar to a sentence with different words, applying language model into protein sequence is challenging and promising. After generating, the amino acid embedding features were fed into a deep learning algorithm for prediction. Our model which combines fastText model and deep convolutional neural networks could identify SNARE proteins with an independent test accuracy of 92.8%, sensitivity of 88.5%, specificity of 97%, and Matthews correlation coefficient (MCC) of 0.86. Our performance results were superior to the state-of-the-art predictor (SNARE-CNN). We suggest this study as a reliable method for biologists for SNARE identification and it serves a basis for applying fastText word embedding model into bioinformatics, especially in protein sequencing prediction.
Collapse
Affiliation(s)
- Nguyen Quoc Khanh Le
- Professional Master Program in Artificial Intelligence in Medicine, Taipei Medical University, Taipei, Taiwan
| | - Tuan-Tu Huynh
- Department of Electrical Electronic and Mechanical Engineering, Lac Hong University, Bien Hoa, Vietnam
- Department of Electrical Engineering, Yuan Ze University, Taoyuan, Taiwan
| |
Collapse
|
27
|
Chou KC. Impacts of Pseudo Amino Acid Components and 5-steps Rule to Proteomics and Proteome Analysis. Curr Top Med Chem 2019; 19:2283-2300. [DOI: 10.2174/1568026619666191018100141] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2019] [Revised: 08/18/2019] [Accepted: 08/26/2019] [Indexed: 01/27/2023]
Abstract
Stimulated by the 5-steps rule during the last decade or so, computational proteomics has achieved remarkable progresses in the following three areas: (1) protein structural class prediction; (2) protein subcellular location prediction; (3) post-translational modification (PTM) site prediction. The results obtained by these predictions are very useful not only for an in-depth study of the functions of proteins and their biological processes in a cell, but also for developing novel drugs against major diseases such as cancers, Alzheimer’s, and Parkinson’s. Moreover, since the targets to be predicted may have the multi-label feature, two sets of metrics are introduced: one is for inspecting the global prediction quality, while the other for the local prediction quality. All the predictors covered in this review have a userfriendly web-server, through which the majority of experimental scientists can easily obtain their desired data without the need to go through the complicated mathematics.
Collapse
Affiliation(s)
- Kuo-Chen Chou
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, 610054, China
| |
Collapse
|
28
|
Malebary SJ, Rehman MSU, Khan YD. iCrotoK-PseAAC: Identify lysine crotonylation sites by blending position relative statistical features according to the Chou's 5-step rule. PLoS One 2019; 14:e0223993. [PMID: 31751380 PMCID: PMC6874067 DOI: 10.1371/journal.pone.0223993] [Citation(s) in RCA: 25] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2019] [Accepted: 10/02/2019] [Indexed: 01/22/2023] Open
Abstract
Among different post-translational modifications (PTMs), one of the most important one is the lysine crotonylation in proteins. Its importance cannot be undermined related to different diseases and essential biological practice. The key step for finding the hidden mechanisms of crotonylation along with their occurrence sites is to completely apprehend the mechanism behind this biological process. In previously reported studies, researchers have used different techniques, like position weighted matrix (PWM), support vector machine (SVM), k nearest neighbors (KNN), and many others. However, the maximum prediction accuracy achieved was not such high. To address this, herein, we propose an improved predictor for lysine crotonylation sites named iCrotoK-PseAAC, in which we have incorporated various position and composition relative features along with statistical moments into PseAAC. The results of self-consistency testing were 100% accurate, while the 10-fold cross validation gave 99.0% accuracy. Based on the validation and comparison of model, it is concluded that the iCrotoK-PseAAC is more accurate than the previously proposed models.
Collapse
Affiliation(s)
- Sharaf Jameel Malebary
- Department of Information Technology, King Abdul Aziz University, Rabigh, Kingdom of Saudi Arabia
| | - Muhammad Safi ur Rehman
- Department of Computer Science, School of Systems and Technology, University of Management and Technology, Lahore, Pakistan
| | - Yaser Daanial Khan
- Department of Computer Science, School of Systems and Technology, University of Management and Technology, Lahore, Pakistan
| |
Collapse
|
29
|
Kurisu K, Yoshiuchi K, Ogino K, Oda T. Machine learning analysis to identify the association between risk factors and onset of nosocomial diarrhea: a retrospective cohort study. PeerJ 2019; 7:e7969. [PMID: 31687281 PMCID: PMC6825409 DOI: 10.7717/peerj.7969] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2019] [Accepted: 10/01/2019] [Indexed: 11/20/2022] Open
Abstract
BACKGROUND Although several risk factors for nosocomial diarrhea have been identified, the detail of association between these factors and onset of nosocomial diarrhea, such as degree of importance or temporal pattern of influence, remains unclear. We aimed to determine the association between risk factors and onset of nosocomial diarrhea using machine learning algorithms. METHODS We retrospectively collected data of patients with acute cerebral infarction. Seven variables, including age, sex, modified Rankin Scale (mRS) score, and number of days of antibiotics, tube feeding, proton pump inhibitors, and histamine 2-receptor antagonist use, were used in the analysis. We split the data into a training dataset and independant test dataset. Based on the training dataset, we developed a random forest, support vector machine (SVM), and radial basis function (RBF) network model. By calculating an area under the curve (AUC) of the receiver operating characteristic curve using 5-fold cross-validation, we performed feature selection and hyperparameter optimization in each model. According to their final performances, we selected the optimal model and also validated it in the independent test dataset. Based on the selected model, we visualized the variable importance and the association between each variable and the outcome using partial dependence plots. RESULTS Two-hundred and eighteen patients were included. In the cross-validation within the training dataset, the random forest model achieved an AUC of 0.944, which was higher than in the SVM and RBF network models. The random forest model also achieved an AUC of 0.832 in the independent test dataset. Tube feeding use days, mRS score, antibiotic use days, age and sex were strongly associated with the onset of nosocomial diarrhea, in this order. Tube feeding use had an inverse U-shaped association with the outcome. The mRS score and age had a convex downward and increasing association, while antibiotic use had a convex upward association with the outcome. CONCLUSION We revealed the degree of importance and temporal pattern of the influence of several risk factors for nosocomial diarrhea, which could help clinicians manage nosocomial diarrhea.
Collapse
Affiliation(s)
- Ken Kurisu
- Department of Stress Sciences and Psychosomatic Medicine, Graduate School of Medicine, The University of Tokyo, Tokyo, Japan
- Department of Infectious Diseases, Showa General Hospital, Tokyo, Japan
| | - Kazuhiro Yoshiuchi
- Department of Stress Sciences and Psychosomatic Medicine, Graduate School of Medicine, The University of Tokyo, Tokyo, Japan
| | - Kei Ogino
- Department of Stress Sciences and Psychosomatic Medicine, Graduate School of Medicine, The University of Tokyo, Tokyo, Japan
- Department of Infectious Diseases, Showa General Hospital, Tokyo, Japan
| | - Toshimi Oda
- Department of Infectious Diseases, Showa General Hospital, Tokyo, Japan
| |
Collapse
|
30
|
Le NQK, Yapp EKY, Nagasundaram N, Chua MCH, Yeh HY. Computational identification of vesicular transport proteins from sequences using deep gated recurrent units architecture. Comput Struct Biotechnol J 2019; 17:1245-1254. [PMID: 31921391 PMCID: PMC6944713 DOI: 10.1016/j.csbj.2019.09.005] [Citation(s) in RCA: 30] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2019] [Revised: 09/07/2019] [Accepted: 09/11/2019] [Indexed: 11/20/2022] Open
Abstract
Protein function prediction is one of the most well-studied topics, attracting attention from countless researchers in the field of computational biology. Implementing deep neural networks that help improve the prediction of protein function, however, is still a major challenge. In this research, we suggested a new strategy that includes gated recurrent units and position-specific scoring matrix profiles to predict vesicular transportation proteins, a biological function of great importance. Although it is difficult to discover its function, our model is able to achieve accuracies of 82.3% and 85.8% in the cross-validation and independent dataset, respectively. We also solve the problem of imbalance in the dataset via tuning class weight in the deep learning model. The results generated showed sensitivity, specificity, MCC, and AUC to have values of 79.2%, 82.9%, 0.52, and 0.861, respectively. Our strategy shows superiority in results on the same dataset against all other state-of-the-art algorithms. In our suggested research, we have suggested a technique for the discovery of more proteins, particularly proteins connected with vesicular transport. In addition, our accomplishment could encourage the use of gated recurrent units architecture in protein function prediction.
Collapse
Affiliation(s)
- Nguyen Quoc Khanh Le
- Medical Humanities Research Cluster, School of Humanities, Nanyang Technological University, 48 Nanyang Ave, 639818, Singapore
- Professional Master Program in Artificial Intelligence in Medicine, Taipei Medical University, Taipei 106, Taiwan
| | - Edward Kien Yee Yapp
- Singapore Institute of Manufacturing Technology, 2 Fusionopolis Way, #08-04, Innovis, 138634, Singapore
| | - N. Nagasundaram
- Medical Humanities Research Cluster, School of Humanities, Nanyang Technological University, 48 Nanyang Ave, 639818, Singapore
| | - Matthew Chin Heng Chua
- Institute of Systems Science, 25 Heng Mui Keng Terrace, National University of Singapore, 119615, Singapore
| | - Hui-Yuan Yeh
- Medical Humanities Research Cluster, School of Humanities, Nanyang Technological University, 48 Nanyang Ave, 639818, Singapore
| |
Collapse
|
31
|
Chen Y, Fan X. Use of Chou's 5-Steps Rule to Reveal Active Compound and Mechanism of Shuangshen Pingfei San on Idiopathic Pulmonary Fibrosis. Curr Mol Med 2019; 20:220-230. [PMID: 31612829 DOI: 10.2174/1566524019666191011160543] [Citation(s) in RCA: 23] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2019] [Revised: 09/20/2019] [Accepted: 09/23/2019] [Indexed: 12/19/2022]
Abstract
BACKGROUND Shuangshen Pingfei San (SPS) is the derivative from the classic formula Renshen Pingfei San in treating idiopathic pulmonary fibrosis (IPF). METHODS In this study, Chou's 5-steps rule was performed to explore the potential active compound and mechanism of SPS on IPF. Compound-target network, target- pathway network, herb-target network and the core gene target interaction network were established and analyzed. A total of 296 compounds and 69 candidate therapeutic targets of SPS in treating IPF were obtained. Network analysis revealed that the main active compounds were flavonoids (such as apigenin, quercetin, naringenin, luteolin), other clusters (such as ginsenoside Rh2, diosgenin, tanshinone IIa), which might also play significant roles. SPS regulated multiple IPF relative genes, which affect fibrosis (PTGS2, KDR, FGFR1, TGFB, VEGFA, MMP2/9) and inflammation (PPARG, TNF, IL13, IL4, IL1B, etc.). CONCLUSION In conclusion, anti-pulmonary fibrosis effect of SPS might be related to the regulation of inflammation and pro-fibrotic signaling pathways. These findings revealed that the potential active compounds and mechanisms of SPS on IPF were a benefit to further study.
Collapse
Affiliation(s)
- Yeqing Chen
- College of Basic Medicine, Nanjing University of Chinese Medicine, Nanjing, China.,Jiangsu Collaborative Innovation Center of Chinese Medicinal Resources Industrialization, Nanjing, China
| | - Xinsheng Fan
- College of Basic Medicine, Nanjing University of Chinese Medicine, Nanjing, China.,Jiangsu Collaborative Innovation Center of Chinese Medicinal Resources Industrialization, Nanjing, China
| |
Collapse
|
32
|
Liu K, Chen W, Lin H. XG-PseU: an eXtreme Gradient Boosting based method for identifying pseudouridine sites. Mol Genet Genomics 2019; 295:13-21. [DOI: 10.1007/s00438-019-01600-9] [Citation(s) in RCA: 45] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2019] [Accepted: 07/29/2019] [Indexed: 01/08/2023]
|
33
|
Le NQK, Huynh TT, Yapp EKY, Yeh HY. Identification of clathrin proteins by incorporating hyperparameter optimization in deep learning and PSSM profiles. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2019; 177:81-88. [PMID: 31319963 DOI: 10.1016/j.cmpb.2019.05.016] [Citation(s) in RCA: 40] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/18/2019] [Revised: 05/06/2019] [Accepted: 05/16/2019] [Indexed: 06/10/2023]
Abstract
BACKGROUND AND OBJECTIVES Clathrin is an adaptor protein that serves as the principal element of the vesicle-coating complex and is important for the membrane cleavage to dispense the invaginated vesicle from the plasma membrane. The functional loss of clathrins has been tied to a lot of human diseases, i.e., neurodegenerative disorders, cancer, Alzheimer's diseases, and so on. Therefore, creating a precise model to identify its functions is a crucial step towards understanding human diseases and designing drug targets. METHODS We present a deep learning model using a two-dimensional convolutional neural network (CNN) and position-specific scoring matrix (PSSM) profiles to identify clathrin proteins from high throughput sequences. Traditionally, the 2D CNNs take images as an input so we treated the PSSM profile with a 20 × 20 matrix as an image of 20 × 20 pixels. The input PSSM profile was then connected to our 2D CNN in which we set a variety of parameters to improve the performance of the model. Based on the 10-fold cross-validation results, hyper-parameter optimization process was employed to find the best model for our dataset. Finally, an independent dataset was used to assess the predictive ability of the current model. RESULTS Our model could identify clathrin proteins with sensitivity of 92.2%, specificity of 91.2%, accuracy of 91.8%, and MCC of 0.83 in the independent dataset. Compared to state-of-the-art traditional neural networks, our method achieved a significant improvement in all typical measurement metrics. CONCLUSIONS Throughout the proposed study, we provide an effective tool for investigating clathrin proteins and our achievement could promote the use of deep learning in biomedical research. We also provide source codes and dataset freely at https://www.github.com/khanhlee/deep-clathrin/.
Collapse
Affiliation(s)
- Nguyen Quoc Khanh Le
- Medical Humanities Research Cluster, School of Humanities, Nanyang Technological University, 48 Nanyang Ave, 639798 Singapore.
| | - Tuan-Tu Huynh
- Department of Electrical Electronic and Mechanical Engineering, Lac Hong University, No. 10 Huynh Van Nghe Road, Bien Hoa, Dong Nai, Vietnam
| | - Edward Kien Yee Yapp
- Singapore Institute of Manufacturing Technology, 2 Fusionopolis Way, #08-04, Innovis, 138634 Singapore
| | - Hui-Yuan Yeh
- Medical Humanities Research Cluster, School of Humanities, Nanyang Technological University, 48 Nanyang Ave, 639798 Singapore.
| |
Collapse
|
34
|
Le NQK, Yapp EKY, Yeh HY. ET-GRU: using multi-layer gated recurrent units to identify electron transport proteins. BMC Bioinformatics 2019; 20:377. [PMID: 31277574 PMCID: PMC6612191 DOI: 10.1186/s12859-019-2972-5] [Citation(s) in RCA: 35] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2019] [Accepted: 06/27/2019] [Indexed: 12/01/2022] Open
Abstract
BACKGROUND Electron transport chain is a series of protein complexes embedded in the process of cellular respiration, which is an important process to transfer electrons and other macromolecules throughout the cell. It is also the major process to extract energy via redox reactions in the case of oxidation of sugars. Many studies have determined that the electron transport protein has been implicated in a variety of human diseases, i.e. diabetes, Parkinson, Alzheimer's disease and so on. Few bioinformatics studies have been conducted to identify the electron transport proteins with high accuracy, however, their performance results require a lot of improvements. Here, we present a novel deep neural network architecture to address this problem. RESULTS Most of the previous studies could not use the original position specific scoring matrix (PSSM) profiles to feed into neural networks, leading to a lack of information and the neural networks consequently could not achieve the best results. In this paper, we present a novel approach by using deep gated recurrent units (GRU) on full PSSMs to resolve this problem. Our approach can precisely predict the electron transporters with the cross-validation and independent test accuracy of 93.5 and 92.3%, respectively. Our approach demonstrates superior performance to all of the state-of-the-art predictors on electron transport proteins. CONCLUSIONS Through the proposed study, we provide ET-GRU, a web server for discriminating electron transport proteins in particular and other protein functions in general. Also, our achievement could promote the use of GRU in computational biology, especially in protein function prediction.
Collapse
Affiliation(s)
- Nguyen Quoc Khanh Le
- Medical Humanities Research Cluster, School of Humanities, Nanyang Technological University, 48 Nanyang Ave, Singapore, 639798 Singapore
| | - Edward Kien Yee Yapp
- Singapore Institute of Manufacturing Technology, 2 Fusionopolis Way, #08-04, Innovis, Singapore, 138634 Singapore
| | - Hui-Yuan Yeh
- Medical Humanities Research Cluster, School of Humanities, Nanyang Technological University, 48 Nanyang Ave, Singapore, 639798 Singapore
| |
Collapse
|
35
|
Gao R, Wang M, Zhou J, Fu Y, Liang M, Guo D, Nie J. Prediction of Enzyme Function Based on Three Parallel Deep CNN and Amino Acid Mutation. Int J Mol Sci 2019; 20:E2845. [PMID: 31212665 PMCID: PMC6600291 DOI: 10.3390/ijms20112845] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2019] [Revised: 06/03/2019] [Accepted: 06/04/2019] [Indexed: 01/28/2023] Open
Abstract
During the past decade, due to the number of proteins in PDB database being increased gradually, traditional methods cannot better understand the function of newly discovered enzymes in chemical reactions. Computational models and protein feature representation for predicting enzymatic function are more important. Most of existing methods for predicting enzymatic function have used protein geometric structure or protein sequence alone. In this paper, the functions of enzymes are predicted from many-sided biological information including sequence information and structure information. Firstly, we extract the mutation information from amino acids sequence by the position scoring matrix and express structure information with amino acids distance and angle. Then, we use histogram to show the extracted sequence and structural features respectively. Meanwhile, we establish a network model of three parallel Deep Convolutional Neural Networks (DCNN) to learn three features of enzyme for function prediction simultaneously, and the outputs are fused through two different architectures. Finally, The proposed model was investigated on a large dataset of 43,843 enzymes from the PDB and achieved 92.34% correct classification when sequence information is considered, demonstrating an improvement compared with the previous result.
Collapse
Affiliation(s)
- Ruibo Gao
- School of Information Science and Engineering, Yanshan University, Qinhuangdao 066004, Hebei, China.
| | - Mengmeng Wang
- School of Information Science and Engineering, Yanshan University, Qinhuangdao 066004, Hebei, China.
| | - Jiaoyan Zhou
- School of Information Science and Engineering, Yanshan University, Qinhuangdao 066004, Hebei, China.
| | - Yuhang Fu
- School of Information Science and Engineering, Yanshan University, Qinhuangdao 066004, Hebei, China.
| | - Meng Liang
- School of Information Science and Engineering, Yanshan University, Qinhuangdao 066004, Hebei, China.
| | - Dongliang Guo
- School of Information Science and Engineering, Yanshan University, Qinhuangdao 066004, Hebei, China.
| | - Junlan Nie
- School of Information Science and Engineering, Yanshan University, Qinhuangdao 066004, Hebei, China.
| |
Collapse
|
36
|
Niu B, Liang C, Lu Y, Zhao M, Chen Q, Zhang Y, Zheng L, Chou KC. Glioma stages prediction based on machine learning algorithm combined with protein-protein interaction networks. Genomics 2019; 112:837-847. [PMID: 31150762 DOI: 10.1016/j.ygeno.2019.05.024] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2019] [Accepted: 05/25/2019] [Indexed: 12/18/2022]
Abstract
BACKGROUND Glioma is the most lethal nervous system cancer. Recent studies have made great efforts to study the occurrence and development of glioma, but the molecular mechanisms are still unclear. This study was designed to reveal the molecular mechanisms of glioma based on protein-protein interaction network combined with machine learning methods. Key differentially expressed genes (DEGs) were screened and selected by using the protein-protein interaction (PPI) networks. RESULTS As a result, 19 genes between grade I and grade II, 21 genes between grade II and grade III, and 20 genes between grade III and grade IV. Then, five machine learning methods were employed to predict the gliomas stages based on the selected key genes. After comparison, Complement Naive Bayes classifier was employed to build the prediction model for grade II-III with accuracy 72.8%. And Random forest was employed to build the prediction model for grade I-II and grade III-VI with accuracy 97.1% and 83.2%, respectively. Finally, the selected genes were analyzed by PPI networks, Gene Ontology (GO) terms and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways, and the results improve our understanding of the biological functions of select DEGs involved in glioma growth. We expect that the key genes expressed have a guiding significance for the occurrence of gliomas or, at the very least, that they are useful for tumor researchers. CONCLUSION Machine learning combined with PPI networks, GO and KEGG analyses of selected DEGs improve our understanding of the biological functions involved in glioma growth.
Collapse
Affiliation(s)
- Bing Niu
- School of Life Sciences, Shanghai University, Shanghai 200444, China; Gordon Life Science Institute, Boston, MA 02478, USA.
| | - Chaofeng Liang
- Department of Neurosurgery, The Third Affiliated Hospital of Sun Yat-sen University, Guangzhou, China
| | - Yi Lu
- School of Life Sciences, Shanghai University, Shanghai 200444, China
| | - Manman Zhao
- School of Life Sciences, Shanghai University, Shanghai 200444, China
| | - Qin Chen
- School of Life Sciences, Shanghai University, Shanghai 200444, China.
| | - Yuhui Zhang
- Renji Hospital, Medical School, Shanghai Jiaotong University, 160 Pujian Rd, New Pudong District, Shanghai 200127, China; Changhai Hospital, Second Military Medical University, Shanghai 200433, China.
| | - Linfeng Zheng
- Department of Radiology, Shanghai General Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai 200080, China; Department of Radiology, Shanghai First People's Hospital, Baoshan Branch, Shanghai 200940, China.
| | - Kuo-Chen Chou
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China; Gordon Life Science Institute, Boston, MA 02478, USA.
| |
Collapse
|