1
|
Yong CSY, Atheeqah-Hamzah N. Transcriptome-wide Identification of Nine Tandem Repeat Protein Families in Roselle ( Hibiscus sabdariffa L.). Trop Life Sci Res 2024; 35:121-148. [PMID: 39464663 PMCID: PMC11507979 DOI: 10.21315/tlsr2024.35.3.6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2023] [Accepted: 05/20/2024] [Indexed: 10/29/2024] Open
Abstract
Plants are rich in tandem repeats-containing proteins. It is postulated that the occurrence of tandem repeat gene families facilitates the adaptation and survival of plants in adverse environmental conditions. This study intended to identify the tandem repeats in the transcriptome of a high potential tropical horticultural plant, roselle (Hibiscus sabdariffa L.). A total of 92,974 annotated de novo assembled transcripts were analysed using in silico approach, and 6,541 transcripts that encoded proteins containing tandem repeats with length of 20-60 amino acid residues were identified. Domain analysis revealed a total of nine tandem repeat protein families in the transcriptome of roselle, which are the Ankyrin repeats (ANK), Armadillo repeats (ARM), elongation factor-hand domain repeats (EF-hand), Huntingtin, elongation factor 3, protein phosphatase 2A, yeast kinase TOR1 repeats (HEAT), Kelch repeats (Kelch), leucine rich repeats (LRR), pentatricopeptide repeats (PPR), tetratricopeptide repeats (TPR) and WD40 repeats (WD40). Functional annotation analysis further matched 6,236 transcripts to 1,045 known proteins that contained tandem repeats including proteins implicated in plant development, protein-protein interaction, immunity and abiotic stress responses. The findings provide new insights into the occurrence of tandem repeats in the transcriptome and lay the foundation to elucidate the functional associations between tandem peptide repeats (TRs) and proteins in roselle and facilitate the identification of novel biotic and abiotic response related tandem repeats genes that may be useful in breeding improved varieties.
Collapse
Affiliation(s)
- Christina Seok Yien Yong
- Department of Biology, Faculty of Science, Universiti Putra Malaysia, Jalan UPM, 43400 Serdang, Selangor, Malaysia
| | - Nur Atheeqah-Hamzah
- Department of Biology, Faculty of Science, Universiti Putra Malaysia, Jalan UPM, 43400 Serdang, Selangor, Malaysia
| |
Collapse
|
2
|
Nazari L, Aslan MF, Sabanci K, Ropelewska E. Integrated transcriptomic meta-analysis and comparative artificial intelligence models in maize under biotic stress. Sci Rep 2023; 13:15899. [PMID: 37741865 PMCID: PMC10517993 DOI: 10.1038/s41598-023-42984-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2022] [Accepted: 09/17/2023] [Indexed: 09/25/2023] Open
Abstract
Biotic stress imposed by pathogens, including fungal, bacterial, and viral, can cause heavy damage leading to yield reduction in maize. Therefore, the identification of resistant genes paves the way to the development of disease-resistant cultivars and is essential for reliable production in maize. Identifying different gene expression patterns can deepen our perception of maize resistance to disease. This study includes machine learning and deep learning-based application for classifying genes expressed under normal and biotic stress in maize. Machine learning algorithms used are Naive Bayes (NB), K-Nearest Neighbor (KNN), Ensemble, Support Vector Machine (SVM), and Decision Tree (DT). A Bidirectional Long Short Term Memory (BiLSTM) based network with Recurrent Neural Network (RNN) architecture is proposed for gene classification with deep learning. To increase the performance of these algorithms, feature selection is made from the raw gene features through the Relief feature selection algorithm. The obtained finding indicated the efficacy of BiLSTM over other machine learning algorithms. Some top genes ((S)-beta-macrocarpene synthase, zealexin A1 synthase, polyphenol oxidase I, chloroplastic, pathogenesis-related protein 10, CHY1, chitinase chem 5, barwin, and uncharacterized LOC100273479 were proved to be differentially upregulated under biotic stress condition.
Collapse
Affiliation(s)
- Leyla Nazari
- Crop and Horticultural Science Research Department, Fars Agricultural and Natural Resources Research and Education Center, Agricultural Research, Education and Extension Organization (AREEO), Shiraz, Iran.
| | - Muhammet Fatih Aslan
- Electrical and Electronics Engineering, Karamanoglu Mehmetbey University, Karaman, Turkey
| | - Kadir Sabanci
- Electrical and Electronics Engineering, Karamanoglu Mehmetbey University, Karaman, Turkey
| | - Ewa Ropelewska
- Fruit and Vegetable Storage and Processing Department, The National Institute of Horticultural Research, Skierniewice, Poland
| |
Collapse
|
3
|
Feng C, Wu J, Wei H, Xu L, Zou Q. CRCF: A Method of Identifying Secretory Proteins of Malaria Parasites. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:2149-2157. [PMID: 34061749 DOI: 10.1109/tcbb.2021.3085589] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Malaria is a mosquito-borne disease that results in millions of cases and deaths annually. The development of a fast computational method that identifies secretory proteins of the malaria parasite is important for research on antimalarial drugs and vaccines. Thus, a method was developed to identify the secretory proteins of malaria parasites. In this method, a reduced alphabet was selected to recode the original protein sequence. A feature synthesis method was used to synthesise three different types of feature information. Finally, the random forest method was used as a classifier to identify the secretory proteins. In addition, a web server was developed to share the proposed algorithm. Experiments using the benchmark dataset demonstrated that the overall accuracy achieved by the proposed method was greater than 97.8 percent using the 10-fold cross-validation method. Furthermore, the reduced schemes and characteristic performance analyses are discussed.
Collapse
|
4
|
CWLy-RF: A novel approach for identifying cell wall lyases based on random forest classifier. Genomics 2021; 113:2919-2924. [PMID: 34186189 DOI: 10.1016/j.ygeno.2021.06.038] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2021] [Revised: 06/20/2021] [Accepted: 06/25/2021] [Indexed: 02/05/2023]
Abstract
Drug resistance of pathogenic bacteria has become increasingly serious due to the abuse of antibiotics in recent years. Researchers have found that cell wall lyases are effective antibacterial agents that can specifically recognize target bacteria and degrade bacterial peptidoglycan. Traditional wet experiments are usually expensive, time-consuming and laborious for the identification of lyases. Therefore, there is an urgent need to develop prediction tools based on computer methods to identify lyases quickly and accurately. In this paper, a new predictor, CWLy-RF, is proposed based on the random forest (RF) algorithm to identify cell wall lyases. In this method, we combined three features, namely, 400D, 188D and the composition of k-spaced amino acid group pairs, using mixed-feature representation methods. Afterward, we improved the feature representation ability with the selected top 100 features by using the information gain method and trained a predictive model using RF. The constructed prediction model is evaluated by using 10-fold cross-validation. The accuracy obtained was 96.09%, the AUC was 0.993, the MCC was 0.922, the sensitivity was 94.92%, and the specificity was 97.32%. We have proved that the proposed predictor CWLy-RF is superior to other latest models, and it will hopefully become an effective and useful tool for identifying lyases.
Collapse
|
5
|
Niu M, Lin Y, Zou Q. sgRNACNN: identifying sgRNA on-target activity in four crops using ensembles of convolutional neural networks. PLANT MOLECULAR BIOLOGY 2021; 105:483-495. [PMID: 33385273 DOI: 10.1007/s11103-020-01102-y] [Citation(s) in RCA: 65] [Impact Index Per Article: 16.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/17/2020] [Accepted: 12/01/2020] [Indexed: 06/12/2023]
Abstract
KEY MESSAGE We proposed an ensemble convolutional neural network model to identify sgRNA high on-target activity in four crops and we used one-hot encoding and k-mers for sequence encoding. As an important component of the CRISPR/Cas9 system, single-guide RNA (sgRNA) plays an important role in gene redirection and editing. sgRNA has played an important role in the improvement of agronomic species, but there is a lack of effective bioinformatics tools to identify the activity of sgRNA in agronomic species. Therefore, it is necessary to develop a method based on machine learning to identify sgRNA high on-target activity. In this work, we proposed a simple convolutional neural network method to identify sgRNA high on-target activity. Our study used one-hot encoding and k-mers for sequence data conversion and a voting algorithm for constructing the convolutional neural network ensemble model sgRNACNN for the prediction of sgRNA activity. The ensemble model sgRNACNN was used for predictions in four crops: Glycine max, Zea mays, Sorghum bicolor and Triticum aestivum. The accuracy rates of the four crops in the sgRNACNN model were 82.43%, 80.33%, 78.25% and 87.49%, respectively. The experimental results showed that sgRNACNN realizes the identification of high on-target activity sgRNA of agronomic data and can meet the demands of sgRNA activity prediction in agronomy to a certain extent. These results have certain significance for guiding crop gene editing and academic research. The source code and relevant dataset can be found in the following link: https://github.com/nmt315320/sgRNACNN.git .
Collapse
Affiliation(s)
- Mengting Niu
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| | - Yuan Lin
- Department of System Integration, Sparebanken Vest, Bergen, Norway.
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China.
| |
Collapse
|
6
|
Using a low correlation high orthogonality feature set and machine learning methods to identify plant pentatricopeptide repeat coding gene/protein. Neurocomputing 2021. [DOI: 10.1016/j.neucom.2020.02.079] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
|
7
|
Zhao X, Wang H, Li H, Wu Y, Wang G. Identifying Plant Pentatricopeptide Repeat Proteins Using a Variable Selection Method. FRONTIERS IN PLANT SCIENCE 2021; 12:506681. [PMID: 33732270 PMCID: PMC7957076 DOI: 10.3389/fpls.2021.506681] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/22/2019] [Accepted: 02/08/2021] [Indexed: 05/05/2023]
Abstract
Motivation: Pentatricopeptide repeat (PPR), which is a triangular pentapeptide repeat domain, plays an important role in plant growth. Features extracted from sequences are applicable to PPR protein identification using certain classification methods. However, which components of a multidimensional feature (namely variables) are more effective for protein discrimination has never been discussed. Therefore, we seek to select variables from a multidimensional feature for identifying PPR proteins. Method: A framework of variable selection for identifying PPR proteins is proposed. Samples representing PPR positive proteins and negative ones are equally split into a training and a testing set. Variable importance is regarded as scores derived from an iteration of resampling, training, and scoring step on the training set. A model selection method based on Gaussian mixture model is applied to automatic choice of variables which are effective to identify PPR proteins. Measurements are used on the testing set to show the effectiveness of the selected variables. Results: Certain variables other than the multidimensional feature they belong to do work for discrimination between PPR positive proteins and those negative ones. In addition, the content of methionine may play an important role in predicting PPR proteins.
Collapse
Affiliation(s)
- Xudong Zhao
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| | - Hanxu Wang
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| | - Hangyu Li
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| | - Yiming Wu
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| | - Guohua Wang
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
- State Key Laboratory of Tree Genetics and Breeding, Northeast Forestry University, Harbin, China
- *Correspondence: Guohua Wang
| |
Collapse
|
8
|
Zhang J, Lv L, Lu D, Kong D, Al-Alashaari MAA, Zhao X. Variable selection from a feature representing protein sequences: a case of classification on bacterial type IV secreted effectors. BMC Bioinformatics 2020; 21:480. [PMID: 33109082 PMCID: PMC7590791 DOI: 10.1186/s12859-020-03826-6] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2020] [Accepted: 10/19/2020] [Indexed: 12/13/2022] Open
Abstract
Background Classification of certain proteins with specific functions is momentous for biological research. Encoding approaches of protein sequences for feature extraction play an important role in protein classification. Many computational methods (namely classifiers) are used for classification on protein sequences according to various encoding approaches. Commonly, protein sequences keep certain labels corresponding to different categories of biological functions (e.g., bacterial type IV secreted effectors or not), which makes protein prediction a fantasy. As to protein prediction, a kernel set of protein sequences keeping certain labels certified by biological experiments should be existent in advance. However, it has been hardly ever seen in prevailing researches. Therefore, unsupervised learning rather than supervised learning (e.g. classification) should be considered. As to protein classification, various classifiers may help to evaluate the effectiveness of different encoding approaches. Besides, variable selection from an encoded feature representing protein sequences is an important issue that also needs to be considered. Results Focusing on the latter problem, we propose a new method for variable selection from an encoded feature representing protein sequences. Taking a benchmark dataset containing 1947 protein sequences as a case, experiments are made to identify bacterial type IV secreted effectors (T4SE) from protein sequences, which are composed of 399 T4SE and 1548 non-T4SE. Comparable and quantified results are obtained only using certain components of the encoded feature, i.e., position-specific scoring matix, and that indicates the effectiveness of our method. Conclusions Certain variables other than an encoded feature they belong to do work for discrimination between different types of proteins. In addition, ensemble classifiers with an automatic assignment of different base classifiers do achieve a better classification result.
Collapse
Affiliation(s)
- Jian Zhang
- College of Artificial Intelligence, Wuxi Vocational College of Science and Technology, No. 8 Xinxi Road, Wuxi, 214028, China
| | - Lixin Lv
- College of Artificial Intelligence, Wuxi Vocational College of Science and Technology, No. 8 Xinxi Road, Wuxi, 214028, China
| | - Donglei Lu
- College of Artificial Intelligence, Wuxi Vocational College of Science and Technology, No. 8 Xinxi Road, Wuxi, 214028, China
| | - Denan Kong
- College of Information and Computer Engineering, Northeast Forestry University, No. 26 Hexing Road, Harbin, 150040, China
| | | | - Xudong Zhao
- College of Information and Computer Engineering, Northeast Forestry University, No. 26 Hexing Road, Harbin, 150040, China.
| |
Collapse
|
9
|
Feng C, Ma Z, Yang D, Li X, Zhang J, Li Y. A Method for Prediction of Thermophilic Protein Based on Reduced Amino Acids and Mixed Features. Front Bioeng Biotechnol 2020; 8:285. [PMID: 32432088 PMCID: PMC7214540 DOI: 10.3389/fbioe.2020.00285] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2020] [Accepted: 03/18/2020] [Indexed: 11/13/2022] Open
Abstract
The thermostability of proteins is a key factor considered during enzyme engineering, and finding a method that can identify thermophilic and non-thermophilic proteins will be helpful for enzyme design. In this study, we established a novel method combining mixed features and machine learning to achieve this recognition task. In this method, an amino acid reduction scheme was adopted to recode the amino acid sequence. Then, the physicochemical characteristics, auto-cross covariance (ACC), and reduced dipeptides were calculated and integrated to form a mixed feature set, which was processed using correlation analysis, feature selection, and principal component analysis (PCA) to remove redundant information. Finally, four machine learning methods and a dataset containing 500 random observations out of 915 thermophilic proteins and 500 random samples out of 793 non-thermophilic proteins were used to train and predict the data. The experimental results showed that 98.2% of thermophilic and non-thermophilic proteins were correctly identified using 10-fold cross-validation. Moreover, our analysis of the final reserved features and removed features yielded information about the crucial, unimportant and insensitive elements, it also provided essential information for enzyme design.
Collapse
Affiliation(s)
- Changli Feng
- College of Information Science and Technology, Taishan University, Tai’an, China
| | - Zhaogui Ma
- College of Information Science and Technology, Taishan University, Tai’an, China
| | - Deyun Yang
- College of Information Science and Technology, Taishan University, Tai’an, China
| | - Xin Li
- College of Information Science and Technology, Taishan University, Tai’an, China
| | - Jun Zhang
- Department of Rehabilitation, General Hospital of Heilongjiang Province Land Reclamation Bureau, Harbin, China
| | - Yanjuan Li
- Information and Computer Engineering College, Northeast Forestry University, Harbin, China
| |
Collapse
|
10
|
Taxonomy dimension reduction for colorectal cancer prediction. Comput Biol Chem 2019; 83:107160. [DOI: 10.1016/j.compbiolchem.2019.107160] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2019] [Revised: 11/02/2019] [Accepted: 11/04/2019] [Indexed: 02/01/2023]
|