1
|
A convex multi-class model via distance metric learning based class-to-instance confidence. Knowl Based Syst 2022. [DOI: 10.1016/j.knosys.2022.109791] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register]
|
2
|
RNA-Seq-Based Breast Cancer Subtypes Classification Using Machine Learning Approaches. COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE 2020; 2020:4737969. [PMID: 33178256 PMCID: PMC7644310 DOI: 10.1155/2020/4737969] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/21/2019] [Revised: 05/31/2020] [Accepted: 10/09/2020] [Indexed: 12/20/2022]
Abstract
Background Breast invasive carcinoma (BRCA) is not a single disease as each subtype has a distinct morphology structure. Although several computational methods have been proposed to conduct breast cancer subtype identification, the specific interaction mechanisms of genes involved in the subtypes are still incomplete. To identify and explore the corresponding interaction mechanisms of genes for each subtype of breast cancer can impose an important impact on the personalized treatment for different patients. Methods We integrate the biological importance of genes from the gene regulatory networks to the differential expression analysis and then obtain the weighted differentially expressed genes (weighted DEGs). A gene with a high weight means it regulates more target genes and thus holds more biological importance. Besides, we constructed gene coexpression networks for control and experiment groups, and the significantly differentially interacting structures encouraged us to design the corresponding Gene Ontology (GO) enrichment based on gene coexpression networks (GOEGCN). The GOEGCN considers the two-side distinction analysis between gene coexpression networks for control and experiment groups. The method allows us to study how the modulated coexpressed gene couples impact biological functions at a GO level. Results We modeled the binary classification with weighted DEGs for each subtype. The binary classifier could make a good prediction for an unseen sample, and the experimental results validated the effectiveness of our proposed approaches. The novel enriched GO terms based on GOEGCN for control and experiment groups of each subtype explain the specific biological function changes according to the two-side distinction of coexpression network structures to some extent. Conclusion The weighted DEGs contain biological importance derived from the gene regulatory network. Based on the weighted DEGs, five binary classifiers were learned and showed good performance concerning the “Sensitivity,” “Specificity,” “Accuracy,” “F1,” and “AUC” metrics. The GOEGCN with weighted DEGs for control and experiment groups presented a novel GO enrichment analysis results and the novel enriched GO terms would further unveil the changes of specific biological functions among all the BRCA subtypes to some extent. The R code in this research is available at https://github.com/yxchspring/GOEGCN_BRCA_Subtypes.
Collapse
|
3
|
Yan J, Zhang Z, Lin K, Yang F, Luo X. A hybrid scheme-based one-vs-all decision trees for multi-class classification tasks. Knowl Based Syst 2020. [DOI: 10.1016/j.knosys.2020.105922] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
|
4
|
Krzhizhanovskaya VV, Závodszky G, Lees MH, Dongarra JJ, Sloot PMA, Brissos S, Teixeira J. Performance Analysis of Binarization Strategies for Multi-class Imbalanced Data Classification. LECTURE NOTES IN COMPUTER SCIENCE 2020. [PMCID: PMC7303687 DOI: 10.1007/978-3-030-50423-6_11] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
Multi-class imbalanced classification tasks are characterized by the skewed distribution of examples among the classes and, usually, strong overlapping between class regions in the feature space. Furthermore, frequently the goal of the final system is to obtain very high precision for each of the concepts. All of these factors contribute to the complexity of the task and increase the difficulty of building a quality data model by learning algorithms. One of the ways of addressing these challenges are so-called binarization strategies, which allow for decomposition of the multi-class problem into several binary tasks with lower complexity. Because of the different decomposition schemes used by each of those methods, some of them are considered to be better suited for handling imbalanced data than the others. In this study, we focus on the well-known binary approaches, namely One-Vs-All, One-Vs-One, and Error-Correcting Output Codes, and their effectiveness in multi-class imbalanced data classification, with respect to the base classifiers and various aggregation schemes for each of the strategies. We compare the performance of these approaches and try to boost the performance of seemingly weaker methods by sampling algorithms. The detailed comparative experimental study of the considered methods, supported by the statistical analysis, is presented. The results show the differences among various binarization strategies. We show how one can mitigate those differences using simple oversampling methods.
Collapse
|
5
|
Yang L, Gao H, Liu Z, Tang L. Identification of Phage Virion Proteins by Using the g-gap Tripeptide Composition. LETT ORG CHEM 2019. [DOI: 10.2174/1570178615666180910112813] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Phages are widely distributed in locations populated by bacterial hosts. Phage proteins can be divided into two main categories, that is, virion and non-virion proteins with different functions. In practice, people mainly use phage virion proteins to clarify the lysis mechanism of bacterial cells and develop new antibacterial drugs. Accurate identification of phage virion proteins is therefore essential to understanding the phage lysis mechanism. Although some computational methods have been focused on identifying virion proteins, the result is not satisfying which gives more room for improvement. In this study, a new sequence-based method was proposed to identify phage virion proteins using g-gap tripeptide composition. In this approach, the protein features were firstly extracted from the ggap tripeptide composition. Subsequently, we obtained an optimal feature subset by performing incremental feature selection (IFS) with information gain. Finally, the support vector machine (SVM) was used as the classifier to discriminate virion proteins from non-virion proteins. In 10-fold crossvalidation test, our proposed method achieved an accuracy of 97.40% with AUC of 0.9958, which outperforms state-of-the-art methods. The result reveals that our proposed method could be a promising method in the work of phage virion proteins identification.
Collapse
Affiliation(s)
- Liangwei Yang
- School of Computer Science and Engineering, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Hui Gao
- School of Computer Science and Engineering, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Zhen Liu
- School of Computer Science and Engineering, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Lixia Tang
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| |
Collapse
|
6
|
Nath A, Subbiah K. The role of pertinently diversified and balanced training as well as testing data sets in achieving the true performance of classifiers in predicting the antifreeze proteins. Neurocomputing 2018. [DOI: 10.1016/j.neucom.2017.07.004] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
7
|
Mukhopadhyay S, Das NK, Kurmi I, Pradhan A, Ghosh N, Panigrahi PK. Tissue multifractality and hidden Markov model based integrated framework for optimum precancer detection. JOURNAL OF BIOMEDICAL OPTICS 2017; 22:1-8. [PMID: 29052373 DOI: 10.1117/1.jbo.22.10.105005] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/12/2017] [Accepted: 09/29/2017] [Indexed: 06/07/2023]
Abstract
We report the application of a hidden Markov model (HMM) on multifractal tissue optical properties derived via the Born approximation-based inverse light scattering method for effective discrimination of precancerous human cervical tissue sites from the normal ones. Two global fractal parameters, generalized Hurst exponent and the corresponding singularity spectrum width, computed by multifractal detrended fluctuation analysis (MFDFA), are used here as potential biomarkers. We develop a methodology that makes use of these multifractal parameters by integrating with different statistical classifiers like the HMM and support vector machine (SVM). It is shown that the MFDFA-HMM integrated model achieves significantly better discrimination between normal and different grades of cancer as compared to the MFDFA-SVM integrated model.
Collapse
Affiliation(s)
| | - Nandan K Das
- Indian Institute of Science Education and Research Kolkata, Mohanpur, West Bengal, India
- Nanyang Technological University, School of Chemical and Biomedical Engineering, Singapore
| | - Indrajit Kurmi
- Indian Institute of Technology Kanpur, Department of Physics, Kanpur, Uttar Pradesh, India
| | - Asima Pradhan
- Indian Institute of Technology Kanpur, Department of Physics, Kanpur, Uttar Pradesh, India
- Indian Institute of Technology Kanpur, Center for Lasers and Photonics, Kanpur, West Bengal, India
| | - Nirmalya Ghosh
- Indian Institute of Science Education and Research Kolkata, Mohanpur, West Bengal, India
| | - Prasanta K Panigrahi
- Indian Institute of Science Education and Research Kolkata, Mohanpur, West Bengal, India
| |
Collapse
|
8
|
Zararsız G, Goksuluk D, Korkmaz S, Eldem V, Zararsiz GE, Duru IP, Ozturk A. A comprehensive simulation study on classification of RNA-Seq data. PLoS One 2017; 12:e0182507. [PMID: 28832679 PMCID: PMC5568128 DOI: 10.1371/journal.pone.0182507] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2017] [Accepted: 07/19/2017] [Indexed: 02/02/2023] Open
Abstract
RNA sequencing (RNA-Seq) is a powerful technique for the gene-expression profiling of organisms that uses the capabilities of next-generation sequencing technologies. Developing gene-expression-based classification algorithms is an emerging powerful method for diagnosis, disease classification and monitoring at molecular level, as well as providing potential markers of diseases. Most of the statistical methods proposed for the classification of gene-expression data are either based on a continuous scale (eg. microarray data) or require a normal distribution assumption. Hence, these methods cannot be directly applied to RNA-Seq data since they violate both data structure and distributional assumptions. However, it is possible to apply these algorithms with appropriate modifications to RNA-Seq data. One way is to develop count-based classifiers, such as Poisson linear discriminant analysis and negative binomial linear discriminant analysis. Another way is to bring the data closer to microarrays and apply microarray-based classifiers. In this study, we compared several classifiers including PLDA with and without power transformation, NBLDA, single SVM, bagging SVM (bagSVM), classification and regression trees (CART), and random forests (RF). We also examined the effect of several parameters such as overdispersion, sample size, number of genes, number of classes, differential-expression rate, and the transformation method on model performances. A comprehensive simulation study is conducted and the results are compared with the results of two miRNA and two mRNA experimental datasets. The results revealed that increasing the sample size, differential-expression rate and decreasing the dispersion parameter and number of groups lead to an increase in classification accuracy. Similar with differential-expression studies, the classification of RNA-Seq data requires careful attention when handling data overdispersion. We conclude that, as a count-based classifier, the power transformed PLDA and, as a microarray-based classifier, vst or rlog transformed RF and SVM classifiers may be a good choice for classification. An R/BIOCONDUCTOR package, MLSeq, is freely available at https://www.bioconductor.org/packages/release/bioc/html/MLSeq.html.
Collapse
Affiliation(s)
- Gökmen Zararsız
- Turcosa Analytics Solutions Ltd Co, Erciyes Teknopark, 38039, Kayseri, Turkey
- Department of Biostatistics, Erciyes University, Kayseri, Turkey
| | - Dincer Goksuluk
- Turcosa Analytics Solutions Ltd Co, Erciyes Teknopark, 38039, Kayseri, Turkey
- Department of Biostatistics, Hacettepe University, Ankara, Turkey
| | - Selcuk Korkmaz
- Turcosa Analytics Solutions Ltd Co, Erciyes Teknopark, 38039, Kayseri, Turkey
- Department of Biostatistics, Hacettepe University, Ankara, Turkey
| | - Vahap Eldem
- Department of Biology, Istanbul University, Istanbul, Turkey
| | | | | | - Ahmet Ozturk
- Department of Biostatistics, Erciyes University, Kayseri, Turkey
| |
Collapse
|
9
|
Gitoee A, Faridi A, France J. Mathematical models for response to amino acids: estimating the response of broiler chickens to branched-chain amino acids using support vector regression and neural network models. Neural Comput Appl 2017. [DOI: 10.1007/s00521-017-2842-x] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/01/2022]
|
10
|
Rojas-Moraleda R, Valous NA, Gowen A, Esquerre C, Härtel S, Salinas L, O’Donnell C. A frame-based ANN for classification of hyperspectral images: assessment of mechanical damage in mushrooms. Neural Comput Appl 2016. [DOI: 10.1007/s00521-016-2376-7] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|
11
|
Ranganarayanan P, Thanigesan N, Ananth V, Jayaraman VK, Ramakrishnan V. Identification of Glucose-Binding Pockets in Human Serum Albumin Using Support Vector Machine and Molecular Dynamics Simulations. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2016; 13:148-157. [PMID: 26886739 DOI: 10.1109/tcbb.2015.2415806] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Human Serum Albumin (HSA) has been suggested to be an alternate biomarker to the existing Hemoglobin-A1c (HbA1c) marker for glycemic monitoring. Development and usage of HSA as an alternate biomarker requires the identification of glycation sites, or equivalently, glucose-binding pockets. In this work, we combine molecular dynamics simulations of HSA and the state-of-art machine learning method Support Vector Machine (SVM) to predict glucose-binding pockets in HSA. SVM uses the three dimensional arrangement of atoms and their chemical properties to predict glucose-binding ability of a pocket. Feature selection reveals that the arrangement of atoms and their chemical properties within the first 4Å from the centroid of the pocket play an important role in the binding of glucose. With a 10-fold cross validation accuracy of 84 percent, our SVM model reveals seven new potential glucose-binding sites in HSA of which two are exposed only during the dynamics of HSA. The predictions are further corroborated using docking studies. These findings can complement studies directed towards the development of HSA as an alternate biomarker for glycemic monitoring.
Collapse
|
12
|
Li P, Hu Y, Yi J, Li J, Yang J, Wang J. Identification of potential biomarkers to differentially diagnose solid pseudopapillary tumors and pancreatic malignancies via a gene regulatory network. J Transl Med 2015; 13:361. [PMID: 26578390 PMCID: PMC4650856 DOI: 10.1186/s12967-015-0718-3] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2015] [Accepted: 10/31/2015] [Indexed: 01/18/2023] Open
Abstract
Background Solid pseudopapillary neoplasms (SPN) are pancreatic tumors with low malignant potential and good prognosis. However, differential
diagnosis between SPN and pancreatic malignancies including pancreatic neuroendocrine tumor (PanNET) and ductal adenocarcinoma (PDAC) is difficult. This study tried to identify candidate biomarkers for the distinction between SPN and the two malignant pancreatic tumors by examining the gene regulatory network of SPN. Methods The gene regulatory network for SPN was constructed by a co-expression model. Genes that have been reported to be correlated with SPN were used as the clues to hunt more SPN-related genes in the network according to a shortest path approach. By means of the K-nearest neighbor algorithm (KNN) classifier evaluated by the jackknife test, sets of genes to distinguish SPN and malignant pancreatic tumors were determined. Results We took a new strategy to identify candidate biomarkers for differentiating SPN from the two malignant pancreatic tumors PanNET and PDAC by analyzing shortest paths among SPN-related genes in the gene regulatory network. 43 new SPN-relevant genes were discovered, among which, we found hsa-miR-194 and hsa-miR-7 along with 7 transcription factors (TFs) such as SOX11, SMAD3 and SOX4 etc. could correctly differentiate SPN from PanNET, while hsa-miR-204 and 4 TFs such as SOX9, TCF7 and PPARD etc. were demonstrated as the potential markers for SPN versus PDAC. 14 genes were demonstrated to serve as the candidate biomarkers for distinguishing SPN from PanNET and PDAC when considering them as malignant pancreatic tumors together. Conclusion This study provides new candidate genes related to SPN and the potential biomarkers to differentiate SPN from PanNET and PDAC, which may help to diagnose patients with SPN in clinical setting. Furthermore, candidate biomarkers such as SOX11 and hsa-miR-204 which could cause cell proliferation but inhibit invasion or metastasis may be of importance in understanding the molecular mechanism of pancreatic oncogenesis and could be possible therapeutic targets for malignant pancreatic tumors. Electronic supplementary material The online version of this article (doi:10.1186/s12967-015-0718-3) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Pengping Li
- State Key Laboratory of Pharmaceutical Biotechnology, Collaborative Innovation Center of Chemistry for Life Sciences, Jiangsu Engineering Research Center for MicroRNA Biology and Biotechnology, NJU Advanced Institute for Life Sciences (NAILS), School of life sciences, Nanjing University, 163 Xianlin Road, Nanjing, 210023, China.
| | - Yuebing Hu
- Department of Neurosurgery, Jinling Hospital, School of Medicine, Nanjing University, 305 East Zhongshan Road, Nanjing, 210002, China.
| | - Jiao Yi
- State Key Laboratory of Pharmaceutical Biotechnology, Collaborative Innovation Center of Chemistry for Life Sciences, Jiangsu Engineering Research Center for MicroRNA Biology and Biotechnology, NJU Advanced Institute for Life Sciences (NAILS), School of life sciences, Nanjing University, 163 Xianlin Road, Nanjing, 210023, China.
| | - Jie Li
- State Key Laboratory of Pharmaceutical Biotechnology, Collaborative Innovation Center of Chemistry for Life Sciences, Jiangsu Engineering Research Center for MicroRNA Biology and Biotechnology, NJU Advanced Institute for Life Sciences (NAILS), School of life sciences, Nanjing University, 163 Xianlin Road, Nanjing, 210023, China.
| | - Jie Yang
- State Key Laboratory of Pharmaceutical Biotechnology, Collaborative Innovation Center of Chemistry for Life Sciences, Jiangsu Engineering Research Center for MicroRNA Biology and Biotechnology, NJU Advanced Institute for Life Sciences (NAILS), School of life sciences, Nanjing University, 163 Xianlin Road, Nanjing, 210023, China.
| | - Jin Wang
- State Key Laboratory of Pharmaceutical Biotechnology, Collaborative Innovation Center of Chemistry for Life Sciences, Jiangsu Engineering Research Center for MicroRNA Biology and Biotechnology, NJU Advanced Institute for Life Sciences (NAILS), School of life sciences, Nanjing University, 163 Xianlin Road, Nanjing, 210023, China.
| |
Collapse
|
13
|
Reboiro-Jato M, Díaz F, Glez-Peña D, Fdez-Riverola F. A novel ensemble of classifiers that use biological relevant gene sets for microarray classification. Appl Soft Comput 2014. [DOI: 10.1016/j.asoc.2014.01.002] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/30/2023]
|
14
|
He L, Yang X, Hao Z. An adaptive class pairwise dimensionality reduction algorithm. Neural Comput Appl 2013. [DOI: 10.1007/s00521-012-0897-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]
|
15
|
Garcia-Manteiga JM. Data Analysis and Interpretation in Metabolomics. Bioinformatics 2013. [DOI: 10.4018/978-1-4666-3604-0.ch077] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022] Open
Abstract
Metabolomics represents the new ‘omics’ approach of the functional genomics era. It consists in the identification and quantification of all small molecules, namely metabolites, in a given biological system. While metabolomics refers to the analysis of any possible biological system, metabonomics is specifically applied to disease and physiopathological situations. The data collected within these approaches is highly integrative of the other higher levels and is hence amenable to be explored with a top-down systems biology point of view. The aim of this chapter is to give a global view of the state of the art in metabolomics describing the two analytical techniques usually used to give rise to this kind of data, nuclear magnetic resonance, NMR, and mass spectrometry. In addition, the author will focus on the different data analysis tools that can be applied to such studies to extract information with special interest at the attempts to integrate metabolomics with other ‘omics’ approaches and its relevance in systems biology modeling.
Collapse
|
16
|
Analyzing the presence of noise in multi-class problems: alleviating its influence with the One-vs-One decomposition. Knowl Inf Syst 2012. [DOI: 10.1007/s10115-012-0570-1] [Citation(s) in RCA: 46] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
|
17
|
Khan A, Majid A, Hayat M. CE-PLoc: an ensemble classifier for predicting protein subcellular locations by fusing different modes of pseudo amino acid composition. Comput Biol Chem 2011; 35:218-29. [PMID: 21864791 DOI: 10.1016/j.compbiolchem.2011.05.003] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2011] [Revised: 05/17/2011] [Accepted: 05/18/2011] [Indexed: 12/18/2022]
Abstract
Precise information about protein locations in a cell facilitates in the understanding of the function of a protein and its interaction in the cellular environment. This information further helps in the study of the specific metabolic pathways and other biological processes. We propose an ensemble approach called "CE-PLoc" for predicting subcellular locations based on fusion of individual classifiers. The proposed approach utilizes features obtained from both dipeptide composition (DC) and amphiphilic pseudo amino acid composition (PseAAC) based feature extraction strategies. Different feature spaces are obtained by varying the dimensionality using PseAAC for a selected base learner. The performance of the individual learning mechanisms such as support vector machine, nearest neighbor, probabilistic neural network, covariant discriminant, which are trained using PseAAC based features is first analyzed. Classifiers are developed using same learning mechanism but trained on PseAAC based feature spaces of varying dimensions. These classifiers are combined through voting strategy and an improvement in prediction performance is achieved. Prediction performance is further enhanced by developing CE-PLoc through the combination of different learning mechanisms trained on both DC based feature space and PseAAC based feature spaces of varying dimensions. The predictive performance of proposed CE-PLoc is evaluated for two benchmark datasets of protein subcellular locations using accuracy, MCC, and Q-statistics. Using the jackknife test, prediction accuracies of 81.47 and 83.99% are obtained for 12 and 14 subcellular locations datasets, respectively. In case of independent dataset test, prediction accuracies are 87.04 and 87.33% for 12 and 14 class datasets, respectively.
Collapse
Affiliation(s)
- Asifullah Khan
- Department of Information and Computer Sciences, Pakistan Institute of Engineering and Applied Sciences, Nilore, Islamabad, Pakistan.
| | | | | |
Collapse
|
18
|
Chou KC. Some remarks on protein attribute prediction and pseudo amino acid composition. J Theor Biol 2010; 273:236-47. [PMID: 21168420 PMCID: PMC7125570 DOI: 10.1016/j.jtbi.2010.12.024] [Citation(s) in RCA: 966] [Impact Index Per Article: 69.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2010] [Revised: 12/08/2010] [Accepted: 12/13/2010] [Indexed: 11/29/2022]
Abstract
With the accomplishment of human genome sequencing, the number of sequence-known proteins has increased explosively. In contrast, the pace is much slower in determining their biological attributes. As a consequence, the gap between sequence-known proteins and attribute-known proteins has become increasingly large. The unbalanced situation, which has critically limited our ability to timely utilize the newly discovered proteins for basic research and drug development, has called for developing computational methods or high-throughput automated tools for fast and reliably identifying various attributes of uncharacterized proteins based on their sequence information alone. Actually, during the last two decades or so, many methods in this regard have been established in hope to bridge such a gap. In the course of developing these methods, the following things were often needed to consider: (1) benchmark dataset construction, (2) protein sample formulation, (3) operating algorithm (or engine), (4) anticipated accuracy, and (5) web-server establishment. In this review, we are to discuss each of the five procedures, with a special focus on the introduction of pseudo amino acid composition (PseAAC), its different modes and applications as well as its recent development, particularly in how to use the general formulation of PseAAC to reflect the core and essential features that are deeply hidden in complicated protein sequences.
Collapse
Affiliation(s)
- Kuo-Chen Chou
- Gordon Life Science Institute, 13784 Torrey Del Mar Drive, San Diego, CA 92130, USA.
| |
Collapse
|
19
|
Identification and optimization of classifier genes from multi-class earthworm microarray dataset. PLoS One 2010; 5:e13715. [PMID: 21060837 PMCID: PMC2965664 DOI: 10.1371/journal.pone.0013715] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2010] [Accepted: 10/06/2010] [Indexed: 11/19/2022] Open
Abstract
Monitoring, assessment and prediction of environmental risks that chemicals pose demand rapid and accurate diagnostic assays. A variety of toxicological effects have been associated with explosive compounds TNT and RDX. One important goal of microarray experiments is to discover novel biomarkers for toxicity evaluation. We have developed an earthworm microarray containing 15,208 unique oligo probes and have used it to profile gene expression in 248 earthworms exposed to TNT, RDX or neither. We assembled a new machine learning pipeline consisting of several well-established feature filtering/selection and classification techniques to analyze the 248-array dataset in order to construct classifier models that can separate earthworm samples into three groups: control, TNT-treated, and RDX-treated. First, a total of 869 genes differentially expressed in response to TNT or RDX exposure were identified using a univariate statistical algorithm of class comparison. Then, decision tree-based algorithms were applied to select a subset of 354 classifier genes, which were ranked by their overall weight of significance. A multiclass support vector machine (MC-SVM) method and an unsupervised K-mean clustering method were applied to independently refine the classifier, producing a smaller subset of 39 and 30 classifier genes, separately, with 11 common genes being potential biomarkers. The combined 58 genes were considered the refined subset and used to build MC-SVM and clustering models with classification accuracy of 83.5% and 56.9%, respectively. This study demonstrates that the machine learning approach can be used to identify and optimize a small subset of classifier/biomarker genes from high dimensional datasets and generate classification models of acceptable precision for multiple classes.
Collapse
|