1
|
Zhang Q, Zheng W, Song Z, Zhang Q, Yang L, Wu J, Lin J, Xu G, Yu H. Machine Learning Enables Prediction of Pyrrolysyl-tRNA Synthetase Substrate Specificity. ACS Synth Biol 2023; 12:2403-2417. [PMID: 37486975 DOI: 10.1021/acssynbio.3c00225] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/26/2023]
Abstract
Knowledge about the substrate scope for a given enzyme is informative for elucidating biochemical pathways and also for expanding applications of the enzyme. However, no general methods are available to accurately predict the substrate specificity of an enzyme. Pyrrolysyl-tRNA synthetase (PylRS) is a powerful tool for incorporating various noncanonical amino acids (NCAAs) into proteins, which enabled us to probe, image, rationally engineer, and evolve protein structure and function. However, the incorporation of a new NCAA typically requires the selection of large libraries of PylRS with randomized mutations at active sites, and this process requires multiple rounds of selection for each new substrate. Therefore, a single aminoacyl-tRNA synthetase with broad substrate promiscuity is ideal to facilitate widespread applications of the genetic NCAA incorporation technique. Herein, machine learning models were developed to predict the substrate specificity of PylRS to accept novel NCAAs that could be incorporated into proteins by three PylRS mutants. The models were built from a training set of 285 unique enzyme-substrate pairs of three PylRS mutants including IFRS, BtaRS, and MFRS against 95 NCAAs. The best BaggingTree (BT) model was then used for virtually screening a NCAAs library containing 1474 phenylalanine, tyrosine, tryptophan, and alanine analogues, and 156 NCAAs were predicted to be accepted by at least one of the three PylRS mutants. Then, 27 NCAAs including 24 positive and 3 negative substrates were experimentally tested for their activities, and 20 of the 24 positive substrates showed weak or strong activity and were accepted by at least one PylRS mutant, among which 11 NCAAs were never reported to be incorporated into proteins before. Three negative substrates did not show any activity. Experimental results suggested that the BT model provides a three-class classification accuracy of 0.69 and a binary classification accuracy of 0.86. This study expanded the substrate scope of three PylRS variants and provided a framework for developing machine learning models to predict substrate specificity of other PylRS variants.
Collapse
Affiliation(s)
- Qunfeng Zhang
- Institute of Bioengineering, College of Chemical and Biological Engineering, Zhejiang University, Hangzhou 310027, Zhejiang, China
| | - Wenlong Zheng
- ZJU-Hangzhou Global Scientific and Technological Innovation Centre, Hangzhou 311200, Zhejiang, China
| | - Zhongdi Song
- Key Laboratory of Pollution Exposure and Health Intervention of Zhejiang Province, Interdisciplinary Research Academy, Zhejiang Shuren University, Hangzhou 310015, China
| | - Qiang Zhang
- ZJU-Hangzhou Global Scientific and Technological Innovation Centre, Hangzhou 311200, Zhejiang, China
- College of Computer Science and Technology, Zhejiang University, Hangzhou 310027, Zhejiang, China
| | - Lirong Yang
- Institute of Bioengineering, College of Chemical and Biological Engineering, Zhejiang University, Hangzhou 310027, Zhejiang, China
- ZJU-Hangzhou Global Scientific and Technological Innovation Centre, Hangzhou 311200, Zhejiang, China
| | - Jianping Wu
- Institute of Bioengineering, College of Chemical and Biological Engineering, Zhejiang University, Hangzhou 310027, Zhejiang, China
- ZJU-Hangzhou Global Scientific and Technological Innovation Centre, Hangzhou 311200, Zhejiang, China
| | - Jianping Lin
- Institute of Bioengineering, College of Chemical and Biological Engineering, Zhejiang University, Hangzhou 310027, Zhejiang, China
| | - Gang Xu
- Institute of Bioengineering, College of Chemical and Biological Engineering, Zhejiang University, Hangzhou 310027, Zhejiang, China
| | - Haoran Yu
- Institute of Bioengineering, College of Chemical and Biological Engineering, Zhejiang University, Hangzhou 310027, Zhejiang, China
- ZJU-Hangzhou Global Scientific and Technological Innovation Centre, Hangzhou 311200, Zhejiang, China
| |
Collapse
|
2
|
Murad T, Ali S, Patterson M. Exploring the Potential of GANs in Biological Sequence Analysis. BIOLOGY 2023; 12:854. [PMID: 37372139 DOI: 10.3390/biology12060854] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/29/2023] [Revised: 06/03/2023] [Accepted: 06/12/2023] [Indexed: 06/29/2023]
Abstract
Biological sequence analysis is an essential step toward building a deeper understanding of the underlying functions, structures, and behaviors of the sequences. It can help in identifying the characteristics of the associated organisms, such as viruses, etc., and building prevention mechanisms to eradicate their spread and impact, as viruses are known to cause epidemics that can become global pandemics. New tools for biological sequence analysis are provided by machine learning (ML) technologies to effectively analyze the functions and structures of the sequences. However, these ML-based methods undergo challenges with data imbalance, generally associated with biological sequence datasets, which hinders their performance. Although various strategies are present to address this issue, such as the SMOTE algorithm, which creates synthetic data, however, they focus on local information rather than the overall class distribution. In this work, we explore a novel approach to handle the data imbalance issue based on generative adversarial networks (GANs), which use the overall data distribution. GANs are utilized to generate synthetic data that closely resembles real data, thus, these generated data can be employed to enhance the ML models' performance by eradicating the class imbalance problem for biological sequence analysis. We perform four distinct classification tasks by using four different sequence datasets (Influenza A Virus, PALMdb, VDjDB, Host) and our results illustrate that GANs can improve the overall classification performance.
Collapse
Affiliation(s)
- Taslim Murad
- Department of Computer Science, Georgia State University, Atlanta, GA 30302, USA
| | - Sarwan Ali
- Department of Computer Science, Georgia State University, Atlanta, GA 30302, USA
| | - Murray Patterson
- Department of Computer Science, Georgia State University, Atlanta, GA 30302, USA
| |
Collapse
|
3
|
Szep M, Pintican R, Boca B, Perja A, Duma M, Feier D, Epure F, Fetica B, Eniu D, Roman A, Dudea SM, Chiorean A. Whole-Tumor ADC Texture Analysis Is Able to Predict Breast Cancer Receptor Status. Diagnostics (Basel) 2023; 13:diagnostics13081414. [PMID: 37189515 DOI: 10.3390/diagnostics13081414] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2023] [Revised: 04/04/2023] [Accepted: 04/11/2023] [Indexed: 05/17/2023] Open
Abstract
There are different breast cancer molecular subtypes with differences in incidence, treatment response and outcome. They are roughly divided into estrogen and progesterone receptor (ER and PR) negative and positive cancers. In this retrospective study, we included 185 patients augmented with 25 SMOTE patients and divided them into two groups: the training group consisted of 150 patients and the validation cohort consisted of 60 patients. Tumors were manually delineated and whole-volume tumor segmentation was used to extract first-order radiomic features. The ADC-based radiomics model reached an AUC of 0.81 in the training cohort and was confirmed in the validation set, which yielded an AUC of 0.93, in differentiating ER/PR positive from ER/PR negative status. We also tested a combined model using radiomics data together with ki67% proliferation index and histological grade, and obtained a higher AUC of 0.93, which was also confirmed in the validation group. In conclusion, whole-volume ADC texture analysis is able to predict hormonal status in breast cancer masses.
Collapse
Affiliation(s)
- Madalina Szep
- Department of Radiology, "Iuliu Hatieganu" University of Medicine and Pharmacy, 400347 Cluj-Napoca, Romania
| | - Roxana Pintican
- Department of Radiology, "Iuliu Hatieganu" University of Medicine and Pharmacy, 400347 Cluj-Napoca, Romania
| | - Bianca Boca
- Department of Medical Imaging, "Iuliu Hatieganu" University of Medicine and Pharmacy, 400347 Cluj-Napoca, Romania
| | - Andra Perja
- Department of Radiology and Medical Imaging, County Clinical Emergency Hospital, 400347 Cluj-Napoca, Romania
| | | | - Diana Feier
- Department of Radiology, "Iuliu Hatieganu" University of Medicine and Pharmacy, 400347 Cluj-Napoca, Romania
- Medimages Breast Center, 400462 Cluj-Napoca, Romania
| | - Flavia Epure
- Medical Imaging Department, Medisprof Cancer Center, 400641 Cluj Napoca, Romania
| | - Bogdan Fetica
- Department of Pathology, "Ion Chiricuţă" Oncology Institute, 400015 Cluj-Napoca, Romania
| | - Dan Eniu
- Department of Surgical Oncology, "Iuliu Hatieganu" University of Medicine and Pharmacy, 400347 Cluj-Napoca, Romania
| | - Andrei Roman
- Department of Radiology, "Ion Chiricuță" Oncology Institute, 400015 Cluj-Napoca, Romania
| | - Sorin Marian Dudea
- Department of Radiology, "Iuliu Hatieganu" University of Medicine and Pharmacy, 400347 Cluj-Napoca, Romania
| | | |
Collapse
|
4
|
Machine learning to improve the interpretation of intercalating dye-based quantitative PCR results. Sci Rep 2022; 12:16445. [PMID: 36180590 PMCID: PMC9525288 DOI: 10.1038/s41598-022-21010-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2022] [Accepted: 09/21/2022] [Indexed: 11/16/2022] Open
Abstract
This study aimed to evaluate the contribution of Machine Learning (ML) approach in the interpretation of intercalating dye-based quantitative PCR (IDqPCR) signals applied to the diagnosis of mucormycosis. The ML-based classification approach was applied to 734 results of IDqPCR categorized as positive (n = 74) or negative (n = 660) for mucormycosis after combining “visual reading” of the amplification and denaturation curves with clinical, radiological and microbiological criteria. Fourteen features were calculated to characterize the curves and injected in several pipelines including four ML-algorithms. An initial subset (n = 345) was used for the conception of classifiers. The classifier predictions were combined with majority voting to estimate performances of 48 meta-classifiers on an external dataset (n = 389). The visual reading returned 57 (7.7%), 568 (77.4%) and 109 (14.8%) positive, negative and doubtful results respectively. The Kappa coefficients of all the meta-classifiers were greater than 0.83 for the classification of IDqPCR results on the external dataset. Among these meta-classifiers, 6 exhibited Kappa coefficients at 1. The proposed ML-based approach allows a rigorous interpretation of IDqPCR curves, making the diagnosis of mucormycosis available for non-specialists in molecular diagnosis. A free online application was developed to classify IDqPCR from the raw data of the thermal cycler output (http://gepamy-sat.asso.st/).
Collapse
|
5
|
Choi HS, Jung D, Kim S, Yoon S. Imbalanced Data Classification via Cooperative Interaction Between Classifier and Generator. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2022; 33:3343-3356. [PMID: 33531305 DOI: 10.1109/tnnls.2021.3052243] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Learning classifiers with imbalanced data can be strongly biased toward the majority class. To address this issue, several methods have been proposed using generative adversarial networks (GANs). Existing GAN-based methods, however, do not effectively utilize the relationship between a classifier and a generator. This article proposes a novel three-player structure consisting of a discriminator, a generator, and a classifier, along with decision boundary regularization. Our method is distinctive in which the generator is trained in cooperation with the classifier to provide minority samples that gradually expand the minority decision region, improving performance for imbalanced data classification. The proposed method outperforms the existing methods on real data sets as well as synthetic imbalanced data sets.
Collapse
|
6
|
Chen W, Yang K, Yu Z, Zhang W. Double-kernel based class-specific broad learning system for multiclass imbalance learning. Knowl Based Syst 2022. [DOI: 10.1016/j.knosys.2022.109535] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/16/2022]
|
7
|
Zhou L, Tang Y, Yan G. A New Estimation Method for the Biological Interaction Predicting Problems. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:1415-1423. [PMID: 33406043 DOI: 10.1109/tcbb.2021.3049642] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
For the past decades, computational methods have been developed to predict various interactions in biological problems. Usually these methods treated the predicting problems as semi-supervised problem or positive-unlabeled(PU) learning problem. Researchers focused on the prediction of unlabeled samples and hoped to find novel interactions in the datasets they collected. However, most of the computational methods could only predict a small proportion of undiscovered interactions and the total number was unknown. In this paper, we developed an estimation method with deep learning to calculate the number of undiscovered interactions in the unlabeled samples, derived its asymptotic interval estimation, and applied it to the compound synergism dataset, drug-target interaction(DTI) dataset and MicroRNA-disease interaction dataset successfully. Moreover, this method could reveal which dataset contained more undiscovered interactions and would be a guidance for the experimental validation. Furthermore, we compared our method with some mixture proportion estimators and demonstarted the efficacy of our method. Finally, we proved that AUC and AUPR were related with the number of undiscovered interactions, which was regarded as another evaluation indicator for the computational methods.
Collapse
|
8
|
Makond B, Wang KJ, Wang KM. Benchmarking prognosis methods for survivability - A case study for patients with contingent primary cancers. Comput Biol Med 2021; 138:104888. [PMID: 34610552 DOI: 10.1016/j.compbiomed.2021.104888] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2021] [Accepted: 09/17/2021] [Indexed: 11/18/2022]
Abstract
BACKGROUND There is an increasing number of patients with a first primary cancer who are diagnosed with a second primary cancer, but prognosis methods to predict the survivability of a patient with multiple primary cancers have not been fully benchmarked. METHODS This study investigated the five-year survivability prognosis performances of six machine learning approaches. These approaches are: artificial neural network, decision tree (DT), logistic regression, support vector machine, naïve Bayes (NB), and Bayesian network (BN). A synthetic minority over-sampling technique (SMOTE) was used to solve the imbalanced problem, and a nationwide cancer patient database containing 7,845 subjects in Taiwan was used as a sample source. Ten primary and secondary cancers and their key variables affecting the survivability of the patients were identified. RESULTS All the models using SMOTE improved sensitivity and specificity significantly. NB has the highest performance in terms of accuracy and specificity, whereas BN has the highest performance in terms of sensitivity. Further, the computational time and the power of knowledge representation of NB, BN, and DT outperformed the others. CONCLUSIONS Selecting the appropriate prognosis models to predict survivability of patients with two contingent primary cancers can aid precise prediction and can support appropriate treatment advice.
Collapse
Affiliation(s)
- Bunjira Makond
- Faculty of Commerce and Management, Prince of Songkla University, Trang, Thailand.
| | - Kung-Jeng Wang
- Department of Industrial Management National Taiwan University of Science and Technology, Taipei, 106, ROC, Taiwan.
| | - Kung-Min Wang
- Department of Surgery, Shin-Kong Wu Ho-Su Memorial Hospital, Taipei, R.O.C, Taiwan.
| |
Collapse
|
9
|
Zhang Q, Wang D, Han K, Huang DS. Predicting TF-DNA Binding Motifs from ChIP-seq Datasets Using the Bag-Based Classifier Combined With a Multi-Fold Learning Scheme. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:1743-1751. [PMID: 32946398 DOI: 10.1109/tcbb.2020.3025007] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
The rapid development of high-throughput sequencing technology provides unique opportunities for studying of transcription factor binding sites, but also brings new computational challenges. Recently, a series of discriminative motif discovery (DMD) methods have been proposed and offer promising solutions for addressing these challenges. However, because of the huge computation cost, most of them have to choose approximate schemes that either sacrifice the accuracy of motif representation or tune motif parameter indirectly. In this paper, we propose a bag-based classifier combined with a multi-fold learning scheme (BCMF) to discover motifs from ChIP-seq datasets. First, BCMF formulates input sequences as a labeled bag naturally. Then, a bag-based classifier, combining with a bag feature extracting strategy, is applied to construct the objective function, and a multi-fold learning scheme is used to solve it. Compared with the existing DMD tools, BCMF features three improvements: 1) Learning position weight matrix (PWM) directly in a continuous space; 2) Proposing to represent a positive bag with a feature fused by its k "most positive" patterns. 3) Applying a more advanced learning scheme. The experimental results on 134 ChIP-seq datasets show that BCMF substantially outperforms existing DMD methods (including DREME, HOMER, XXmotif, motifRG, EDCOD and our previous work).
Collapse
|
10
|
Moghadas-Dastjerdi H, Rahman SETH, Sannachi L, Wright FC, Gandhi S, Trudeau ME, Sadeghi-Naini A, Czarnota GJ. Prediction of chemotherapy response in breast cancer patients at pre-treatment using second derivative texture of CT images and machine learning. Transl Oncol 2021; 14:101183. [PMID: 34293685 PMCID: PMC8319580 DOI: 10.1016/j.tranon.2021.101183] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2021] [Revised: 07/07/2021] [Accepted: 07/13/2021] [Indexed: 01/01/2023] Open
Abstract
Textural and second derivative textural features of CT images can be used in conjunction with machine learning models to predict breast cancer response to chemotherapy prior to the start of treatment. The proposed predictive model separates the patients at pre-treatment into two cohorts (responders/non-responders) with significantly different survival. The proposed methodology is a step forward towards the precision oncology paradigm for breast cancer patients.
Although neoadjuvant chemotherapy (NAC) is a crucial component of treatment for locally advanced breast cancer (LABC), only about 70% of patients respond to it. Effective adjustment of NAC for individual patients can significantly improve survival rates of those resistant to standard regimens. Thus, the early prediction of NAC outcome is of great importance in facilitating a personalized paradigm for breast cancer therapeutics. In this study, quantitative computed tomography (qCT) parametric imaging in conjunction with machine learning techniques were investigated to predict LABC tumor response to NAC. Textural and second derivative textural (SDT) features of CT images of 72 patients diagnosed with LABC were analysed before the initiation of NAC to quantify intra-tumor heterogeneity. These quantitative features were processed through a correlation-based feature reduction followed by a sequential feature selection with a bootstrap 0.632+ area under the receiver operating characteristic (ROC) curve (AUC0.632+) criterion. The best feature subset consisted of a combination of one textural and three SDT features. Using these features, an AdaBoost decision tree could predict the patient response with a cross-validated AUC0.632+ accuracy, sensitivity and specificity of 0.88, 85%, 88% and 75%, respectively. This study demonstrates, for the first time, that a combination of textural and SDT features of CT images can be used to predict breast cancer response NAC prior to the start of treatment which can potentially facilitate early therapy adjustments.
Collapse
Affiliation(s)
- Hadi Moghadas-Dastjerdi
- Department of Medical Biophysics, University of Toronto, Toronto, ON, Canada; Physical Sciences Platform, Sunnybrook Research Institute, Sunnybrook Health Sciences Center, Toronto, ON, Canada; Department of Radiation Oncology, Odette Cancer Center, Sunnybrook Health Sciences Center, Toronto, ON, Canada; Department of Radiation Oncology, University of Toronto, Toronto, ON, Canada
| | - Shan-E-Tallat Hira Rahman
- Physical Sciences Platform, Sunnybrook Research Institute, Sunnybrook Health Sciences Center, Toronto, ON, Canada; Faculty of Engineering, University of Waterloo, Waterloo, ON, Canada
| | - Lakshmanan Sannachi
- Department of Medical Biophysics, University of Toronto, Toronto, ON, Canada; Physical Sciences Platform, Sunnybrook Research Institute, Sunnybrook Health Sciences Center, Toronto, ON, Canada; Department of Radiation Oncology, Odette Cancer Center, Sunnybrook Health Sciences Center, Toronto, ON, Canada; Department of Radiation Oncology, University of Toronto, Toronto, ON, Canada
| | - Frances C Wright
- Surgical Oncology, Odette Cancer Center, Sunnybrook Health Sciences Center, and Department of Surgery, University of Toronto, Toronto, ON, Canada
| | - Sonal Gandhi
- Division of Medical Oncology, Odette Cancer Center, Sunnybrook Health Sciences Center, and Department of Medicine, University of Toronto, Toronto, ON, Canada
| | - Maureen E Trudeau
- Division of Medical Oncology, Odette Cancer Center, Sunnybrook Health Sciences Center, and Department of Medicine, University of Toronto, Toronto, ON, Canada
| | - Ali Sadeghi-Naini
- Department of Medical Biophysics, University of Toronto, Toronto, ON, Canada; Physical Sciences Platform, Sunnybrook Research Institute, Sunnybrook Health Sciences Center, Toronto, ON, Canada; Department of Radiation Oncology, Odette Cancer Center, Sunnybrook Health Sciences Center, Toronto, ON, Canada; Department of Electrical Engineering and Computer Science, Lassonde School of Engineering, York University, Toronto, ON, Canada.
| | - Gregory J Czarnota
- Department of Medical Biophysics, University of Toronto, Toronto, ON, Canada; Physical Sciences Platform, Sunnybrook Research Institute, Sunnybrook Health Sciences Center, Toronto, ON, Canada; Department of Radiation Oncology, Odette Cancer Center, Sunnybrook Health Sciences Center, Toronto, ON, Canada; Department of Radiation Oncology, University of Toronto, Toronto, ON, Canada.
| |
Collapse
|
11
|
Tian H, Jiang X, Tao P. PASSer: Prediction of Allosteric Sites Server. MACHINE LEARNING-SCIENCE AND TECHNOLOGY 2021; 2. [PMID: 34396127 DOI: 10.1088/2632-2153/abe6d6] [Citation(s) in RCA: 29] [Impact Index Per Article: 9.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
Allostery is considered important in regulating protein's activity. Drug development depends on the understanding of allosteric mechanisms, especially the identification of allosteric sites, which is a prerequisite in drug discovery and design. Many computational methods have been developed for allosteric site prediction using pocket features and protein dynamics. Here, we present an ensemble learning method, consisting of eXtreme gradient boosting (XGBoost) and graph convolutional neural network (GCNN), to predict allosteric sites. Our model can learn physical properties and topology without any prior information, and shows good performance under multiple indicators. Prediction results showed that 84.9% of allosteric pockets in the test set appeared in the top 3 positions. The PASSer: Protein Allosteric Sites Server (https://passer.smu.edu), along with a command line interface (CLI, https://github.com/smutaogroup/passerCLI) provide insights for further analysis in drug discovery.
Collapse
Affiliation(s)
- Hao Tian
- Department of Chemistry, Center for Research Computing, Center for Drug Discovery, Design, and Delivery (CD4), Southern Methodist University, Dallas, Texas, United States of America
| | - Xi Jiang
- Department of Statistical Science, Southern Methodist University, Dallas, Texas, United States of America
| | - Peng Tao
- Department of Chemistry, Center for Research Computing, Center for Drug Discovery, Design, and Delivery (CD4), Southern Methodist University, Dallas, Texas, United States of America
| |
Collapse
|
12
|
Chen S, Gan M, Lv H, Jiang R. DeepCAPE: A Deep Convolutional Neural Network for the Accurate Prediction of Enhancers. GENOMICS PROTEOMICS & BIOINFORMATICS 2021; 19:565-577. [PMID: 33581335 PMCID: PMC9040020 DOI: 10.1016/j.gpb.2019.04.006] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/24/2018] [Revised: 03/15/2019] [Accepted: 04/29/2019] [Indexed: 12/12/2022]
Abstract
The establishment of a landscape of enhancers across human cells is crucial to deciphering the mechanism of gene regulation, cell differentiation, and disease development. High-throughput experimental approaches, which contain successfully reported enhancers in typical cell lines, are still too costly and time-consuming to perform systematic identification of enhancers specific to different cell lines. Existing computational methods, capable of predicting regulatory elements purely relying on DNA sequences, lack the power of cell line-specific screening. Recent studies have suggested that chromatin accessibility of a DNA segment is closely related to its potential function in regulation, and thus may provide useful information in identifying regulatory elements. Motivated by the aforementioned understanding, we integrate DNA sequences and chromatin accessibility data to accurately predict enhancers in a cell line-specific manner. We proposed DeepCAPE, a deep convolutional neural network to predict enhancers via the integration of DNA sequences and DNase-seq data. Benefitting from the well-designed feature extraction mechanism and skip connection strategy, our model not only consistently outperforms existing methods in the imbalanced classification of cell line-specific enhancers against background sequences, but also has the ability to self-adapt to different sizes of datasets. Besides, with the adoption of auto-encoder, our model is capable of making cross-cell line predictions. We further visualize kernels of the first convolutional layer and show the match of identified sequence signatures and known motifs. We finally demonstrate the potential ability of our model to explain functional implications of putative disease-associated genetic variants and discriminate disease-related enhancers. The source code and detailed tutorial of DeepCAPE are freely available at https://github.com/ShengquanChen/DeepCAPE.
Collapse
Affiliation(s)
- Shengquan Chen
- MOE Key Laboratory of Bioinformatics, Research Department of Bioinformatics at the Beijing National Research Center for Information Science and Technology, Center for Synthetic and Systems Biology, Department of Automation, Tsinghua University, Beijing 100084, China
| | - Mingxin Gan
- Department of Management Science and Engineering, School of Economics and Management, University of Science and Technology Beijing, Beijing 100083, China
| | - Hairong Lv
- MOE Key Laboratory of Bioinformatics, Research Department of Bioinformatics at the Beijing National Research Center for Information Science and Technology, Center for Synthetic and Systems Biology, Department of Automation, Tsinghua University, Beijing 100084, China
| | - Rui Jiang
- MOE Key Laboratory of Bioinformatics, Research Department of Bioinformatics at the Beijing National Research Center for Information Science and Technology, Center for Synthetic and Systems Biology, Department of Automation, Tsinghua University, Beijing 100084, China.
| |
Collapse
|
13
|
Akhter N, Chennupati G, Djidjev H, Shehu A. Decoy selection for protein structure prediction via extreme gradient boosting and ranking. BMC Bioinformatics 2020; 21:189. [PMID: 33297949 PMCID: PMC7724862 DOI: 10.1186/s12859-020-3523-9] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2020] [Accepted: 04/29/2020] [Indexed: 11/10/2022] Open
Abstract
Background Identifying one or more biologically-active/native decoys from millions of non-native decoys is one of the major challenges in computational structural biology. The extreme lack of balance in positive and negative samples (native and non-native decoys) in a decoy set makes the problem even more complicated. Consensus methods show varied success in handling the challenge of decoy selection despite some issues associated with clustering large decoy sets and decoy sets that do not show much structural similarity. Recent investigations into energy landscape-based decoy selection approaches show promises. However, lack of generalization over varied test cases remains a bottleneck for these methods. Results We propose a novel decoy selection method, ML-Select, a machine learning framework that exploits the energy landscape associated with the structure space probed through a template-free decoy generation. The proposed method outperforms both clustering and energy ranking-based methods, all the while consistently offering better performance on varied test-cases. Moreover, ML-Select shows promising results even for the decoy sets consisting of mostly low-quality decoys. Conclusions ML-Select is a useful method for decoy selection. This work suggests further research in finding more effective ways to adopt machine learning frameworks in achieving robust performance for decoy selection in template-free protein structure prediction.
Collapse
Affiliation(s)
- Nasrin Akhter
- Department of Computer Science, George Mason University, Fairfax, 22030, VA, USA
| | - Gopinath Chennupati
- Information Sciences (CCS-3) Group, Los Alamos National Laboratory, Bikini At al Rd., Los Alamos, 87545, USA.
| | - Hristo Djidjev
- Information Sciences (CCS-3) Group, Los Alamos National Laboratory, Bikini At al Rd., Los Alamos, 87545, USA
| | - Amarda Shehu
- Department of Computer Science, George Mason University, Fairfax, 22030, VA, USA.,Department of Bioengineering, George Mason University, Fairfax, 22030, VA, USA.,School of Systems Biology, George Mason University, Manassas, 20110, VA, USA
| |
Collapse
|
14
|
Suh S, Lee H, Lukowicz P, Lee YO. CEGAN: Classification Enhancement Generative Adversarial Networks for unraveling data imbalance problems. Neural Netw 2020; 133:69-86. [PMID: 33125919 DOI: 10.1016/j.neunet.2020.10.004] [Citation(s) in RCA: 18] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2019] [Revised: 07/23/2020] [Accepted: 10/11/2020] [Indexed: 10/23/2022]
Abstract
The data imbalance problem in classification is a frequent but challenging task. In real-world datasets, numerous class distributions are imbalanced and the classification result under such condition reveals extreme bias in the majority data class. Recently, the potential of GAN as a data augmentation method on minority data has been studied. In this paper, we propose a classification enhancement generative adversarial networks (CEGAN) to enhance the quality of generated synthetic minority data and more importantly, to improve the prediction accuracy in data imbalanced condition. In addition, we propose an ambiguity reduction method using the generated synthetic minority data for the case of multiple similar classes that are degenerating the classification accuracy. The proposed method is demonstrated with five benchmark datasets. The results indicate that approximating the real data distribution using CEGAN improves the classification performance significantly in data imbalanced conditions compared with various standard data augmentation methods.
Collapse
Affiliation(s)
- Sungho Suh
- Smart Convergence Group, Korea Institute of Science and Technology Europe Forschungsgesellschaft mbH, 66123 Saarbrücken, Germany; Department of Computer Science, TU Kaiserslautern, 67663 Kaiserslautern, Germany
| | - Haebom Lee
- Smart Convergence Group, Korea Institute of Science and Technology Europe Forschungsgesellschaft mbH, 66123 Saarbrücken, Germany
| | - Paul Lukowicz
- Department of Computer Science, TU Kaiserslautern, 67663 Kaiserslautern, Germany; German Research Center for Artificial Intelligence (DFKI), 67663 Kaiserslautern, Germany
| | - Yong Oh Lee
- Smart Convergence Group, Korea Institute of Science and Technology Europe Forschungsgesellschaft mbH, 66123 Saarbrücken, Germany.
| |
Collapse
|
15
|
Liu Y, Li A, Zhao XM, Wang M. DeepTL-Ubi: A novel deep transfer learning method for effectively predicting ubiquitination sites of multiple species. Methods 2020; 192:103-111. [PMID: 32791338 DOI: 10.1016/j.ymeth.2020.08.003] [Citation(s) in RCA: 17] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2020] [Revised: 07/17/2020] [Accepted: 08/06/2020] [Indexed: 11/16/2022] Open
Abstract
Ubiquitination is one of the most important post-translational modifications which involves in many biological processes. Because mass spectrometry-based ubiquitination site identification methods are costly and time consuming, computational approaches provide alternative ways to the determination of ubiquitination sites. Although machine learning based methods can effectively predict ubiquitination sites, most of them rely on feature engineering, which may lead to bias or incomplete feature. Recently, deep learning has achieved great success in prediction of post-translational modification sites. However, deep learning method has not been explored in the prediction of species-specific ubiquitination sites. In this paper, we propose a novel transfer deep learning method, named DeepTL-Ubi, for predicting ubiquitination sites of multiple species. DeepTL-Ubi enhances the performance of species-specific ubiquitination site prediction by transferring common knowledge from the large amount of human data to other species, which effectively solves the problem of insufficient training data for other species. Besides, we train and test our model by collecting ubiquitination sites for multiple species from several sources. Experiment results show that our transfer learning technique can effectively improve the predictive performance of species with small sample size, and DeepTL-Ubi is superior to existing tools in many species. The source code and training data of DeepTL-Ubi are publicly deposited at https://github.com/USTC-HIlab/DeepTL-Ubi.
Collapse
Affiliation(s)
- Yu Liu
- School of Information Science and Technology, University of Science and Technology of China, Hefei AH230027, China.
| | - Ao Li
- School of Information Science and Technology, University of Science and Technology of China, Hefei AH230027, China; Centers for Biomedical Engineering, University of Science and Technology of China, Hefei AH230027, China.
| | - Xing-Ming Zhao
- Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai 200433, China; Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence (Fudan University), Ministry of Education, Shanghai 200433, China.
| | - Minghui Wang
- School of Information Science and Technology, University of Science and Technology of China, Hefei AH230027, China; Centers for Biomedical Engineering, University of Science and Technology of China, Hefei AH230027, China.
| |
Collapse
|
16
|
Xu C, Zhu G. Semi-supervised Learning Algorithm Based on Linear Lie Group for Imbalanced Multi-class Classification. Neural Process Lett 2020. [DOI: 10.1007/s11063-020-10287-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
|
17
|
Moghadas-Dastjerdi H, Sha-E-Tallat HR, Sannachi L, Sadeghi-Naini A, Czarnota GJ. A priori prediction of tumour response to neoadjuvant chemotherapy in breast cancer patients using quantitative CT and machine learning. Sci Rep 2020; 10:10936. [PMID: 32616912 PMCID: PMC7331583 DOI: 10.1038/s41598-020-67823-8] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2019] [Accepted: 06/08/2020] [Indexed: 12/19/2022] Open
Abstract
Response to Neoadjuvant chemotherapy (NAC) has demonstrated a high correlation to survival in locally advanced breast cancer (LABC) patients. An early prediction of responsiveness to NAC could facilitate treatment adjustments on an individual patient basis that would be expected to improve treatment outcomes and patient survival. This study investigated, for the first time, the efficacy of quantitative computed tomography (qCT) parametric imaging to characterize intra-tumour heterogeneity and its application in predicting tumour response to NAC in LABC patients. Textural analyses were performed on CT images acquired from 72 patients before the start of chemotherapy to determine quantitative features of intra-tumour heterogeneity. The best feature subset for response prediction was selected through a sequential feature selection with bootstrap 0.632 + area under the receiver operating characteristic (ROC) curve (\documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$${\mathrm{A}\mathrm{U}\mathrm{C}}_{0.632+}$$\end{document}AUC0.632+) as a performance criterion. Several classifiers were evaluated for response prediction using the selected feature subset. Amongst the applied classifiers an Adaboost decision tree provided the best results with cross-validated \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$${\mathrm{A}\mathrm{U}\mathrm{C}}_{0.632+}$$\end{document}AUC0.632+, accuracy, sensitivity and specificity of 0.89, 84%, 80% and 88%, respectively. The promising results obtained in this study demonstrate the potential of the proposed biomarkers to be used as predictors of LABC tumour response to NAC prior to the start of treatment.
Collapse
Affiliation(s)
- Hadi Moghadas-Dastjerdi
- Department of Medical Biophysics, University of Toronto, Toronto, ON, Canada.,Physical Sciences Platform, Sunnybrook Research Institute, Sunnybrook Health Sciences Centre, Toronto, ON, Canada.,Department of Radiation Oncology, Odette Cancer Centre, Sunnybrook Health Sciences Centre, Toronto, ON, Canada.,Department of Radiation Oncology, University of Toronto, Toronto, ON, Canada
| | - Hira Rahman Sha-E-Tallat
- Physical Sciences Platform, Sunnybrook Research Institute, Sunnybrook Health Sciences Centre, Toronto, ON, Canada.,Faculty of Engineering, University of Waterloo, Waterloo, ON, Canada
| | - Lakshmanan Sannachi
- Department of Medical Biophysics, University of Toronto, Toronto, ON, Canada.,Physical Sciences Platform, Sunnybrook Research Institute, Sunnybrook Health Sciences Centre, Toronto, ON, Canada.,Department of Radiation Oncology, Odette Cancer Centre, Sunnybrook Health Sciences Centre, Toronto, ON, Canada.,Department of Radiation Oncology, University of Toronto, Toronto, ON, Canada
| | - Ali Sadeghi-Naini
- Department of Medical Biophysics, University of Toronto, Toronto, ON, Canada.,Physical Sciences Platform, Sunnybrook Research Institute, Sunnybrook Health Sciences Centre, Toronto, ON, Canada.,Department of Radiation Oncology, Odette Cancer Centre, Sunnybrook Health Sciences Centre, Toronto, ON, Canada.,Department of Electrical Engineering and Computer Science, Lassonde School of Engineering, York University, Toronto, ON, Canada
| | - Gregory J Czarnota
- Department of Medical Biophysics, University of Toronto, Toronto, ON, Canada. .,Physical Sciences Platform, Sunnybrook Research Institute, Sunnybrook Health Sciences Centre, Toronto, ON, Canada. .,Department of Radiation Oncology, Odette Cancer Centre, Sunnybrook Health Sciences Centre, Toronto, ON, Canada. .,Department of Radiation Oncology, University of Toronto, Toronto, ON, Canada.
| |
Collapse
|
18
|
Tadepalli S, Akhter N, Barbara D, Shehu A. Anomaly Detection-Based Recognition of Near-Native Protein Structures. IEEE Trans Nanobioscience 2020; 19:562-570. [PMID: 32340957 DOI: 10.1109/tnb.2020.2990642] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
The three-dimensional structures populated by a protein molecule determine to a great extent its biological activities. The rich information encoded by protein structure on protein function continues to motivate the development of computational approaches for determining functionally-relevant structures. The majority of structures generated in silico are not relevant. Discriminating relevant/native protein structures from non-native ones is an outstanding challenge in computational structural biology. Inherently, this is a recognition problem that can be addressed under the umbrella of machine learning. In this paper, based on the premise that near-native structures are effectively anomalies, we build on the concept of anomaly detection in machine learning. We propose methods that automatically select relevant subsets, as well as methods that select a single structure to offer as prediction. Evaluations are carried out on benchmark datasets and demonstrate that the proposed methods advance the state of the art. The presented results motivate further building on and adapting concepts and techniques from machine learning to improve recognition of near-native structures in protein structure prediction.
Collapse
|
19
|
|
20
|
Abuassba AO, Zhang D, Luo X. A Heterogeneous AdaBoost Ensemble Based Extreme Learning Machines for Imbalanced Data. INTERNATIONAL JOURNAL OF COGNITIVE INFORMATICS AND NATURAL INTELLIGENCE 2019. [DOI: 10.4018/ijcini.2019070102] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Extreme learning machine (ELM) is an effective learning algorithm for the single hidden layer feed-forward neural network (SLFN). It is diversified in the form of kernels or feature mapping functions, while achieving a good learning performance. It is agile in learning and often has good performance, including kernel ELM and Regularized ELM. Dealing with imbalanced data has been a long-term focus for the learning algorithms to achieve satisfactory analytical results. It is obvious that the unbalanced class distribution imposes very challenging obstacles to implement learning tasks in real-world applications, including online visual tracking and image quality assessment. This article addresses this issue through advanced diverse AdaBoost based ELM ensemble (AELME) for imbalanced binary and multiclass data classification. This article aims to improve classification accuracy of the imbalanced data. In the proposed method, the ensemble is developed while splitting the trained data into corresponding subsets. And different algorithms of enhanced ELM, including regularized ELM and kernel ELM, are used as base learners, so that an active learner is constructed from a group of relatively weak base learners. Furthermore, AELME is implemented by training a randomly selected ELM classifier on a subset, chosen by random re-sampling. Then, the labels of unseen data could be predicted using the weighting approach. AELME is validated through classification on real-world benchmark datasets.
Collapse
Affiliation(s)
- Adnan Omer Abuassba
- University of Science and Technology Beijing (USTB), Beijing, China & Arab Open University - Palestine, Ramallah, Palestine
| | - Dezheng Zhang
- University of Science and Technology Beijing (USTB), Beijing, China
| | - Xiong Luo
- University of Science and Technology Beijing (USTB), Beijing, China
| |
Collapse
|
21
|
Zhang L, Yu G, Xia D, Wang J. Protein–protein interactions prediction based on ensemble deep neural networks. Neurocomputing 2019. [DOI: 10.1016/j.neucom.2018.02.097] [Citation(s) in RCA: 74] [Impact Index Per Article: 14.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/01/2023]
|
22
|
Bhattacharya M, Jurkovitz C, Shatkay H. Chronic Kidney Disease stratification using office visit records: Handling data imbalance via hierarchical meta-classification. BMC Med Inform Decis Mak 2018; 18:125. [PMID: 30537962 PMCID: PMC6290512 DOI: 10.1186/s12911-018-0675-x] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Background Chronic Kidney Disease (CKD) is one of several conditions that affect a growing percentage of the US population; the disease is accompanied by multiple co-morbidities, and is hard to diagnose in-and-of itself. In its advanced forms it carries severe outcomes and can lead to death. It is thus important to detect the disease as early as possible, which can help devise effective intervention and treatment plan. Here we investigate ways to utilize information available in electronic health records (EHRs) from regular office visits of more than 13,000 patients, in order to distinguish among several stages of the disease. While clinical data stored in EHRs provide valuable information for risk-stratification, one of the major challenges in using them arises from data imbalance. That is, records associated with a more severe condition are typically under-represented compared to those associated with a milder manifestation of the disease. To address imbalance, we propose and develop a sampling-based ensemble approach, hierarchical meta-classification, aiming to stratify CKD patients into severity stages, using simple quantitative non-text features gathered from standard office visit records. Methods The proposed hierarchical meta-classification method frames the multiclass classification task as a hierarchy of two subtasks. The first is binary classification, separating records associated with the majority class from those associated with all minority classes combined, using meta-classification. The second subtask separates the records assigned to the combined minority classes into the individual constituent classes. Results The proposed method identifies a significant proportion of patients suffering from the more advanced stages of the condition, while also correctly identifying most of the less severe cases, maintaining high sensitivity, specificity and F-measure (≥ 93%). Our results show that the high level of performance attained by our method is preserved even when the size of the training set is significantly reduced, demonstrating the stability and generalizability of our approach. Conclusion We present a new approach to perform classification while addressing data imbalance, which is inherent in the biomedical domain. Our model effectively identifies severity stages of CKD patients, using information readily available in office visit records within the realistic context of high data imbalance.
Collapse
Affiliation(s)
- Moumita Bhattacharya
- Computational Biomedicine Lab, Computer and Information Sciences, University of Delaware, Newark, DE, USA.
| | | | - Hagit Shatkay
- Computational Biomedicine Lab, Computer and Information Sciences, University of Delaware, Newark, DE, USA.,Center for Bioinformatics and Computational Biology, Delaware Biotechnology Inst, University of Delaware, Newark, DE, USA
| |
Collapse
|
23
|
Sastry A, Monk J, Tegel H, Uhlen M, Palsson BO, Rockberg J, Brunk E. Machine learning in computational biology to accelerate high-throughput protein expression. Bioinformatics 2018; 33:2487-2495. [PMID: 28398465 DOI: 10.1093/bioinformatics/btx207] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2016] [Accepted: 04/05/2017] [Indexed: 01/21/2023] Open
Abstract
Motivation The Human Protein Atlas (HPA) enables the simultaneous characterization of thousands of proteins across various tissues to pinpoint their spatial location in the human body. This has been achieved through transcriptomics and high-throughput immunohistochemistry-based approaches, where over 40 000 unique human protein fragments have been expressed in E. coli. These datasets enable quantitative tracking of entire cellular proteomes and present new avenues for understanding molecular-level properties influencing expression and solubility. Results Combining computational biology and machine learning identifies protein properties that hinder the HPA high-throughput antibody production pipeline. We predict protein expression and solubility with accuracies of 70% and 80%, respectively, based on a subset of key properties (aromaticity, hydropathy and isoelectric point). We guide the selection of protein fragments based on these characteristics to optimize high-throughput experimentation. Availability and implementation We present the machine learning workflow as a series of IPython notebooks hosted on GitHub (https://github.com/SBRG/Protein_ML). The workflow can be used as a template for analysis of further expression and solubility datasets. Contact ebrunk@ucsd.edu or johanr@biotech.kth.se. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Anand Sastry
- Department of Bioengineering, University of California, San Diego, CA, USA
| | - Jonathan Monk
- Department of Bioengineering, University of California, San Diego, CA, USA
| | - Hanna Tegel
- KTH - Royal Institute of Technology, Department of Proteomics and Nanobiotechnology, SE-106 91 Stockholm, Sweden
| | - Mathias Uhlen
- KTH - Royal Institute of Technology, Department of Proteomics and Nanobiotechnology, SE-106 91 Stockholm, Sweden.,The Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, 2800 Lyngby, Denmark
| | - Bernhard O Palsson
- Department of Bioengineering, University of California, San Diego, CA, USA.,The Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, 2800 Lyngby, Denmark
| | - Johan Rockberg
- KTH - Royal Institute of Technology, Department of Proteomics and Nanobiotechnology, SE-106 91 Stockholm, Sweden
| | - Elizabeth Brunk
- Department of Bioengineering, University of California, San Diego, CA, USA.,The Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, 2800 Lyngby, Denmark
| |
Collapse
|
24
|
Dynamic affinity-based classification of multi-class imbalanced data with one-versus-one decomposition: a fuzzy rough set approach. Knowl Inf Syst 2017. [DOI: 10.1007/s10115-017-1126-1] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
|
25
|
Akkasi A, Varoğlu E, Dimililer N. Balanced undersampling: a novel sentence-based undersampling method to improve recognition of named entities in chemical and biomedical text. APPL INTELL 2017. [DOI: 10.1007/s10489-017-0920-5] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
26
|
iDPF-PseRAAAC: A Web-Server for Identifying the Defensin Peptide Family and Subfamily Using Pseudo Reduced Amino Acid Alphabet Composition. PLoS One 2015; 10:e0145541. [PMID: 26713618 PMCID: PMC4694767 DOI: 10.1371/journal.pone.0145541] [Citation(s) in RCA: 44] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2015] [Accepted: 12/04/2015] [Indexed: 11/29/2022] Open
Abstract
Defensins as one of the most abundant classes of antimicrobial peptides are an essential part of the innate immunity that has evolved in most living organisms from lower organisms to humans. To identify specific defensins as interesting antifungal leads, in this study, we constructed a more rigorous benchmark dataset and the iDPF-PseRAAAC server was developed to predict the defensin family and subfamily. Using reduced dipeptide compositions were used, the overall accuracy of proposed method increased to 95.10% for the defensin family, and 98.39% for the vertebrate subfamily, which is higher than the accuracy from other methods. The jackknife test shows that more than 4% improvement was obtained comparing with the previous method. A free online server was further established for the convenience of most experimental scientists at http://wlxy.imu.edu.cn/college/biostation/fuwu/iDPF-PseRAAAC/index.asp. A friendly guide is provided to describe how to use the web server. We anticipate that iDPF-PseRAAAC may become a useful high-throughput tool for both basic research and drug design.
Collapse
|
27
|
Dai HL. Imbalanced Protein Data Classification Using Ensemble FTM-SVM. IEEE Trans Nanobioscience 2015; 14:350-359. [DOI: 10.1109/tnb.2015.2431292] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
|
28
|
You ZH, Chan KCC, Hu P. Predicting protein-protein interactions from primary protein sequences using a novel multi-scale local feature representation scheme and the random forest. PLoS One 2015; 10:e0125811. [PMID: 25946106 PMCID: PMC4422660 DOI: 10.1371/journal.pone.0125811] [Citation(s) in RCA: 92] [Impact Index Per Article: 10.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2014] [Accepted: 03/04/2015] [Indexed: 11/18/2022] Open
Abstract
The study of protein-protein interactions (PPIs) can be very important for the understanding of biological cellular functions. However, detecting PPIs in the laboratories are both time-consuming and expensive. For this reason, there has been much recent effort to develop techniques for computational prediction of PPIs as this can complement laboratory procedures and provide an inexpensive way of predicting the most likely set of interactions at the entire proteome scale. Although much progress has already been achieved in this direction, the problem is still far from being solved. More effective approaches are still required to overcome the limitations of the current ones. In this study, a novel Multi-scale Local Descriptor (MLD) feature representation scheme is proposed to extract features from a protein sequence. This scheme can capture multi-scale local information by varying the length of protein-sequence segments. Based on the MLD, an ensemble learning method, the Random Forest (RF) method, is used as classifier. The MLD feature representation scheme facilitates the mining of interaction information from multi-scale continuous amino acid segments, making it easier to capture multiple overlapping continuous binding patterns within a protein sequence. When the proposed method is tested with the PPI data of Saccharomyces cerevisiae, it achieves a prediction accuracy of 94.72% with 94.34% sensitivity at the precision of 98.91%. Extensive experiments are performed to compare our method with existing sequence-based method. Experimental results show that the performance of our predictor is better than several other state-of-the-art predictors also with the H. pylori dataset. The reason why such good results are achieved can largely be credited to the learning capabilities of the RF model and the novel MLD feature representation scheme. The experiment results show that the proposed approach can be very promising for predicting PPIs and can be a useful tool for future proteomic studies.
Collapse
Affiliation(s)
- Zhu-Hong You
- Department of Computing, The Hong Kong Polytechnic University, Hong Kong, China; School of Electronics and Information Engineering, Tongji University, Shanghai, China
| | - Keith C C Chan
- Department of Computing, The Hong Kong Polytechnic University, Hong Kong, China
| | - Pengwei Hu
- Department of Computing, The Hong Kong Polytechnic University, Hong Kong, China
| |
Collapse
|
29
|
Detecting protein-protein interactions with a novel matrix-based protein sequence representation and support vector machines. BIOMED RESEARCH INTERNATIONAL 2015; 2015:867516. [PMID: 26000305 PMCID: PMC4426769 DOI: 10.1155/2015/867516] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/01/2014] [Revised: 01/09/2015] [Accepted: 01/09/2015] [Indexed: 11/27/2022]
Abstract
Proteins and their interactions lie at the heart of most underlying biological processes. Consequently, correct detection of protein-protein interactions (PPIs) is of fundamental importance to understand the molecular mechanisms in biological systems. Although the convenience brought by high-throughput experiment in technological advances makes it possible to detect a large amount of PPIs, the data generated through these methods is unreliable and may not be completely inclusive of all possible PPIs. Targeting at this problem, this study develops a novel computational approach to effectively detect the protein interactions. This approach is proposed based on a novel matrix-based representation of protein sequence combined with the algorithm of support vector machine (SVM), which fully considers the sequence order and dipeptide information of the protein primary sequence. When performed on yeast PPIs datasets, the proposed method can reach 90.06% prediction accuracy with 94.37% specificity at the sensitivity of 85.74%, indicating that this predictor is a useful tool to predict PPIs. Achieved results also demonstrate that our approach can be a helpful supplement for the interactions that have been detected experimentally.
Collapse
|
30
|
Tomar D, Agarwal S. An effective Weighted Multi-class Least Squares Twin Support Vector Machine for Imbalanced data classification. INT J COMPUT INT SYS 2015. [DOI: 10.1080/18756891.2015.1061395] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022] Open
|
31
|
You ZH, Zhu L, Zheng CH, Yu HJ, Deng SP, Ji Z. Prediction of protein-protein interactions from amino acid sequences using a novel multi-scale continuous and discontinuous feature set. BMC Bioinformatics 2014; 15 Suppl 15:S9. [PMID: 25474679 PMCID: PMC4271571 DOI: 10.1186/1471-2105-15-s15-s9] [Citation(s) in RCA: 84] [Impact Index Per Article: 8.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023] Open
Abstract
BACKGROUND Identifying protein-protein interactions (PPIs) is essential for elucidating protein functions and understanding the molecular mechanisms inside the cell. However, the experimental methods for detecting PPIs are both time-consuming and expensive. Therefore, computational prediction of protein interactions are becoming increasingly popular, which can provide an inexpensive way of predicting the most likely set of interactions at the entire proteome scale, and can be used to complement experimental approaches. Although much progress has already been achieved in this direction, the problem is still far from being solved and new approaches are still required to overcome the limitations of the current prediction models. RESULTS In this work, a sequence-based approach is developed by combining a novel Multi-scale Continuous and Discontinuous (MCD) feature representation and Support Vector Machine (SVM). The MCD representation gives adequate consideration to the interactions between sequentially distant but spatially close amino acid residues, thus it can sufficiently capture multiple overlapping continuous and discontinuous binding patterns within a protein sequence. An effective feature selection method mRMR was employed to construct an optimized and more discriminative feature set by excluding redundant features. Finally, a prediction model is trained and tested based on SVM algorithm to predict the interaction probability of protein pairs. CONCLUSIONS When performed on the yeast PPIs data set, the proposed approach achieved 91.36% prediction accuracy with 91.94% precision at the sensitivity of 90.67%. Extensive experiments are conducted to compare our method with the existing sequence-based method. Experimental results show that the performance of our predictor is better than several other state-of-the-art predictors, whose average prediction accuracy is 84.91%, sensitivity is 83.24%, and precision is 86.12%. Achieved results show that the proposed approach is very promising for predicting PPI, so it can be a useful supplementary tool for future proteomics studies. The source code and the datasets are freely available at http://csse.szu.edu.cn/staff/youzh/MCDPPI.zip for academic use.
Collapse
|
32
|
You ZH, Yu JZ, Zhu L, Li S, Wen ZK. A MapReduce based parallel SVM for large-scale predicting protein–protein interactions. Neurocomputing 2014. [DOI: 10.1016/j.neucom.2014.05.072] [Citation(s) in RCA: 33] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
|
33
|
Ma C, Zhang HH, Wang X. Machine learning for Big Data analytics in plants. TRENDS IN PLANT SCIENCE 2014; 19:798-808. [PMID: 25223304 DOI: 10.1016/j.tplants.2014.08.004] [Citation(s) in RCA: 93] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/28/2014] [Revised: 07/30/2014] [Accepted: 08/20/2014] [Indexed: 05/19/2023]
Abstract
Rapid advances in high-throughput genomic technology have enabled biology to enter the era of 'Big Data' (large datasets). The plant science community not only needs to build its own Big-Data-compatible parallel computing and data management infrastructures, but also to seek novel analytical paradigms to extract information from the overwhelming amounts of data. Machine learning offers promising computational and analytical solutions for the integrative analysis of large, heterogeneous and unstructured datasets on the Big-Data scale, and is gradually gaining popularity in biology. This review introduces the basic concepts and procedures of machine-learning applications and envisages how machine learning could interface with Big Data technology to facilitate basic research and biotechnology in the plant sciences.
Collapse
Affiliation(s)
- Chuang Ma
- School of Plant Sciences, University of Arizona, 1140 E. South Campus Drive, Tucson, AZ 85721, USA
| | - Hao Helen Zhang
- Department of Mathematics, University of Arizona, 617 North Santa Rita Ave, Tucson, AZ 85721, USA
| | - Xiangfeng Wang
- School of Plant Sciences, University of Arizona, 1140 E. South Campus Drive, Tucson, AZ 85721, USA; Department of Plant Genetics and Breeding, College of Agronomy and Biotechnology, China Agricultural University, Beijing 100193, China.
| |
Collapse
|
34
|
Lee BJ, Ku B, Nam J, Pham DD, Kim JY. Prediction of fasting plasma glucose status using anthropometric measures for diagnosing type 2 diabetes. IEEE J Biomed Health Inform 2014; 18:555-61. [PMID: 24608055 DOI: 10.1109/jbhi.2013.2264509] [Citation(s) in RCA: 45] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
It is well known that body fat distribution and obesity are important risk factors for type 2 diabetes. Prediction of type 2 diabetes using a combination of anthropometric measures remains a controversial issue. This study aims to predict the fasting plasma glucose (FPG) status that is used in the diagnosis of type 2 diabetes by a combination of various measures among Korean adults. A total of 4870 subjects (2955 females and 1915 males) participated in this study. Based on 37 anthropometric measures, we compared predictions of FPG status using individual versus combined measures using two machine-learning algorithms. The values of the area under the receiver operating characteristic curve in the predictions by logistic regression and naive Bayes classifier based on the combination of measures were 0.741 and 0.739 in females, respectively, and were 0.687 and 0.686 in males, respectively. Our results indicate that prediction of FPG status using a combination of anthropometric measures was superior to individual measures alone in both females and males. We show that using balanced data of normal and high FPG groups can improve the prediction and reduce the intrinsic bias of the model toward the majority class.
Collapse
|
35
|
Yu H, Ni J. An Improved Ensemble Learning Method for Classifying High-Dimensional and Imbalanced Biomedicine Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2014; 11:657-666. [PMID: 26356336 DOI: 10.1109/tcbb.2014.2306838] [Citation(s) in RCA: 30] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Training classifiers on skewed data can be technically challenging tasks, especially if the data is high-dimensional simultaneously, the tasks can become more difficult. In biomedicine field, skewed data type often appears. In this study, we try to deal with this problem by combining asymmetric bagging ensemble classifier (asBagging) that has been presented in previous work and an improved random subspace (RS) generation strategy that is called feature subspace (FSS). Specifically, FSS is a novel method to promote the balance level between accuracy and diversity of base classifiers in asBagging. In view of the strong generalization capability of support vector machine (SVM), we adopt it to be base classifier. Extensive experiments on four benchmark biomedicine data sets indicate that the proposed ensemble learning method outperforms many baseline approaches in terms of Accuracy, F-measure, G-mean and AUC evaluation criterions, thus it can be regarded as an effective and efficient tool to deal with high-dimensional and imbalanced biomedical data.
Collapse
|
36
|
Wang KJ, Makond B, Chen KH, Wang KM. A hybrid classifier combining SMOTE with PSO to estimate 5-year survivability of breast cancer patients. Appl Soft Comput 2014. [DOI: 10.1016/j.asoc.2013.09.014] [Citation(s) in RCA: 81] [Impact Index Per Article: 8.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
37
|
Bakhtiarizadeh MR, Moradi-Shahrbabak M, Ebrahimi M, Ebrahimie E. Neural network and SVM classifiers accurately predict lipid binding proteins, irrespective of sequence homology. J Theor Biol 2014; 356:213-22. [PMID: 24819464 DOI: 10.1016/j.jtbi.2014.04.040] [Citation(s) in RCA: 40] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2014] [Revised: 04/03/2014] [Accepted: 04/29/2014] [Indexed: 01/05/2023]
Abstract
Due to the central roles of lipid binding proteins (LBPs) in many biological processes, sequence based identification of LBPs is of great interest. The major challenge is that LBPs are diverse in sequence, structure, and function which results in low accuracy of sequence homology based methods. Therefore, there is a need for developing alternative functional prediction methods irrespective of sequence similarity. To identify LBPs from non-LBPs, the performances of support vector machine (SVM) and neural network were compared in this study. Comprehensive protein features and various techniques were employed to create datasets. Five-fold cross-validation (CV) and independent evaluation (IE) tests were used to assess the validity of the two methods. The results indicated that SVM outperforms neural network. SVM achieved 89.28% (CV) and 89.55% (IE) overall accuracy in identification of LBPs from non-LBPs and 92.06% (CV) and 92.90% (IE) (in average) for classification of different LBPs classes. Increasing the number and the range of extracted protein features as well as optimization of the SVM parameters significantly increased the efficiency of LBPs class prediction in comparison to the only previous report in this field. Altogether, the results showed that the SVM algorithm can be run on broad, computationally calculated protein features and offers a promising tool in detection of LBPs classes. The proposed approach has the potential to integrate and improve the common sequence alignment based methods.
Collapse
Affiliation(s)
| | - Mohammad Moradi-Shahrbabak
- Department of Animal Science, College of Agriculture and Natural Resources, University of Tehran, Karaj, Iran
| | - Mansour Ebrahimi
- Department of Biology, School of Basic Sciences, University of Qom, Qom, Iran
| | - Esmaeil Ebrahimie
- Department of Crop Production & Plant Breeding, College of Agriculture, Shiraz University, Shiraz, Iran; School of Molecular and Biomedical Science, The University of Adelaide, Adelaide, Australia.
| |
Collapse
|
38
|
Abdi L, Hashemi S. To combat multi-class imbalanced problems by means of over-sampling and boosting techniques. Soft comput 2014. [DOI: 10.1007/s00500-014-1291-z] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
|
39
|
Abstract
β-lactam group of antibiotics is the most widely used therapeutic molecules for treating bacterial infections. The main mode of bacterial resistance to β-lactams is by β-lactamases. In the present study, we report our results on the role of cation-π interactions in β-lactamases and their environmental preferences. The number of interactions formed by arginine is higher than lysine in the cationic group, while tyrosine is comparatively higher than phenylalanine and tryptophan in the π group. Our results indicate that cation-π interactions might play an important role in the global conformational stability of β-lactamases.
Collapse
|
40
|
Wang KJ, Makond B, Wang KM. An improved survivability prognosis of breast cancer by using sampling and feature selection technique to solve imbalanced patient classification data. BMC Med Inform Decis Mak 2013; 13:124. [PMID: 24207108 PMCID: PMC3829096 DOI: 10.1186/1472-6947-13-124] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2013] [Accepted: 10/28/2013] [Indexed: 11/22/2022] Open
Abstract
Background Breast cancer is one of the most critical cancers and is a major cause of cancer death among women. It is essential to know the survivability of the patients in order to ease the decision making process regarding medical treatment and financial preparation. Recently, the breast cancer data sets have been imbalanced (i.e., the number of survival patients outnumbers the number of non-survival patients) whereas the standard classifiers are not applicable for the imbalanced data sets. The methods to improve survivability prognosis of breast cancer need for study. Methods Two well-known five-year prognosis models/classifiers [i.e., logistic regression (LR) and decision tree (DT)] are constructed by combining synthetic minority over-sampling technique (SMOTE) ,cost-sensitive classifier technique (CSC), under-sampling, bagging, and boosting. The feature selection method is used to select relevant variables, while the pruning technique is applied to obtain low information-burden models. These methods are applied on data obtained from the Surveillance, Epidemiology, and End Results database. The improvements of survivability prognosis of breast cancer are investigated based on the experimental results. Results Experimental results confirm that the DT and LR models combined with SMOTE, CSC, and under-sampling generate higher predictive performance consecutively than the original ones. Most of the time, DT and LR models combined with SMOTE and CSC use less informative burden/features when a feature selection method and a pruning technique are applied. Conclusions LR is found to have better statistical power than DT in predicting five-year survivability. CSC is superior to SMOTE, under-sampling, bagging, and boosting to improve the prognostic performance of DT and LR.
Collapse
Affiliation(s)
- Kung-Jeng Wang
- Department of Industrial Management, National Taiwan University of Science and Technology, Taipei 106, Taiwan.
| | | | | |
Collapse
|
41
|
Wang M, Zhao XM, Tan H, Akutsu T, Whisstock JC, Song J. Cascleave 2.0, a new approach for predicting caspase and granzyme cleavage targets. ACTA ACUST UNITED AC 2013; 30:71-80. [PMID: 24149049 DOI: 10.1093/bioinformatics/btt603] [Citation(s) in RCA: 60] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
MOTIVATION Caspases and granzyme B (GrB) are important proteases involved in fundamental cellular processes and play essential roles in programmed cell death, necrosis and inflammation. Although a number of substrates for both types have been experimentally identified, the complete repertoire of caspases and granzyme B substrates remained to be fully characterized. Accordingly, systematic bioinformatics studies of known cleavage sites may provide important insights into their substrate specificity and facilitate the discovery of novel substrates. RESULTS We develop a new bioinformatics tool, termed Cascleave 2.0, which builds on previous success of the Cascleave tool for predicting generic caspase cleavage sites. It can be efficiently used to predict potential caspase-specific cleavage sites for the human caspase-1, 3, 6, 7, 8 and GrB. In particular, we integrate heterogeneous sequence and protein functional information from various sources to improve the prediction accuracy of Cascleave 2.0. During classification, we use both maximum relevance minimum redundancy and forward feature selection techniques to quantify the relative contribution of each feature to prediction and thus remove redundant as well as irrelevant features. A systematic evaluation of Cascleave 2.0 using the benchmark data and comparison with other state-of-the-art tools using independent test data indicate that Cascleave 2.0 outperforms other tools on protease-specific cleavage site prediction of caspase-1, 3, 6, 7 and GrB. Cascleave 2.0 is anticipated to be used as a powerful tool for identifying novel substrates and cleavage sites of caspases and GrB and help understand the functional roles of these important proteases in human proteolytic cascades. AVAILABILITY AND IMPLEMENTATION http://www.structbioinfor.org/cascleave2/.
Collapse
Affiliation(s)
- Mingjun Wang
- National Engineering Laboratory for Industrial Enzymes and Key Laboratory of Systems Microbial Biotechnology, Tianjin Institute of Industrial Biotechnology, Chinese Academy of Sciences, Tianjin 300308, Department of Computer Science, School of Electronics and Information Engineering, Tongji University, Shanghai 201804, China, Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Victoria 3800, Australia, Bioinformatics Center, Institute for Chemical Research, Kyoto University, Uji, Kyoto 611-0011, Japan and ARC Centre of Excellence for Structural and Functional Microbial Genomics, Monash University, Melbourne, Victoria 3800, Australia
| | | | | | | | | | | |
Collapse
|
42
|
Fernández A, López V, Galar M, del Jesus MJ, Herrera F. Analysing the classification of imbalanced data-sets with multiple classes: Binarization techniques and ad-hoc approaches. Knowl Based Syst 2013. [DOI: 10.1016/j.knosys.2013.01.018] [Citation(s) in RCA: 236] [Impact Index Per Article: 21.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
43
|
Lee BJ, Kim KH, Ku B, Jang JS, Kim JY. Prediction of body mass index status from voice signals based on machine learning for automated medical applications. Artif Intell Med 2013; 58:51-61. [PMID: 23453267 DOI: 10.1016/j.artmed.2013.02.001] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2012] [Revised: 12/21/2012] [Accepted: 02/05/2013] [Indexed: 11/28/2022]
Abstract
OBJECTIVES The body mass index (BMI) provides essential medical information related to body weight for the treatment and prognosis prediction of diseases such as cardiovascular disease, diabetes, and stroke. We propose a method for the prediction of normal, overweight, and obese classes based only on the combination of voice features that are associated with BMI status, independently of weight and height measurements. MATERIALS AND METHODS A total of 1568 subjects were divided into 4 groups according to age and gender differences. We performed statistical analyses by analysis of variance (ANOVA) and Scheffe test to find significant features in each group. We predicted BMI status (normal, overweight, and obese) by a logistic regression algorithm and two ensemble classification algorithms (bagging and random forests) based on statistically significant features. RESULTS In the Female-2030 group (females aged 20-40 years), classification experiments using an imbalanced (original) data set gave area under the receiver operating characteristic curve (AUC) values of 0.569-0.731 by logistic regression, whereas experiments using a balanced data set gave AUC values of 0.893-0.994 by random forests. AUC values in Female-4050 (females aged 41-60 years), Male-2030 (males aged 20-40 years), and Male-4050 (males aged 41-60 years) groups by logistic regression in imbalanced data were 0.585-0.654, 0.581-0.614, and 0.557-0.653, respectively. AUC values in Female-4050, Male-2030, and Male-4050 groups in balanced data were 0.629-0.893 by bagging, 0.707-0.916 by random forests, and 0.695-0.854 by bagging, respectively. In each group, we found discriminatory features showing statistical differences among normal, overweight, and obese classes. The results showed that the classification models built by logistic regression in imbalanced data were better than those built by the other two algorithms, and significant features differed according to age and gender groups. CONCLUSION Our results could support the development of BMI diagnosis tools for real-time monitoring; such tools are considered helpful in improving automated BMI status diagnosis in remote healthcare or telemedicine and are expected to have applications in forensic and medical science.
Collapse
Affiliation(s)
- Bum Ju Lee
- Medical Research Division, Korea Institute of Oriental Medicine, 1672 Yuseongdae-ro, Yuseong-gu, Daejeon 305-811, Republic of Korea
| | | | | | | | | |
Collapse
|
44
|
Shuo Wang, Xin Yao. Multiclass Imbalance Problems: Analysis and Potential Solutions. ACTA ACUST UNITED AC 2012; 42:1119-30. [DOI: 10.1109/tsmcb.2012.2187280] [Citation(s) in RCA: 319] [Impact Index Per Article: 26.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
|
45
|
BATUWITA RUKSHAN, PALADE VASILE. ADJUSTED GEOMETRIC-MEAN: A NOVEL PERFORMANCE MEASURE FOR IMBALANCED BIOINFORMATICS DATASETS LEARNING. J Bioinform Comput Biol 2012; 10:1250003. [DOI: 10.1142/s0219720012500035] [Citation(s) in RCA: 36] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022]
Abstract
One common and challenging problem faced by many bioinformatics applications, such as promoter recognition, splice site prediction, RNA gene prediction, drug discovery and protein classification, is the imbalance of the available datasets. In most of these applications, the positive data examples are largely outnumbered by the negative data examples, which often leads to the development of sub-optimal prediction models having high negative recognition rate (Specificity = SP) and low positive recognition rate (Sensitivity = SE). When class imbalance learning methods are applied, usually, the SE is increased at the expense of reducing some amount of the SP. In this paper, we point out that in these data-imbalanced bioinformatics applications, the goal of applying class imbalance learning methods would be to increase the SE as high as possible by keeping the reduction of SP as low as possible. We explain that the existing performance measures used in class imbalance learning can still produce sub-optimal models with respect to this classification goal. In order to overcome these problems, we introduce a new performance measure called Adjusted Geometric-mean (AGm). The experimental results obtained on ten real-world imbalanced bioinformatics datasets demonstrates that the AGm metric can achieve a lower rate of reduction of SP than the existing performance metrics, when increasing the SE through class imbalance learning methods. This characteristic of AGm metric makes it more suitable for achieving the proposed classification goal in imbalanced bioinformatics datasets learning.
Collapse
Affiliation(s)
- RUKSHAN BATUWITA
- University of Oxford, Department of Computer Science, Oxford, OX1 3QD, United Kingdom
| | - VASILE PALADE
- University of Oxford, Department of Computer Science, Oxford, OX1 3QD, United Kingdom
| |
Collapse
|
46
|
Identification of human protein complexes from local sub-graphs of protein-protein interaction network based on random forest with topological structure features. Anal Chim Acta 2012; 718:32-41. [PMID: 22305895 DOI: 10.1016/j.aca.2011.12.069] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2011] [Revised: 12/28/2011] [Accepted: 12/30/2011] [Indexed: 11/20/2022]
Abstract
In the post-genomic era, one of the most important and challenging tasks is to identify protein complexes and further elucidate its molecular mechanisms in specific biological processes. Previous computational approaches usually identify protein complexes from protein interaction network based on dense sub-graphs and incomplete priori information. Additionally, the computational approaches have little concern about the biological properties of proteins and there is no a common evaluation metric to evaluate the performance. So, it is necessary to construct novel method for identifying protein complexes and elucidating the function of protein complexes. In this study, a novel approach is proposed to identify protein complexes using random forest and topological structure. Each protein complex is represented by a graph of interactions, where descriptor of the protein primary structure is used to characterize biological properties of protein and vertex is weighted by the descriptor. The topological structure features are developed and used to characterize protein complexes. Random forest algorithm is utilized to build prediction model and identify protein complexes from local sub-graphs instead of dense sub-graphs. As a demonstration, the proposed approach is applied to protein interaction data in human, and the satisfied results are obtained with accuracy of 80.24%, sensitivity of 81.94%, specificity of 80.07%, and Matthew's correlation coefficient of 0.4087 in 10-fold cross-validation test. Some new protein complexes are identified, and analysis based on Gene Ontology shows that the complexes are likely to be true complexes and play important roles in the pathogenesis of some diseases. PCI-RFTS, a corresponding executable program for protein complexes identification, can be acquired freely on request from the authors.
Collapse
|
47
|
Sun C, Zhao XM, Tang W, Chen L. FGsub: Fusarium graminearum protein subcellular localizations predicted from primary structures. BMC SYSTEMS BIOLOGY 2010; 4 Suppl 2:S12. [PMID: 20840726 PMCID: PMC2982686 DOI: 10.1186/1752-0509-4-s2-s12] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/30/2023]
Abstract
Background The fungal pathogen Fusarium graminearum (telomorph Gibberella zeae) is the causal agent of several destructive crop diseases, where a set of genes usually work in concert to cause diseases to crops. To function appropriately, the F. graminearum proteins inside one cell should be assigned to different compartments, i.e. subcellular localizations. Therefore, the subcellular localizations of F. graminearum proteins can provide insights into protein functions and pathogenic mechanisms of this destructive pathogen fungus. Unfortunately, there are no subcellular localization information for F. graminearum proteins available now. Computational approaches provide an alternative way to predicting F. graminearum protein subcellular localizations due to the expensive and time-consuming biological experiments in lab. Results In this paper, we developed a novel predictor, namely FGsub, to predict F. graminearum protein subcellular localizations from the primary structures. First, a non-redundant fungi data set with subcellular localization annotation is collected from UniProtKB database and used as training set, where the subcellular locations are classified into 10 groups. Subsequently, Support Vector Machine (SVM) is trained on the training set and used to predict F. graminearum protein subcellular localizations for those proteins that do not have significant sequence similarity to those in training set. The performance of SVMs on training set with 10-fold cross-validation demonstrates the efficiency and effectiveness of the proposed method. In addition, for F. graminearum proteins that have significant sequence similarity to those in training set, BLAST is utilized to transfer annotations of homologous proteins to uncharacterized F. graminearum proteins so that the F. graminearum proteins are annotated more comprehensively. Conclusions In this work, we present FGsub to predict F. graminearum protein subcellular localizations in a comprehensive manner. We make four fold contributions to this filed. First, we present a new algorithm to cope with imbalance problem that arises in protein subcellular localization prediction, which can solve imbalance problem and avoid false positive results. Second, we design an ensemble classifier which employs feature selection to further improve prediction accuracy. Third, we use BLAST to complement machine learning based methods, which enlarges our prediction coverage. Last and most important, we predict the subcellular localizations of 12786 F. graminearum proteins, which provide insights into protein functions and pathogenic mechanisms of this destructive pathogen fungus.
Collapse
Affiliation(s)
- Chenglei Sun
- Institute of Systems Biology, Shanghai University, Shanghai, China.
| | | | | | | |
Collapse
|
48
|
Jain P, Hirst JD. Automatic structure classification of small proteins using random forest. BMC Bioinformatics 2010; 11:364. [PMID: 20594334 PMCID: PMC2916923 DOI: 10.1186/1471-2105-11-364] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2010] [Accepted: 07/01/2010] [Indexed: 11/29/2022] Open
Abstract
Background Random forest, an ensemble based supervised machine learning algorithm, is used to predict the SCOP structural classification for a target structure, based on the similarity of its structural descriptors to those of a template structure with an equal number of secondary structure elements (SSEs). An initial assessment of random forest is carried out for domains consisting of three SSEs. The usability of random forest in classifying larger domains is demonstrated by applying it to domains consisting of four, five and six SSEs. Results Random forest, trained on SCOP version 1.69, achieves a predictive accuracy of up to 94% on an independent and non-overlapping test set derived from SCOP version 1.73. For classification to the SCOP Class, Fold, Super-family or Family levels, the predictive quality of the model in terms of Matthew's correlation coefficient (MCC) ranged from 0.61 to 0.83. As the number of constituent SSEs increases the MCC for classification to different structural levels decreases. Conclusions The utility of random forest in classifying domains from the place-holder classes of SCOP to the true Class, Fold, Super-family or Family levels is demonstrated. Issues such as introduction of a new structural level in SCOP and the merger of singleton levels can also be addressed using random forest. A real-world scenario is mimicked by predicting the classification for those protein structures from the PDB, which are yet to be assigned to the SCOP classification hierarchy.
Collapse
Affiliation(s)
- Pooja Jain
- School of Chemistry, The University of Nottingham, University Park, Nottingham, NG7 2RD, UK
| | | |
Collapse
|
49
|
|
50
|
Deng L, Guan J, Dong Q, Zhou S. Prediction of protein-protein interaction sites using an ensemble method. BMC Bioinformatics 2009; 10:426. [PMID: 20015386 PMCID: PMC2808167 DOI: 10.1186/1471-2105-10-426] [Citation(s) in RCA: 58] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2009] [Accepted: 12/16/2009] [Indexed: 01/23/2023] Open
Abstract
Background Prediction of protein-protein interaction sites is one of the most challenging and intriguing problems in the field of computational biology. Although much progress has been achieved by using various machine learning methods and a variety of available features, the problem is still far from being solved. Results In this paper, an ensemble method is proposed, which combines bootstrap resampling technique, SVM-based fusion classifiers and weighted voting strategy, to overcome the imbalanced problem and effectively utilize a wide variety of features. We evaluate the ensemble classifier using a dataset extracted from 99 polypeptide chains with 10-fold cross validation, and get a AUC score of 0.86, with a sensitivity of 0.76 and a specificity of 0.78, which are better than that of the existing methods. To improve the usefulness of the proposed method, two special ensemble classifiers are designed to handle the cases of missing homologues and structural information respectively, and the performance is still encouraging. The robustness of the ensemble method is also evaluated by effectively classifying interaction sites from surface residues as well as from all residues in proteins. Moreover, we demonstrate the applicability of the proposed method to identify interaction sites from the non-structural proteins (NS) of the influenza A virus, which may be utilized as potential drug target sites. Conclusion Our experimental results show that the ensemble classifiers are quite effective in predicting protein interaction sites. The Sub-EnClassifiers with resampling technique can alleviate the imbalanced problem and the combination of Sub-EnClassifiers with a wide variety of feature groups can significantly improve prediction performance.
Collapse
Affiliation(s)
- Lei Deng
- Department of Computer Science and Technology, Tongji University, Shanghai 201804, China.
| | | | | | | |
Collapse
|