Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For:	Zhao XM, Li X, Chen L, Aihara K. Protein classification with imbalanced data. Proteins 2007;70:1125-32. [DOI: 10.1002/prot.21870] [Citation(s) in RCA: 97] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]

Number

Cited by Other Article(s)

Zhang Q, Zheng W, Song Z, Zhang Q, Yang L, Wu J, Lin J, Xu G, Yu H. Machine Learning Enables Prediction of Pyrrolysyl-tRNA Synthetase Substrate Specificity. ACS Synth Biol 2023;12:2403-2417. [PMID: 37486975 DOI: 10.1021/acssynbio.3c00225] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/26/2023]

Abstract

Knowledge about the substrate scope for a given enzyme is informative for elucidating biochemical pathways and also for expanding applications of the enzyme. However, no general methods are available to accurately predict the substrate specificity of an enzyme. Pyrrolysyl-tRNA synthetase (PylRS) is a powerful tool for incorporating various noncanonical amino acids (NCAAs) into proteins, which enabled us to probe, image, rationally engineer, and evolve protein structure and function. However, the incorporation of a new NCAA typically requires the selection of large libraries of PylRS with randomized mutations at active sites, and this process requires multiple rounds of selection for each new substrate. Therefore, a single aminoacyl-tRNA synthetase with broad substrate promiscuity is ideal to facilitate widespread applications of the genetic NCAA incorporation technique. Herein, machine learning models were developed to predict the substrate specificity of PylRS to accept novel NCAAs that could be incorporated into proteins by three PylRS mutants. The models were built from a training set of 285 unique enzyme-substrate pairs of three PylRS mutants including IFRS, BtaRS, and MFRS against 95 NCAAs. The best BaggingTree (BT) model was then used for virtually screening a NCAAs library containing 1474 phenylalanine, tyrosine, tryptophan, and alanine analogues, and 156 NCAAs were predicted to be accepted by at least one of the three PylRS mutants. Then, 27 NCAAs including 24 positive and 3 negative substrates were experimentally tested for their activities, and 20 of the 24 positive substrates showed weak or strong activity and were accepted by at least one PylRS mutant, among which 11 NCAAs were never reported to be incorporated into proteins before. Three negative substrates did not show any activity. Experimental results suggested that the BT model provides a three-class classification accuracy of 0.69 and a binary classification accuracy of 0.86. This study expanded the substrate scope of three PylRS variants and provided a framework for developing machine learning models to predict substrate specificity of other PylRS variants.

Collapse

Murad T, Ali S, Patterson M. Exploring the Potential of GANs in Biological Sequence Analysis. BIOLOGY 2023;12:854. [PMID: 37372139 DOI: 10.3390/biology12060854] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/29/2023] [Revised: 06/03/2023] [Accepted: 06/12/2023] [Indexed: 06/29/2023]

Szep M, Pintican R, Boca B, Perja A, Duma M, Feier D, Epure F, Fetica B, Eniu D, Roman A, Dudea SM, Chiorean A. Whole-Tumor ADC Texture Analysis Is Able to Predict Breast Cancer Receptor Status. Diagnostics (Basel) 2023;13:diagnostics13081414. [PMID: 37189515 DOI: 10.3390/diagnostics13081414] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2023] [Revised: 04/04/2023] [Accepted: 04/11/2023] [Indexed: 05/17/2023] Open

Machine learning to improve the interpretation of intercalating dye-based quantitative PCR results. Sci Rep 2022;12:16445. [PMID: 36180590 PMCID: PMC9525288 DOI: 10.1038/s41598-022-21010-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2022] [Accepted: 09/21/2022] [Indexed: 11/16/2022] Open

Choi HS, Jung D, Kim S, Yoon S. Imbalanced Data Classification via Cooperative Interaction Between Classifier and Generator. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2022;33:3343-3356. [PMID: 33531305 DOI: 10.1109/tnnls.2021.3052243] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]

Chen W, Yang K, Yu Z, Zhang W. Double-kernel based class-specific broad learning system for multiclass imbalance learning. Knowl Based Syst 2022. [DOI: 10.1016/j.knosys.2022.109535] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/16/2022]

Zhou L, Tang Y, Yan G. A New Estimation Method for the Biological Interaction Predicting Problems. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022;19:1415-1423. [PMID: 33406043 DOI: 10.1109/tcbb.2021.3049642] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]

Makond B, Wang KJ, Wang KM. Benchmarking prognosis methods for survivability - A case study for patients with contingent primary cancers. Comput Biol Med 2021;138:104888. [PMID: 34610552 DOI: 10.1016/j.compbiomed.2021.104888] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2021] [Accepted: 09/17/2021] [Indexed: 11/18/2022]

Zhang Q, Wang D, Han K, Huang DS. Predicting TF-DNA Binding Motifs from ChIP-seq Datasets Using the Bag-Based Classifier Combined With a Multi-Fold Learning Scheme. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021;18:1743-1751. [PMID: 32946398 DOI: 10.1109/tcbb.2020.3025007] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]

Moghadas-Dastjerdi H, Rahman SETH, Sannachi L, Wright FC, Gandhi S, Trudeau ME, Sadeghi-Naini A, Czarnota GJ. Prediction of chemotherapy response in breast cancer patients at pre-treatment using second derivative texture of CT images and machine learning. Transl Oncol 2021;14:101183. [PMID: 34293685 PMCID: PMC8319580 DOI: 10.1016/j.tranon.2021.101183] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2021] [Revised: 07/07/2021] [Accepted: 07/13/2021] [Indexed: 01/01/2023] Open

Abstract

•

Textural and second derivative textural features of CT images can be used in conjunction with machine learning models to predict breast cancer response to chemotherapy prior to the start of treatment.

•

The proposed predictive model separates the patients at pre-treatment into two cohorts (responders/non-responders) with significantly different survival.

•

The proposed methodology is a step forward towards the precision oncology paradigm for breast cancer patients.

Although neoadjuvant chemotherapy (NAC) is a crucial component of treatment for locally advanced breast cancer (LABC), only about 70% of patients respond to it. Effective adjustment of NAC for individual patients can significantly improve survival rates of those resistant to standard regimens. Thus, the early prediction of NAC outcome is of great importance in facilitating a personalized paradigm for breast cancer therapeutics. In this study, quantitative computed tomography (qCT) parametric imaging in conjunction with machine learning techniques were investigated to predict LABC tumor response to NAC. Textural and second derivative textural (SDT) features of CT images of 72 patients diagnosed with LABC were analysed before the initiation of NAC to quantify intra-tumor heterogeneity. These quantitative features were processed through a correlation-based feature reduction followed by a sequential feature selection with a bootstrap 0.632+ area under the receiver operating characteristic (ROC) curve (AUC0.632+) criterion. The best feature subset consisted of a combination of one textural and three SDT features. Using these features, an AdaBoost decision tree could predict the patient response with a cross-validated AUC0.632+ accuracy, sensitivity and specificity of 0.88, 85%, 88% and 75%, respectively. This study demonstrates, for the first time, that a combination of textural and SDT features of CT images can be used to predict breast cancer response NAC prior to the start of treatment which can potentially facilitate early therapy adjustments.

Collapse

Affiliation(s)

Hadi Moghadas-Dastjerdi Department of Medical Biophysics, University of Toronto, Toronto, ON, Canada; Physical Sciences Platform, Sunnybrook Research Institute, Sunnybrook Health Sciences Center, Toronto, ON, Canada; Department of Radiation Oncology, Odette Cancer Center, Sunnybrook Health Sciences Center, Toronto, ON, Canada; Department of Radiation Oncology, University of Toronto, Toronto, ON, Canada
Shan-E-Tallat Hira Rahman Physical Sciences Platform, Sunnybrook Research Institute, Sunnybrook Health Sciences Center, Toronto, ON, Canada; Faculty of Engineering, University of Waterloo, Waterloo, ON, Canada
Lakshmanan Sannachi Department of Medical Biophysics, University of Toronto, Toronto, ON, Canada; Physical Sciences Platform, Sunnybrook Research Institute, Sunnybrook Health Sciences Center, Toronto, ON, Canada; Department of Radiation Oncology, Odette Cancer Center, Sunnybrook Health Sciences Center, Toronto, ON, Canada; Department of Radiation Oncology, University of Toronto, Toronto, ON, Canada
Frances C Wright Surgical Oncology, Odette Cancer Center, Sunnybrook Health Sciences Center, and Department of Surgery, University of Toronto, Toronto, ON, Canada
Sonal Gandhi Division of Medical Oncology, Odette Cancer Center, Sunnybrook Health Sciences Center, and Department of Medicine, University of Toronto, Toronto, ON, Canada
Maureen E Trudeau Division of Medical Oncology, Odette Cancer Center, Sunnybrook Health Sciences Center, and Department of Medicine, University of Toronto, Toronto, ON, Canada
Ali Sadeghi-Naini Department of Medical Biophysics, University of Toronto, Toronto, ON, Canada; Physical Sciences Platform, Sunnybrook Research Institute, Sunnybrook Health Sciences Center, Toronto, ON, Canada; Department of Radiation Oncology, Odette Cancer Center, Sunnybrook Health Sciences Center, Toronto, ON, Canada; Department of Electrical Engineering and Computer Science, Lassonde School of Engineering, York University, Toronto, ON, Canada.
Gregory J Czarnota Department of Medical Biophysics, University of Toronto, Toronto, ON, Canada; Physical Sciences Platform, Sunnybrook Research Institute, Sunnybrook Health Sciences Center, Toronto, ON, Canada; Department of Radiation Oncology, Odette Cancer Center, Sunnybrook Health Sciences Center, Toronto, ON, Canada; Department of Radiation Oncology, University of Toronto, Toronto, ON, Canada.

Collapse

Tian H, Jiang X, Tao P. PASSer: Prediction of Allosteric Sites Server. MACHINE LEARNING-SCIENCE AND TECHNOLOGY 2021;2. [PMID: 34396127 DOI: 10.1088/2632-2153/abe6d6] [Citation(s) in RCA: 29] [Impact Index Per Article: 9.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]

Chen S, Gan M, Lv H, Jiang R. DeepCAPE: A Deep Convolutional Neural Network for the Accurate Prediction of Enhancers. GENOMICS PROTEOMICS & BIOINFORMATICS 2021;19:565-577. [PMID: 33581335 PMCID: PMC9040020 DOI: 10.1016/j.gpb.2019.04.006] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/24/2018] [Revised: 03/15/2019] [Accepted: 04/29/2019] [Indexed: 12/12/2022]

Abstract

The establishment of a landscape of enhancers across human cells is crucial to deciphering the mechanism of gene regulation, cell differentiation, and disease development. High-throughput experimental approaches, which contain successfully reported enhancers in typical cell lines, are still too costly and time-consuming to perform systematic identification of enhancers specific to different cell lines. Existing computational methods, capable of predicting regulatory elements purely relying on DNA sequences, lack the power of cell line-specific screening. Recent studies have suggested that chromatin accessibility of a DNA segment is closely related to its potential function in regulation, and thus may provide useful information in identifying regulatory elements. Motivated by the aforementioned understanding, we integrate DNA sequences and chromatin accessibility data to accurately predict enhancers in a cell line-specific manner. We proposed DeepCAPE, a deep convolutional neural network to predict enhancers via the integration of DNA sequences and DNase-seq data. Benefitting from the well-designed feature extraction mechanism and skip connection strategy, our model not only consistently outperforms existing methods in the imbalanced classification of cell line-specific enhancers against background sequences, but also has the ability to self-adapt to different sizes of datasets. Besides, with the adoption of auto-encoder, our model is capable of making cross-cell line predictions. We further visualize kernels of the first convolutional layer and show the match of identified sequence signatures and known motifs. We finally demonstrate the potential ability of our model to explain functional implications of putative disease-associated genetic variants and discriminate disease-related enhancers. The source code and detailed tutorial of DeepCAPE are freely available at https://github.com/ShengquanChen/DeepCAPE.

Collapse

Akhter N, Chennupati G, Djidjev H, Shehu A. Decoy selection for protein structure prediction via extreme gradient boosting and ranking. BMC Bioinformatics 2020;21:189. [PMID: 33297949 PMCID: PMC7724862 DOI: 10.1186/s12859-020-3523-9] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2020] [Accepted: 04/29/2020] [Indexed: 11/10/2022] Open

Suh S, Lee H, Lukowicz P, Lee YO. CEGAN: Classification Enhancement Generative Adversarial Networks for unraveling data imbalance problems. Neural Netw 2020;133:69-86. [PMID: 33125919 DOI: 10.1016/j.neunet.2020.10.004] [Citation(s) in RCA: 18] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2019] [Revised: 07/23/2020] [Accepted: 10/11/2020] [Indexed: 10/23/2022]

Liu Y, Li A, Zhao XM, Wang M. DeepTL-Ubi: A novel deep transfer learning method for effectively predicting ubiquitination sites of multiple species. Methods 2020;192:103-111. [PMID: 32791338 DOI: 10.1016/j.ymeth.2020.08.003] [Citation(s) in RCA: 17] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2020] [Revised: 07/17/2020] [Accepted: 08/06/2020] [Indexed: 11/16/2022] Open

Xu C, Zhu G. Semi-supervised Learning Algorithm Based on Linear Lie Group for Imbalanced Multi-class Classification. Neural Process Lett 2020. [DOI: 10.1007/s11063-020-10287-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]

Moghadas-Dastjerdi H, Sha-E-Tallat HR, Sannachi L, Sadeghi-Naini A, Czarnota GJ. A priori prediction of tumour response to neoadjuvant chemotherapy in breast cancer patients using quantitative CT and machine learning. Sci Rep 2020;10:10936. [PMID: 32616912 PMCID: PMC7331583 DOI: 10.1038/s41598-020-67823-8] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2019] [Accepted: 06/08/2020] [Indexed: 12/19/2022] Open

Abstract

Response to Neoadjuvant chemotherapy (NAC) has demonstrated a high correlation to survival in locally advanced breast cancer (LABC) patients. An early prediction of responsiveness to NAC could facilitate treatment adjustments on an individual patient basis that would be expected to improve treatment outcomes and patient survival. This study investigated, for the first time, the efficacy of quantitative computed tomography (qCT) parametric imaging to characterize intra-tumour heterogeneity and its application in predicting tumour response to NAC in LABC patients. Textural analyses were performed on CT images acquired from 72 patients before the start of chemotherapy to determine quantitative features of intra-tumour heterogeneity. The best feature subset for response prediction was selected through a sequential feature selection with bootstrap 0.632 + area under the receiver operating characteristic (ROC) curve (\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathrm{A}\mathrm{U}\mathrm{C}}_{0.632+}$$\end{document}AUC0.632+) as a performance criterion. Several classifiers were evaluated for response prediction using the selected feature subset. Amongst the applied classifiers an Adaboost decision tree provided the best results with cross-validated \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathrm{A}\mathrm{U}\mathrm{C}}_{0.632+}$$\end{document}AUC0.632+, accuracy, sensitivity and specificity of 0.89, 84%, 80% and 88%, respectively. The promising results obtained in this study demonstrate the potential of the proposed biomarkers to be used as predictors of LABC tumour response to NAC prior to the start of treatment.

Collapse

Affiliation(s)

Hadi Moghadas-Dastjerdi Department of Medical Biophysics, University of Toronto, Toronto, ON, Canada.,Physical Sciences Platform, Sunnybrook Research Institute, Sunnybrook Health Sciences Centre, Toronto, ON, Canada.,Department of Radiation Oncology, Odette Cancer Centre, Sunnybrook Health Sciences Centre, Toronto, ON, Canada.,Department of Radiation Oncology, University of Toronto, Toronto, ON, Canada
Hira Rahman Sha-E-Tallat Physical Sciences Platform, Sunnybrook Research Institute, Sunnybrook Health Sciences Centre, Toronto, ON, Canada.,Faculty of Engineering, University of Waterloo, Waterloo, ON, Canada
Lakshmanan Sannachi Department of Medical Biophysics, University of Toronto, Toronto, ON, Canada.,Physical Sciences Platform, Sunnybrook Research Institute, Sunnybrook Health Sciences Centre, Toronto, ON, Canada.,Department of Radiation Oncology, Odette Cancer Centre, Sunnybrook Health Sciences Centre, Toronto, ON, Canada.,Department of Radiation Oncology, University of Toronto, Toronto, ON, Canada
Ali Sadeghi-Naini Department of Medical Biophysics, University of Toronto, Toronto, ON, Canada.,Physical Sciences Platform, Sunnybrook Research Institute, Sunnybrook Health Sciences Centre, Toronto, ON, Canada.,Department of Radiation Oncology, Odette Cancer Centre, Sunnybrook Health Sciences Centre, Toronto, ON, Canada.,Department of Electrical Engineering and Computer Science, Lassonde School of Engineering, York University, Toronto, ON, Canada
Gregory J Czarnota Department of Medical Biophysics, University of Toronto, Toronto, ON, Canada. .,Physical Sciences Platform, Sunnybrook Research Institute, Sunnybrook Health Sciences Centre, Toronto, ON, Canada. .,Department of Radiation Oncology, Odette Cancer Centre, Sunnybrook Health Sciences Centre, Toronto, ON, Canada. .,Department of Radiation Oncology, University of Toronto, Toronto, ON, Canada.

Collapse

Tadepalli S, Akhter N, Barbara D, Shehu A. Anomaly Detection-Based Recognition of Near-Native Protein Structures. IEEE Trans Nanobioscience 2020;19:562-570. [PMID: 32340957 DOI: 10.1109/tnb.2020.2990642] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]

Random Balance ensembles for multiclass imbalance learning. Knowl Based Syst 2020. [DOI: 10.1016/j.knosys.2019.105434] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]

Abuassba AO, Zhang D, Luo X. A Heterogeneous AdaBoost Ensemble Based Extreme Learning Machines for Imbalanced Data. INTERNATIONAL JOURNAL OF COGNITIVE INFORMATICS AND NATURAL INTELLIGENCE 2019. [DOI: 10.4018/ijcini.2019070102] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]

Zhang L, Yu G, Xia D, Wang J. Protein–protein interactions prediction based on ensemble deep neural networks. Neurocomputing 2019. [DOI: 10.1016/j.neucom.2018.02.097] [Citation(s) in RCA: 74] [Impact Index Per Article: 14.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/01/2023]

Bhattacharya M, Jurkovitz C, Shatkay H. Chronic Kidney Disease stratification using office visit records: Handling data imbalance via hierarchical meta-classification. BMC Med Inform Decis Mak 2018;18:125. [PMID: 30537962 PMCID: PMC6290512 DOI: 10.1186/s12911-018-0675-x] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open

Abstract

Background

Chronic Kidney Disease (CKD) is one of several conditions that affect a growing percentage of the US population; the disease is accompanied by multiple co-morbidities, and is hard to diagnose in-and-of itself. In its advanced forms it carries severe outcomes and can lead to death. It is thus important to detect the disease as early as possible, which can help devise effective intervention and treatment plan.

Here we investigate ways to utilize information available in electronic health records (EHRs) from regular office visits of more than 13,000 patients, in order to distinguish among several stages of the disease. While clinical data stored in EHRs provide valuable information for risk-stratification, one of the major challenges in using them arises from data imbalance. That is, records associated with a more severe condition are typically under-represented compared to those associated with a milder manifestation of the disease. To address imbalance, we propose and develop a sampling-based ensemble approach, hierarchical meta-classification, aiming to stratify CKD patients into severity stages, using simple quantitative non-text features gathered from standard office visit records.

Methods

The proposed hierarchical meta-classification method frames the multiclass classification task as a hierarchy of two subtasks. The first is binary classification, separating records associated with the majority class from those associated with all minority classes combined, using meta-classification. The second subtask separates the records assigned to the combined minority classes into the individual constituent classes.

Results

The proposed method identifies a significant proportion of patients suffering from the more advanced stages of the condition, while also correctly identifying most of the less severe cases, maintaining high sensitivity, specificity and F-measure (≥ 93%). Our results show that the high level of performance attained by our method is preserved even when the size of the training set is significantly reduced, demonstrating the stability and generalizability of our approach.

Conclusion

We present a new approach to perform classification while addressing data imbalance, which is inherent in the biomedical domain. Our model effectively identifies severity stages of CKD patients, using information readily available in office visit records within the realistic context of high data imbalance.

Collapse

Sastry A, Monk J, Tegel H, Uhlen M, Palsson BO, Rockberg J, Brunk E. Machine learning in computational biology to accelerate high-throughput protein expression. Bioinformatics 2018;33:2487-2495. [PMID: 28398465 DOI: 10.1093/bioinformatics/btx207] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2016] [Accepted: 04/05/2017] [Indexed: 01/21/2023] Open

Dynamic affinity-based classification of multi-class imbalanced data with one-versus-one decomposition: a fuzzy rough set approach. Knowl Inf Syst 2017. [DOI: 10.1007/s10115-017-1126-1] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]

Akkasi A, Varoğlu E, Dimililer N. Balanced undersampling: a novel sentence-based undersampling method to improve recognition of named entities in chemical and biomedical text. APPL INTELL 2017. [DOI: 10.1007/s10489-017-0920-5] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]

iDPF-PseRAAAC: A Web-Server for Identifying the Defensin Peptide Family and Subfamily Using Pseudo Reduced Amino Acid Alphabet Composition. PLoS One 2015;10:e0145541. [PMID: 26713618 PMCID: PMC4694767 DOI: 10.1371/journal.pone.0145541] [Citation(s) in RCA: 44] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2015] [Accepted: 12/04/2015] [Indexed: 11/29/2022] Open

Dai HL. Imbalanced Protein Data Classification Using Ensemble FTM-SVM. IEEE Trans Nanobioscience 2015;14:350-359. [DOI: 10.1109/tnb.2015.2431292] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]

You ZH, Chan KCC, Hu P. Predicting protein-protein interactions from primary protein sequences using a novel multi-scale local feature representation scheme and the random forest. PLoS One 2015;10:e0125811. [PMID: 25946106 PMCID: PMC4422660 DOI: 10.1371/journal.pone.0125811] [Citation(s) in RCA: 92] [Impact Index Per Article: 10.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2014] [Accepted: 03/04/2015] [Indexed: 11/18/2022] Open

Abstract

The study of protein-protein interactions (PPIs) can be very important for the understanding of biological cellular functions. However, detecting PPIs in the laboratories are both time-consuming and expensive. For this reason, there has been much recent effort to develop techniques for computational prediction of PPIs as this can complement laboratory procedures and provide an inexpensive way of predicting the most likely set of interactions at the entire proteome scale. Although much progress has already been achieved in this direction, the problem is still far from being solved. More effective approaches are still required to overcome the limitations of the current ones. In this study, a novel Multi-scale Local Descriptor (MLD) feature representation scheme is proposed to extract features from a protein sequence. This scheme can capture multi-scale local information by varying the length of protein-sequence segments. Based on the MLD, an ensemble learning method, the Random Forest (RF) method, is used as classifier. The MLD feature representation scheme facilitates the mining of interaction information from multi-scale continuous amino acid segments, making it easier to capture multiple overlapping continuous binding patterns within a protein sequence. When the proposed method is tested with the PPI data of Saccharomyces cerevisiae, it achieves a prediction accuracy of 94.72% with 94.34% sensitivity at the precision of 98.91%. Extensive experiments are performed to compare our method with existing sequence-based method. Experimental results show that the performance of our predictor is better than several other state-of-the-art predictors also with the H. pylori dataset. The reason why such good results are achieved can largely be credited to the learning capabilities of the RF model and the novel MLD feature representation scheme. The experiment results show that the proposed approach can be very promising for predicting PPIs and can be a useful tool for future proteomic studies.

Collapse

Detecting protein-protein interactions with a novel matrix-based protein sequence representation and support vector machines. BIOMED RESEARCH INTERNATIONAL 2015;2015:867516. [PMID: 26000305 PMCID: PMC4426769 DOI: 10.1155/2015/867516] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/01/2014] [Revised: 01/09/2015] [Accepted: 01/09/2015] [Indexed: 11/27/2022]

Tomar D, Agarwal S. An effective Weighted Multi-class Least Squares Twin Support Vector Machine for Imbalanced data classification. INT J COMPUT INT SYS 2015. [DOI: 10.1080/18756891.2015.1061395] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022] Open

You ZH, Zhu L, Zheng CH, Yu HJ, Deng SP, Ji Z. Prediction of protein-protein interactions from amino acid sequences using a novel multi-scale continuous and discontinuous feature set. BMC Bioinformatics 2014;15 Suppl 15:S9. [PMID: 25474679 PMCID: PMC4271571 DOI: 10.1186/1471-2105-15-s15-s9] [Citation(s) in RCA: 84] [Impact Index Per Article: 8.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023] Open

Abstract

BACKGROUND

Identifying protein-protein interactions (PPIs) is essential for elucidating protein functions and understanding the molecular mechanisms inside the cell. However, the experimental methods for detecting PPIs are both time-consuming and expensive. Therefore, computational prediction of protein interactions are becoming increasingly popular, which can provide an inexpensive way of predicting the most likely set of interactions at the entire proteome scale, and can be used to complement experimental approaches. Although much progress has already been achieved in this direction, the problem is still far from being solved and new approaches are still required to overcome the limitations of the current prediction models.

RESULTS

In this work, a sequence-based approach is developed by combining a novel Multi-scale Continuous and Discontinuous (MCD) feature representation and Support Vector Machine (SVM). The MCD representation gives adequate consideration to the interactions between sequentially distant but spatially close amino acid residues, thus it can sufficiently capture multiple overlapping continuous and discontinuous binding patterns within a protein sequence. An effective feature selection method mRMR was employed to construct an optimized and more discriminative feature set by excluding redundant features. Finally, a prediction model is trained and tested based on SVM algorithm to predict the interaction probability of protein pairs.

CONCLUSIONS

When performed on the yeast PPIs data set, the proposed approach achieved 91.36% prediction accuracy with 91.94% precision at the sensitivity of 90.67%. Extensive experiments are conducted to compare our method with the existing sequence-based method. Experimental results show that the performance of our predictor is better than several other state-of-the-art predictors, whose average prediction accuracy is 84.91%, sensitivity is 83.24%, and precision is 86.12%. Achieved results show that the proposed approach is very promising for predicting PPI, so it can be a useful supplementary tool for future proteomics studies. The source code and the datasets are freely available at http://csse.szu.edu.cn/staff/youzh/MCDPPI.zip for academic use.

Collapse

You ZH, Yu JZ, Zhu L, Li S, Wen ZK. A MapReduce based parallel SVM for large-scale predicting protein–protein interactions. Neurocomputing 2014. [DOI: 10.1016/j.neucom.2014.05.072] [Citation(s) in RCA: 33] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]

Ma C, Zhang HH, Wang X. Machine learning for Big Data analytics in plants. TRENDS IN PLANT SCIENCE 2014;19:798-808. [PMID: 25223304 DOI: 10.1016/j.tplants.2014.08.004] [Citation(s) in RCA: 93] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/28/2014] [Revised: 07/30/2014] [Accepted: 08/20/2014] [Indexed: 05/19/2023]

Lee BJ, Ku B, Nam J, Pham DD, Kim JY. Prediction of fasting plasma glucose status using anthropometric measures for diagnosing type 2 diabetes. IEEE J Biomed Health Inform 2014;18:555-61. [PMID: 24608055 DOI: 10.1109/jbhi.2013.2264509] [Citation(s) in RCA: 45] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]

Yu H, Ni J. An Improved Ensemble Learning Method for Classifying High-Dimensional and Imbalanced Biomedicine Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2014;11:657-666. [PMID: 26356336 DOI: 10.1109/tcbb.2014.2306838] [Citation(s) in RCA: 30] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]

Wang KJ, Makond B, Chen KH, Wang KM. A hybrid classifier combining SMOTE with PSO to estimate 5-year survivability of breast cancer patients. Appl Soft Comput 2014. [DOI: 10.1016/j.asoc.2013.09.014] [Citation(s) in RCA: 81] [Impact Index Per Article: 8.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]

Bakhtiarizadeh MR, Moradi-Shahrbabak M, Ebrahimi M, Ebrahimie E. Neural network and SVM classifiers accurately predict lipid binding proteins, irrespective of sequence homology. J Theor Biol 2014;356:213-22. [PMID: 24819464 DOI: 10.1016/j.jtbi.2014.04.040] [Citation(s) in RCA: 40] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2014] [Revised: 04/03/2014] [Accepted: 04/29/2014] [Indexed: 01/05/2023]

Abdi L, Hashemi S. To combat multi-class imbalanced problems by means of over-sampling and boosting techniques. Soft comput 2014. [DOI: 10.1007/s00500-014-1291-z] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]

Cation-π interactions in β-lactamases: the role in structural stability. Cell Biochem Biophys 2013;66:147-55. [PMID: 23109179 DOI: 10.1007/s12013-012-9463-x] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]

Wang KJ, Makond B, Wang KM. An improved survivability prognosis of breast cancer by using sampling and feature selection technique to solve imbalanced patient classification data. BMC Med Inform Decis Mak 2013;13:124. [PMID: 24207108 PMCID: PMC3829096 DOI: 10.1186/1472-6947-13-124] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2013] [Accepted: 10/28/2013] [Indexed: 11/22/2022] Open

Wang M, Zhao XM, Tan H, Akutsu T, Whisstock JC, Song J. Cascleave 2.0, a new approach for predicting caspase and granzyme cleavage targets. ACTA ACUST UNITED AC 2013;30:71-80. [PMID: 24149049 DOI: 10.1093/bioinformatics/btt603] [Citation(s) in RCA: 60] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]

Fernández A, López V, Galar M, del Jesus MJ, Herrera F. Analysing the classification of imbalanced data-sets with multiple classes: Binarization techniques and ad-hoc approaches. Knowl Based Syst 2013. [DOI: 10.1016/j.knosys.2013.01.018] [Citation(s) in RCA: 236] [Impact Index Per Article: 21.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]

Lee BJ, Kim KH, Ku B, Jang JS, Kim JY. Prediction of body mass index status from voice signals based on machine learning for automated medical applications. Artif Intell Med 2013;58:51-61. [PMID: 23453267 DOI: 10.1016/j.artmed.2013.02.001] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2012] [Revised: 12/21/2012] [Accepted: 02/05/2013] [Indexed: 11/28/2022]

Abstract

OBJECTIVES

The body mass index (BMI) provides essential medical information related to body weight for the treatment and prognosis prediction of diseases such as cardiovascular disease, diabetes, and stroke. We propose a method for the prediction of normal, overweight, and obese classes based only on the combination of voice features that are associated with BMI status, independently of weight and height measurements.

MATERIALS AND METHODS

A total of 1568 subjects were divided into 4 groups according to age and gender differences. We performed statistical analyses by analysis of variance (ANOVA) and Scheffe test to find significant features in each group. We predicted BMI status (normal, overweight, and obese) by a logistic regression algorithm and two ensemble classification algorithms (bagging and random forests) based on statistically significant features.

RESULTS

In the Female-2030 group (females aged 20-40 years), classification experiments using an imbalanced (original) data set gave area under the receiver operating characteristic curve (AUC) values of 0.569-0.731 by logistic regression, whereas experiments using a balanced data set gave AUC values of 0.893-0.994 by random forests. AUC values in Female-4050 (females aged 41-60 years), Male-2030 (males aged 20-40 years), and Male-4050 (males aged 41-60 years) groups by logistic regression in imbalanced data were 0.585-0.654, 0.581-0.614, and 0.557-0.653, respectively. AUC values in Female-4050, Male-2030, and Male-4050 groups in balanced data were 0.629-0.893 by bagging, 0.707-0.916 by random forests, and 0.695-0.854 by bagging, respectively. In each group, we found discriminatory features showing statistical differences among normal, overweight, and obese classes. The results showed that the classification models built by logistic regression in imbalanced data were better than those built by the other two algorithms, and significant features differed according to age and gender groups.

CONCLUSION

Our results could support the development of BMI diagnosis tools for real-time monitoring; such tools are considered helpful in improving automated BMI status diagnosis in remote healthcare or telemedicine and are expected to have applications in forensic and medical science.

Collapse

Shuo Wang, Xin Yao. Multiclass Imbalance Problems: Analysis and Potential Solutions. ACTA ACUST UNITED AC 2012;42:1119-30. [DOI: 10.1109/tsmcb.2012.2187280] [Citation(s) in RCA: 319] [Impact Index Per Article: 26.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]

BATUWITA RUKSHAN, PALADE VASILE. ADJUSTED GEOMETRIC-MEAN: A NOVEL PERFORMANCE MEASURE FOR IMBALANCED BIOINFORMATICS DATASETS LEARNING. J Bioinform Comput Biol 2012;10:1250003. [DOI: 10.1142/s0219720012500035] [Citation(s) in RCA: 36] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022]

Identification of human protein complexes from local sub-graphs of protein-protein interaction network based on random forest with topological structure features. Anal Chim Acta 2012;718:32-41. [PMID: 22305895 DOI: 10.1016/j.aca.2011.12.069] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2011] [Revised: 12/28/2011] [Accepted: 12/30/2011] [Indexed: 11/20/2022]

Abstract

In the post-genomic era, one of the most important and challenging tasks is to identify protein complexes and further elucidate its molecular mechanisms in specific biological processes. Previous computational approaches usually identify protein complexes from protein interaction network based on dense sub-graphs and incomplete priori information. Additionally, the computational approaches have little concern about the biological properties of proteins and there is no a common evaluation metric to evaluate the performance. So, it is necessary to construct novel method for identifying protein complexes and elucidating the function of protein complexes. In this study, a novel approach is proposed to identify protein complexes using random forest and topological structure. Each protein complex is represented by a graph of interactions, where descriptor of the protein primary structure is used to characterize biological properties of protein and vertex is weighted by the descriptor. The topological structure features are developed and used to characterize protein complexes. Random forest algorithm is utilized to build prediction model and identify protein complexes from local sub-graphs instead of dense sub-graphs. As a demonstration, the proposed approach is applied to protein interaction data in human, and the satisfied results are obtained with accuracy of 80.24%, sensitivity of 81.94%, specificity of 80.07%, and Matthew's correlation coefficient of 0.4087 in 10-fold cross-validation test. Some new protein complexes are identified, and analysis based on Gene Ontology shows that the complexes are likely to be true complexes and play important roles in the pathogenesis of some diseases. PCI-RFTS, a corresponding executable program for protein complexes identification, can be acquired freely on request from the authors.

Collapse

Sun C, Zhao XM, Tang W, Chen L. FGsub: Fusarium graminearum protein subcellular localizations predicted from primary structures. BMC SYSTEMS BIOLOGY 2010;4 Suppl 2:S12. [PMID: 20840726 PMCID: PMC2982686 DOI: 10.1186/1752-0509-4-s2-s12] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/30/2023]

Abstract

Background

The fungal pathogen Fusarium graminearum (telomorph Gibberella zeae) is the causal agent of several destructive crop diseases, where a set of genes usually work in concert to cause diseases to crops. To function appropriately, the F. graminearum proteins inside one cell should be assigned to different compartments, i.e. subcellular localizations. Therefore, the subcellular localizations of F. graminearum proteins can provide insights into protein functions and pathogenic mechanisms of this destructive pathogen fungus. Unfortunately, there are no subcellular localization information for F. graminearum proteins available now. Computational approaches provide an alternative way to predicting F. graminearum protein subcellular localizations due to the expensive and time-consuming biological experiments in lab.

Results

In this paper, we developed a novel predictor, namely FGsub, to predict F. graminearum protein subcellular localizations from the primary structures. First, a non-redundant fungi data set with subcellular localization annotation is collected from UniProtKB database and used as training set, where the subcellular locations are classified into 10 groups. Subsequently, Support Vector Machine (SVM) is trained on the training set and used to predict F. graminearum protein subcellular localizations for those proteins that do not have significant sequence similarity to those in training set. The performance of SVMs on training set with 10-fold cross-validation demonstrates the efficiency and effectiveness of the proposed method. In addition, for F. graminearum proteins that have significant sequence similarity to those in training set, BLAST is utilized to transfer annotations of homologous proteins to uncharacterized F. graminearum proteins so that the F. graminearum proteins are annotated more comprehensively.

Conclusions

In this work, we present FGsub to predict F. graminearum protein subcellular localizations in a comprehensive manner. We make four fold contributions to this filed. First, we present a new algorithm to cope with imbalance problem that arises in protein subcellular localization prediction, which can solve imbalance problem and avoid false positive results. Second, we design an ensemble classifier which employs feature selection to further improve prediction accuracy. Third, we use BLAST to complement machine learning based methods, which enlarges our prediction coverage. Last and most important, we predict the subcellular localizations of 12786 F. graminearum proteins, which provide insights into protein functions and pathogenic mechanisms of this destructive pathogen fungus.

Collapse

Jain P, Hirst JD. Automatic structure classification of small proteins using random forest. BMC Bioinformatics 2010;11:364. [PMID: 20594334 PMCID: PMC2916923 DOI: 10.1186/1471-2105-11-364] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2010] [Accepted: 07/01/2010] [Indexed: 11/29/2022] Open

Imbalanced classification using support vector machine ensemble. Neural Comput Appl 2010. [DOI: 10.1007/s00521-010-0349-9] [Citation(s) in RCA: 28] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]

Deng L, Guan J, Dong Q, Zhou S. Prediction of protein-protein interaction sites using an ensemble method. BMC Bioinformatics 2009;10:426. [PMID: 20015386 PMCID: PMC2808167 DOI: 10.1186/1471-2105-10-426] [Citation(s) in RCA: 58] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2009] [Accepted: 12/16/2009] [Indexed: 01/23/2023] Open