1
|
Li Y, Hsu W. A classification for complex imbalanced data in disease screening and early diagnosis. Stat Med 2022; 41:3679-3695. [PMID: 35603639 PMCID: PMC9541048 DOI: 10.1002/sim.9442] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2021] [Revised: 04/11/2022] [Accepted: 05/10/2022] [Indexed: 11/09/2022]
Abstract
Imbalanced classification has drawn considerable attention in the statistics and machine learning literature. Typically, traditional classification methods often perform poorly when a severely skewed class distribution is observed, not to mention under a high-dimensional longitudinal data structure. Given the ubiquity of big data in modern health research, it is expected that imbalanced classification in disease diagnosis may encounter an additional level of difficulty that is imposed by such a complex data structure. In this article, we propose a nonparametric classification approach for imbalanced data in longitudinal and high-dimensional settings. Technically, the functional principal component analysis is first applied for feature extraction under the longitudinal structure. The univariate exponential loss function coupled with group LASSO penalty is then adopted into the classification procedure in high-dimensional settings. Along with a good improvement in imbalanced classification, our approach provides a meaningful feature selection for interpretation while enjoying a remarkably lower computational complexity. The proposed method is illustrated on the real data application of Alzheimer's disease early detection and its empirical performance in finite sample size is extensively evaluated by simulations.
Collapse
Affiliation(s)
- Yiming Li
- Department of StatisticsKansas State UniversityManhattanKansasUSA
| | - Wei‐Wen Hsu
- Division of Biostatistics and Bioinformatics, Department of Environmental and Public Health SciencesUniversity of CincinnatiCincinnatiOhioUSA
| | | |
Collapse
|
2
|
Namdar K, Haider MA, Khalvati F. A Modified AUC for Training Convolutional Neural Networks: Taking Confidence Into Account. Front Artif Intell 2021; 4:582928. [PMID: 34917933 PMCID: PMC8670229 DOI: 10.3389/frai.2021.582928] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2020] [Accepted: 09/30/2021] [Indexed: 11/13/2022] Open
Abstract
Receiver operating characteristic (ROC) curve is an informative tool in binary classification and Area Under ROC Curve (AUC) is a popular metric for reporting performance of binary classifiers. In this paper, first we present a comprehensive review of ROC curve and AUC metric. Next, we propose a modified version of AUC that takes confidence of the model into account and at the same time, incorporates AUC into Binary Cross Entropy (BCE) loss used for training a Convolutional neural Network for classification tasks. We demonstrate this on three datasets: MNIST, prostate MRI, and brain MRI. Furthermore, we have published GenuineAI, a new python library, which provides the functions for conventional AUC and the proposed modified AUC along with metrics including sensitivity, specificity, recall, precision, and F1 for each point of the ROC curve.
Collapse
Affiliation(s)
- Khashayar Namdar
- Department of Medical Imaging, University of Toronto, Toronto, ON, Canada.,The Hospital for Sick Children (SickKids), Toronto, ON, Canada
| | - Masoom A Haider
- Lunenfeld-Tanenbaum Research Institute, Sinai Health System, Toronto, ON, Canada.,Sunnybrook Research Institute, Toronto, ON, Canada
| | - Farzad Khalvati
- Department of Medical Imaging, University of Toronto, Toronto, ON, Canada.,The Hospital for Sick Children (SickKids), Toronto, ON, Canada
| |
Collapse
|
3
|
Zhang Y, Wang Z, Wang S, Shang J. Comparative Analysis of Unsupervised Protein Similarity Prediction Based on Graph Embedding. Front Genet 2021; 12:744334. [PMID: 34630534 PMCID: PMC8493040 DOI: 10.3389/fgene.2021.744334] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2021] [Accepted: 08/25/2021] [Indexed: 11/13/2022] Open
Abstract
The study of protein-protein interaction and the determination of protein functions are important parts of proteomics. Computational methods are used to study the similarity between proteins based on Gene Ontology (GO) to explore their functions and possible interactions. GO is a series of standardized terms that describe gene products from molecular functions, biological processes, and cell components. Previous studies on assessing the similarity of GO terms were primarily based on Information Content (IC) between GO terms to measure the similarity of proteins. However, these methods tend to ignore the structural information between GO terms. Therefore, considering the structural information of GO terms, we systematically analyze the performance of the GO graph and GO Annotation (GOA) graph in calculating the similarity of proteins using different graph embedding methods. When applied to the actual Human and Yeast datasets, the feature vectors of GO terms and proteins are learned based on different graph embedding methods. To measure the similarity of the proteins annotated by different GO numbers, we used Dynamic Time Warping (DTW) and cosine to calculate protein similarity in GO graph and GOA graph, respectively. Link prediction experiments were then performed to evaluate the reliability of protein similarity networks constructed by different methods. It is shown that graph embedding methods have obvious advantages over the traditional IC-based methods. We found that random walk graph embedding methods, in particular, showed excellent performance in calculating the similarity of proteins. By comparing link prediction experiment results from GO(DTW) and GOA(cosine) methods, it is shown that GO(DTW) features provide highly effective information for analyzing the similarity among proteins.
Collapse
Affiliation(s)
- Yuanyuan Zhang
- School of Information and Control Engineering, Qingdao University of Technology, Qingdao, China
- College of Computer Science and Technology, China University of Petroleum (East China), Qingdao, China
| | - Ziqi Wang
- School of Information and Control Engineering, Qingdao University of Technology, Qingdao, China
| | - Shudong Wang
- College of Computer Science and Technology, China University of Petroleum (East China), Qingdao, China
| | - Junliang Shang
- School of Information Science and Engineering, Qufu Normal University, Rizhao, China
| |
Collapse
|
4
|
Leclercq M, Vittrant B, Martin-Magniette ML, Scott Boyer MP, Perin O, Bergeron A, Fradet Y, Droit A. Large-Scale Automatic Feature Selection for Biomarker Discovery in High-Dimensional OMICs Data. Front Genet 2019; 10:452. [PMID: 31156708 PMCID: PMC6532608 DOI: 10.3389/fgene.2019.00452] [Citation(s) in RCA: 50] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2019] [Accepted: 04/30/2019] [Indexed: 12/11/2022] Open
Abstract
The identification of biomarker signatures in omics molecular profiling is usually performed to predict outcomes in a precision medicine context, such as patient disease susceptibility, diagnosis, prognosis, and treatment response. To identify these signatures, we have developed a biomarker discovery tool, called BioDiscML. From a collection of samples and their associated characteristics, i.e., the biomarkers (e.g., gene expression, protein levels, clinico-pathological data), BioDiscML exploits various feature selection procedures to produce signatures associated to machine learning models that will predict efficiently a specified outcome. To this purpose, BioDiscML uses a large variety of machine learning algorithms to select the best combination of biomarkers for predicting categorical or continuous outcomes from highly unbalanced datasets. The software has been implemented to automate all machine learning steps, including data pre-processing, feature selection, model selection, and performance evaluation. BioDiscML is delivered as a stand-alone program and is available for download at https://github.com/mickaelleclercq/BioDiscML.
Collapse
Affiliation(s)
- Mickael Leclercq
- Centre de Recherche du CHU de Québec-Université Laval, Québec City, QC, Canada.,Département de Médecine Moléculaire, Université Laval, Québec City, QC, Canada
| | - Benjamin Vittrant
- Centre de Recherche du CHU de Québec-Université Laval, Québec City, QC, Canada.,Département de Médecine Moléculaire, Université Laval, Québec City, QC, Canada
| | - Marie Laure Martin-Magniette
- Institute of Plant Sciences Paris Saclay IPS2, CNRS, INRA, Université Paris-Sud, Université Evry, Université Paris-Saclay, Paris Diderot, Sorbonne Paris-Cité, Orsay, France.,UMR MIA-Paris, AgroParisTech, INRA, Université Paris-Saclay, Paris, France
| | - Marie Pier Scott Boyer
- Centre de Recherche du CHU de Québec-Université Laval, Québec City, QC, Canada.,Département de Médecine Moléculaire, Université Laval, Québec City, QC, Canada
| | - Olivier Perin
- Digital Sciences Department, L'Oréal Advanced Research, Aulnay-sous-bois, France
| | - Alain Bergeron
- Centre de Recherche du CHU de Québec-Université Laval, Québec City, QC, Canada.,Département de Chirurgie, Oncology Axis, Université Laval, Québec City, QC, Canada
| | - Yves Fradet
- Centre de Recherche du CHU de Québec-Université Laval, Québec City, QC, Canada.,Département de Chirurgie, Oncology Axis, Université Laval, Québec City, QC, Canada
| | - Arnaud Droit
- Centre de Recherche du CHU de Québec-Université Laval, Québec City, QC, Canada.,Département de Médecine Moléculaire, Université Laval, Québec City, QC, Canada
| |
Collapse
|